极端分片:每个用户一个 SQLite 数据库
我正在开发一个介于电子邮件服务和社交网络之间的网络应用.我觉得它有潜力在未来变得非常大,所以我担心可扩展性.
I'm working on a web app that is somewhere between an email service and a social network. I feel it has the potential to grow really big in the future, so I'm concerned about scalability.
我决定为每个活动用户创建一个单独的 SQLite 数据库,而不是使用一个集中的 MySQL/InnoDB 数据库,然后在那个时候对其进行分区:每个分片"一个活动用户.
Instead of using one centralized MySQL/InnoDB database and then partitioning it when that time comes, I've decided to create a separate SQLite database for each active user: one active user per 'shard'.
这样备份数据库就像每天将每个用户的小数据库文件复制到远程位置一样简单.
That way backing up the database would be as easy as copying each user's small database file to a remote location once a day.
扩展就像添加额外的硬盘来存储新文件一样简单.
Scaling up will be as easy as adding extra hard disks to store the new files.
当应用程序超出单个服务器时,我可以使用 GlusterFS 在文件系统级别将服务器链接在一起并原样运行应用程序,或者安装一个简单的 SQLite 代理系统,允许每个服务器操作相邻服务器中的 sqlite 文件.
When the app grows beyond a single server I can link the servers together at the filesystem level using GlusterFS and run the app unchanged, or rig up a simple SQLite proxy system that will allow each server to manipulate sqlite files in adjacent servers.
并发问题将最小化,因为每个 HTTP 请求一次只会访问一个或两个数据库文件,在数千个中,而且 SQLite 无论如何只会阻止读取.
Concurrency issues will be minimal because each HTTP request will only touch one or two database files at a time, out of thousands, and SQLite only blocks on reads anyway.
我敢打赌,这种方法将使我的应用程序能够优雅地扩展并支持许多很酷和独特的功能.我赌错了吗?我错过了什么吗?
I'm betting that this approach will allow my app to scale gracefully and support lots of cool and unique features. Am I betting wrong? Am I missing anything?
更新 我决定采用一个不太极端的解决方案,到目前为止它运行良好.我正在使用固定数量的分片 - 准确地说是 256 个 sqlite 数据库.每个用户都通过一个简单的哈希函数分配并绑定到一个随机分片.
UPDATE I decided to go with a less extreme solution, which is working fine so far. I'm using a fixed number of shards - 256 sqlite databases, to be precise. Each user is assigned and bound to a random shard by a simple hash function.
我的应用程序的大多数功能每个请求只需要访问一两个分片,但有一个特别需要对 256 个不同分片中的 10 到 100 个不同分片执行简单查询,具体取决于用户.测试表明,如果所有数据都缓存在 RAM 中,大约需要 0.02 秒或更短的时间.我想我可以忍受!
Most features of my app require access to just one or two shards per request, but there is one in particular that requires the execution of a simple query on 10 to 100 different shards out of 256, depending on the user. Tests indicate it would take about 0.02 seconds, or less, if all the data is cached in RAM. I think I can live with that!
UPDATE 2.0 我将应用程序移植到 MySQL/InnoDB 并且能够获得与常规请求大致相同的性能,但是对于需要分片行走的请求,innodb 的速度提高了 4-5 倍.出于这个原因和其他原因,我放弃了这个架构,但我希望有人能在某个地方找到它的用处......谢谢.
UPDATE 2.0 I ported the app to MySQL/InnoDB and was able to get about the same performance for regular requests, but for that one request that requires shard walking, innodb is 4-5 times faster. For this reason, and other reason, I'm dropping this architecture, but I hope someone somewhere finds a use for it...thanks.
推荐答案
如果您必须执行所谓的分片行走"——即找出一堆不同用户的所有数据,就会失败.这种特定类型的查询"必须以编程方式完成,依次询问每个 SQLite 数据库 - 并且很可能是您站点中最慢的方面.在将数据分片"到不同数据库的任何系统中,这是一个常见问题.
The place where this will fail is if you have to do what's called "shard walking" - which is finding out all the data across a bunch of different users. That particular kind of "query" will have to be done programmatically, asking each of the SQLite databases in turn - and will very likely be the slowest aspect of your site. It's a common issue in any system where data has been "sharded" into separate databases.
如果所有数据对用户来说都是独立的,那么这应该可以很好地扩展 - 使其成为有效设计的关键是了解数据可能会如何使用以及数据是否来自一个人将与来自另一个(在您的上下文中)的数据进行交互.
If all the of the data is self-contained to the user, then this should scale pretty well - the key to making this an effective design is to know how the data is likely going to be used and if data from one person will be interacting with data from another (in your context).
您可能还需要注意文件系统资源 - SQLite 很棒、很棒、速度很快等等 - 但是在使用标准数据库"(即 MySQL、PostgreSQL 等)时确实可以获得一些缓存和写入优势,因为它们的设计方式.在您提议的设计中,您会错过其中的一些内容.
You may also need to watch out for file system resources - SQLite is great, awesome, fast, etc - but you do get some caching and writing benefits when using a "standard database" (i.e. MySQL, PostgreSQL, etc) because of how they're designed. In your proposed design, you'll be missing out on some of that.
相关文章