数据库 - 设计“事件"桌子
阅读这个很棒的Nettuts+的提示后文章 我想出了一个表模式,它将高度易变的数据与其他受到大量读取的表分开,同时减少整个数据库模式中所需的表数量,但是我不确定这是否是一个好主意,因为它不遵循规范化规则,我想听听您的建议,这里是总体思路:
<小时>我在 类表继承结构中建模了四种类型的用户,主要是用户"表我存储所有用户共有的数据(id
、username
、password
、几个flags
、...) 以及一些 TIMESTAMP
字段(date_created
、date_updated
、date_activated
、date_lastLogin代码>, ...).
引用上述 Nettuts+ 文章中的提示 #16:
<块引用>示例 2:您有一个last_login"表中的字段.它每更新一次用户登录网站的时间.但是表上的每次更新都会导致该表的查询缓存是酡.您可以将该字段放入另一个表来更新你的用户表最少.
现在变得更棘手了,我需要跟踪一些用户统计数据,例如
- 用户个人资料被查看了多少次独特
- 来自特定类型用户的广告被点击了多少唯一次
- 看到来自特定类型用户的帖子有多少独特次
- 等等...
在我完全规范化的数据库中,这总共增加了大约 8 到 10 个额外的表,数量不多,但如果可以的话,我想保持简单,所以我想出了以下events
" 表:
|------|----------------|----------------|---------------------|-----------||身份证 |表 |活动 |日期 |知识产权 ||------|----------------|----------------|---------------------|-----------||1 |用户 |登录 |2010-04-19 00:30:00 |127.0.0.1 ||------|----------------|----------------|---------------------|-----------||1 |用户 |登录 |2010-04-19 02:30:00 |127.0.0.1 ||------|----------------|----------------|---------------------|-----------||2 |用户 |创建 |2010-04-19 00:31:00 |127.0.0.2 ||------|----------------|----------------|---------------------|-----------||2 |用户 |激活 |2010-04-19 02:34:00 |127.0.0.2 ||------|----------------|----------------|---------------------|-----------||2 |用户 |批准 |2010-04-19 09:30:00 |217.0.0.1 ||------|----------------|----------------|---------------------|-----------||2 |用户 |登录 |2010-04-19 12:00:00 |127.0.0.2 ||------|----------------|----------------|---------------------|-----------||15 |用户广告 |创建 |2010-04-19 12:30:00 |127.0.0.1 ||------|----------------|----------------|---------------------|-----------||15 |用户广告 |印象深刻 |2010-04-19 12:31:00 |127.0.0.2 ||------|----------------|----------------|---------------------|-----------||15 |用户广告 |点击 |2010-04-19 12:31:01 |127.0.0.2 ||------|----------------|----------------|---------------------|-----------||15 |用户广告 |点击 |2010-04-19 12:31:02 |127.0.0.2 ||------|----------------|----------------|---------------------|-----------||15 |用户广告 |点击 |2010-04-19 12:31:03 |127.0.0.2 ||------|----------------|----------------|---------------------|-----------||15 |用户广告 |点击 |2010-04-19 12:31:04 |127.0.0.2 ||------|----------------|----------------|---------------------|-----------||15 |用户广告 |点击 |2010-04-19 12:31:05 |127.0.0.2 ||------|----------------|----------------|---------------------|-----------||2 |用户 |阻止|2010-04-20 03:19:00 |217.0.0.1 ||------|----------------|----------------|---------------------|-----------||2 |用户 |删除 |2010-04-20 03:20:00 |217.0.0.1 ||------|----------------|----------------|---------------------|-----------|
基本上 ID
指的是 TABLE
表中的主键 (id
) 字段,我相信其余部分应该非常简单.在这个设计中,我喜欢的一件事是我可以跟踪所有用户登录,而不仅仅是最后一次登录,从而使用这些数据生成一些有趣的指标.
由于events
表的不断增长的性质,我也想到了做一些优化,比如:
- #9:由于只有有限数量的表和有限(和预定)数量的事件,
TABLE
和EVENTS
列可以设置为ENUM
s 而不是VARCHAR
s 以节省一些空间. - #14:使用
INET_ATON() 将
IP
存储为UNSIGNED INT
scode> 而不是VARCHAR
s. - 将
DATE
s 存储为TIMESTAMP
s 而不是DATETIME
s. - 使用
ARCHIVE
(或)引擎代替CSV
?InnoDB
/MyISAM
.- 仅支持
INSERT
s 和SELECT
s,并且数据是动态压缩的.
- 仅支持
总的来说,每个事件只会消耗 14 个(未压缩的)字节,我猜这对我的流量来说是可以的.
优点:
- 能够存储更详细的数据(例如登录信息).
- 无需设计(和代码)几乎十几个额外的表格(日期和统计数据).
- 减少每个表的几列并保持易失性数据分开.
缺点:
- 非关系(仍然不如 EAV):
SELECT * FROM events WHERE id = 2 AND table = 'user' ORDER BY date DESC();
- 每个事件 6 个字节的开销(
ID
、TABLE
和EVENT
).
我更倾向于采用这种方法,因为优点似乎远远大于缺点,但我仍然有点不情愿...... 我是否遗漏了什么?您对此有何看法?
谢谢!
<小时>@coolgeek:
<块引用>我稍微做的一件事不同的是维持一个entity_type 表,并在其中使用其 IDobject_type 列(在您的情况下,表格"列).你会想要用 event_type 做同样的事情表.
为了清楚起见,您的意思是我应该添加一个附加表来映射表中允许的事件,并在事件表中使用该表的 PK 而不是 TABLE
/<代码>事件代码>对?
@本:
<块引用>这些都是从现有数据,不是吗?
附加表主要与统计有关,但我的数据不存在,一些示例:
user_ad_stats user_post_stats------------- ---------------user_ad_id (FK) user_post_id (FK)ip ip日期 日期类型(印象深刻,点击)
如果我删除这些表格,我将无法跟踪谁、什么或何时,不知道视图在这里有什么帮助.
<块引用>我同意它应该是分开的,但更多的是因为它从根本上不同的数据.某人是什么以及某人所做的有两种不同事物.我不认为波动是这样的很重要.
两种方式我都听说过,但我在 MySQL 手册中找不到任何说明任何一种都正确的内容.无论如何,我同意你的观点,它们应该是分开的表格,因为它们代表各种数据(与常规方法相比具有更多描述性的额外好处).
<块引用>我认为你想念森林是因为树,可以这么说.
你的表的谓词是来自 IP IP 在 DATE 时间的用户 IDEVENTed to TABLE"这似乎合理,但也有问题.
我的意思是不像 EAV 那样糟糕"是所有记录都遵循线性结构并且它们很容易查询,没有层次结构所以所有查询都可以通过简单的 SELECT
完成代码>.
关于你的第二个陈述,我想你在这里理解错了;IP 地址不一定与用户相关联.表结构应该是这样的:
<块引用>IP 地址 (IP
) 做了一些事情(EVENT
) 到 PK (ID
)日期 (DATE
) 上的表 (TABLE
).
例如,在我上面示例的最后一行中,它应该读取 IP 217.0.0.1(某些管理员),在 2010-04-20 03 删除了用户 #2(其最后一个已知 IP 为 127.0.0.2):20:00.
<块引用>您仍然可以加入用户活动给用户,但你不能实现外键约束.
确实,这是我最关心的问题.但是,我不完全确定这种设计会出现什么问题,而传统关系设计不会出现问题.我可以发现一些警告,但只要与数据库混淆的应用程序知道它在做什么,我想应该不会有任何问题.
在这个论点中重要的另一件事是我将存储更多的事件,每个事件与原始设计相比将增加一倍以上,使用 ARCHIVE
存储非常有意义这里的引擎,唯一的问题是它不支持 FK
s(既不支持 UPDATE
s 也不支持 DELETE
s).
我强烈推荐这种方法.由于您大概对 OLTP 和 OLAP 使用相同的数据库,因此您可以通过添加一些星星和雪花来获得显着的性能优势.
我有一个社交网络应用,目前有 65 桌.我维护一个表来跟踪对象(博客/帖子、论坛/主题、画廊/相册/图像等)的视图,另一个表用于对象推荐,以及第三个表来总结十几个其他表中的插入/更新活动.>
我做的稍有不同的一件事是维护一个 entity_type 表,并在 object_type 列(在您的情况下为TABLE"列)中使用其 ID.您可能希望对 event_type 表做同样的事情.
为 Alix 澄清 - 是的,您维护一个对象引用表和一个事件引用表(这些将是您的维度表).您的事实表将包含以下字段:
idobject_id事件编号事件时间IP地址
After reading the tips from this great Nettuts+ article I've come up with a table schema that would separate highly volatile data from other tables subjected to heavy reads and at the same time lower the number of tables needed in the whole database schema, however I'm not sure if this is a good idea since it doesn't follow the rules of normalization and I would like to hear your advice, here is the general idea:
I've four types of users modeled in a Class Table Inheritance structure, in the main "user" table I store data common to all the users (id
, username
, password
, several flags
, ...) along with some TIMESTAMP
fields (date_created
, date_updated
, date_activated
, date_lastLogin
, ...).
To quote the tip #16 from the Nettuts+ article mentioned above:
Example 2: You have a "last_login" field in your table. It updates every time a user logs in to the website. But every update on a table causes the query cache for that table to be flushed. You can put that field into another table to keep updates to your users table to a minimum.
Now it gets even trickier, I need to keep track of some user statistics like
- how many unique times a user profile was seen
- how many unique times a ad from a specific type of user was clicked
- how many unique times a post from a specific type of user was seen
- and so on...
In my fully normalized database this adds up to about 8 to 10 additional tables, it's not a lot but I would like to keep things simple if I could, so I've come up with the following "events
" table:
|------|----------------|----------------|---------------------|-----------|
| ID | TABLE | EVENT | DATE | IP |
|------|----------------|----------------|---------------------|-----------|
| 1 | user | login | 2010-04-19 00:30:00 | 127.0.0.1 |
|------|----------------|----------------|---------------------|-----------|
| 1 | user | login | 2010-04-19 02:30:00 | 127.0.0.1 |
|------|----------------|----------------|---------------------|-----------|
| 2 | user | created | 2010-04-19 00:31:00 | 127.0.0.2 |
|------|----------------|----------------|---------------------|-----------|
| 2 | user | activated | 2010-04-19 02:34:00 | 127.0.0.2 |
|------|----------------|----------------|---------------------|-----------|
| 2 | user | approved | 2010-04-19 09:30:00 | 217.0.0.1 |
|------|----------------|----------------|---------------------|-----------|
| 2 | user | login | 2010-04-19 12:00:00 | 127.0.0.2 |
|------|----------------|----------------|---------------------|-----------|
| 15 | user_ads | created | 2010-04-19 12:30:00 | 127.0.0.1 |
|------|----------------|----------------|---------------------|-----------|
| 15 | user_ads | impressed | 2010-04-19 12:31:00 | 127.0.0.2 |
|------|----------------|----------------|---------------------|-----------|
| 15 | user_ads | clicked | 2010-04-19 12:31:01 | 127.0.0.2 |
|------|----------------|----------------|---------------------|-----------|
| 15 | user_ads | clicked | 2010-04-19 12:31:02 | 127.0.0.2 |
|------|----------------|----------------|---------------------|-----------|
| 15 | user_ads | clicked | 2010-04-19 12:31:03 | 127.0.0.2 |
|------|----------------|----------------|---------------------|-----------|
| 15 | user_ads | clicked | 2010-04-19 12:31:04 | 127.0.0.2 |
|------|----------------|----------------|---------------------|-----------|
| 15 | user_ads | clicked | 2010-04-19 12:31:05 | 127.0.0.2 |
|------|----------------|----------------|---------------------|-----------|
| 2 | user | blocked | 2010-04-20 03:19:00 | 217.0.0.1 |
|------|----------------|----------------|---------------------|-----------|
| 2 | user | deleted | 2010-04-20 03:20:00 | 217.0.0.1 |
|------|----------------|----------------|---------------------|-----------|
Basically the ID
refers to the primary key (id
) field in the TABLE
table, I believe the rest should be pretty straightforward. One thing that I've come to like in this design is that I can keep track of all the user logins instead of just the last one, and thus generate some interesting metrics with that data.
Due to the growing nature of the events
table I also thought of making some optimizations, such as:
- #9: Since there is only a finite number of tables and a finite (and predetermined) number of events, the
TABLE
andEVENTS
columns could be setup asENUM
s instead ofVARCHAR
s to save some space. - #14: Store
IP
s asUNSIGNED INT
s withINET_ATON()
instead ofVARCHAR
s. - Store
DATE
s asTIMESTAMP
s instead ofDATETIME
s. - Use the
ARCHIVE
(or the) engine instead ofCSV
?InnoDB
/MyISAM
.- Only
INSERT
s andSELECT
s are supported, and data is compressed on the fly.
- Only
Overall, each event would only consume 14 (uncompressed) bytes which is okay for my traffic I guess.
Pros:
- Ability to store more detailed data (such as logins).
- No need to design (and code for) almost a dozen additional tables (dates and statistics).
- Reduces a few columns per table and keeps volatile data separated.
Cons:
- Non-relational (still not as bad as EAV):
SELECT * FROM events WHERE id = 2 AND table = 'user' ORDER BY date DESC();
- 6 bytes overhead per event (
ID
,TABLE
andEVENT
).
I'm more inclined to go with this approach since the pros seem to far outweigh the cons, but I'm still a little bit reluctant... Am I missing something? What are your thoughts on this?
Thanks!
@coolgeek:
One thing that I do slightly differently is to maintain an entity_type table, and use its ID in the object_type column (in your case, the 'TABLE' column). You would want to do the same thing with an event_type table.
Just to be clear, you mean I should add an additional table that maps which events are allowed in a table and use the PK of that table in the events table instead of having a TABLE
/ EVENT
pair?
@ben:
These are all statistics derived from existing data, aren't they?
The additional tables are mostly related to statistics but I the data doesn't already exists, some examples:
user_ad_stats user_post_stats
------------- ---------------
user_ad_id (FK) user_post_id (FK)
ip ip
date date
type (impressed, clicked)
If I drop these tables I've no way to keep track of who, what or when, not sure how views can help here.
I agree that it ought to be separate, but more because it's fundamentally different data. What someone is and what someone does are two different things. I don't think volatility is so important.
I've heard it both ways and I couldn't find anything in the MySQL manual that states that either one is right. Anyway, I agree with you that they should be separated tables because they represent kinds of data (with the added benefit of being more descriptive than a regular approach).
I think you're missing the forest for the trees, so to speak.
The predicate for your table would be "User ID from IP IP at time DATE EVENTed to TABLE" which seems reasonable, but there are issues.
What I meant for "not as bad as EAV" is that all records follow a linear structure and they are pretty easy to query, there is no hierarchical structure so all queries can be done with a simple SELECT
.
Regarding your second statement, I think you understood me wrong here; the IP address is not necessarily associated with the user. The table structure should read something like this:
IP address (
IP
) did something (EVENT
) to the PK (ID
) of the table (TABLE
) on date (DATE
).
For instance, in the last row of my example above it should read that IP 217.0.0.1 (some admin), deleted the user #2 (whose last known IP is 127.0.0.2) at 2010-04-20 03:20:00.
You can still join, say, user events to users, but you can't implement a foreign key constraint.
Indeed, that's my main concern. However I'm not totally sure what can go wrong with this design that couldn't go wrong with a traditional relational design. I can spot some caveats but as long as the app messing with the database knows what it is doing I guess there shouldn't be any problems.
One other thing that counts in this argument is that I will be storing much more events, and each event will more than double compared to the original design, it makes perfect sense to use the ARCHIVE
storage engine here, the only thing is it doesn't support FK
s (neither UPDATE
s or DELETE
s).
I highly recommend this approach. Since you're presumably using the same database for OLTP and OLAP, you can gain significant performance benefits by adding in some stars and snowflakes.
I have a social networking app that is currently at 65 tables. I maintain a single table to track object (blog/post, forum/thread, gallery/album/image, etc) views, another for object recommends, and a third table to summarize insert/update activity in a dozen other tables.
One thing that I do slightly differently is to maintain an entity_type table, and use its ID in the object_type column (in your case, the 'TABLE' column). You would want to do the same thing with an event_type table.
Clarifying for Alix - Yes, you maintain a reference table for objects, and a reference table for events (these would be your dimension tables). Your fact table would have the following fields:
id
object_id
event_id
event_time
ip_address
相关文章