首次数据库设计:我是否过度设计?

2021-12-26 00:00:00 database schema mysql database-design database-normalization

背景

我是 CS 的一年级学生，我在我父亲的小企业做兼职.我在实际应用程序开发方面没有任何经验.我用 Python 写过脚本，用 C 写过一些课程，但没有像这样的.

我父亲有一家小型培训公司，目前所有课程都通过外部网络应用程序安排、记录和跟进.有一个导出/报告"功能，但它非常通用，我们需要特定的报告.我们无权访问实际数据库来运行查询.我被要求建立一个自定义报告系统.

我的想法是每晚创建通用的 CSV 导出并将它们导入(可能使用 Python)到办公室托管的 MySQL 数据库中，从那里我可以运行所需的特定查询.我没有数据库方面的经验，但了解非常基础的知识.我已经阅读了一些关于数据库创建和范式的内容.

我们可能很快就会开始拥有国际客户，所以我希望数据库在发生这种情况时不会爆炸.我们目前还有几家大公司作为客户，拥有不同的部门(例如 ACME 母公司、ACME 医疗保健部门、ACME 身体护理部门)

我提出的架构如下:

从客户的角度来看:
- Clients 是主表
- 客户与其工作的部门相关联
  - 部门可以分散在一个国家/地区:伦敦的人力资源部、斯旺西的营销部等.
  - 部门与公司的部门相关
- 部门与母公司相关
从类的角度来看:
- Sessions 是主表
  - 每节课都有一名教师
  - 每个会话都有一个 statusid.例如.0 - 已完成，1 - 已取消
  - 会话被分组为任意大小的包"
- 每个包都分配给一个客户

我在一张纸上设计"(更像是潦草地写的)模式，试图将其规范化为第 3 种形式.然后我将它插入 MySQL Workbench，它让我觉得这一切都很漂亮:
(
_{(来源:maian.org)}

我将运行的示例查询

仍有信用的客户处于非活动状态(未来没有安排课程的客户)
每个客户/部门/部门的出席率是多少(以每个会话的状态 ID 衡量)
一个老师一个月有多少节课
标记出勤率低的客户
人力资源部门的自定义报告及其部门人员的出勤率

问题

这是过度设计还是我的方向正确?
大多数查询需要连接多个表会导致性能下降吗?
我已向客户端添加了一个lastsession"列，因为它可能会成为一个常见的查询.这是一个好主意还是我应该严格规范化数据库?

感谢您的时间

解决方案

您的问题的更多答案:

1) 对于第一次处理此类问题的人来说，您几乎是目标.我认为到目前为止其他人在这个问题上的指示几乎涵盖了它.干得好！

2 &3) 您将受到的性能影响在很大程度上取决于为您的特定查询/过程拥有和优化正确的索引，更重要的是记录量.除非你在你的主表中谈论超过一百万条记录，否则你似乎有一个足够主流的设计，在合理的硬件上性能不会成为问题.

也就是说，这与您的问题 3 相关，从一开始，您可能真的不应该过度担心性能或对规范化正统观念的过度敏感.这是您正在构建的报告服务器，而不是基于事务的应用程序后端，后者在性能或规范化的重要性方面会有很大不同.支持实时注册和调度应用程序的数据库必须注意需要几秒钟才能返回数据的查询.不仅报表服务器功能对复杂和冗长的查询有更大的容忍度，而且提高性能的策略也大不相同.

例如，在基于事务的应用程序环境中，您的性能改进选项可能包括将存储过程和表结构重构到第 n 级，或者为少量常用数据开发缓存策略.在报告环境中，您当然可以这样做，但通过引入快照机制，您可以对性能产生更大的影响，在该机制中计划进程运行并存储预配置的报告，您的用户可以访问快照数据，而不会对您的数据库层造成压力以每个请求为基础.

所有这些都是一个冗长的咆哮，以说明您所采用的设计原则和技巧可能会因您创建的数据库的角色而有所不同.我希望这会有所帮助.

Background

I'm a first year CS student and I work part time for my dad's small business. I don't have any experience in real world application development. I have written scripts in Python, some coursework in C, but nothing like this.

My dad has a small training business and currently all classes are scheduled, recorded and followed up via an external web application. There is an export/"reports" feature but it is very generic and we need specific reports. We don't have access to the actual database to run the queries. I've been asked to set up a custom reporting system.

My idea is to create the generic CSV exports and import (probably with Python) them into a MySQL database hosted in the office every night, from where I can run the specific queries that are needed. I don't have experience in databases but understand the very basics. I've read a little about database creation and normal forms.

We may start having international clients soon, so I want the database to not explode if/when that happens. We also currently have a couple big corporations as clients, with different divisions (e.g. ACME parent company, ACME healthcare division, ACME bodycare division)

The schema I have come up with is the following:

From the client perspective:
- Clients is the main table
- Clients are linked to the department they work for
  - Departments can be scattered around a country: HR in London, Marketing in Swansea, etc.
  - Departments are linked to the division of a company
- Divisions are linked to the parent company
From the classes perspective:
- Sessions is the main table
  - A teacher is linked to each session
  - A statusid is given to each session. E.g. 0 - Completed, 1 - Cancelled
  - Sessions are grouped into "packs" of an arbitrary size
- Each packs is assigned to a client

I "designed" (more like scribbled) the schema on a piece of paper, trying to keep it normalised to the 3rd form. I then plugged it into MySQL Workbench and it made it all pretty for me:
(Click here for full-sized graphic)

_{(source: maian.org)}

Example queries I'll be running

Which clients with credit still left are inactive (those without a class scheduled in the future)
What is the attendance rate per client/department/division (measured by the status id in each session)
How many classes has a teacher had in a month
Flag clients who have low attendance rate
Custom reports for HR departments with attendance rates of people in their division

Question(s)

Is this overengineered or am I headed the right way?
Will the need to join multiple tables for most queries result in a big performance hit?
I have added a 'lastsession' column to clients, as it is probably going to be a common query. Is this a good idea or should I keep the database strictly normalised?

Thanks for your time

解决方案

Some more answers to your questions:

1) You're pretty much on target for someone who is approaching a problem like this for the first time. I think the pointers from others on this question thus far pretty much cover it. Good job!

2 & 3) The performance hit you will take will largely be dependent on having and optimizing the right indexes for your particular queries / procedures and more importantly the volume of records. Unless you are talking about well over a million records in your main tables you seem to be on track to having a sufficiently mainstream design that performance will not be an issue on reasonable hardware.

That said, and this relates to your question 3, with the start you have you probably shouldn't really be overly worried about performance or hyper-sensitivity to normalization orthodoxy here. This is a reporting server you are building, not a transaction based application backend, which would have a much different profile with respect to the importance of performance or normalization. A database backing a live signup and scheduling application has to be mindful of queries that take seconds to return data. Not only does a report server function have more tolerance for complex and lengthy queries, but the strategies to improve performance are much different.

For example, in a transaction based application environment your performance improvement options might include refactoring your stored procedures and table structures to the nth degree, or developing a caching strategy for small amounts of commonly requested data. In a reporting environment you can certainly do this but you can have an even greater impact on performance by introducing a snapshot mechanism where a scheduled process runs and stores pre-configured reports and your users access the snapshot data with no stress on your db tier on a per request basis.

All of this is a long-winded rant to illustrate that what design principles and tricks you employ may differ given the role of the db you're creating. I hope that's helpful.

相关文章