MySQL Binary Storage using BLOB VS OS File System:大文件、大数量、大问题

2022-01-17 00:00:00 http-headers blob streaming php mysql

我正在运行的版本(基本上最新的):
PHP:5.3.1
MySQL:5.1.41
阿帕奇:2.2.14
操作系统:CentOS(最新)

Versions I am running (basically latest of everything):
PHP: 5.3.1
MySQL: 5.1.41
Apache: 2.2.14
OS: CentOS (latest)

情况如下.

我有数千份非常重要的文件，从客户合同到语音签名(合同客户授权记录)，文件类型包括但不限于 jpg、gif、png、tiff、doc、docx、xls、wav、mp3、pdf等

I have thousands of very important documents, ranging from customer contracts to voice signatures (recordings of customer authorisation for contracts), with file types including, but not limited to jpg, gif, png, tiff, doc, docx, xls, wav, mp3, pdf, etc.

所有这些文档目前都存储在多个服务器上，包括 Windows 32 位、CentOS 和 Mac 等.有些文件还存储在员工的台式电脑和笔记本电脑上，有些仍然是存储在数百个盒子和文件柜中的硬拷贝.

All of these documents are currently stored on several servers including Windows 32 bit, CentOS and Mac, among others. Some files are also stored on employees desktop computers and laptops, and some are still hard copies stored in hundreds of boxes and filing cabinets.

现在，由于客户或律师可以随时要求提供合同证据，我的公司必须能够有效地搜索和找到正确的文件，因此所有这些文件都必须数字化(如果还没有) 并关联到某种搜索和访问顺序.

Now because customers or lawyers could demand evidence of contracts at any time, my company has to be able to search and locate the correct document(s) effectively, for this reason ALL of these files have to be digitised (if not already) and correlated into some sort of order for searching and accessing.

作为程序员，我创建了一个全公司都使用的完整的客户关系管理工具.这包括客户资料管理、订单和工作跟踪工具、工作/销售创建和管理模块等，以及目前在客户资料级别(驾驶执照、信用授权等)或工作/销售级别(合同、语音签名等)可以上传到服务器并位于父/子层次结构中，就像 Windows 资源管理器或任何其他典型的文件管理模型一样.

As the programmer, I have created a full Customer Relations Management tool that the whole company uses. This includes Customer Profiles management, Order and job Tracking tools, Job/sale creation and management modules, etc, and at the moment any file that is needed at a customer profile level (drivers licence, credit authority, etc) or at a job/sale level (contracts, voice signatures, etc) can be uploaded to the server and sits in a parent/child hierarchy structure, just like Windows Explorer or any other typical file managment model.

结构如下所示:

drivers_license
|- DL_123.jpg
语音签名
|- VS_123.wav
|- VS_4567.wav
合同

drivers_license
|- DL_123.jpg
voice_signatures
|- VS_123.wav
|- VS_4567.wav
contracts

所以文件是使用 PHP 和 Apache 上传的，并存储在操作系统的文件系统中.在上传时，有关文件的某些信息存储在 MySQL 数据库中.存储的一些信息是:

So the files are uplaoded using PHP and Apache, and are stored in the file system of the OS. At the time of uploading, certain information about the file(s) is stored in a MySQL database. Some of the information stored is:

表格:文件上传
文件ID
CustomerID(文件所属的客户id，他们都有这个.)
JobID/SaleID(相关工作/销售的 ID，如果有的话.)
文件大小
文件类型
上传日期时间
上传者
FilePath(存储文件的目录路径.)
FileName(上传文件的当前文件名，CustomerID 和 JobID/SaleID 的组合，如果适用.)
文件说明
OriginalFileName(上传时源文件的原始名称，包括扩展名.)

TABLE: FileUploads
FileID
CustomerID (the customer id that the file belongs to, they all have this.)
JobID/SaleID (the id of the job/sale associated, if any.)
FileSize
FileType
UploadedDateTime
UploadedBy
FilePath (the directory path the file is stored in.)
FileName (current file name of uploaded file, combination of CustomerID and JobID/SaleID if applicable.)
FileDescription
OriginalFileName (original name of the source file when uploaded, including extension.)

如您所见，文件通过文件名链接到数据库.当我想将客户的文件提供给用户下载时，我所要做的就是SELECT * FROM FileUploads WHERE CustomerID = 123 OR JobID = 2345;"这将输出我需要的所有文件详细信息，并使用 FilePath 和 FileName 我可以提供下载链接.

So as you can see, the file is linked to the database by the File Name. When I want to provide a customers' files for download to a user all I have to do is "SELECT * FROM FileUploads WHERE CustomerID = 123 OR JobID = 2345;" and this will output all the file details I require, and with the FilePath and FileName I can provide the link for download.

http...服务器/文件路径/文件名

http... server / FilePath / FileName

这种方法有很多问题:

在这种数据库无意识"环境中存储文件意味着无法保持数据完整性.如果一条记录被删除，该文件也可能不会被删除，反之亦然.
文件散落在各处，不同的服务器、计算机等.
文件名是唯一与数据库、客户资料和客户记录匹配的二进制文件.

等等等等.有这么多原因，这里有一些描述:http://www.dreamwerx.net/site/article01 .这里也有一篇有趣的文章:sietch.net/ViewNewsItem.aspx?NewsItemID=124.

etc, etc. There are so many reasons, some of which are described here: http://www.dreamwerx.net/site/article01 . Also there is an interesting article here too: sietch.net/ViewNewsItem.aspx?NewsItemID=124 .

所以，经过大量研究，我几乎决定将所有这些文件作为 BLOB 或 LONGBLOB 存储在数据库中，但在我这样做之前仍有许多考虑因素.

SO, after much research I have pretty much decided I am going to store ALL of these files in the database, as a BLOB or LONGBLOB, but there are still many considerations before I do this.

我知道将它们存储在数据库中是一个可行的选择，但是有许多存储它们的方法.我也知道存储它们是一回事.以可管理的方式关联和访问它们完全是另一回事.

I know that storing them in the database is a viable option, however there are a number of methods of storing them. I also know storing them is one thing; correlating and accessing them in a manageable way is another thing entirely.

此链接提供的文章:dreamwerx.net/site/article01 描述了一种将上传的二进制文件拆分为 64kb 块并使用 FileID 存储每个块，然后使用标头将实际二进制文件流式传输到客户端的方法.这是一个非常酷的想法，因为它减轻了服务器内存的压力；它不是将整个 100mb 文件加载到 RAM 中然后将其发送到客户端，而是一次执行 64kb.我已经尝试过这个(并更新了他的脚本)，并且在非常小的测试框架内这是完全成功的.

The article provided at this link: dreamwerx.net/site/article01 describes a way of splitting the uploaded binary files into 64kb chunks and storing each chunk with the FileID, and then streaming the actual binary file to the client using headers. This is a really cool idea since it alleviates preassure on the servers memory; instead of loading an entire 100mb file into the RAM and then sending it to the client, it is doing it 64kb at a time. I have tried this (and updated his scripts) and this is totally successful, in a very small frame of testing.

因此，如果您同意此方法是一种可行、稳定且稳健的长期选择，可用于存储中等大小的文件(1kb 到几百兆)以及大量此类文件，请告诉我还有哪些其他注意事项或你有什么想法.

So if you are in agreeance that this method is a viable, stable and robust long-term option to store moderately large files (1kb to couple hundred megs), and large quantities of these files, let me know what other considerations or ideas you have.

另外，我正在考虑获取一个当前的文件管理"PHP 脚本，该脚本提供了一个界面，用于管理存储在文件系统中的文件并将其转换为管理存储在数据库中的文件.如果已经有任何软件可以做到这一点，请告诉我.

Also, I am considering getting a current "File Management" PHP script that gives an interface for managing files stored in the File System and converting it to manage files stored in the database. If there is already any software out there that does this, please let me know.

我想我可以问很多问题，所有信息都在那里^^所以请讨论这方面的各个方面，我们可以来回传递想法并互相教导.

I guess there are many questions I could ask, and all the information is up there ^^ so please, discuss all aspects of this and we can pass ideas back and forth and teach each other.

干杯，

Quantico773

Quantico773

推荐答案

我在一个大型软件系统上工作，该系统已经完成了存储附件和其他内容的机制.系统的第一次迭代将所有数据存储在数据库中的 BLOB 中.我当时诅咒它.作为一名程序员，我可以编写辅助脚本来立即对数据进行操作并随时更改.

I work on a large software system that has done both mechanisms for storing attachments and other content. The first iteration of the system stored all data in BLOBs in the DB. I cursed it at the time. As a programmer, I could write side scripts to immediately operate on the data and change it whenever I wanted to.

前进了大约 10 年，我仍然管理相同的软件，但架构发生了变化，并且它是使用文件系统指针编写的.我现在诅咒它，希望它回到数据库中.我有几年的额外好处，并且在越来越多的更大情况下以更大的能力工作了这个应用程序，我觉得我现在的意见受到了更好的教育.应用程序的升级或系统迁移需要大量的脚本编写和数百万个文件的复制.有一次我们更改了操作系统并且所有文件指针都有错误的目录分隔符，或者服务器名称更改了文件所在的位置，我们不得不在周末与 DBA 一起编写和安排简单的 SQL 更新语句来修复.另一个是文件系统和数据库记录不同步，为什么不确定但经过数千天的操作，有时非事务系统(文件系统和数据库不共享事务上下文)会变得不同步.有时文件会莫名其妙地丢失.

Advance about 10 years and I still manage the same software but the architecture has changed and it was written with filesystem pointers. I curse it now and wish it were back in the DB. I have the added benefit of several years and having worked this application in much greater capacity in many more and many larger situations, I feel my opinion now is better educated. Promotion or system migration of the application requires extensive scripting and copying of millions of files. On one occasion we changed the OS and all the file pointers had the wrong directory separator, or the server name changes where the file was and we had to write and schedule simple SQL update statements with the DBA on the weekend to fix. Another is that the filesystem and DB records get out of sync, why is uncertain but after thousands of days of operation, sometimes non-transactional systems (filesystem and DB don't share transactional contexts) simply become out of sync. Sometimes files mysteriously go missing.

当所有这些都在数据库中时，迁移或环境升级就是转储和导入数据库的问题.可以正确审核行更改，同步的所有内容和日志可以在必要时重播到时间点.数据库肯定变大了，但现在是 2011 年，这对数据库来说根本不是挑战.

When all this was in the DB, migration or environment promotion was a matter of dump and import the DB. Row changes could be properly audited, everything in sync and logs can be replayed to point-in-time if necessary. Sure the DB gets big, but it is 2011 and this stuff is simply not a challenge for databases.

值得我们在流式传输某些数据时遇到一些类似的大型数据缓冲区问题，但是 A)我们可以使用 JDBC 中的 Input|OutputStreams 和 B)在使用其他工具时将数据泵送到字节缓冲区中，我们写道一个存储过程，它将 BLOB 分块到一个临时表中，并从临时表中迭代地提供这些块.效果很好.

For what it is worth we had some similar issues with large data buffers when streaming some data, but A) we could pump the data in byte buffers with the Input|OutputStreams in JDBC and B) when using other tools, we wrote a stored procedure that would chunk the BLOB into a temp table and iteratively serve the chunks from the temp table. Works great.

我不在乎不将这些东西放入数据库的技术原因是什么，但在一个合并的位置管理起来要容易得多，我可以加倍并在管理不同文件的短时间内将顾问和客户浪费的时间增加三倍的硬件或网格数据库.

I don't care what the technical reason for not putting this stuff in the DB, but it is so much easier to manage in a consolidated location I could double and triple the hardware or grid the DB for the time wasted by consultants and customers just in a short period of time managing the disparate files.

更新:对评论者放轻松，他们只是对此事发表意见.

Update: go easy on the commenters, they're just giving their opinion on the matter.

相关文章