将 base64 编码数据存储为 BLOB 或 TEXT 数据类型

2021-12-22 00:00:00 blob text mysql innodb

我们有一个 MySQL InnoDB 表,其中包含大约 10 列的小型 base64 编码 javascript 文件和 png(小于 2KB 大小)图像,也采用 base64 编码.

We have a MySQL InnoDB table holding ~10 columns of small base64 encoded javascript files and png (<2KB size) images base64 encoded as well.

相对而言,插入较少,读取较多,但输出会在 Memcached 实例上缓存几分钟,以避免后续读取.

There are few inserts and a lot of reads comparatively, however the output is being cached on a Memcached instance for some minutes to avoid subsequent reads.

现在我们正在为这些列使用 BLOB,但我想知道在性能或快照方面切换到 TEXT 数据类型是否有优势正在备份.

As it is right now we are using BLOB for those columns, but I am wondering if there is an advantage in switching to TEXT datatype in terms of performance or snapshot backing up.

我的搜索结果表明 BLOBTEXT 在我的案例中几乎相同,因为我事先不知道实际上会是什么类型的数据存储我去BLOB.

My search digging indicates that BLOB and TEXT for my case are close to identical and since I do not know before-hand what type of data are actually going to be stored I went for BLOB.

对于此特定案例,您是否有关于 TEXT 与 BLOB 辩论的任何提示?

Do you have any pointers on the TEXT vs BLOB debate for this specific case?

推荐答案

不应将 Base64 编码的数据存储在自己的数据库中...

Base64 是一种编码,其中仅使用可打印的文本字符表示任意二进制数据:它专为需要通过只能处理可打印文本的协议或介质(例如 SMTP/电子邮件)传输此类二进制数据的情况而设计).它增加了数据大小(增加了 33%)并增加了编码/解码的计算成本,因此除非绝对必要,否则应避免使用.

One shouldn't store Base64-encoded data in one's database...

Base64 is a coding in which arbitrary binary data is represented using only printable text characters: it was designed for situations where such binary data needs to be transferred across a protocol or medium that can handle only printable-text (e.g. SMTP/email). It increases the data size (by 33%) and adds the computational cost of encoding/decoding, so it should be avoided unless absolutely necessary.

相比之下,BLOB 列的全部意义在于它们存储不透明的二进制字符串.因此,只需将您的内容直接存储到您的 BLOB 列中,而无需先对它们进行 Base64 编码.(也就是说,如果 MySQL 有更适合存储的特定数据的类型,您可能希望改用它:例如,像 JavaScript 源这样的文本文件可以从存储在 TEXT 列中受益哪个 MySQL 本地跟踪特定于文本的元数据——更多内容见下文).

By contrast, the whole point of BLOB columns is that they store opaque binary strings. So just go ahead and store your stuff directly into your BLOB columns without first Base64-encoding them. (That said, if MySQL has a more suitable type for the particular data being stored, you may wish to use that instead: for example, text files like JavaScript sources could benefit from being stored in TEXT columns for which MySQL natively tracks text-specific metadata—more on this below).

SQL 数据库需要像 Base64 这样的可打印文本编码来处理任意二进制数据的(错误的)想法已被大量信息不足的教程延续.这个想法似乎是基于错误的信念,因为 SQL 只包含其他上下文中的可打印文本,它肯定也需要二进制数据(至少对于数据传输,如果不是用于数据存储).这根本不是真的:SQL 可以通过多种方式传送二进制数据,包括纯字符串文字(前提是它们像任何其他字符串一样被正确引用和转义);当然,将数据(任何类型)传递到数据库的首选方式是通过参数化查询,并且参数的数据类型可以像其他任何类型一样轻松地成为原始二进制字符串.

The (erroneous) idea that SQL databases require printable-text encodings like Base64 for handling arbitrary binary data has been perpetuated by a large number of ill-informed tutorials. This idea appears to be seated in the mistaken belief that, because SQL comprises only printable-text in other contexts, it must surely require it for binary data too (at least for data transfer, if not for data storage). This is simply not true: SQL can convey binary data in a number of ways, including plain string literals (provided that they are properly quoted and escaped like any other string); of course, the preferred way to pass data (of any type) to your database is through parameterised queries, and the data types of your parameters can just as easily be raw binary strings as anything else.

存储 Base64 编码数据可能带来一些好处的唯一情况是,它通常在从数据库中检索后立即通过需要这种编码的协议(例如通过电子邮件附件)传输.— 在这种情况下,存储 Base64 编码的表示将不必在每次获取时对其他原始数据执行编码操作.

The only situation in which there might be some benefit from storing Base64-encoded data is where it's usually transmitted across a protocol requiring such encoding (e.g. by email attachment) immediately after being retrieved from the database—in which case, storing the Base64-encoded representation would save from having to perform the encoding operation on the otherwise raw data upon every fetch.

但是,从这个意义上说,Base64 编码的存储仅充当缓存,就像出于性能原因可能存储非规范化数据一样.

However, note in this sense that the Base64-encoded storage is merely acting as a cache, much like one might store denormalised data for performance reasons.

如上所述:TEXTBLOB 列之间的唯一区别在于,对于 TEXT 列,MySQL 额外跟踪特定于文本的元数据(例如字符编码和排序规则).这个额外的元数据使 MySQL 能够在存储和连接字符集(在适当的情况下)之间转换值并执行花哨的字符串比较/排序操作.

As alluded above: the only difference between TEXT and BLOB columns is that, for TEXT columns, MySQL additionally tracks text-specific metadata (such as character encoding and collation) for you. This additional metadata enables MySQL to convert values between storage and connection character sets (where appropriate) and perform fancy string comparison/sorting operations.

一般来说:如果两个使用不同字符集的客户端应该看到相同的字节,那么你需要一个BLOB列;如果他们应该看到相同的字符,那么您需要一个 TEXT 列.

Generally speaking: if two clients working in different character sets should see the same bytes, then you want a BLOB column; if they should see the same characters then you want a TEXT column.

使用 Base64,这两个客户端必须最终发现数据解码为相同的字节;但他们应该看到存储/编码的数据具有相同的字符.例如,假设有人希望插入 'Hello world!'(即 'SGVsbG8gd29ybGQh')的 Base64 编码.如果插入应用程序在 UTF-8 字符集中工作,那么它会将字节序列 0x53475673624738676432397962475168 发送到数据库.

With Base64, those two clients must ultimately find that the data decodes to the same bytes; but they should see that the stored/encoded data has the same characters. For example, suppose one wishes to insert the Base64-encoding of 'Hello world!' (which is 'SGVsbG8gd29ybGQh'). If the inserting application is working in the UTF-8 character set, then it will send the byte sequence 0x53475673624738676432397962475168 to the database.

  • 如果该字节序列存储在 BLOB 列中,并且稍后由使用 UTF-16* 的应用程序检索,则相同将返回字节——代表'噳扇㡧搲㥹扇全',而不是所需的Base64编码值;而

  • if that byte sequence is stored in a BLOB column and later retrieved by an application that is working in UTF-16*, the same bytes will be returned—which represent '升噳扇㡧搲㥹扇全' and not the desired Base64-encoded value; whereas

如果该字节序列存储在 TEXT 列中,然后由以 UTF-16 工作的应用程序检索,MySQL 将即时转码以返回字节序列0x0053004700560073006200470038006700640032003900790062004700510068——表示原始 Base64 编码值

if that byte sequence is stored in a TEXT column and later retrieved by an application that is working in UTF-16, MySQL will transcode on-the-fly to return the byte sequence 0x0053004700560073006200470038006700640032003900790062004700510068—which represents the original Base64-encoded value 'SGVsbG8gd29ybGQh' as desired.

当然,您仍然可以使用 BLOB 列并以其他方式跟踪字符编码 - 但这只会不必要地重新发明轮子,增加了维护复杂性和引入意外错误的风险.

Of course, you could nevertheless use BLOB columns and track the character encoding in some other way—but that would just needlessly reinvent the wheel, with added maintenance complexity and risk of introducing unintentional errors.

* 实际上 MySQL 不支持使用与 ASCII 字节不兼容的客户端字符集(因此 Base64 编码在它们的任何组合中始终保持一致),但是这个例子仍然可以说明差异BLOBTEXT 列类型之间,从而解释了为什么 TEXT 在技术上是正确的,即使 BLOB 实际上会正常工作(至少在 MySQL 添加对非 ASCII 兼容客户端字符集的支持之前).

* Actually MySQL doesn't support using client character sets that are not byte-compatible with ASCII (and therefore Base64 encodings will always be consistent across any combination of them), but this example nevertheless serves to illustrate the difference between BLOB and TEXT column types and thus explains why TEXT is technically correct for this purpose even though BLOB will actually work without error (at least until MySQL adds support for non-ASCII compatible client character sets).

相关文章