使用 SQL 确定文本字段的字数统计信息

2021-11-20 00:00:00 sql mysql word-count text-processing

我最近一直在研究一些数据库搜索功能,并希望获得一些信息,例如每个文档的平均字数(例如数据库中的文本字段).到目前为止,我发现的唯一一件事(没有在数据库之外选择语言进行处理)是:

I've recently been working on some database search functionality and wanted to get some information like the average words per document (e.g. text field in the database). The only thing I have found so far (without processing in language of choice outside the DB) is:

SELECT AVG(LENGTH(content) - LENGTH(REPLACE(content, ' ', '')) + 1)
FROM documents

这似乎有效*,但您还有其他建议吗?我目前正在使用 MySQL 4(希望尽快将此应用程序移至第 5 版),但我也对通用解决方案感兴趣.

This seems to work* but do you have other suggestions? I'm currently using MySQL 4 (hope to move to version 5 for this app soon), but am also interested in general solutions.

谢谢!

* 我可以想象这是一种非常粗略的确定方法,因为它也没有考虑内容中的 HTML 等.对于这个特定项目来说没问题,但还有更好的方法吗?

* I can imagine that this is a pretty rough way to determine this as it does not account for HTML in the content and the like as well. That's OK for this particular project but again are there better ways?

更新: 定义我所说的更好"的意思:更准确、执行更高效,或者更正确"(易于维护、良好实践等).对于我可用的内容,上面的查询足够快并且对于这个项目来说是准确的,但我将来可能需要类似的东西(所以我问了).

Update: To define what I mean by "better": either more accurate, performs more efficiently, or is more "correct" (easy to maintain, good practice, etc). For the content I have available, the query above is fast enough and is accurate for this project, but I may need something similar in the future (so I asked).

推荐答案

MySQL 的文本处理能力不足以满足您的需求.存储函数是一种选择,但可能会很慢.在 MySQL 中处理数据的最佳选择是添加一个 用户定义功能.如果您无论如何要构建更新版本的 MySQL,您还可以添加 本机函数.

The text handling capabilities of MySQL aren't good enough for what you want. A stored function is an option, but will probably be slow. Your best bet to process the data within MySQL is to add a user defined function. If you're going to build a newer version of MySQL anyway, you could also add a native function.

正确"的方法是处理 DB 外的数据,因为 DB 是用于存储而不是处理,任何繁重的处理都可能会给 DBMS 带来过多的负载.此外,在 MySQL 之外计算字数可以更轻松地更改计数为单词的定义.如何将字数存储在数据库中并在文档更改时更新它?

The "correct" way is to process the data outside the DB since DBs are for storage, not processing, and any heavy processing might put too much of a load on the DBMS. Additionally, calculating the word count outside of MySQL makes it easier to change the definition of what counts as a word. How about storing the word count in the DB and updating it when a document is changed?

示例存储函数:

DELIMITER $$
CREATE FUNCTION wordcount(str LONGTEXT)
       RETURNS INT
       DETERMINISTIC
       SQL SECURITY INVOKER
       NO SQL
  BEGIN
    DECLARE wordCnt, idx, maxIdx INT DEFAULT 0;
    DECLARE currChar, prevChar BOOL DEFAULT 0;
    SET maxIdx=char_length(str);
    SET idx = 1;
    WHILE idx <= maxIdx DO
        SET currChar=SUBSTRING(str, idx, 1) RLIKE '[[:alnum:]]';
        IF NOT prevChar AND currChar THEN
            SET wordCnt=wordCnt+1;
        END IF;
        SET prevChar=currChar;
        SET idx=idx+1;
    END WHILE;
    RETURN wordCnt;
  END
$$
DELIMITER ;

相关文章