如何为某些帖子创建有效的内容过滤器?

2022-01-03 00:00:00 cron php wordpress ajax

我已将此帖子标记为 WordPress,但我不完全确定它是特定于 WordPress 的,因此我将其发布到 StackOverflow 而不是 WPSE.解决方案不必特定于 WordPress,只需 PHP.

I've tagged this post as WordPress, but I'm not entirely sure it's WordPress-specific, so I'm posting it on StackOverflow rather than WPSE. The solution doesn't have to be WordPress-specific, simply PHP.

场景
我经营一个养鱼网站,里面有许多热带鱼Species ProfilesGlossary 条目.

The Scenario
I run a fishkeeping website with a number of tropical fish Species Profiles and Glossary entries.

我们的网站以我们的个人资料为导向.正如您所说,它们是网站的基本要素.

Our website is oriented around our profiles. They are, as you may term it, the bread and butter of the website.

我希望实现的是,在提到另一个物种或词汇表条目的每个物种简介中,我可以用链接替换这些词 - 例如你会看到 此处.理想情况下,我也希望这也出现在新闻、文章和博客文章中.

What I'm hoping to achieve is that, in every species profile which mentions another species or a glossary entry, I can replace those words with a link - such as you'll see here. Ideally, I would also like this to occur in news, articles and blog posts too.

我们有近 1400 个物种概况1700 个词汇条目.我们的物种概况通常很长,最终仅凭我们的物种概况就编号超过 170 万字的信息.

We have nearly 1400 species profiles and 1700 glossary entries. Our species profiles are often lengthy and at last count our species profiles alone numbered more than 1.7 million words of information.

我目前正在尝试什么
目前,我有一个带有函数的 filter.php - 我相信 - 做我需要它做的事情.代码很长,可以在这里找到完整的代码.

What I'm Currently Attempting
Currently, I have a filter.php with a function that - I believe - does what I need it to do. The code is quite lengthy, and can be found in full here.

此外,在我的 WordPress 主题的 functions.php 中,我有以下内容:

In addition, in my WordPress theme's functions.php, I have the following:

# ==============================================================================================
# [Filter]
#
# Every hour, using WP_Cron, `my_updated_posts` is checked. If there are new Post IDs in there,
# it will run a filter on all of the post's content. The filter will search for Glossary terms
# and scientific species names. If found, it will replace those names with links including a 
# pop-up.

    include "filter.php";

# ==============================================================================================
# When saving a post (new or edited), check to make sure it isn't a revision then add its ID
# to `my_updated_posts`.

    add_action( 'save_post', 'my_set_content_filter' );
    function my_set_content_filter( $post_id ) {
        if ( !wp_is_post_revision( $post_id ) ) {

            $post_type = get_post_type( $post_id );

            if ( $post_type == "species" || ( $post_type == "post" && in_category( "articles", $post_id ) ) || ( $post_type == "post" && in_category( "blogs", $post_id ) ) ) {
                //get the previous value
                $ids = get_option( 'my_updated_posts' );

                //add new value if necessary
                if( !in_array( $post_id, $ids ) ) {
                    $ids[] = $post_id;
                    update_option( 'my_updated_posts', $ids );
                }
            }
        }
    }

# ==============================================================================================
# Add the filter to WP_Cron.

    add_action( 'my_filter_posts_content', 'my_filter_content' );
    if( !wp_next_scheduled( 'my_filter_posts_content' ) ) {
        wp_schedule_event( time(), 'hourly', 'my_filter_posts_content' );
    }

# ==============================================================================================
# Run the filter.

    function my_filter_content() {
        //check to see if posts need to be parsed
        if ( !get_option( 'my_updated_posts' ) )
            return false;

        //parse posts
        $ids = get_option( 'my_updated_posts' );

        update_option( 'error_check', $ids );

        foreach( $ids as $v ) {
            if ( get_post_status( $v ) == 'publish' )
                run_filter( $v );

            update_option( 'error_check', "filter has run at least once" );
        }

        //make sure no values have been added while loop was running
        $id_recheck = get_option( 'my_updated_posts' );
        my_close_out_filter( $ids, $id_recheck );

        //once all options, including any added during the running of what could be a long cronjob are done, remove the value and close out
        delete_option( 'my_updated_posts' );
        update_option( 'error_check', 'working m8' );
        return true;
    }

# ==============================================================================================
# A "difference" function to make sure no new posts have been added to `my_updated_posts` whilst
# the potentially time-consuming filter was running.

    function my_close_out_filter( $beginning_array, $end_array ) {
        $diff = array_diff( $beginning_array, $end_array );
        if( !empty ( $diff ) ) {
            foreach( $diff as $v ) {
                run_filter( $v );
            }
        }
        my_close_out_filter( $end_array, get_option( 'my_updated_posts' ) );
    }

正如(希望)代码注释所描述的那样,这种工作方式是 WordPress 每小时运行一个 cron 作业(这就像一个虚假的 cron - 对用户点击起作用,但这并不重要,因为时间不重要)运行上面找到的过滤器.

The way this works, as (hopefully) described by the code's comments, is that each hour WordPress operates a cron job (which is like a false cron - works upon user hits, but that doesn't really matter as the timing isn't important) which runs the filter found above.

每小时运行一次的基本原理是,如果我们试图在保存每个帖子时运行它,这将对作者不利.一旦我们让客座作者参与进来,这显然不是一种可以接受的方式.

The rationale behind running it on an hourly basis was that if we tried to run it when each post was saved, it would be to the detriment of the author. Once we get guest authors involved, that is obviously not an acceptable way of going about it.

问题...
几个月来,我一直在让这个过滤器可靠运行时遇到问题.我认为问题不在于过滤器本身,而在于启用过滤器的功能之一 - 即 cron 作业,或选择过滤哪些帖子的功能,或准备词表等的功能过滤器.

The Problem...
For months now I've been having problems getting this filter running reliably. I don't believe that the problem lies with the filter itself, but with one of the functions that enables the filter - i.e. the cron job, or the function that chooses which posts are filtered, or the function which prepares the wordlists etc. for the filter.

不幸的是,诊断问题非常困难(我可以看到),这要归功于它在后台运行并且仅每小时运行一次.我一直在尝试使用 WordPress 的 update_option 函数(它基本上编写一个简单的数据库值)来进行错误检查,但我运气不佳 - 说实话,我很对问题出在哪里感到困惑.

Unfortunately, diagnosing the problem is quite difficult (that I can see), thanks to it running in the background and only on an hourly basis. I've been trying to use WordPress' update_option function (which basically writes a simple database value) to error-check, but I haven't had much luck - and to be honest, I'm quite confused as to where the problem lies.

我们最终在没有此过滤器正常工作的情况下将网站上线.有时它似乎有效,有时却不起作用.因此,我们现在有很多未正确过滤的物种概况.

We ended up putting the website live without this filter working correctly. Sometimes it seems to work, sometimes it doesn't. As a result, we now have quite a few species profiles which aren't correctly filtered.

我想要什么...
我基本上是在寻求有关运行此过滤器的最佳方法的建议.

What I'd Like...
I'm basically seeking advice on the best way to go about running this filter.

Cron Job 是答案吗?我可以设置一个每天运行的 .php 文件,那不会有问题.它如何确定哪些帖子需要过滤?它在运行时会对服务器产生什么影响?

Is a Cron Job the answer? I can set up a .php file which runs every day, that wouldn't be a problem. How would it determine which posts need to be filtered? What impact would it have on the server at the time it ran?

或者,答案是 WordPress 管理页面吗?如果我知道怎么做,那么使用 AJAX 来选择页面来运行过滤器就完美了.有一个名为 AJAX Regenerate Thumbnails 的插件,它的工作原理是这样的,也许那是最有效的?

Alternatively, is a WordPress admin page the answer? If I knew how to do it, something along the lines of a page - utilising AJAX - which allowed me to select the posts to run the filter on would be perfect. There's a plugin called AJAX Regenerate Thumbnails which works like this, maybe that would be the most effective?

注意事项

  • 被影响/读取/写入的数据库/信息的大小
  • 过滤了哪些帖子
  • 过滤器对服务器的影响;特别是考虑到我似乎无法将 WordPress 内存限制增加到 32Mb 以上.
  • 实际的过滤器本身是否高效、有效和可靠?

这是一个相当复杂的问题,我不可避免地(因为我在这个过程中被同事分心了大约 18 次)遗漏了一些细节.请随时向我询问更多信息.

This is quite a complex question and I've inevitably (as I was distracted roughly 18 times by colleagues in the process) left out some details. Please feel free to probe me for further information.

提前致谢,

推荐答案

在创建配置文件时执行.

Do it when the profile is created.

尝试颠倒整个过程.与其检查单词的内容,不如检查内容的单词.

Try reversing the whole process. Rather than checking the content for the words, check the words for the content's words.

  1. 在进入单词时将内容帖子分解(在空格上)
  2. 消除重复项,包括数据库中单词最小大小的单词、最大单词大小的单词以及您保留的常用单词"列表中的单词.
  3. 检查每个表,如果您的某些表包含带空格的短语,请进行 %text% 搜索,否则进行直接匹配(速度更快),如果问题真的有那么大,甚至可以构建哈希表.(我会把它作为一个 PHP 数组来做并以某种方式缓存结果,没有意义重新发明轮子)
  4. 使用现在显着缩小的列表创建链接.

即使您要检查 100,000 个单词,您也应该能够轻松地将其保持在 1 秒以内.之前我已经为贝叶斯过滤器完成了这项工作,没有缓存单词列表.

You should be able to easily keep this under 1 second even as you move out to even 100,000 words you are checking against. I've done exactly this, without caching the word lists, for a Bayesian Filter before.

使用较小的列表,即使它很贪婪,收集与小丑"不匹配的单词也会捕获小丑泥鳅",结果较小的列表应该只有几个到几十个带有链接的单词.这根本不需要时间来查找和替换一大块文本.

With the smaller list, even if it is greedy and gathers words that don't match "clown" will catch "clown loach", the resulting smaller list should be only a few to a few dozen words with links. Which will take no time at all to do a find and replace over a chunk of text.

以上内容并未真正解决您对旧配置文件的担忧.你并没有确切地说有多少,只是有很多文本并且它在 1400 到 3100(两个项目)放在一起.如果您有信息,您可以根据受欢迎程度来制作这些较旧的内容.或按输入日期,最新的在前.不管怎样,最好的方法是编写一个脚本来暂停 PHP 的时间限制,并在所有帖子上批量运行加载/处理/保存.如果每个都需要大约 1 秒(可能要少得多,但在最坏的情况下),您说的是 3100 秒,也就是不到一个小时.

The above doesn't really address your concern over the older profiles. You don't say exactly how many there are, just that there is a lot of text and that it is on 1400 to 3100 (both items) put together. This older content you could do based on popularity if you have the info. Or on date entered, newest first. Regardless the best way to do this is to write a script that suspends the time limit on PHP and just batch-runs a load/process/save on all the posts. If each one takes about 1 second (probably much less but worst case) you are talking 3100 seconds which is a little less than an hour.

相关文章