给定一个正则表达式,我将如何生成与之匹配的所有字符串?

2021-12-16 00:00:00 递归 regex parsing vector c++

我使用的是一种仅包含 ()|、空格和字母字符的简单语言.
给定一个如下的正则表达式:

I'm using a simple language of only (), |, spaces, and alpha characters.
Given a regular expression like the following:

(hello|goodbye) (world(s|)|)

我将如何生成以下数据?

How would I go about generating the following data?

hello worlds
hello world
hello 
goodbye worlds
goodbye world
goodbye

我不太确定是否需要先构建一棵树,或者是否可以递归完成.我被困在要使用的数据结构上,以及如何生成字符串.我是否必须保留一堆标记,并索引回部分构建的字符串以连接更多数据?我不知道如何最好地解决这个问题.我是否需要先阅读整个表达式,然后以某种方式重新排序?

I'm not quite sure if I need to build a tree first, or if it can be done recursively. I'm stuck on what data structures to utilize, and how to generate the strings as I go. Will I have to keep a bunch of markers, and index back into partially built strings to concatenate more data on? I don't know how best to approach this problem. Would I need to read the whole expression first, and re-order it a certain way?

函数签名将如下所示:

std::vector<std::string> Generate(std::string const&){
   //...
}

你建议我做什么?


让我澄清一下,这里的结果应该总是有限的.在我的特定示例中,表达式中只有 6 个字符串为真.我不确定我的术语在这里是否正确,但我正在寻找的是表达式的完美匹配 - 不是任何包含匹配子字符串的字符串.


Let me clarify that the results should always be finite here. In my particular example, there are only 6 strings that would ever be true for the expression. I'm not sure if my terminology is correct here, but what I'm looking for, is a perfect match of the expression- not any string that contains a substring which matches.

推荐答案

有点遵循 Kieveli 的建议,我有提出一个可行的解决方案.虽然之前没有提到,但对我来说计算可能产生多少结果也很重要.我使用的是在 github 上找到的名为exrex"的 Python 脚本.尴尬的是,我没有意识到它也有计数的能力.尽管如此,我还是使用简化的正则表达式语言在 C++ 中尽我所能地实现了它.如果对我的解决方案感兴趣,请继续阅读.

Somewhat following Kieveli's advice, I have come up with a working solution. Although not previously mentioned, it was important for me to also get a count of how many results could potentially be generated. I was using a python script called "exrex" which I had found on github. Embarrassingly, I did not realize that it had the capability to also count. Nonetheless, I implemented it the best I could in C++ using my simplified regular expression language. If interested in my solution, please read on.

从面向对象的角度来看,我编写了一个扫描器来获取正则表达式(字符串),并将其转换为标记列表(字符串向量).然后将令牌列表发送到生成 n 叉树的解析器.所有这些都打包在一个表达式生成器"类中,该类可以接受一个表达式并保存解析树,以及生成的计数.

扫描仪很重要,因为它标记了空字符串大小写,您可以在我的问题中看到它显示为|)".扫描也造就了[词][运算][词][运算]...[词]的图案.
例如扫描:"(hello|goodbye) (world(s|)|)"
将创建: [][(][hello][|][goodbye][)][ ][(][world][(][s][|][][)][][|][][)][]

From an object oriented stand point, I wrote a scanner to take the regular expression(string), and convert it into a list of tokens(vector of strings). The list of tokens was then sent to a parser which generated an n-ary tree. All of this was packed inside an "expression generator" class that could take an expression and hold the parse tree, as well as the generated count.

The scanner was important because it tokenized the empty string case which you can see in my question appearing as "|)". Scanning also created a pattern of [word] [operation] [word] [operation] ... [word].
For example, scanning: "(hello|goodbye) (world(s|)|)"
will create: [][(][hello][|][goodbye][)][ ][(][world][(][s][|][][)][][|][][)][]

解析树是一个节点向量.节点包含节点向量的向量.
橙色单元格代表或",而绘制连接的其他框代表和".下面是我的代码.

The parse tree was a vector of nodes. Nodes contain a vector of vector of nodes.
The orange cells represent the "or"s, and the other boxes that draw the connections, represent the "and"s. Below is my code.

节点头

#pragma once
#include <string>
#include <vector>

class Function_Expression_Node{

public:
    Function_Expression_Node(std::string const& value_in = "", bool const& more_in = false);

    std::string value;
    bool more;
    std::vector<std::vector<Function_Expression_Node>> children;

};

节点来源

#include "function_expression_node.hpp"

    Function_Expression_Node::Function_Expression_Node(std::string const& value_in, bool const& more_in)
    : value(value_in)
    , more(more_in)
    {}

扫描仪标题

#pragma once
#include <vector>
#include <string>

class Function_Expression_Scanner{

    public: Function_Expression_Scanner() = delete;
    public: static std::vector<std::string> Scan(std::string const& expression);

};

扫描仪来源

#include "function_expression_scanner.hpp"

std::vector<std::string> Function_Expression_Scanner::Scan(std::string const& expression){

    std::vector<std::string> tokens;
    std::string temp;

    for (auto const& it: expression){

        if (it == '('){
            tokens.push_back(temp);
            tokens.push_back("(");
            temp.clear();
        }

        else if (it == '|'){
            tokens.push_back(temp);
            tokens.push_back("|");
            temp.clear();
        }

        else if (it == ')'){
            tokens.push_back(temp);
            tokens.push_back(")");
            temp.clear();
        }

        else if (isalpha(it) || it == ' '){
            temp+=it;
        }

    }

    tokens.push_back(temp);

    return tokens;
    }

解析器标头

#pragma once
#include <string>
#include <vector>
#include "function_expression_node.hpp"

class Function_Expression_Parser{

    Function_Expression_Parser() = delete;

//get parse tree
public: static std::vector<std::vector<Function_Expression_Node>> Parse(std::vector<std::string> const& tokens, unsigned int & amount);
    private: static std::vector<std::vector<Function_Expression_Node>> Build_Parse_Tree(std::vector<std::string>::const_iterator & it, std::vector<std::string>::const_iterator const& end, unsigned int & amount);
        private: static Function_Expression_Node Recursive_Build(std::vector<std::string>::const_iterator & it, int & total); //<- recursive

    //utility
    private: static bool Is_Word(std::string const& it);
};

解析器源

#include "function_expression_parser.hpp"

bool Function_Expression_Parser::Is_Word(std::string const& it){
        return (it != "(" && it != "|" && it != ")");
    }
Function_Expression_Node Function_Expression_Parser::Recursive_Build(std::vector<std::string>::const_iterator & it, int & total){

    Function_Expression_Node sub_root("",true); //<- contains the full root
    std::vector<Function_Expression_Node> root;

    const auto begin = it;

    //calculate the amount
    std::vector<std::vector<int>> multiplies;
    std::vector<int> adds;
    int sub_amount = 1;

    while(*it != ")"){

        //when we see a "WORD", add it.
        if(Is_Word(*it)){
            root.push_back(Function_Expression_Node(*it));
        }

        //when we see a "(", build the subtree,
        else if (*it == "("){
            ++it;
            root.push_back(Recursive_Build(it,sub_amount));

            //adds.push_back(sub_amount);
            //sub_amount = 1;
        }

        //else we see an "OR" and we do the split
        else{
            sub_root.children.push_back(root);
            root.clear();

            //store the sub amount
            adds.push_back(sub_amount);
            sub_amount = 1;
        }

        ++it;
    }

    //add the last bit, if there is any
    if (!root.empty()){
        sub_root.children.push_back(root);

        //store the sub amount
        adds.push_back(sub_amount);
    }
    if (!adds.empty()){
        multiplies.push_back(adds);
    }


    //calculate sub total
    int or_count = 0;
    for (auto const& it: multiplies){
        for (auto const& it2: it){
            or_count+=it2;
        }

        if (or_count > 0){
            total*=or_count;
        }
        or_count = 0;
    }

    /*
    std::cout << "---SUB FUNCTION---
";
    for (auto it: multiplies){for (auto it2: it){std::cout << "{" << it2 << "} ";}std::cout << "
";}std::cout << "--
";
    std::cout << total << std::endl << '
';
    */

    return sub_root;
}
std::vector<std::vector<Function_Expression_Node>> Function_Expression_Parser::Build_Parse_Tree(std::vector<std::string>::const_iterator & it, std::vector<std::string>::const_iterator const& end, unsigned int & amount){

    std::vector<std::vector<Function_Expression_Node>> full_root;
    std::vector<Function_Expression_Node> root;

    const auto begin = it;

    //calculate the amount
    std::vector<int> adds;
    int sub_amount = 1;
    int total = 0;

    while (it != end){

        //when we see a "WORD", add it.
        if(Is_Word(*it)){
            root.push_back(Function_Expression_Node(*it));
        }

        //when we see a "(", build the subtree,
        else if (*it == "("){
            ++it;
            root.push_back(Recursive_Build(it,sub_amount));

        }

        //else we see an "OR" and we do the split
        else{
            full_root.push_back(root);
            root.clear();

            //store the sub amount
            adds.push_back(sub_amount);
            sub_amount = 1;
        }

        ++it;
    }

    //add the last bit, if there is any
    if (!root.empty()){
        full_root.push_back(root);

        //store the sub amount
        adds.push_back(sub_amount);
        sub_amount = 1;
    }

    //calculate sub total
    for (auto const& it: adds){ total+=it; }

    /*
    std::cout << "---ROOT FUNCTION---
";
    for (auto it: adds){std::cout << "[" << it << "] ";}std::cout << '
';
    std::cout << total << std::endl << '
';
    */
    amount = total;

    return full_root;
}
std::vector<std::vector<Function_Expression_Node>> Function_Expression_Parser::Parse(std::vector<std::string> const& tokens, unsigned int & amount){

    auto it = tokens.cbegin();
    auto end = tokens.cend();
    auto parse_tree = Build_Parse_Tree(it,end,amount);
    return parse_tree;
}

生成器标题

#pragma once
#include "function_expression_node.hpp"

class Function_Expression_Generator{

    //constructors
    public: Function_Expression_Generator(std::string const& expression);
    public: Function_Expression_Generator();

    //transformer
    void Set_New_Expression(std::string const& expression);

    //observers
    public: unsigned int Get_Count();
    //public: unsigned int Get_One_Word_Name_Count();
    public: std::vector<std::string> Get_Generations();
        private: std::vector<std::string> Generate(std::vector<std::vector<Function_Expression_Node>> const& parse_tree);
            private: std::vector<std::string> Sub_Generate(std::vector<Function_Expression_Node> const& nodes);

private:
    std::vector<std::vector<Function_Expression_Node>> m_parse_tree;
    unsigned int amount;

};

生成器源

#include "function_expression_generator.hpp"

#include "function_expression_scanner.hpp"
#include "function_expression_parser.hpp"

//constructors
Function_Expression_Generator::Function_Expression_Generator(std::string const& expression){
    auto tokens = Function_Expression_Scanner::Scan(expression);
    m_parse_tree = Function_Expression_Parser::Parse(tokens,amount);
}
Function_Expression_Generator::Function_Expression_Generator(){}

//transformer
void Function_Expression_Generator::Set_New_Expression(std::string const& expression){
    auto tokens = Function_Expression_Scanner::Scan(expression);
    m_parse_tree = Function_Expression_Parser::Parse(tokens,amount);
}

//observers
unsigned int Function_Expression_Generator::Get_Count(){
    return amount;
}
std::vector<std::string> Function_Expression_Generator::Get_Generations(){
    return Generate(m_parse_tree);
}
std::vector<std::string> Function_Expression_Generator::Generate(std::vector<std::vector<Function_Expression_Node>> const& parse_tree){
    std::vector<std::string> results;
    std::vector<std::string> more;

    for (auto it: parse_tree){
        more = Sub_Generate(it);
        results.insert(results.end(), more.begin(), more.end());
    }

    return results;
}
std::vector<std::string> Function_Expression_Generator::Sub_Generate(std::vector<Function_Expression_Node> const& nodes){
    std::vector<std::string> results;
    std::vector<std::string> more;
    std::vector<std::string> new_results;

    results.push_back("");
    for (auto it: nodes){
        if (!it.more){
            for (auto & result: results){
                result+=it.value;
            }
        }
        else{
            more = Generate(it.children);
            for (auto m: more){
                for (auto r: results){
                    new_results.push_back(r+m);
                }
            }
            more.clear();
            results = new_results;
            new_results.clear();
        }
    }

    return results;
}

总而言之,我建议使用 exrex 或本主题中提到的任何其他程序,如果您需要为正则表达式生成匹配项.

In conclusion, I recommend using exrex, or any other programs mentioned in this thread, if you need to generate matches for regular expressions.

相关文章