如何在C++中按表情符号拆分字符串

2022-09-22 00:00:00 emoji c++

我正在尝试获取一串表情符号，并将它们拆分成每个表情符号的向量

给定字符串：

std::string emojis = "????????????????";

我正在尝试获取：

std::vector<std::string> splitted_emojis = {"??", "??", "??", "??", "??", "??", "??", "??"};

编辑

我已尝试：

std::string emojis = "????????????????";
std::vector<std::string> splitted_emojis;
size_t pos = 0;
std::string token;
while ((pos = emojis.find("")) != std::string::npos)
{
    token = emojis.substr(0, pos);
    splitted_emojis.push_back(token);
    emojis.erase(0, pos);
}

但它似乎在几秒钟后抛出了terminate called after throwing an instance of 'std::bad_alloc'。
在尝试使用以下命令检查字符串中有多少表情符号时：

std::string emojis = "????????????????";
std::cout << emojis.size() << std::endl; // returns 32

它返回一个更大的数字，我假设它是Unicode数据。我对Unicode数据了解不多，但我正在尝试弄清楚如何检查emoji数据的开始和结束时间，以便能够将字符串拆分到每个emoji

解决方案

我绝对建议您使用具有更好unicode支持的库(所有大型框架都是这样做的)，但在必要时，您可以勉强使用Utf-8编码将unicode字符分布在多个字节中，并且第一个字节的第一个比特确定一个字符由多少个字节组成。

我从boost窃取了一个函数。Split_by_codepoint函数对输入字符串使用迭代器，并使用前N个字节(其中N由字节计数函数确定)构造一个新字符串，并将其推送到ret向量。

// Taken from boost internals
inline unsigned utf8_byte_count(uint8_t c)
{
  // if the most significant bit with a zero in it is in position
  // 8-N then there are N bytes in this UTF-8 sequence:
  uint8_t mask = 0x80u;
  unsigned result = 0;
  while(c & mask)
  {
    ++result;
    mask >>= 1;
  }
  return (result == 0) ? 1 : ((result > 4) ? 4 : result);
}

std::vector<std::string> split_by_codepoint(std::string input) {
  std::vector<std::string> ret;
  auto it = input.cbegin();
  while (it != input.cend()) {
    uint8_t count = utf8_byte_count(*it);
    ret.emplace_back(std::string{it, it+count});
    it += count;
  }
  return ret;
}

int main() {
    std::string emojis = u8"????????????????";
    auto split = split_by_codepoint(emojis);
    std::cout << split.size() << std::endl;
}

请注意，该函数只是将一个字符串拆分成UTF-8字符串，每个字符串包含一个代码点。确定字符是否为emoji表情符号作为练习：utf-8-对任何4字节字符进行解码，并查看它们是否在正确的范围内。

相关文章