拆分包含字母和数字的字符串,在 PHP 中不被任何特定的分隔符分隔

2022-01-02 00:00:00 nlp string regex algorithm php

目前我正在开发一个 Web 应用程序来获取 Twitter 流并尝试自己创建一个自然语言处理.

Currently I am developing a web application to fetch Twitter stream and trying to create a natural language processing by my own.

由于我的数据来自 Twitter(限制为 140 个字符),因此缩短了许多单词,或者在这种情况下,省略了空格.

Since my data is from Twitter (limited by 140 characters) there are many words shortened, or on this case, omitted space.


"Hi, my name is Bob. I m 19yo and 170cm tall"


- hi
- my
- name
- bob
- i
- 19
- yo
- 170
- cm
- tall


Notice that 19 and yo in 19yo have no space between them. I use it mostly for extracting numbers with their units.


Simply, what I need is a way to 'explode' each tokens that has number in it by chunk of numbers or letters without delimiter.

'123abc' 将是 ['123', 'abc']

'abc123' 将是 ['abc', '123']

'abc123xyz' 将是 ['abc', '123', 'xyz']


在 PHP 中实现它的最佳方法是什么?

What is the best way to achieve it in PHP?

我发现了一些接近它的东西,但它是 C# 并且专门用于日/月拆分.如何在 C# 中根据字母和数字拆分字符串

I found something close to it, but it's C# and spesifically for day/month splitting. How do I split a string in C# based on letters and numbers


您可以使用 preg_split

$string = "Hi, my name is Bob. I m 19yo and 170cm tall";
$parts = preg_split("/(,?s+)|((?<=[a-z])(?=d))|((?<=d)(?=[a-z]))/i", $string);
var_dump ($parts);


When matching against the digit-letter boundary, the regular expression match must be zero-width. The characters themselves must not be included in the match. For this the zero-width lookarounds are useful.

