使用python从文本中提取城市名称

2022-05-18 00:00:00 python validation normalization

问题描述

我有一个数据集，其中一列的标题是"您的位置和时区是什么？"

这意味着我们有如下条目

丹麦，CET
地点为英格兰德文郡，格林威治时间
澳大利亚。澳大利亚东部标准时间。+10小时协调世界时。

甚至

我的位置是俄勒冈州的尤金，一年中的大部分时间，或者在首尔，韩国则视学校假期而定。我的主要时区是太平洋时区。
整个五月我将在英国伦敦(格林威治标准时间+1)。整个六月份，我要么在挪威(GMT+2)，要么在以色列 (GMT+3)，互联网接入有限。整个七月和八月我将在英国伦敦(GMT+1)。然后从 2015年9月，我将在美国波士顿(美国东部夏令时)

有什么方法可以从中提取城市、国家和时区吗？

我在考虑(从开源数据集)创建一个包含所有国家/地区名称(包括缩写形式)和城市名称/时区的数组，然后，如果数据集中的任何单词与城市/国家/时区或缩写形式匹配，它就会将其填充到同一数据集中的新列中，并对其进行计数。

这实用吗？

=基于NLTK答案的REPLT=

运行与Alecxe相同的代码我得到

Traceback (most recent call last):
  File "E:SBTF
tlk_test.py", line 19, in <module>
    tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
  File "C:Python27ArcGIS10.4libsite-packages
ltk	ag\__init__.py", line 110, in pos_tag
    tagger = PerceptronTagger()
  File "C:Python27ArcGIS10.4libsite-packages
ltk	agperceptron.py", line 141, in __init__
    self.load(AP_MODEL_LOC)
  File "C:Python27ArcGIS10.4libsite-packages
ltk	agperceptron.py", line 209, in load
    self.model.weights, self.tagdict, self.classes = load(loc)
  File "C:Python27ArcGIS10.4libsite-packages
ltkdata.py", line 801, in load
    opened_resource = _open(resource_url)
  File "C:Python27ArcGIS10.4libsite-packages
ltkdata.py", line 924, in _open
    return urlopen(resource_url)
  File "C:Python27ArcGIS10.4liburllib2.py", line 154, in urlopen
    return opener.open(url, data, timeout)
  File "C:Python27ArcGIS10.4liburllib2.py", line 431, in open
    response = self._open(req, data)
  File "C:Python27ArcGIS10.4liburllib2.py", line 454, in _open
    'unknown_open', req)
  File "C:Python27ArcGIS10.4liburllib2.py", line 409, in _call_chain
    result = func(*args)
  File "C:Python27ArcGIS10.4liburllib2.py", line 1265, in unknown_open
    raise URLError('unknown url type: %s' % type)
URLError: <urlopen error unknown url type: c>

解决方案

我将使用自然语言处理和nltk必须提供的内容来提取实体。

示例(主要基于this gist)，它将文件中的每一行标记化，将其拆分成块，并递归地为每个块查找NE(命名实体)标签。更多说明here：

import nltk

def extract_entity_names(t):
    entity_names = []

    if hasattr(t, 'label') and t.label:
        if t.label() == 'NE':
            entity_names.append(' '.join([child[0] for child in t]))
        else:
            for child in t:
                entity_names.extend(extract_entity_names(child))

    return entity_names

with open('sample.txt', 'r') as f:
    for line in f:
        sentences = nltk.sent_tokenize(line)
        tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
        tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
        chunked_sentences = nltk.ne_chunk_sents(tagged_sentences, binary=True)

        entities = []
        for tree in chunked_sentences:
            entities.extend(extract_entity_names(tree))

        print(entities)

对于sample.txt包含：

Denmark, CET
Location is Devon, England, GMT time zone
Australia. Australian Eastern Standard Time. +10h UTC.
My location is Eugene, Oregon for most of the year or in Seoul, South Korea depending on school holidays. My primary time zone is the Pacific time zone.
For the entire May I will be in London, United Kingdom (GMT+1). For the entire June I will be in either Norway (GMT+2) or Israel (GMT+3) with limited internet access. For the entire July and August I will be in London, United Kingdom (GMT+1). And then from September, 2015, I will be in Boston, United States (EDT)

打印：

['Denmark', 'CET']
['Location', 'Devon', 'England', 'GMT']
['Australia', 'Australian Eastern Standard Time']
['Eugene', 'Oregon', 'Seoul', 'South Korea', 'Pacific']
['London', 'United Kingdom', 'Norway', 'Israel', 'London', 'United Kingdom', 'Boston', 'United States', 'EDT']

输出不理想，但对您来说可能是一个良好的开端。

相关文章