向现有数据帧添加空间输出时,列不对齐

2022-05-15 00:00:00 python pandas spacy

问题描述

我有一个CSV,其中包含一列文章标题,我使用Spacy从其中提取出现在标题中的任何人名。尝试使用Spacy提取的名称向CSV添加新列时,它们与从中提取它们的行不对齐。

我相信这是因为Spacy结果有自己的索引,独立于原始数据的索引。

我已尝试将, index=df.index)添加到新列行,但得到";ValueError:传递的值的长度为2,索引暗示为10。&q;

如何将空格输出与其来源行对齐?

以下是我的代码:

import pandas as pd
from pandas import DataFrame
df = (pd.read_csv(r"C:UsersAdminDownloadsitsnicethat (5).csv", nrows=10,
                  usecols=['article_title']))
article = [_ for _ in df['article_title']]

import spacy
nlp = spacy.load('en_core_web_lg')
doc = nlp(str(article))
ents = list(doc.ents)
people = []
for ent in ents:
    if ent.label_ == "PERSON":
        people.append(ent)

import numpy as np
df['artist_names'] = pd.Series(people)
print(df.head())

这是生成的数据帧:

                                       article_title       artist_names
0  "They’re like, is that? Oh it’s!" – ...               (Hannah, Ward)
1  Billed as London’s biggest public festival of ...  (Dylan, Mulvaney)
2  Transport yourself back to the dusky skies and...                NaN
3  Turning to art at the beginning of quarantine ...                NaN
4  Dylan Mulvaney, head of design at Gretel, expl...                NaN

这就是我所期待的:

                                       article_title       artist_names
0  "They’re like, is that? Oh it’s!" – ...               (Hannah, Ward)
1  Billed as London’s biggest public festival of ...                NaN
2  Transport yourself back to the dusky skies and...                NaN
3  Turning to art at the beginning of quarantine ...                NaN
4  Dylan Mulvaney, head of design at Gretel, expl...   (Dylan, Mulvaney)

您可以看到MACTOR_NAMES列中的第5个值与第5个文章标题相关。如何使它们对齐?

感谢您的帮助。


解决方案

我会遍历文章,分别检测每个文章中的实体,并将检测到的实体放在一个列表中,每个文章有一个元素:

nlp = spacy.load('en_core_web_lg')
article = [_ for _ in df['article_title']]

entities_by_article = []
for doc in nlp.pipe(article):
  people = []
  for ent in doc.ents:
    if ent.label_ == "PERSON":
      people.append(ent)
  entities_by_article.append(people)

df['artist_names'] = pd.Series(entities_by_article)

注意:for doc in nlp.pipe(article)是Spacy在文本列表中循环的更有效方式,可以替换为:

for a in article:
  doc = nlp(a)
  ## rest of code within loop

相关文章