向现有数据帧添加空间输出时,列不对齐
问题描述
我有一个CSV,其中包含一列文章标题,我使用Spacy从其中提取出现在标题中的任何人名。尝试使用Spacy提取的名称向CSV添加新列时,它们与从中提取它们的行不对齐。
我相信这是因为Spacy结果有自己的索引,独立于原始数据的索引。
我已尝试将, index=df.index)
添加到新列行,但得到";ValueError:传递的值的长度为2,索引暗示为10。&q;
如何将空格输出与其来源行对齐?
以下是我的代码:
import pandas as pd
from pandas import DataFrame
df = (pd.read_csv(r"C:UsersAdminDownloadsitsnicethat (5).csv", nrows=10,
usecols=['article_title']))
article = [_ for _ in df['article_title']]
import spacy
nlp = spacy.load('en_core_web_lg')
doc = nlp(str(article))
ents = list(doc.ents)
people = []
for ent in ents:
if ent.label_ == "PERSON":
people.append(ent)
import numpy as np
df['artist_names'] = pd.Series(people)
print(df.head())
这是生成的数据帧:
article_title artist_names
0 "They’re like, is that? Oh it’s!" – ... (Hannah, Ward)
1 Billed as London’s biggest public festival of ... (Dylan, Mulvaney)
2 Transport yourself back to the dusky skies and... NaN
3 Turning to art at the beginning of quarantine ... NaN
4 Dylan Mulvaney, head of design at Gretel, expl... NaN
这就是我所期待的:
article_title artist_names
0 "They’re like, is that? Oh it’s!" – ... (Hannah, Ward)
1 Billed as London’s biggest public festival of ... NaN
2 Transport yourself back to the dusky skies and... NaN
3 Turning to art at the beginning of quarantine ... NaN
4 Dylan Mulvaney, head of design at Gretel, expl... (Dylan, Mulvaney)
您可以看到MACTOR_NAMES列中的第5个值与第5个文章标题相关。如何使它们对齐?
感谢您的帮助。
解决方案
我会遍历文章,分别检测每个文章中的实体,并将检测到的实体放在一个列表中,每个文章有一个元素:
nlp = spacy.load('en_core_web_lg')
article = [_ for _ in df['article_title']]
entities_by_article = []
for doc in nlp.pipe(article):
people = []
for ent in doc.ents:
if ent.label_ == "PERSON":
people.append(ent)
entities_by_article.append(people)
df['artist_names'] = pd.Series(entities_by_article)
注意:for doc in nlp.pipe(article)
是Spacy在文本列表中循环的更有效方式,可以替换为:
for a in article:
doc = nlp(a)
## rest of code within loop
相关文章