用于文本分类的训练空间
问题描述
阅读docs并执行tutorial后,我想我应该做一个小演示。结果我的模特不想训练。以下是代码
import spacy
import random
import json
TRAINING_DATA = [
["My little kitty is so special", {"KAT": True}],
["Dude, Totally, Yeah, Video Games", {"KAT": False}],
["Should I pay $1,000 for the iPhone X?", {"KAT": False}],
["The iPhone 8 reviews are here", {"KAT": False}],
["Noa is a great cat name.", {"KAT": True}],
["We got a new kitten!", {"KAT": True}]
]
nlp = spacy.blank("en")
category = nlp.create_pipe("textcat")
nlp.add_pipe(category)
category.add_label("KAT")
# Start the training
nlp.begin_training()
# Loop for 10 iterations
for itn in range(100):
# Shuffle the training data
random.shuffle(TRAINING_DATA)
losses = {}
# Batch the examples and iterate over them
for batch in spacy.util.minibatch(TRAINING_DATA, size=2):
texts = [text for text, entities in batch]
annotations = [{"textcat": [entities]} for text, entities in batch]
nlp.update(texts, annotations, losses=losses)
if itn % 20 == 0:
print(losses)
当我运行此命令时,输出表明学习的内容很少。
{'textcat': 0.0}
{'textcat': 0.0}
{'textcat': 0.0}
{'textcat': 0.0}
{'textcat': 0.0}
这感觉不对。应该有错误或有意义的标记。预测证实了这一点。
for text, d in TRAINING_DATA:
print(text, nlp(text).cats)
# Dude, Totally, Yeah, Video Games {'KAT': 0.45303162932395935}
# The iPhone 8 reviews are here {'KAT': 0.45303162932395935}
# Noa is a great cat name. {'KAT': 0.45303162932395935}
# Should I pay $1,000 for the iPhone X? {'KAT': 0.45303162932395935}
# We got a new kitten! {'KAT': 0.45303162932395935}
# My little kitty is so special {'KAT': 0.45303162932395935}
感觉我的代码遗漏了一些东西,但我搞不清楚是什么。
解决方案
如果您更新并使用Spacy 3-上面的代码将不再工作。解决方案是在进行一些更改后进行迁移。我已相应地修改了cantdutchthis中的示例。
更改摘要:
- 使用配置更改架构。旧的默认设置是词袋,新的默认设置是使用注意力的文本集合。在调整模型时请记住这一点
- 标签现在需要一次性编码
add_pipe
接口稍有更改nlp.update
现在需要Example
对象,而不是text
、annotation
的元组
import spacy
# Add imports for example, as well as textcat config...
from spacy.training import Example
from spacy.pipeline.textcat import single_label_bow_config, single_label_default_config
from thinc.api import Config
import random
# labels should be one-hot encoded
TRAINING_DATA = [
["My little kitty is so special", {"KAT0": True}],
["Dude, Totally, Yeah, Video Games", {"KAT1": True}],
["Should I pay $1,000 for the iPhone X?", {"KAT1": True}],
["The iPhone 8 reviews are here", {"KAT1": True}],
["Noa is a great cat name.", {"KAT0": True}],
["We got a new kitten!", {"KAT0": True}]
]
# bow
# config = Config().from_str(single_label_bow_config)
# textensemble with attention
config = Config().from_str(single_label_default_config)
nlp = spacy.blank("en")
# now uses `add_pipe` instead
category = nlp.add_pipe("textcat", last=True, config=config)
category.add_label("KAT0")
category.add_label("KAT1")
# Start the training
nlp.begin_training()
# Loop for 10 iterations
for itn in range(100):
# Shuffle the training data
random.shuffle(TRAINING_DATA)
losses = {}
# Batch the examples and iterate over them
for batch in spacy.util.minibatch(TRAINING_DATA, size=4):
texts = [nlp.make_doc(text) for text, entities in batch]
annotations = [{"cats": entities} for text, entities in batch]
# uses an example object rather than text/annotation tuple
examples = [Example.from_dict(doc, annotation) for doc, annotation in zip(
texts, annotations
)]
nlp.update(examples, losses=losses)
if itn % 20 == 0:
print(losses)
相关文章