NLTK之词性标注

词性标注重要性

回想学英语的时候,老师就开始讲词性,通过分析句子中某个单词的词性,我们可以推测这个词的意思,猜测这个词在句子中的作用,这对理解句子意思有极大的帮助。小弟也还是初学,以后若发现词性有更多作用时会继续补充~

标注语料库

NLTK(3.2.5)中提供了一些已经标注好词性的文本,通过下面代码可以查看:

1
2
3
4
5
import nltk
nltk.corpus.brown.tagged_words()
outputs:
[(u'The', u'AT'), (u'Fulton', u'NP-TL'), ...]

这表示The 被标注为AT词性,Fulton 被标注为NP-TL 词性,似乎看不太懂?

下面可以把它们转成统一词性名称

1
2
3
4
nltk.corpus.brown.tagged_words(tagset='universal')
outputs:
[(u'The', u'DET'), (u'Fulton', u'NOUN'), ...]

DET 是限定词,NOUN 是名词。

这是因为标注器本身所使用的符号统一符号 不一样的原因,通过制定tagset 可以转化为统一符号,而标注转换可以在~/nltk_data/taggers/universal_tagset 中找到对应的文件。

我调查源代码发现,上述代码所使用的是en-brown.map文件,打开查看可以发现:

1
2
3
4
5
....
36 AT DET
....
294 NP-TL NOUN
...

其他标注语料库

1
2
3
4
nltk.corpus.sinica_treebank.tagged_words()
nltk.corpus.indian.tagged_words()
nltk.corpus.mac_morpho.tagged_words()
...

indian为例:

1
2
3
4
nltk.corpus.indian.tagged_words()
outputs:
[(u'\u09ae\u09b9\u09bf\u09b7\u09c7\u09b0', u'NN'), (u'\u09b8\u09a8\u09cd\u09a4\u09be\u09a8', u'NN'), ...]

似乎输出了unicode_escape编码,怎么办呢?下面可以解决:

1
2
3
4
print ', '.join([word + '/ + tag for (word, tag) in nltk.corpus.indian.tagged_words()][:100])
outputs:
মহিষের/NN, সন্তান/NN, :/SYM, তোড়া/NNP, উপজাতি/NN, ৷/SYM, বাসস্থান-ঘরগৃহস্থালি/NN, তোড়া/NNP, ভাষায়/NN, গ্রামকেও/NN, বলে/VM, `/SYM, মোদ/NN, '/SYM, ৷/SYM, মোদের/NN, আয়তন/NN, খুব/INTF, বড়ো/JJ, নয়/VM, ৷/SYM, প্রতি/QF, মোদে/NN, আছে/VM, কিছু/QF, কুঁড়েঘর/NN, ,/SYM, সাধারণ/JJ, মহিষশালা/NN, ৷/SYM, আর/CC, গ্রামের/NN, বাইরে/NST, থাকে/VM, ডেয়ারি-মন্দির/NN, ৷/SYM, আয়তনের/NN, তারতম্য/NN, অনুসারে/PSP, গ্রামগুলি/NN, দু/QC, রকমের/NN, :/SYM, এতূডমোদ/NNP, (/SYM, বড়ো/JJ, গ্রাম/NN, )/SYM, ওকিনমোদ/NNP, (/SYM, ছোট/JJ, গ্রাম/NN, )/SYM, ৷/SYM, কোন/DEM, কোন/RDP, গ্রামের/NN, আবার/CC, ধর্মীয়/JJ, বা_Cমহিষের/NN, সন্তান/NN, :/SYM, তোড়া/NNP, উপজাতি/NN, ৷/SYM, িকে/PRP, বলে/VM, `/SYM, সোতি-মোদ/NNP, '/SYM, ৷/SYM, কুঁড়েঘরগুলির/NN, আকার/NN, বাংলার/NNP, বা/CC, ভারতের/NNP, অন্য/JJ, অঞ্চলের/NN, প্রচলিত/JJ, কুঁড়ে/NN, ঘর/NN, নয়/VM, ৷/SYM, এগুলি/PRP, দেখতে/NN, শোয়ানো/JJ, পিপের/NN, মতো/PSP, ৷/SYM, এক/QC, দিকের/PSP, বাঁশের/NN, কাঠামো/NN, খিলানের/NN, মতো/PSP, বেঁকে/JJ, গিয়ে/VM, অন্যদিকের/NN, মাটিতে/NN, মিশেছে/VM

标注器

使用标注器

NLTK提供了现成的标注器,你可以直接使用:

1
2
3
4
5
text = nltk.word_tokenize('This beautiful future is just his imagination so far')
nltk.pos_tag(text, tagset='universal')
outputs:
[('This', u'DET'), ('beautiful', u'ADJ'), ('future', u'NOUN'), ('is', u'VERB'), ('just', u'ADV'), ('his', u'PRON'), ('imagination', u'NOUN'), ('so', u'ADV'), ('far', u'ADV')]

你觉得这个标注器的准确率怎么样呢?

似乎完成的还不错,那么我们试试另外一个句子:

1
2
3
4
5
text = nltk.word_tokenize('They refuse to permit us to obtain the refuse permit')
nltk.pos_tag(text, tagset='universal')
outputs:
[('They', u'PRON'), ('refuse', u'VERB'), ('to', u'PRT'), ('permit', u'VERB'), ('us', u'PRON'), ('to', u'PRT'), ('obtain', u'VERB'), ('the', u'DET'), ('refuse', u'NOUN'), ('permit', u'NOUN')]

对于文中的两个refuse,前者被标为动词,后者被标为名词,完成的还不错。

这有什么意义呢?拿第二个句子来说,其实两个refuse的读音不一样,第一个读作refUSE,第二个读作REFuse,所以语音系统为了正确的发音,需要先做词性标注才行。

自动标注器

为了更好的理解标注器的原理,我们慢慢来自建构建一个词性标注器,先载入数据:

1
2
3
4
5
from nltk.corpus import brown
brown_tagged_sents = brown.tagged_sents(categories='news')
brown_sents = brown.sents(categories='news')
brown_tagged_words = brown.tagged_words(categories='news')
brown_words = brown.words(categories='news')

默认标注器

这是最简单的标注器了,它给所有的标识符都分配同样的词性标记,我们先看来来哪个标记是最有可能的:

1
2
3
4
5
tags = [tag for (word, tag) in brown_tagged_words]
nltk.FreqDist(tags).max()
outputs:
u'NN'

说明名词是最多的,那么我们就生成一个标注器,它将所有词都标注为名词:

1
2
3
4
5
default_tagger = nltk.DefaultTagger('NN')
default_tagger.tag(nltk.word_tokenize('This beautiful future is just his imagination so far'))
outputs:
[('This', 'NN'), ('beautiful', 'NN'), ('future', 'NN'), ('is', 'NN'), ('just', 'NN'), ('his', 'NN'), ('imagination', 'NN'), ('so', 'NN'), ('far', 'NN')]

可以看到,已经全部标注为NN了,下面评估一下我们这个标注器:

1
2
3
4
default_tagger.evaluate(brown_tagged_sents)
outputs:
0.13089484257215028

哈哈,说明这个标注器太差了,它的标注正确率只有13.1%。

虽然如此,但碰巧的是在处理大量文本的时候,大部分新词都是名词,这意味着默认标注器可以帮助我们提高语言处理系统的稳定性。

正则表达式标注器

在英语单词中,我们可以通过后缀nessinged等来推测一个单词的词性,那么这样做是否也有效呢?试试就知道啦~

1
2
3
4
5
6
7
8
9
10
patterns = [
(r'.*ing$', 'VBG'),
(r'.*ed$', 'VBD'),
(r'.*es$', 'VBZ'),
(r'.*ould$', 'MD'),
(r'.*\'s$', 'NN$'),
(r'.*s$', 'NNS'),
(r'^-?[0-9]+(.[0-9]+)?$', 'CD'),
(r'.*', 'NN')
]

按照顺序匹配,当全部都不匹配时,最后会被标注为NN词性。

1
2
3
4
5
regexp_tagger = nltk.RegexpTagger(patterns)
regexp_tagger.tag(brown_sents[3])
outputs:
[(u'``', 'NN'), (u'Only', 'NN'), (u'a', 'NN'), (u'relative', 'NN'), (u'handful', 'NN'), (u'of', 'NN'), (u'such', 'NN'), (u'reports', 'NNS'), (u'was', 'NNS'), (u'received', 'VBD'), (u"''", 'NN'), (u',', 'NN'), (u'the', 'NN'), (u'jury', 'NN'), (u'said', 'NN'), (u',', 'NN'), (u'``', 'NN'), (u'considering', 'VBG'), (u'the', 'NN'), (u'widespread', 'NN'), (u'interest', 'NN'), (u'in', 'NN'), (u'the', 'NN'), (u'election', 'NN'), (u',', 'NN'), (u'the', 'NN'), (u'number', 'NN'), (u'of', 'NN'), (u'voters', 'NNS'), (u'and', 'NN'), (u'the', 'NN'), (u'size', 'NN'), (u'of', 'NN'), (u'this', 'NNS'), (u'city', 'NN'), (u"''", 'NN'), (u'.', 'NN')]

评估一下:

1
2
3
4
regexp_tagger.evaluate(brown_tagged_sents)
outputs:
0.20326391789486245

比默认标注器要好点,哈哈

查询标注器

可以发现,名词虽然出现的频率最高,但出现频率最高的词未必都是名词,所以我们可以试试取频率最大的前100个词,用他们最有可能的词性来进行标注。

1
2
3
4
5
6
7
8
9
fd = nltk.FreqDist(brown_words)
cfd = nltk.ConditionalFreqDist(brown_tagged_words)
most_freq_words = fd.most_common()[:100]
likely_tags = dict((word, cfd[word].max()) for (word, freq) in most_freq_words)
baseline_tagger = nltk.UnigramTagger(model=likely_tags)
baseline_tagger.evaluate(brown_tagged_sents)
outputs:
0.45578495136941344

可见,就算只取前100个词,效率也已经比之前高很多了。

我们实地看看它的工作结果:

1
2
3
4
baseline_tagger.tag(brown_sents[3])
outputs:
[(u'``', u'``'), (u'Only', None), (u'a', u'AT'), (u'relative', None), (u'handful', None), (u'of', u'IN'), (u'such', None), (u'reports', None), (u'was', u'BEDZ'), (u'received', None), (u"''", u"''"), (u',', u','), (u'the', u'AT'), (u'jury', None), (u'said', u'VBD'), (u',', u','), (u'``', u'``'), (u'considering', None), (u'the', u'AT'), (u'widespread', None), (u'interest', None), (u'in', u'IN'), (u'the', u'AT'), (u'election', None), (u',', u','), (u'the', u'AT'), (u'number', None), (u'of', u'IN'), (u'voters', None), (u'and', u'CC'), (u'the', u'AT'), (u'size', None), (u'of', u'IN'), (u'this', u'DT'), (u'city', None), (u"''", u"''"), (u'.', u'.')]

可以看到有很多是None,说明它没有出现在前100个词中,这时候我们可以把它们交给默认标注器处理,也就是标记为NN,这个转移工作叫做回退

1
2
3
4
5
baseline_tagger = nltk.UnigramTagger(model=likely_tags, backoff=nltk.DefaultTagger('NN'))
baseline_tagger.evaluate(brown_tagged_sents)
outputs:
0.5817769556656125

准确率瞬间提高了10%+ 有木有!!

如果取更多的词呢?下面给出数据:

高频词数量 准确率
200 0.5060962269029576
800 0.6335401873620145
1600 0.7067247449131809
3200 0.7813513137219802

说明随着数量增加,准确率还会提升~

N-gram标注

一元标注

它使用简单的统计算法,给每一个词分配一个最可能的标记,不会关联上下文。

1
2
3
4
5
unigram_tagger = nltk.UnigramTagger(brown_tagged_sents)
unigram_tagger.evaluate(brown_tagged_sents)
outputs:
0.9349006503968017

准确率还蛮高的,学过的同学知道,这其实是过拟合啦,不信?我们试试其他十个语料库:

1
2
3
4
5
6
7
8
9
10
11
12
13
print "\n".join([cate + "\t" + str(unigram_tagger.evaluate(brown.tagged_sents(categories=cate))) for cate in brown.categories()[:10]])
outputs:
adventure 0.787891898128
belles_lettres 0.798707075842
editorial 0.813940653204
fiction 0.799147295877
government 0.807778427485
hobbies 0.771327949481
humor 0.793178151648
learned 0.772942690007
lore 0.798085204762
mystery 0.80790288443

准确率表现在80%左右,下降了10%多,影响还是蛮大的啦。

一般的N-gram的标注

它是根据上下文来推断词性的。比如一段句子是 $w{n-2} w{n-1} w{n} w{n+1}$,对应的词性是$t{n-2} t{n-1} t{n} t{n+1}$,三元标准器(n=3)就是考虑当前词$w{n}$ 的前两个词的标记$t{n-2} t{n-1}$ ,我们来推断$t{n}$ 的词性。

下面是一个二元标注器(即只考虑前一个词)

1
2
3
4
5
bigram_tagger = nltk.BigramTagger(brown_tagged_sents)
bigram_tagger.evaluate(brown_tagged_sents)
outputs:
0.7860751437038805

三元标注器:

1
2
3
4
5
trigram_tagger = nltk.TrigramTagger(brown_tagged_sents)
trigram_tagger.evaluate(brown_tagged_sents)
outputs:
0.8223641028700996

准确率要高点。我们来试试它对其他文本的准确率怎么样

1
2
3
4
5
6
7
8
9
10
11
12
13
print "\n".join([cate + "\t" + str(trigram_tagger.evaluate(brown.tagged_sents(categories=cate))) for cate in brown.categories()[:10]])
outputs:
adventure 0.0947189293646
belles_lettres 0.0632885797477
editorial 0.0675930134407
fiction 0.0889498890317
government 0.0531682758817
hobbies 0.0558503855729
humor 0.0754551740032
learned 0.0566172589726
lore 0.0623759054933
mystery 0.096993125645

好吧,我怀疑我用了假标注器!

我们组合一下所建的标注器:

1
2
3
4
5
6
7
t0 = nltk.DefaultTagger('NN')
t1 = nltk.UnigramTagger(brown_tagged_sents, backoff=t0)
t2 = nltk.BigramTagger(brown_tagged_sents, backoff=t1)
t2.evaluate(brown_tagged_sents)
outputs:
0.9730592517453309

嗯。。。准确率还不错,下面试试:

1
2
3
4
5
6
7
8
9
10
11
12
13
print "\n".join([cate + "\t" + str(t2.evaluate(brown.tagged_sents(categories=cate))) for cate in brown.categories()[:10]])
outputs:
adventure 0.835626315941
belles_lettres 0.840522022462
editorial 0.849977274203
fiction 0.84151968228
government 0.844089165252
hobbies 0.825283866659
humor 0.839041253745
learned 0.836756685433
lore 0.844033037471
mystery 0.85128303801

em………还不错吧,下降也有10%,但准确率还有85%左右

坚持原创文章分享,您的支持将鼓励我继续创作!