nltk.tag package¶

Submodules¶

nltk.tag.api module¶

Interface for tagging each token in a sentence with supplementary information, such as its part of speech.

class nltk.tag.api.FeaturesetTaggerI[source]¶

Bases: nltk.tag.api.TaggerI

A tagger that requires tokens to be featuresets. A featureset is a dictionary that maps from feature names to feature values. See nltk.classify for more information about features and featuresets.

class nltk.tag.api.TaggerI[source]¶

Bases: object

A processing interface for assigning a tag to each token in a list. Tags are case sensitive strings that identify some property of each token, such as its part of speech or its sense.

Some taggers require specific types for their tokens. This is generally indicated by the use of a sub-interface to TaggerI. For example, featureset taggers, which are subclassed from FeaturesetTagger, require that each token be a featureset.

Subclasses must define:

either tag() or tag_sents() (or both)

evaluate(gold)[source]¶

Score the accuracy of the tagger against the gold standard. Strip the tags from the gold standard text, retag it using the tagger, then compute the accuracy score.

Parameters:	gold (list(list(tuple(str, str)))) – The list of tagged sentences to score the tagger on.
Return type:	float

tag(tokens)[source]¶

Determine the most appropriate tag sequence for the given token sequence, and return a corresponding list of tagged tokens. A tagged token is encoded as a tuple (token, tag).

Return type:	list(tuple(str, str))

tag_sents(sentences)[source]¶: Apply self.tag() to each element of sentences. I.e.:

return [self.tag(sent) for sent in sentences]

nltk.tag.brill module¶

class nltk.tag.brill.BrillTagger(initial_tagger, rules, training_stats=None)[source]¶

Bases: nltk.tag.api.TaggerI

Brill’s transformational rule-based tagger. Brill taggers use an initial tagger (such as tag.DefaultTagger) to assign an initial tag sequence to a text; and then apply an ordered list of transformational rules to correct the tags of individual tokens. These transformation rules are specified by the TagRule interface.

Brill taggers can be created directly, from an initial tagger and a list of transformational rules; but more often, Brill taggers are created by learning rules from a training corpus, using one of the TaggerTrainers available.

batch_tag_incremental(sequences, gold)[source]¶

Tags by applying each rule to the entire corpus (rather than all rules to a single sequence). The point is to collect statistics on the test set for individual rules.

NOTE: This is inefficient (does not build any index, so will traverse the entire corpus N times for N rules) – usually you would not care about statistics for individual rules and thus use batch_tag() instead

Parameters:	sequences (list of list of strings) – lists of token sequences (sentences, in some applications) to be tagged gold (list of list of strings) – the gold standard
Returns:	tuple of (tagged_sequences, ordered list of rule scores (one for each rule))

classmethod decode_json_obj(obj)[source]¶

encode_json_obj()[source]¶

json_tag = 'nltk.tag.BrillTagger'¶

print_template_statistics(test_stats=None, printunused=True)[source]¶

Print a list of all templates, ranked according to efficiency.

If test_stats is available, the templates are ranked according to their relative contribution (summed for all rules created from a given template, weighted by score) to the performance on the test set. If no test_stats, then statistics collected during training are used instead. There is also an unweighted measure (just counting the rules). This is less informative, though, as many low-score rules will appear towards end of training.

Parameters:	test_stats (dict of str -> any (but usually numbers)) – dictionary of statistics collected during testing printunused (bool) – if True, print a list of all unused templates
Returns:	None
Return type:	None

rules()[source]¶

Return the ordered list of transformation rules that this tagger has learnt

Returns:	the ordered list of transformation rules that correct the initial tagging
Return type:	list of Rules

tag(tokens)[source]¶

train_stats(statistic=None)[source]¶

Return a named statistic collected during training, or a dictionary of all available statistics if no name given

Parameters:	statistic (str) – name of statistic
Returns:	some statistic collected during training of this tagger
Return type:	any (but usually a number)

class nltk.tag.brill.Pos(positions, end=None)[source]¶

Bases: nltk.tbl.feature.Feature

Feature which examines the tags of nearby tokens.

static extract_property(tokens, index)[source]¶: @return: The given token’s tag.

json_tag = 'nltk.tag.brill.Pos'¶

class nltk.tag.brill.Word(positions, end=None)[source]¶

Bases: nltk.tbl.feature.Feature

Feature which examines the text (word) of nearby tokens.

static extract_property(tokens, index)[source]¶: @return: The given token’s text.

json_tag = 'nltk.tag.brill.Word'¶

nltk.tag.brill.brill24()[source]¶: Return 24 templates of the seminal TBL paper, Brill (1995)

nltk.tag.brill.describe_template_sets()[source]¶: Print the available template sets in this demo, with a short description”

nltk.tag.brill.fntbl37()[source]¶: Return 37 templates taken from the postagging task of the fntbl distribution www.cs.jhu.edu/~rflorian/fntbl/ (37 is after excluding a handful which do not condition on Pos[0]; fntbl can do that but the current nltk implementation cannot.)

nltk.tag.brill.nltkdemo18()[source]¶: Return 18 templates, from the original nltk demo, in multi-feature syntax

nltk.tag.brill.nltkdemo18plus()[source]¶: Return 18 templates, from the original nltk demo, and additionally a few multi-feature ones (the motivation is easy comparison with nltkdemo18)

nltk.tag.brill_trainer module¶

class nltk.tag.brill_trainer.BrillTaggerTrainer(initial_tagger, templates, trace=0, deterministic=None, ruleformat='str')[source]¶

Bases: object

A trainer for tbl taggers.

train(train_sents, max_rules=200, min_score=2, min_acc=None)[source]¶

Trains the Brill tagger on the corpus train_sents, producing at most max_rules transformations, each of which reduces the net number of errors in the corpus by at least min_score, and each of which has accuracy not lower than min_acc.

#imports >>> from nltk.tbl.template import Template >>> from nltk.tag.brill import Pos, Word >>> from nltk.tag import RegexpTagger, BrillTaggerTrainer

#some data >>> from nltk.corpus import treebank >>> training_data = treebank.tagged_sents()[:100] >>> baseline_data = treebank.tagged_sents()[100:200] >>> gold_data = treebank.tagged_sents()[200:300] >>> testing_data = [untag(s) for s in gold_data]

>>> backoff = RegexpTagger([
... (r'^-?[0-9]+(.[0-9]+)?$', 'CD'),   # cardinal numbers
... (r'(The|the|A|a|An|an)$', 'AT'),   # articles
... (r'.*able$', 'JJ'),                # adjectives
... (r'.*ness$', 'NN'),                # nouns formed from adjectives
... (r'.*ly$', 'RB'),                  # adverbs
... (r'.*s$', 'NNS'),                  # plural nouns
... (r'.*ing$', 'VBG'),                # gerunds
... (r'.*ed$', 'VBD'),                 # past tense verbs
... (r'.*', 'NN')                      # nouns (default)
... ])

>>> baseline = backoff #see NOTE1

>>> baseline.evaluate(gold_data) 
0.2450142...

#templates >>> Template._cleartemplates() #clear any templates created in earlier tests >>> templates = [Template(Pos([-1])), Template(Pos([-1]), Word([0]))]

#construct a BrillTaggerTrainer >>> tt = BrillTaggerTrainer(baseline, templates, trace=3)

>>> tagger1 = tt.train(training_data, max_rules=10)
TBL train (fast) (seqs: 100; tokens: 2417; tpls: 2; min score: 2; min acc: None)
Finding initial useful rules...
    Found 845 useful rules.

           B      |
   S   F   r   O  |        Score = Fixed - Broken
   c   i   o   t  |  R     Fixed = num tags changed incorrect -> correct
   o   x   k   h  |  u     Broken = num tags changed correct -> incorrect
   r   e   e   e  |  l     Other = num tags changed incorrect -> incorrect
   e   d   n   r  |  e
------------------+-------------------------------------------------------
 132 132   0   0  | AT->DT if Pos:NN@[-1]
  85  85   0   0  | NN->, if Pos:NN@[-1] & Word:,@[0]
  69  69   0   0  | NN->. if Pos:NN@[-1] & Word:.@[0]
  51  51   0   0  | NN->IN if Pos:NN@[-1] & Word:of@[0]
  47  63  16 161  | NN->IN if Pos:NNS@[-1]
  33  33   0   0  | NN->TO if Pos:NN@[-1] & Word:to@[0]
  26  26   0   0  | IN->. if Pos:NNS@[-1] & Word:.@[0]
  24  24   0   0  | IN->, if Pos:NNS@[-1] & Word:,@[0]
  22  27   5  24  | NN->-NONE- if Pos:VBD@[-1]
  17  17   0   0  | NN->CC if Pos:NN@[-1] & Word:and@[0]

>>> tagger1.rules()[1:3]
(Rule('001', 'NN', ',', [(Pos([-1]),'NN'), (Word([0]),',')]), Rule('001', 'NN', '.', [(Pos([-1]),'NN'), (Word([0]),'.')]))

>>> train_stats = tagger1.train_stats()
>>> [train_stats[stat] for stat in ['initialerrors', 'finalerrors', 'rulescores']]
[1775, 1269, [132, 85, 69, 51, 47, 33, 26, 24, 22, 17]]

>>> tagger1.print_template_statistics(printunused=False)
TEMPLATE STATISTICS (TRAIN)  2 templates, 10 rules)
TRAIN (   2417 tokens) initial  1775 0.2656 final:  1269 0.4750
#ID | Score (train) |  #Rules     | Template
--------------------------------------------
001 |   305   0.603 |   7   0.700 | Template(Pos([-1]),Word([0]))
000 |   201   0.397 |   3   0.300 | Template(Pos([-1]))

>>> tagger1.evaluate(gold_data) 
0.43996...

>>> tagged, test_stats = tagger1.batch_tag_incremental(testing_data, gold_data)

>>> tagged[33][12:] == [('foreign', 'IN'), ('debt', 'NN'), ('of', 'IN'), ('$', 'NN'), ('64', 'CD'),
... ('billion', 'NN'), ('*U*', 'NN'), ('--', 'NN'), ('the', 'DT'), ('third-highest', 'NN'), ('in', 'NN'),
... ('the', 'DT'), ('developing', 'VBG'), ('world', 'NN'), ('.', '.')]
True

>>> [test_stats[stat] for stat in ['initialerrors', 'finalerrors', 'rulescores']]
[1855, 1376, [100, 85, 67, 58, 27, 36, 27, 16, 31, 32]]

# a high-accuracy tagger >>> tagger2 = tt.train(training_data, max_rules=10, min_acc=0.99) TBL train (fast) (seqs: 100; tokens: 2417; tpls: 2; min score: 2; min acc: 0.99) Finding initial useful rules...

Found 845 useful rules.

<BLANKLINE>

B |

——————+——————————————————-

132 132 0 0 | AT->DT if Pos:NN@[-1]: 85 85 0 0 | NN->, if Pos:NN@[-1] & Word:,@[0] 69 69 0 0 | NN->. if Pos:NN@[-1] & Word:.@[0] 51 51 0 0 | NN->IN if Pos:NN@[-1] & Word:of@[0] 36 36 0 0 | NN->TO if Pos:NN@[-1] & Word:to@[0] 26 26 0 0 | NN->. if Pos:NNS@[-1] & Word:.@[0] 24 24 0 0 | NN->, if Pos:NNS@[-1] & Word:,@[0] 19 19 0 6 | NN->VB if Pos:TO@[-1] 18 18 0 0 | CD->-NONE- if Pos:NN@[-1] & Word:0@[0] 18 18 0 0 | NN->CC if Pos:NN@[-1] & Word:and@[0]

>>> tagger2.evaluate(gold_data)  
0.44159544...
>>> tagger2.rules()[2:4]
(Rule('001', 'NN', '.', [(Pos([-1]),'NN'), (Word([0]),'.')]), Rule('001', 'NN', 'IN', [(Pos([-1]),'NN'), (Word([0]),'of')]))

# NOTE1: (!!FIXME) A far better baseline uses nltk.tag.UnigramTagger, # with a RegexpTagger only as backoff. For instance, # >>> baseline = UnigramTagger(baseline_data, backoff=backoff) # However, as of Nov 2013, nltk.tag.UnigramTagger does not yield consistent results # between python versions. The simplistic backoff above is a workaround to make doctests # get consistent input.

Parameters:	train_sents (list(list(tuple))) – training data max_rules (int) – output at most max_rules rules min_score (int) – stop training when no rules better than min_score can be found min_acc (float or None) – discard any rule with lower accuracy than min_acc
Returns:	the learned tagger
Return type:	BrillTagger

nltk.tag.crf module¶

A module for POS tagging using CRFSuite

class nltk.tag.crf.CRFTagger(feature_func=None, verbose=False, training_opt={})[source]¶

Bases: nltk.tag.api.TaggerI

A module for POS tagging using CRFSuite https://pypi.python.org/pypi/python-crfsuite

>>> from nltk.tag import CRFTagger
>>> ct = CRFTagger()

>>> train_data = [[('University','Noun'), ('is','Verb'), ('a','Det'), ('good','Adj'), ('place','Noun')],
... [('dog','Noun'),('eat','Verb'),('meat','Noun')]]

>>> ct.train(train_data,'model.crf.tagger')
>>> ct.tag_sents([['dog','is','good'], ['Cat','eat','meat']])
[[('dog', 'Noun'), ('is', 'Verb'), ('good', 'Adj')], [('Cat', 'Noun'), ('eat', 'Verb'), ('meat', 'Noun')]]

>>> gold_sentences = [[('dog','Noun'),('is','Verb'),('good','Adj')] , [('Cat','Noun'),('eat','Verb'), ('meat','Noun')]] 
>>> ct.evaluate(gold_sentences) 
1.0

Setting learned model file >>> ct = CRFTagger() >>> ct.set_model_file(‘model.crf.tagger’) >>> ct.evaluate(gold_sentences) 1.0

set_model_file(model_file)[source]¶

tag(tokens)[source]¶

Tag a sentence using Python CRFSuite Tagger. NB before using this function, user should specify the mode_file either by

Train a new model using ``train’’ function
Use the pre-trained model which is set via ``set_model_file’’ function

:params tokens : list of tokens needed to tag. :type tokens : list(str) :return : list of tagged tokens. :rtype : list (tuple(str,str))

tag_sents(sents)[source]¶

Tag a list of sentences. NB before using this function, user should specify the mode_file either by

Train a new model using ``train’’ function
Use the pre-trained model which is set via ``set_model_file’’ function

:params sentences : list of sentences needed to tag. :type sentences : list(list(str)) :return : list of tagged sentences. :rtype : list (list (tuple(str,str)))

train(train_data, model_file)[source]¶: Train the CRF tagger using CRFSuite :params train_data : is the list of annotated sentences. :type train_data : list (list(tuple(str,str))) :params model_file : the model will be saved to this file.