nltk.classify package¶

Submodules¶

nltk.classify.api module¶

Interfaces for labeling tokens with category labels (or “class labels”).

ClassifierI is a standard interface for “single-category classification”, in which the set of categories is known, the number of categories is finite, and each text belongs to exactly one category.

MultiClassifierI is a standard interface for “multi-category classification”, which is like single-category classification except that each text belongs to zero or more categories.

class nltk.classify.api.ClassifierI[source]

Bases: object

A processing interface for labeling tokens with a single category label (or “class”). Labels are typically strs or ints, but can be any immutable type. The set of labels that the classifier chooses from must be fixed and finite.

Subclasses must define:
  • labels()
  • either classify() or classify_many() (or both)
Subclasses may define:
  • either prob_classify() or prob_classify_many() (or both)
classify(featureset)[source]
Returns:the most appropriate label for the given featureset.
Return type:label
classify_many(featuresets)[source]

Apply self.classify() to each element of featuresets. I.e.:

return [self.classify(fs) for fs in featuresets]
Return type:list(label)
labels()[source]
Returns:the list of category labels used by this classifier.
Return type:list of (immutable)
prob_classify(featureset)[source]
Returns:a probability distribution over labels for the given featureset.
Return type:ProbDistI
prob_classify_many(featuresets)[source]

Apply self.prob_classify() to each element of featuresets. I.e.:

return [self.prob_classify(fs) for fs in featuresets]
Return type:list(ProbDistI)
class nltk.classify.api.MultiClassifierI[source]

Bases: object

A processing interface for labeling tokens with zero or more category labels (or “labels”). Labels are typically strs or ints, but can be any immutable type. The set of labels that the multi-classifier chooses from must be fixed and finite.

Subclasses must define:
  • labels()
  • either classify() or classify_many() (or both)
Subclasses may define:
  • either prob_classify() or prob_classify_many() (or both)
classify(featureset)[source]
Returns:the most appropriate set of labels for the given featureset.
Return type:set(label)
classify_many(featuresets)[source]

Apply self.classify() to each element of featuresets. I.e.:

return [self.classify(fs) for fs in featuresets]
Return type:list(set(label))
labels()[source]
Returns:the list of category labels used by this classifier.
Return type:list of (immutable)
prob_classify(featureset)[source]
Returns:a probability distribution over sets of labels for the given featureset.
Return type:ProbDistI
prob_classify_many(featuresets)[source]

Apply self.prob_classify() to each element of featuresets. I.e.:

return [self.prob_classify(fs) for fs in featuresets]
Return type:list(ProbDistI)

nltk.classify.decisiontree module¶

A classifier model that decides which label to assign to a token on the basis of a tree structure, where branches correspond to conditions on feature values, and leaves correspond to label assignments.

class nltk.classify.decisiontree.DecisionTreeClassifier(label, feature_name=None, decisions=None, default=None)[source]

Bases: nltk.classify.api.ClassifierI

static best_binary_stump(feature_names, labeled_featuresets, feature_values, verbose=False)[source]
static best_stump(feature_names, labeled_featuresets, verbose=False)[source]
static binary_stump(feature_name, feature_value, labeled_featuresets)[source]
classify(featureset)[source]
error(labeled_featuresets)[source]
labels()[source]
static leaf(labeled_featuresets)[source]
pretty_format(, prefix='', depth=4)[source]

Return a string containing a pretty-printed version of this decision tree. Each line in this string corresponds to a single decision tree node or leaf, and indentation is used to display the structure of the decision tree.

pseudocode(prefix='', depth=4)[source]

Return a string representation of this decision tree that expresses the decisions it makes as a nested set of pseudocode if statements.

refine(labeled_featuresets, entropy_cutoff, depth_cutoff, support_cutoff, binary=False, feature_values=None, verbose=False)[source]
static stump(feature_name, labeled_featuresets)[source]
static train(labeled_featuresets, entropy_cutoff=0.05, depth_cutoff=100, support_cutoff=10, binary=False, feature_values=None, verbose=False)[source]
Parameters:binary – If true, then treat all feature/value pairs as individual binary features, rather than using a single n-way branch for each feature.
unicode_repr

Return repr(self).

nltk.classify.decisiontree.demo()[source]
nltk.classify.decisiontree.f(x)[source]

nltk.classify.maxent module¶

A classifier model based on maximum entropy modeling framework. This framework considers all of the probability distributions that are empirically consistent with the training data; and chooses the distribution with the highest entropy. A probability distribution is “empirically consistent” with a set of training data if its estimated frequency with which a class and a feature vector value co-occur is equal to the actual frequency in the data.

Terminology: ‘feature’¶

The term feature is usually used to refer to some property of an unlabeled token. For example, when performing word sense disambiguation, we might define a 'prevword' feature whose value is the word preceding the target word. However, in the context of maxent modeling, the term feature is typically used to refer to a property of a “labeled” token. In order to prevent confusion, we will introduce two distinct terms to disambiguate these two different concepts:

  • An “input-feature” is a property of an unlabeled token.
  • A “joint-feature” is a property of a labeled token.

In the rest of the nltk.classify module, the term “features” is used to refer to what we will call “input-features” in this module.

In literature that describes and discusses maximum entropy models, input-features are typically called “contexts”, and joint-features are simply referred to as “features”.

Converting Input-Features to Joint-Features¶

In maximum entropy models, joint-features are required to have numeric values. Typically, each input-feature input_feat is mapped to a set of joint-features of the form:

joint_feat(token, label) = { 1 if input_feat(token) == feat_val
{ and label == some_label
{
{ 0 otherwise

For all values of feat_val and some_label. This mapping is performed by classes that implement the MaxentFeatureEncodingI interface.

class nltk.classify.maxent.BinaryMaxentFeatureEncoding(labels, mapping, unseen_features=False, alwayson_features=False)[source]

Bases: nltk.classify.maxent.MaxentFeatureEncodingI

A feature encoding that generates vectors containing a binary joint-features of the form:

joint_feat(fs, l) = { 1 if (fs[fname] == fval) and (l == label)
{
{ 0 otherwise

Where fname is the name of an input-feature, fval is a value for that input-feature, and label is a label.

Typically, these features are constructed based on a training corpus, using the train() method. This method will create one feature for each combination of fname, fval, and label that occurs at least once in the training corpus.

The unseen_features parameter can be used to add “unseen-value features”, which are used whenever an input feature has a value that was not encountered in the training corpus. These features have the form:

joint_feat(fs, l) = { 1 if is_unseen(fname, fs[fname])
{ and l == label
{
{ 0 otherwise

Where is_unseen(fname, fval) is true if the encoding does not contain any joint features that are true when fs[fname]==fval.

The alwayson_features parameter can be used to add “always-on features”, which have the form:

|  joint_feat(fs, l) = { 1 if (l == label)
|                      {
|                      { 0 otherwise

These always-on features allow the maxent model to directly model the prior probabilities of each label.

describe(f_id)[source]
encode(featureset, label)[source]
labels()[source]
length()[source]
classmethod train(train_toks, count_cutoff=0, labels=None, **options)[source]

Construct and return new feature encoding, based on a given training corpus train_toks. See the class description BinaryMaxentFeatureEncoding for a description of the joint-features that will be included in this encoding.

Parameters:
  • train_toks (list(tuple(dict, str))) – Training data, represented as a list of pairs, the first member of which is a feature dictionary, and the second of which is a classification label.
  • count_cutoff (int) – A cutoff value that is used to discard rare joint-features. If a joint-feature’s value is 1 fewer than count_cutoff times in the training corpus, then that joint-feature is not included in the generated encoding.
  • labels (list) – A list of labels that should be used by the classifier. If not specified, then the set of labels attested in train_toks will be used.
  • options – Extra parameters for the constructor, such as unseen_features and alwayson_features.
nltk.classify.maxent.ConditionalExponentialClassifier

Alias for MaxentClassifier.

alias of MaxentClassifier

class nltk.classify.maxent.FunctionBackedMaxentFeatureEncoding(func, length, labels)[source]

Bases: nltk.classify.maxent.MaxentFeatureEncodingI

A feature encoding that calls a user-supplied function to map a given featureset/label pair to a sparse joint-feature vector.

describe(fid)[source]
encode(featureset, label)[source]
labels()[source]
length()[source]
class nltk.classify.maxent.GISEncoding(labels, mapping, unseen_features=False, alwayson_features=False, C=None)[source]

Bases: nltk.classify.maxent.BinaryMaxentFeatureEncoding

A binary feature encoding which adds one new joint-feature to the joint-features defined by BinaryMaxentFeatureEncoding: a correction feature, whose value is chosen to ensure that the sparse vector always sums to a constant non-negative number. This new feature is used to ensure two preconditions for the GIS training algorithm:

  • At least one feature vector index must be nonzero for every token.
  • The feature vector must sum to a constant non-negative number for every token.
C

The non-negative constant that all encoded feature vectors will sum to.

describe(f_id)[source]
encode(featureset, label)[source]
length()[source]
class nltk.classify.maxent.MaxentClassifier(encoding, weights, logarithmic=True)[source]

Bases: nltk.classify.api.ClassifierI

A maximum entropy classifier (also known as a “conditional exponential classifier”). This classifier is parameterized by a set of “weights”, which are used to combine the joint-features that are generated from a featureset by an “encoding”. In particular, the encoding maps each (featureset, label) pair to a vector. The probability of each label is then computed using the following equation:

                          dotprod(weights, encode(fs,label))
prob(fs|label) = ---------------------------------------------------
                 sum(dotprod(weights, encode(fs,l)) for l in labels)

Where dotprod is the dot product:

dotprod(a,b) = sum(x*y for (x,y) in zip(a,b))
ALGORITHMS = ['GIS', 'IIS', 'MEGAM', 'TADM']

A list of the algorithm names that are accepted for the train() method’s algorithm parameter.

classify(featureset)[source]
explain(featureset, columns=4)[source]

Print a table showing the effect of each of the features in the given feature set, and how they combine to determine the probabilities of each label for that featureset.

labels()[source]
prob_classify(featureset)[source]