nltk.classify package¶
Submodules¶
nltk.classify.api module¶
Interfaces for labeling tokens with category labels (or “class labels”).
ClassifierI
is a standard interface for “single-category
classification”, in which the set of categories is known, the number
of categories is finite, and each text belongs to exactly one
category.
MultiClassifierI
is a standard interface for “multi-category
classification”, which is like single-category classification except
that each text belongs to zero or more categories.
-
class
nltk.classify.api.
ClassifierI
[source]¶ Bases:
object
A processing interface for labeling tokens with a single category label (or “class”). Labels are typically strs or ints, but can be any immutable type. The set of labels that the classifier chooses from must be fixed and finite.
- Subclasses must define:
labels()
- either
classify()
orclassify_many()
(or both)
- Subclasses may define:
- either
prob_classify()
orprob_classify_many()
(or both)
- either
-
classify
(featureset)[source]¶ Returns: the most appropriate label for the given featureset. Return type: label
-
classify_many
(featuresets)[source]¶ Apply
self.classify()
to each element offeaturesets
. I.e.:return [self.classify(fs) for fs in featuresets]Return type: list(label)
-
labels
()[source]¶ Returns: the list of category labels used by this classifier. Return type: list of (immutable)
-
prob_classify
(featureset)[source]¶ Returns: a probability distribution over labels for the given featureset. Return type: ProbDistI
-
prob_classify_many
(featuresets)[source]¶ Apply
self.prob_classify()
to each element offeaturesets
. I.e.:return [self.prob_classify(fs) for fs in featuresets]Return type: list(ProbDistI)
-
class
nltk.classify.api.
MultiClassifierI
[source]¶ Bases:
object
A processing interface for labeling tokens with zero or more category labels (or “labels”). Labels are typically strs or ints, but can be any immutable type. The set of labels that the multi-classifier chooses from must be fixed and finite.
- Subclasses must define:
labels()
- either
classify()
orclassify_many()
(or both)
- Subclasses may define:
- either
prob_classify()
orprob_classify_many()
(or both)
- either
-
classify
(featureset)[source]¶ Returns: the most appropriate set of labels for the given featureset. Return type: set(label)
-
classify_many
(featuresets)[source]¶ Apply
self.classify()
to each element offeaturesets
. I.e.:return [self.classify(fs) for fs in featuresets]Return type: list(set(label))
-
labels
()[source]¶ Returns: the list of category labels used by this classifier. Return type: list of (immutable)
-
prob_classify
(featureset)[source]¶ Returns: a probability distribution over sets of labels for the given featureset. Return type: ProbDistI
-
prob_classify_many
(featuresets)[source]¶ Apply
self.prob_classify()
to each element offeaturesets
. I.e.:return [self.prob_classify(fs) for fs in featuresets]Return type: list(ProbDistI)
nltk.classify.decisiontree module¶
A classifier model that decides which label to assign to a token on the basis of a tree structure, where branches correspond to conditions on feature values, and leaves correspond to label assignments.
-
class
nltk.classify.decisiontree.
DecisionTreeClassifier
(label, feature_name=None, decisions=None, default=None)[source]¶ Bases:
nltk.classify.api.ClassifierI
-
static
best_binary_stump
(feature_names, labeled_featuresets, feature_values, verbose=False)[source]¶
-
static
best_stump
(feature_names, labeled_featuresets, verbose=False)[source]¶
-
static
binary_stump
(feature_name, feature_value, labeled_featuresets)[source]¶
-
classify
(featureset)[source]¶
-
error
(labeled_featuresets)[source]¶
-
labels
()[source]¶
-
static
leaf
(labeled_featuresets)[source]¶
-
pretty_format
(, prefix='', depth=4)[source]¶ Return a string containing a pretty-printed version of this decision tree. Each line in this string corresponds to a single decision tree node or leaf, and indentation is used to display the structure of the decision tree.
-
pseudocode
(prefix='', depth=4)[source]¶ Return a string representation of this decision tree that expresses the decisions it makes as a nested set of pseudocode if statements.
-
refine
(labeled_featuresets, entropy_cutoff, depth_cutoff, support_cutoff, binary=False, feature_values=None, verbose=False)[source]¶
-
static
stump
(feature_name, labeled_featuresets)[source]¶
-
static
train
(labeled_featuresets, entropy_cutoff=0.05, depth_cutoff=100, support_cutoff=10, binary=False, feature_values=None, verbose=False)[source]¶ Parameters: binary – If true, then treat all feature/value pairs as individual binary features, rather than using a single n-way branch for each feature.
-
unicode_repr
¶ Return repr(self).
-
static
-
nltk.classify.decisiontree.
demo
()[source]¶
-
nltk.classify.decisiontree.
f
(x)[source]¶
nltk.classify.maxent module¶
A classifier model based on maximum entropy modeling framework. This framework considers all of the probability distributions that are empirically consistent with the training data; and chooses the distribution with the highest entropy. A probability distribution is “empirically consistent” with a set of training data if its estimated frequency with which a class and a feature vector value co-occur is equal to the actual frequency in the data.
Terminology: ‘feature’¶
The term feature is usually used to refer to some property of an
unlabeled token. For example, when performing word sense
disambiguation, we might define a 'prevword'
feature whose value is
the word preceding the target word. However, in the context of
maxent modeling, the term feature is typically used to refer to a
property of a “labeled” token. In order to prevent confusion, we
will introduce two distinct terms to disambiguate these two different
concepts:
- An “input-feature” is a property of an unlabeled token.
- A “joint-feature” is a property of a labeled token.
In the rest of the nltk.classify
module, the term “features” is
used to refer to what we will call “input-features” in this module.
In literature that describes and discusses maximum entropy models, input-features are typically called “contexts”, and joint-features are simply referred to as “features”.
Converting Input-Features to Joint-Features¶
In maximum entropy models, joint-features are required to have numeric
values. Typically, each input-feature input_feat
is mapped to a
set of joint-features of the form:
For all values of feat_val
and some_label
. This mapping is
performed by classes that implement the MaxentFeatureEncodingI
interface.
-
class
nltk.classify.maxent.
BinaryMaxentFeatureEncoding
(labels, mapping, unseen_features=False, alwayson_features=False)[source]¶ Bases:
nltk.classify.maxent.MaxentFeatureEncodingI
A feature encoding that generates vectors containing a binary joint-features of the form:
joint_feat(fs, l) = { 1 if (fs[fname] == fval) and (l == label){{ 0 otherwiseWhere
fname
is the name of an input-feature,fval
is a value for that input-feature, andlabel
is a label.Typically, these features are constructed based on a training corpus, using the
train()
method. This method will create one feature for each combination offname
,fval
, andlabel
that occurs at least once in the training corpus.The
unseen_features
parameter can be used to add “unseen-value features”, which are used whenever an input feature has a value that was not encountered in the training corpus. These features have the form:joint_feat(fs, l) = { 1 if is_unseen(fname, fs[fname]){ and l == label{{ 0 otherwiseWhere
is_unseen(fname, fval)
is true if the encoding does not contain any joint features that are true whenfs[fname]==fval
.The
alwayson_features
parameter can be used to add “always-on features”, which have the form:| joint_feat(fs, l) = { 1 if (l == label) | { | { 0 otherwise
These always-on features allow the maxent model to directly model the prior probabilities of each label.
-
describe
(f_id)[source]¶
-
encode
(featureset, label)[source]¶
-
labels
()[source]¶
-
length
()[source]¶
-
classmethod
train
(train_toks, count_cutoff=0, labels=None, **options)[source]¶ Construct and return new feature encoding, based on a given training corpus
train_toks
. See the class descriptionBinaryMaxentFeatureEncoding
for a description of the joint-features that will be included in this encoding.Parameters: - train_toks (list(tuple(dict, str))) – Training data, represented as a list of pairs, the first member of which is a feature dictionary, and the second of which is a classification label.
- count_cutoff (int) – A cutoff value that is used to discard
rare joint-features. If a joint-feature’s value is 1
fewer than
count_cutoff
times in the training corpus, then that joint-feature is not included in the generated encoding. - labels (list) – A list of labels that should be used by the
classifier. If not specified, then the set of labels
attested in
train_toks
will be used. - options – Extra parameters for the constructor, such as
unseen_features
andalwayson_features
.
-
-
nltk.classify.maxent.
ConditionalExponentialClassifier
¶ Alias for MaxentClassifier.
alias of
MaxentClassifier
-
class
nltk.classify.maxent.
FunctionBackedMaxentFeatureEncoding
(func, length, labels)[source]¶ Bases:
nltk.classify.maxent.MaxentFeatureEncodingI
A feature encoding that calls a user-supplied function to map a given featureset/label pair to a sparse joint-feature vector.
-
describe
(fid)[source]¶
-
encode
(featureset, label)[source]¶
-
labels
()[source]¶
-
length
()[source]¶
-
-
class
nltk.classify.maxent.
GISEncoding
(labels, mapping, unseen_features=False, alwayson_features=False, C=None)[source]¶ Bases:
nltk.classify.maxent.BinaryMaxentFeatureEncoding
A binary feature encoding which adds one new joint-feature to the joint-features defined by
BinaryMaxentFeatureEncoding
: a correction feature, whose value is chosen to ensure that the sparse vector always sums to a constant non-negative number. This new feature is used to ensure two preconditions for the GIS training algorithm:- At least one feature vector index must be nonzero for every token.
- The feature vector must sum to a constant non-negative number for every token.
-
C
¶ The non-negative constant that all encoded feature vectors will sum to.
-
describe
(f_id)[source]¶
-
encode
(featureset, label)[source]¶
-
length
()[source]¶
-
class
nltk.classify.maxent.
MaxentClassifier
(encoding, weights, logarithmic=True)[source]¶ Bases:
nltk.classify.api.ClassifierI
A maximum entropy classifier (also known as a “conditional exponential classifier”). This classifier is parameterized by a set of “weights”, which are used to combine the joint-features that are generated from a featureset by an “encoding”. In particular, the encoding maps each
(featureset, label)
pair to a vector. The probability of each label is then computed using the following equation:dotprod(weights, encode(fs,label)) prob(fs|label) = --------------------------------------------------- sum(dotprod(weights, encode(fs,l)) for l in labels)
Where
dotprod
is the dot product:dotprod(a,b) = sum(x*y for (x,y) in zip(a,b))
-
ALGORITHMS
= ['GIS', 'IIS', 'MEGAM', 'TADM']¶ A list of the algorithm names that are accepted for the
train()
method’salgorithm
parameter.
-
classify
(featureset)[source]¶
-
explain
(featureset, columns=4)[source]¶ Print a table showing the effect of each of the features in the given feature set, and how they combine to determine the probabilities of each label for that featureset.
-
labels
()[source]¶
-
prob_classify
(featureset)[source]
-