nltk.stem package¶
Submodules¶
nltk.stem.api module¶
-
class
nltk.stem.api.
StemmerI
[source]¶ Bases:
object
A processing interface for removing morphological affixes from words. This process is known as stemming.
-
stem
(token)[source]¶ Strip affixes from the token and return the stem.
Parameters: token (str) – The token that should be stemmed.
-
nltk.stem.isri module¶
ISRI Arabic Stemmer
The algorithm for this stemmer is described in:
Taghva, K., Elkoury, R., and Coombs, J. 2005. Arabic Stemming without a root dictionary. Information Science Research Institute. University of Nevada, Las Vegas, USA.
The Information Science Research Institute’s (ISRI) Arabic stemmer shares many features with the Khoja stemmer. However, the main difference is that ISRI stemmer does not use root dictionary. Also, if a root is not found, ISRI stemmer returned normalized form, rather than returning the original unmodified word.
Additional adjustments were made to improve the algorithm:
1- Adding 60 stop words. 2- Adding the pattern (تفاعيل) to ISRI pattern set. 3- The step 2 in the original algorithm was normalizing all hamza. This step is discarded because it increases the word ambiguities and changes the original root.
-
class
nltk.stem.isri.
ISRIStemmer
[source]¶ Bases:
nltk.stem.api.StemmerI
ISRI Arabic stemmer based on algorithm: Arabic Stemming without a root dictionary. Information Science Research Institute. University of Nevada, Las Vegas, USA.
A few minor modifications have been made to ISRI basic algorithm. See the source code of this module for more information.
isri.stem(token) returns Arabic root for the given token.
The ISRI Stemmer requires that all tokens have Unicode string types. If you use Python IDLE on Arabic Windows you have to decode text first using Arabic ‘1256’ coding.
-
end_w5
(word)[source]¶ ending step (word of length five)
-
end_w6
(word)[source]¶ ending step (word of length six)
-
norm
(word, num=3)[source]¶ normalization: num=1 normalize diacritics num=2 normalize initial hamza num=3 both 1&2
-
pre1
(word)[source]¶ normalize short prefix
-
pre32
(word)[source]¶ remove length three and length two prefixes in this order
-
pro_w4
(word)[source]¶ process length four patterns and extract length three roots
-
pro_w53
(word)[source]¶ process length five patterns and extract length three roots
-
pro_w54
(word)[source]¶ process length five patterns and extract length four roots
-
pro_w6
(word)[source]¶ process length six patterns and extract length three roots
-
pro_w64
(word)[source]¶ process length six patterns and extract length four roots
-
stem
(token)[source]¶ Stemming a word token using the ISRI stemmer.
-
suf1
(word)[source]¶ normalize short sufix
-
suf32
(word)[source]¶ remove length three and length two suffixes in this order
-
waw
(word)[source]¶ remove connective ‘و’ if it precedes a word beginning with ‘و’
-
nltk.stem.lancaster module¶
A word stemmer based on the Lancaster stemming algorithm. Paice, Chris D. “Another Stemmer.” ACM SIGIR Forum 24.3 (1990): 56-61.
-
class
nltk.stem.lancaster.
LancasterStemmer
[source]¶ Bases:
nltk.stem.api.StemmerI
Lancaster Stemmer
>>> from nltk.stem.lancaster import LancasterStemmer >>> st = LancasterStemmer() >>> st.stem('maximum') # Remove "-um" when word is intact 'maxim' >>> st.stem('presumably') # Don't remove "-um" when word is not intact 'presum' >>> st.stem('multiply') # No action taken if word ends with "-ply" 'multiply' >>> st.stem('provision') # Replace "-sion" with "-j" to trigger "j" set of rules 'provid' >>> st.stem('owed') # Word starting with vowel must contain at least 2 letters 'ow' >>> st.stem('ear') # ditto 'ear' >>> st.stem('saying') # Words starting with consonant must contain at least 3 'say' >>> st.stem('crying') # letters and one of those letters must be a vowel 'cry' >>> st.stem('string') # ditto 'string' >>> st.stem('meant') # ditto 'meant' >>> st.stem('cement') # ditto 'cem'
-
parseRules
(rule_tuple)[source]¶ Validate the set of rules used in this stemmer.
-
rule_tuple
= ('ai*2.', 'a*1.', 'bb1.', 'city3s.', 'ci2>', 'cn1t>', 'dd1.', 'dei3y>', 'deec2ss.', 'dee1.', 'de2>', 'dooh4>', 'e1>', 'feil1v.', 'fi2>', 'gni3>', 'gai3y.', 'ga2>', 'gg1.', 'ht*2.', 'hsiug5ct.', 'hsi3>', 'i*1.', 'i1y>', 'ji1d.', 'juf1s.', 'ju1d.', 'jo1d.', 'jeh1r.', 'jrev1t.', 'jsim2t.', 'jn1d.', 'j1s.', 'lbaifi6.', 'lbai4y.', 'lba3>', 'lbi3.', 'lib2l>', 'lc1.', 'lufi4y.', 'luf3>', 'lu2.', 'lai3>', 'lau3>', 'la2>', 'll1.', 'mui3.', 'mu*2.', 'msi3>', 'mm1.', 'nois4j>', 'noix4ct.', 'noi3>', 'nai3>', 'na2>', 'nee0.', 'ne2>', 'nn1.', 'pihs4>', 'pp1.', 're2>', 'rae0.', 'ra2.', 'ro2>', 'ru2>', 'rr1.', 'rt1>', 'rei3y>', 'sei3y>', 'sis2.', 'si2>', 'ssen4>', 'ss0.', 'suo3>', 'su*2.', 's*1>', 's0.', 'tacilp4y.', 'ta2>', 'tnem4>', 'tne3>', 'tna3>', 'tpir2b.', 'tpro2b.', 'tcud1.', 'tpmus2.', 'tpec2iv.', 'tulo2v.', 'tsis0.', 'tsi3>', 'tt1.', 'uqi3.', 'ugo1.', 'vis3j>', 'vie0.', 'vi2>', 'ylb1>', 'yli3y>', 'ylp0.', 'yl2>', 'ygo1.', 'yhp1.', 'ymo1.', 'ypo1.', 'yti3>', 'yte3>', 'ytl2.', 'yrtsi5.', 'yra3>', 'yro3>', 'yfi3.', 'ycn2t>', 'yca3>', 'zi2>', 'zy1s.')¶
-
stem
(word)[source]¶ Stem a word using the Lancaster stemmer.
-
unicode_repr
()¶
-
nltk.stem.porter module¶
Porter Stemmer
This is the Porter stemming algorithm, ported to Python from the version coded up in ANSI C by the author. It follows the algorithm presented in
Porter, M. “An algorithm for suffix stripping.” Program 14.3 (1980): 130-137.
only differing from it at the points marked –DEPARTURE– and –NEW– below.
For a more faithful version of the Porter algorithm, see
www.tartarus.org/~martin/PorterStemmer/
Later additions:
June 2000
The ‘l’ of the ‘logi’ -> ‘log’ rule is put with the stem, so that short stems like ‘geo’ ‘theo’ etc work like ‘archaeo’ ‘philo’ etc.
This follows a suggestion of Barry Wilkins, research student at Birmingham.
February 2000
the cvc test for not dropping final -e now looks after vc at the beginning of a word, so are, eve, ice, ore, use keep final -e. In this test c is any consonant, including w, x and y. This extension was suggested by Chris Emerson.
-fully -> -ful treated like -fulness -> -ful, and -tionally -> -tion treated like -tional -> -tion
both in Step 2. These were suggested by Hiranmay Ghosh, of New Delhi.
Invariants proceed, succeed, exceed. Also suggested by Hiranmay Ghosh.
Additional modifications were made to incorperate this module into nltk. All such modifications are marked with “–NLTK–”.
-
class
nltk.stem.porter.
PorterStemmer
[source]¶ Bases:
nltk.stem.api.StemmerI
A word stemmer based on the Porter stemming algorithm.
Porter, M. “An algorithm for suffix stripping.” Program 14.3 (1980): 130-137.A few minor modifications have been made to Porter’s basic algorithm. See the source code of this module for more information.
The Porter Stemmer requires that all tokens have string types.
-
stem
(word)[source]¶
-
stem_word
(p, i=0, j=None)[source]¶ Returns the stem of p, or, if i and j are given, the stem of p[i:j+1].
-
unicode_repr
()¶
-
-
nltk.stem.porter.
demo
()[source]¶ A demonstration of the porter stemmer on a sample from the Penn Treebank corpus.
nltk.stem.regexp module¶
-
class
nltk.stem.regexp.
RegexpStemmer
(regexp, min=0)[source]¶ Bases:
nltk.stem.api.StemmerI
A stemmer that uses regular expressions to identify morphological affixes. Any substrings that match the regular expressions will be removed.
>>> from nltk.stem import RegexpStemmer >>> st = RegexpStemmer('ing$|s$|e$|able$', min=4) >>> st.stem('cars') 'car' >>> st.stem('mass') 'mas' >>> st.stem('was') 'was' >>> st.stem('bee') 'bee' >>> st.stem('compute') 'comput' >>> st.stem('advisable') 'advis'
Parameters: - regexp (str or regexp) – The regular expression that should be used to identify morphological affixes.
- min (int) – The minimum length of string to stem
-
stem
(word)[source]¶
-
unicode_repr
()¶
nltk.stem.rslp module¶
-
class
nltk.stem.rslp.
RSLPStemmer
[source]¶ Bases:
nltk.stem.api.StemmerI
A stemmer for Portuguese.
>>> from nltk.stem import RSLPStemmer >>> st = RSLPStemmer() >>> # opening lines of Erico Verissimo's "Música ao Longe" >>> text = ''' ... Clarissa risca com giz no quadro-negro a paisagem que os alunos ... devem copiar . Uma casinha de porta e janela , em cima duma ... coxilha .''' >>> for token in text.split(): ... print(st.stem(token)) clariss risc com giz no quadro-negr a pais que os alun dev copi . uma cas de port e janel , em cim dum coxilh .
-
apply_rule
(word, rule_index)[source]¶
-
read_rule
(filename)[source]¶
-
stem
(word)[source]¶
-
nltk.stem.snowball module¶
Snowball stemmers
This module provides a port of the Snowball stemmers developed by Martin Porter.
There is also a demo function: snowball.demo().
-
class
nltk.stem.snowball.
DanishStemmer
(ignore_stopwords=False)[source]¶ Bases:
nltk.stem.snowball._ScandinavianStemmer
The Danish Snowball stemmer.
Variables: - __vowels – The Danish vowels.
- __consonants – The Danish consonants.
- __double_consonants – The Danish double consonants.
- __s_ending – Letters that may directly appear before a word final ‘s’.
- __step1_suffixes – Suffixes to be deleted in step 1 of the algorithm.
- __step2_suffixes – Suffixes to be deleted in step 2 of the algorithm.
- __step3_suffixes – Suffixes to be deleted in step 3 of the algorithm.
Note: A detailed description of the Danish stemming algorithm can be found under snowball.tartarus.org/algorithms/danish/stemmer.html
-
stem
(word)[source]¶ Stem a Danish word and return the stemmed form.
Parameters: word (str or unicode) – The word that is stemmed. Returns: The stemmed form. Return type: unicode
-
class
nltk.stem.snowball.
DutchStemmer
(ignore_stopwords=False)[source]¶ Bases:
nltk.stem.snowball._StandardStemmer
The Dutch Snowball stemmer.
Variables: - __vowels – The Dutch vowels.
- __step1_suffixes – Suffixes to be deleted in step 1 of the algorithm.
- __step3b_suffixes – Suffixes to be deleted in step 3b of the algorithm.
Note: A detailed description of the Dutch stemming algorithm can be found under snowball.tartarus.org/algorithms/dutch/stemmer.html
-
stem
(word)[source]¶ Stem a Dutch word and return the stemmed form.
Parameters: word (str or unicode) – The word that is stemmed. Returns: The stemmed form. Return type: unicode
-
class
nltk.stem.snowball.
EnglishStemmer
(ignore_stopwords=False)[source]¶ Bases:
nltk.stem.snowball._StandardStemmer
The English Snowball stemmer.
Variables: - __vowels – The English vowels.
- __double_consonants – The English double consonants.
- __li_ending – Letters that may directly appear before a word final ‘li’.
- __step0_suffixes – Suffixes to be deleted in step 0 of the algorithm.
- __step1a_suffixes – Suffixes to be deleted in step 1a of the algorithm.
- __step1b_suffixes – Suffixes to be deleted in step 1b of the algorithm.
- __step2_suffixes – Suffixes to be deleted in step 2 of the algorithm.
- __step3_suffixes – Suffixes to be deleted in step 3 of the algorithm.
- __step4_suffixes – Suffixes to be deleted in step 4 of the algorithm.
- __step5_suffixes – Suffixes to be deleted in step 5 of the algorithm.
- __special_words – A dictionary containing words which have to be stemmed specially.
Note: A detailed description of the English stemming algorithm can be found under snowball.tartarus.org/algorithms/english/stemmer.html
-
stem
(word)[source]¶ Stem an English word and return the stemmed form.
Parameters: word (str or unicode) – The word that is stemmed. Returns: The stemmed form. Return type: unicode
-
class
nltk.stem.snowball.
FinnishStemmer
(ignore_stopwords=False)[source]¶ Bases:
nltk.stem.snowball._StandardStemmer
The Finnish Snowball stemmer.
Variables: - __vowels – The Finnish vowels.
- __restricted_vowels – A subset of the Finnish vowels.
- __long_vowels – The Finnish vowels in their long forms.
- __consonants – The Finnish consonants.
- __double_consonants – The Finnish double consonants.
- __step1_suffixes – Suffixes to be deleted in step 1 of the algorithm.
- __step2_suffixes – Suffixes to be deleted in step 2 of the algorithm.
- __step3_suffixes – Suffixes to be deleted in step 3 of the algorithm.
- __step4_suffixes – Suffixes to be deleted in step 4 of the algorithm.
Note: A detailed description of the Finnish stemming algorithm can be found under snowball.tartarus.org/algorithms/finnish/stemmer.html
-
stem
(word)[source]¶ Stem a Finnish word and return the stemmed form.
Parameters: word (str or unicode) – The word that is stemmed. Returns: The stemmed form. Return type: unicode
-
class
nltk.stem.snowball.
FrenchStemmer
(ignore_stopwords=False)[source]¶ Bases:
nltk.stem.snowball._StandardStemmer
The French Snowball stemmer.
Variables: - __vowels – The French vowels.
- __step1_suffixes – Suffixes to be deleted in step 1 of the algorithm.
- __step2a_suffixes – Suffixes to be deleted in step 2a of the algorithm.
- __step2b_suffixes – Suffixes to be deleted in step 2b of the algorithm.
- __step4_suffixes – Suffixes to be deleted in step 4 of the algorithm.
Note: A detailed description of the French stemming algorithm can be found under snowball.tartarus.org/algorithms/french/stemmer.html
-
stem
(word)[source]¶ Stem a French word and return the stemmed form.