Preprocessor¶
This module provides basic functions to process Corpus and extract tokens from documents.
To use preprocessing you should create a corpus:
>>> from orangecontrib.text import Corpus
>>> corpus = Corpus.from_file('book-excerpts')
And create a Preprocessor
objects with methods you want:
>>> from orangecontrib.text import preprocess
>>> p = preprocess.Preprocessor(transformers=[preprocess.LowercaseTransformer()],
... tokenizer=preprocess.WordPunctTokenizer(),
... normalizer=preprocess.SnowballStemmer('english'),
... filters=[preprocess.StopwordsFilter('english'),
... preprocess.FrequencyFilter(min_df=.1)])
Then you can apply you preprocessor to the corpus and access tokens via tokens
attribute:
>>> new_corpus = p(corpus)
>>> new_corpus.tokens[0][:10]
['hous', 'say', ';', 'spoke', 'littl', 'one', 'hand', 'wall', 'hurt', '?']
This module defines default_preprocessor
that will be used to extract tokens from a Corpus
if no preprocessing was applied yet:
>>> from orangecontrib.text import Corpus
>>> corpus = Corpus.from_file('deerwester')
>>> corpus.tokens[0]
['human', 'machine', 'interface', 'for', 'lab', 'abc', 'computer', 'applications']
-
class
orangecontrib.text.preprocess.
Preprocessor
(transformers=None, tokenizer=None, normalizer=None, filters=None, ngrams_range=None, pos_tagger=None)[source]¶ Holds document processing objects.
-
transformers
¶ List([BaseTransformer] – transforms strings
-
tokenizer
¶ BaseTokenizer – tokenizes string
-
normalizer
¶ BaseNormalizer – normalizes tokens
-
filters
¶ List[BaseTokenFilter] – filters unneeded tokens
-