Preprocessor

This module provides basic functions to process Corpus and extract tokens from documents.

To use preprocessing you should create a corpus:

>>> from orangecontrib.text import Corpus
>>> corpus = Corpus.from_file('bookexcerpts')

And create a Preprocessor objects with methods you want:

>>> from orangecontrib.text import preprocess
>>> p = preprocess.Preprocessor(transformers=[preprocess.LowercaseTransformer()],
...                             tokenizer=preprocess.WordPunctTokenizer(),
...                             normalizer=preprocess.SnowballStemmer('english'),
...                             filters=[preprocess.StopwordsFilter('english'),
...                                      preprocess.FrequencyFilter(min_df=.1)])

Then you can apply you preprocessor to the corpus and access tokens via tokens attribute:

>>> new_corpus = p(corpus)
>>> new_corpus.tokens[0][:10]
['hous', 'say', ';', 'spoke', 'littl', 'one', 'hand', 'wall', 'hurt', '?']

This module defines default_preprocessor that will be used to extract tokens from a Corpus if no preprocessing was applied yet:

>>> from orangecontrib.text import Corpus
>>> corpus = Corpus.from_file('deerwester')
>>> corpus.tokens[0]
['human', 'machine', 'interface', 'for', 'lab', 'abc', 'computer', 'applications']
class orangecontrib.text.preprocess.Preprocessor(transformers=None, tokenizer=None, normalizer=None, filters=None, ngrams_range=None, pos_tagger=None)[source]

Holds document processing objects.

transformers

List([BaseTransformer] – transforms strings

tokenizer

BaseTokenizer – tokenizes string

normalizer

BaseNormalizer – normalizes tokens

filters

List[BaseTokenFilter] – filters unneeded tokens

__call__(corpus, inplace=True, on_progress=None)[source]

Runs preprocessing over a corpus.

Parameters:
  • corpus (orangecontrib.text.Corpus) – A corpus to preprocess.
  • inplace (bool) – Whether to create a new Corpus instance.
set_up()[source]

Called before every __call__. Used for setting up tokenizer & filters.

tear_down()[source]

Called after every __call__. Used for cleaning up tokenizer & filters.