Bag of Words

This module constructs a new corpus with tokens as features.

First create a corpus:

>>> from orangecontrib.text import Corpus
>>> corpus = Corpus.from_file('deerwester')
>>> corpus.domain
[ | Category] {Text}

Then create BowVectorizer object and call transform:

>>> from orangecontrib.text.vectorization.bagofwords import BowVectorizer
>>> bow = BowVectorizer()
>>> new_corpus = bow.transform(corpus)
>>> new_corpus.domain
[a, abc, and, applications, binary, computer, engineering, eps, error, for,
generation, graph, human, in, interface, intersection, iv, lab, machine,
management, measurement, minors, of, opinion, ordering, paths, perceived,
quasi, random, relation, response, survey, system, testing, the, time, to,
trees, unordered, user, well, widths | Category] {Text}
class orangecontrib.text.vectorization.bagofwords.BowVectorizer(norm='(None)', wlocal='Count', wglobal='(None)')[source]
__init__(norm='(None)', wlocal='Count', wglobal='(None)')[source]
transform(corpus, copy=True, source_dict=None, callback=<function dummy_callback>)

Transforms a corpus to a new one with additional attributes.