class orangecontrib.text.corpus.Corpus(*args, **kwargs)[source]

Internal class for storing a corpus.


Return a copy of the table.

property dictionary

A token to id mapper.



property documents

Returns a list of strings representing documents — created by joining selected text features.


feats (list) – A list fo features to join.

Returns: a list of strings constructed by joining feats.

extend_attributes(X, feature_names, feature_values=None, compute_values=None, var_attrs=None, sparse=False, rename_existing=False)[source]

Append features to corpus. If feature_values argument is present, features will be Discrete else Continuous.

  • X (numpy.ndarray or scipy.sparse.csr_matrix) – Features values to append

  • feature_names (list) – List of string containing feature names

  • feature_values (list) – A list of possible values for Discrete features.

  • compute_values (list) – Compute values for corresponding features.

  • var_attrs (dict) – Additional attributes appended to variable.attributes.

  • sparse (bool) – Whether the features should be marked as sparse.

  • rename_existing (bool) – When true and names are not unique rename exiting features; if false rename new features

static from_documents(documents, name, attributes=None, class_vars=None, metas=None, title_indices=None)[source]

Create corpus from documents.

  • documents (list) – List of documents.

  • name (str) – Name of the corpus

  • attributes (list) – List of tuples (Variable, getter) for attributes.

  • class_vars (list) – List of tuples (Variable, getter) for class vars.

  • metas (list) – List of tuples (Variable, getter) for metas.

  • title_indices (list) – List of indices into domain corresponding to features which will be used as titles.



classmethod from_file(filename)[source]

Read a data table from a file. The path can be absolute or relative.

  • filename (str) – File name

  • sheet (str) – Sheet in a file (optional)


a new data table

Return type

classmethod from_numpy(domain, X, Y=None, metas=None, W=None, attributes=None, ids=None, text_features=None)[source]

Construct a table from numpy arrays with the given domain. The number of variables in the domain must match the number of columns in the corresponding arrays. All arrays must have the same number of rows. Arrays may be of different numpy types, and may be dense or sparse.

  • domain ( – the domain for the new table

  • X (np.array) – array with attribute values

  • Y (np.array) – array with class values

  • metas (np.array) – array with meta attributes

  • W (np.array) – array with weights


classmethod from_table(domain, source, row_indices=Ellipsis)[source]

Create a new table from selected columns and/or rows of an existing one. The columns are chosen using a domain. The domain may also include variables that do not appear in the source table; they are computed from source variables if possible.

The resulting data may be a view or a copy of the existing data.

  • domain ( – the domain for the new table

  • source ( – the source table

  • row_indices (a slice or a sequence) – indices of the rows to include


a new table

Return type

classmethod from_table_rows(source, row_indices)[source]

Construct a new table by selecting rows from the source table.

  • source ( – an existing table

  • row_indices (a slice or a sequence) – indices of the rows to include


a new table

Return type


Return whether corpus is preprocessed or not.

property ngrams

Ngram representations of documents.



property pos_tags

A list of lists containing POS tags. If there are no POS tags available, return None.



property pp_documents

Preprocessed documents (transformed).

static retain_preprocessing(orig, new, key=Ellipsis)[source]

Set preprocessing of ‘new’ object to match the ‘orig’ object.

set_text_features(feats: Optional[List[Variable]]) None[source]

Select which meta-attributes to include when mining text.


feats – List of text features to include. If None infer them.

set_title_variable(title_variable: Optional[Union[StringVariable, str]]) None[source]

Set the title attribute. Only one column can be a title attribute.


title_variable – Variable that need to be set as a title variable. If it is None, do not set a variable.

store_tokens(tokens, dictionary=None)[source]

tokens (list) – List of lists containing tokens.

property titles

Returns a list of titles.

property tokens

A list of lists containing tokens. If tokens are not yet present, run default preprocessor and return tokens.