Corpus

class orangecontrib.text.corpus.Corpus(*args, **kwargs)[source]

Internal class for storing a corpus.

__init__(domain=None, X=None, Y=None, metas=None, W=None, text_features=None, ids=None)[source]
Parameters
  • domain (Orange.data.Domain) – the domain for this Corpus

  • X (numpy.ndarray) – attributes

  • Y (numpy.ndarray) – class variables

  • metas (numpy.ndarray) – meta attributes; e.g. text

  • W (numpy.ndarray) – instance weights

  • text_features (list) – meta attributes that are used for text mining. Infer them if None.

  • ids (numpy.ndarray) – Indices

copy()[source]

Return a copy of the table.

property dictionary

A token to id mapper.

Type

corpora.Dictionary

property documents

Returns a list of strings representing documents — created by joining selected text features.

documents_from_features(feats)[source]
Parameters

feats (list) – A list fo features to join.

Returns: a list of strings constructed by joining feats.

extend_attributes(X, feature_names, feature_values=None, compute_values=None, var_attrs=None, sparse=False, rename_existing=False)[source]

Append features to corpus. If feature_values argument is present, features will be Discrete else Continuous.

Parameters
  • X (numpy.ndarray or scipy.sparse.csr_matrix) – Features values to append

  • feature_names (list) – List of string containing feature names

  • feature_values (list) – A list of possible values for Discrete features.

  • compute_values (list) – Compute values for corresponding features.

  • var_attrs (dict) – Additional attributes appended to variable.attributes.

  • sparse (bool) – Whether the features should be marked as sparse.

  • rename_existing (bool) – When true and names are not unique rename exiting features; if false rename new features

extend_corpus(metadata, Y)[source]

Append documents to corpus.

Parameters
  • metadata (numpy.ndarray) – Meta data

  • Y (numpy.ndarray) – Class variables

static from_documents(documents, name, attributes=None, class_vars=None, metas=None, title_indices=None)[source]

Create corpus from documents.

Parameters
  • documents (list) – List of documents.

  • name (str) – Name of the corpus

  • attributes (list) – List of tuples (Variable, getter) for attributes.

  • class_vars (list) – List of tuples (Variable, getter) for class vars.

  • metas (list) – List of tuples (Variable, getter) for metas.

  • title_indices (list) – List of indices into domain corresponding to features which will be used as titles.

Returns

Corpus.

classmethod from_file(filename)[source]

Read a data table from a file. The path can be absolute or relative.

Parameters
  • filename (str) – File name

  • sheet (str) – Sheet in a file (optional)

Returns

a new data table

Return type

Orange.data.Table

classmethod from_numpy(*args, **kwargs)[source]

Construct a table from numpy arrays with the given domain. The number of variables in the domain must match the number of columns in the corresponding arrays. All arrays must have the same number of rows. Arrays may be of different numpy types, and may be dense or sparse.

Parameters
  • domain (Orange.data.Domain) – the domain for the new table

  • X (np.array) – array with attribute values

  • Y (np.array) – array with class values

  • metas (np.array) – array with meta attributes

  • W (np.array) – array with weights

Returns

classmethod from_table(domain, source, row_indices=Ellipsis)[source]

Create a new table from selected columns and/or rows of an existing one. The columns are chosen using a domain. The domain may also include variables that do not appear in the source table; they are computed from source variables if possible.

The resulting data may be a view or a copy of the existing data.

Parameters
  • domain (Orange.data.Domain) – the domain for the new table

  • source (Orange.data.Table) – the source table

  • row_indices (a slice or a sequence) – indices of the rows to include

Returns

a new table

Return type

Orange.data.Table

classmethod from_table_rows(source, row_indices)[source]

Construct a new table by selecting rows from the source table.

Parameters
  • source (Orange.data.Table) – an existing table

  • row_indices (a slice or a sequence) – indices of the rows to include

Returns

a new table

Return type

Orange.data.Table

has_tokens()[source]

Return whether corpus is preprocessed or not.

property ngrams

Ngram representations of documents.

Type

generator

property pos_tags

A list of lists containing POS tags. If there are no POS tags available, return None.

Type

np.ndarray

property pp_documents

Preprocessed documents (transformed).

static retain_preprocessing(orig, new, key=Ellipsis)[source]

Set preprocessing of ‘new’ object to match the ‘orig’ object.

set_text_features(feats: Optional[List[Orange.data.variable.Variable]]) None[source]

Select which meta-attributes to include when mining text.

Parameters

feats – List of text features to include. If None infer them.

set_title_variable(title_variable: Optional[Union[Orange.data.variable.StringVariable, str]]) None[source]

Set the title attribute. Only one column can be a title attribute.

Parameters

title_variable – Variable that need to be set as a title variable. If it is None, do not set a variable.

store_tokens(tokens, dictionary=None)[source]
Parameters

tokens (list) – List of lists containing tokens.

property titles

Returns a list of titles.

property tokens

A list of lists containing tokens. If tokens are not yet present, run default preprocessor and return tokens.

Type

np.ndarray