Corpus

class orangecontrib.text.corpus.Corpus(domain=None, X=None, Y=None, metas=None, W=None, text_features=None, ids=None)[source]

Internal class for storing a corpus.

__init__(domain=None, X=None, Y=None, metas=None, W=None, text_features=None, ids=None)[source]
Parameters:
  • domain (Orange.data.Domain) – the domain for this Corpus
  • X (numpy.ndarray) – attributes
  • Y (numpy.ndarray) – class variables
  • metas (numpy.ndarray) – meta attributes; e.g. text
  • W (numpy.ndarray) – instance weights
  • text_features (list) – meta attributes that are used for text mining. Infer them if None.
  • ids (numpy.ndarray) – Indices
copy()[source]

Return a copy of the table.

dictionary

corpora.Dictionary – A token to id mapper.

documents

Returns – a list of strings representing documents — created by joining selected text features.

documents_from_features(feats)[source]
Parameters:feats (list) – A list fo features to join.

Returns: a list of strings constructed by joining feats.

extend_attributes(X, feature_names, feature_values=None, compute_values=None, var_attrs=None, sparse=False)[source]

Append features to corpus. If feature_values argument is present, features will be Discrete else Continuous.

Parameters:
  • X (numpy.ndarray or scipy.sparse.csr_matrix) – Features values to append
  • feature_names (list) – List of string containing feature names
  • feature_values (list) – A list of possible values for Discrete features.
  • compute_values (list) – Compute values for corresponding features.
  • var_attrs (dict) – Additional attributes appended to variable.attributes.
  • sparse (bool) – Whether the features should be marked as sparse.
extend_corpus(metadata, Y)[source]

Append documents to corpus.

Parameters:
  • metadata (numpy.ndarray) – Meta data
  • Y (numpy.ndarray) – Class variables
static from_documents(documents, name, attributes=None, class_vars=None, metas=None, title_indices=None)[source]

Create corpus from documents.

Parameters:
  • documents (list) – List of documents.
  • name (str) – Name of the corpus
  • attributes (list) – List of tuples (Variable, getter) for attributes.
  • class_vars (list) – List of tuples (Variable, getter) for class vars.
  • metas (list) – List of tuples (Variable, getter) for metas.
  • title_indices (list) – List of indices into domain corresponding to features which will be used as titles.
Returns:

Corpus.

has_tokens()[source]

Return whether corpus is preprocessed or not.

ngrams

generator – Ngram representations of documents.

static retain_preprocessing(orig, new, key=Ellipsis)[source]

Set preprocessing of ‘new’ object to match the ‘orig’ object.

set_text_features(feats)[source]

Select which meta-attributes to include when mining text.

Parameters:feats (list or None) – List of text features to include. If None infer them.
store_tokens(tokens, dictionary=None)[source]
Parameters:tokens (list) – List of lists containing tokens.
titles

Returns a list of titles.

tokens

np.ndarray – A list of lists containing tokens. If tokens are not yet present, run default preprocessor and save tokens.