Corpus¶
-
class
orangecontrib.text.corpus.
Corpus
(domain=None, X=None, Y=None, metas=None, W=None, text_features=None, ids=None)[source]¶ Internal class for storing a corpus.
-
__init__
(domain=None, X=None, Y=None, metas=None, W=None, text_features=None, ids=None)[source]¶ Parameters: - domain (Orange.data.Domain) – the domain for this Corpus
- X (numpy.ndarray) – attributes
- Y (numpy.ndarray) – class variables
- metas (numpy.ndarray) – meta attributes; e.g. text
- W (numpy.ndarray) – instance weights
- text_features (list) – meta attributes that are used for text mining. Infer them if None.
- ids (numpy.ndarray) – Indices
-
dictionary
¶ corpora.Dictionary – A token to id mapper.
-
documents
¶ Returns – a list of strings representing documents — created by joining selected text features.
-
documents_from_features
(feats)[source]¶ Parameters: feats (list) – A list fo features to join. Returns: a list of strings constructed by joining feats.
-
extend_attributes
(X, feature_names, feature_values=None, compute_values=None, var_attrs=None)[source]¶ Append features to corpus. If feature_values argument is present, features will be Discrete else Continuous.
Parameters: - X (numpy.ndarray or scipy.sparse.csr_matrix) – Features values to append
- feature_names (list) – List of string containing feature names
- feature_values (list) – A list of possible values for Discrete features.
- compute_values (list) – Compute values for corresponding features.
- var_attrs (dict) – Additional attributes appended to variable.attributes.
-
extend_corpus
(metadata, Y)[source]¶ Append documents to corpus.
Parameters: - metadata (numpy.ndarray) – Meta data
- Y (numpy.ndarray) – Class variables
-
static
from_documents
(documents, name, attributes=None, class_vars=None, metas=None, title_indices=None)[source]¶ Create corpus from documents.
Parameters: - documents (list) – List of documents.
- name (str) – Name of the corpus
- attributes (list) – List of tuples (Variable, getter) for attributes.
- class_vars (list) – List of tuples (Variable, getter) for class vars.
- metas (list) – List of tuples (Variable, getter) for metas.
- title_indices (list) – List of indices into domain corresponding to features which will be used as titles.
Returns: Corpus.
-
ngrams
¶ generator – Ngram representations of documents.
-
static
retain_preprocessing
(orig, new, key=Ellipsis)[source]¶ Set preprocessing of ‘new’ object to match the ‘orig’ object.
-
set_text_features
(feats)[source]¶ Select which meta-attributes to include when mining text.
Parameters: feats (list or None) – List of text features to include. If None infer them.
-
store_tokens
(tokens, dictionary=None)[source]¶ Parameters: tokens (list) – List of lists containing tokens.
-
titles
¶ Returns a list of titles.
-
tokens
¶ np.ndarray – A list of lists containing tokens. If tokens are not yet present, run default preprocessor and save tokens.
-