Document Embedding
==================

Embeds documents from input corpus into vector space by using pre-trained
[fastText](https://fasttext.cc/docs/en/crawl-vectors.html) models described in E. Grave et al. (2018).

**Inputs**

- Corpus: A collection of documents.

**Outputs**

- Corpus: Corpus with new features appended.

**Document Embedding** parses n-grams of each document in corpus, obtains embedding 
for each n-gram using pre-trained model for chosen language and obtains one vector for each document by aggregating n-gram embeddings using one of offered aggregators. Note that method will work on any n-grams but it will give best results if corpus is preprocessed such that n-grams are words (because model was trained to embed words).

![](images/Document-Embedding-stamped.png)

1. Widget parameters:
    - Language: widget will use a model trained on documents in chosen language.
    - Aggregator: operation to perform on n-gram embeddings to aggregate them into a single document vector.
2. Cancel current execution.
3. If *Apply automatically* is checked, changes in parameters are sent automatically. Alternatively press *Apply*.

Embedding retrieval
-------------------

**Document Embedding** takes n-grams (tokens), usually produced by the [Preprocess Text](preprocesstext.md) widget. One can see tokens in the [Corpus Viewer](corpusviewer.md) widget by selection *Show tokens and tags* or in [Word Cloud](wordcloud.md). Tokens are sent to the server where each token is [vectorized](https://fasttext.cc/docs/en/python-module.html#model-object) separately and then the aggregation function is used to compute the document embedding. The server returns the vector for each document. Currently, the server runs `fasttext==0.9.1`. For out-of-vocabulary (OOV) words, fastText obtain vectors by summing up vectors for its component character n-grams.

Examples
--------

In first example, we will inspect how the widget works. Load *book-excerpts.tab* using [Corpus](corpus-widget.md) widget and connect it to **Document Embedding**. Check the output data by connecting **Document Embedding** to **Data Table**. We see additional 300 features that we widget has appended.

![](images/Document-Embedding-Example1.png)

In the second example we will try to predict document category. We will keep working on *book-excerpts.tab* loaded with [Corpus](corpus-widget.md) widget and sent through [Preprocess Text](preprocesstext.md) with default parameters. Connect **Preprocess Text** to **Document Embedding** to obtain features for predictive modelling. Here we set aggregator to Sum.

Connect **Document Embedding** to **Test and Score** and also connect learner of choice to the left side of **Test and Score**. We chose SVM and changed kernel to Linear. **Test and Score** will now compute performance of each learner on the input. We can see that we achieved great results.

Let's now inspect confusion matrix. Connect **Test and Score** to **Confusion Matrix**.
Clicking on *Select Misclassified* will output documents that were misclassified. We can further inspect them by connecting [Corpus Viewer](corpusviewer.md) to **Confusion Matrix**.

![](images/Document-Embedding-Example2.png)

References
----------

E. Grave, P. Bojanowski, P. Gupta, A. Joulin, T. Mikolov. "Learning Word Vectors for 157 Languages." *Proceedings of the International Conference on Language Resources and Evaluation*, 2018.