In natural language processing, the term topic means a set of words that “go together”.

topic modeling in NLP is one that automatically discovers topics occurring in a collection of documents. A trained model may then be used to discern which of these topics occur in new documents. The model can also pick out which portions of a document cover which topics.

Consider Wikipedia. It has millions of documents covering hundreds of thousands of topics. Wouldn’t it be great if these could be discovered automatically? Plus a finer map of which documents cover which topics. These would be useful adjuncts for people seeking to explore Wikipedia.

We could also discover emerging topics, as documents get written about them. In some settings (such as news) where new documents are constantly being produced and recency matters, this would help us detect trending topics.

This post covers a statistically powerful and widely used approach to this problem.

Latent Dirichlet Allocation

This approach involves building explicit statistical models of topics and documents.

A topic is modeled as a probability distribution over a fixed set of words (the lexicon). This formalizes “the set of words that come to mind when referring to this topic”. A document is modeled as a probability distribution over a fixed set of topics. This reveals the topics the document covers.

Learning aims to discover, from a corpus of documents, good word distributions of the various topics, as well as good topic proportions in the various documents. The number of topics is a parameter to this learning.

Generating A Document

At this stage, it will help to describe how to generate a synthetic document from a learned model. This will reveal key aspects of how this model operates that we haven’t delved into yet.

First, we’ll pick the topics this document will cover. One way to do this is to first pick a random document from our corpus, then set the new document’s topic proportions to those of the seed document.

Next, we’ll set the document length, call it n.

Next, we will repeat the following n times:

sample a topic from the document’s topic proportions
sample a word from the chosen topic’s words-distribution

This will emit a sequence of n words. These words will come annotated with the topics they were sampled from.

The resulting document is gibberish. A bag of words sampled from a mix of topics. That’s not a problem — it wasn’t meant to be read. It does reveal which words were generated from which topics, which can be insightful.

Example

Lexicon: {athlete, football, soccer, tennis, computer, smartphone, laptop, printer, Intel, Apple, Google}
Num Topics : 3
1: {athlete, football, soccer, tennis}
2: {computer, smartphone, laptop, printer}
3: {Intel, Apple, Google}
Topic proportions in a document: { 2 ⇒ 70%, 3 ⇒ 30% }

In the above, we’ve described a topic as a set of words. We interpret this as all the words in the set are equiprobable; the remaining words in the lexicon have zero probability.

Let’s see a 4-word generated document.

Topic:  2      3       2          2
Word: laptop Intel smartphone computer

Topic 3’s proportion in this document (25%) is close to its proportion (30%) in its sampling distribution.

Learning

As usual, this is where things get especially interesting.

First, let’s remind ourselves of the aim of learning. It is to discover, from a corpus of documents, the word distributions of the various topics, and the topic proportions in the various documents. In short, what words describe which topic, and which topics are covered in which document.

The algorithm we’ll describe is in wide use. It is also not hard to understand. It is a form of Gibbs Sampling.

This algorithm works by initially assigning the topics to the various words in the corpus somehow, then iteratively improving these assignments. During its operation, the algorithm keeps track of certain statistics on the current assignments. These statistics help the algorithm in its subsequent learning. When the algorithm terminates, it is easy to “read off” the per-topic word distributions and the per-document topic proportions from the final topic assignments.

Let’s start by describing the statistics mentioned in the previous paragraph. These take the form of two matrices of counts: topic_word and doc_topic. Both are derived from the current assignment of topics to the words in the corpus. topic_word(t,w) counts the number of occurrences of topic t for word w. doc_topic(d,t) counts the number of occurrences of topic t in document d.

Let’s see a numeric example to make sure we got it right. Below we see a two-document corpus along with an assignment of topics to its words. The lexicon is A, B, C.

Doc 1’s words:  A B A C A        Doc 2’s words:  B C C B
Doc 1’s topics: 1 1 1 2 2          Doc 2’s topics: 2 2 2 2

Actually, let’s first use this opportunity to muse about some peculiarities we see. In doc 1, notice that A is assigned sometimes to topic 1 and sometimes to topic 2. This is plausible if word A has a high probability in both topics. In doc 2, notice that B is consistently assigned to topic 2. This is plausible if Doc 2 covers only topic 2, and B has a positive probability in topic 2’s distribution.

Okay, now to the two matrices of counts.

topic_word:           doc_topic:
A B C                    1 2
1 2 1 0                 d1 3 2
2 1 2 3                 d2 0 4

We’ve bolded some entries that are a bit striking. Perhaps doc2 prefers topic 2. Perhaps topic 2 prefers word C.

CLICK HERE For More Course Inclusions and Our June Main Eventyoutube


EXCLUSIVEyogafx-promo-banner