# Topic Modeling using Latent Dirichlet Allocation

We will use the same twitter data we got from Kaggle previously at
[twitter-threads](https://www.kaggle.com/danielgrijalvas/twitter-threads/version/1).

Some the code below is adapted from this [blog](https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/)

In [25]:
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
import sqlite3
import pprint

conn = sqlite3.connect('twitter.db')
#conn.text_factory=lambda x: str(x, 'iso-8859-1')
conn.text_factory=lambda x: str(x, 'latin1')
curs = conn.cursor()

curs.execute('SELECT content FROM tweets')
#curs.execute('SELECT content FROM tweets LIMIT 200')

data = np.array(curs.fetchall()).flatten()

pprint.pprint(data[:3])

array([ 'Extraordinary evidence at Treasury committee from Jon Thompson, CEO of HMRC on customs and Brexit today https://t.co/DJhIQhmVwJ',
       "The Brexiter favourite Max Fac - would cost business between Â£17 and Â£20bn a year\r\r\n\r\r\n- that's almost 1% of GDP\r\r\n\r\r\n- jusâ?¦ https://t.co/0MwIcwre4t",
       'How does he arrive at the figure\r\r\n\r\r\n200m export consignments at an average cost of Â£32.50 each = Â£6.5bn (times two beâ?¦ https://t.co/KxnkU2QiVO'],
      dtype='<U182')


## Clean the data

* Get rid of stopwords
* Convert to lowercase
* TODO: get rid of funky non-english words
* TODO: stem/lemmatize the words 

In [13]:
from textblob import TextBlob
from textblob import Word
from stop_words import get_stop_words
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
from nltk.corpus import stopwords
import re

# load stopwords from different stopword packages
mystopwords = list(stopwords.words('english'))
mystopwords.extend(list(get_stop_words('en')))
# add your own stop words here
mystopwords.extend(['https','http'])

# regular expression to detect numbers and non-alphanumeric characters 
# a word
p = re.compile('[\d\W]')

wdata = []
for t in data:
    tb = TextBlob(t)
    wlist = []
    for w in tb.words:
        # add your own data cleaning code here
        # if numbers or non-alpha are found, ignore
        if p.search(w) != None:
            continue
        # if w is a stopword, ignore
        if w.lower() in mystopwords:
            continue
        wlist.append(w)
        #ww = Word(w)
        #print(ww.synsets)
    wdata.append(wlist)
    
#pprint.pprint(wdata)

[nltk_data] Downloading package punkt to /Users/lipyeow/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/lipyeow/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/lipyeow/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


## Preparing the cleaned data

In [14]:
# Gensim
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel

# Create Dictionary
id2word = corpora.Dictionary(wdata)

# Create Corpus
texts = wdata

# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]

# View
print(corpus[:1])

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1)]]


## Build the LDA Model

In [21]:
# Build LDA model
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=5, 
                                           random_state=100,
                                           update_every=1,
                                           chunksize=100,
                                           passes=20,
                                           alpha='auto',
                                           per_word_topics=True)
lda_model.print_topics()

[(0,
  '0.023*"didnâ" + 0.016*"aâ" + 0.015*"money" + 0.013*"thing" + 0.012*"spent" + 0.011*"however" + 0.011*"fit" + 0.011*"coulâ" + 0.010*"â" + 0.010*"Trump"'),
 (1,
  '0.020*"one" + 0.015*"years" + 0.014*"another" + 0.014*"big" + 0.014*"horrid" + 0.014*"Porkulus" + 0.014*"gave" + 0.014*"omnibusbill" + 0.014*"bills" + 0.006*"need"'),
 (2,
  '0.025*"QAnon" + 0.015*"TheStorm" + 0.014*"GreatAwakening" + 0.009*"internetbillofrights" + 0.008*"realDonaldTrump" + 0.008*"OIGReport" + 0.008*"â" + 0.006*"Democrat" + 0.005*"take" + 0.004*"tâ"'),
 (3,
  '0.043*"â" + 0.037*"POTUS" + 0.012*"President" + 0.012*"must" + 0.012*"tell" + 0.012*"realized" + 0.012*"likeâ" + 0.012*"basically" + 0.012*"Omnibus" + 0.011*"Per"'),
 (4,
  '0.037*"Congress" + 0.029*"ð" + 0.026*"Obama" + 0.022*"like" + 0.014*"spend" + 0.013*"mâ" + 0.013*"never" + 0.012*"saw" + 0.012*"suspect" + 0.012*"pleases"')]

## Measuring the goodness of a topic model

These measures can be used to find the optimal number of topics similar to finding the "elbow" in the curve.

Topic Coherence Score: Each such generated topic consists of words, and the topic coherence is applied to the top N words from the topic. It is defined as the average / median of the pairwise word-similarity scores of the words in the topic (e.g. PMI).

In [22]:
# Compute Perplexity
# a measure of how good the model is. lower the better.
print('\nPerplexity: ', lda_model.log_perplexity(corpus))  

# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model, texts=wdata, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)


Perplexity:  -8.77383090521

Coherence Score:  0.458083782689


## Visualizing the topic models

This would hopefully help you make sense of the results. Note that LDA is somewhat sensitive to initial conditions (like k-means) as well.

In [23]:
# Plotting tools
import pyLDAvis
import pyLDAvis.gensim  # don't skip this
import matplotlib.pyplot as plt
%matplotlib inline

# Visualize the topics
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model, corpus, id2word)
vis