lda: Topic modeling with latent Dirichlet Allocation

lda implements latent Dirichlet allocation (LDA) using collapsed Gibbs sampling. lda is fast and can be installed without a compiler on Linux and macOS.

The interface follows conventions found in scikit-learn. The following demonstrates how to inspect a model of a subset of the Reuters news dataset. (The input below, X, is a document-term matrix.)

>>> import numpy as np
>>> import lda
>>> X = lda.datasets.load_reuters()
>>> vocab = lda.datasets.load_reuters_vocab()
>>> titles = lda.datasets.load_reuters_titles()
>>> X.shape
(395, 4258)
>>> X.sum()
84010
>>> model = lda.LDA(n_topics=20, n_iter=1500, random_state=1)
>>> model.fit(X)  # model.fit_transform(X) is also available
>>> topic_word = model.topic_word_  # model.components_ also works
>>> n_top_words = 8
>>> for i, topic_dist in enumerate(topic_word):
...     topic_words = np.array(vocab)[np.argsort(topic_dist)][:-n_top_words:-1]
...     print('Topic {}: {}'.format(i, ' '.join(topic_words)))
Topic 0: british churchill sale million major letters west
Topic 1: church government political country state people party
Topic 2: elvis king fans presley life concert young
Topic 3: yeltsin russian russia president kremlin moscow michael
Topic 4: pope vatican paul john surgery hospital pontiff
Topic 5: family funeral police miami versace cunanan city
Topic 6: simpson former years court president wife south
Topic 7: order mother successor election nuns church nirmala
Topic 8: charles prince diana royal king queen parker
Topic 9: film french france against bardot paris poster
Topic 10: germany german war nazi letter christian book
Topic 11: east peace prize award timor quebec belo
Topic 12: n't life show told very love television
Topic 13: years year time last church world people
Topic 14: mother teresa heart calcutta charity nun hospital
Topic 15: city salonika capital buddhist cultural vietnam byzantine
Topic 16: music tour opera singer israel people film
Topic 17: church catholic bernardin cardinal bishop wright death
Topic 18: harriman clinton u.s ambassador paris president churchill
Topic 19: city museum art exhibition century million churches

NOTE: This package is in maintenance mode. Critical bugs will be fixed. No new features will be added.

Contents:

Getting started

The following demonstrates how to inspect a model of a subset of the Reuters news dataset. The input below, X, is a document-term matrix (sparse matrices are accepted).

>>> import numpy as np
>>> import lda
>>> X = lda.datasets.load_reuters()
>>> vocab = lda.datasets.load_reuters_vocab()
>>> titles = lda.datasets.load_reuters_titles()
>>> X.shape
(395, 4258)
>>> X.sum()
84010
>>> model = lda.LDA(n_topics=20, n_iter=1500, random_state=1)
>>> model.fit(X)  # model.fit_transform(X) is also available
>>> topic_word = model.topic_word_  # model.components_ also works
>>> n_top_words = 8
>>> for i, topic_dist in enumerate(topic_word):
...     topic_words = np.array(vocab)[np.argsort(topic_dist)][:-n_top_words:-1]
...     print('Topic {}: {}'.format(i, ' '.join(topic_words)))
Topic 0: british churchill sale million major letters west
Topic 1: church government political country state people party
Topic 2: elvis king fans presley life concert young
Topic 3: yeltsin russian russia president kremlin moscow michael
Topic 4: pope vatican paul john surgery hospital pontiff
Topic 5: family funeral police miami versace cunanan city
Topic 6: simpson former years court president wife south
Topic 7: order mother successor election nuns church nirmala
Topic 8: charles prince diana royal king queen parker
Topic 9: film french france against bardot paris poster
Topic 10: germany german war nazi letter christian book
Topic 11: east peace prize award timor quebec belo
Topic 12: n't life show told very love television
Topic 13: years year time last church world people
Topic 14: mother teresa heart calcutta charity nun hospital
Topic 15: city salonika capital buddhist cultural vietnam byzantine
Topic 16: music tour opera singer israel people film
Topic 17: church catholic bernardin cardinal bishop wright death
Topic 18: harriman clinton u.s ambassador paris president churchill
Topic 19: city museum art exhibition century million churches

The document-topic distributions are available in model.doc_topic_.

>>> doc_topic = model.doc_topic_
>>> for i in range(10):
...     print("{} (top topic: {})".format(titles[i], doc_topic[i].argmax()))
0 UK: Prince Charles spearheads British royal revolution. LONDON 1996-08-20 (top topic: 8)
1 GERMANY: Historic Dresden church rising from WW2 ashes. DRESDEN, Germany 1996-08-21 (top topic: 13)
2 INDIA: Mother Teresa's condition said still unstable. CALCUTTA 1996-08-23 (top topic: 14)
3 UK: Palace warns British weekly over Charles pictures. LONDON 1996-08-25 (top topic: 8)
4 INDIA: Mother Teresa, slightly stronger, blesses nuns. CALCUTTA 1996-08-25 (top topic: 14)
5 INDIA: Mother Teresa's condition unchanged, thousands pray. CALCUTTA 1996-08-25 (top topic: 14)
6 INDIA: Mother Teresa shows signs of strength, blesses nuns. CALCUTTA 1996-08-26 (top topic: 14)
7 INDIA: Mother Teresa's condition improves, many pray. CALCUTTA, India 1996-08-25 (top topic: 14)
8 INDIA: Mother Teresa improves, nuns pray for "miracle". CALCUTTA 1996-08-26 (top topic: 14)
9 UK: Charles under fire over prospect of Queen Camilla. LONDON 1996-08-26 (top topic: 8)

Document-topic distributions may be inferred for out-of-sample texts using the transform method:

>>> X = lda.datasets.load_reuters()
>>> titles = lda.datasets.load_reuters_titles()
>>> X_train = X[10:]
>>> X_test = X[:10]
>>> titles_test = titles[:10]
>>> model = lda.LDA(n_topics=20, n_iter=1500, random_state=1)
>>> model.fit(X_train)
>>> doc_topic_test = model.transform(X_test)
>>> for title, topics in zip(titles_test, doc_topic_test):
...     print("{} (top topic: {})".format(title, topics.argmax()))
0 UK: Prince Charles spearheads British royal revolution. LONDON 1996-08-20 (top topic: 7)
1 GERMANY: Historic Dresden church rising from WW2 ashes. DRESDEN, Germany 1996-08-21 (top topic: 11)
2 INDIA: Mother Teresa's condition said still unstable. CALCUTTA 1996-08-23 (top topic: 4)
3 UK: Palace warns British weekly over Charles pictures. LONDON 1996-08-25 (top topic: 7)
4 INDIA: Mother Teresa, slightly stronger, blesses nuns. CALCUTTA 1996-08-25 (top topic: 4)
5 INDIA: Mother Teresa's condition unchanged, thousands pray. CALCUTTA 1996-08-25 (top topic: 4)
6 INDIA: Mother Teresa shows signs of strength, blesses nuns. CALCUTTA 1996-08-26 (top topic: 4)
7 INDIA: Mother Teresa's condition improves, many pray. CALCUTTA, India 1996-08-25 (top topic: 4)
8 INDIA: Mother Teresa improves, nuns pray for "miracle". CALCUTTA 1996-08-26 (top topic: 4)
9 UK: Charles under fire over prospect of Queen Camilla. LONDON 1996-08-26 (top topic: 11)

(Note that the topic numbers have changed due to LDA not being an identifiable model. The phenomenon is known as label switching in the literature.)

Convergence may be monitored by accessing the loglikelihoods_ attribute on a fitted model. The attribute is bound to a list which records the sequence of log likelihoods associated with the model at different iterations (thinned by the refresh parameter).

(The following code assumes matplotlib is installed.)

>>> import matplotlib.pyplot as plt
>>> # skipping the first few entries makes the graph more readable
>>> plt.plot(model.loglikelihoods_[5:])
_images/loglikelihoods.png

Judging convergence from the plot, the model should be fit with a slightly greater number of iterations.

Installing lda

lda requires Python (>= 3.6) and NumPy (>= 1.13.0). If these requirements are satisfied, lda should install successfully on Linux and macOS with:

pip install lda

If you encounter problems, consult the platform-specific instructions below.

Mac OS X

lda and its dependencies are all available as wheel packages for Mac OS X:

pip install lda

Linux

lda and its dependencies are all available as wheel packages for most distributions of Linux:

pip install lda

Windows

lda must be built from source on Windows. There are no wheels at this time.

Installation from source

Installing from source requires you to have installed the Python development headers and a working C/C++ compiler. Under Debian-based operating systems, which include Ubuntu, you can install all these requirements by issuing:

sudo apt-get install build-essential python3-dev python3-setuptools \
                     python3-numpy

Before attempting a command such as python setup.py install you will need to run Cython to generate the relevant C files:

make cython

API Reference

This page contains auto-generated API reference documentation [1].

lda

Submodules

lda._setup_hooks
Module Contents
Functions
sdist_pre_hook(cmdobj) Ensure Cython has compiled all pyx files to c.
lda._setup_hooks.sdist_pre_hook(cmdobj)

Ensure Cython has compiled all pyx files to c.

lda.datasets
Module Contents
Functions
load_reuters()
load_reuters_vocab()
load_reuters_titles()
lda.datasets._test_dir
lda.datasets.load_reuters()
lda.datasets.load_reuters_vocab()
lda.datasets.load_reuters_titles()
lda.lda

Latent Dirichlet allocation using collapsed Gibbs sampling

Module Contents
Classes
LDA(n_topics, n_iter=2000, alpha=0.1, eta=0.01, random_state=None, refresh=10) Latent Dirichlet allocation using collapsed Gibbs sampling
lda.lda.logger
lda.lda.PY2
lda.lda.range
class lda.lda.LDA(n_topics, n_iter=2000, alpha=0.1, eta=0.01, random_state=None, refresh=10)

Latent Dirichlet allocation using collapsed Gibbs sampling

Parameters:
n_topics : int

Number of topics

n_iter : int, default 2000

Number of sampling iterations

alpha : float, default 0.1

Dirichlet parameter for distribution over topics

eta : float, default 0.01

Dirichlet parameter for distribution over words

random_state : int or RandomState, optional

The generator used for the initial topics.

References

Blei, David M., Andrew Y. Ng, and Michael I. Jordan. “Latent Dirichlet Allocation.” Journal of Machine Learning Research 3 (2003): 993–1022.

Griffiths, Thomas L., and Mark Steyvers. “Finding Scientific Topics.” Proceedings of the National Academy of Sciences 101 (2004): 5228–5235. doi:10.1073/pnas.0307752101.

Wallach, Hanna, David Mimno, and Andrew McCallum. “Rethinking LDA: Why Priors Matter.” In Advances in Neural Information Processing Systems 22, edited by Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta, 1973–1981, 2009.

Wallach, Hanna M., Iain Murray, Ruslan Salakhutdinov, and David Mimno. 2009. “Evaluation Methods for Topic Models.” In Proceedings of the 26th Annual International Conference on Machine Learning, 1105–1112. ICML ’09. New York, NY, USA: ACM. https://doi.org/10.1145/1553374.1553515.

Buntine, Wray. “Estimating Likelihoods for Topic Models.” In Advances in Machine Learning, First Asian Conference on Machine Learning (2009): 51–64. doi:10.1007/978-3-642-05224-8_6.

Examples

>>> import numpy
>>> X = numpy.array([[1,1], [2, 1], [3, 1], [4, 1], [5, 8], [6, 1]])
>>> import lda
>>> model = lda.LDA(n_topics=2, random_state=0, n_iter=100)
>>> model.fit(X) #doctest: +ELLIPSIS +NORMALIZE_WHITESPACE
LDA(alpha=...
>>> model.components_
array([[ 0.85714286,  0.14285714],
       [ 0.45      ,  0.55      ]])
>>> model.loglikelihood() #doctest: +ELLIPSIS
-40.395...
Attributes:
`components_` : array, shape = [n_topics, n_features]

Point estimate of the topic-word distributions (Phi in literature)

`topic_word_` :

Alias for components_

`nzw_` : array, shape = [n_topics, n_features]

Matrix of counts recording topic-word assignments in final iteration.

`ndz_` : array, shape = [n_samples, n_topics]

Matrix of counts recording document-topic assignments in final iteration.

`doc_topic_` : array, shape = [n_samples, n_features]

Point estimate of the document-topic distributions (Theta in literature)

`nz_` : array, shape = [n_topics]

Array of topic assignment counts in final iteration.

fit(self, X, y=None)

Fit the model with X.

Parameters:
X: array-like, shape (n_samples, n_features)

Training data, where n_samples in the number of samples and n_features is the number of features. Sparse matrix allowed.

Returns:
self : object

Returns the instance itself.

fit_transform(self, X, y=None)

Apply dimensionality reduction on X

Parameters:
X : array-like, shape (n_samples, n_features)

New data, where n_samples in the number of samples and n_features is the number of features. Sparse matrix allowed.

Returns:
doc_topic : array-like, shape (n_samples, n_topics)

Point estimate of the document-topic distributions

transform(self, X, max_iter=20, tol=1e-16)

Transform the data X according to previously fitted model

Parameters:
X : array-like, shape (n_samples, n_features)

New data, where n_samples in the number of samples and n_features is the number of features.

max_iter : int, optional

Maximum number of iterations in iterated-pseudocount estimation.

tol: double, optional

Tolerance value used in stopping condition.

Returns:
doc_topic : array-like, shape (n_samples, n_topics)

Point estimate of the document-topic distributions

_transform_single(self, doc, max_iter, tol)

Transform a single document according to the previously fit model

Parameters:
X : 1D numpy array of integers

Each element represents a word in the document

max_iter : int

Maximum number of iterations in iterated-pseudocount estimation.

tol: double

Tolerance value used in stopping condition.

Returns:
doc_topic : 1D numpy array of length n_topics

Point estimate of the topic distributions for document

_fit(self, X)

Fit the model to the data X

Parameters:
X: array-like, shape (n_samples, n_features)

Training vector, where n_samples in the number of samples and n_features is the number of features. Sparse matrix allowed.

_initialize(self, X)
loglikelihood(self)

Calculate complete log likelihood, log p(w,z)

Formula used is log p(w,z) = log p(w|z) + log p(z)

_sample_topics(self, rands)

Samples all topic assignments. Called once per iteration.

lda.utils
Module Contents
Functions
check_random_state(seed)
matrix_to_lists(doc_word) Convert a (sparse) matrix of counts into arrays of word and doc indices
lists_to_matrix(WS, DS) Convert array of word (or topic) and document indices to doc-term array
dtm2ldac(dtm, offset=0) Convert a document-term matrix into an LDA-C formatted file
ldac2dtm(stream, offset=0) Convert an LDA-C formatted file to a document-term array
lda.utils.PY2
lda.utils.zip
lda.utils.logger
lda.utils.check_random_state(seed)
lda.utils.matrix_to_lists(doc_word)

Convert a (sparse) matrix of counts into arrays of word and doc indices

Parameters:
doc_word : array or sparse matrix (D, V)

document-term matrix of counts

Returns:
(WS, DS) : tuple of two arrays

WS[k] contains the kth word in the corpus DS[k] contains the document index for the kth word

lda.utils.lists_to_matrix(WS, DS)

Convert array of word (or topic) and document indices to doc-term array

Parameters:
(WS, DS) : tuple of two arrays

WS[k] contains the kth word in the corpus DS[k] contains the document index for the kth word

Returns:
doc_word : array (D, V)

document-term array of counts

lda.utils.dtm2ldac(dtm, offset=0)

Convert a document-term matrix into an LDA-C formatted file

Parameters:
dtm : array of shape N,V
Returns:
doclines : iterable of LDA-C lines suitable for writing to file

Notes

If a format similar to SVMLight is desired, offset of 1 may be used.

lda.utils.ldac2dtm(stream, offset=0)

Convert an LDA-C formatted file to a document-term array

Parameters:
stream: file object

File yielding unicode strings in LDA-C format.

Returns:
dtm : array of shape N,V

Notes

If a format similar to SVMLight is the source, an offset of 1 may be used.

Package Contents

Classes
LDA(n_topics, n_iter=2000, alpha=0.1, eta=0.01, random_state=None, refresh=10) Latent Dirichlet allocation using collapsed Gibbs sampling
class lda.LDA(n_topics, n_iter=2000, alpha=0.1, eta=0.01, random_state=None, refresh=10)

Latent Dirichlet allocation using collapsed Gibbs sampling

Parameters:
n_topics : int

Number of topics

n_iter : int, default 2000

Number of sampling iterations

alpha : float, default 0.1

Dirichlet parameter for distribution over topics

eta : float, default 0.01

Dirichlet parameter for distribution over words

random_state : int or RandomState, optional

The generator used for the initial topics.

References

Blei, David M., Andrew Y. Ng, and Michael I. Jordan. “Latent Dirichlet Allocation.” Journal of Machine Learning Research 3 (2003): 993–1022.

Griffiths, Thomas L., and Mark Steyvers. “Finding Scientific Topics.” Proceedings of the National Academy of Sciences 101 (2004): 5228–5235. doi:10.1073/pnas.0307752101.

Wallach, Hanna, David Mimno, and Andrew McCallum. “Rethinking LDA: Why Priors Matter.” In Advances in Neural Information Processing Systems 22, edited by Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta, 1973–1981, 2009.

Wallach, Hanna M., Iain Murray, Ruslan Salakhutdinov, and David Mimno. 2009. “Evaluation Methods for Topic Models.” In Proceedings of the 26th Annual International Conference on Machine Learning, 1105–1112. ICML ’09. New York, NY, USA: ACM. https://doi.org/10.1145/1553374.1553515.

Buntine, Wray. “Estimating Likelihoods for Topic Models.” In Advances in Machine Learning, First Asian Conference on Machine Learning (2009): 51–64. doi:10.1007/978-3-642-05224-8_6.

Examples

>>> import numpy
>>> X = numpy.array([[1,1], [2, 1], [3, 1], [4, 1], [5, 8], [6, 1]])
>>> import lda
>>> model = lda.LDA(n_topics=2, random_state=0, n_iter=100)
>>> model.fit(X) #doctest: +ELLIPSIS +NORMALIZE_WHITESPACE
LDA(alpha=...
>>> model.components_
array([[ 0.85714286,  0.14285714],
       [ 0.45      ,  0.55      ]])
>>> model.loglikelihood() #doctest: +ELLIPSIS
-40.395...
Attributes:
`components_` : array, shape = [n_topics, n_features]

Point estimate of the topic-word distributions (Phi in literature)

`topic_word_` :

Alias for components_

`nzw_` : array, shape = [n_topics, n_features]

Matrix of counts recording topic-word assignments in final iteration.

`ndz_` : array, shape = [n_samples, n_topics]

Matrix of counts recording document-topic assignments in final iteration.

`doc_topic_` : array, shape = [n_samples, n_features]

Point estimate of the document-topic distributions (Theta in literature)

`nz_` : array, shape = [n_topics]

Array of topic assignment counts in final iteration.

fit(self, X, y=None)

Fit the model with X.

Parameters:
X: array-like, shape (n_samples, n_features)

Training data, where n_samples in the number of samples and n_features is the number of features. Sparse matrix allowed.

Returns:
self : object

Returns the instance itself.

fit_transform(self, X, y=None)

Apply dimensionality reduction on X

Parameters:
X : array-like, shape (n_samples, n_features)

New data, where n_samples in the number of samples and n_features is the number of features. Sparse matrix allowed.

Returns:
doc_topic : array-like, shape (n_samples, n_topics)

Point estimate of the document-topic distributions

transform(self, X, max_iter=20, tol=1e-16)

Transform the data X according to previously fitted model

Parameters:
X : array-like, shape (n_samples, n_features)

New data, where n_samples in the number of samples and n_features is the number of features.

max_iter : int, optional

Maximum number of iterations in iterated-pseudocount estimation.

tol: double, optional

Tolerance value used in stopping condition.

Returns:
doc_topic : array-like, shape (n_samples, n_topics)

Point estimate of the document-topic distributions

_transform_single(self, doc, max_iter, tol)

Transform a single document according to the previously fit model

Parameters:
X : 1D numpy array of integers

Each element represents a word in the document

max_iter : int

Maximum number of iterations in iterated-pseudocount estimation.

tol: double

Tolerance value used in stopping condition.

Returns:
doc_topic : 1D numpy array of length n_topics

Point estimate of the topic distributions for document

_fit(self, X)

Fit the model to the data X

Parameters:
X: array-like, shape (n_samples, n_features)

Training vector, where n_samples in the number of samples and n_features is the number of features. Sparse matrix allowed.

_initialize(self, X)
loglikelihood(self)

Calculate complete log likelihood, log p(w,z)

Formula used is log p(w,z) = log p(w|z) + log p(z)

_sample_topics(self, rands)

Samples all topic assignments. Called once per iteration.

lda.__version__
[1]Created with sphinx-autoapi

Contributing

Style Guidlines

Before contributing a patch, please read the Python “Style Commandments” written by the OpenStack developers: http://docs.openstack.org/developer/hacking/

Building in Develop Mode

To build in develop mode on OS X, first install Cython and pbr. Then run:

git clone https://github.com/lda-project/lda.git
cd lda
make cython
python setup.py develop

What’s New

v2.0.0 (17. August 2020)

  • Drop support for Python 2.7
  • Wheels for Python 3.8

v1.1.0 (9. September 2018)

  • Wheels for Python 3.7
  • Minimum required NumPy version is 1.13.0.
  • Major speed increase in data loading. Thanks @luoshao23.
  • Bugfix in Cython searchsorted function. Thanks @luoshao23.

v1.0.5 (18. June 2017)

  • Wheels for Python 3.6

v1.0.4 (13. July 2016)

  • Linux wheels (manylinux1)

v1.0.3 (5. Nov 2015)

  • Python 3.5 wheels
  • Release GIL during sampling
  • Many minor fixes

Indices and tables