Latent Dirichlet Allocation for Topic Modelling

Overview

Latent Dirichlet Allocation (LDA) is used to model text corpora and discover topics addressed by a document.

The basic idea is that documents are represented as random mixtures over latent topics, where each topic is characterized by a distribution over words.

LDA assumes the following generative process for each document w in a corpus D:

Choose N ~ Poisson(ξ).
Choose θ ~ Dir(α).
For each of the N words wn: (a) Choose a topic zn ~ Multinomial(θ). (b) Choose a word wn from p(wn |zn, β), a multinomial probability conditioned on the topic zn.

Assuming the above described generative process we estimate the parameters. The posterior distribution is intractable for exact inference, hence we need to use approximate inference algorithms like Laplace approximation, variational approximation, and Markov chain Monte Carlo etc.

This code uses Collapsed Gibbs Sampling (MCMC) method.

Input Format

The input corpus is expected to in the following format

	### doc_id 1 ###
	word1
	word2
	...
	###EOF###
	...
	...
	### doc_id 100 ###
	word500
	word501
	...
	###EOF###

To create an input file in the above format, do:

	for each document d:
		print ### doc id ###
		for each word in document d:
			print word
		print ###EOF###

Example: Topic Modelling on NIPS Papers

These are some of the topic learned by LDA from NIPS papers. The topics names are inferred from its top words.

Topic Modelling/LDA: topic, lda, words, document, topics, documents, word, text, dirichlet, distribution, latent, corpus, semantic, perplexity, figure, probability, sense, distributions, web, collection, inference, specific, space, vocabulary, multinomial, terms, models, prior, nips, number, related, content, allocation

Image Peocessing/ Computer Vision: image, images, pixels, patches, pixel, patch, figure, color, resolution, natural, set, results, fig, scale, texture, high, input, denoising, field, method, intensity, gaussian, work, statistics, low, real, shown, values, small, filter, range, filters, non, similar, based, reconstruction, left, single, applied, row

Encoding/Decoding: sparse, coding, basis, code, representation, dictionary, reconstruction, bases, codes, error, representations, vectors, coefficients, linear, number, figure, energy, sparseness, decoder, encoding, overcomplete, learned, decoding, sparsity, coordinate, shown, compression, term, non, average, encoder

Kernels: kernel, kernels, space, mkl, gaussian, linear, operator, rkhs, hilbert, section, rbf, positive, defined, spaces, svm, based, set, feature, definite, reproducing, case, operators, eq, theorem, dimensional, compact, product, paper, methods, characteristic, kx, functions, measure, note, basis, mapping, svms

Clustering: clustering, cluster, clusters, data, means, spectral, number, similarity, based, affinity, normalized, cut, partition, points, measure, set, dataset, partitioning, approach, distance, results, centers, method, pairwise, quality, center, clusterings, partitions, eigenvectors, average, objective,

Neural Networks: layer, hidden, units, training, input, layers, network, deep, output, networks, trained, rbm, unit, neural, visible, set, learning, weights, architecture, unsupervised, convolutional, features, results, function, recognition, mnist, vector, test, inputs, representations, top, connections, energy, weight, error

Belief Propagation: bp, propagation, belief, message, inference, messages, marginals, variables, marginal, product, exact, mrf, approximate, approximation, variable, potentials, passing, xs, energy, loopy, factor, graphical, node, nodes, max, assignment, pairwise, sum, free, graph, field, bethe, tree, partition, fixed, figure, local

Boosting: boosting, weak, adaboost, classifier, examples, base, classifiers, learner, error, distribution, risk, margin, hypothesis, ensemble, bound, training, set, pac, learners, ht, dt, probability, bayes, hypotheses, gq, ln, oost, algorithms, random, weighted, positive, algorithm, h1, noise, pr, final, def, decision

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.gitignore		.gitignore
README.md		README.md
lda.py		lda.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Latent Dirichlet Allocation for Topic Modelling

Overview

Input Format

Example: Topic Modelling on NIPS Papers

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Latent Dirichlet Allocation for Topic Modelling

Overview

Input Format

Example: Topic Modelling on NIPS Papers

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages