tinytopics: GPU-accelerated topic modeling via constrained neural Poisson NMF

Nan Xiao October 26, 2024 3 min read

Downhill mountain biking. Photo by Tim Foster.

I’m excited to share that my first Python package, tinytopics, is now available on PyPI. You can install it using

pip3 install tinytopics

tinytopics is a minimalist solution designed to scale up topic modeling tasks on CPUs and GPUs using PyTorch.

Motivation

Fitting topic models at scale using classical algorithms on CPUs can be slow. Carbonetto et al. (2022) demonstrated the equivalence between Poisson non-negative matrix factorization (NMF) and multinomial topic model likelihoods. They proposed a novel optimization strategy: fit a Poisson NMF via coordinate descent, then recover the corresponding topic model through a simple transformation. This method was implemented in their R package, fastTopics.

Building on this theoretical insight, tinytopics takes a different approach by directly optimizing a sum-to-one constrained neural Poisson NMF problem with stochastic gradient methods.

When to use tinytopics

For standard topic modeling tasks, I think fastTopics is already an excellent solution because it is fast and can generate high-quality models with default settings (sensible defaults). Plus, I can’t praise its ergonomic API design enough, which can be summarized as “topic modeling for humans”.

You might find tinytopics a viable alternative option if you care more about:

Scale and speed. For extra-large datasets, tinytopics can leverage GPUs to accelerate computations. You can also use PyTorch distributed training to scale across multiple GPUs or machines if single card VRAM is insufficient.
Model customization. The constrained neural Poisson NMF in tinytopics is a flexible, differentiable model. You can adapt it by adding new layers, incorporating regularization, or even integrating other data modalities, such as images or videos, for joint modeling.

When not to use tinytopics

tinytopics might not be the best option if you need:

Theoretical guarantees. Since tinytopics solves an approximate version of the exact Poisson NMF problem using stochastic gradient methods, it may lack the convergence, consistency, and identifiability guarantees often found in classical algorithms.
Minimal parameter tuning. While tinytopics uses modern stochastic gradient optimizers and schedulers, you might still need to adjust hyperparameters to get optimal results, depending on your dataset. This can require some empirical fine-tuning and can be tricky to get right.

Examples

I created three vignettes to demonstrate tinytopics’ functionality, result accuracy, and performance on GPU:

Python toolchain that simplified development

I wanted to thank the creators of the following software for improving the Python package development experience:

PyTorch. It just works. If you are going to build for GPU, choose PyTorch.
mkdocs-material. A Markdown-first documentation website framework that, with mkdocs and mkdocstrings, makes package documentation generation efficient and enjoyable.
Rye. The package and project environment manager I wish I had known earlier! I’m grateful that my friend Simo suggested Rye and Neovim so that I can focus on writing code and be more productive.

References

Carbonetto, Peter, Abhishek Sarkar, Zihao Wang, and Matthew Stephens. 2022. “Non-Negative Matrix Factorization Algorithms Greatly Improve Topic Model Fits.” arXiv Preprint arXiv:2105.13440. https://arxiv.org/abs/2105.13440.