Skip to content

Text data topic modeling

Tip

Prerequisite: run example-text.R to get the count data and the model fitted with fastTopics for comparison.

To run the code from this article as a Python script:

python3 examples/example-text.py

We show a minimal example of text data topic modeling using tinytopics. The NIPS dataset contains a count matrix for 2483 research papers on 14036 terms. More details about the dataset can be found in this GitHub repo.

Import tinytopics

import numpy as np
import pandas as pd
import torch
from pyreadr import read_r
from tinytopics.fit import fit_model
from tinytopics.plot import plot_loss, plot_structure, plot_top_terms
from tinytopics.utils import (
    set_random_seed,
    align_topics,
    sort_documents,
)

Read count data

def read_rds_numpy(file_path):
    X0 = read_r(file_path)
    X = X0[list(X0.keys())[0]]
    return(X.to_numpy())

def read_rds_torch(file_path):
    X = read_rds_numpy(file_path)
    return(torch.from_numpy(X))
X = read_rds_torch("counts.rds")

with open("terms.txt", "r") as file:
    terms = [line.strip() for line in file]

Fit topic model

set_random_seed(42)

k = 10
model, losses = fit_model(X, k)
plot_loss(losses, output_file="loss.png")

Post-process results

We first load the L and F matrices fitted by fastTopics and then compare them with the tinytopics model. For easier visual comparison, we will try to “align” the topics fitted by tinytopics with those from fastTopics, and sort documents grouped by dominant topics.

L_tt = model.get_normalized_L().numpy()
F_tt = model.get_normalized_F().numpy()

L_ft = read_rds_numpy("L_fastTopics.rds")
F_ft = read_rds_numpy("F_fastTopics.rds")

aligned_indices = align_topics(F_ft, F_tt)
F_aligned_tt = F_tt[aligned_indices]
L_aligned_tt = L_tt[:, aligned_indices]

sorted_indices_ft = sort_documents(L_ft)
L_sorted_ft = L_ft[sorted_indices_ft]
sorted_indices_tt = sort_documents(L_aligned_tt)
L_sorted_tt = L_aligned_tt[sorted_indices_tt]

Visualize results

Use Structure plot to check the document-topic distributions:

plot_structure(
    L_sorted_ft,
    title="fastTopics document-topic distributions (sorted)",
    output_file="L-fastTopics.png",
)

plot_structure(
    L_sorted_tt,
    title="tinytopics document-topic distributions (sorted and aligned)",
    output_file="L-tinytopics.png",
)

Plot the probability of top 15 terms in each topic from both models to inspect their concordance:

plot_top_terms(
    F_ft,
    n_top_terms=15,
    term_names = terms,
    title="fastTopics top terms per topic",
    output_file="F-top-terms-fastTopics.png",
)

plot_top_terms(
    F_aligned_tt,
    n_top_terms=15,
    term_names = terms,
    title="tinytopics top terms per topic (aligned)",
    output_file="F-top-terms-tinytopics.png",
)