Skip to content

Text data topic modeling

Tip

Prerequisite: run text.R to get the count data and the model fitted with fastTopics for comparison:

Rscript docs/articles/static/example-text/text.R

To run the code from this article as a Python script:

python3 examples/text.py

We show a minimal example of text data topic modeling using tinytopics. The NIPS dataset contains a count matrix for 2483 research papers on 14036 terms. More details about the dataset can be found in this GitHub repo.

Import tinytopics

import torch
import numpy as np
import pandas as pd
import tinytopics as tt
from pyreadr import read_r

Read count data

def read_rds_numpy(file_path):
    X0 = read_r(file_path)
    X = X0[list(X0.keys())[0]]
    return(X.to_numpy())

def read_rds_torch(file_path):
    X = read_rds_numpy(file_path)
    return(torch.from_numpy(X))
X = read_rds_torch("counts.rds")

with open("terms.txt", "r") as file:
    terms = [line.strip() for line in file]

Fit topic model

tt.set_random_seed(42)

k = 10
model, losses = tt.fit_model(X, k)
tt.plot_loss(losses, output_file="loss.png")

Post-process results

We first load the L and F matrices fitted by fastTopics and then compare them with the tinytopics model. For easier visual comparison, we will try to “align” the topics fitted by tinytopics with those from fastTopics, and sort documents grouped by dominant topics.

L_tt = model.get_normalized_L().numpy()
F_tt = model.get_normalized_F().numpy()

L_ft = read_rds_numpy("L_fastTopics.rds")
F_ft = read_rds_numpy("F_fastTopics.rds")

aligned_indices = tt.align_topics(F_ft, F_tt)
F_aligned_tt = F_tt[aligned_indices]
L_aligned_tt = L_tt[:, aligned_indices]

sorted_indices_ft = tt.sort_documents(L_ft)
L_sorted_ft = L_ft[sorted_indices_ft]
sorted_indices_tt = tt.sort_documents(L_aligned_tt)
L_sorted_tt = L_aligned_tt[sorted_indices_tt]

Visualize results

Use Structure plot to check the document-topic distributions:

tt.plot_structure(
    L_sorted_ft,
    title="fastTopics document-topic distributions (sorted)",
    output_file="L-fastTopics.png",
)

tt.plot_structure(
    L_sorted_tt,
    title="tinytopics document-topic distributions (sorted and aligned)",
    output_file="L-tinytopics.png",
)

Plot the probability of top 15 terms in each topic from both models to inspect their concordance:

tt.plot_top_terms(
    F_ft,
    n_top_terms=15,
    term_names = terms,
    title="fastTopics top terms per topic",
    output_file="F-top-terms-fastTopics.png",
)

tt.plot_top_terms(
    F_aligned_tt,
    n_top_terms=15,
    term_names = terms,
    title="tinytopics top terms per topic (aligned)",
    output_file="F-top-terms-tinytopics.png",
)