Skip to content

Data

tinytopics.data

NumpyDiskDataset

Bases: Dataset

A PyTorch Dataset class for loading document-term matrices from .npy files.

The dataset can be initialized with either a path to a .npy file or a NumPy array. When a file path is provided, the data is accessed lazily using memory mapping, which is useful for handling large datasets that do not fit entirely in (CPU) memory.

num_terms property

Return vocabulary size (number of columns).

__init__(data, indices=None)

Parameters:

Name Type Description Default
data str | Path | ndarray

Either path to .npy file (str or Path) or numpy array.

required
indices Sequence[int] | None

Optional sequence of indices to use as valid indices.

None

TorchDiskDataset

Bases: Dataset

A PyTorch Dataset class for loading document-term matrices from .pt files.

The dataset can be initialized with either a path to a .pt file or a PyTorch tensor. When a file path is provided, the data is accessed lazily using memory mapping, which is useful for handling large datasets that do not fit entirely in (CPU) memory. The input .pt file should contain a single tensor with document-term matrix data.

num_terms property

Return vocabulary size (number of columns).

__init__(data, indices=None)

Parameters:

Name Type Description Default
data str | Path

Path to .pt file (str or Path).

required
indices Sequence[int] | None

Optional sequence of indices to use as valid indices.

None

IndexTrackingDataset

Bases: Dataset

Dataset wrapper that tracks indices through shuffling