Data¶

`tinytopics.data` ¶

`NumpyDiskDataset` ¶

Bases: Dataset

A PyTorch Dataset class for loading document-term matrices from .npy files.

The dataset can be initialized with either a path to a .npy file or a NumPy array. When a file path is provided, the data is accessed lazily using memory mapping, which is useful for handling large datasets that do not fit entirely in (CPU) memory.

`num_terms` `property` ¶

Return vocabulary size (number of columns).

`init(data, indices=None)` ¶

Parameters:

Name	Type	Description	Default
`data`	`str \| Path \| ndarray`	Either path to `.npy` file (str or Path) or numpy array.	required
`indices`	`Sequence[int] \| None`	Optional sequence of indices to use as valid indices.	`None`

`TorchDiskDataset` ¶

Bases: Dataset

A PyTorch Dataset class for loading document-term matrices from .pt files.

The dataset can be initialized with either a path to a .pt file or a PyTorch tensor. When a file path is provided, the data is accessed lazily using memory mapping, which is useful for handling large datasets that do not fit entirely in (CPU) memory. The input .pt file should contain a single tensor with document-term matrix data.

`num_terms` `property` ¶

Return vocabulary size (number of columns).

`init(data, indices=None)` ¶

Parameters:

Name	Type	Description	Default
`data`	`str \| Path`	Path to `.pt` file (str or Path).	required
`indices`	`Sequence[int] \| None`	Optional sequence of indices to use as valid indices.	`None`

`IndexTrackingDataset` ¶

Bases: Dataset

Dataset wrapper that tracks indices through shuffling

Data¶

tinytopics.data ¶

NumpyDiskDataset ¶

num_terms property ¶

__init__(data, indices=None) ¶

TorchDiskDataset ¶

num_terms property ¶

__init__(data, indices=None) ¶

IndexTrackingDataset ¶

`tinytopics.data` ¶

`NumpyDiskDataset` ¶

`num_terms` `property` ¶

`init(data, indices=None)` ¶

`TorchDiskDataset` ¶

`num_terms` `property` ¶

`init(data, indices=None)` ¶

`IndexTrackingDataset` ¶