Data¶
tinytopics.data
¶
NumpyDiskDataset
¶
Bases: Dataset
A PyTorch Dataset class for loading document-term matrices from .npy
files.
The dataset can be initialized with either a path to a .npy
file or
a NumPy array. When a file path is provided, the data is accessed
lazily using memory mapping, which is useful for handling large datasets
that do not fit entirely in (CPU) memory.
num_terms
property
¶
Return vocabulary size (number of columns).
__init__(data, indices=None)
¶
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data
|
str | Path | ndarray
|
Either path to |
required |
indices
|
Sequence[int] | None
|
Optional sequence of indices to use as valid indices. |
None
|
TorchDiskDataset
¶
Bases: Dataset
A PyTorch Dataset class for loading document-term matrices from .pt
files.
The dataset can be initialized with either a path to a .pt
file or
a PyTorch tensor. When a file path is provided, the data is accessed
lazily using memory mapping, which is useful for handling large datasets
that do not fit entirely in (CPU) memory.
The input .pt
file should contain a single tensor with document-term
matrix data.
IndexTrackingDataset
¶
Bases: Dataset
Dataset wrapper that tracks indices through shuffling