A parallel iterator for large machine learning datasets that don't fit into memory inspired by PyTorch's `DataLoader` class.
Author lorenzoh
41 Stars
Updated Last
1 Year Ago
Started In
March 2020


Documentation (latest)

A threaded data iterator for machine learning on out-of-memory datasets. Inspired by PyTorch's DataLoader.

It uses to load data in parallel while keeping the primary thread free. It can also load data inplace to avoid allocations.

Many data containers work out of the box and it is easy to extend with your own.

DataLoaders is built on top of and fully compatible with MLDataPattern.jl's Data Access Pattern, a functional interface for machine learning datasets.


x = rand(128, 10000)  #  10000 observations of size 128
y = rand(1, 10000)

dataloader = DataLoader((x, y), 16)

for (xs, ys) in dataloader
    @assert size(xs) == (128, 16)
    @assert size(ys) == (1, 16)

Of course, in the above example, we can keep the dataset in memory and don't need parallel workers. See Custom data containers for a more realistic example.

Getting Started

If you get the idea and know it from PyTorch, see Quickstart for PyTorch users.

Otherwise, read on here.

Available methods are documented here.


Used By Packages