Package for accessing UCI Machine Learning Repository datasets in a common format
Author JackDunnNZ
29 Stars
Updated Last
1 Year Ago
Started In
January 2015


This is a package for accessing UCI Machine Learning Repository datasets (and some from other sources) inside Julia. The UCI ML repository is a useful source for machine learning datasets for testing and benchmarking, but the format of datasets is not consistent. This means effort is required in order to make use of new datasets since they need to be read differently.

Instead, the aim is to convert the datasets into a common format (CSV), where each line is as follows:


The attribute header names start with C or N, indicating categoric or numeric variables.

These datasets can be accessed as DataFrames in Julia using the following, with categoric columns pooled into PooledDataArray type (here we load the "iris" dataset):

using UCIData

You can get a list of dataset types with


and then a list of the available datasets for a given type with


The datasets are not checked in to git in order to minimise the size of the repository and to avoid rehosting the data. As such, the script downloads any missing datasets directly from UCI as it runs, using DataDeps.jl


Please feel free to add new datasets via pull request!

