Socrata.jl is a Julia wrapper for accessing the Socrata Open Data API (http://dev.socrata.com) and importing data into a DataFrame. Socrata is an open data platform used by many local and State governments as well as by the Federal Government.
Here are just a few examples of Socrata datasets/repositories:
- Socrata's Open Data Site
- HealthCare.gov Health Plans
- Centers for Medicare & Medicaid Services
- New York City OpenData
More Open Data Resources can be found here.
Pkg.clone("https://github.com/dreww2/Socrata.jl.git")
The Socrata API consists of a single function, socrata
, which at a minimum takes a Socrata url
and returns a DataFrame
:
julia> using Socrata
julia> df = socrata("http://soda.demo.socrata.com/resource/4334-bgaj")
100x9 DataFrame
|-------|--------------------|------------|---------|
| Col # | Name | Eltype | Missing |
| 1 | Source | UTF8String | 0 |
| 2 | Earthquake_ID | UTF8String | 0 |
| 3 | Version | UTF8String | 0 |
| 4 | Datetime | UTF8String | 0 |
| 5 | Magnitude | Float64 | 0 |
| 6 | Depth | Float64 | 0 |
| 7 | Number_of_Stations | Int64 | 0 |
| 8 | Region | UTF8String | 0 |
| 9 | Location | UTF8String | 0 |
The url
may be a Socrata API Endpoint or may be the URL from the address bar (in which case Socrata.jl will automatically attempt to parse the string into a usable format). For example, the following are all valid urls for the same dataset:
- http://soda.demo.socrata.com/resource/4334-bgaj
- http://soda.demo.socrata.com/resource/4334-bgaj.json
- http://soda.demo.socrata.com/resource/4334-bgaj.csv
- https://soda.demo.socrata.com/dataset/USGS-Earthquakes-for-2012-11-01-API-School-Demo/4334-bgaj
There are several optional keyword string arguments:
app_token
is your Socrata application token which allows for more API requests per unit of timelimit
is equal to the number of rows in the dataset you would like to retrieve. Default is equal to 100, max is equal to 1,000 (Socrata's limit). If you want to download a large dataset, setfulldataset=true
(see below).offset
indicates the first row from which to start pulling data.fulldataset
ignores all query parameters includinglimit
,offset
, and any of the Socrata Query Language (SoQL) arguments and downloads the entire dataset.usefieldids
is not yet implemented, but will substitute the default human-readable column headers with API field IDs.
Socrata.jl supports SoQL queries using the following arguments:
select
where
order
group
q
limit
andoffset
as described above.
Note that any references to columns inside these arguments must reference the dataset's API Field ID, which can be found on any Socrata dataset page under Export => SODA API => Column IDs.
using Socrata
url = "http://soda.demo.socrata.com/resource/4334-bgaj"
token = "your_app_token_goes_here"
A basic query, getting the first 5 rows:
df = socrata(url, app_token=token, limit="5")
Get rows 5-10 of the data:
df = socrata(url, app_token=token, limit="5", offset="5")
Get only the first 10 rows and the Source, Earthquake_ID, Magnitude, and Region columns:
df = socrata(url, app_token=token, limit="10", select="source, earthquake_id, magnitude, region")
You can add multiple conditions within a single argument. For example, get only rows where magnitude is greater than 5.5 and depth is less than 30:
df = socrata(url, app_token=token, where="magnitude > 5.5 AND depth < 30")
Search for Hawaii
in the dataset where Magnitude > 2 and only select certain columns:
df = socrata(url, app_token=token, q="hawaii", where="magnitude > 2", select="datetime, magnitude, region, location")
- Add support for automatically getting API Field IDs
- Implement better app_token system
- Add support for JSON and XML