SparkSQL.jl

SparkSQL.jl enables Julia programs to work with Apache Spark data using just SQL.
Author propelledanalytics
Popularity
1 Star
Updated Last
2 Years Ago
Started In
April 2021

SparkSQL.jl

Purpose

Submits Structured Query Language (SQL), Data Manipulation Language (DML) and Data Definition Language (DDL) statements to Apache Spark. Has functions to move data from Spark into Julia DataFrames and Julia DataFrame data into Spark.

Use Case

Apache Spark is one of the world's most popular open source big data processing engines. Spark supports programming in Java, Scala, Python, SQL, and R. This package enables Julia programs to utilize Apache Spark for structured data processing using SQL.

The design goal of this package is to enable Julia centric programming using Apache Spark with just SQL. There are only 8 functions. No need to use Java, Scala, Python or R. Work with Spark data all from within Julia. The SparkSQL.jl package uses the Dataset APIs internally giving Julia users the performance benefit of Spark's catalyst optimizer and tungsten engine. The earlier Spark RDD API is not supported.

This package is for structured and semi-structured data in Data Lakes, Lakehouses (Delta Lake) on premise and in the cloud.

Available Functions

Use ? in the Julia REPL to see help for each function.

  • initJVM: initializes the Java Virtual Machine (JVM) in Julia.
  • SparkSession: submits application to Apache Spark cluster with config options.
  • sql: function to submit SQL, DDL, and DML statements to Spark.
  • cache: function to cache Spark Dataset into memory.
  • createOrReplaceTempView: creates temporary view that lasts the duration of the session.
  • createGlobalTempView: creates temporary view that lasts the duration of the application.
  • toJuliaDF: move Spark data into a Julia DataFrame.
  • toSparkDS: move Julia DataFrame data to a Spark Dataset.

Quick Start

Install and Setup

Download Apache Spark 3.1.1 or later and set the environmental variables for Spark and Java home:

export SPARK_HOME=/path/to/apache/spark
export JAVA_HOME=/path/to/java

Usage

Start Julia with "JULIA_COPY_STACKS=yes" required for JVM interop:

JULIA_COPY_STACKS=yes julia

In Julia include the DataFrames package. Also include the Dates and Decimals packages if your Spark data contains dates or decimal numbers.

using SparkSQL, DataFrames, Dates, Decimals

Initialize the JVM and start the Spark Session:

initJVM()
sparkSession = SparkSession("spark://example.com:7077", "Julia SparkSQL Example App")

Query data from Spark and load it into a Julia Dataset.

stmt = sql(sparkSession, "SELECT _c0 AS columnName1, _c1 AS columnName2 FROM CSV.`/pathToFile/fileName.csv`")
createOrReplaceTempView(stmt, "TempViewName")
sqlQuery = sql(sparkSession, "SELECT columnName1, columnName2 FROM TempViewName;")
juliaDataFrame = toJuliaDF(sqlQuery)
describe(juliaDataFrame)

Move Julia DataFrame data into an Apache Spark Dataset.

sparkDataset = toSparkDS(sparkSession, juliaDataFrame,",")
createOrReplaceTempView(sparkDataset, "tempTable")

The Dataset is a delimited string. To generate columns use the SparkSQL "split" function.

sqlQuery = sql(sparkSession, "Select split(value, ',' )[0] AS columnName1, split(value, ',' )[1] AS columnName2 from tempTable")

Spark Data Sources

Supported data-sources include:

  • File formats including: CSV, JSON, arrow, parquet
  • Data Lakes including: Hive, ORC, Avro
  • Data Lake Houses: Delta Lake, Apache Iceberg.
  • Cloud Object Stores: S3, Azure Blob Storage, Swift Object.

Data Source Examples

CSV file example:

Comma Separated Value (CSV) format.

stmt = sql(session, "SELECT * FROM CSV.`/pathToFile/fileName.csv`;")

Parquet file example:

Apache Parquet format.

stmt = sql(session, "SELECT * FROM PARQUET.`/pathToFile/fileName.parquet`;")

Delta Lake Example:

Delta Lake is an open-source storage layer for Spark. Delta Lake offers:

  • ACID transactions on Spark: Serializable isolation levels ensure that readers never see inconsistent data.
  • Scalable metadata handling: Leverages Spark’s distributed processing power to handle all the metadata for petabyte-scale tables with billions of files.

Example shows create table (DDL) statements using Delta Lake and SparkSQL:

sql(session, "CREATE DATABASE demo;")
sql(session, "USE demo;")
sql(session, "CREATE TABLE tb(col STRING) USING DELTA;" )

The Delta Lake feature requires adding the Delta Lake jar to the Spark jars folder.

Used By Packages

No packages found.