Lingual executes ANSI SQL queries as Cascading applications on Apache Hadoop clusters.
Immediate Data Access
Using the Lingual Shell or JDBC Driver with existing Desktop and BI tools, unlock data stored on an Apache Hadoop cluster.
Simplify SQL Migration
Using the Lingual JDBC Driver or native Cascading APIs, simply migrate existing SQL code to Apache Hadoop.
Getting Started
Lingual is currently under active development. Builds are continuously published for use (see below), or binaries can be built directly from the git repository.
To get started see the project README, and review the Lingual client documentation. JavaDoc is coming soon.
Install:
curl http://files.concurrentinc.com/lingual/1.0/lingual-client/install-lingual-client.sh | bash
or:
Announcements and Support
To follow updates about Lingual or to get help, please sign up for our email list on Google Groups.
To register issues or feature requests, use the Lingual Issues app.
About
Lingual was created in collaboration by the developers of Cascading and Optiq, and relies on the robustness of both. Cascading is the de facto Java API for creating complex data processing workloads and the engine underneath Scalding, Cascalog, and others. Optiq is a SQL parser and optimizer written by the author of Mondrian.
Using the SQL Command Shell
The following example is based on the MySQL Sample Employee Database:
$ mkdir -p ~/src/lingual $ cd ~/src/lingual $ curl http://data.cascading.org/employees.tgz | tar xvz
That creates a employees subdirectory for the table data, which has several CSV files. Next, load the schema for these into Lingual:
$ curl http://data.cascading.org/create-employees.sh > create-employees.sh $ chmod +x ./create-employees.sh $ ./create-employees.sh local
That loads data + schema, so let’s try the command line SQL interpreter:
$ lingual shell 0: jdbc:lingual:local> !tables
That lists the available tables: EMPLOYEE, TITLES, SALARIES.
Next, let’s try a simple query:
0: jdbc:lingual:local> SELECT * FROM EMPLOYEES.EMPLOYEES WHERE FIRST_NAME = 'Gina';
There are you go — listings for a whole bunch of people named Gina.
Using the JDBC Driver
The Java source code used to execute a query via a JDBC connection is much the same as with any other JDBC driver:
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.ResultSet;
import java.sql.SQLException;
import java.sql.Statement;
public class JdbcExample
{
public static void main( String[] args ) throws Exception
{
new JdbcExample().run();
}
public void run() throws ClassNotFoundException, SQLException
{
Class.forName( "cascading.lingual.jdbc.Driver" );
Connection connection = DriverManager.getConnection(
"jdbc:lingual:local;schemas=src/main/resources/data/example" );
Statement statement = connection.createStatement();
ResultSet resultSet = statement.executeQuery(
"select *\n"
+ "from \"example\".\"sales_fact_1997\" as s\n"
+ "join \"example\".\"employee\" as e\n"
+ "on e.\"EMPID\" = s.\"CUST_ID\"" );
while( resultSet.next() )
{
int n = resultSet.getMetaData().getColumnCount();
StringBuilder builder = new StringBuilder();
for( int i = 1; i <= n; i++ )
{
builder.append(( i > 1 ? "; " : "" )
+ resultSet.getMetaData().getColumnLabel( i )
+ "="
+ resultSet.getObject( i ) );
}
System.out.println( builder );
}
resultSet.close();
statement.close();
connection.close();
}
}
Note that in this example the schema for the DDL has been derived directly from the CSV files. In other words, point the JDBC connection at a directory of flat files and query as if they were already loaded into SQL.
To build and run:
$ gradle clean jar $ hadoop jar build/libs/lingual-examples–1.0.0-wip-dev.jar
This sample app uses Lingual to open a JDBC connection and run a SQL query:
SELECT *
FROM "example"."sales_fact_1997" AS s
JOIN "example"."employee" AS e
ON e."EMPID\" = s."CUST_ID"
;
Example data is listed in the src/main/resources/data/example/ directory, as CSV files.
The results look like:
CUST_ID=100; PROD_ID=10; EMPID=100; NAME=Bill CUST_ID=150; PROD_ID=20; EMPID=150; NAME=Sebastian
An equivalent Cascading app would be:
Tap empTap = new FileTap(new TextDelimited(true, ",", "\""), "src/test/data/employee.txt");
Tap salesTap = new FileTap(new TextDelimited(true, ",", "\""), "src/test/data/salesfact.txt");
Tap resultsTap = new FileTap(new TextDelimited(true, ",", "\""), "build/test/output/results.txt", SinkMode.REPLACE);
Pipe empPipe = new Pipe("emp");
Pipe salesPipe = new Pipe("sales");
Pipe join = new CoGroup(empPipe, new Fields("empid"), salesPipe, new Fields("cust_id"));
FlowDef flowDef = flowDef()
.setName("flow")
.addSource(empPipe, empTap)
.addSource(salesPipe, salesTap)
.addTailSink(join, resultsTap);
Flow flow = new LocalFlowConnector().connect(flowDef);
flow.start();
TupleEntryIterator iterator = resultTap.openForRead();
JDBC Access from R
The following example is based on the RJDBC package for R, assuming that the MySQL Sample Employee Database has been used as described above:
# JDBC support in R is provided by the RJDBC package http://www.rforge.net/RJDBC/
# install the RJDBC package; only needed once -- uncomment next line the first time
#install.packages("RJDBC", dep=TRUE)
# load the library
library(RJDBC)
# set up the driver
drv <- JDBC("cascading.lingual.jdbc.Driver", "~/src/concur/lingual/lingual-local/build/libs/lingual-local-1.0.0-wip-dev-jdbc.jar")
# set up a database connection to a local repository
connection <- dbConnect(drv, "jdbc:lingual:local;catalog=~/src/concur/lingual/lingual-examples/tables;schema=EMPLOYEES")
# query the repository
df <- dbGetQuery(connection, "SELECT * FROM EMPLOYEES.EMPLOYEES WHERE FIRST_NAME = 'Gina'")
head(df)
# use R functions to summarize and visualize part of the data
df$hire_age <- as.integer(as.Date(df$HIRE_DATE) - as.Date(df$BIRTH_DATE)) / 365.25
summary(df$hire_age)
# uncomment next line the first time
#install.packages("ggplot2")
library(ggplot2)
m <- ggplot(df, aes(x=hire_age))
m <- m + ggtitle("Age at hire, people named Gina")
m + geom_histogram(binwidth=1, aes(y=..density.., fill=..count..)) + geom_density()
The last part of that R script calculates the age (in years) at time of hire for all people named ‘Gina’, summarized as:
> summary(df$hire_age) Min. 1st Qu. Median Mean 3rd Qu. Max. 20.86 27.89 31.70 31.61 35.01 43.92
and visualized in the plot:
