by  Oleksandr Berchenko

Setup Cassandra + Spark + Tableau (including DSE 5.0)

clock-icon-white  22 min read

This blog post will show how to setup Cassandra + Spark + Tableau using both DataStax Community Edition and DataStax Enterprise and compare the differences.

With Amazon AWS and CentOS 7.1 as our environment and OS, we will setup five instances:

  1. Cassandra 2.1 + Spark 1.4.1
  2. Cassandra 3.0 + Spark 1.6.1
  3. DSE 4.8 (Cassandra 2.1 + Spark 1.4.2)
  4. DSE 5.0 (Cassandra 3.0 + Spark 1.6.1)
  5. Tableau 9.3

Please note, for the sake of the process simplicity, we will setup single Cassandra + Spark instances, not clusters. For the same reason, we will not enable any security features such as user authentication.

Cassandra + Spark + Tableau using DataStax Community Edition

Instructions for Cassandra 2.1 and 3.0 are almost identical, so we will just emphasize the differences.

1. Create Cassandra + Spark instance.

  • Go to EC2 Management Console and click Launch Instance
  • Choose an AMI and click Select. I have chosen CentOS 7.1 x86_64 with cloud-init (HVM) (ami-91e416fa)
  • Choose an Instance Type and click Next. I have chosen m4.xlarge (4 vCPUs and 16 GiB)
  • Choose a network and click Next (write down its IP range)
  • Add a storage and click Next (I have chosen 8 GiB of GP2)
  • Assign tags if needed and click Next
  • Create a new or select an existing security group. The following inbound rules must exist:
    All traffic | All | All | <IP range of the network above>
    SSH | TCP | 22 | 0.0.0.0/0
    RDP | TCP | 3389 | 0.0.0.0/0
    Custom TCP Rule | TCP | 8888 | 0.0.0.0/0

  • Click Review and Launch
  • Click Launch
  • Either create a new key pair or choose an existing key pair, check the checkbox and click Launch Instances
  • Click View Instances
  • Wait until the instance is launched (all status checks are passed)
  • Write down the Private and Public IPs of the instance
  • Done!

2. Setup the environment.

  • Make SSH connection to the instance (if you have chosen ami-91e416fa, its default user is "ec2-user")
  • Update packages:
    sudo yum -y install yum-plugin-fastestmirror
    sudo yum -y update
  • Download Java (choose jdk-8u92-linux-i586.rpm) and then scp it to the instance
  • Install Java:
    sudo rpm -ivh jdk-8u92-linux-x64.rpm
    sudo alternatives --install /usr/bin/java java /usr/java/jdk1.8.0_92/bin/java 200000
  • Make sure the correct Java version is used:
    sudo alternatives --config java
     

    Choose /usr/java/jdk1.8.0_92/bin/java if needed.

  • Add JAVA_HOME to your path:
    vi /.bashrc
     

    Add a line:
    export JAVA_HOME=/usr/java/latest

    Run:
    source /.bashrc

  • Verify that Java is correctly installed:
    java -version
  • Done!

3. Install Cassandra.

  • Add DataStax repo:
    sudo vi /etc/yum.repos.d/datastax.repo

    Add:

    [datastax]
    name = DataStax Repo for Apache Cassandra
    baseurl = https://rpm.datastax.com/community
    enabled = 1
    gpgcheck = 0
  • Install Cassandra 2.1:
    sudo yum -y install dsc21
    sudo yum -y install cassandra21-tools
  • If you install Cassandra 3.0, then run:
    sudo yum -y install dsc30
    sudo yum -y install cassandra30-tools

    A bit of hacking (for Cassandra 3.0 only):

    sudo rm /usr/lib/python2.7/site-packages/cqlshlib/copyutil.so
  • If you want to change the name of your cluster:
    sudo vi /etc/cassandra/conf/cassandra.yaml

    Change the line:

    cluster_name: 'Test Cluster'
  • Start Cassandra:
    sudo service cassandra start
    sudo service cassandra status
  • Verify that Cassandra is up and running:
    nodetool status
    cqlsh
    exit
  • Cassandra log files are located here:
    /var/log/cassandra/
  • Done!

4. Install OpsCenter.

An optional step if you are using Cassandra 2.1 or DataStax Enterprise (both 4.8 and 5.0). Unfortunately, OpsCenter from Community Edition does not work with Cassandra 3.0 yet.

  • Install OpsCenter:
    sudo yum -y install opscenter
  • Start OpsCenter:
    sudo service opscenterd start
    sudo service opscenterd status
  • Open http://<Public IP>:8888 in a browser
  • Choose Manage Existing Cluster
  • Enter 127.0.0.1
  • Follow further instructions on UI if any
  • Done!

5. Create a test database.

Let's use a well-known Killr Video database for a fictitious video sharing web-site.

  • Install Git, if needed:
    sudo yum -y install git
  • Download the sources:
    git clone https://github.com/pmcfadin/killrvideo-sample-schema
    cd killrvideo-sample-schema
    git reset --hard 1078160c267aaf17e22e6569bfdd8ec593da7534
    cd ..

    Please note that we need to do a reset to a specific commit, compatible with both Cassandra 2.1 and 3.0.

  • Create schema:
    cqlsh -f killrvideo-sample-schema/killrvideo-schema.cql
  • Insert data:
    cqlsh -f killrvideo-sample-schema/killrvideo-inserts.cql
  • Verify the database:
    cqlsh
    use killrvideo;
    describe tables;
    select * from videos;
    exit
  • Done!

6. Install Spark.

For Cassandra 2.1 we will use Spark 1.4.1 and for Cassandra 3.0 we will use Spark 1.6.1.

  • Get a link to a Spark binary. For Cassandra 2.1 choose "Spark 1.4.1", "Pre-built for Hadoop 2.6 or later" and "Direct Download". For Cassandra 3.0 choose "Spark 1.6.1", "Pre-built for Hadoop 2.6 or later" and "Direct Download".
  • Install wget, if needed:
    sudo yum -y install wget
  • Download Spark binary:
    wget <link to a Spark binary>
  • If you have Spark 1.4.1:
    tar xvfz spark-1.4.1-bin-hadoop2.6.tgz
    cd spark-1.4.1-bin-hadoop2.6
  • If you have Spark 1.6.1:
    tar xvfz spark-1.6.1-bin-hadoop2.6.tgz
    cd spark-1.6.1-bin-hadoop2.6
  • Run Spark shell:
    ./bin/spark-shell
    sc.parallelize( 1 to 50 ).sum()
    sys.exit()
  • Done!

7. Install Spark Cassandra Connector.

  • Download the sources:
    git clone https://github.com/datastax/spark-cassandra-connector
    cd spark-cassandra-connector
  • Build the sources.

    If you have Spark 1.4.1:

    git checkout tags/v1.4.1
    ./sbt/sbt assembly
    cd ..
    ln -s spark-cassandra-connector/spark-cassandra-connector/target/scala-2.10/spark-cassandra-connector-assembly-1.4.1-SNAPSHOT.jar spark-cassandra-connector.jar

    If you have Spark 1.6.1:

    git checkout tags/v1.6.0-M2
    ./sbt/sbt assembly
    cd ..
    ln -s spark-cassandra-connector/spark-cassandra-connector/target/scala-2.10/spark-cassandra-connector-assembly-1.6.0-M2.jar spark-cassandra-connector.jar

    I also tried to use JARs from here, instead of compiling them manually, but for some reason none of them worked for me (I tried different combinations).

  • Set connection.host:
    cp conf/spark-defaults.conf.template conf/spark-defaults.conf
    vi conf/spark-defaults.conf

    Add the line:
    spark.cassandra.connection.host 127.0.0.1

  • Run Spark shell with Spark Cassandra Connector.

    If you have Spark 1.4.1:

    ./bin/spark-shell --jars spark-cassandra-connector.jar
    val hc = new org.apache.spark.sql.hive.HiveContext(sc)
    val results = hc.read.format("org.apache.spark.sql.cassandra").options(Map("keyspace" -> "killrvideo", "table" -> "videos")).load()
    results.collect().foreach(println)
    sys.exit()

    If you have Spark 1.6.1:

    ./bin/spark-shell --jars spark-cassandra-connector.jar
    val results = sqlContext.read.format("org.apache.spark.sql.cassandra").options(Map("keyspace" -> "killrvideo", "table" -> "videos")).load()
    results.collect().foreach(println)
    sys.exit()

    In both Spark 1.4.1 and Spark 1.6.1 you may also use the following syntax, but it's considered deprecated:

    ./bin/spark-shell --jars spark-cassandra-connector.jar
    import org.apache.spark.sql.cassandra.CassandraSQLContext
    val csc = new CassandraSQLContext(sc)
    csc.setKeyspace("killrvideo")
    val results = csc.sql("select * from videos")
    results.collect().foreach(println)
    sys.exit()
  • Done!

8. Run Spark SQL Thrift Server and create Hive tables.

  • Download MySQL connector:
    wget https://dev.mysql.com/get/Downloads/Connector-J/mysql-connector-java-5.1.38.tar.gz
    tar xvfz mysql-connector-java-5.1.38.tar.gz
  • Run Spark SQL Thrift Server:
    ./sbin/start-thriftserver.sh --hiveconf hive.server2.thrift.bind.host <Private IP> --hiveconf hive.server2.thrift.port 10000 --hiveconf hive.metastore.warehouse.dir /hive/warehouse --jars spark-cassandra-connector.jar --driver-class-path mysql-connector-java-5.1.38/mysql-connector-java-5.1.38-bin.jar

    Please note that you need to set a path to Hive warehouse and it must be writable for Spark process.

  • Verify it has successfully started:
    tail -f logs/*.out
  • Connect to Spark SQL Thrift Server using beeline:
    ./bin/beeline -u jdbc:hive2://<Private IP>:10000
    show databases;

    Keep in mind that Cassandra databases are not automatically visible, so you need to create them manually.

  • If you have Spark 1.6.1, create a new Hive database:
    create database killrvideo;
    use killrvideo;

    If you have Spark 1.4.1, Cassandra tables may be created only in the "default" database.

  • Manually create all the tables:
    create table uploaded_videos using org.apache.spark.sql.cassandra options (cluster 'Test Cluster', keyspace 'killrvideo', table 'uploaded_videos');
    create table uploaded_videos_by_jobid using org.apache.spark.sql.cassandra options (cluster 'Test Cluster', keyspace 'killrvideo', table 'uploaded_videos_by_jobid');
    create table users using org.apache.spark.sql.cassandra options (cluster 'Test Cluster', keyspace 'killrvideo', table 'users');
    create table user_credentials using org.apache.spark.sql.cassandra options (cluster 'Test Cluster', keyspace 'killrvideo', table 'user_credentials');
    create table encoding_job_notifications using org.apache.spark.sql.cassandra options (cluster 'Test Cluster', keyspace 'killrvideo', table 'encoding_job_notifications');
    create table latest_videos using org.apache.spark.sql.cassandra options (cluster 'Test Cluster', keyspace 'killrvideo', table 'latest_videos');
    create table video_rating using org.apache.spark.sql.cassandra options (cluster 'Test Cluster', keyspace 'killrvideo', table 'video_rating');
    create table comments_by_user using org.apache.spark.sql.cassandra options (cluster 'Test Cluster', keyspace 'killrvideo', table 'comments_by_user');
    create table tags_by_letter using org.apache.spark.sql.cassandra options (cluster 'Test Cluster', keyspace 'killrvideo', table 'tags_by_letter');
    create table videos using org.apache.spark.sql.cassandra options (cluster 'Test Cluster', keyspace 'killrvideo', table 'videos');
    create table user_videos using org.apache.spark.sql.cassandra options (cluster 'Test Cluster', keyspace 'killrvideo', table 'user_videos');
    create table videos_by_tag using org.apache.spark.sql.cassandra options (cluster 'Test Cluster', keyspace 'killrvideo', table 'videos_by_tag');
    create table video_event using org.apache.spark.sql.cassandra options (cluster 'Test Cluster', keyspace 'killrvideo', table 'video_event');
    create table video_ratings_by_user using org.apache.spark.sql.cassandra options (cluster 'Test Cluster', keyspace 'killrvideo', table 'video_ratings_by_user');
    create table comments_by_video using org.apache.spark.sql.cassandra options (cluster 'Test Cluster', keyspace 'killrvideo', table 'comments_by_video');
    show tables;
  • Verify that the new tables are correctly mapped to Cassandra tables:
    select * from videos;
    !quit
  • Done!

9. Create Tableau instance.

  • Go to EC2 Management Console and click Launch Instance
  • Choose an AMI and click Select. I have chosen Windows_Server-2012-R2_RTM-English-64Bit-Base-2016.04.13 (ami-c8a9baa2)
  • Choose an Instance Type and click Next. I have chosen m4.large (2 vCPUs and 8 GiB)
  • Choose the same network as for Cassandra + Spark instance and click Next
  • Add a storage and click Next (I have used the default one)
  • Assign tags if needed and click Next
  • Select the same security group as for Cassandra + Spark instance and click Review and Launch
  • Click Launch
  • Choose the same key pair as for Cassandra + Spark instance, check the checkbox and click Launch Instances
  • Click View Instances
  • Wait until the instance is launched (all status checks are passed)
  • Click Get Windows Password
  • Choose the correct Key Pair Path and click Decrypt Password
  • Write down public DNS, user name and password and click Close
  • Done!

10. Install Tableau.

11. Connect Tableau to Spark SQL Thrift Server.

  • Click Connect To a server -> More Servers... and choose Spark SQL
  • Enter the following values:
    Server: <Private IP>
    Port: 10000
    Type: SparkThriftServer...
    Authentication: User Name
    Username: test (doesn't matter)
  • Click OK
  • Click Select Schema dropdown and then Search icon
  • If you have Spark 1.4.1, choose "default". If you have Spark 1.6.1, choose "killrvideo".
  • Click Search icon to display all tables
  • Double click "videos" table
  • Click Update Now
  • Double click "videos_by_tag" table
  • Click Update Now
  • Done!

12. Test materialized views in Cassandra 3.0.

Materialized views is a new feature of Cassandra 3.0, so let’s test it as well.

  • Create a materialized view:
    cqlsh
    use killrvideo;
    create materialized view videos_view as select name, videoid, description from videos where videoid is not null and name is not null primary key (name, videoid);
    select * from videos_view where name = 'My funny cat';
    exit
  • Run Spark shell:
    ./sbin/stop-thriftserver.sh
    ./bin/spark-shell --jars spark-cassandra-connector.jar
    import org.apache.spark.sql.cassandra.CassandraSQLContext
    val csc = new CassandraSQLContext(sc)
    csc.setKeyspace("killrvideo")
    val results = csc.sql("select * from videos_view")
    results.collect().foreach(println)
    sys.exit()
  • Run Spark SQL Thrift Server and connect to it using beeline:
    ./sbin/start-thriftserver.sh --hiveconf hive.server2.thrift.bind.host <Private IP> --hiveconf hive.server2.thrift.port 10000 --hiveconf hive.metastore.warehouse.dir /hive/warehouse --jars spark-cassandra-connector.jar --driver-class-path mysql-connector-java-5.1.38/mysql-connector-java-5.1.38-bin.jar
    ./bin/beeline -u jdbc:hive2://<Private IP>:10000
    use killrvideo;
    create table videos_view using org.apache.spark.sql.cassandra options (cluster 'Test Cluster', keyspace 'killrvideo', table 'videos_view');
    select * from videos_view;
    !quit
  • Connect Tableau to Spark SQL Thrift Server
  • Click Select Schema dropdown and then Search icon
  • Choose "killrvideo"
  • Click Search icon to display all tables
  • Double click "videos_view" table
  • Click Update Now

    The output is fine. You can use materialized views exactly in the same way as usual Cassandra tables.
  • Done!

Cassandra + Spark + Tableau using DataStax Enterprise 4.8

1. Create Datastax Enterprise instance.

  • Follow the steps 1 and 2 for "Cassandra + Spark + Tableau using DataStax Community Edition"
  • Done!

2. Install Datastax Enterprise.

  • Register at DataStax Academy (if you haven't yet). You will need email address and password
  • Make SSH connection to the instance
  • Add DataStax repo:
    sudo vi /etc/yum.repos.d/datastax.repo

    Add:
    [datastax]
    name = DataStax Repo for DataStax Enterprise
    baseurl=https://<email address, make sure "@" is replaced with "%40">:<password>@rpm.datastax.com/enterprise
    enabled=1
    gpgcheck=0

  • Install Datastax Enterprise:
    sudo yum -y install dse-full-4.8.*
    sudo vi /etc/default/dse
    SPARK_ENABLED=1
  • If you want to change the name of your cluster:
    sudo vi /etc/dse/cassandra/cassandra.yaml

    Change the line:

    cluster_name: 'Test Cluster'
  • Start DSE:
    sudo service dse start
    sudo service dse status
    sudo service datastax-agent start
    sudo service datastax-agent status
  • Verify if Cassandra is up and running:
    nodetool status
    cqlsh
    exit
  • Cassandra log files are located here:
    /var/log/cassandra/
  • Follow the step 4 for “Cassandra + Spark + Tableau using DataStax Community Edition”
  • Done!

3. Create a test database.

  • Follow the step 5 for “Cassandra + Spark + Tableau using DataStax Community Edition”
  • Additionally, download another test database and then scp all files to the instance.
  • Create the database:
    chmod 755 ScriptCQL.sh
    ./ScriptCQL.sh
  • Done!

4. Verify Spark.

  • Run Spark:
    dse spark
    sc.parallelize( 1 to 50 ).sum()
    val results = hc.read.format("org.apache.spark.sql.cassandra").options(Map("keyspace" -> "killrvideo", "table" -> "videos")).load()
    results.collect().foreach(println)
    sys.exit()
  • Done!

5. Run Spark SQL Thrift Server and create Hive tables.

  • Run Spark SQL Thrift Server:
    dse spark-sql-thriftserver start
  • Connect to Spark SQL Thrift Server using DSE beeline:
    dse beeline -u jdbc:hive2://<Private IP>:10000
    show databases;

    Please note that all Cassandra databases are automatically visible. Wow, right? 😊

  • In addition to automatic tables, create the tables manually in "default" database (like for Community Edition). See the step 8 for “Cassandra + Spark + Tableau using DataStax Community Edition”.
  • Done!

6. Run test queries.

This step is actually not an instruction, but rather a log of my actions and revealed issues. Note that I run all of the queries below using Community Edition as well, but no issues have been revealed.

  • Run test queries:
    use ks_music;
    select * from albums limit 10;

    The output is fine.

    use killrvideo;
    select videoid from videos;

    Please note an unusable binary output.

    select tags from videos;

    Oops... I get the following error:

    java.lang.IllegalArgumentException: Unregonized Thrift TTypeId value: ARRAY_TYPE

    Select another field:

    select preview_thumbnails from videos;

    Almost the same error appears:

    java.lang.IllegalArgumentException: Unregonized Thrift TTypeId value: MAP_TYPE

    Run:

    use default
    select videoid from videos;

    The output is fine (no binary data).

    Run:

    select tags from videos;

    The error persists:

    java.lang.IllegalArgumentException: Unregonized Thrift TTypeId value: ARRAY_TYPE

    Run:

    select preview_thumbnails from videos;

    The error persists:

    java.lang.IllegalArgumentException: Unregonized Thrift TTypeId value: MAP_TYPE

    Quit:

    !quit

    Well, there are definitely issues with UUID and collection types. The most interesting thing is that UUID value is displayed incorrectly only if we use automatically imported tables. Let's try original beeline tool (not DSE beeline).

  • Get a link to a Spark binary. Choose "Spark 1.4.1", "Pre-built for Hadoop 2.6 or later" and "Direct Download".
  • Install wget, if needed:
    sudo yum -y install wget
  • Download Spark binary:
    wget <link>
    tar xvfz spark-1.4.1-bin-hadoop2.6.tgz
  • Connect to Spark SQL Thrift Server using beeline:
    cd spark-1.4.1-bin-hadoop2.6
    ./bin/beeline -u jdbc:hive2://<Private IP>:10000
    show databases;
  • Run the same queries:
    use killrvideo;
    select videoid from videos;

    A new error appears:

    Error: java.lang.String cannot be cast to [B (state=,code=0)

    Run:

    select tags, preview_thumbnails from videos;

    The output is fine.

    Run:

    use default;
    select videoid, tags, preview_thumbnails from videos;

    The output is fine.

    Quit:

    !quit

    With the original beeline we have only one error with a UUID value from automatically imported tables. Manually created tables work fine.

  • Stop Spark SQL Thrift Server and run DSE Spark SQL:
    dse spark-sql-thriftserver stop
    dse spark-sql
    show databases;
  • Run the same queries:
    use killrvideo;
    select videoid from videos;

    The output is binary.

    Run:

    select tags, preview_thumbnails from videos;

    The output is fine.

    Run:

    use default;
    select videoid, tags, preview_thumbnails from videos;

    The output is fine.

    Quit:

    exit;

    With DSE Spark SQL everything works fine, except for binary output for a UUID value from the automatically imported tables.

  • Restart Spark SQL Thrift Server:
    dse spark-sql-thriftserver start
  • Done!

7. Connect Tableau to Spark SQL Thrift Server.

  • Make RDP connection to the Tableau instance
  • Run Tableau
  • Click Connect To a server -> More Servers... and choose Spark SQL
  • Enter the following values:
    Server: <Private IP>
    Port: 10000
    Type: SparkThriftServer...
    Authentication: User Name
    Username: test (doesn't matter)
  • Click OK
  • Click Select Schema dropdown and then Search icon
  • Choose "ks_music"
  • Click Search icon to display all tables
  • Double click "albums" table
  • Click Update Now
  • Double click "performers" table
  • Click Update Now

    So far, everything works fine.

  • Remove the tables from the area
  • Click Select Schema dropdown and then Search icon
  • Choose "killrvideo"
  • Click Search icon to display all tables
  • Double click "videos" table
  • Click Update Now

    Oops... The following error appears:

    [Simba][Hardy] (35) Error from server: error code: '0' error message: 'java.lang.String cannot be cast to [B'.

    We have already seen this error using the original beeline tool. Mind that beeline uses JDBC, but Tableau uses ODBC.

  • Remove the table from the area
  • Click Select Schema dropdown and then Search icon
  • Choose "default"
  • Click Search icon to display all tables
  • Double click "videos" table
  • Click Update Now

    The output is fine.

  • Double click "videos_by_tag" table
  • Click Update Now

    The output is fine. It looks like there are no issues using manually created tables, while automatically imported tables face an issue with UUID type.

  • Done!

Cassandra + Spark + Tableau using DataStax Enterprise 5.0

1. Create Datastax Enterprise instance.

  • Follow the steps 1 and 2 for "Cassandra + Spark + Tableau using DataStax Community Edition"
  • Done!

2. Install Datastax Enterprise.

  • Register at DataStax Academy (if you haven't yet). You will need email address and password.
  • Make SSH connection to the instance
  • Add DataStax repo:
    sudo vi /etc/yum.repos.d/datastax.repo

    Add:
    [datastax]
    name = DataStax Repo for DataStax Enterprise
    baseurl=https://<email address, make sure "@" is replaced with "%40">:<password>@rpm.datastax.com/enterprise
    enabled=1
    gpgcheck=0

  • Install Datastax Enterprise (including DataStax Agent):
    sudo yum -y install dse-full-5.0.*
    sudo yum -y install datastax-agent
    sudo vi /etc/default/dse
    SPARK_ENABLED=1
  • If you want to change the name of your cluster:
    sudo vi /etc/dse/cassandra/cassandra.yaml

    Change the line:
    cluster_name: 'Test Cluster'

  • Start DSE:
    sudo service dse start
    sudo service dse status
    sudo service datastax-agent start
    sudo service datastax-agent status
  • Verify if Cassandra is up and running:
    nodetool status
    cqlsh
    exit
  • Cassandra log files are located here:
    /var/log/cassandra/
  • Follow the step 4 for “Cassandra + Spark + Tableau using DataStax Community Edition”
  • Done!

3. Create a test database.

  • Follow the step 5 for “Cassandra + Spark + Tableau using DataStax Community Edition”
  • Additionally, download another test database and then scp all files to the instance
  • Create the database:
    chmod 755 ScriptCQL.sh
    ./ScriptCQL.sh
  • Done!

4. Verify Spark.

  • Run Spark:
    dse spark
    sc.parallelize( 1 to 50 ).sum()
    val results = sqlContext.read.format("org.apache.spark.sql.cassandra").options(Map("keyspace" -> "killrvideo", "table" -> "videos")).load()
    results.collect().foreach(println)
    sys.exit()
  • Done!

5. Run Spark SQL Thrift Server and create Hive tables.

  • Run Spark SQL Thrift Server:
    dse spark-sql-thriftserver start
  • Connect to Spark SQL Thrift Server using DSE beeline:
  • dse beeline -u jdbc:hive2://<Private IP>:10000
    show databases;

    Please note that all Cassandra databases are automatically visible.

  • In addition to automatic tables, create the tables manually in "default" database (like for Community Edition). See the step 8 for “Cassandra + Spark + Tableau using DataStax Community Edition”.
  • Done!

6. Run test queries.

Again, this step is actually not an instruction, but rather a log of my actions and revealed issues.

  • Run test queries:
    use ks_music;
    select * from albums limit 10;

    The output is fine.

    use killrvideo;
    select videoid from videos;

    The output is fine.

    select tags from videos;

    I get the following error:

    java.lang.IllegalArgumentException: Unregonized Thrift TTypeId value: ARRAY_TYPE

    Select another field:

    select preview_thumbnails from videos;

    Almost the same error appears:

    java.lang.IllegalArgumentException: Unregonized Thrift TTypeId value: MAP_TYPE

    Run:

    use default
    select videoid from videos;

    The output is fine.

    Run:

    select tags from videos;

    The error persists:

    java.lang.IllegalArgumentException: Unregonized Thrift TTypeId value: ARRAY_TYPE

    Run:

    select preview_thumbnails from videos;

    The error persists:

    java.lang.IllegalArgumentException: Unregonized Thrift TTypeId value: MAP_TYPE

    Quit:

    !quit

    Well, there are issues with collection types. Let's try original beeline tool (not DSE beeline).

  • Get a link to a Spark binary. Choose "Spark 1.6.1", "Pre-built for Hadoop 2.6 or later" and "Direct Download".
  • Install wget, if needed:
    sudo yum -y install wget
  • Download Spark binary:
    wget <link>
    tar xvfz spark-1.6.1-bin-hadoop2.6.tgz
  • Connect to Spark SQL Thrift Server using beeline:
    cd spark-1.6.1-bin-hadoop2.6
    ./bin/beeline -u jdbc:hive2://<Private IP>:10000
    show databases;
  • Run:
    use killrvideo;
    select videoid, tags, preview_thumbnails from videos;

    The output is fine.

    Run:

    use default;
    select videoid, tags, preview_thumbnails from videos;

    The output is fine.

    Quit:

    !quit

    With original beeline everything works fine.

  • Stop Spark SQL Thrift Server and run DSE Spark SQL:
    dse spark-sql-thriftserver stop
    dse spark-sql
    show databases;
  • Run:
    use killrvideo;
    select videoid, tags, preview_thumbnails from videos;

    The output is fine.

    Run:

    use default;
    select videoid, tags, preview_thumbnails from videos;

    The output is fine.

    Quit:

    exit;

    With DSE Spark SQL everything works fine.

  • Restart Spark SQL Thrift Server:
    dse spark-sql-thriftserver start
  • Done!

7. Connect Tableau to Spark SQL Thrift Server.

  • Make RDP connection to the Tableau instance
  • Run Tableau
  • Click Connect To a server -> More Servers... and choose Spark SQL
  • Enter the following values:
    Server: <Private IP >
    Port: 10000
    Type: SparkThriftServer...
    Authentication: User Name
    Username: test (doesn't matter)
  • Click OK
  • Click Select Schema dropdown and then Search icon
  • Choose "ks_music"
  • Click Search icon to display all tables
  • Double click "albums" table
  • Click Update Now
  • Double click "performers" table
  • Click Update Now

    So far, everything works fine.

  • Remove the tables from the area
  • Click Select Schema dropdown and then Search icon
  • Choose "killrvideo"
  • Click Search icon to display all tables
  • Double click "videos" table
  • Click Update Now

    The output is fine.

  • Remove the table from the area
  • Click Select Schema dropdown and then Search icon
  • Choose "default"
  • Click Search icon to display all tables
  • Double click "videos" table
  • Click Update Now

    The output is fine.

  • Double click "videos_by_tag" table
  • Click Update Now

    The output is fine. It looks like there are no issues using either manually created tables or automatically imported tables.

  • Done!

8. Test materialized views.

  • Create a materialized view:
    cqlsh
    use killrvideo;
    create materialized view videos_view as select name, videoid, description from videos where videoid is not null and name is not null primary key (name, videoid);
    select * from videos_view where name = 'My funny cat';
    exit
  • Run Spark shell:
    dse spark-sql-thriftserver stop
    dse spark
    import org.apache.spark.sql.cassandra.CassandraSQLContext
    val csc = new CassandraSQLContext(sc)
    csc.setKeyspace("killrvideo")
    val results = csc.sql("select * from videos_view")
    results.collect().foreach(println)
    sys.exit()
  • Run Spark SQL Thrift Server and connect to it using DSE beeline:
    dse spark-sql-thriftserver start
    dse beeline -u jdbc:hive2://<Private IP>:10000
    use killrvideo;
    show tables;

    Please note that “videos_view” is not in the list of the auomatically imported tables.

    Run:

    select * from videos_view;

    Unsurprisingly, the following error appears:

    Error: org.apache.spark.sql.AnalysisException: Table not found: videos_view; line 1 pos 14 (state=,code=0)

    Create the table manually:

    use default;
    create table videos_view using org.apache.spark.sql.cassandra options (cluster 'Test Cluster', keyspace 'killrvideo', table 'videos_view');
    select * from videos_view;

    Everything is fine now.

    Quit:

    !quit
  • Connect to Spark SQL Thrift Server using beeline:
    ./bin/beeline -u jdbc:hive2://<Private IP>:10000
    use default;
    select * from videos_view;
    !quit

    Everything is fine.

  • Run DSE Spark SQL:
    dse spark-sql-thriftserver stop
    dse spark-sql
    use default;
    select * from videos_view;
    exit;

    Everything is fine.

  • Restart Spark SQL Thrift Server:
    dse spark-sql-thriftserver start
  • Connect Tableau to Spark SQL Thrift Server
  • Click Select Schema dropdown and then Search icon
  • Choose "default"
  • Click Search icon to display all tables
  • Double click "videos_view" table
  • Click Update Now

    The output is fine. You can use materialized views exactly in the same way as usual Cassandra tables.

  • Done!

Conclusions

It's possible to setup Cassandra + Spark + Tableau using both DataStax Community Edition and DataStax Enterprise.

The first option is a bit complicated (but still doable). Also, you have to manually create each table that you want to use via Spark SQL in Hive. Of course, this option is free.

The second option is much simpler and provides all the additional advantages of DataStax Enterprise (not directly related to Cassandra and Spark integration). Also, it allows to automatically make all Cassandra tables visible in Spark SQL (except for materialized views). In DSE 4.8 these automatic tables do not work very well with certain data types like UUID that could be a stopper of using them in a real application. See also this open question on Stack Overflow. Luckily, this issue has been already fixed in DSE 5.0.Of course, this option is not free.

Here’s a summary of use cases that don't work as expected:

Actual performance of each Cassandra + Spark solution (using Cassandra 2.1, Cassandra 3.0, DataStax Enterprise 4.8 and DataStax Enterprise 5.0) remains an open question. Get a grip of my following article for more investigation.

Useful Reads