This is a candidate session. ScalaMatsuri selects sessions using as a reference participants voting later.

日本語

Scaling TB's of data with Apache Spark and Scala DSL at Production

Since Apache Spark became one of the leading top Apache community project with Scala, Requirement of scala skillset changed drastically at global level. As entire Apache Spark framework has been written in scala from ground, it’s real pleasure to understand beauty of functional Scala DSL with Spark.

This talk is intent to present :

primary data structures (RDD, DataSet, Dataframe) usage in universal large scale data processing with Hbase (Data lake), Hive (Analytical Engine).
We will go through importance of physical data split up techniques such as Partition, Repartition, Coalesce in Scaling TB’s of data with ~17 billions of transactional data as a use case / case study.
Also, We will understand crucial part and very interesting way of understanding parallel & concurrent distributed data processing – tuning memory, cache, Disk I/O, Leaking memory, Internal shuffle, spark executor, spark driver etc.

Session length: 40 minutes
Language of the presentation: English
Target audience: Beginner: No need to have prior knowledge
Who is your session intended to: Who understands basic functional programming with scala or has understanding of Java.
Who understands concurrent programming or multithreading in Java / Scala.
Who has interest in distributed data processing and has keen interest in data scaling optimization.
Who has earlier worked in Big Data, Fast Data or has keen interest.
Speaker: CHETANKUMAR KHATRI (Lead - Data Science at Accion labs.)

Candidate sessions