Scaling TB's of data with Apache Spark and Scala DSL at Production
Since Apache Spark became one of the leading top Apache community project with Scala, Requirement of scala skillset changed drastically at global level. As entire Apache Spark framework has been written in scala from ground, it’s real pleasure to understand beauty of functional Scala DSL with Spark.
This talk is intent to present :
primary data structures (RDD, DataSet, Dataframe) usage in universal large scale data processing with Hbase (Data lake), Hive (Analytical Engine).
We will go through importance of physical data split up techniques such as Partition, Repartition, Coalesce in Scaling TB’s of data with ~17 billions of transactional data as a use case / case study.
Also, We will understand crucial part and very interesting way of understanding parallel & concurrent distributed data processing – tuning memory, cache, Disk I/O, Leaking memory, Internal shuffle, spark executor, spark driver etc.
Session length
40 minutes
Language of the presentation
English
Target audience
Beginner: No need to have prior knowledge
Who is your session intended to
Who understands basic functional programming with scala or has understanding of Java.
Who understands concurrent programming or multithreading in Java / Scala.
Who has interest in distributed data processing and has keen interest in data scaling optimization.
Who has earlier worked in Big Data, Fast Data or has keen interest.
Speaker
CHETANKUMAR KHATRI
(Lead - Data Science at Accion labs.)