Description
Description
This hands-on course introduces Apache Spark 3.x, the powerful distributed computing engine, designed for developers, data analysts, architects, and technical managers looking to work with Spark in a practical, effective way.
You’ll gain a strong technical foundation in Spark’s architecture and operations, starting with core concepts like RDDs and Spark’s compute engine. From there, you’ll dive into higher-level APIs including DataFrames, DataSets, and Spark SQL—now the preferred tools for building robust and optimized data applications.
The course also covers key performance topics like query optimization, memory management, and caching, as well as advanced features such as Spark Structured Streaming and integration with Kafka for processing real-time data streams.
This course is highly interactive, featuring numerous hands-on labs. You’ll write code using the Spark Shell for quick, exploratory work and build full applications using the Spark API in Scala.
Note: Labs and exercises are in Scala. If you’re using Python, check out our companion course for PySpark.
Training Objectives
- By the end of this course you will:
- Understand Spark’s role in modern data processing
- Grasp the architecture behind Spark and how it runs distributed computations
- Set up and run Spark locally or on a cluster
- Use the Spark Shell for interactive analysis
- Understand RDDs and how to transform and operate on them (e.g. map()
- filter())
- Work with DataFrames and DataSets using Spark SQL
- Explore the Catalyst and Tungsten optimizers for performance
- Tune jobs for better memory and execution efficiency
- Use Spark's caching mechanisms
- Write and run standalone Spark applications
- Use Structured Streaming to process real-time data
- Ingest and process data from Kafka
- Optimize streaming pipelines for performance and scalability
Course Outline
- Session 1 (Optional): Scala Ramp-Up<
- Introduction to Scala<
- Variables, Data Types, Control Flow<
- Scala Interpreter & Collections (map(), etc.)<
- Functions, Methods, and Functional Programming<
- Classes, Objects, Traits, Case Classes<
- Session 2: Spark Overview<
- What is Spark?<
- Spark Ecosystem and Use Cases<
- Spark vs. Hadoop<
- Installing Spark<
- Using the Spark Shell & SparkContext<
- Session 3: RDDs and Spark Internals<
- RDD Concepts & Lifecycle<
- Lazy Evaluation & Partitioning<
- Transformations: map(), filter(), etc.<
- Session 4: DataFrames, DataSets, and Spark SQL<
- Introduction to Spark SQL<
- Creating and Managing DataFrames/DataSets<
- Supported Formats (JSON, CSV, Parquet, etc.)<
- Querying with DSL and SQL<
- Working with Typed APIs (flatMap(), explode(), split())<
- Comparing RDDs, DataFrames, and DataSets<
- Session 5: Shuffles and Query Optimization<
- Grouping, Reducing, Joining<
- Narrow vs. Wide Dependencies<
- Catalyst Optimizer: explain(), query plans<
- Tungsten: binary formats, code generation<
- Session 6: Performance Tuning<
- Caching Strategies<
- Minimizing Shuffle Overhead<
- Using Broadcast Variables and Accumulators<
- General Best Practices<
- Session 7: Standalone Spark Applications<
- SparkSession and Configuration<
- Building Apps with sbt and spark-submit<
- Application Lifecycle: Driver, Executors, Tasks<
- Cluster Managers (Standalone, YARN, Mesos)<
- Logging and Debugging<
- Session 8: Structured Streaming with Kafka<
- Introduction to Spark Structured Streaming<
- Table-based Streaming Model<
- Setting Up Streaming Pipelines<
- Connecting to Kafka as a Source<
- Consuming and Processing Kafka Streams<
- Performance Considerations in Streaming




