Introduction to Spark 3 with Scala

Description

This hands-on course introduces Apache Spark 3.x, the powerful distributed computing engine, designed for developers, data analysts, architects, and technical managers looking to work with Spark in a practical, effective way.

You’ll gain a strong technical foundation in Spark’s architecture and operations, starting with core concepts like RDDs and Spark’s compute engine. From there, you’ll dive into higher-level APIs including DataFrames, DataSets, and Spark SQL—now the preferred tools for building robust and optimized data applications.

The course also covers key performance topics like query optimization, memory management, and caching, as well as advanced features such as Spark Structured Streaming and integration with Kafka for processing real-time data streams.

This course is highly interactive, featuring numerous hands-on labs. You’ll write code using the Spark Shell for quick, exploratory work and build full applications using the Spark API in Scala.

Note: Labs and exercises are in Scala. If you’re using Python, check out our companion course for PySpark.

Training Objectives

By the end of this course you will:
Understand Spark’s role in modern data processing
Grasp the architecture behind Spark and how it runs distributed computations
Set up and run Spark locally or on a cluster
Use the Spark Shell for interactive analysis
Understand RDDs and how to transform and operate on them (e.g. map()
filter())
Work with DataFrames and DataSets using Spark SQL
Explore the Catalyst and Tungsten optimizers for performance
Tune jobs for better memory and execution efficiency
Use Spark's caching mechanisms
Write and run standalone Spark applications
Use Structured Streaming to process real-time data
Ingest and process data from Kafka
Optimize streaming pipelines for performance and scalability

Course Outline

Session 1 (Optional): Scala Ramp-Up<
Introduction to Scala<
Variables, Data Types, Control Flow<
Scala Interpreter & Collections (map(), etc.)<
Functions, Methods, and Functional Programming<
Classes, Objects, Traits, Case Classes<
Session 2: Spark Overview<
What is Spark?<
Spark Ecosystem and Use Cases<
Spark vs. Hadoop<
Installing Spark<
Using the Spark Shell & SparkContext<
Session 3: RDDs and Spark Internals<
RDD Concepts & Lifecycle<
Lazy Evaluation & Partitioning<
Transformations: map(), filter(), etc.<
Session 4: DataFrames, DataSets, and Spark SQL<
Introduction to Spark SQL<
Creating and Managing DataFrames/DataSets<
Supported Formats (JSON, CSV, Parquet, etc.)<
Querying with DSL and SQL<
Working with Typed APIs (flatMap(), explode(), split())<
Comparing RDDs, DataFrames, and DataSets<
Session 5: Shuffles and Query Optimization<
Grouping, Reducing, Joining<
Narrow vs. Wide Dependencies<
Catalyst Optimizer: explain(), query plans<
Tungsten: binary formats, code generation<
Session 6: Performance Tuning<
Caching Strategies<
Minimizing Shuffle Overhead<
Using Broadcast Variables and Accumulators<
General Best Practices<
Session 7: Standalone Spark Applications<
SparkSession and Configuration<
Building Apps with sbt and spark-submit<
Application Lifecycle: Driver, Executors, Tasks<
Cluster Managers (Standalone, YARN, Mesos)<
Logging and Debugging<
Session 8: Structured Streaming with Kafka<
Introduction to Spark Structured Streaming<
Table-based Streaming Model<
Setting Up Streaming Pipelines<
Connecting to Kafka as a Source<
Consuming and Processing Kafka Streams<
Performance Considerations in Streaming

Introduction to Spark 3 with Scala - Ratio

Introduction to Spark 3 with Scala

Description

Description

Training Objectives

Course Outline

Enquire About This Course?

Introduction to Spark 3 with Scala - Ratio

Introduction to Spark 3 with Scala

Description

Description

Training Objectives

Course Outline

Enquire About This Course?

Related products

Introduction to Git

Python API Development Fundamentals

Introduction to Programming for Android Using Android Studio

Using Data Science Tools in Python®