Introduction to Apache Spark for Data Science | Analyzing Big Data with Spark

Learn to Use Spark to Build Unified Big Data Applications Combining Batch, Streaming, and Interactive Analytics

TTSK7513

Introductory and Beyond

3 Days

Course Overview

Apache Spark is a powerful, open-source processing engine for data in the Hadoop cluster, optimized for speed, ease of use, and sophisticated analytics. The Spark framework supports streaming data processing and complex, iterative algorithms, enabling applications to run up to 100x faster than traditional Hadoop MapReduce programs.  With Spark, you can write sophisticated parallel applications to execute faster decisions, better decisions, and real-time actions, applied to a wide variety of use cases, architectures, and industries.

Apache Spark for Data Science is a three-day, hands-on course geared for technical business professional who wish to solve real-world data related problems using Apache Spark. This course explores using Apache Spark for common data related activities.  Students will learn to build unified big data applications combining batch, streaming, and interactive analytics on all their data.  

NOTE: The hands-on treatment and focus in this course is geared towards the data science aspects of Spark and related tools.  Students who want a more developer-oriented edition of this course should consider the TTSK7503 Spark Developer | Spark for Big Data, Hadoop & Machine Learning which aligns in subject coverage but is geared for developers instead of data scientists.

Course Objectives

This course is approximately 50% hands-on, combining expert lecture, real-world demonstrations and group discussions with machine-based practical labs and exercises.  Working in a hands-on learning environment led by our expert practitioner students will explore:

  • Spark Essentials
  • DataFrames
  • Spark SQL
  • Spark MLib
  • Spark Streaming
  • Streaming with Kafka
  • Data Flow with NiFi
  • Spark GraphX
  • Performance and Tuning
  • Cluster Mode
  • Spark - the Big Picture

Need different skills or topics?  If your team requires different topics or tools, additional skills or custom approach, this course may be easily adjusted to accommodate.  We offer additional related Spark, Hadoop, data science, programming and development courses which may be blended with this course for a track that best suits your development objectives. Our team will collaborate with you to understand your needs and will target the course to focus on your specific learning objectives and goals.

Course Prerequisites

This course is an Introductory level and beyond course. Typical attendees would include systems administrators, testers or technical data related roles who need to learn to use Spark for data analysis or processing data. 

Attending students should have the following background:

  • Basic knowledge of Python Programming (or students who know R and can pick up Python easily)
  • Basic prior exposure to Java syntax (those without that background can copy and paste the labs)
  • Introduction to SQL (familiarity wits SQL basics)
  • Basic knowledge of Statistics and Probability & Data science

Please see the Related Courses tab for specific Pre-Requisite courses, Related Courses that offer similar skills or topics, and next-step Learning Path recommendations.

Course Agenda

Please note that this list of topics is based on our standard course offering, evolved from typical industry uses and trends. We’ll work with you to tune this course and level of coverage to target the skills you need most.

Getting Started

  • Our Data and our problem set
  • Accessing the cluster, the data, and the tools
  • The Continuous Workshop approach
  • "Let's build a model together"
  • Focus on analysis, exploration, data munging, algorithms
  • Tooling and fundamentals as necessary to get the job done

Spark Overview

  • Data Science: The State of the Art
  • Hadoop, Yarn, and Spark
  • Architectural Overview
  • MLib Overview
  • HDFS data - Accessing
  • Lab Focus
  • Working with HDFS data
  • Distributed vs. Local Run Modes
  • Spark vs. Other tools (when is Spark the right tool for the job?)
  • Spark vs. SAS
  • Spark Languages (Java, R, Python, and Scala)
  • Hello, Spark

Spark Essentials

  • Spark Core
  • Spark SQL
  • Spark and Hive
  • Lab
  • MLib
  • Spark Streaming
  • Spark API

DataFrames

  • DataFrames and Resilient Distributed Datasets (RDDs)
  • Partitions
  • Adding variables to a DataFrame
  • DataFrame Types
  • DataFrame Operations
  • Dependent vs. Independent variables
  • Map/Reduce with DataFrames

Spark SQL

  • Spark SQL Overview
  • Data stores: HDFS, Cassandra, HBase, Hive, and S3
  • Table Definitions
  • Queries

Spark MLib

  • MLib overview
  • MLib Algorithms Overview
  • Classification Algorithms
  • Regression Algorithms
  • Lab Focus
  • Brief Comparison to SAS
  • Here's your split, how to tune regression
  • Decision Trees and forests
  • Lab Focus
  • Brief Comparison to SAS
  • Stepwise approach to Decision Trees
  • Working with Exit Criteria
  • Recommendation with ALS
  • Clustering Algorithms
  • Lab Focus
  • Key Clustering Algorithms
  • Choosing Clustering Algorithms
  • Working with key algorithms
  • Machine Learning Pipelines
  • Linear Algebra (SVD, PCA)
  • Statistics in MLib

Spark Streaming

  • Streaming overview
  • Real-time data ingestion
  • State
  • Window Operations

Streaming with Kafka

  • Kafka overview
  • Kafka and Spark Streaming

Data Flow with NiFi

  • Apache NiFi overview
  • NiFi data flows with Spark/R

Spark GraphX

  • GraphX overview
  • ETL with GraphX
  • Graph computation

Performance and Tuning

  • Broadcast variables
  • Accumulators
  • Memory Management

Cluster Mode

  • Standalone Cluster
  • Masters and Workers
  • Configurations
  • Working with large data sets

Spark - the Big Picture

  • Spark in Real-Time and near-Real-Time Decision Support Systems
  • Spark in the Enterprise
  • Best Practices

Course Materials

Our course materials include more than a simple slideshow presentation handout. Each student will receive a comprehensive course Student Guide, complete with detailed course notes, code samples, software tutorials, diagrams and related reference materials and links. Our courses also include detailed our Student Workbook, with step by step hands-on lab instructions and project files (as necessary) and solutions, clearly illustrated for users to complete hands-on work in class, and to revisit to review or refresh skills at any time.  Students will also receive the course set up filesproject files (or code, if applicable) and solutions required for the hands-on work.

Raise the bar for advancing technology skills

Attend a Class!

Live scheduled classes are listed below or browse our full course catalog anytime

Special Offers

We regulary offer discounts for individuals, groups and corporate teams. Contact us

Custom Team Training

Check out custom training solutions planned around your unique needs and skills.

EveryCourse Extras

Exclusive materials, ongoing support and a free live course refresh with every class.

Attend a Course

Please see the current upcoming available open enrollment course dates posted below. Please feel free to Register Online below, or call 844-475-4559 toll free to connect with our Registrar for assistance. If you need additional date options, please contact us for scheduling.

Course Title Days Date Time Price
Introduction to Apache Spark for Data Science | Analyzing Big Data with Spark 3 Days Aug 30 to Sep 1 10:00 AM to 06:00 PM EST $2,395.00 Enroll
Introduction to Apache Spark for Data Science | Analyzing Big Data with Spark 3 Days Oct 4 to Oct 6 10:00 AM to 06:00 PM EST $2,395.00 Enroll
Introduction to Apache Spark for Data Science | Analyzing Big Data with Spark 3 Days Nov 29 to Dec 1 10:00 AM to 06:00 PM EST $2,395.00 Enroll

New Site, BIG Savings!
We're celebrating the launch of our lonnngggg awaited new site with with *50% off all 2021 Public Classes* booked by April 30!  Check out our Current Offers for Individuals, Teams and Organizations to Learn for Less!

See our latest Offers and Promotions

Learn. Explore. Advance!

Extend your training investment! Recorded sessions, free re-sits and after course support included with Every Course
Trivera MiniCamps
Gain the skills you need with less time in the classroom with our short course, live-online hands-on events
Trivera QuickSkills: Free Courses and Webinars
Training on us! Keep your skills current with free live events, courses & webinars
Trivera AfterCourse: Coaching and Support
Expert level after-training support to help organizations put new training skills into practice on the job

The voices of our customers speak volumes

Special Offers
Limited Offer for most courses.

SAVE 50%

Learn More