Introduction to Apache Spark for R Experienced Data Scientists and Analysts

Getting Started with Spark for Experienced Data Scientists Already Working with R

TTSK7516

Introductory

2 Days

Course Overview

Spark is a highly optimized Data Science environment running on Hadoop YARN, with support for Machine Learning through MLib and Mahout, SQL, DataFrames, and Streaming. In this course, Data Scientists dive into the details of practical data science on the Spark platform, including real-world interaction with other systems in modern Data Science environments.

Quick Start to Spark for R Experienced Data Scientists & Analysts is intended for existing Data Scientists already fluent in data science techniques in other languages such as SAS and already comfortable with R. This course will be presented in a "rolling lab" approach - a continuous workshop of real-world data exploration involving real-world problems. As such, problems and opportunities will be explored as data suggests and as questions arise. "Lecture" material will be provided only as is necessary to explain the background of the approach being used at the moment. Times and ordering of the material are highly flexible and should be used only as estimates. Student questions and requests will also significantly alter the direction of the workshop.

The objective of the course is to practically transition these data scientists to the R/Spark/Hadoop environment, becoming comfortable with the tools and machine learning libraries and conduct statistical and machine learning analyses they've already been performing in SAS or similar environments.

Course Objectives

This course is approximately 50% hands-on, combining expert lecture, real-world demonstrations and group discussions with machine-based practical labs and exercises.  The objective of the course is to practically transition these data scientists to the R/Spark/Hadoop environment, becoming comfortable with the tools and machine learning libraries and conduct statistical and machine learning analyses they've already been performing in SAS or similar environments.

Course Prerequisites

This course is intended for existing Data Scientists already fluent in data science techniques in other languages such as SAS and already comfortable with R.  

Please see the Related Courses tab for specific Pre-Requisite courses, Related Courses or Follow On training options. Our team will be happy to help you with recommendations for next steps in your Learning Journey.

Course Agenda

Please note that this list of topics is based on our standard course offering, evolved from typical industry uses and trends. We’ll work with you to tune this course and level of coverage to target the skills you need most.

Getting Started - Overview
  • Our Data and our problem set
  • Accessing the cluster, the data, and the tools
  • The Continuous Workshop approach
  • "Let's build a model together"
  • Focus on analysis, exploration, data munging, algorithms
  • Tooling and fundamentals as necessary to get the job done

Spark Overview

  • Data Science: The State of the Art
  • Hadoop, Yarn, and Spark
  • Architectural Overview
  • MLib Overview
  • HDFS data - Accessing
  • Lab Focus
  • Working with HDFS data
  • Distributed vs. Local Run Modes
  • Spark vs. Other tools (when is Spark the right tool for the job?)
  • Spark vs. SAS
  • Spark Languages (Java, R, Python, and Scala)
  • Hello, Spark

Spark Overview

  • Spark Core
  • Spark SQL
  • Spark and Hive
  • Lab
  • MLib
  • Spark Streaming
  • Spark API

DataFrames

  • DataFrames and Resilient Distributed Datasets (RDDs)
  • Partitions
  • Adding variables to a DataFrame
  • DataFrame Types
  • DataFrame Operations
  • Dependent vs. Independent variables
  • Map/Reduce with DataFrames

Spark SQL

  • Spark SQL Overview
  • Data stores: HDFS, Cassandra, HBase, Hive, and S3
  • Table Definitions
  • Queries

Spark MLib

  • MLib overview
  • MLib Algorithms Overview
  • Classification Algorithms
  • Regression Algorithms
  • Lab Focus
  • Brief Comparison to SAS
  • Here's your split, how to tune regression
  • Decision Trees and forests
  • Lab Focus
  • Brief Comparison to SAS
  • Stepwise approach to Decision Trees
  • Working with Exit Criteria
  • Recommendation with ALS
  • Clustering Algorithms
  • Lab Focus
  • Key Clustering Algorithms
  • Choosing Clustering Algorithms
  • Working with key algorithms
  • Machine Learning Pipelines
  • Linear Algebra (SVD, PCA)
  • Statistics in MLib

Spark Streaming

  • Streaming overview

Streaming with Kafka

  • Kafka overview
  • Kafka and Spark Streaming

Data Flow with NiFi

  • Apache NiFi overview
  • NiFi data flows with Spark/R

Cluster Mode

  • Standalone Cluster
  • Masters and Workers

Spark - the Big Picture

  • Spark in Real-Time and near-Real-Time Decision Support Systems
  • Spark in the Enterprise
  • Best Practices

Course Materials

Student Materials: Each participant will receive a Student Guide with course notes, code samples, software tutorials, step-by-step written lab instructions, diagrams and related reference materials and resource links. Students will also receive the project files (or code, if applicable) and solutions required for the hands-on work

Hands-On Setup Made Simple! Our dedicated tech team will work with you to ensure our ‘easy-access’ cloud-based course environment is accessible, fully-tested and verified as ready to go well in advance of the course start date, ensuring a smooth start to class and effective learning experience for all participants. Please inquire for details and options.

Raise the bar for advancing technology skills

Attend a Class!

Live scheduled classes are listed below or browse our full course catalog anytime

Special Offers

We regulary offer discounts for individuals, groups and corporate teams. Contact us

Custom Team Training

Check out custom training solutions planned around your unique needs and skills.

EveryCourse Extras

Exclusive materials, ongoing support and a free live course refresh with every class.

New Site, BIG Savings!
We're celebrating the launch of our lonnngggg awaited new site with with *50% off all 2021 Public Classes* booked by April 30!  Check out our Current Offers for Individuals, Teams and Organizations to Learn for Less!

See our latest Offers and Promotions

Learn. Explore. Advance!

Extend your training investment! Recorded sessions, free re-sits and after course support included with Every Course
Trivera MiniCamps
Gain the skills you need with less time in the classroom with our short course, live-online hands-on events
Trivera QuickSkills: Free Courses and Webinars
Training on us! Keep your skills current with free live events, courses & webinars
Trivera AfterCourse: Coaching and Support
Expert level after-training support to help organizations put new training skills into practice on the job

The voices of our customers speak volumes

Special Offers
Limited Offer for most courses.

SAVE 50%

Learn More