Hadoop Developer Foundation | Explore Hadoop, HDFS, Hive, Yarn, Spark and More

Learn the Modern Skills & Tools Required to Process Large Data Streams in the Hadoop Ecosystem

TTDS6509

Intermediate

4 Days

Course Overview

Apache Hadoop is the classical framework for processing Big Data, and Spark is a new in-memory processing engine.

Hadoop Developer Foundation | Working with Hadoop, HDFS, Hive, Yarn, Spark and More is a lab-intensive hands-on Hadoop course that explores processing large data streams in the Hadoop Ecosystem.  Working in a hands-on learning environment, students will learn techniques and tools for ingesting, transforming, and exporting data to and from the Hadoop Ecosystem for processing, as well as processing data using Map/Reduce, and other critical tools including Hive and Pig. Towards the end of the course, we’ll introduce other useful tools such as Spark and Oozie and discuss essential security in the ecosystem.

NOTE: This course agenda can be adjusted to add review and discussion of pending desired exam and Certifications as needed. We’ll collaborate with your organization to tune the agenda as needed to accommodate additional prep topics and review.

Course Objectives

This “skills-centric” course is about 50% hands-on lab and 50% lecture, designed to train attendees in core big data/ Spark development and use skills, coupling the most current, effective techniques with the soundest industry practices. Throughout the course students will be led through a series of progressively advanced topics, where each topic consists of lecture, group discussion, comprehensive hands-on lab exercises, and lab review.

Working in a hands-on learning environment led by our expert Hadoop team, students will explore:

  • Introduction to Hadoop
  • HDFS
  • YARN
  • Data Ingestion
  • HBase
  • Oozie
  • Working with Hive
  • Hive (Advanced)
  • Hive in Cloudera
  • Working with Spark
  • Spark Basics
  • Spark Shell
  • RDDs (Condensed coverage)
  • Spark Dataframes & Datasets
  • Spark SQL
  • Spark API programming
  • Spark and Hadoop
  • Machine Learning (ML / MLlib)
  • GraphX
  • Spark Streaming

Need different skills or topics?  If your team requires different topics or tools, additional skills or custom approach, this course may be further adjusted to accommodate.  We offer additional Big Data / Data Science, Hadoop, development, programming, analytics, Python/R, Spark, and other related topics that may be blended with this course for a track that best suits your needs.

Course Prerequisites

This in an intermediate-level course is geared for experienced developers seeking to be proficient in Hadoop, Spark tools & related technologies. Attendees should be experienced developers who are comfortable with programming languages.  Students should also be able to navigate Linux command line, and who have basic knowledge of Linux editors (such as VI / nano) for editing code.

In order to gain the most from this course, attending students should be:

  • Familiar with a programming language
  • Comfortable in Linux environment (be able to navigate Linux command line, edit files using vi or nano)

Please see the Related Courses tab for specific Pre-Requisite courses, Related Courses or Follow On training options. Our team will be happy to help you with recommendations for next steps in your Learning Journey

Course Agenda

Please note that this list of topics is based on our standard course offering, evolved from typical industry uses and trends. We will work with you to tune this course and level of coverage to target the skills you need most.   Each section below also has an accompanying hands-on lab sprcific to the topics and concepts in that chaper. Please inquire for additional details.

Day One

Introduction to Hadoop

  • Hadoop history, concepts
  • Ecosystem
  • Distributions
  • High-level architecture
  • Hadoop myths
  • Hadoop challenges
  • Hardware and software

HDFS

  • Design and architecture
  • Concepts (horizontal scaling, replication, data locality, rack awareness)
  • Daemons: Namenode, Secondary Namenode, Datanode
  • Communications and heart-beats
  • Data integrity
  • Read and write path
  • Namenode High Availability (HA), Federation

Day Two

YARN

  • YARN Concepts and architecture
  • Evolution from MapReduce to YARN

Data Ingestion

  • Flume for logs and other data ingestion into HDFS
  • Sqoop for importing from SQL databases to HDFS, as well as exporting back to SQL
  • Copying data between clusters (distcp)
  • Using S3 as complementary to HDFS
  • Data ingestion best practices and architectures
  • Oozie for scheduling events on Hadoop

HBase

  • (Covered in brief)
  • Concepts and architecture
  • HBase vs RDBMS vs Cassandra
  • HBase Java API
  • Time series data on HBase
  • Schema design

Oozie

  • Introduction to Oozie
  • Features of Oozie
  • Oozie Workflow
  • Creating a MapReduce Workflow
  • Start, End, and Error Nodes
  • Parallel Fork and Join Nodes
  • Workflow Jobs Lifecycle
  • Workflow Notifications
  • Workflow Manager
  • Creating and Running a Workflow
  • Oozie Coordinator Sub-groups
  • Oozie Coordinator Components, Variables, and Parameters

Day Three

Working with Hive

  • Architecture and design
  • Data types
  • SQL support in Hive
  • Creating Hive tables and querying
  • Partitions
  • Joins
  • Text processing

Hive (Advanced)

  • Transformation, Aggregation
  • Working with Dates, Timestamps, and Arrays
  • Converting Strings to Date, Time, and Numbers
  • Create new Attributes, Mathematical Calculations, Windowing Functions
  • Use Character and String Functions
  • Binning and Smoothing
  • Processing JSON Data
  • Execution Engines (Tez, MR, Spark)

Day Four

Hive in Cloudera (or tools of choice)

Working with Spark

Spark Basics

  • Big Data, Hadoop, Spark
  • What’s new in Spark v2
  • Spark concepts and architecture
  • Spark ecosystem (core, spark sql, mlib, streaming)

Spark Shell

  • Spark web UIs
  • Analyzing dataset – part 1

RDDs (Condensed coverage)

  • RDDs concepts
  • RDD Operations / transformations
  • Labs : Unstructured data analytics using RDDs
  • Data model concepts
  • Partitions
  • Distributed processing
  • Failure handling
  • Caching and persistence

Spark Dataframes & Datasets

  • Intro to Dataframe / Dataset
  • Programming in Dataframe / Dataset API
  • Loading structured data using Dataframes

Spark SQL

  • Spark SQL concepts and overview
  • Defining tables and importing datasets
  • Querying data using SQL
  • Handling various storage formats : JSON / Parquet / ORC

Spark API programming (Scala and Python)

  • Introduction to Spark  API
  • Submitting the first program to Spark
  • Debugging / logging
  • Configuration properties

Spark and Hadoop

  • Hadoop Primer: HDFS / YARN
  • Hadoop + Spark architecture
  • Running Spark on YARN
  • Processing HDFS files using Spark
  • Spark & Hive

Capstone project (Optional)

  • Team design workshop
  • The class will be broken into teams
  • The teams will get a name and a task
  • They will architect a complete solution to a specific useful problem, present it, and defend the architecture based on the best practices they have learned in class

Optional Additional Topics – Please Inquire for Details

Machine Learning (ML / MLlib)

  • Machine Learning primer
  • Machine Learning in Spark: MLlib / ML
  • Spark ML overview (newer Spark2 version)
  • Algorithms: Clustering, Classifications, Recommendations

GraphX

  • GraphX library overview
  • GraphX APIs

Spark Streaming

  • Streaming concepts
  • Evaluating Streaming platforms
  • Spark streaming library overview
  • Streaming operations
  • Sliding window operations
  • Structured Streaming
  • Continuous streaming
  • Spark & Kafka streaming

 

Course Materials

Student Materials: Each participant will receive a Student Guide with course notes, code samples, software tutorials, step-by-step written lab instructions, diagrams and related reference materials and resource links. Students will also receive the project files (or code, if applicable) and solutions required for the hands-on work.

Hands-On Setup Made Simple! Our dedicated tech team will work with you to ensure our ‘easy-access’ cloud-based course environment is accessible, fully-tested and verified as ready to go well in advance of the course start date, ensuring a smooth start to class and effective learning experience for all participants. Please inquire for details and options.

Raise the bar for advancing technology skills

Attend a Class!

Live scheduled classes are listed below or browse our full course catalog anytime

Special Offers

We regulary offer discounts for individuals, groups and corporate teams. Contact us

Custom Team Training

Check out custom training solutions planned around your unique needs and skills.

EveryCourse Extras

Exclusive materials, ongoing support and a free live course refresh with every class.

Mix, Match & Master!
2FOR1: Two Courses, One Price!

Enroll in *any* two public courses (for 2023 *OR* 2024 dates!) by December 31, for one price!  Learn something new, or share the promo!

Click for Details & Additional Offers

Learn. Explore. Advance!

Extend your training investment! Recorded sessions, free re-sits and after course support included with Every Course
Trivera MiniCamps
Gain the skills you need with less time in the classroom with our short course, live-online hands-on events
Trivera QuickSkills: Free Courses and Webinars
Training on us! Keep your skills current with free live events, courses & webinars
Trivera AfterCourse: Coaching and Support
Expert level after-training support to help organizations put new training skills into practice on the job

The voices of our customers speak volumes

Special Offers
Limited Offer for most courses.

SAVE 50%

Learn More