• Introduction to Distributed Programming
  • Introduction to MapReduce
  • How Hadoop works on Cloudera
  • Hive
  • Sqoop
  • Pig
  • Hbase
  • Druid


  • This handson project covers all the aspects of ETL on large hadoop clusters.
  • How to optimize the performance of jobs.
  • How to deal with real time issues.
  • Handling terabytes of data .
Apache Spark (Pyspark/Scala)

What is Scala? Preview

  • Why Scala for Spark?
  • Scala in other Frameworks
  • Introduction to Scala REPL
  • Basic Scala Operations
  • Variable Types in Scala
  • Control Structures in Scala Preview
  • Foreach loop, Functions and Procedures
  • Collections in Scala- Array
  • ArrayBuffer, Map, Tuples, Lists, and more

Fundamentals on Scala for Spark

  • Functional Programming
  • Higher Order Functions
  • Anonymous Functions
  • Class in Scala Preview
  • Getters and Setters
  • Custom Getters and Setters
  • Properties with only Getters
  • Auxiliary Constructor and Primary Constructor
  • Singletons
  • Extending a Class Preview
  • Overriding Methods
  • Traits as Interfaces and Layered Traits

Apache Spark

  • Spark’s Place in Hadoop Ecosystem
  • Spark Components & its Architecture Preview
  • Spark Deployment Modes
  • Introduction to Spark Shell
  • Writing your first Spark Job Using SBT
  • Submitting Spark Job
  • Spark Web UI
  • Data Ingestion using Sqoop Preview
  • Building and Running Spark Application
  • Spark Application Web UI
  • Configuring Spark Properties

DeepDive into Spark Framework

  • Challenges in Existing Computing Methods
  • Probable Solution & How RDD Solves the Problem
  • What is RDD, It’s Operations, Transformations & Actions Preview
  • Data Loading and Saving Through RDDs Preview
  • Key-Value Pair RDDs
  • Other Pair RDDs, Two Pair RDDs
  • RDD Lineage
  • RDD Persistence
  • WordCount Program Using RDD Concepts
  • RDD Partitioning & How It Helps Achieve Parallelization
  • Passing Functions to Spark
  • Loading data in RDDs
  • Saving data through RDDs
  • RDD Transformations
  • RDD Actions and Functions
  • RDD Partitions
  • WordCount through RDDs

Need for Spark SQL

  • What is Spark SQL? Preview
  • Spark SQL Architecture
  • SQL Context in Spark SQL
  • User Defined Functions
  • Data Frames & Datasets Preview
  • Interoperating with RDDs
  • JSON and Parquet File Formats
  • Loading Data through Different Sources
  • Spark – Hive Integration
  • Spark SQL – Creating Data Frames
  • Loading and Transforming Data through Different Sources
  • Stock Market Analysis
  • Spark-Hive Integration

Need for Kafka

  • What is Kafka? Preview
  • Core Concepts of Kafka
  • Kafka Architecture
  • Where is Kafka Used?
  • Understanding the Components of Kafka Cluster
  • Configuring Kafka Cluster
  • Kafka Producer and Consumer Java API
  • Need of Apache Flume
  • What is Apache Flume? Preview
  • Basic Flume Architecture
  • Flume Sources
  • Flume Sinks
  • Flume Channels
  • Flume Configuration Preview
  • Integrating Apache Flume and Apache Kafka
  • Configuring Single Node Single Broker Cluster
  • Configuring Single Node Multi Broker Cluster
  • Producing and consuming messages
  • Flume Commands
  • Setting up Flume Agent
  • Streaming Twitter Data into HDFS

Drawbacks in Existing Computing Methods

  • Why Streaming is Necessary?
  • What is Spark Streaming? Preview
  • Spark Streaming Features
  • Spark Streaming Workflow Preview
  • How Uber Uses Streaming Data
  • Streaming Context & DStreams
  • Transformations on DStreams
  • Describe Windowed Operators and Why it is Useful
  • Important Windowed Operators
  • Slice, Window and ReduceByWindow Operators
  • Stateful Operators