Apache Hadoop for Developers

Login to see prices

Apache Hadoop for Developers

Login to see prices

Apache Hadoop is an open source, scalable, massively parallel, in-memory database environment for data farms and data lakes. The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.

It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

Add to Wishlist
Add to Wishlist
SKU: BD-011 Category:


This course prepares anyone for the Apache Hadoop platform. There are two primary distributions for Hadoop, that of Hortonworks and Cloudera. This course is architected in a way that allows maximum customization for your needs.

You need to make 3 decisions. One is to choose the Core Essentials course, which this is Development. The remainder determine the content and length of the course. The Essentials class is 3 days.

Then decide whether students have a need for the Ecosystem and deeper dive for the coverage of code or simply to discuss the Ecosystem components. This affects 2 days of course length.

Then decide if the students need Hortonworks, Cloudera, or both. This affects labs and has no effect on course length.

The Administrator course deals with 3-5 deployed servers. Now the course may use Docker if you wish to deploy these servers.

Why We Learn Hadoop:

To have an increase in their access to Big Data, to make use of existing Big Data investments, pace up with growing Enterprise Adoption and because over all, there is has recently been an increase in demand for Hadoop Developers.

Course Audience:

Developers who wish to explore Data Science and Big Data

Course Duration: 3-5 days


  • What is Big Data?
  • Typical Distributed Systems
  • A Short History of Hadoop
  • Who are the players?
  • Hadoop Alternatives


  • What is Hadoop
  • YARN
  • Key differences between 1.X and 2.X


  • What is HDFS
  • HDFS Architecture
  • Writing and reading files
  • Understanding Block storage
  • Nodes
  • HDFS client connections


  • MapReduce as a Pattern
  • MapReduce YARN Style
  • Tracing MapReduce Job on Hadoop 2.0


  • Most common cluster topology
  • Sizing considerations
  • Hardware, Software, OS, network considerations
  • File Systems – Windows, and Linux
  • Hadoop Configuration Files


  • Advanced HDFS
  • Advanced MapReduce
  • Advanced YARN configurations
  • Rack Aware Clusters
  • Including and Excluding Hosts


  • Overview of all the Map Reduce Design Patterns
  • MapReduce Design Patterns Overview
  • Deep Dive into following Patterns
  • Filtering Patterns
  • Join Patterns
  • Input and Output Patterns
  • Other Patterns Overview
  • Summarization Patterns
  • Data Organization Patterns
  • MetaPatterns
  • Comparison chart of when to use which design patterns
  • Best Practices


  • Text vs Binary Dataformats
  • Avro vs Parquet vs Sequence File
  • Working with Avro Tools
  • Working with Parquet Tools
  • Using Avro with MR
  • Using Avro with Hive
  • Using Parquet with Impala
  • Compression Codec


  • The Hadoop Ecosystem
  • Ecosystem components
  • The Major Vendors
  • Use Cases of Ecosystem Components


  • The Sqoop Import Tool
  • Importing & Exporting data with Sqoop


  • Flume Sources, Sinks and Channels
  • Flume Interceptors
  • Flume configuration
  • Monitoring Flume


  • Introduction to Hive & Pig
  • Comparing Hive with RDBMS
  • Hive & Pig Components & Metastore
  • HiveServer2
  • Hive Command Line Interface
  • Defining Hive Tables
  • Loading Data into Hive & Performing Queries
  • Hive Security
  • Pig Tables & Syntax


  • Introduction to Oozie
  • Architecture
  • Administration of Oozie


  • HDFS NFS Gateway
  • Configuring NFS Gateway
  • Using NFS Gateway Services


  • Data Movement with Hadoop
  • ETL using Hadoop
  • Ingesting data with Hadoop
  • Using Hue
  • Distributed Copy

Additional information

Course Duration

5 Day



Lab Count

20 Labs


There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Close Menu