Big Data/Hadoop - Corporate Training

We provide the most comprehensive technical training for highly motivated individuals and corporations with our wealth of experience in Big Data technologies. Training offerings in Scala, Python and Hadoop with hands-on practice in programming and implementation with the most popular and useful cloud technologies like AWS and more. Along the way, students will have assistance in preparing for the job search.

  • Courses include:
    • Introductory to Python and Scala
    • Big Data with Hadoop
    • Big Data with Spark.
  • We offer a training for variety of certificate which include
    • “HDP Certified Administrator (HDPCA)” certification
    • “HDP Certified Developer (HDPCD)” certification
    • “HDP Certified Spark Developer (HDPCD:SPARK)” certification.
Bootstrap Image Preview

Hadoop Training

Curriculum (80 Hours)

  • Hadoop Architecture
  • Build production like Hadoop Cluster in Amazon EC2 cloud – focus on Apache distribution
  • YARN
  • HIVE
  • SCOOP
  • PIG
  • Excel and HIVE Integration
  • HBASE
  • PIG and HIVE integration with Hbase
  • Hands-on training on real projects
  • Preparation for certification exam (HDPCD)
  • SPARK SQL – Building analytics on HDFS, HIVE and HBASE

What is Apache Hadoop?

The Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing.
Hadoop is an open-source software framework for storing data and running applications on clusters of commodity hardware.
It provides massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs.
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.
This course enables participants to build complete, unified Big Data applications combining batch and interactive analytics on Hadoop platform.

Hadoop Advance Training

Curriculum (40 Hours)

  • End-to-end Integration
    • Build production like Hadoop Cluster in Amazon EC2 cloud – focus on Apache distribution
    • YARN
    • HIVE
    • SCOOP
    • PIG
  • Design and build project
  • Prerequisites:
    • Strong Java / Scala
    • ambariCloud Hadoop Solution Architecture Class
    • MapReduce
    • Eclipse IDE

What is Lambda Architecture?

Batch work flows are too slow; data in batch view is stale and cannot be used NOW!—these are common concerns to many architects who have implemented Hadoop.
Organizations want (near) real time data to perform accurate decisions, on massive amount of data to gain competitive advantage. Era of batch processing is over!!!
The Lambda Architecture is an approach to building stream processing applications on top of Hadoop using Spark, Kafka and NiFi.
In this advanced course you will deep-dive into using real time data ingest tools such as Flume, Spark, Kafka, HBASE etc. and end-to-end integration of these tools.
>> Learn about tool selection
>> Learn about integration
>> Design and Build end-to-end project

Spark Training

Curriculum (50 Hours)

  • Spark Architecture
  • Spark SQL, Zeppelin
  • Spark Streaming
    • Kafka
  • Unified data view from HBASE, MYSQL, Oracle, HDFS, HIVE
  • Spark Data Frame, Dataset
  • Intro to NIFI
  • Preparation for certification exam
  • AWS AMI
    • AMI setup
    • Spark, Kafka, Hbase, Hadoop, Nifi, Zeppline
  • PyCharm setup
    • Python basics
    • How to build application and run in AMI
  • Intro to Spark
    • Why Spark
  • Scala programming –Spark
    • Employ functional programming practices
    • Explain the purpose and function of RDDs
    • Perform Spark transformations and actions
      • Map, flatmap
      • DataFrame
      • DataSet
  • Explore and manipulate data using a Spark REPL
    • Spark Shell
  • Explore and manipulate data using Zeppelin
    • Interpreters
    • Run SQLs and build reports
  • Hands on Practice
    • Read:  Local, HDFS files
      • JSON
      • CSV
      • Parquet files
      • XML
    • DataFrame
      • API
      • Window functions
      • Pivoting
    • DatFrame from HIVE tables
    • Data Frame from JDBC
      • Join Hive, JDBC tables and HDFS files
    • Store data
      • Parquet, JSON, JDBC, HBASE etc
  • Hbase
    • Read and store Hbase data from spark
      • Join multiple Hbase tables in spark
      • Join many source tables ( from Hbase, hive, hdfs, mysql) and Join
  • Spark infrastructure architecture
    • Nodes architecture – Driver, workers
  • Kafka and Spark streaming
    • Kafka architecture
    • NIFi to load data to Kafka
    • Read data from Kafka to Spark.
      • Store Stream data to Hbase, Hadoop, mySQL
    • Lambda architecture.
      • Combine batch and Stream data
    • Streaming data from RasberryPI to Hbase, ElasticSearch and MYSQL via Nifi, Kafka and Spark
  • Integration with Elastic Search (ES)
    • Streaming data from Nifi=>kafka=>spark=>to ES
  • Introduction to Spark Machine Learning
    • Spark new ML library -- Pipeline model
      • Label, features
    • Basic stats
      • Mean, max, mode, correlation
    • Liner regression
    • Kmeans

What is Apache Spark?

Apache Spark™ is a lightning-fast cluster computing, open-source processing engine to process huge amounts of data in short time. Spark is optimized for speed, ease of use, and advanced analytics.
The Spark framework supports streaming data processing and complex, iterative algorithms, enabling applications to run up to 100x faster than traditional “Hadoop Map Reduce” programs.
This course enables participants to build complete, unified Big Data applications combining batch, streaming, and interactive analytics on Spark platform.
With Spark, developers can write sophisticated applications for faster business decisions and better user outcomes, applied to a wide variety of use cases, architectures, and industries.

Falcon and Oozie Training

Curriculum (15 Hours)

  • Falcon
  • Oozie
  • Need experience in:
    • ambariCloud Hadoop Solution Architecture Class
    • PIG
    • Hive
    • XML

What is Falcon & Oozie?

Apache Falcon is a data governance engine that defines, schedules, and monitors data management policies. Falcon allows Hadoop administrators to centrally define their data pipelines, and then Falcon uses those definitions to auto-generate workflows in Apache Oozie.
As the demand for big data continues to grow, big data governance has become a critical issue for most organizations.
Big data often deals with sensitive personal information and confidential enterprise records, and big data governance to ensure the security of this information is paramount.
Superior governance can enable organizations to avoid the costs associated with low quality data re-work, and to provide big data reporting in compliance with government regulations like Sarbanes-Oxley, HIPAA, and Basel II/Basel III.

In this training developers learn about data replication, job scheduling etc.

Spark and DataScientist Type - B

Curriculum

  • Python –
    • Python modules, Classes, Functional Programming
    • Advanced Spark -
    • Spark SQL, Zeppelin
    • Spark Streaming
    • Spark Data Frame
  • Machine Learning-
    • Linear Regression
    • K-Means
    • Classification
    • Recommendation
  • Prerequisites
    • Basic Python
    • Hive, SQL, Linux

Spark and Data Scientist B

Spark Engineer: Spark™ is a lightning-fast cluster computing, open-source processing engine to process huge amounts of data in short time. Spark is optimized for speed, ease of use, and advanced analytics. The Spark framework supports streaming data processing and complex, iterative algorithms, enabling applications to run up to 100x faster than traditional “Hadoop Map Reduce” programs.
Type B Data Scientist: The B in Type B Data Scientist refers to building models. Type B Data Scientists predict the unknown, by asking questions from different perspectives of the business, writing complex algorithms and developing statistical models on structure / unstructured data in BigData domain.
Some of the important skills and tools for Type B Data Scientists include - expertise in Python/Scala, Hadoop, Data Analysis, NoSQL, Machine Learning, and Software Development.

NiFi Training

Curriculum (16 Hours):

  • NiFi Architecture
  • Develop data migration Processors
  • Convert migration process to Templates
  • IOT -- nifiMini
  • Prerequisites:
    • ambariCloud Hadoop Solution Architecture Class

What is Apache NiFi?

Apache NiFi is a dataflow system based on the concepts of flow-based programming. It supports powerful and scalable directed graphs of data routing, transformation, and system mediation logic. NiFi has a web-based user interface for design, control, feedback, and monitoring of dataflow.

The NiFi data-flow orchestration tool, drafted in as part of the NSA's duty "to respond to foreign-intelligence requirements", now finds itself on the front line of “Internet of Things” technology, according to Hortonworks CTO Scott Gnau.

"Instead of a one-way traditional streaming or data flow, it's bidirectional and point to point. That's a really big difference technologically and from a requirements perspective."

In this training developer learn simplified batch/real time data ingestion to Hadoop from various internal sources such as Oracle, HANA, or any external sources such as Tweeter etc.

Kerberos Training

Curriculum (16 Hours):

  • Hadoop Security
  • Kerberos 101
  • KDC Server/Client Installation and Configuration
  • Kerberos Encryption types
  • Kerberos Operations
  • Kerberos Troubleshooting
  • Kerberos setup in Hadoop
  • Kerberos configuration in Hadoop echo system (Hive /Pig /Oozie/ )
  • Prerequisites:
    • ambariCloud Hadoop Solution Architecture Class

Kerberos for Hadoop

Security became real concern for critical enterprise Hadoop IT services.
To implement Hadoop authentication, Kerberos is de facto standard and very critical skills required to manage secured Hadoop projects.
Through instructor-led discussion and interactive, hands-on exercises, participants will navigate the Hadoop authentication security with Kerberos.