Hadoop
Questions
What are the two components of Hadoop ?
- HDFS
- MapReduce
What are the two types of nodes ?
- Master Node
- Slave Node
What is the role of master HDFS ?
- Partition Data
- Keep track of positions
What is the role of master MapReduce ? Schedule work
What are the HDFS daemons ?
- Name node
- Data node
- Secondary Name node
How many name nodes per cluster ? 1
How many data nodes per clusters ? multiple
HDFS breaks large data into smaller pieces called ? Blocks
What is the default block size ? 64MB
What is the identity used by NN called ? RACKID
What is a rack ? Set of data nodes in a cluster
What is the primary job of a name node ? Managing the File System Namespace
What is a File System Namespace ? Collection of files in a cluster
What is contained in an FsImage ?
- Mapping of block to file
- File metadata
What is the replication factor ? The number of times a file have to stored in HDFS
Where is the replication factor stored ? Name node
What are the two files used by a NN ?
- EditLog
- FsImage
What happens when NN starts ?
- It reads FsImage and EditLog from local disk and applies to all transactions from the EditLog to in-memory representation of the FsImage.
- Then it flushes out new version of FsImage on disk and truncates older EditLog because the changes are updated in the FsImage.
Explain with a diagram when data replication happens in HDFS
?
Is the secondary name node a backup name node ? No, It is a separate name node that keeps the copies of both the EditLog and the FsImage. It merges them periodically to keep the size reasonable. Usually it is better to have this on a node different from the name node
How does a client read work in HDFS
?
Explain the HDFS Replica Strategy ?
- Same node
- A node from a different cluster
- Another node from the aforementioned cluster
How does a client write work in HDFS
?
How would you create a new folder in HDFS
?
hdfs dfs -mkdir /sample
How would you copy a file from local FS to HDFS
?
hdfs dfs -put ./sample.txt /sample/sample.txt
How would you copy a file from HDFS to local FS
?
hdfs dfs -get /sample/sample.txt sample.txt
What are the two special features of Hadoop ?
- Data Replication The client is automatically redirected to the nearest replica to ensure maximum performance. The client doesn’t need to keep track of the blocks
- Data Pipeline The client just writes to the first Data Node in the pipeline. The changes are automatically forwarded to the next node. This node forwards it to the next node and so on. The process continues until all the replicas are updated
What are the two phases in MapReduce ?
- Map
- Reduce
What are the daemons used in MapReduce ?
- JobTracker
- TaskTracker
Where are the JobTracker and TaskTracker executed ? JobTracker is executed in the Master Node and TaskTracker is executed in the Slave Node
Draw a diagram showing the interaction between JobTracker and TaskTracker
?
Explain the MapReduce Workflow
?
What is the hidden phase in between map and reduce ? Shuffle and Sort
What are the 5 limitations of Hadoop Architecture ?
- One NameNode is responsible for the entire cluster
- MapReduce takes care of the cluster resource and data management
- Only suitable for batch-oriented MapReduce tasks
- Not suitable for interactive analysis
- Not suitable for ML, graphs, memory intensive task
What is the full form of YARN ? Yet Another Resource Negotiator
What is the primary reason for introducing YARN ? Separate resource management from data processing
What are the two main components of YARN ?
- ResourceManager
- NodeManager
What are the daemons running on ResourceManager ?
- Scheduler
- Application Manager
What are the components of a NodeManager ?
- Container
- ApplicationMaster
What are the functions of the Application Manager ? Applications Manager is responsible for accepting job-submissions, negotiating the first container for executing the application specific ApplicationMaster and provides the service for restarting the ApplicationMaster container on failure.
What are the functions of the ApplicationMaster ? The per-application Application Master has the responsibility of negotiating appropriate resource containers from the Scheduler, tracking their status and monitoring for progress.
Draw the YARN workflow
?
Explain pig ? Pig is a data flow system for hadoop Pig is a scripting language that can be used as an alternative to Map Reduce
Explain Hive ? Hive is a Data Warehousing Layer on top of Hadoop. Analysis and queries can be done using an SQL-like language
Explain Scoop ? Sqoop is a tool which helps to transfer data between Hadoop and Relational Databases. With the help of Sqoop, you can import data from RDBMS to HDFS and vice-versa
Explain HBase ? HBase is a NoSQL database for Hadoop. HBase is column-oriented NoSQL database. HBase is used to store billions of rows and millions of columns. HBase provides random read/write operation.