Get Even More Visitors To Your Blog, Upgrade To A Business Listing >>

Hadoop interview questions and answers for freshers and experienced - Part 1

1.What is HDFS?

  • HDFS, the Hadoop Distributed File System, is a distributed file system designed to hold very large amounts of data (terabytes or even petabytes), and provide high-throughput access to this information.
  • Files are stored in a redundant fashion across multiple machines to ensure their durability to failure and high availability to very parallel applications

2.What are the Hadoop configuration files?

  1.     hdfs-site.xml
  2.     core-site.xml
  3.     mapred-site.xml

3.How NameNode Handles data node failures?

  • NameNode periodically receives a Heartbeat and a Blockreport from each of the DataNode in the cluster.Receipt of a Heartbeat implies that the DataNode is functioning properly.
  • When NameNode notices that it has not received a heartbeat message from a DataNode after a certain amount of time, the DataNode is identified as dead. Since blocks will be under replicated the system NameNode begins replicating the blocks that were stored on the dead DataNode.
  • The NameNode takes responsibility of the replication of the data blocks from one DataNode to another.The replication data transfer happens directly between DataNodes and the data never passes through the NameNode.

4.What is MapReduce in Hadoop?

  • Hadoop MapReduce is a specially designed framework for distributed processing of large data sets on clusters of commodity hardware. 
  • The framework itself can take care of scheduling tasks, monitoring them and reassigning of failed tasks.

5.What is the responsibility of NameNode in HDFS ?

  • NameNode is a master daemon for creating metadata for blocks, stored on DataNodes. Every DataNode sends heartbeat and block report to NameNode.
  • If NameNode not receives any heartbeat then it simply identifies that the DataNode is dead. This NameNode is the single Point of failover. If NameNode goes down HDFS cluster is inaccessible.

6.What it  the responsibility of SecondaryNameNode in HDFS?

  • SecondaryNameNode is the mater Daemon to create Housekeeping work for NameNode.
  • SecondaryNameNode is not the backup of NameNode but it is the backup for metadata of the NameNode.

7.What is the DataNode in HDFS?

  • DataNode is the slave daemon of NameNode for storing actual data blocks. Each DataNode stores number of 64MB blocks.

8.What is the JobTracker in HDFS?

  • JobTracker is a mater daemon for assigning tasks to TaskTrackers in different DataNodes where it can find data blocks for input file.

9.How can we list all job running in a cluster?

  •  ]$ hadoop job -list

10.How can we kill a job?

  • ]$ hadoop job –kill jobid

11.Whats the default port that jobtrackers listens to

  •  http://localhost:50030

12.Whats the default port where the dfs Namenode web ui will listen on

  •     http://localhost:50070

13.What is Hadoop Streaming

  • Streaming is a generic API that allows programs written in virtually any language to be used as Hadoop Mapper and Reducer implementations

14.Whats is Distributed Cache in Hadoop

  • Distributed Cache is a facility provided by the Map/Reduce framework to cache files (text, archives, jars and so on) needed by applications during execution of the job.
  • The framework will copy the necessary files to the slave node before any tasks for the job are executed on that node.

15.What is the benifit of Distributed cache, why can we just have the file in HDFS and have the application read it

  • This is because distributed cache is much faster. It copies the file to all trackers at the start of the job.
  • Now if the task tracker runs 10 or 100 mappers or reducer, it will use the same copy of distributed cache. on the other hand, if you put code in file to read it from
  • HDFS in the MR job then every mapper will try to access it from HDFS hence if a task    tracker run 100 map jobs then it will try to read this file 100 times from HDFS. Also
  • HDFS is not very efficient when used like this.

16.Is it possible to provide multiple input to Hadoop? If yes then how can you give multiple directories as input to the Hadoop job

  • Yes, The input format class provides methods to add multiple directories as input to a Hadoop job

17.What will a hadoop job do if you try to run it with an output directory that is already present? Will it overwrite it - warn you and continue - throw an exception and exit

  • The hadoop job will throw an exception and exit.

18.How can you set an arbitary number of mappers to be created for a job in Hadoop

  • This is a trick question. You cannot set it

19.How can you set an arbitary number of reducers to be created for a job in Hadoop

  • You can either do it programmatically by using method setNumReduceTasksin the JobConfclass or set it up as a configuration setting

20.How will you write a custom partitioner for a Hadoop job

  • To have hadoop use a custom partitioner you will have to do minimum the following three
  1. Create a new class that extends Partitioner class
  2. Override method getPartition
  3. In the wrapper that runs the Map Reducer, either  add the custom partitioner to the job programtically using method setPartitionerClass or add the custom partitioner to the job as a config file (if your wrapper reads from config file or oozie)

21.How did you debug your Hadoop code?

  • There can be several ways of doing this but most common ways are
  1.     By using counters
  2.     The web interface provided by Hadoop framework

22.What does the term "Replication factor" mean

  • Replication factor is the number of times a file needs to be replicated in HDFS

23.What is the default replication factor in HDFS

  • The default replication factor is 3

24. What is the typical block size of an HDFS block

  • The default HDFS block size is 64Mb or 128Mb

25.What is the benefit of having such big block size (when compared to block size of linux file system like ext)

  • It allows HDFS to decrease the amount of metadata storage required per file (the list of blocks per file will be smaller as the size of individual blocks increases). Furthermore, it allows for fast streaming reads of data, by keeping large amounts of data sequentially laidout on the disk

26.Why is it recommended to have few very large files instead of a lot of small files in HDFS

  • This is because the Name node contains the meta data of each and every file in HDFS and more files means more metadata and since namenode loads all the metadata in memory for speed hence having a lot of files may make the metadata information big enough to exceed the size of the memory on the Name node

27.What alternate way does HDFS provides to recover data in case a Namenode, without backup, fails and cannot be recovered

  • There is no way. If Namenode dies and there is no backup then there is no way to recover data

28.Describe how a HDFS client will read a file in HDFS, like will it talk to data node or namenode ... how will data flow etc

  • To open a file, a client contacts the Name Node and retrieves a list of locations for the blocks that comprise the file.
  • These locations identify the Data Nodes which hold each block. Clients then read file data directly from the Data Node servers, possibly in parallel.
  • The Name Node is not directly involved in this bulk data transfer, keeping its overhead to a minimum.

29.Using linux command line. how will you List the the number of files in a HDFS directory

  •      hadoop fs -ls

30.Using linux command line. how will  Create a directory in HDFS

  •     hadoop fs -mkdir

This post first appeared on Java Tutorial - InstanceOfJava, please read the originial post: here

Share the post

Hadoop interview questions and answers for freshers and experienced - Part 1


Subscribe to Java Tutorial - Instanceofjava

Get updates delivered right to your inbox!

Thank you for your subscription