December 6th 2018

Hadoop Interview Questions and answers

Hadoop Interview Questions and answers for beginners and experts. List of frequently asked Hadoop Interview Questions with answers by Besant Technologies. We hope these Hadoop Interview Questions and answers are useful and will help you to get the best job in the networking industry. This Hadoop Interview Questions and answers are prepared by Hadoop Professionals based on MNC Companies expectation. Stay tuned we will update New Hadoop Interview questions with Answers Frequently. If you want to learn Practical Hadoop Training then please go through this Hadoop Training in Chennai and Hadoop Training in Bangalore

ÙƒØªØ§Ø¨ Ù„ØºØ© Ø§Ù„Ø¬Ø³Ø¯: Ø§Ù„Ø¯Ù„Ù…

Best Hadoop Interview Questions and answers

Besant Technologies supports the students by providing Hadoop Interview Questions and answers for the job placements and job purposes. Hadoop is the leading important course in the present situation because more job openings and the high salary pay for this Hadoop and more related jobs. We provide the Hadoop online training also for all students around the world through the Gangboard medium. These are top Hadoop Interview Questions and answers, prepared by our institute experienced trainers.

Hadoop Interview Questions and answers for the job placements

Here is the list of most frequently asked Hadoop Interview Questions and answers in technical interviews. These questions and answers are suitable for both freshers and experienced professionals at any level. The questions are for intermediate to somewhat advanced Hadoop professionals, but even if you are just a beginner or fresher you should be able to understand the answers and explanations here we give.

1) Explain in detail about Kafka Producer in context to Hadoop?

Before explaining about Kafka Producer, we first have to know about what Kafka is and why it came into existence.

Kafka is an open source API cluster for processing stream data.

Kafka Includes these Core API’s – Producer API, Consumer API, Streams API, Connect API

The use cases of Kafka API’s are – Website Activity Tracking, Messaging, Metrics, Log Aggregation, , Event Sourcing , Stream Processing and Commit Log.

Let’s go in detail about Producer API:

These API’s are mainly used for Publishing and Consuming Messages using Java Client.

Kafka Producer API (Apache) has a class called “KafkaProducer” which facilitates Kafka broker in its constructor and provide following methods- Send Method, Flush Method and Metrics.

Send Method-

e.g- producer.send(new ProducerRecord(topic, partition, key, value) , Usercallback);

In the above example code-

ProducerRecord – This is a producer class which manages a buffer of records waiting to be sent which needs topic, partition , key and value are parameters.

UserCallback – It is a User callback function to execute when the record has been acknowledged by the server. If it is null that means there is no callback.

Flush Method – this Method is used for sending messages.

e.g. public void flush ()

Metrics – It provides partition for getting the Partition metadata for given topic in runtime. This method is also used for custom partitioning.

e.g. public Map metrics()

After execution of all the methods, we need to call the close method after sent request is completed.

e.g. public void close()

Overview of Kafka Producer API’s:

There are 2 types of producers i.e. Synchronous (Sync) and Asynchronous (Async)

Sync – This Producer send message directly along with other execution (messages) in background.

e.g. kafka.producer.SyncProducer

Async- Kafka provides an asynchronous send method to send a record to a topic. The big difference between Sync and Async is that we have to use a lambda expression to define a callback.

e.g. kafka.producer.async.AsyncProducer.

Example Program-

class Producer

{

/* the data that is partitioned by key to the topic is sent using either the synchronous or the asynchronous producer */

public void send(kafka.javaapi.producer.ProducerDataproducerData);

public void send(java.util.List>producerData);

/* In the last close the producer to clean up */

public void close();

}

2) Explain Monad class?

Monad class is a class for wrapping of objects. E.g. identity with Unit & Bind with Map. It provides two operations as below:-

identity (return in Haskell, unit in Scala)

bind (>>= in Haskell, flatMap in Scala)

Scala doesn’t have a built-in monad type, so we need to model the monad ourselves. However other subsidiaries of Scala like Scalaz have the monad built-in itself also it comes with theory family like applicatives , functors, monoids and so on.

The sample program to model monad with generic trait in Scala which provide method like unit() and flatMap() is below. Lets denote M in-short for monad.

trait M[A]

{

defflatMap[B](f: A => M[B]): M[B]

}

def unit[A](x: A): M[A]

3) Explain the reliability of Flume-NG data?

Apache Flume provides a reliable and Distributed system for collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store.

This work currently in progress and informally referred to as Flume NG. It has gone through two internal milestones – NG Alpha 1, and NG Alpha 2 and a formal incubator release of Flume NG is in the works.

The Core Concept of Flume-NG data are – Event, Flow, Client, Agent, Source, Channel and Sink. These core concept makes the architecture of Flume NG to achieve this objective.

4) What is Interceptor?

This is a Flume Plug-in that helps to listen any Incoming and alter event’s content on the Fly.

e.g. Interceptor Implementation for JSON data.

5) What are the different Flume-NG Channel types?

The main channel types of Flume-NG are Memory Channel, JDBC Channel, Kafka Channel, File Channel, Spillable Memory Channel, Pseudo Transaction Channel.

In basic Flume, we have channel type like memory, JDBC, file and Kafka.

6) What is Base class in java?

A base class is also a class which facilitates the creation of other classes. In terms of object oriented programming, it is referred as derived class. This helps to reuse the code implicitly from base class except constructors and destructors.

7) What is Base class in scala?

Base class concept is same for both java and scala. Only difference is in syntex. The Keywords in Scala are Base and Derived.

Ex.abstractclassBase( val x : String )

finalclassDerived( x : String ) extendsBase( “Base’s ” + x )

{

overridedeftoString = x

}

8) What is Resilient Distributed Dataset(RDD)?

Resilient Distributed Dataset(RDD) is core of Apache Spark which provides primary data abstraction.

These are features of RDDs:

Resilient means fault-tolerant with the help of RDD lineage graph and so that it’s easy to re-compute missing or damaged partitions due to failure of any node.
Distributed means this feature works with data residing on multiple nodes in a cluster.
Dataset means collection of partitioned data with primitive values or values of values, e.g. tuples or other objects.

9) Give a brief description of Fault tolerance in Hadoop?

Fault tolerance can be defined as, proper functioning of the system without any data loss even if some hardware components of the system fails. This feature of Hadoop is used for computing large data sets with parallel and distributed algorithms in the cluster without any failures. It use the Heart of Hadoop i.e. MapReduce.

10) What is Immutable data with respect to Hadoop?

Immutability is the idea that data or objects cannot be modified once they are created. This concept provides the basic functionalities of the Hadoop in computing the large data without any data loss or failures. Programming languages, like Java and Python, treat strings as immutable objects which means we will not be able change it.

Which are the nodes that hadoop can b executed?

We have three modes in which Hadoop can run and which are:

Standalone (local) mode: Default mode of Hadoop, it uses the local file system for input and output operations. This mode is used for debugging purpose, and it does not support the use of HDFS.
Pseudo-distributed mode: In this case, you need configuration for all the three files mentioned above. In this case, all daemons are running on one node and thus, both Master and Slave nodes are on the same machine.
Fully distributed mode: This is the production phase of Hadoop where data is distributed across several nodes on a Hadoop cluster. Different nodes are allotted as Master and Slaves.

How is formatting done in HDFS?

Hadoop distributed file system(HDFS) is formatted using bin/hadoop namenode -format command. This command formats the HDFS via NameNode. This command is only used for the first time. Formatting the file system means starting the working of the directory specified by the dfs.name.directory variable. If you execute this command on existing filesystem, you will delete all your data stored on your NameNode. Formatting a Namenode will not format the DataNode.

What are the contents found in masterfile of hadoop?

The masters file contains information about Secondary NameNode server location.

Describe the main hdfs-site.xml properties?

The three important hdfs-site.xml properties are:

checkpoint.dir is the directory found on the filesystem where the Secondary NameNode collects the temporary images of edit logs, which is to be combined and the FsImage for backup.

Explain about spill factor with respect to the RAM?

The map output is stored in an in-memory buffer; when this buffer is almost full, then spilling phase begins in order to transport the data to a temp folder.

Map output is first written to buffer and buffer size is decided by mapreduce.task.io.sort.mb .By default, it will be 100 MB.

When the buffer outreaches certain threshold, it will start spilling buffer data to disk. This threshold is specified inmapreduce.map.sort.spill.percent .

Why do we require a password-less SSH in Fully Distributed environment?

We required a password-less SSH in a Fully-Distributed environment because when the cluster is live and working in Fully Distributed environment, the communication is too frequent. The DataNode and the NodeManager should be able to transport messages quickly to master server.

Does this reqirement lead to security issues?
Hadoop cluster is an isolated cluster and generally, it has nothing to do with the internet. It has a different kind of a configuration. We doesn’t worry about that kind of a security breach, like as, someone hacking through the internet, and so on. Hadoop also has a very secured way to connect to other devices to fetch and to process the built data.

18.What will happen to a NameNode, when ResourceManager is down?

When a ResourceManager is not working, it will not be functional (for submitting jobs) but NameNode will be available. So, the cluster is present if NameNode is working, even if the ResourceManager is not in a working state.

1 Tell about features of Fully Distributed mode?

This is one of the important question as Fully Distributed mode is used in the production environment, in which we have ‘n’ number of machines resulting in the formation of a Hadoop cluster. Hadoop daemons works on a cluster of machines. There is one node on which Namenode is running and other nodes on which Datanodes are running. NodeManagers are placed on every DataNode and it is responsible for working of the task on every single DataNode. The work of ResourceManager is to manage all these NodeManagers. Another work of ResourceManager is to receive the processing requests and some parts of requests its passes to the corresponding NodeManagers and so on.

Explain about fsck?

The expansion of fsck is File System Check. Hadoop Distributed File System supports the file system check command to check for different inconsistencies. It is constructed or designed for reporting the problems with the files in HDFS, for example, missing blocks of a file or under-replicated blocks.

21.) how to copy file from local hard disk to hdfs

hadoop fs -copyFromLocal localfilepath hdfsfilepath

22.) is it possible to set the reducer to zero???

Yes, it is possible to set the number of reducers to zero in MapReduce (Hadoop).

When the number of reducers is set to zero, no reducers will be executed, and the output

of each mapper process will be stored to a separate file on HDFS.

23.) map-side join / hive join

To optimize the performance in Hive queries, we can use Map-side Join in Hive. We will use Map-Side Join when one of the tables in the join is small in size and can be loaded into primary memory.

So that join could be performed within a mapper process without using Reduce step.

24.) Managed Table Vs External Table

Managed table stores the data in /user/hive/warehouse/tablename folder. And once you drop the table, along with the table schema, the data will be lost.

External table stores the data in the user specified location. And once you drop the table, only table schema will be lost. The data still available in HDFS for further use.

25.) Difference between bucketing and partitioning

Bucketing – Bucketing concept is mainly used for data sampling. We can use Hive bucketing concept on Hive Managed tables / External tables. We can perform bucketing on a single column only not more than one column. The value of this single column will be distributed into number of buckets by using hash algorithm. Bucketing is an optimization technique and it improves the performance.

Partitioning – we can do partitioning with one or more columns and sub-partitioning (Partition within a Partition) is allowed. In static partitioning, we have to give the number of static partitions. But in dynamic partitioning, the number of partitions will be decided by number of unique values in the partitioned column.

26.) Syntax to create hive table with partitioning

create table tablename

(

var1 datatype1,

var2 datatype2,

var3 datatype3

)

PARTITIONED BY (var4 datatype4,var5 datatype5)

ROW FORMAT DELIMITED

FIELDS TERMINATED BY ‘delimiter’

LINES TERMINATED BY ‘\n’

TBLPROPERTIES (“SKIP.HEADER.LINE.COUNT”=”1”)

28.) SQOOP split by:

For parallel importing / exporting the data to / from HDFS from / to RDBMS with multiple mappers. We can distribute the work load into multiple parts.

split-by used to specify the column of a table used to generate the splits for import. It means which column has to be used to create splits for imports will be declared by split-by.

Generally select min(split-by column) from table and select max(split-by column) from table will decide the out boundaries for the split (boundary-query). We need to define the column to create splits for parallel imports. Otherwise, sqoop will split the workload based on primary key of the table.

Syntax: bin/sqoop import –connect jdbc:mysql://localhost/database –table tablename –split-by column

29.) file formats available in SQOOP Import

Delimited Text and sequenceFile

Delimited Text is default import file format. We can specify it as stored as-textfile

sequenceFile is binary format.

30.) Default number of mappers in a sqoop command

the default number of mappers is 4 in a sqoop command.

31.) Maximum number of mappers used a sqoop import command

The maximum number of mappers depends on many variables:

1.Database type.

2.Hardware that is used for your database server.

3.Impact to other requests that your database needs to process.

32.) Flume Architecture

External data source ==> Source ==> Channel ==> Sink ==> HDFS

33.) In Unix, command to show all processes

34.) partitions in hive

Partitions allows use to store the data in different sub-folders under main folder based on a Partitioned column.

Static Partitions: User has to load the data into static partitioned table manually.

Dynamic Partitions: We can load the data from a non-partitioned table to partitioned table using dynamic partitions.

set hive.exec.dynamic.partition = true

set hive.exec.dynamic.partition.mode = nonstrict

set hive.exec.dynamic.partitioned

set hive.exec.max.dynamic.partitions = 10000

set hive.exec.max.dynamic.partitions.pernode = 1000

35.) File formats in hive

ORC File format – Optimized Row Columnar file format

RC File format – Row Columnar file format

TEXT File format – Defalut file format

Sequence file format – If the size of a file is smaller than the data block size in Hadoop, we can consider it as a small file. Due to this, metadata increases which will become an overhead to the NameNode. To solve this problem, sequence files are introduced. Sequence files act as containers to store multiple small files.

Avro file format

Custom INPUT FILE FORMAT and OUTPUT FILE FORMAT

36.) Syntax to create bucketed table

create table tablename

(

var1 datatype1,

var2 datatype2,

var3 datatype3

)

PARTITIONED BY (var4 datatype4,var5 datatype5)

CLUSTERED BY (VAR1) INTO 5 BUCKETS

ROW FORMAT DELIMITED

FIELDS TERMINATED BY ‘delimiter’

LINES TERMINATED BY ‘\n’

TBLPROPERTIES (“SKIP.HEADER.LINE.COUNT”=”1”)

37.) Custom Partitioning

Custom Partitioner is a process that allows us to store the results in different reducers, based on the user condition. By setting a partitioner to partition by the key, we can confirm that, records for the same keys will go to the same reducers.

38.) Difference between order by and sort by

Hive supports sortby – sort the data per reducer and orderby – sort the data for all reducers (mean sort the total data)

39.) Purpose of Zoo Keeper

Zoo Keeper assists in cluster management.

Manage configuration across nodes: Hadoop cluster will have hundreds of systems. Zoo Keeper helps in synchronization of configurations across the cluster.

As many systems are involved, race condition and deadlocks are common problems when implementing distributed applications.

Race condition occurs when a system tries to perform two or more operations at the same time and this can be taken care by serialization property of ZooKeeper.

Deadlock is when two or more systems try to access same shared resource at the same time. Synchronization helps to solve the deadlock.

Partial failure of process, which can lead to uncertainity of data. Zookeeper handles this through atomicity, which means either whole of the process will finish or nothing will carry through after failure.

40.) Sqoop Incremental last modified

bin/sqoop import –connect jdbc:mysql://localhost/database –table table_name –incremental-lastmodified –check-column column_name –last-value ‘value’ -m 1

41.) Difference MR1 vs MR2

MR1 – It consists of Job Tracker and Task Tracker (For processing) and name node and data node (For storing). It supports only MR framework.

MR2 – Job Tracker has been splitted again into two parts application master (one per mr job) and resource manager (only one). It will support MR framework and other frameworks too (spark, storm)

42.) select * from table – give what results for normal table and partitioned table

give the same results in both the scenarios.

43.) Explode and implode in hive

Explode – will explore the array of values into the individual values.

Syntax – select pageid, adid from page LATERAL VIEW explode (adid_list) mytable as adid;

implode – collect aggregates records into either an array or map from multiple rows. It is the opposite of an explode().

syntax – select userid, collect(actor_id) from actor group by userid;

44.) Interceptors in Flume:

Interceptors are designed to modify or drop an event of data. Flume is designed to pick the data from source and drop it into Sink.

Timestamp Interceptors: This will add the timestamp at which process is running to the header event.

Host Interceptors: this will write the hostname or ip address of the host system on which the agent or process is running to the event of data.

Static Interceptors: This will add the static string along with the static header to all events;

UUID Interceptors: Universla Unique Identifier, this setups a UUID on all events that are intercepted.

Search and Replace Interceptors: this will search and replace a string with a value in the event data.

Regex filtering Interceptors: This is used to include/exclude an event. This filters events selectively by interpreting a exent body as text and against a matching text against a configured regular expression.

Regex Extractor Interceptors: this will extracts a match of regex interceptors againest a regular expression.

45.) Different types of distributed file systems:

HDFS – Hadoop Distributed File system

GFS – Google File System

MapR File system

Ceph File system

IBM General Parallel file system (GPFS)

46.) Write a pig script to extract hive table

First we need to enter the pig shell with option useHCataLog (pig -useHCataLog).

A = LOAD ‘tablename’ USING org.apache.hive.hcatalog.pig.HCatLoader();

A = LOAD ‘airline.airdata’ USING org.apache.hive.hcatalog.pig.HCatLoader();

47.) predefined value in sqoop to extract data from any database current date minus one

sqoop import –connect jdbc:mysql://localhost/database –table table_name –where “time_stamp > day(now()-1)”

48.) Keywords UNION, UNIONALL, MINUS and INTERSECT available in hive ?

select_statement UNION [ALL | DISTINCT] select_statement

MINUS keyword is not available in Hive

INTERSECT keyword is not available in Hive

49.) Difference between Distribute by, cluster by, order by, sort by

Distribute by – Distribute the data among n reducers (un-sorted manner).

Cluster by – Distribute the data among n reducers and sort the data (Distribute by and sort by).

order by – sort the data for all reducers.

sort by – sort the data per reducer.

Describe the main hdfs-site.xml properties?

The three important hdfs-site.xml properties are:

dfs.name.dir which gives you the location in which NameNode stores the metadata (FsImage and edit logs) and where DFS is located – on the disk or onto the remote directory.
Location of the DataNodes is given by dfs.data.dir , and the data is stored in DataNodes.
fs.checkpoint.dir is the directory found on the filesystem where the Secondary NameNode collects the temporary images of edit logs, which is to be combined and the FsImage for backup.

The post Hadoop Interview Questions and Answers appeared first on Besant Technologies | No.1 Training Institute in Chennai, Bangalore & Pune.

This post first appeared on Job Openings In Hcl, please read the originial post: here

People also like

Hadoop Interview Questions and Answers