Big-Data Interview Questions

Big-Data Interview Questions

  • What do the four V’s of Big Data denote?

The four critical features of big data:
Volume –Scale of data
Velocity –Analysis of streaming data
Variety – Different forms of data
Veracity –Uncertainty of data

  • How big data analysis helps businesses increase their revenue? Give example

Big data analysis is helping businesses differentiate themselves – for example Walmart the world’s largest retailer in 2014 in terms of revenue – is using big data analytics to increase its sales through better predictive analytics, providing customized recommendations and launching new products based on customer preferences and needs. Walmart observed a significant 10% to 15% increase in online sales for $1 billion in incremental revenue. There are many more companies like Facebook, Twitter, LinkedIn, Pandora, JPMorgan Chase, Bank of America, etc. using big data analytics to boost their revenue.

Big-Data Interview Questions

  • Name some companies that use Hadoop

Yahoo (One of the biggest user & more than 80% code contributor to Hadoop)

  • Differentiate between Structured and Unstructured data

Data which can be stored in traditional database systems in the form of rows and columns, for example the online purchase transactions can be referred to as Structured Data. Data which can be stored only partially in traditional database systems, for example, data in XML records can be referred to as semi structured data. Unorganized and raw data that cannot be categorized as semi structured or structured data is referred to as unstructured data. Facebook updates, Tweets on Twitter, Reviews, web logs, etc. are all examples of unstructured data.

Big-Data Interview Questions

  • On what concept the Hadoop framework works?

Hadoop Framework works on the following two core components-

HDFS – Hadoop Distributed File System is the java based file system for scalable and reliable storage of large datasets. Data in HDFS is stored in the form of blocks and it operates on the Master Slave Architecture.

Hadoop MapReduce-This is a java based programming paradigm of Hadoop framework that provides scalability across various Hadoop clusters. MapReduce distributes the workload into various tasks that can run in parallel. Hadoop jobs perform 2 separate tasks- job. The map job breaks down the data sets into key-value pairs or tuples. The reduce job then takes the output of the map job and combines the data tuples to into smaller set of tuples. The reduce job is always performed after the map job is executed.

  • What are the main components of a Hadoop Application?

Hadoop applications have wide range of technologies that provide great advantage in solving complex business problems.
Core components of a Hadoop application are-
Hadoop Common
Hadoop MapReduce
Data Access Components are – Pig and Hive
Data Storage Component is – HBase
Data Integration Components are – Apache Flume, Sqoop, Chukwa
Data Management and Monitoring Components are – Ambari, Oozie and Zookeeper.
Data Serialization Components are – Thrift and Avro
Data Intelligence Components are – Apache Mahout and Drill.

Big-Data Interview Questions

  • What is Hadoop streaming?

Hadoop distribution has a generic application programming interface for writing Map and Reduce jobs in any desired programming language like Python, Perl, Ruby, etc. This is referred to as Hadoop Streaming. Users can create and run jobs with any kind of shell scripts or executable as the Mapper or Reducers.

  • What is the best hardware configuration to run Hadoop?

The best configuration for executing Hadoop jobs is dual core machines or dual processors with 4GB or 8GB RAM that use ECC memory. Hadoop highly benefits from using ECC memory though it is not low – end. ECC memory is recommended for running Hadoop because most of the Hadoop users have experienced various checksum errors by using non ECC memory. However, the hardware configuration also depends on the workflow requirements and can change accordingly.

Big-Data Interview Questions

  • What are the most commonly defined input formats in Hadoop?

The most common Input Formats defined in Hadoop are:

Text Input Format- This is the default input format defined in Hadoop.
Key Value Input Format- This input format is used for plain text files wherein the files are broken down into lines.
Sequence File Input Format- This input format is used for reading files in sequence.


  • What are the steps involved in deploying a big data solution?

Data Ingestion – The foremost step in deploying big data solutions is to extract data from different sources which could be an Enterprise Resource Planning System like SAP, any CRM like Salesforce or Siebel , RDBMS like MySQL or Oracle, or could be the log files, flat files, documents, images, social media feeds. This data needs to be stored in HDFS. Data can either be ingested through batch jobs that run every 15 minutes, once every night and so on or through streaming in real-time from 100 ms to 120 seconds.

Data Storage – The subsequent step after ingesting data is to store it either in HDFS or NoSQL database like HBase. HBase storage works well for random read/write access whereas HDFS is optimized for sequential access.

Data Processing – The ultimate step is to process the data using one of the processing frameworks like mapreduce, spark, pig, hive, etc.

Big-Data Interview Questions

  • How will you choose various file formats for storing and processing data using Apache Hadoop?

The decision to choose a particular file format is based on the following factors-

Schema evolution to add, alter and rename fields.
Usage pattern like accessing 5 columns out of 50 columns vs accessing most of the columns.
Splittability to be processed in parallel.
Read/Write/Transfer performance vs block compression saving storage space
File Formats that can be used with Hadoop – CSV, JSON, Columnar, Sequence files, AVRO, and Parquet file.
CSV Files
CSV files are an ideal fit for exchanging data between hadoop and
external systems. It is advisable not to use header and footer lines

  • What is Big Data?

a.Big data is defined as the voluminous amount of structured, unstructured or semi-structured data that has huge potential for mining but is so large that it cannot be processed using traditional database systems. Big data is characterized by its high velocity, volume and variety that requires cost effective and innovative methods for information processing to draw meaningful business insights. More than the volume of the data – it is the nature of the data that defines whether it is considered as Big Data or not.

Big-Data Interview Questions

  • What is a block and block scanner in HDFS?

a.Block – The minimum amount of data that can be read or written is generally referred to as a “block” in HDFS. The default size of a block in HDFS is 64MB.

Block Scanner – Block Scanner tracks the list of blocks present on a DataNode and verifies them to find any kind of checksum errors. Block Scanners use a throttling mechanism to reserve disk bandwidth on the datanode.

  • Explain the difference between NameNode, Backup Node and Checkpoint NameNode.

a.NameNode: NameNode is at the heart of the HDFS file system which manages the metadata i.e. the data of the files is not stored on the NameNode but rather it has the directory tree of all the files present in the HDFS file system on a hadoop cluster. NameNode uses two files for the namespace-

fsimage file- It keeps track of the latest checkpoint of the namespace.

edits file-It is a log of changes that have been made to the namespace
since checkpoint.

Checkpoint Node- Checkpoint Node keeps track of the latest checkpoint in a directory that has same structure as that of NameNode’s directory. Checkpoint node creates checkpoints for the namespace at regular intervals by downloading the edits and fsimage file from the NameNode and merging it locally. The new image is then again updated back to the active NameNode.
BackupNode: – Backup Node also provides check pointing functionality
like that of the checkpoint node but it also maintains its up-to-date in-
memory copy of the file system namespace that is in sync with the active

Big-Data Interview Questions

  • What is commodity hardware? Click here to Tweet

a.Commodity Hardware refers to inexpensive systems that do not have high availability or high quality. Commodity Hardware consists of RAM because there are specific services that need to be executed on RAM. Hadoop can be run on any commodity hardware and does not require any super computer s or high end hardware configuration to execute jobs.

  • What is the port number for NameNode, Task Tracker and Job Tracker? Click here to Tweet

a.NameNode 50070
Job Tracker 50030
Task Tracker 50060

Big-Data Interview Questions

  • Explain about the process of inter cluster data copying. Click here to Tweet

a.HDFS provides a distributed data copying facility through the DistCP from source to destination. If this data copying is within the hadoop cluster then it is referred to as inter cluster data copying. DistCP requires both source and destination to have a compatible or same version of hadoop.

  • How can you overwrite the replication factors in HDFS? Click here to Tweet

The replication factor in HDFS can be modified or overwritten in 2 ways-

Using the Hadoop FS Shell, replication factor can be changed per file basis using the below command-
$hadoop fs –setrep –w 2 /my/test_file (test_file is the filename whose replication factor will be set to 2)
Using the Hadoop FS Shell, replication factor of all files under a given directory can be modified using the below command-
$hadoop fs –setrep –w 5 /my/test_dir (test_dir is the name of the directory and all the files in this directory will have a replication factor set to 5)

Big-Data Interview Questions

  • Explain the difference between NAS and HDFS. Click here to Tweet

a.NAS runs on a single machine and thus there is no probability of data redundancy whereas HDFS runs on a cluster of different machines thus there is data redundancy because of the replication protocol.
NAS stores data on a dedicated hardware whereas in HDFS all the data blocks are distributed across local drives of the machines.
In NAS data is stored independent of the computation and hence Hadoop MapReduce cannot be used for processing whereas HDFS works with Hadoop MapReduce as the computations in HDFS are moved to data.

  • Explain what happens if during the PUT operation, HDFS block is assigned a replication factor 1 instead of the default value 3.

Replication factor is a property of HDFS that can be set accordingly for the entire cluster to adjust the number of times the blocks are to be replicated to ensure high data availability. For every block that is stored in HDFS, the cluster will have n-1 duplicated blocks. So, if the replication factor during the PUT operation is set to 1 instead of the default value 3, then it will have a single copy of data. Under these circumstances when the replication factor is set to 1 ,if the DataNode crashes under any circumstances, then only single copy of the data would be lost.

Big-Data Interview Questions

  • What is the process to change the files at arbitrary locations in HDFS?

a.HDFS does not support modifications at arbitrary offsets in the file or multiple writers but files are written by a single writer in append only format i.e. writes to a file in HDFS are always made at the end of the file.

  • Explain about the indexing process in HDFS. Click here to Tweet

a.Indexing process in HDFS depends on the block size. HDFS stores the last part of the data that further points to the address where the next part of data chunk is stored.

Big-Data Interview Questions

  • What is a rack awareness and on what basis is data stored in a rack?

Here are the elements which are present in the component directory structure anf modules: –
module.ts- in this, the angular module is declared. @NgModule decorator is used which initializes the different aspects of angular applications. AppComponent is also declared in it.

components.ts- it simply defines the components in angular and this is the place where the app-root sector is also defined. A title attribute is also declared in the component.

component.html- it is the template file of the application which represents the visual parts of our components.
All the data nodes put together form a storage area i.e. the physical location of the data nodes is referred to as Rack in HDFS. The rack information i.e. the rack id of each data node is acquired by the NameNode. The process of selecting closer data nodes depending on the rack information is known as Rack Awareness.

The contents present in the file are divided into data block as soon as the client is ready to load the file into the hadoop cluster. After consulting with the NameNode, client allocates 3 data nodes for each data block. For each data block, there exists 2 copies in one rack and the third copy is present in another rack. This is generally referred to as the Replica Placement Policy.


  • What happens to a NameNode that has no data?

There does not exist any NameNode without data. If it is a NameNode then it should have some sort of data in it.
What happens when a user submits a Hadoop job when the NameNode is
down- does the job get in to hold or does it fail.

The Hadoop job fails when the NameNode is down
What happens when a user submits a Hadoop job when the Job Tracker is
down- does the job get in to hold or does it fail.
The Hadoop job fails when the Job Tracker is down.
Whenever a client submits a hadoop job, who receives it?
NameNode receives the Hadoop job which then looks for the data requested by the client and provides the block information. JobTracker takes care of resource allocation of the hadoop job to ensure timely completion.
What do you understand by edge nodes in Hadoop?
Edges nodes are the interface between hadoop cluster and the external network. Edge nodes are used for running cluster adminstration tools and client applications.Edge nodes are also referred to as gateway nodes.
Explain the usage of Context Object
Context Object is used to help the mapper interact with other Hadoop systems. Context Object can be used for updating counters, to report the progress and to provide any application level status updates. ContextObject has the configuration details for the job and also interfaces, that helps it to generating the output.
Big-Data Interview Questions

  • What are the core methods of a Reducer?

The 3 core methods of a reducer are –

setup () – This method of the reducer is used for configuring various parameters like the input data size, distributed cache, heap size, etc.
Function Definition- public void setup (context)
reduce () it is heart of the reducer which is called once per key with the associated reduce task.
Function Definition -public void reduce (Key,Value,context)
cleanup () – This method is called only once at the end of reduce task for clearing all the temporary files.
Function Definition -public void cleanup (context)

  • Explain about the partitioning, shuffle and sort phase Click here to Tweet

Shuffle Phase-Once the first map tasks are completed, the nodes continue to perform several other map tasks and also exchange the intermediate outputs with the reducers as required. This process of moving the intermediate outputs of map tasks to the reducer is referred to as Shuffling.

Sort Phase- Hadoop MapReduce automatically sorts the set of intermediate keys on a single node before they are given as input to the reducer.

Partitioning Phase-The process that determines which intermediate keys and value will be received by each reducer instance is referred to as partitioning. The destination partition is same for any key irrespective of the mapper instance that generated it.

Big-Data Interview Questions

  • How to write a custom partitioner for a Hadoop MapReduce job?

Steps to write a Custom Partitioner for a Hadoop MapReduce Job-

A new class must be created that extends the pre-defined Partitioner Class.
getPartition method of the Partitioner class must be overridden.
The custom partitioner to the job can be added as a config file in the wrapper which runs Hadoop MapReduce or the custom partitioner can be added to the job by using the set method of the partitioner class.
We have further categorized Hadoop MapReduce Interview Questions for Freshers and Experienced.

  • What is the use of RecordReader in Hadoop?

Since Hadoop splits data into various blocks, RecordReader is used to read the slit data into single record. For instance, if our input data is split like:
Row1: Welcome to
Row2: Intellipaat
It will be read as “Welcome to Intellipaat” using RecordReader.

  • How can you debug Hadoop code?

First, check the list of MapReduce jobs currently running. Next, we need to see that there are no orphaned jobs running; if yes, you need to determine the location of RM logs.
Run: “ps –ef | grep –I ResourceManager”
and look for log directory in the displayed result. Find out the job-id from the displayed list and check if there is any error message associated with that job.
On the basis of RM logs, identify the worker node that was involved in execution of the task.
Now, login to that node and run – “ps –ef | grep –iNodeManager”
Examine the Node Manager log. The majority of errors come from user level logs for each map-reduce job.

  • What is the difference between Map Side join and Reduce Side Join?

Map side Join at map side is performed data reaches the map. You need a strict structure for defining map side join. On the other hand, Reduce side Join (Repartitioned Join) is simpler than map side join since the input datasets need not be structured. However, it is less efficient as it will have to go through sort and shuffle phases, coming with network overheads.

  • What is Job Tracker role in Hadoop?

Job Tracker’s primary function is resource management (managing the task trackers), tracking resource availability and task life cycle management (tracking the taks progress and fault tolerance).
It is a process that runs on a separate node, not on a DataNode often.
Job Tracker communicates with the NameNode to identify data location.
Finds the best Task Tracker Nodes to execute tasks on given nodes.
Monitors individual Task Trackers and submits the overall job back to the client.
It tracks the execution of MapReduce workloads local to the slave node.


This information box about the author only appears if the author has biographical information. Otherwise there is not author box shown. Follow YOOtheme on Twitter or read the blog.
+91 9952948899