Google News
logo
Hadoop Interview Questions
Apache Hadoop is the solution for dealing with Big Data. It is an open-source framework that offers several tools and services to store, manage, process, and analyze Big Data. This allows organizations to make significant business decisions in an effective and efficient manner, which was not possible with traditional methods and systems.
The different Hadoop configuration files include :
 
* hadoop-env.sh
* mapred-site.xml
* core-site.xml
* yarn-site.xml
* hdfs-site.xml
* Master and Slaves
Hadoop can be run in three modes :
 
Standalone mode : The default mode of Hadoop, uses a local file system for input and output operations. This mode is mainly used for debugging purposes, and it does not support the use of HDFS. Further, in this mode, there is no custom configuration required for mapred-site.xml, core-site.xml, and hdfs-site.xml files. This mode works much faster when compared to other modes.

Pseudo-distributed mode (Single-node Cluster) : In this case, you need configuration for all the three files mentioned above. In this case, all daemons are running on one node, and thus both Master and Slave nodes are the same.

Fully distributed mode (Multi-node Cluster) : This is the production phase of Hadoop (what Hadoop is known for) where data is used and distributed across several nodes on a Hadoop cluster. Separate nodes are allotted as Master and Slave.
Regular FileSystem : In regular FileSystem, data is maintained in a single system. If the machine crashes, data recovery is challenging due to low fault tolerance. Seek time is more and hence it takes more time to process the data.

HDFS : Data is distributed and maintained on multiple systems. If a DataNode crashes, data can still be recovered from other nodes in the cluster. Time taken to read data is comparatively more, as there is local data read to the disc and coordination of data from multiple systems.
HDFS is fault-tolerant because it replicates data on different DataNodes. By default, a block of data is replicated on three DataNodes. The data blocks are stored in different DataNodes. If one node crashes, the data can still be retrieved from other DataNodes.
Big data” is the term for a collection of large and complex data sets, that makes it difficult to process using relational database management tools or traditional data processing applications. It is difficult to capture, curate, store, search, share, transfer, analyze, and visualize Big data. Big Data has emerged as an opportunity for companies. Now they can successfully derive value from their data and will have a distinct advantage over their competitors with enhanced business decisions making capabilities.
 
Tip : It will be a good idea to talk about the 5Vs in such questions, whether it is asked specifically or not!
 
Volume : The volume represents the amount of data which is growing at an exponential rate i.e. in Petabytes and Exabytes. 

Velocity : Velocity refers to the rate at which data is growing, which is very fast. Today, yesterday’s data are considered as old data. Nowadays, social media is a major contributor to the velocity of growing data.

Variety : Variety refers to the heterogeneity of data types. In another word, the data which are gathered has a variety of formats like videos, audios, csv, etc. So, these various formats represent the variety of data.

Veracity : Veracity refers to the data in doubt or uncertainty of data available due to data inconsistency and incompleteness. Data available can sometimes get messy and may be difficult to trust. With many forms of big data, quality and accuracy are difficult to control. The volume is often the reason behind for the lack of quality and accuracy in the data.

Value : It is all well and good to have access to big data but unless we can turn it into a value it is useless. By turning it into value I mean, Is it adding to the benefits of the organizations? Is the organization working on Big Data achieving high ROI (Return On Investment)? Unless, it adds to their profits by working on Big Data, it is useless.

As we know Big Data is growing at an accelerating rate, so the factors associated with it are also evolving.
HDFS (Hadoop Distributed File System) is the storage unit of Hadoop. It is responsible for storing different kinds of data as blocks in a distributed environment. It follows master and slave topology.
 
 * NameNode : NameNode is the master node in the distributed environment and it maintains the metadata information for the blocks of data stored in HDFS like block location, replication factors etc.

 * DataNode : DataNodes are the slave nodes, which are responsible for storing data in the HDFS. NameNode manages all the DataNodes.


YARN (Yet Another Resource Negotiator) is the processing framework in Hadoop, which manages resources and provides an execution environment to the processes.
 
 * ResourceManager : It receives the processing requests, and then passes the parts of requests to corresponding NodeManagers accordingly, where the actual processing takes place. It allocates resources to applications based on the needs.

 * NodeManager : NodeManager is installed on every DataNode and it is responsible for the execution of the task on every single DataNode.
 
 
 
In this question, first explain NAS and HDFS, and then compare their features as follows :
 
* Network-attached storage (NAS) is a file-level computer data storage server connected to a computer network providing data access to a heterogeneous group of clients. NAS can either be a hardware or software which provides services for storing and accessing files. Whereas Hadoop Distributed File System (HDFS) is a distributed filesystem to store data using commodity hardware.

* In HDFS Data Blocks are distributed across all the machines in a cluster. Whereas in NAS data is stored on a dedicated hardware.

* HDFS is designed to work with MapReduce paradigm, where computation is moved to the data. NAS is not suitable for MapReduce since data is stored separately from the computations.

* HDFS uses commodity hardware which is cost-effective, whereas a NAS is a high-end storage devices which includes high cost.
 
* Only one NameNode is possible to configure.
* Secondary NameNode was to take hourly backup of MetaData from NameNode.
* It is only suitable for Batch Processing of a vast amount of Data, which is already in the Hadoop System.
* It is not ideal for Real-time Data Processing.
* It supports up to 4000 Nodes per Cluster.
* It has a single component: JobTracker to perform many activities like Resource Management, Job Scheduling, Job Monitoring, Re-scheduling Jobs etc.
* JobTracker is the single point of failure.
* It supports only one Name No and One Namespace per Cluster.
* It does not help the Horizontal Scalability of NameNode.
* It runs only Map/Reduce jobs.
Big Data is nothing but an assortment of such a huge and complex data that it becomes very tedious to capture, store, process, retrieve and analyze it with the help of on-hand database management tools or traditional data processing techniques.
Effective analysis of Big Data provides a lot of business advantage as organizations will learn which areas to focus on and which areas are less important. Big data analysis provides some early key indicators that can prevent the company from a huge loss or help in grasping a great opportunity with open hands! A precise analysis of Big Data helps in decision making! For instance, nowadays people rely so much on Facebook and Twitter before buying any product or service. All thanks to the Big Data explosion.
Suppose you have a file stored in a system, and due to some technical problem that file gets destroyed. Then there is no chance of getting the data back present in that file. To avoid such situations, Hadoop has introduced the feature of fault tolerance in HDFS. In Hadoop, when we store a file, it automatically gets replicated at two other locations also. So even if one or two of the systems collapse, the file is still available on the third system.
Hadoop makes 5 splits as follows :
 
* One split for 64K files
* Two splits for 65MB files, and
* Two splits for 127MB files
 
WebDAV is a set of extension to HTTP which is used to support editing and uploading files. On most operating system WebDAV shares can be mounted as filesystems, so it is possible to access HDFS as a standard filesystem by exposing HDFS over WebDAV.
Sqoop is a tool used to transfer data between the Relational Database Management System (RDBMS) and Hadoop HDFS. By using Sqoop, you can transfer data from RDBMS like MySQL or Oracle into HDFS as well as exporting data from HDFS file to RDBMS.
Hadoop, well known as Apache Hadoop, is an open-source software platform for scalable and distributed computing of large volumes of data. It provides rapid, high-performance, and cost-effective analysis of structured and unstructured data generated on digital platforms and within the enterprise. It is used in almost all departments and sectors today.
 
Here are some of the instances where Hadoop is used : 
 
* Managing traffic on streets
* Streaming processing
* Content management and archiving e-mails
* Processing rat brain neuronal signals using a Hadoop computing cluster
* Fraud detection and prevention
* Advertisements targeting platforms are using Hadoop to capture and analyze clickstream, transaction, video, and social media data
* Managing content, posts, images, and videos on social media platforms
* Analyzing customer data in real-time for improving business performance
* Public sector fields such as intelligence, defense, cyber security, and scientific research
* Getting access to unstructured data such as output from medical devices, doctor’s notes, lab results, imaging reports, medical correspondence, clinical data, and financial data
 
 
 
 
A Combiner is a mini version of a reducer that is used to perform local reduction processes. The mapper sends the input to a specific node of the Combiner which later sends the respective output to the reducer. It also reduces the quantum of data that needs to be sent to the reducers, improving the efficiency of MapReduce.
Apache Spark is an open-source framework engine known for its speed and ease of use in big data processing and analysis. It also provides built-in modules for graph processing, machine learning, streaming, SQL, etc. The execution engine of Apache Spark supports in-memory computation and cyclic data flow. It can also access diverse data sources such as HBase, HDFS, Cassandra, etc.
Hadoop is a distributed file system that lets you store and handle massive amounts of data on a cloud of machines, handling data redundancy.
 
The primary benefit of this is that since data is stored in several nodes, it is better to process it in a distributed manner. Each node can process the data stored on it instead of spending time moving the data over the network.
 
On the contrary, in the relational database computing system, we can query data in real-time, but it is not efficient to store data in tables, records, and columns when the data is huge.
 
Hadoop also provides a scheme to build a column database with Hadoop HBase for runtime queries on rows.
 
Listed below are the main components of Hadoop :
 
* HDFS : HDFS or Hadoop Distributed File System is Hadoop’s storage unit.
* MapReduce : MapReduce the Hadoop’s processing unit.
* YARN : YARN is the resource management unit of Apache Hadoop.
Distributed cache in Hadoop is a service by MapReduce framework to cache files when needed.
 
Once a file is cached for a specific job, Hadoop will make it available on each DataNode both in the system and in memory, where map and reduce tasks are executed. Later, you can easily access and read the cache file and populate any collection (like an array, hashmap) in your code.
 
Benefits of using distributed cache are as follows :
 
* It distributes simple, read-only text/data files and/or complex types such as jars, archives, and others. These archives are then un-archived at the slave node.

* Distributed cache tracks the modification timestamps of cache files, which notify that the files should not be modified until a job is executed.
Following are the components of the Region Server of HBase :
 
BlockCache : It resides on Region Server and stores data in the memory that is read frequently.
WAL : WAL or Write Ahead Log is a file that is attached to each Region Server located in the distributed environment.
MemStore : MemStore is the write cache that stores the input data before it is stored in the disk or permanent memory.
HFile : HDFS stores the HFile that stores the cells on the disk.
NameNode :  is the core of HDFS that manages the metadata—the information of which file maps to which block locations and which blocks are stored on which DataNode. In simple terms, it’s the data about the data being stored. NameNode supports a directory tree-like structure consisting of all the files present in HDFS on a Hadoop cluster. It uses the following files for namespace:
  * fsimage file : It keeps track of the latest checkpoint of the namespace.
  * edits file : It is a log of changes that have been made to the namespace since Checkpoint.

Checkpoint NameNode : has the same directory structure as NameNode and creates Checkpoints for namespace at regular intervals by downloading the fsimage, editing files, and margining them within the local directory. The new image after merging is then uploaded to NameNode. There is a similar node like Checkpoint, commonly known as the Secondary Node, but it does not support the ‘upload to NameNode’ functionality.

Backup Node : provides similar functionality as Checkpoint, enforcing synchronization with NameNode. It maintains an up-to-date in-memory copy of the file system namespace and doesn’t require getting hold of changes after regular intervals. The Backup Node needs to save the current state in-memory to an image file to create a new Checkpoint.
The various Features of Hadoop are :
 
* Open Source : Apache Hadoop is an open source software framework. Open source means it is freely available and even we can change its source code as per our requirements.

* Distributed processing : As HDFS stores data in a distributed manner across the cluster. MapReduce process the data in parallel on the cluster of nodes.

* Fault Tolerance : Apache Hadoop is highly Fault-Tolerant. By default, each block creates 3 replicas across the cluster and we can change it as per needment. So if any node goes down, we can recover data on that node from the other node. Framework recovers failures of nodes or tasks automatically.

* Reliability : It stores data reliably on the cluster despite machine failure.

* High Availability : Data is highly available and accessible despite hardware failure. In Hadoop, when a machine or hardware crashes, then we can access data from another path.

* Scalability : Hadoop is highly scalable, as one can add the new hardware to the nodes.

* Economic : Hadoop runs on a cluster of commodity hardware which is not very expensive. We do not need any specialized machine for it.

* Easy to use : No need of client to deal with distributed computing, the framework take care of all the things. So it is easy to use.
 
Just like the Standalone mode, Hadoop can also run on a single-node in this mode. The difference is that each Hadoop daemon runs in a separate Java process in this Mode. In Pseudo-distributed mode, we need configuration for all the four files mentioned above. In this case, all daemons are running on one node and thus, both Master and Slave node are the same.

The pseudo mode is suitable for both for development and in the testing environment. In the Pseudo mode, all the daemons run on the same machine.
* In Hadoop 2, the minimum supported version of Java is Java 7, while in Hadoop 3 is Java 8.

* Hadoop 2, handle fault tolerance by replication (which is wastage of space). While Hadoop 3 handle it by Erasure coding.

* For data balancing Hadoop 2 uses HDFS balancer. While Hadoop 3 uses Intra-data node balancer.

* In Hadoop 2 some default ports are Linux ephemeral port range. So at the time of startup, they will fail to bind. But in Hadoop 3 these ports have been moved out of the ephemeral range.

* In hadoop 2, HDFS has 200% overhead in storage space. While Hadoop 3 has 50% overhead in storage space.

* Hadoop 2 has features to overcome SPOF (single point of failure). So whenever NameNode fails, it recovers automatically. Hadoop 3 recovers SPOF automatically no need of manual intervention to overcome it.
Hadoop major drawback was cross-switch network traffic due to the huge volume of data. To overcome this drawback, Data locality came into the picture. It refers to the ability to move the computation close to where the actual data resides on the node, instead of moving large data to computation. Data locality increases the overall throughput of the system.

In Hadoop, HDFS stores datasets. Datasets are divided into blocks and stored across the datanodes in Hadoop cluster. When a user runs the MapReduce job then NameNode sends this MapReduce code to the datanodes on which data is available related to MapReduce job.

Data locality has three categories :
 
Data local : In this category data is on the same node as the mapper working on the data. In such case, the proximity of the data is closer to the computation. This is the most preferred scenario.

Intra : Rack- In this scenarios mapper run on the different node but on the same rack. As it is not always possible to execute the mapper on the same datanode due to constraints.

Inter-Rack : In this scenarios mapper run on the different rack. As it is not possible to execute mapper on a different node in the same rack due to resource constraints.
Safemode in Apache Hadoop is a maintenance state of NameNode. During which NameNode doesn’t allow any modifications to the file system. During Safemode, HDFS cluster is in read-only and doesn’t replicate or delete blocks. At the startup of NameNode :
 
  * It loads the file system namespace from the last saved FsImage into its main memory and the edits log file.
  * Merges edits log file on FsImage and results in new file system namespace.
  * Then it receives block reports containing information about block location from all datanodes.

In SafeMode NameNode perform a collection of block reports from datanodes. NameNode enters safemode automatically during its start up. NameNode leaves Safemode after the DataNodes have reported that most blocks are available. Use the command :

hadoop dfsadmin – safemode get : To know the status of Safemode
bin/hadoop dfsadmin – safemode enter : To enter Safemode
hadoop dfsadmin - safemode leave : To come out of Safemode

NameNode front page shows whether safemode is on or off.
 
Apache Hadoop achieves security by using Kerberos.

At a high level, there are three steps that a client must take to access a service when using Kerberos. Thus, each of which involves a message exchange with a server.
 
* Authentication : The client authenticates itself to the authentication server. Then, receives a timestamped Ticket-Granting Ticket (TGT).

* Authorization : The client uses the TGT to request a service ticket from the Ticket Granting Server.

* Service Request : The client uses the service ticket to authenticate itself to the server.
For processing large data sets in parallel across a Hadoop cluster, Hadoop MapReduce framework is used. Data analysis uses a two-step map and reduce process.
In MapReduce, during the map phase, it counts the words in each document, while in the reduce phase it aggregates the data as per the document spanning the entire collection. During the map phase, the input data is divided into splits for analysis by map tasks running in parallel across Hadoop framework.
The process by which the system performs the sort and transfers the map outputs to the reducer as inputs is known as the shuffle
Distributed Cache is an important feature provided by the MapReduce framework. When you want to share some files across all nodes in Hadoop Cluster, Distributed Cache is used. The files could be an executable jar files or simple properties file.
In Hadoop for submitting and tracking MapReduce jobs, JobTracker is used. Job tracker run on its own JVM process
 
Job Tracker performs following actions in Hadoop
 
* Client application submit jobs to the job tracker
* JobTracker communicates to the Name mode to determine data location
* Near the data or with available slots JobTracker locates TaskTracker nodes
* On chosen TaskTracker Nodes, it submits the work
* When a task fails, Job tracker notifies and decides what to do then.
* The TaskTracker nodes are monitored by JobTracker
Heartbeat is referred to a signal used between a data node and Name node, and between task tracker and job tracker, if the Name node or job tracker does not respond to the signal, then it is considered there is some issues with data node or task tracker
Hadoop streaming is a utility which allows you to create and run map/reduce job. It is a generic API that allows programs written in any languages to be used as Hadoop mapper.
A Combiner is a mini version of a reducer that is used to perform local reduction processes. The mapper sends the input to a specific node of the Combiner which later sends the respective output to the reducer. It also reduces the quantum of data that needs to be sent to the reducers, improving the efficiency of MapReduce.
Hive is an open-source system that processes structured data in Hadoop, living on top of the latter for summing Big Data and facilitating analysis and queries. In addition, hive enables SQL developers to write Hive Query Language statements similar to standard SQL statements for data query and analysis. It is created to make MapReduce programming easier because you don’t know and write lengthy Java code.
MapReduce needs programs to be translated into map and reduce stages. As not all data analysts are accustomed to MapReduce, Yahoo researchers introduced Apache pig to bridge the gap. Apache Pig was created on top of Hadoop, producing a high level of abstraction and enabling programmers to spend less time writing complex MapReduce programs.
Apache Pig architecture includes a Pig Latin interpreter that applies Pig Latin scripts to process and interpret massive datasets. Programmers use Pig Latin language to examine huge datasets in the Hadoop environment. Apache pig has a vibrant set of datasets showing different data operations like join, filter, sort, load, group, etc.

Programmers must practice Pig Latin language to address a Pig script to perform a particular task. Pig transforms these Pig scripts into a series of Map-Reduce jobs to reduce programmers’ work. Pig Latin programs are performed via various mechanisms such as UDFs, embedded, and Grunt shells.
 
Apache Pig architecture consists of the following major components :
 
Parser : The Parser handles the Pig Scripts and checks the syntax of the script.
Optimizer : The optimizer receives the logical plan (DAG). And carries out the logical optimization such as projection and push down.
Compiler : The compiler converts the logical plan into a series of MapReduce jobs.
Execution Engine : In the end, the MapReduce jobs get submitted to Hadoop in sorted order.
Execution Mode : Apache Pig is executed in local and Map Reduce modes. The selection of execution mode depends on where the data is stored and where you want to run the Pig script.
Apache Zookeeper is an open-source service that supports controlling a huge set of hosts. Management and coordination in a distributed environment are complex. Zookeeper automates this process and enables developers to concentrate on building software features rather than bother about its distributed nature.

Zookeeper helps to maintain configuration knowledge, naming, group services for distributed applications. It implements various protocols on the cluster so that the application should not execute them on its own. It provides a single coherent view of many machines.
Simple distributed coordination process: The coordination process among all nodes in Zookeeper is straightforward.

Synchronization : Mutual exclusion and co-operation among server processes. 

Ordered Messages : Zookeeper tracks with a number by denoting its order with the stamping of each update; with the help of all this, messages are ordered here.

Serialization : Encode the data according to specific rules. Ensure your application runs consistently. 

Reliability : The zookeeper is very reliable. In case of an update, it keeps all the data until forwarded.

Atomicity : Data transfer either succeeds or fails, but no transaction is partial.
Persistent Znodes : The default znode in ZooKeeper is the Persistent Znode. It permanently stays in the zookeeper server until any other clients leave it apart.

Ephemeral Znodes : These are the temporary znodes. It is smashed whenever the creator client logs out of the ZooKeeper server. For example, assume client1 created eznode1. Once client1 logs out of the ZooKeeper server, the eznode1 gets destroyed.

Sequential Znodes : Sequential znode is assigned a 10-digit number in numerical order at the end of its name. Assume client1 produced a sznode1. In the ZooKeeper server, the sznode1 will be named like this:
sznode0000000001
If the client1 generates another sequential znode, it will bear the following number in a sequence. So the subsequent sequential znode is <znode name>0000000002.
Robust : It is highly robust. It even has community support and contribution and is easily usable.

Full Load : Sqoop can load the whole table just by a single Sqoop command. It also allows us to load all the tables of the database by using a single Sqoop command.

Incremental Load : It supports incremental load functionality. Using Sqoop, we can load parts of the table whenever it is updated.

Parallel import/export : It uses the YARN framework for importing and exporting the data. That provides fault tolerance on the top of parallelism.

Import results of SQL query : It allows us to import the output from the SQL query into the Hadoop Distributed File System.
It is a tool that is used for copying a very large amount of data to and from Hadoop file systems in parallel. It uses MapReduce to affect its distribution, error handling, recovery, and reporting. It expands a list of files and directories into input to map tasks, each of which will copy a partition of the files specified in the source list.
By default, the size of the HDFS data block is 128 MB. The ideas for the large size of blocks are :
 
* To reduce the expense of seek: Because of the large size blocks, the time consumed to shift the data from the disk can be longer than the usual time taken to commence the block. As a result, the multiple blocks are transferred at the disk transfer rate.

* If there are small blocks, the number of blocks will be too many in Hadoop HDFS and too much metadata to store. Managing such a vast number of blocks and metadata will create overhead and head to traffic in a network.
By default, the replication factor is 3. There are no two copies that will be on the same data node. Usually, the first two copies will be on the same rack, and the third copy will be off the shelf. It is advised to set the replication factor to at least three so that one copy is always safe, even if something happens to the rack.

We can set the default replication factor of the file system as well as of each file and directory exclusively. For files that are not essential, we can lower the replication factor, and critical files should have a high replication factor.
Hadoop provides an option where a particular set of lousy input records can be skipped when processing map inputs. Applications can manage this feature through the SkipBadRecords class.

This feature can be used when map tasks fail deterministically on a particular input. This usually happens due to faults in the map function. The user would have to fix these issues.
The command used to find the status of the block  is :  hdfs fsck <path> -files –blocks
And the command used to find File-system health is : hdfs fsck/ -files –blocks –locations > dfs-fsck.log
The command used for copying data from the Local system to HDFS is : hadoop fs –copyFromLocal [source][destination]
The data is stored in memory by Apache Spark for faster processing and development of machine learning models, which may need a lot of Machine Learning algorithms for multiple repetitions and various conceptual steps to create an optimized model. In the case of Graph algorithms, it moves within all the nodes and edges to make a graph. These low latency workloads, which need many iterations, enhance the performance.
Metastore is used to store the metadata information; it’s also possible to use RDBMS and the open-source ORM layer, converting object Representation into a relational schema. It’s the central repository of Apache Hive metadata. It stores metadata for Hive tables (similar to their schema and location) and partitions in a relational database. It gives the client access to this information by using metastore service API. Disk storage for the Hive metadata is separate from HDFS storage.
The following commands will help you restart NameNode and all the daemons :
 
You can stop the NameNode with ./sbin /Hadoop-daemon.sh stop NameNode command and then start the NameNode using ./sbin/Hadoop-daemon.sh start NameNode command.

You can stop all the daemons with the ./sbin /stop-all.sh command and then start the daemons using the ./sbin/start-all.sh command.
In general Apache Flume architecture is composed of the following components :
 
Flume Source : Flume Source is available on various networking platforms like Facebook or Instagram. It is a Data generator that collects data from the generator, and then the data is transferred to a Flume Channel in the form of a Flume.

Flume Channel : The data from the flume source is sent to an Intermediate Store which buffers the events till they get transferred into Sink. The Intermediate Store is called Flume Channel. Channel is an intermediate source. It is a bridge between Source and a Sink Flume channel. It supports both the Memory channel and File channel. The file channel is non-volatile which means once the data is entered into the channel, the data will never be lost unless you delete it. In contrast, in the Memory channel, events are stored in memory, so it’s volatile, meaning data may be lost, but Memory Channel is very fast in nature.

Flume sink : Data repositories like HDFS, have Flume Sink. Which takes Flume events from the Flume Channel and stores them into the Destination specified like HDFS. It is done in such a way where it should deliver the events to the Store or another agent. Various sinks like Hive Sink, Thrift Sink, etc are supported by the Flume.

Flume Agent : A Java process that works on Source, Channel, Sink combination is called Flume Agent. One or more than one agent is possible in Flume. Connected Flume agents which are distributed in nature can also be collectively called Flume.

Flume Event : An Event is the unit of data transported in Flume. The general representation of the Data Object in Flume is called Event. The event is made up of a payload of a byte array with optional headers.
Heterogeneity : The design of applications should allow the users to access services and run applications over a heterogeneous collection of computers and networks taking into consideration Hardware devices, OS, networks, Programming languages.

Transparency : Distributed system Designers must hide the complexity of the system as much as they can. Some Terms of transparency are location, access, migration, Relocation, and so on.

Openness : It is a characteristic that determines whether the system can be extended and reimplemented in various ways.

Security : Distributed system Designers must take care of confidentiality, integrity, and availability.

Scalability : A system is said to be scalable if it can handle the addition of users and resources without suffering a noticeable loss of performance.
The ‘jps’ command helps us to check if the Hadoop daemons are running or not. It shows all the Hadoop daemons i.e namenode, datanode, resourcemanager, nodemanager etc. that are running on the machine.
This answer includes many points, so we will go through them sequentially.
 
We cannot perform “aggregation” (addition) in mapper because sorting does not occur in the “mapper” function. Sorting occurs only on the reducer side and without sorting aggregation cannot be done.

During “aggregation”, we need the output of all the mapper functions which may not be possible to collect in the map phase as mappers may be running on the different machine where the data blocks are stored.

And lastly, if we try to aggregate data at mapper, it requires communication between all mapper functions which may be running on different machines. So, it will consume high network bandwidth and can cause network bottlenecking.
The “InputSplit” defines a slice of work, but does not describe how to access it. The “RecordReader” class loads the data from its source and converts it into (key, value) pairs suitable for reading by the “Mapper” task. The “RecordReader” instance is defined by the “Input Format”.
Distributed Cache can be explained as, a facility provided by the MapReduce framework to cache files needed by applications. Once you have cached a file for your job, Hadoop framework will make it available on each and every data nodes where you map/reduce tasks are running. Then you can access the cache file as a local file in your Mapper or Reducer job.
Pig Latin can handle both atomic data types like int, float, long, double etc. and complex data types like tuple, bag and map.
 
Atomic data types : Atomic or scalar data types are the basic data types which are used in all the languages like string, int, float, long, double, char[], byte[].
 
Complex Data Types : Complex data types are Tuple, Map and Bag.
Apache Hive is a data warehouse system built on top of Hadoop and is used for analyzing structured and semi-structured data developed by Facebook. Hive abstracts the complexity of Hadoop MapReduce.
 
The “SerDe” interface allows you to instruct “Hive” about how a record should be processed. A “SerDe” is a combination of a “Serializer” and a “Deserializer”. “Hive” uses “SerDe” (and “FileFormat”) to read and write the table’s row.
The components of a Region Server are :
 
WAL : Write Ahead Log (WAL) is a file attached to every Region Server inside the distributed environment. The WAL stores the new data that hasn’t been persisted or committed to the permanent storage.

Block Cache : Block Cache resides in the top of Region Server. It stores the frequently read data in the memory.

MemStore : It is the write cache. It stores all the incoming data before committing it to the disk or permanent memory. There is one MemStore for each column family in a region.

HFile : HFile is stored in HDFS. It stores the actual cells on the disk.
Write Ahead Log (WAL) is a file attached to every Region Server inside the distributed environment. The WAL stores the new data that hasn’t been persisted or committed to the permanent storage. It is used in case of failure to recover the data sets.
63 .
RDD is the acronym for Resilient Distribution Datasets – a fault-tolerant collection of operational elements that run parallel. The partitioned data in RDD are immutable and distributed, which is a key component of Apache Spark.
To check the status of the blocks, use the command :
 
hdfs fsck <path> -files -blocks
 
To check the health status of FileSystem, use the command :
 
hdfs fsck / -files –blocks –locations > dfs-fsck.log
The commands below are used to refresh the node information while commissioning, or when the decommissioning of nodes is completed. 
 
dfsadmin -refreshNodes
 
This is used to run the HDFS client and it refreshes node configuration for the NameNode. 
 
rmadmin -refreshNodes
 
This is used to perform administrative tasks for ResourceManager.
Yes, the following are ways to change the replication of files on HDFS :
 
We can change the dfs.replication value to a particular number in the $HADOOP_HOME/conf/hadoop-site.xml file, which will start replicating to the factor of that number for any new content that comes in.
 
If you want to change the replication factor for a particular file or directory, use :
 
$HADOOP_HOME/bin/Hadoop dfs –setrep –w4 /path of the file
 
Example : $HADOOP_HOME/bin/Hadoop dfs –setrep –w4 /user/temp/test.csv

In a cluster, it is always the NameNode that takes care of the replication consistency. The fsck command provides information regarding the over and under-replicated block. 
 
Under-replicated blocks : These are the blocks that do not meet their target replication for the files they belong to. HDFS will automatically create new replicas of under-replicated blocks until they meet the target replication.
 
Consider a cluster with three nodes and replication set to three. At any point, if one of the NameNodes crashes, the blocks would be under-replicated. It means that there was a replication factor set, but there are not enough replicas as per the replication factor. If the NameNode does not get information about the replicas, it will wait for a limited amount of time and then start the re-replication of missing blocks from the available nodes. 
 
Over-replicated blocks : These are the blocks that exceed their target replication for the files they belong to. Usually, over-replication is not a problem, and HDFS will automatically delete excess replicas.
 
Consider a case of three nodes running with the replication of three, and one of the nodes goes down due to a network failure. Within a few minutes, the NameNode re-replicates the data, and then the failed node is back with its set of blocks. This is an over-replication situation, and the NameNode will delete a set of blocks from one of the nodes. 
The major components of a Pig execution environment are :
 
Pig Scripts : They are written in Pig Latin using built-in operators and UDFs, and submitted to the execution environment.

Parser : Completes type checking and checks the syntax of the script. The output of the parser is a Directed Acyclic Graph (DAG).

Optimizer : Performs optimization using merge, transform, split, etc. Optimizer aims to reduce the amount of data in the pipeline.

Compiler : Converts the optimized code into MapReduce jobs automatically.

Execution Engine : MapReduce jobs are submitted to execution engines to generate the desired results.
Pig has three complex data types, which are primarily Tuple, Bag, and Map.
 
Tuple  : A tuple is an ordered set of fields that can contain different data types for each field. It is represented by braces ().
 
Example : (1,3)
 
Bag  : A bag is a set of tuples represented by curly braces {}.
 
Example : {(1,4), (3,5), (4,6)}
 
Map  : A map is a set of key-value pairs used to represent data elements. It is represented in square brackets [ ].
 
Example : [key#value, key1#value1,….]

The Group statement collects various records with the same key and groups the data in one or more relations.
 
Example : Group_data = GROUP Relation_name BY AGE
 

The Order statement is used to display the contents of relation in sorted order based on one or more fields.
 
Example : Relation_2 = ORDER Relation_name1 BY (ASC|DSC)
 

Distinct statement removes duplicate records and is implemented only on entire records, and not on individual records.
 
Example : Relation_2 = DISTINCT Relation_name1

The relational operators in Pig are as follows :
 
COGROUP : It joins two or more tables and then performs GROUP operation on the joined table result.
 
CROSS : This is used to compute the cross product (cartesian product) of two or more relations.
 
FOREACH : This will iterate through the tuples of a relation, generating a data transformation.
 
JOIN : This is used to join two or more tables in a relation.
 
LIMIT : This will limit the number of output tuples.
 
SPLIT : This will split the relation into two or more relations.
 
UNION : It will merge the contents of two relations.
 
ORDER : This is used to sort a relation based on one or more fields.
The following code is used to open a connection in HBase :
 
Configuration myConf = HBaseConfiguration.create();
 
HTableInterface usersTable = new HTable(myConf, “users”);

The replication feature in HBase provides a mechanism to copy data between clusters. This feature can be used as a disaster recovery solution that provides high availability for HBase.
 
The following commands alter the hbase1 table and set the replication_scope to 1. A replication_scope of 0 indicates that the table is not replicated.
 
disable ‘hbase1’
alter ‘hbase1’, {NAME => ‘family_name’, REPLICATION_SCOPE => ‘1’}
enable ‘hbase1’

Yes, it is possible to import and export tables from one HBase cluster to another. 
 
HBase export utility :

hbase org.apache.hadoop.hbase.mapreduce.Export “table name” “target export location” 
 
Example : hbase org.apache.hadoop.hbase.mapreduce.Export “employee_table” “/export/employee_table” 
 
HBase import utility :

create ‘emp_table_import’, {NAME => ‘myfam’, VERSIONS => 10} 
hbase org.apache.hadoop.hbase.mapreduce.Import “table name” “target import location” 
 
Example : create ‘emp_table_import’, {NAME => ‘myfam’, VERSIONS => 10}
 
hbase org.apache.hadoop.hbase.mapreduce.Import “emp_table_import” “/export/employee_table”
The following code is used to list the contents of an HBase table :
 
scan ‘table_name’ 
 
Example : scan ‘employee_table’ 


To update column families in the table, use the following command :
 
alter ‘table_name’, ‘column_family_name’ 
 
Example : alter ‘employee_table’, ‘emp_address’ 
The Codegen tool in Sqoop generates the Data Access Object (DAO) Java classes that encapsulate and interpret imported records.
 
The following example generates Java code for an “employee” table in the “testdb” database.
 
$ sqoop codegen \
--connect jdbc:mysql://localhost/testdb \
--username root \
--table employee