Hadoop MCQ (Multiple Choice Questions) With Answers, Hadoop Online Quiz

1 .

Who was the developer of Hadoop language?

A)

Bell Labs

B)

Sun Microsystems

C)

Apache Software Foundation

D)

Hadoop Software Foundation

Correct Answer : Apache Software Foundation

Explanation :

According to its co-founders, Doug Cutting and Mike Cafarella, the genesis of Hadoop was the Google File System paper that was published in October 2003.

The Apache Software Foundation has stated that only software officially released by the Apache Hadoop Project can be called Apache Hadoop or Distributions of Apache Hadoop. The naming of products and derivative works from other vendors and the term "compatible" are somewhat controversial within the Hadoop developer community.

2 .

When the Initial release date of hadoop?

A)

1st April 2005

B)

1st April 2006

C)

1st April 2007

D)

1st April 2008

Correct Answer : 1st April 2006

3 .

What license is Hadoop distributed under?

A)

Apache License 1.0

B)

Apache License 2.0

C)

Apache License 2.3

D)

Apache License 2.7

Correct Answer : Apache License 2.0

Explanation : Hadoop is Open Source, released under Apache 2 license.

4 .

The hadoop language wriiten in which language?

A)

C

B)

C++

C)

Python

D)

Java

Correct Answer : Java

Explanation : The Hadoop framework itself is mostly written in the Java programming language, with some native code in C and Command-line utilities written as shell scripts.

5 .

Which of the following platforms does Hadoop run on?

A)

Unix-like

B)

Debian

C)

Bare metal

D)

Cross-platform

Correct Answer : Cross-platform

Explanation : Hadoop has support for cross-platform operating system.

6 .

Hadoop achieves reliability by replicating the data across multiple hosts and hence does not require ________ storage on hosts.

A)

ZFS

B)

Operating system

C)

RAID

D)

Standard RAID levels

Correct Answer : RAID

Explanation : With the default replication value, 3, data is stored on three nodes: two on the same rack, and one on a different rack.

7 .

Above the file systems comes the ________ engine, which consists of one Job Tracker, to which client applications submit MapReduce jobs.

A)

MapReduce

B)

Google

C)

Facebook

D)

Functional programming

Correct Answer : MapReduce

Explanation : MapReduce engine uses to distribute work around a cluster.

8 .

The Hadoop list includes the HBase database, the Apache Mahout ________ system, and matrix operations.

A)

Artificial intelligence

B)

Machine learning

C)

Pattern recognition

D)

Statistical classification

Correct Answer : Machine learning

Explanation : The Apache Mahout project’s goal is to build a scalable machine learning tool.

9 .

Which of the following genres does Hadoop produce?

A)

JAX-RS

B)

Distributed file system

C)

Java Message Service

D)

Relational Database Management System

Correct Answer : Distributed file system

Explanation : The Hadoop Distributed File System (HDFS) is designed to store very large data sets reliably, and to stream those data sets at high bandwidth to the user.

10 .

IBM and ________ have announced a major initiative to use Hadoop to support university courses in distributed computer programming.

A)

Google

B)

Google Variations

C)

Google Latitude

D)

Android (operating system)

Correct Answer : Google

Explanation : Google and IBM Announce University Initiative to Address Internet-Scale.

11 .

As companies move past the experimental phase with Hadoop, many cite the need for additional capabilities, including ________

A)

Improved data warehousing functionality

B)

Improved data storage and information retrieval

C)

Improved security, workload management, and SQL support

D)

Improved extract, transform and load features for data integration

Correct Answer : Improved security, workload management, and SQL support

Explanation : Adding security to Hadoop is challenging because all the interactions do not follow the classic client-server pattern.

12 .

According to analysts, for what can traditional IT systems provide a foundation when theyâ€™re integrated with big data technologies like Hadoop?

A)

Management of Hadoop clusters

B)

Collecting and storing unstructured data

C)

Data warehousing and business intelligence

D)

Big data management and data mining

Correct Answer : Big data management and data mining

Explanation : Data warehousing integrated with Hadoop would give a better understanding of data.

13 .

Hadoop is a framework that works with a variety of related tools. Common cohorts include ________

A)

MapReduce, Heron and Trumpet

B)

MapReduce, Hummer and Iguana

C)

MapReduce, MySQL and Google Apps

D)

MapReduce, Hive and HBase

Correct Answer : MapReduce, Hive and HBase

Explanation : To use Hive with HBase you’ll typically want to launch two clusters, one to run HBase and the other to run Hive.

14 .

What was Hadoop named after?

A)

Cuttingâ€™s high school rock band

B)

The toy elephant of Cuttingâ€™s son

C)

Creator Doug Cuttingâ€™s favorite circus act

D)

A sound Cuttingâ€™s laptop made during Hadoop development

Correct Answer : The toy elephant of Cuttingâ€™s son

Explanation : Doug Cutting, Hadoop creator, named the framework after his child’s stuffed toy elephant.

15 .

All of the following accurately describe Hadoop, EXCEPT ________

A)

Real-time

B)

Java-based

C)

Open-source

D)

Distributed computing approach

Correct Answer : Real-time

Explanation : Apache Hadoop is an open-source software framework for distributed storage and distributed processing of Big Data on clusters of commodity hardware.

16 .

________ can best be described as a programming model used to develop Hadoop-based applications that can process massive amounts of data.

A)

Oozie

B)

Mahout

C)

MapReduce

D)

All of the above

Correct Answer : MapReduce

Explanation : MapReduce is a programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm.

17 .

_________ has the world's largest Hadoop cluster.

A)

Apple

B)

Datamatics

C)

Facebook

D)

None of the above

Correct Answer : Facebook

Explanation : Facebook has many Hadoop clusters, the largest among them is the one that is used for Data warehousing.

18 .

_______ is a platform for constructing data flows for extract, transform, and load (ETL) processing and analysis of large datasets.

A)

Pig

B)

Hive

C)

Oozie

D)

Pig Latin

Correct Answer : Pig

Explanation : Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs.

19 .

________ hides the limitations of Java behind a powerful and concise Clojure API for Cascading.

A)

Scalding

B)

Cascalog

C)

HCatalog

D)

All of the above

Correct Answer : Cascalog

Explanation : Cascalog also adds Logic Programming concepts inspired by Datalog. Hence the name “Cascalog” is a contraction of Cascading and Datalog.

20 .

A)

Scalding

B)

HCatalog

C)

Cascalog

D)

Cascading

Correct Answer : Cascading

Explanation : Cascading hides many of the complexities of MapReduce programming behind more intuitive pipes and data flow abstractions.

21 .

__________ is general-purpose computing model and runtime system for distributed data analytics.

A)

Drill

B)

Mapreduce

C)

Oozie

D)

None of the above

Correct Answer : Mapreduce

Explanation : Mapreduce provides a flexible and scalable foundation for analytics, from traditional reporting to leading-edge machine learning algorithms.

22 .

The Pig Latin scripting language is not only a higher-level data flow language but also has operators similar to ________

A)

XML

B)

JSON

C)

SQL

D)

All of the above

Correct Answer : SQL

Explanation : Pig Latin, in essence, is designed to fill the gap between the declarative style of SQL and the low-level procedural style of MapReduce.

23 .

_____ is a framework for performing remote procedure calls and data serialization.

A)

Avro

B)

Drill

C)

BigTop

D)

Chukwa

Correct Answer : Avro

Explanation : In the context of Hadoop, Avro can be used to pass data from one program or language to another.

24 .

A ________ node acts as the Slave and is responsible for executing a Task assigned to it by the JobTracker.

A)

TaskTracker

B)

Mapper

C)

JobTracker

D)

MapReduce

Correct Answer : TaskTracker

Explanation : TaskTracker receives the information necessary for the execution of a Task from JobTracker, Executes the Task, and Sends the Results back to JobTracker.

25 .

_______ is a utility which allows users to create and run jobs with any executables as the mapper and/or the reducer.

A)

Hadoop Stream

B)

Hadoop Strdata

C)

Hadoop Streaming

D)

None of the above

Correct Answer : Hadoop Streaming

Explanation : Hadoop streaming is one of the most important utilities in the Apache Hadoop distribution.

26 .

_________ maps input key/value pairs to a set of intermediate key/value pairs.

A)

Reducer

B)

Mapper

C)

Both (A) and (B)

D)

None of the above

Correct Answer : Mapper

Explanation : Maps are the individual tasks that transform input records into intermediate records.

27 .

The number of maps is usually driven by the total size of ________

A)

tasks

B)

outputs

C)

Both (A) and (B)

D)

inputs

Correct Answer : inputs

Explanation : Total size of inputs means the total number of blocks of the input files.

28 .

________ is the default Partitioner for partitioning key space.

A)

HashPar

B)

Partitioner

C)

HashPartitioner

D)

None of the above

Correct Answer : HashPartitioner

Explanation : The default partitioner in Hadoop is the HashPartitioner which has a method called getPartition to partition.

29 .

Mapper implementations are passed the JobConf for the job via the ________ method.

A)

JobConfigurable.configure

B)

JobConfigure.configure

C)

JobConfigurable.configurable

D)

None of the above

Correct Answer : JobConfigurable.configure

Explanation : JobConfigurable.configure method is overridden to initialize themselves.

30 .

Input to the _______ is the sorted output of the mappers.

A)

Shuffle

B)

Reducer

C)

Mapper

D)

All of the above

Correct Answer : Reducer

Explanation : In the Shuffle phase the framework fetches the relevant partition of the output of all the mappers, via HTTP.

31 .

The output of the _______ is not sorted in the Mapreduce framework for Hadoop.

A)

Mapper

B)

Scalding

C)

Cascader

D)

None of the above

Correct Answer : None of the above

Explanation : The output of the reduce task is typically written to the FileSystem. The output of the Reducer is not sorted.

32 .

Which of the following phases occur simultaneously?

A)

Shuffle and Sort

B)

Shuffle and Map

C)

Reduce and Sort

D)

All of the above

Correct Answer : Shuffle and Sort

Explanation : The shuffle and sort phases occur simultaneously; while map-outputs are being fetched they are merged.

33 .

Mapper and Reducer implementations can use the ________ to report progress or just indicate that they are alive.

A)

Reporter

B)

Partitioner

C)

OutputCollector

D)

All of the above

Correct Answer : Reporter

Explanation : Reporter is a facility for MapReduce applications to report progress, set application-level status messages and update Counters.

34 .

________ is the primary interface for a user to describe a MapReduce job to the Hadoop framework for execution.

A)

MemoryConf

B)

Map Parameters

C)

JobConf

D)

None of the above

Correct Answer : JobConf

Explanation : JobConf represents a MapReduce job configuration.

35 .

_______ systems are scale-out file-based (HDD) systems moving to more uses of memory in the nodes.

A)

SQL

B)

NoSQL

C)

NewSQL

D)

All of the above

Correct Answer : NoSQL

Explanation : NoSQL systems make the most sense whenever the application is based on data with varying data types and the data can be stored in key-value notation.

36 .

Hadoop data is not sequenced and is in 64MB to 256MB block sizes of delimited record values with schema applied on read based on _______

A)

Hive

B)

Hbase

C)

Both (A) and (B)

D)

HCatalog

Correct Answer : HCatalog

Explanation : Other means of tagging the values also can be used.

37 .

HDFS and NoSQL file systems focus almost exclusively on adding nodes to _______

A)

Scale up

B)

Scale out

C)

Both Scale up and out

D)

None of the above

Correct Answer : Scale out

Explanation : HDFS and NoSQL file systems focus almost exclusively on adding nodes to increase performance (scale-out) but even they require node configuration with elements of scale up.

38 .

Which is the most popular NoSQL database for scalable big data store with Hadoop?

A)

Hbase

B)

Cassandra

C)

MongoDB

D)

None of the above

Correct Answer : Hbase

Explanation : HBase is the Hadoop database: a distributed, scalable Big Data store that lets you host very large tables — billions of rows multiplied by millions of columns — on clusters built with commodity hardware.

39 .

The ___________ can also be used to distribute both jars and native libraries for use in the map and/or reduce tasks.

A)

DataCache

B)

DistributedData

C)

DistributedCache

D)

All of the above

Correct Answer : DistributedCache

Explanation : The child-jvm always has its current working directory added to the java.library.path and LD_LIBRARY_PATH.

40 .

HBase provides ________ like capabilities on top of Hadoop and HDFS.

A)

Bigtable

B)

BigTop

C)

TopTable

D)

None of the above

Correct Answer : Bigtable

Explanation : Google Bigtable leverages the distributed data storage provided by the Google File System.

41 .

Streaming supports streaming command options as well as _________ command options.

A)

tool

B)

task

C)

library

D)

generic

Correct Answer : generic

Explanation : Place the generic options before the streaming options, otherwise the command will fail.

42 .

Which of the following Hadoop streaming command option parameter is required?

A)

mapper executable

B)

input directoryname

C)

output directoryname

D)

All of the above

Correct Answer : All of the above

Explanation : Required parameters are used for Input and Output location for the mapper.

43 .

To set an environment variable in a streaming command use ________

A)

-cmenv EXAMPLE_DIR=/home/example/dictionaries/

B)

-cmdenv EXAMPLE_DIR=/home/example/dictionaries/

C)

-cmden EXAMPLE_DIR=/home/example/dictionaries/

D)

-cmdev EXAMPLE_DIR=/home/example/dictionaries/

Correct Answer : -cmdenv EXAMPLE_DIR=/home/example/dictionaries/

Explanation : Environment Variable is set using cmdenv command.

44 .

Hadoop has a library class, org.apache.hadoop.mapred.lib.FieldSelectionMapReduce, that effectively allows you to process text data like the unix ______ utility.

A)

Copy

B)

Paste

C)

Cut

D)

Move

Correct Answer : Cut

Explanation : The map function defined in the class treats each input key/value pair as a list of fields.

45 .

Which of the following class provides a subset of features provided by the Unix/GNU Sort?

A)

KeyFieldBasedComparator

B)

KeyFieldBased

C)

KeyFieldComparator

D)

All of the above

Correct Answer : KeyFieldBasedComparator

Explanation : Hadoop has a library class, KeyFieldBasedComparator, that is useful for many applications.

46 .

Which of the following class is provided by the Aggregate package?

A)

Map

B)

Reduce

C)

Reducer

D)

None of the above

Correct Answer : Reducer

Explanation : Aggregate provides a special reducer class and a special combiner class, and a list of simple aggregators that perform aggregations such as “sum”, “max”, “min” and so on over a sequence of values.

47 .

_________ class allows the Map/Reduce framework to partition the map outputs based on certain key fields, not the whole keys.

A)

KeyFieldBased

B)

KeyFieldPartitioner

C)

Both (A) and (B)

D)

KeyFieldBasedPartitioner

Correct Answer : KeyFieldBasedPartitioner

Explanation : The primary key is used for partitioning, and the combination of the primary and secondary keys is used for sorting.

48 .

A ________ serves as the master and there is only one NameNode per cluster.

A)

Replication

B)

NameNode

C)

Data Node

D)

Data block

Correct Answer : NameNode

Explanation : All the metadata related to HDFS including the information about data nodes, files stored on HDFS, and Replication, etc. are stored and maintained on the NameNode.

49 .

HDFS works in a __________ fashion.

A)

worker/slave

B)

master-worker

C)

master-slave

D)

All of the above

Correct Answer : master-worker

Explanation : NameNode servers as the master and each DataNode servers as a worker/slave

50 .

________ NameNode is used when the Primary NameNode goes down.

A)

Data

B)

Rack

C)

Secondary

D)

None of the above

Correct Answer : Secondary

Explanation : Secondary namenode is used for all time availability and reliability.

51 .

Which of the following scenario may not be a good fit for HDFS?

A)

HDFS is not suitable for scenarios requiring multiple/simultaneous writes to the same file

B)

HDFS is suitable for storing data related to applications requiring low latency data access

C)

HDFS is suitable for storing data related to applications requiring low latency data access

D)

None of the above

Correct Answer : HDFS is not suitable for scenarios requiring multiple/simultaneous writes to the same file

Explanation : HDFS can be used for storing archive data since it is cheaper as HDFS allows storing the data on low cost commodity hardware while ensuring a high degree of fault-tolerance.

52 .

______ is the slave/worker node and holds the user data in the form of Data Blocks.

A)

DataNode

B)

Data block

C)

Replication

D)

NameNode

Correct Answer : DataNode

Explanation : A DataNode stores data in the [HadoopFileSystem]. A functional filesystem has more than one DataNode, with data replicated across them.

53 .

DFS provides a command line interface called _______ used to interact with HDFS.

A)

"DFS Shell"

B)

"FS Shell"

C)

"HDFS Shell"

D)

None of the above

Correct Answer : "FS Shell"

Explanation : The File System (FS) shell includes various shell-like commands that directly interact with the Hadoop Distributed File System (HDFS).

54 .

For YARN, the ___________ Manager UI provides host and port information.

A)

Data Node

B)

NameNode

C)

Replication

D)

Resource

Correct Answer : Resource

Explanation : All the metadata related to HDFS including the information about data nodes, files stored on HDFS, and Replication, etc. are stored and maintained on the NameNode.

55 .

During start up, the ______ loads the file system state from the fsimage and the edits log file.

A)

DataNode

B)

ActionNode

C)

NameNode

D)

None of the above

Correct Answer : NameNode

Explanation : HDFS is implemented on any computer which can run Java can host a NameNode/DataNode on it.

56 .

______ is a programming model designed for processing large volumes of data in parallel by dividing the work into a set of independent tasks.

A)

Pig

B)

Hive

C)

Lucene

D)

MapReduce

Correct Answer : MapReduce

Explanation : MapReduce is the heart of hadoop.

57 .

The daemons associated with the MapReduce phase are ________ and task-trackers.

A)

job-tracker

B)

map-tracker

C)

reduce-tracker

D)

all of the above

Correct Answer : job-tracker

Explanation : Map-Reduce jobs are submitted on job-tracker.

58 .

The JobTracker pushes work out to available _______ nodes in the cluster, striving to keep the work as close to the data as possible.

A)

TaskTracker

B)

DataNodes

C)

ActionNodes

D)

All of the above

Correct Answer : DataNodes

Explanation : A heartbeat is sent from the TaskTracker to the JobTracker every few minutes to check its status whether the node is dead or alive.

59 .

InputFormat class calls the ________ function and computes splits for each file and then sends them to the jobtracker.

A)

puts

B)

gets

C)

getSplits

D)

all of the above

Correct Answer : getSplits

Explanation : InputFormat uses their storage locations to schedule map tasks to process them on the tasktrackers.

60 .

The default InputFormat is __________ which treats each value of input a new value and the associated key is byte offset.

A)

InputFormat

B)

TextFormat

C)

TextInputFormat

D)

All of the above

Correct Answer : TextInputFormat

Explanation : A RecordReader is little more than an iterator over records, and the map task uses one to generate record key-value pairs.

61 .

Output of the mapper is first written on the local disk for sorting and _________ process.

A)

shuffling

B)

forking

C)

reducing

D)

secondary sorting

Correct Answer : shuffling

Explanation : All values corresponding to the same key will go the same reducer.

62 .

In order to read any file in HDFS, instance of __________ is required.

A)

outstream

B)

inputstream

C)

datastream

D)

filesystem

Correct Answer : filesystem

Explanation : InputDataStream is used to read data from file.

63 .

_________ is method to copy byte from input stream to any other stream in Hadoop.

A)

Utils

B)

IOUtils

C)

IUtils

D)

All of the above

Correct Answer : IOUtils

Explanation : IOUtils class is static method in Java interface.

64 .

_______ is used to read data from bytes buffers.

A)

write()

B)

read()

C)

readwrite()

D)

All of the above

Correct Answer : write()

Explanation : Readfully method can also be used instead of read method.

65 .

Interface _________ reduces a set of intermediate values which share a key to a smaller set of values.

A)

Mapper

B)

Writable

C)

Reducer

D)

Readable

Correct Answer : Reducer

Explanation : Reducer implementations can access the JobConf for the job.

66 .

The output of the reduce task is typically written to the FileSystem via _________

A)

OutputCollect

B)

InputCollector

C)

OutputCollector

D)

All of the above

Correct Answer : OutputCollector

Explanation : In reduce phase the reduce(Object, Iterator, OutputCollector, Reporter) method is called for each pair in the grouped inputs.

67 .

Hadoop I/O Hadoop comes with a set of ________ for data I/O.

A)

classes

B)

methods

C)

commands

D)

None of the above

Correct Answer : None of the above

Explanation : Hadoop I/O consist of primitives for serialization and deserialization.

68 .

Apache Hadoop ___________ provides a persistent data structure for binary key-value pairs.

A)

Putfile

B)

SequenceFile

C)

GetFile

D)

All of the above

Correct Answer : SequenceFile

Explanation : SequenceFile is append-only.

69 .

How many formats of SequenceFile are present in Hadoop I/O?

A)

3

B)

4

C)

5

D)

6

Correct Answer : 3

Explanation : SequenceFile has 3 available formats: An “Uncompressed” format, a “Record Compressed” format and a “Block-Compressed”.

70 .

Which of the following format is more compression-aggressive?

A)

Uncompressed

B)

Block-Compressed

C)

Partition Compressed

D)

Record Compressed

Correct Answer : Block-Compressed

Explanation : SequenceFile key-value list can be just a Text/Text pair, and is written to the file during the initialization that happens in the SequenceFile.

71 .

The ______ file is populated with the key and a LongWritable that contains the starting byte position of the record.

A)

Array

B)

Index

C)

Immutable

D)

All of the above

Correct Answer : Index

Explanation : Index doesn’t contains all the keys but just a fraction of the keys.

72 .

The _________ as just the value field append(value) and the key is a LongWritable that contains the record number, count + 1.

A)

SetFile

B)

BloomMapFile

C)

ArrayFile

D)

None of the above

Correct Answer : ArrayFile

Explanation : The SetFile instead of append(key, value) as just the key field append(key) and the value is always the NullWritable instance.

73 .

__________ data file takes is based on avro serialization framework which was primarily created for hadoop.

A)

Avro

B)

Oozie

C)

cTakes

D)

Lucene

Correct Answer : Avro

Explanation : Avro is a splittable data format with a metadata section at the beginning and then a sequence of avro serialized objects.

74 .

Avro schemas describe the format of the message and are defined using _______

A)

JS

B)

XML

C)

XHTML

D)

JSON

Correct Answer : JSON

Explanation : The JSON schema content is put into a file.

75 .

The _____ is an iterator which reads through the file and returns objects using the next() method.

A)

DatumReader

B)

DatumRead

C)

DatReader

D)

None of the above

Correct Answer : DatumReader

Explanation : DatumReader reads the content through the DataFileReader implementation.

76 .

The ____________ class extends and implements several Hadoop-supplied interfaces.

A)

Mapper

B)

AvroMapper

C)

AvroReducer

D)

None of the above

Correct Answer : AvroMapper

Explanation : AvroMapper is used to provide the ability to collect or map data.

77 .

_________ class accepts the values that the ModelCountMapper object has collected.

A)

Mapper

B)

AvroReducer

C)

AvroMapper

D)

None of the above

Correct Answer : AvroReducer

Explanation : AvroReducer summarizes them by looping through the values.

78 .

Which of the following works well with Avro?

A)

kafka

B)

Lucene

C)

MapReduce

D)

None of the above

Correct Answer : MapReduce

Explanation : You can use Avro and MapReduce together to process many items serialized with Avro’s small binary format.

79 .

The _______ codec from Google provides modest compression ratios.

A)

Snappy

B)

Snapcheck

C)

FileCompress

D)

None of the above

Correct Answer : Snappy

Explanation : Snappy has fast compression and decompression speeds.

80 .

Which of the following compression is similar to Snappy compression?

A)

Gzip

B)

Bzip2

C)

Both (A) and (B)

D)

LZO

Correct Answer : LZO

Explanation : LZO is only really desirable if you need to compress text files.

81 .

Which of the following supports splittable compression?

A)

LZO

B)

Gzip

C)

Bzip2

D)

All of the above

Correct Answer : LZO

Explanation : LZO enables the parallel processing of compressed text file splits by your MapReduce jobs.

82 .

Gzip (short for GNU zip) generates compressed files that have a _____ extension.

A)

.g

B)

.gz

C)

.gzp

D)

.gzip

Correct Answer : .gz

Explanation : You can use the gunzip command to decompress files that were created by a number of compression utilities, including Gzip.

83 .

Which of the following is based on the DEFLATE algorithm?

A)

LZO

B)

Bzip2

C)

Gzip

D)

All of the above

Correct Answer : Gzip

Explanation : gzip is based on the DEFLATE algorithm, which is a combination of LZ77 and Huffman Coding.

84 .

The LZO compression format is composed of approximately __________ blocks of compressed data.

A)

24k

B)

36k

C)

128k

D)

256k

Correct Answer : 256k

Explanation : LZO was designed with speed in mind : it decompresses about twice as fast as gzip, meaning it’s fast enough to keep up with hard drive read speeds.

85 .

The HDFS client software implements __________ checking on the contents of HDFS files.

A)

parity

B)

checksum

C)

metastore

D)

none of the above

Correct Answer : checksum

Explanation : When a client creates an HDFS file, it computes a checksum of each block of the file and stores these checksums in a separate hidden file in the same HDFS namespace.

86 .

The ___________ machine is a single point of failure for an HDFS cluster.

A)

NameNode

B)

ActionNode

C)

DataNode

D)

All of the above

Correct Answer : NameNode

Explanation : If the NameNode machine fails, manual intervention is necessary. Currently, automatic restart and failover of the NameNode software to another machine is not supported.

87 .

The _____ and the EditLog are central data structures of HDFS.

A)

FsImage

B)

DsImage

C)

FsImages

D)

All of the above

Correct Answer : FsImage

Explanation : A corruption of these files can cause the HDFS instance to be non-functional.

88 .

________ support storing a copy of data at a particular instant of time.

A)

Datanots

B)

Snapshots

C)

Data Image

D)

All of the above

Correct Answer : Snapshots

Explanation : One usage of the snapshot feature may be to roll back a corrupted HDFS instance to a previously known good point in time.

89 .

Automatic restart and ________ of the NameNode software to another machine is not supported.

A)

end

B)

failover

C)

scalability

D)

all of the above

Correct Answer : failover

Explanation : If the NameNode machine fails, manual intervention is necessary.

90 .

HDFS, by default, replicates each data block _____ times on different nodes and on at least ____ racks.

A)

1, 2

B)

2, 3

C)

3, 2

D)

All of the above

Correct Answer : 3, 2

Explanation : HDFS has a simple yet robust architecture that was explicitly designed for data reliability in the face of faults and failures in disks, nodes and networks.

91 .

________ stores its metadata on multiple disks that typically include a non-local file server.

A)

ActionNode

B)

DataNode

C)

Both (A) and (B)

D)

NameNode

Correct Answer : NameNode

Explanation : HDFS tolerates failures of storage servers (called DataNodes) and its disks.

92 .

_________ facilitates construction of generic data-processing systems and languages.

A)

Dynamic typing

B)

Untagged data

C)

No manually-assigned field IDs

D)

All of the above

Correct Answer : Dynamic typing

Explanation : Avro does not require that code be generated.

93 .

With ______ we can store data and read it easily with various programming languages.

A)

Avro

B)

Thrift

C)

Protocol Buffers

D)

None of the above

Correct Answer : Avro

Explanation : Avro is optimized to minimize the disk space needed by our data and it is flexible.

94 .

Thrift resolves possible conflicts through _________ of the field.

A)

UID

B)

Static number

C)

Name

D)

None of the above

Correct Answer : Static number

Explanation : Avro resolves possible conflicts through the name of the field.

95 .

Avro is said to be the future _______ layer of Hadoop.

A)

RDC

B)

RMC

C)

RPC

D)

All of the above

Correct Answer : RPC

Explanation : When Avro is used in RPC, the client and server exchange schemas in the connection handshake.

96 .

When using reflection to automatically build our schemas without code generation, we need to configure Avro using?

A)

AvroJob.Reflect(jConf);

B)

AvroJob.setReflect(jConf);

C)

Job.setReflect(jConf);

D)

None of the

Correct Answer : Job.setReflect(jConf);

Explanation : For strongly typed languages like Java, it also provides a generation code layer, including RPC services code generation.

97 .

_______ can change the maximum number of cells of a column family.

A)

alter

B)

set

C)

reset

D)

select

Correct Answer : alter

Explanation : Alter is the command used to make changes to an existing table.

98 .

Which of the following is not a table scope operator?

A)

MAX_FILESIZE

B)

MEMSTORE_FLUSH

C)

MEMSTORE_FLUSHSIZE

D)

All of the above

Correct Answer : MEMSTORE_FLUSH

Explanation : Using alter, you can set and remove table scope operators such as MAX_FILESIZE, READONLY, MEMSTORE_FLUSHSIZE, DEFERRED_LOG_FLUSH, etc.

99 .

You can delete a column family from a table using the method _________ of HBAseAdmin class.

A)

delColumn()

B)

removeColumn()

C)

Both (A) and (B)

D)

deleteColumn()

Correct Answer : deleteColumn()

100 .

_______ class adds HBase configuration files to its object.

A)

Collector

B)

Configuration

C)

Component

D)

None of the above

Correct Answer : Configuration

Explanation : You can create a configuration object using the create() method of the HbaseConfiguration class.

101 .

_______ is the main configuration file of HBase.

A)

hbase.xml

B)

hbase-site-conf.xml

C)

hbase-site.xml

D)

None of the above

Correct Answer : hbase-site.xml

Explanation : Set the data directory to an appropriate location by opening the HBase home folder in /usr/local/HBase.

102 .

The Mapper implementation processes one line at a time via _________ method.

A)

map

B)

reduce

C)

reducer

D)

mapper

Correct Answer : map

Explanation : The Mapper outputs are sorted and then partitioned per Reducer.

103 .

The Hadoop MapReduce framework spawns one map task for each __________ generated by the InputFormat for the job.

A)

InputSplit

B)

OutputSplit

C)

InputSplitStream

D)

All of the mentioned

Correct Answer : InputSplit

Explanation : Mapper implementations are passed the JobConf for the job via the JobConfigurable.configure(JobConf) method and override it to initialize themselves.

104 .

Users can control which keys (and hence records) go to which Reducer by implementing a custom?

A)

Reporter

B)

Partitioner

C)

OutputSplit

D)

All of the above

Correct Answer : Partitioner

Explanation : Users can control the grouping by specifying a Comparator via JobConf.setOutputKeyComparatorClass(Class).

105 .

Applications can use the ____________ to report progress and set application-level status messages.

A)

Reporter

B)

Partitioner

C)

OutputSplit

D)

All of the above

Correct Answer : Reporter

Explanation : Reporter is also used to update Counters, or just indicate that they are alive.

106 .

The output of the reduce task is typically written to the FileSystem via ________

A)

OutputCollector.put

B)

OutputCollector.get

C)

OutputCollector.receive

D)

OutputCollector.collect

Correct Answer : OutputCollector.collect

107 .

The number of reduces for the job is set by the user via _________

A)

JobConf.setNumTasks(int)

B)

JobConf.setNumMapTasks(int)

C)

Both (A) and (B)

D)

JobConf.setNumReduceTasks(int)

Correct Answer : JobConf.setNumReduceTasks(int)

Explanation : Reducer has 3 primary phases : Shuffle, Sort and Reduce.

108 .

Which of the following is the default Partitioner for Mapreduce?

A)

MergePartitioner

B)

HashedPartitioner

C)

HashPartitioner

D)

None of the above

Correct Answer : HashPartitioner

Explanation : The total number of partitions is the same as the number of reduce tasks for the job.

109 .

Which of the following partitions the key space?

A)

Collector

B)

Partitioner

C)

Compactor

D)

All of the above

Correct Answer : Partitioner

Explanation : Partitioner controls the partitioning of the keys of the intermediate map-outputs.

110 .

_________ is the primary interface for a user to describe a MapReduce job to the Hadoop framework for execution.

A)

JobConf

B)

JobConfig

C)

JobConfiguration

D)

All of the above

Correct Answer : JobConf

Explanation : JobConf is typically used to specify the Mapper, combiner (if any), Partitioner, Reducer, InputFormat, OutputFormat and OutputCommitter implementations.

111 .

___________ is percentage of memory relative to the maximum heap size in which map outputs may be retained during the reduce.

A)

io.sort.factor

B)

mapred.inmem.merge.threshold

C)

mapred.job.shuffle.merge.percent

D)

mapred.job.reduce.input.buffer.percen

Correct Answer : mapred.job.reduce.input.buffer.percen

Explanation : When the reduce begins, map outputs will be merged to disk until those that remain are under the resource limit this defines.

112 .

Which of the following class provides access to configuration parameters?

A)

Config

B)

Configuration

C)

OutputConfig

D)

None of the above

Correct Answer : Configuration

Explanation : Configurations are specified by resources.

113 .

_________ gives site-specific configuration for a given hadoop installation.

A)

coredefault.xml

B)

core-default.xml

C)

core-site.xml

D)

All of the above

Correct Answer : core-site.xml

Explanation : core-default.xml is read-only defaults for hadoop.

114 .

________ method clears all keys from the configuration.

A)

Clear

B)

getClass

C)

addResource

D)

None of the above

Correct Answer : Clear

Explanation : getClass is used to get the value of the name property as a Class.

115 .

_______ checks whether the given key is deprecated.

A)

isDeprecatedif

B)

isDeprecated

C)

setDeprecated

D)

All of the above

Correct Answer : isDeprecated

Explanation : Method returns true if the key is deprecated and false otherwise.

116 .

_________ is useful for iterating the properties when all deprecated properties for currently set properties need to be present.

A)

addResource

B)

addDefaultResource

C)

None of the above

D)

setDeprecatedProperties

Correct Answer : setDeprecatedProperties

Explanation : setDeprecatedProperties sets all deprecated properties that are not currently set but have a corresponding new property that is set.

117 .

Which of the following adds a configuration resource?

A)

addDeprecation

B)

addDefaultResource

C)

setDeprecatedProperties

D)

addResource

Correct Answer : addResource

Explanation : The properties of this resource will override the properties of previously added resources unless they were marked final. addResource adds a configuration resource.

118 .

_______ is a data migration tool added for archiving data.

A)

Hiver

B)

Serde

C)

Mover

D)

None of the above

Correct Answer : Mover

Explanation : Mover periodically scans the files in HDFS to check if the block placement satisfies the storage policy.

119 .

Which of the following is used to list out the storage policies?

A)

hdfs storagepolicies

B)

hdfs storage

C)

hd storagepolicies

D)

All of the above

Correct Answer : hdfs storagepolicies

Explanation : Arguments are none for the hdfs storagepolicies command.

120 .

Which of the following method is used to get user-specified job name?

A)

getPriority()

B)

getJobState()

C)

Both (A) and (B)

D)

getJobName()

Correct Answer : getJobName()

121 .

_________ get events indicating completion (success/failure) of component tasks.

A)

getPriority()

B)

getJobState()

C)

getJobName()

D)

getTaskCompletionEvents(int startFrom)

Correct Answer : getTaskCompletionEvents(int startFrom)

122 .

reduceProgress() gets the progress of the jobâ€™s reduce-tasks, as a float between _________

A)

0.1.-1.0

B)

0.0-1.0

C)

1.0-2.0

D)

2.0-3.0

Correct Answer : 0.0-1.0

Explanation : mapProgress() is used to get the progress of the job’s map-tasks, as a float between 0.0 and 1.0.

123 .

For running hadoop service daemons in Hadoop in secure mode ___________ principals are required.

A)

SSL

B)

SSH

C)

Kerberos

D)

None of the above

Correct Answer : Kerberos

Explanation : Each service reads authenticate information saved in keytab file with appropriate permission.

124 .

Data transfer between Web-console and clients are protected by using _______

A)

SSL

B)

SSH

C)

Kerberos

D)

None of the above

Correct Answer : SSL

Explanation : AES offers the greatest cryptographic strength and the best performance.

125 .

The __________ provides a proxy between the web applications exported by an application and an end user.

A)

WebProxy

B)

ProxyServer

C)

WebAppProxy

D)

None of the above

Correct Answer : WebAppProxy

Explanation : If security is enabled it will warn users before accessing a potentially unsafe web application. Authentication and authorization using the proxy is handled just like any other privileged web application.

126 .

The ____________ requires that paths including and leading up to the directories specified in yarn.nodemanager.local-dirs.

A)

LinuxController

B)

LinuxTaskController

C)

TaskController

D)

None of the above

Correct Answer : LinuxTaskController

Explanation : LinuxTaskController keeps track of all paths and directories on datanode.

127 .

The configuration file must be owned by the user running ______

A)

NodeManager

B)

DataManager

C)

ValidationManager

D)

None of the above

Correct Answer : NodeManager

Explanation : To recap, local file-system permissions need to be modified.

128 .

________ is added for supporting writing single replica files in memory.

A)

ROM_DISK

B)

RAM_DISK

C)

ARCHIVE

B)

ROM_DISK

C)

RAM_DISK

D)

All of the above

Correct Answer : ARCHIVE

Explanation : Little compute power is added for supporting archival storage.

130 .

________ is used for storing one of the replicas in SSD.

A)

Hot

B)

All_SSD

C)

Lazy_Persist

D)

One_SSD

Correct Answer : One_SSD

Explanation : The remaining replicas are stored in DISK.

131 .

_______ is used for writing blocks with single replica in memory.

A)

Hot

B)

All_SSD

C)

One_SSD

D)

Lazy_Persist

Correct Answer : Lazy_Persist

Explanation : The replica is first written in RAM_DISK and then it is lazily persisted in DISK.

132 .

_______ is the architectural center of Hadoop that allows multiple data processing engines.

A)

Hive

B)

Chuckwa

C)

YARN

D)

Incubator

Correct Answer : YARN

Explanation : YARN is the prerequisite for Enterprise Hadoop, providing resource management and a central platform to deliver consistent operations, security, and data governance tools across Hadoop clusters.

133 .

YARN's dynamic allocation of cluster resources improves utilization over more static _______ rules used in early versions of Hadoop.

A)

Hive

B)

Imphala

C)

MapReduce

D)

All of the above

Correct Answer : MapReduce

Explanation : Multi-tenant data processing improves an enterprise’s return on its Hadoop investments.

134 .

The __________ is a framework-specific entity that negotiates resources from the ResourceManager.

A)

NodeManager

B)

ApplicationMaster

C)

ResourceManager

D)

All of the above

Correct Answer : ApplicationMaster

Explanation : Each ApplicationMaster has the responsibility for negotiating appropriate resource containers from the schedule.

135 .

MapReduce has undergone a complete overhaul in hadoop is ______

A)

0.23

B)

0.24

C)

0.26

D)

0.30

Correct Answer : 0.23

Explanation : The fundamental idea of MRv2 is to split up the two major functionalities of the JobTracker.

136 .

The __________ is responsible for allocating resources to the various running applications subject to familiar constraints of capacities, queues etc.

A)

Master

B)

Manager

C)

Both (A) and (B)

D)

Scheduler

Correct Answer : Scheduler

Explanation : The Scheduler is a pure scheduler in the sense that it performs no monitoring or tracking of status for the application.

137 .

The CapacityScheduler supports _____ queues to allow for more predictable sharing of cluster resources.

A)

Partition

B)

Networked

C)

Hierarchical

D)

None of the above

Correct Answer : Hierarchical

Explanation : The Scheduler has a pluggable policy plugin, which is responsible for partitioning the cluster resources among the various queues, applications etc.

138 .

Yarn commands are invoked by the ________ script.

A)

bin

B)

hive

C)

home

D)

hadoop

Correct Answer : bin

Explanation : Running the yarn script without any arguments prints the description for all commands.

139 .

The CapacityScheduler has a predefined queue called ________

A)

rear

B)

root

C)

domain

D)

All of the above

Correct Answer : root

140 .

The updated queue configuration should be a valid one i.e. queue-capacity at each level should be equal to _________

A)

25%

B)

50%

C)

75%

D)

100%

Correct Answer : 100%

Explanation : Queues cannot be deleted, only the addition of new queues is supported.

141 .

Users can bundle their Yarn code in a _________ file and execute it using jar command.

A)

xml

B)

jar

C)

java

D)

C code

Correct Answer : jar

Explanation : Usage: yarn jar <jar> [mainClass] args…

142 .

_______ will clear the RMStateStore and is useful if past applications are no longer needed.

A)

-format-state

B)

-form-state-store

C)

-format-state-store

D)

None of the above

Correct Answer : -format-state-store

Explanation : -format-state-store formats the RMStateStore.

143 .

Which of the following command runs ResourceManager admin client?

A)

run

B)

admin

C)

proxyserver

D)

rmadmin

Correct Answer : rmadmin

144 .

_______ generates keys of type LongWritable and values of type Text.

A)

TextInputFormat

B)

OutputInputFormat

C)

TextOutputFormat

D)

None of the above

Correct Answer : TextInputFormat

145 .

In _________ the default job is similar, but not identical, to the Java equivalent.

A)

Streaming

B)

Mapreduce

C)

Orchestration

D)

All of the above

Correct Answer : Streaming

146 .

An input _________ is a chunk of the input that is processed by a single map.

A)

datanode

B)

split

C)

textformat

D)

None of the above

Correct Answer : split

Explanation : Each split is divided into records, and the map processes each record—a key-value pair—in turn.

147 .

An ___________ is responsible for creating the input splits, and dividing them into records.

A)

TextInputFormat

B)

TextOutputFormat

C)

OutputInputFormat

D)

InputFormat

Correct Answer : InputFormat

Explanation : As a MapReduce application writer, you don’t need to deal with InputSplits directly, as they are created by an InputFormat.

148 .

__________ is another implementation of the MapRunnable interface that runs mappers concurrently in a configurable number of threads.

A)

MultithreadedMap

B)

MultithreadedRunner

C)

MultithreadedMapRunner

D)

SinglethreadedMapRunner

Correct Answer : MultithreadedMapRunner

Explanation : A RecordReader is little more than an iterator over records, and the map task uses one to generate record key-value pairs, which it passes to the map function.

149 .

________ is the base class for all implementations of InputFormat that use files as their data source.

A)

FileTextFormat

B)

FileInputFormat

C)

FileOutputFormat

D)

None of the above

Correct Answer : FileInputFormat

Explanation : FileInputFormat provides implementation for generating splits for the input files.

150 .

__________ takes node and rack locality into account when deciding which blocks to place in the same split.

A)

TextFileInputFormat

B)

CombineFileOutputFormat

C)

CombineFileInputFormat

D)

None of the above

Correct Answer : CombineFileInputFormat

Explanation : CombineFileInputFormat does not compromise the speed at which it can process the input in a typical MapReduce job.

151 .

The key, a ____________ is the byte offset within the file of the beginning of the line.

A)

LongWritable

B)

ShortReadable

C)

LongReadable

D)

All of the above

Correct Answer : LongWritable

Explanation : The value is the contents of the line, excluding any line terminators (newline, carriage return), and is packaged as a Text object.

152 .

________ is the output produced by TextOutputFor mat, Hadoop default OutputFormat.

A)

FileValueTextInputFormat

B)

KeyValueTextInputFormat

C)

KeyValueTextOutputFormat

D)

All of the above

Correct Answer : KeyValueTextOutputFormat

Explanation : To interpret such files correctly, KeyValueTextInputFormat is appropriate.

153 .

The split size is normally the size of a ________ block, which is appropriate for most applications.

A)

HDFS

B)

Library

C)

Generic

D)

Task

Correct Answer : Task

Explanation : FileInputFormat splits only large files(Here “large” means larger than an HDFS block).

154 .

Which hdfs command is used to check for various inconsistencies?

A)

fsk

B)

fsck

C)

fetchdt

D)

None of the above

Correct Answer : fsck

Explanation : fsck is designed for reporting problems with various files, for example, missing blocks for a file or under-replicated blocks.

155 .

HDFS supports the ____________ command to fetch Delegation Token and store it in a file on the local system.

A)

rec

B)

fsk

C)

fetdt

D)

fetchdt

Correct Answer : fetchdt

Explanation : Delegation token can be later used to access secure server from a non secure client.

156 .

In ___________ mode, the NameNode will interactively prompt you at the command line about possible courses of action you can take to recover your data.

A)

full

B)

partial

C)

commit

D)

recovery

Correct Answer : recovery

Explanation : Recovery mode can cause you to lose data, you should always backup your edit log and fsimage before using it.

157 .

_______ mode is a Namenode state in which it does not accept changes to the name space.

A)

Safe

B)

Recover

C)

Rollback

D)

None of the above

Correct Answer : Rollback

Explanation : dfsadmin runs a HDFS dfsadmin client.

158 .

Which of the following is used for the MapReduce job Tracker node?

A)

jobtracker

B)

mradmin

C)

tasktracker

D)

None of the above

Correct Answer : jobtracker

159 .

Which of the following is a common hadoop maintenance issue?

A)

Lack of tools

B)

Lack of web interface

C)

Lack of configuration management

D)

None of the above

Correct Answer : Lack of configuration management

Explanation : Without a centralized configuration management framework, you end up with a number of issues that can cascade just as your usage picks up.

160 .

Which of the following is a configuration management system?

A)

Alex

B)

Puppet

C)

Acem

D)

None of the above

Correct Answer : Puppet

Explanation : Administrators may use configuration management systems such as Puppet and Chef to manage processes.

161 .

Which of the following is a common reason to restart hadoop process?

A)

Upgrade Hadoop

B)

React to incidents

C)

Remove worker nodes

D)

All of the above

Correct Answer : All of the above

Explanation : The most common reason administrators restart Hadoop processes is to enact configuration changes.

162 .

_________ is a standard Java API for monitoring and managing applications.

A)

JVM

B)

JMX

C)

JVX

D)

None of the above

Correct Answer : JMX

Explanation : Hadoop includes several managed beans (MBeans), which expose Hadoop metrics to JMX-aware applications.

163 .

Pig operates in mainly how many nodes?

A)

Two

B)

Three

C)

Four

D)

Five

Correct Answer : Two

Explanation : You can run Pig (execute Pig Latin statements and Pig commands) using various mode: Interactive and Batch Mode.

164 .

Pig Latin statements are generally organized in one of the following ways?

A)

A LOAD statement to read data from the file system

B)

A series of â€œtransformationâ€ statements to process the data

C)

A DUMP statement to view results or a STORE statement to save the results

D)

All of the above

Correct Answer : All of the above

Explanation : A DUMP or STORE statement is required to generate output.

165 .

Which of the following function is used to read data in PIG?

A)

LOAD

B)

READ

C)

WRITE

D)

None of the above

Correct Answer : LOAD

Explanation : PigStorage is the default load function.

166 .

Which of the following will run pig in local mode?

A)

$ pig â€¦

B)

$ pig -x local ...

C)

$ pig -x tez_local â€¦

D)

None of the above

Correct Answer : $ pig -x local..

Explanation : Specify local mode using the -x flag (pig -x local).

167 .

Which of the following has methods to deal with metadata?

A)

LoadCaster

B)

LoadPushDown

C)

LoadMetadata

D)

All of the above

Correct Answer : LoadMetadata

Explanation : Most implementation of loaders don’t need to implement this unless they interact with some metadata system.

168 .

_________ method will be called by Pig both in the front end and back end to pass a unique signature to the Loader.

A)

getShipFiles()

B)

getCacheFiles()

C)

relativeToAbsolutePath()

D)

setUdfContextSignature()

Correct Answer : setUdfContextSignature()

Explanation : The signature can be used to store into the UDFContext any information which the Loader needs to store between various method invocations in the front end and back end.

169 .

________ return a list of hdfs files to ship to distributed cache.

A)

getShipFiles()

B)

getCacheFiles()

C)

relativeToAbsolutePath()

D)

setUdfContextSignature()

Correct Answer : getShipFiles()

Explanation : The default implementation provided in LoadFunc handles this for FileSystem locations.

170 .

The loader should use ______ method to communicate the load information to the underlying InputFormat.

A)

getCacheFiles()

B)

relativeToAbsolutePath()

C)

setLocation()

D)

setUdfContextSignature()

Correct Answer : setLocation()

Explanation : setLocation() method is called by Pig to communicate the load location to the loader.

171 .

________ method enables the RecordReader associated with the InputFormat provided by the LoadFunc is passed to the LoadFunc.

A)

getNext()

B)

prepareToRead()

C)

relativeToAbsolutePath()

D)

All of the above

Correct Answer : prepareToRead()

Explanation : The RecordReader can then be used by the implementation in getNext() to return a tuple representing a record of data back to pig.

172 .

A loader implementation should implement __________ if casts (implicit or explicit) from DataByteArray fields to other types need to be supported.

A)

LoadCaster

B)

LoadMetadata

C)

LoadPushDown

D)

All of the above

Correct Answer : LoadCaster

Explanation : LoadCaster has methods to convert byte arrays to specific types.

173 .

________ operator is used to review the schema of a relation.

A)

DUMP

B)

STORE

C)

EXPLAIN

D)

DESCRIBE

Correct Answer : DESCRIBE

Explanation : DESCRIBE returns the schema of a relation.

174 .

Which of the following operator is used to view the map reduce execution plans?

A)

DUMP

B)

EXPLAIN

C)

STORE

D)

DESCRIBE

Correct Answer : EXPLAIN

175 .

__________ operator is used to view the step-by-step execution of a series of statements.

A)

STORE

B)

EXPLAIN

C)

DESCRIBE

D)

ILLUSTRATE

Correct Answer : ILLUSTRATE

Explanation : ILLUSTRATE allows you to test your programs on small datasets and get faster turnaround times.

176 .

_________ is a framework for collecting and storing script-level statistics for Pig Latin.

A)

Pig Stats

B)

PStatistics

C)

Pig Statistics

D)

None of the above

Correct Answer : Pig Statistics

Explanation : The new Pig statistics and the existing Hadoop statistics can also be accessed via the Hadoop job history file.

177 .

Which of the following will compile the Pigunit?

A)

$pig_ ant pigunit-jar

B)

$pig_tr ant pigunit-jar

C)

$pig_trunk ant pigunit-jar

D)

None of the above

Correct Answer : $pig_trunk ant pigunit-jar

Explanation : The compile will create the pigunit.jar file.

178 .

Which of the following is shortcut for DUMP operator?

A)

\q

B)

\d alias

C)

\de alias

D)

None of the above

Correct Answer : \d alias

Explanation : If alias is ignored last defined alias will be used.

179 .

Which of the following command can be used for debugging?

A)

exec

B)

throw

C)

error

D)

execute

Correct Answer : exec

Explanation : With the exec command, store statements will not trigger execution; rather, the entire script is parsed before execution starts.

180 .

Which of the following file contains user defined functions (UDFs)?

A)

pig.jar

B)

tutorial.jar

C)

excite.log.bz2

D)

script2-local.pig

Correct Answer : tutorial.jar

Explanation : tutorial.jar contains java classes also.

181 .

Which of the following is correct syntax for parameter substitution using cmd?

A)

{%declare | %default}

B)

{%declare | %default} param_name param_value

C)

{%declare | %default} param_name param_value cmd

D)

pig {-param param_name = param_value | -param_file file_name} [-debug | -dryrun] script

Correct Answer : pig {-param param_name = param_value | -param_file file_name} [-debug | -dryrun] script

Explanation : Parameter Substitution is used to substitute values for parameters at run time.

182 .

________ are scanned in the order they are specified on the command line.

A)

Parameter files

B)

Command line parameters

C)

Declare and default preprocessors

D)

Both parameter files and command line parameters

Correct Answer : Both parameter files and command line parameters

Explanation : Parameters and command parameters are scanned in FIFO manner.

183 .

Pig Latin is _______ and fits very naturally in the pipeline paradigm while SQL is instead declarative.

A)

functional

B)

declarative

C)

procedural

D)

All of the above

Correct Answer : procedural

Explanation : In SQL users can specify that data from two tables must be joined, but not what join implementation to use.

184 .

In comparison to SQL, Pig uses _________

A)

ETL

B)

Lazy evaluation

C)

Supports pipeline splits

D)

All of the above

Correct Answer : All of the above

Explanation : Pig Latin ability to include user code at any point in the pipeline is useful for pipeline development.

185 .

Which of the following is an entry in jobconf?

A)

pig.input.dirs

B)

pig.job

C)

pig.feature

D)

None of the above

Correct Answer : pig.input.dirs

Explanation : pig.input.dirs contains comma-separated list of input directories for the job.

186 .

Hive uses _________ for logging.

A)

logj4

B)

log4j

C)

log4i

D)

log4l

Correct Answer : log4j

Explanation : By default Hive will use hive-log4j.default in the conf/ directory of the Hive installation.

187 .

What does the hive.rrot.logger specified in the following statement?

$HIVE_HOME/bin/hive --hiveconf hive.root.logger=INFO,console

A)

Log level

B)

Log source

C)

Log modes

D)

All of the above

Correct Answer : Log level

Explaination : hive.root.logger specified the logging level as well as the log destination. Specifying console as the target sends the logs to the standard error.

188 .

HiveServer2 introduced in Hive 0.11 has a new CLI called ______

A)

SqlLine

B)

CLilLine

C)

BeeLine

D)

HiveLine

Correct Answer : BeeLine

Explanation : Beeline is a JDBC client based on SQLLine.

189 .

HCatalog is installed with Hive, starting with Hive release is ______

A)

0.9.0

B)

0.11.0

C)

0.10.0

D)

0.12.0

Correct Answer : 0.11.0

Explanation : hcat commands can be issued as hive commands, and vice versa.

190 .

Variable Substitution is disabled by using ______

A)

set hive.variable

B)

set hive.variable.substitute=true;

C)

set hive.variable.substitutevalues=false;

D)

set hive.variable.substitute=false;

Correct Answer : set hive.variable.substitute=false;

Explanation : Variable substitution is on by default (hive.variable.substitute=true).

191 .

In ______ mode HiveServer2 only accepts valid Thrift calls.

A)

Remote

B)

HTTP

C)

Interactive

D)

Embedded

Correct Answer : Remote

Explanation : In HTTP mode, the message body contains Thrift payloads.

192 .

Avro-backed tables can simply be created by using _________ in a DDL statement.

A)

â€œSTORED AS AVROâ€

B)

â€œSTORED AS HIVEâ€

C)

â€œSTORED AS SERDEâ€

D)

â€œSTORED AS AVROHIVEâ€

Correct Answer : â€œSTORED AS AVROâ€

Explanation : AvroSerDe takes care of creating the appropriate Avro schema from the Hive table schema.

193 .

Types that may be null must be defined as a ______ of that type and Null within Avro.

A)

Set

B)

Intersection

C)

Union

D)

All of the above

Correct Answer : Union

Explanation : A null in a field that is not so defined will result in an exception during the save. No changes need be made to the Hive schema to support this, as all fields in Hive can be null.

194 .

The files that are written by the _______ job are valid Avro files.

A)

Avro

B)

Hive

C)

Map Reduce

D)

All of the above

Correct Answer : Hive

Explanation : If you copy these files out, you’ll likely want to rename them with .avro.

195 .

Use ________ and embed the schema in the create statement.

A)

row.literal

B)

schema.lit

C)

schema.literal

D)

all of the above

Correct Answer : schema.literal

Explanation : You can embed the schema directly into the create statement.

196 .

_______ is interpolated into the quotes to correctly handle spaces within the schema.

A)

$ROW

B)

$SCHEMA

C)

$NAMESPACES

D)

$SCHEMASPACES

Correct Answer : $SCHEMA

Explanation : Use none to ignore either avro.schema.literal or avro.schema.url.

197 .

_______ was designed to overcome limitations of the other Hive file formats.

A)

ORC

B)

OPC

C)

ODC

D)

None of the above

Correct Answer : ORC

Explanation : The Optimized Row Columnar (ORC) file format provides a highly efficient way to store Hive data.

198 .

HBase is a distributed ________ database built on top of the Hadoop file system.

A)

Row

B)

Row-oriented

C)

Tuple-oriented

D)

Column-oriented

Correct Answer : Column-oriented

Explanation : HBase is a data model that is similar to Google’s big table designed to provide quick random access to huge amounts of structured data.

199 .

Apache HBase is a non-relational database modeled after Google's _________

A)

Bigtable

B)

BigTop

C)

Scanner

D)

FoundationDB

Correct Answer : Bigtable

Explanation : Bigtable acts up on Google File System, likewise Apache HBase works on top of Hadoop and HDFS.

200 .

Which of the following command provides information about the user?

A)

user

B)

status

C)

whoami

D)

version

Correct Answer : whoami

Explanation : status command provides the status of HBase, for example, the number of servers.

201 .

HBaseAdmin and ____________ are the two important classes in this package that provide DDL functionalities.

A)

HTable

B)

HDescriptor

C)

HTabDescriptor

D)

HTableDescriptor

Correct Answer : HTableDescriptor

Explanation : Java provides an Admin API to achieve DDL functionalities through programming.

202 .

The ________ class provides the getValue() method to read the values from its instance.

A)

Put

B)

Result

C)

Get

D)

Value

Correct Answer : Result

Explanation : Get the result by passing your Get class instance to the get method of the HTable class. This method returns the Result class object, which holds the requested result.

203 .

________ communicate with the client and handle data-related operations.

A)

Htable

B)

Master Server

C)

Region Server

D)

All of the above

Correct Answer : Region Server

Explanation : Region Server handle read and write requests for all the regions under it.

204 .

HBase uses the _______ File System to store its data.

A)

Scala

B)

Hadoop

C)

Imphala

D)

Hive

Correct Answer : Hadoop

Explanation : The data storage will be in the form of regions (tables). These regions will be split up and stored in region servers.

205 .

ZooKeeper itself is intended to be replicated over a sets of hosts called __________

A)

ensemble

B)

chunks

C)

subdomains

D)

None of the above

Correct Answer : ensemble

Explanation : As long as a majority of the servers are available, the ZooKeeper service will be available.

206 .

Which of the guarantee is provided by Zookeeper?

A)

Reliability

B)

Flexibility

C)

Scalability

D)

Interactivity

Correct Answer : Reliability

Explanation : Once an update has been applied, it will persist from that time forward until a client overwrites the update.

207 .

ZooKeeper is especially fast in ___________ workloads.

A)

write

B)

read-write

C)

read-dominant

D)

none of the above

Correct Answer : read-dominant

Explanation : ZooKeeper applications run on thousands of machines, and it performs best where reads are more common than writes, at ratios of around 10:1.

208 .

The underlying client-server protocol has changed in version _______ of ZooKeeper.

A)

2.0.0

B)

3.0.0

C)

4.0.0

D)

6.0.0

Correct Answer : 3.0.0

Explanation : Old pre-3.0.0 clients are not guaranteed to operate against upgraded 3.0.0 servers and vice-versa.

209 .

ZooKeeper allows distributed processes to coordinate with each other through registers, known as __________

A)

rnodes

B)

hnodes

C)

vnodes

D)

znodes

Correct Answer : znodes

Explanation : Every znode is identified by a path, with path elements separated by a slash.

210 .

Zookeeper essentially mirrors the _______ functionality exposed in the Linux kernel.

A)

iwrite

B)

iread

C)

icount

D)

inotify

Correct Answer : inotify

Explanation : A client can request for Zookeeper to generate the node name to avoid collisions.

211 .

ZooKeeperâ€™s architecture supports high _______ through redundant services.

A)

availability

B)

flexibility

C)

scalability

D)

interactivity

Correct Answer : availability

Explanation : The clients can thus ask another ZooKeeper master if the first fails to answer.

212 .

________ serves distributed Lucene indexes in a grid environment.

A)

Neo4j

B)

101tec

C)

Katta

D)

Helprace

Correct Answer : Katta

Explanation : Zookeeper is used for node, master and index management in the grid.

213 .

The Email & Apps team of ___________ uses ZooKeeper to coordinate sharding and responsibility changes in a distributed email client.

A)

Katta

B)

Rackspace

C)

Helprace

D)

None of the above

Correct Answer : Rackspace

Explanation : ZooKeeper also provides distributed locking for connections to prevent a cluster from overwhelming servers.

214 .

Helprace is using ZooKeeper on a _______ cluster in conjunction with Hadoop and HBase.

A)

3-node

B)

4-node

C)

5-node

D)

6-node

Correct Answer : 3-node

Explanation : Zookeeper is used to manage a system build out of hadoop, katta, oracle batch jobs and a web component.

215 .

BigDecimal is comprised of a ________ with an integer 'scale' field.

A)

BigInt

B)

SmallInt

C)

MediumInt

D)

BigInteger

Correct Answer : BigInteger

Explanation : The BigDecimal/BigInteger can also return itself as a ‘long’ value.

216 .

________ encapsulates a set of delimiters used to encode a record.

A)

LobSerializer

B)

LargeObjectLoader

C)

FieldMapProcessor

D)

DelimiterSet

Correct Answer : DelimiterSet

Explanation : Delimiter set is created with the specified delimiters.

217 .

________ supports null values for all types.

A)

DelimiterSet

B)

SmallObjectLoader

C)

JdbcWritableBridge

D)

FieldMapProcessor

Correct Answer : JdbcWritableBridge

Explanation : JdbcWritableBridge class contains a set of methods which can read db columns from a ResultSet into Java types.

218 .

Records are terminated by a __________ character.

A)

FIELD_LIMITER

B)

RECORD_DELIMITER

C)

FIELD_DELIMITER

D)

None of the above

Correct Answer : RECORD_DELIMITER

Explanation : Class RecordParser parses a record containing one or more fields.

219 .

The fields parsed by _______ are backed by an internal buffer.

A)

RecordParser

B)

LargeObjectLoader

C)

ProcessingException

D)

None of the above

Correct Answer : RecordParser

Explanation : Multiple threads must use separate instances of RecordParser.

220 .

Which of the following interface is implemented by Sqoop for recording?

A)

SqoopWrite

B)

SqoopRead

C)

SqoopRecord

D)

None of the above

Correct Answer : SqoopRecord

Explanation : Class SqoopRecord is an interface implemented by the classes generated by sqoop orm.ClassWriter.

221 .

Sqoop is an open source tool written at ________

A)

IBM

B)

Cloudera

C)

Microsoft

D)

All of the above

Correct Answer : Microsoft

Explanation : Sqoop allows users to import data from their relational databases into HDFS and vice versa.

222 .

Sqoop uses _________ to fetch data from RDBMS and stores that on HDFS.

A)

Hive

B)

BigTOP

C)

Imphala

D)

Map reduce

Correct Answer : Map reduce

Explanation : While fetching, it throttles the number of mappers accessing data on RDBMS to avoid DDoS.

223 .

________ allows users to specify the target location inside of Hadoop.

A)

Hive

B)

Sqoop

C)

Oozie

D)

Imphala

Correct Answer : Sqoop

Explanation : Sqoop is a connectivity tool for moving data from non-Hadoop data stores – such as relational databases and data warehouses – into Hadoop.

224 .

Microsoft uses a Sqoop-based connector to help transfer data from _________ databases to Hadoop.

A)

Oracle

B)

MySQL

C)

PostreSQL

D)

SQL Server

Correct Answer : SQL Server

Explanation : Sqoop is a command-line interface application for transferring data between relational databases and Hadoop.

225 .

Sqoop direct mode does not support imports of ______ columns.

A)

BLOB

B)

CLOB

C)

LONGVARBINARY

D)

All of the above

Correct Answer : All of the above

Explanation : Use JDBC-based imports for these columns; do not supply the –direct argument to the import tool.

226 .

Sqoop has been tested with Oracle ______ Express Edition.

A)

10.2.0

B)

10.3.0

C)

11.2.0

D)

12.2.0

Correct Answer : 10.2.0

Explanation : Oracle is notable in its different approach to SQL from the ANSI standard and its non-standard JDBC driver. Therefore, several features work differently.

227 .

BatchEE projects aims to provide an _________ implementation. (aka JSR352)

A)

JBat

B)

JBash

C)

JBatch

D)

None of the above

Correct Answer : JBatch

Explanation : BatchEE provides a set of useful extensions for this specification.

228 .

________ allows database-like access, and in particular a SQL interface.

A)

Blur

B)

Calcite

C)

JBatch

D)

All of the above

Correct Answer : Calcite

Explanation : Calcite also provides advanced query optimization, for data not residing in a traditional database.

229 .

_______ is a toolkit/application for converting between and editing common office file formats.

A)

Ignite

B)

Droids

C)

DataFu

D)

Corinthia

Correct Answer : Corinthia

Explanation : The toolkit is small, portable, and flexible, with minimal dependencies.

230 .

_________ is a distributed and scalable OLAP engine built on Hadoop to support extremely large datasets.

A)

Lens

B)

Kylin

C)

MRQL

D)

log4cxx2

Correct Answer : Kylin

Explanation : MRQL is a query processing and optimization system for large-scale, distributed data analysis.

231 .

NiFi is a dataflow system based on the concepts of ________ programming.

A)

set

B)

relational

C)

structured

D)

flow-based

Correct Answer : flow-based

Explanation : NiFi is incubator made by Billie Rinaldi.

232 .

Ripple is a browser based mobile phone emulator designed to aid in the development of _______ based mobile applications.

A)

HTML5

B)

C++

C)

Java

D)

Javascript

Correct Answer : HTML5

Explanation : Ripple is a cross platform and cross runtime testing/debugging tool.

233 .

Which of the following are a collaborative data analytics and visualization tool?

A)

Zeppelin

B)

ACE

C)

Abdera

D)

Accumulo

Correct Answer : Zeppelin

Explanation : Zeppelin is used for general-purpose data processing systems such as Apache Spark, Apache Flink, etc.

234 .

_________ is a software distribution framework based on OSGi.

A)

Abdera

B)

Zeppelin

C)

ACE

D)

Accumulo

Correct Answer : ACE

Explanation : ACE allows you to manage and distribute artifacts.

235 .

________ is a software development collaboration tool.

A)

Buildr

B)

Bloodhound

C)

Cassandra

D)

All of the above

Correct Answer : Bloodhound

Explanation : Buildr is a simple and intuitive build system for Java projects written in Ruby.

236 .

Apache __________ is a platform for building native mobile applications using HTML, CSS and JavaScript (formerly Phonegap).

A)

Cazerra

B)

Cordova

C)

CouchDB

D)

All of the above

Correct Answer : Cordova

Explanation : The project entered incubation as Callback, but decided to change its name to Cordova on 2011-11-28.

237 .

Which of the following project will create an SOA services framework?

A)

CXF

B)

DeltaSpike

C)

DeltaCloud

D)

None of the above

Correct Answer : CXF

Explanation : DeltaSpike is a collection of JSR-299 (CDI) Extensions for building applications on the Java SE and EE platforms.

238 .

_______ includes a ï¬‚exible and powerful toolkit for displaying monitoring and analyzing results.

A)

Oozie

B)

BigTop

C)

Imphala

D)

Chukwa

Correct Answer : Chukwa

Explanation : Chukwa is built on top of the Hadoop distributed filesystem (HDFS) and MapReduce framework and inherits Hadoop’s scalability and robustness.

239 .

HICC, the Chukwa visualization interface, requires HBase version ________

A)

0.90.5+.

B)

0.10.4+.

C)

0.90.4+.

D)

None of the above

Correct Answer : 0.90.4+.

Explanation : The Chukwa cluster management scripts rely on ssh; these scripts, however, are not required if you have some alternate mechanism for starting and stopping daemons.

240 .

_________ are the Chukwa processes that actually produce data.

A)

Agents

B)

HCatalog

C)

Collectors

D)

HBase Table

Correct Answer : Agents

Explanation : Setting the option chukwaAgent.control.remote will disallow remote connections to the agent control socket.

241 .

Chukwa ___________ are responsible for accepting incoming data from Agents, and storing the data.

A)

Agents

B)

Collectors

C)

HBase Table

D)

None of the above

Correct Answer : Collectors

Explanation : Most commonly, collectors simply write all received to HBase or HDFS.

242 .

By default, collectorâ€™s listen on port ______

A)

8008

B)

8080

C)

8070

D)

None of the above

Correct Answer : 8080

Explanation : Port number can be configured in chukwa-collector.conf.xml

243 .

________ class allows other programs to get incoming chunks fed to them over a socket by the collector.

A)

PipelineWriter

B)

PipelineStageWriter

C)

SocketTeeWriter

D)

None of the above

Correct Answer : SocketTeeWriter

Explanation : PipelineStageWriter lets you string together a series of PipelineableWriters for pre-processing or post-processing incoming data.

244 .

___________ provides an intuitive, easy-to-use Hadoop management web UI backed by its RESTful APIs.

A)

Hive

B)

Oozie

C)

Imphala

D)

Ambari

Correct Answer : Ambari

Explanation : The Apache Ambari project is aimed at making Hadoop management simpler by developing software for provisioning, managing, and monitoring Apache Hadoop clusters.

245 .

Ambari leverages ________ for metrics collection.

A)

Ganglia

B)

Nagios

C)

Nagaond

D)

All of the above

Correct Answer : Ganglia

Explanation : Ganglia is a scalable distributed monitoring system for high-performance computing systems such as clusters and Grids.

246 .

A ________ is a way of extending Ambari that allows 3rd parties to plug in new resource types along with the APIs.

A)

view

B)

trigger

C)

schema

D)

none of the above

Correct Answer : view

247 .

Ambari provides a ________ API that enables integration with existing tools, such as Microsoft System Center.

A)

RestLess

B)

RESTful

C)

Web Service

D)

None of the above

Correct Answer : RESTful

Explanation : RESTful APIs enables integration with enterprise systems.

248 .

A fully secure Hadoop cluster needs _______

A)

SSL

B)

SSH

C)

Kerberos

D)

REST

Correct Answer : Kerberos

Explanation : Kerberos requires a client side library and complex client side configuration.

249 .

A __________ can route requests to multiple Knox instances.

A)

collector

B)

comparator

C)

load balancer

D)

all of the above

Correct Answer : load balancer

Explanation : Knox is a stateless reverse proxy framework.

250 .

Knox integrates with prevalent identity management and _______ systems.

A)

SSL

B)

SSO

C)

SSH

D)

Kerberos

Correct Answer : SSO

Explanation : Knox allows identities from those enterprise systems to be used for seamless, secure access to Hadoop clusters.

251 .

Apache Knox Eliminates _______ edge node risks.

A)

SSH

B)

SSO

C)

SSL

D)

All of the above

Correct Answer : SSH

252 .

Apache Knox provides __________ REST API Access Point.

A)

Zero

B)

Double

C)

Multiple

D)

Single

Correct Answer : Single

Explanation : The Apache Knox Gateway is a system that provides a single point of authentication and access for Apache.

253 .

Apache Hadoop Development Tools is an effort undergoing incubation at _____

A)

AFS

B)

HCC

C)

ADF

D)

ASF

Correct Answer : ASF

Explanation : The Apache Software Foundation(ASF) is sponsored by the Apache Incubator PMC.

254 .

HDT project works with eclipse version ________ and above.

A)

3.4

B)

3.5

C)

3.6

D)

3.7

Correct Answer : 3.6

Explanation : The user should be able to install using a single update site for all Hadoop-related Eclipse tools.

255 .

HDT has been tested on __________ and Juno, and can work on Kepler as well.

A)

Indiavo

B)

Indigo

C)

Hadovo

D)

Rainbow

Correct Answer : Indigo

Explanation : HDT aims at bringing plugins in eclipse to simplify development on Hadoop platform.

256 .

Which of the following has the core Eclipse PDE tools for HDT development?

A)

RAP

B)

RBP

C)

RVP

D)

RXP

Correct Answer : RAP

Explanation : RCP/RAP developers package has the core Eclipse PDE tools.

257 .

HDT provides wizards for creating Java Classes for ________

A)

Driver

B)

Mapper

C)

Reducer

D)

All of the above

Correct Answer : All of the above

Explanation : HDT provides wizards for the creation of Hadoop Based Projects.

258 .

Spark was initially started by ____________ at UC Berkeley AMPLab in 2009.

A)

Stonebraker

B)

Doug Cutting

C)

Matei Zaharia

D)

Mahek Zaharia

Correct Answer : Matei Zaharia

Explanation : Apache Spark is an open-source cluster computing framework originally developed in the AMPLab at UC Berkeley.

259 .

Spark SQL provides a domain-specific language to manipulate ______ in Scala, Java, or Python.

A)

RDDs

B)

Spark SQL

C)

Spark Streaming

D)

All of the above

Correct Answer : RDDs

Explanation : Spark SQL provides SQL language support, with command-line interfaces and ODBC/JDBC server.

260 .

________ leverages Spark Core fast scheduling capability to perform streaming analytics.

A)

MLlib

B)

RDDs

C)

GraphX

D)

Spark Streaming

Correct Answer : Spark Streaming

Explanation : Spark Streaming ingests data in mini-batches and performs RDD transformations on those mini-batches of data.

261 .

_______ is a distributed machine learning framework on top of Spark.

A)

RDDs

B)

MLlib

C)

GraphX

D)

Spark Streaming

Correct Answer : MLlib

Explanation : MLlib implements many common machine learning and statistical algorithms to simplify large scale machine learning pipelines.

262 .

Spark architecture is ___________ times as fast as Hadoop disk-based Apache Mahout and even scales better than Vowpal Wabbit.

A)

10

B)

20

C)

30

D)

80

Correct Answer : 10

Explanation : Spark architecture has proven scalability to over 8000 nodes in production.

263 .

Apache Flume 1.3.0 is the fourth release under the auspices of Apache of the so-called ________ codeline.

A)

NF

B)

NR

C)

NG

D)

ND

Correct Answer : NG

Explanation : Flume 1.3.0 has been put through many stress and regression tests, is stable, production-ready software, and is backwards-compatible with Flume 1.2.0.

264 .

A ________ is an operation on the stream that can transform the stream.

A)

Sinks

B)

Decorator

C)

Source

D)

All of the above

Correct Answer : Source

Explanation : A source can be any data source, and Flume has many predefined source adapters.

265 .

_______ is used when you want the sink to be the input source for another operation.

A)

Basic

B)

Mid

C)

Collector Tier Event

D)

Agent Tier Event

Correct Answer : Agent Tier Event

Explanation : All agents in a specific tier could be given the same name; One configuration file with … Clients send Events to Agents; Agents hosts number Flume components.

266 .

_______ sink can be a text file, the console display, a simple HDFS path, or a null bucket where the data is simply deleted.

A)

Basic

B)

Agent Tier Event

C)

Collector Tier Event

D)

None of the above

Correct Answer : Basic

267 .

_________ provides Java-based indexing and search technology.

A)

Lucy

B)

Lucene Core

C)

Solr

D)

All of the above

Correct Answer : Lucene Core

Explanation : Lucene provides spellchecking, hit highlighting and advanced analysis/tokenization capabilities.

268 .

_______ is a high performance search server built using Lucene Core.

A)

Lucy

B)

Solr

C)

PyLucene

D)

Lucene Core

Correct Answer : Solr

Explanation : Solr provides hit highlighting, faceted search, caching, replication, and a web admin interface.

269 .

________ is a subproject with the aim of collecting and distributing free materials.

A)

OPR

B)

ORS

C)

ORP

D)

OSR

Correct Answer : ORP

Explanation : Open Relevance Project is used for relevance testing and performance.

270 .

The Lucene _________ is pleased to announce the availability of Apache Lucene 5.0.0 and Apache Solr 5.0.0.

A)

PMC

B)

RPC

C)

CPM

D)

All of the above

Correct Answer : PMC

Explanation : PyLucene was previously hosted at the Open Source Applications Foundation.

271 .

Lucene provides scalable, high-Performance indexing over ______ per hour on modern hardware.

A)

2 TB

B)

1 TB

C)

500 GB

D)

150GB

Correct Answer : 150GB

Explanation : Lucene offers powerful features through a simple API.

272 .

Lucene index size is roughly _______ the size of text indexed.

A)

20%

B)

50%

C)

70%

D)

100%

Correct Answer : 20%

Explanation : Lucene provides incremental indexing as fast as batch indexing.

273 .

How many types of modes are present in Hama?

A)

2

B)

3

C)

4

D)

5

Correct Answer : 3

Explanation : Just like Hadoop, Hama has distinct between three modes.

274 .

_________ is the default mode if you download Hama.

A)

Local Mode

B)

Distributed Mode

C)

Pseudo Distributed Mode

D)

All of the above

Correct Answer : Local Mode

Explanation : This mode can be configured via the bsp.master.address property to local.

275 .

Distributed Mode are mapped in the __________ file.

A)

groom

B)

grsvers

C)

grervers

D)

groomservers

Correct Answer : groomservers

Explanation : Distributed Mode is used when you have multiple machines.

276 .

The web UI provides information about ________ job statistics of the Hama cluster.

A)

ISP

B)

USP

C)

BSP

D)

MPP

Correct Answer : BSP

Explanation : Running/completed/Failed jobs is detailed in UI interface.

277 .

Which of the following apache project is gaining a lot of traction steadily with the efforts of its committers?

A)

Pig

B)

Hama

C)

Hive

D)

Hadoop

Correct Answer : Hama

Explanation : HAMA is a distributed framework on Hadoop for massive matrix algorithms.

278 .

HCatalog supports reading and writing files in any format for which a ________ can be written.

A)

SerDE

B)

DocSear

C)

SaerDear

D)

All of the above

Correct Answer : SerDE

Explanation : By default, HCatalog supports RCFile, CSV, JSON, and SequenceFile, and ORC file formats. To use a custom format, you must provide the InputFormat, OutputFormat, and SerDe.

279 .

HCatalog is built on top of the Hive metastore and incorporates Hive's is _________

A)

DCL

B)

DDL

C)

DML

D)

TCL

Correct Answer : DDL

Explanation : HCatalog provides read and write interfaces for Pig and MapReduce and uses Hive’s command line interface for issuing data definition and metadata exploration commands.

280 .

The HCatalog interface for Pig consists of ____________ and HCatStorer, which implement the Pig load and store interfaces respectively.

A)

HCatLoad

B)

HCLoader

C)

Both (A) and (B)

D)

HCatLoader

Correct Answer : HCatLoader

Explanation : HCatLoader accepts a table to read data from; you can indicate which partitions to scan by immediately following the load statement with a partition filter statement.

281 .

_________ accepts a table to read data from and optionally a selection predicate to indicate which partitions to scan.

A)

InputFormat

B)

OutputFormat

C)

HCatInputFormat

D)

HCatOutputFormat

Correct Answer : HCatInputFormat

Explanation : The HCatalog interface for MapReduce — HCatInputFormat and HCatOutputFormat — is an implementation of Hadoop InputFormat and OutputFormat.

282 .

You can write to a single partition by specifying the partition key(s) and value(s) in the ___________ method.

A)

put

B)

get

C)

setOutput

D)

setOut

Correct Answer : setOutput

Explanation : You can write to multiple partitions if the partition key(s) are columns in the data being stored.

283 .

HCatalog supports the same data types as _________

A)

Hive

B)

Pig

C)

Hama

D)

Oozie

Correct Answer : Hive

Explanation : Partitions are multi-dimensional and not hierarchical. Records are divided into columns.

284 .

_________ property allow users to override the expiry time specified.

A)

hcat.append.limit

B)

hcat.desired.partition.num.splits

C)

hcatalog.hive.client.cache.disabled

D)

hcatalog.hive.client.cache.expiry.time

Correct Answer : hcatalog.hive.client.cache.expiry.time

Explanation : This property is an int, and specifies number of seconds.

285 .

_________ is used with Pig scripts to write data to HCatalog-managed tables.

A)

HCatStam

B)

HCatStorer

C)

HamaStorer

D)

All of the above

Correct Answer : HCatStorer

Explanation : HCatStorer is accessed via a Pig store statement.

286 .

Hive does not have a data type corresponding to the ____________ type in Pig.

A)

short

B)

datetime

C)

decimal

D)

biginteger

Correct Answer : short

Explanation : Hive 0.12.0 and earlier releases support writing Pig primitive data types with HCatStorer.

287 .

The first call on the HCatOutputFormat must be _______

A)

setOut

B)

OutputSchema

C)

setOutput

D)

setOutputSchema

Correct Answer : setOutput

Explanation : Any other call will throw an exception saying the output format is not initialized.

288 .

Which of the following Hive commands is not supported by HCatalog?

A)

DROP TABLE

B)

CREATE VIEW

C)

SHOW FUNCTIONS

D)

ALTER INDEX â€¦ REBUILD

Correct Answer : ALTER INDEX â€¦ REBUILD

Explanation : Any command which is not supported throws an exception with the message “Operation Not Supported”.

289 .

Mahout provides _____ libraries for common and primitive Java collections.

A)

Perl

B)

Java

C)

Python

D)

Javascript

Correct Answer : Java

Explanation : Maths operations are focused on linear algebra and statistics.

290 .

Mahout provides an implementation of a _______ identification algorithm which scores collocations using log-likelihood ratio.

A)

collocation

B)

collection

C)

compaction

D)

none of the above

Correct Answer : collocation

Explanation : The log-likelihood score indicates the relative usefulness of a collocation with regards other term combinations in the text.

291 .

The tokens are passed through a Lucene ____________ to produce NGrams of the desired length.

A)

Collfilter

B)

ShngleFil

C)

SingleFilter

D)

ShingleFilter

Correct Answer : ShingleFilter

Explanation : The tools that the collocation identification algorithm are embedded within either consume tokenized text as input or provide the ability to specify an implementation of the Lucene Analyzer class perform tokenization in order to form ngrams.

292 .

The ________ collocation identifier is integrated into the process that is used to create vectors from sequence files of text keys and values.

A)

lcr

B)

lbr

C)

llr

D)

lar

Correct Answer : llr

Explanation : The –minLLR option can be used to control the cutoff that prevents collocations below the specified LLR score from being emitted.

293 .

_________ generates NGrams and counts frequencies for ngrams, head and tail subgrams.

A)

CarDriver

B)

CollocDriver

C)

CollocationDriver

D)

All of the above

Correct Answer : CollocDriver

Explanation : Each call to the mapper passes in the full set of tokens for the corresponding document using a StringTuple.

294 .

_______ phase merges the counts for unique ngrams or ngram fragments across multiple documents.

A)

CollocMerger

B)

CollocReducer

C)

CollocCombiner

D)

None of the above

Correct Answer : CollocCombiner

Explanation : The combiner treats the entire GramKey as the key and as such, identical tuples from separate documents are passed into a single call to the combiner’s reduce method, their frequencies are summed and a single tuple is passed out via the collector.

295 .

Drill is designed from the ground up to support high-performance analysis on the ____________ data.

A)

semi-structured

B)

structured

C)

unstructured

D)

none of the above

Correct Answer : semi-structured

Explanation : Drill is an Apache open-source SQL query engine for Big Data exploration.

296 .

_______ includes Apache Drill as part of the Hadoop distribution.

A)

Oozie

B)

MapR

C)

Impala

D)

All of the above

Correct Answer : MapR

Explanation : The MapR Sandbox with Apache Drill is a fully functional single-node cluster that can be used to get an overview on Apache Drill in a Hadoop environment.

297 .

Drill integrates with BI tools using a standard __________ connector.

A)

JDBC

B)

ODBC

C)

ODBC-JDBC

D)

None of the above

Correct Answer : ODBC

Explanation : Drill conforms to the stringent ANSI SQL standards ensuring compatibility with existing BI environments as well as Hive deployments.

298 .

Apache _________ provides direct queries on self-describing and semi-structured data in files.

A)

Oozie

B)

Mahout

C)

Both (A) and (B)

D)

Drill

Correct Answer : Drill

Explanation : Users can explore live data on their own as it arrives versus spending weeks or months on data preparation, modeling, ETL and subsequent schema management.

299 .

Drill also provides intuitive extensions to SQL to work with _______ data types.

A)

int

B)

simple

C)

nested

D)

all of the above

Correct Answer : nested

Explanation : Users can also plug-and-play with Hive environments to enable ad-hoc low latency queries on existing Hive tables and reuse Hive’s metadata, hundreds of file formats and UDFs out of the box.

300 .

PCollection, PTable, and PGroupedTable all support a __________ operation.

A)

union

B)

OR

C)

intersection

D)

None of the above

Correct Answer : union

Explanation : Union operation takes a series of distinct PCollections that all have the same data type and treats them as a single virtual PCollection.

301 .

Inline DoFn that splits a line up into words is an inner class ______

A)

Pipeline

B)

WritePipe

C)

MyPipeline

D)

ReadPipeline

Correct Answer : MyPipeline

Explanation : Inner classes contain references to their parent outer classes, so unless MyPipeline implements the Serializable interface, the NotSerializableException will be thrown when Crunch tries to serialize the inner DoFn.

302 .

DoFns provide direct access to the __________ object that is used within a given Map or Reduce task via the getContext method.

A)

TaskInputContext

B)

TaskInputOutputContext

C)

TaskOutputContext

D)

All of the above

Correct Answer : TaskInputOutputContext

Explanation : There are also a number of helper methods for working with the objects associated with the TaskInputOutputContext.

303 .

The top-level ___________ package contains three of the most important specializations in Crunch.

A)

org.apache.scrunch

B)

org.apache.kcrunch

C)

Both (A) and (B)

D)

org.apache.crunch

Correct Answer : org.apache.crunch

Explanation : Each of these specialized DoFn implementations has associated methods on the PCollection, PTable, and PGroupedTable interfaces to support common data processing steps.

304 .

The ______ class defines a configuration parameter named LINES_PER_MAP that controls how the input file is split.

A)

NLineInputFormat

B)

LineInputFormat

C)

InputLineFormat

D)

None of the above

Correct Answer : NLineInputFormat

Explanation : We can set the value of parameter via the Source interface’s inputConf method.

305 .

The ________ class allows developers to exercise precise control over how data is partitioned, sorted, and grouped by the underlying execution engine.

A)

Grouping

B)

RowGrouping

C)

GroupingOptions

D)

None of the above

Correct Answer : GroupingOptions

Explanation : The GroupingOptions class is immutable.

306 .

_________ is a server based Bundle Engine that provides a higher-level oozie abstraction that will batch a set of coordinator applications.

A)

Oozie v2

B)

Oozie v3

C)

Oozie v4

D)

Oozie v5

Correct Answer : Oozie v4

Explanation : Oozie combines multiple jobs sequentially into one logical unit of work.

307 .

Oozie Workflow jobs are Directed ________ graphs of actions.

A)

Elliptical

B)

Acyclical

C)

Cyclical

D)

All of the above

Correct Answer : Acyclical

Explanation : Oozie is a framework allowing to combine multiple Map/Reduce jobs into a logical unit of work.

308 .

Which of the following is one of the possible state for a workflow jobs?

A)

START

B)

PREP

C)

RESUME

D)

END

Correct Answer : PREP

Explanation : Possible states for a workflow jobs are: PREP, RUNNING, SUSPENDED, SUCCEEDED, KILLED and FAILED.

309 .

Which of the following is project for Infrastructure Engineers and Data Scientists?

A)

BigTop

B)

Oozie

C)

Flume

D)

Impala

Correct Answer : BigTop

Explanation : Bigtop supports a wide range of components/projects, including, but not limited to, Hadoop, HBase and Spark.

310 .

Which of the following operating system is not supported by BigTop?

A)

SUSE

B)

Ubuntu

C)

Fedora

D)

Solaris

Correct Answer : Solaris

Explanation : Bigtop components power the leading Hadoop distros and support many Operating Systems, including Debian/Ubuntu, CentOS, Fedora, SUSE and many others.

311 .

The Apache Jenkins server runs the ______________ job whenever code is committed to the trunk branch.

A)

â€œBigtopâ€

B)

â€œBig-trunkâ€

C)

â€œBigtop-trunkâ€

D)

None of the above

Correct Answer : â€œBigtop-trunkâ€

Explanation : Jenken Server in turn runs several test jobs.

312 .

Which of the following builds an APT or YUM package repository?

A)

Bigtop-VM-matrix

B)

Bigtop-trunk-repository

C)

Bigtop-trunk-packagetest

D)

None of the above

Correct Answer : Bigtop-trunk-repository

Explanation : Bigtop-trunk-packagetest runs the package tests.

313 .

_________ is a fully integrated, state-of-the-art analytic database architected specifically to leverage strengths of Hadoop.

A)

BigTop

B)

Impala

C)

Oozie

D)

Lucene

Correct Answer : Impala

314 .

Impala is an integrated part of a ____________ enterprise data hub.

A)

Cloudera

B)

IBM

C)

MicroSoft

D)

All of the above

Correct Answer : Cloudera

Explanation : Impala is open source (Apache License), so you can self-support in perpetuity if you wish.

315 .

________ is a distributed real-time computation system for processing large volumes of high-velocity data.

A)

Kafka

B)

BigTop

C)

Storm

D)

Lucene

Correct Answer : Storm

Explanation : Storm on YARN is powerful for scenarios requiring real-time analytics, machine learning and continuous monitoring of operations.

316 .

Storm integrates with __________ via Apache Slider.

A)

YARN

B)

Scheduler

C)

Both (A) and (B)

D)

Compaction

Correct Answer : Compaction

317 .

Apache Storm added the open source, stream data processing to _________ Data Platform.

A)

MapR

B)

Cloudera

C)

Hortonworks

D)

Local Cloudera

Correct Answer : Hortonworks

Explanation : The Storm community is working to improve capabilities related to three important themes: business continuity, operations and developer productivity.

318 .

________ communicates with Nimbus through Zookeeper, starts and stops workers according to signals from Nimbus.

A)

Nimbus

B)

Supervisor

C)

Zookeeper

D)

None of the above

Correct Answer : Supervisor

Explanation : ZooKeeper nodes coordinate the Storm cluster.

319 .

Many people use Kafka as a replacement for a ___________ solution.

A)

log aggregation

B)

collection

C)

compaction

D)

all of the above

Correct Answer : log aggregation

Explanation : Log aggregation typically collects physical log files off servers and puts them in a central place.

320 .

Kafka uses __________ so you need to first start a ZooKeeper server if you don't already have one.

A)

BigTop

B)

Impala

C)

ActiveMQ

D)

Zookeeper

Correct Answer : Zookeeper

Explanation : You can use the convenience script packaged with Kafka to get a quick-and-dirty single-node ZooKeeper instance.

321 .

________ is the amount of time to keep a log segment before it is deleted.

A)

log.index.enable

B)

log.retention

C)

log.cleaner.enable

D)

log.flush.interval.message

Correct Answer : log.retention

Explanation : log.cleaner.enable is configuration must be set to true for log compaction to run.

322 .

Apache __________ is a data repository containing device information, images, and other relevant information for all sorts of mobile devices.

A)

Drill

B)