ETL Testing Interview Questions and Answers

1 .

ETL stands for Extraction, Transformation, and Loading. It is an essential concept in Data Warehousing systems. There are three basics steps in Data Integration Process.

* Extraction stands for extracting the data from different data sources such as transactional systems or applications.

* Transformation stands to apply the conversion rules on data so that it becomes suitable for analytical reporting.

* Loading process involves, to move the data into the target system, i.e., Data Warehouse.

2 .

What is ETL Testing?

ETL Testing refers to the process of validating, verifying, and qualifying data while avoiding duplication and preventing data loss. This process is important to ensure that data is delivered securely to the right destination and that the data quality is impeccable before its usage in BI (Business Intelligence) reports.

3 .

What is the importance of ETL testing?

Some of the notable benefits that are highlighted while endorsing ETL Testing :

* Ensure data is transformed efficiently and quickly from one system to another.

* Data quality issues during ETL processes, such as duplicate data or data loss, can also be identified and prevented by ETL testing.

* Assures that the ETL process itself is running smoothly and is not hampered.

* Ensures that all data implemented is in line with client requirements and provides accurate output.

* Ensures that bulk data is moved to the new destination completely and securely.

4 .

What is the three-layer architecture of an ETL cycle?

The three layers in the ETL are:

Staging Layer : Staging layer is used to store the data which is extracted from the different data source systems.

Data Integration Layer : Integration layer transforms the data from the staging layer and moves the data to a database. In the database, the data is arranged into hierarchical groups, which is often called dimension, and into facts and aggregation facts. The combination of facts and dimension table in a data warehouse system is called a schema.

Access Layer : Access layer is used by the end-users to retrieve the data for analytical reporting.

5 .

Why ETL Testing is required?

* To keep an eye on data that is being transferred from one system to another

* To keep track of the efficiency and speed of the process

* To achieve fast and the best results

6 .

What are the responsibilities of an ETL Tester?

* Stage Tables

* Business Logic Transformation

* Target table loading from the staging table, once we apply the transformation.

Responsibilities of ETL tester are :

* ETL tester tests the ETL software thoroughly.

* The tester will check the test component of the ETL Data Warehouse.

* The tester will execute the data-driven test in the backend.

* The tester creates the design and executes the test cases, test plans or test harness, etc.

* Tester identifies the problems and will suggest the best solution also.

* Tester approves the requirements and design specification.

* Tester transfers the data from flat files.

* They write the SQL queries for the different test scenario.

7 .

What are the various tools used in ETL?

* Cognos Decision Stream

* Oracle Warehouse Builder

* Business Objects XI

* SAS Business Warehouse

* SAS Enterprise ETL Server
* Open source Clover ETL.

* IBM- Websphere DataStage

* Informatica- Power Center

8 .

What is Business Intelligence(BI)?

Business Intelligence refers to gathering, storage, and analysis of data with the objective of converting raw data into actionable information which can be used to make better business decisions.

9 .

What are the differences between ETL and BI tools?

ETL TOOLS	BI TOOLS
The ETL tools are used to extract the data from different data sources, transform the data, and load it into a data warehouse system.	BI tools are used to generate interactive and ad-hoc reports for end-users, data visualization for monthly, quarterly, and annual board meetings.
Most commonly ETL tools are Informatica, SAP BO data service, Microsoft SSIS, Oracle Data Integrator (ODI) Clover ETL Open Source, etc.	Most commonly BI tools are SAP Lumira, IBM Cognos, Microsoft BI platform, Tableau, Oracle Business Intelligence Enterprise Edition, etc.

10 .

What do ETL Testing operations include?

ETL Testing includes :

* Verifying whether the data is transformed accurately according to business requirements

* Verifying that the projected data is loaded into the data warehouse without any truncation or data loss

* Making sure that the ETL application reports any invalid data and replaces with default values

* Making sure that the data loads within the expected time frame to improve scalability and performance

11 .

List some of the ETL bugs.

Input/Output Bug : This type of bug causes the application to take invalid values in place of valid ones.

Boundary Value Analysis Bug : Bugs in this section check for both the minimum and maximum values.

Calculation bugs : These bugs are usually mathematical errors causing incorrect results.

ECP (Equivalence Class Partitioning) Bug : A bug of this type results in invalid types.

User Interface Bug : GUI bugs include issues with color selection, font style, navigation, spelling check, etc.

Hardware Bugs : This type of bug prevents the device from responding to an application as expected.

Load Condition Bugs : A bug like this does not allow multiple users. The user-accepted data is not allowed.

Version Control Bugs : Regression Testing is where these kinds of bugs normally occur and does not provide version details.

Race Condition Bugs : This type of bug interferes with your system’s ability to function properly and causes it to crash or hang.

Help Source Bugs : The help documentation will be incorrect due to this bug.

12 .

What is the difference between ETL testing and manual testing?

ETL Testing	Manual Testing
The test is an automated process, which means that no special technical knowledge is needed aside from understanding the software.	It requires technical expertise in SQL and Shell scripting since it is a manual process.
It is extremely fast and systematic, and it delivers excellent results.	In addition to being time-consuming, it is highly prone to errors.
Databases and their counts are central to ETL testing.	Manual testing focuses on the program's functionality.
Metadata is included and can easily be altered.	It lacks metadata, and changes require more effort.
It is concerned with error handling, log summary, and load progress, which eases the developer's and maintainer's workload.	From a maintenance perspective, it requires maximum effort.
It is very good at handling historical data.	As data increases, processing time decreases.

13 .

What is the difference between the data warehouse and data mining?

Data warehousing is a broad concept as compared to data mining. Data Mining involves extracting the hidden information from the data and interpreting it for future forecasting. In contrast, data warehousing includes operations such as analytical reporting to generate detailed reports and ad-hoc reports, information processing to generate interactive dashboards and charts.

14 .

What are the differences between data warehousing and data mining?

OLTP	OLAP
OLTP stands for Online Transactional Processing.	OLAP stands for Online Analytical Processing.
OLTP is a relational database, and it is used to manage the day to day transaction.	OLAP is a multidimensional system, and it is also called a data warehouse.

15 .

Where the user use ETL concepts? Explain it.

* Before ETL tools user writes the extended code for data transformation to data loading.

* ETL makes life more comfortable, and one tool manages all the scenarios of transformation and loading of the data.

Here is the following example where we are using the ETL:

Example : Data Warehousing

ETL is used in data warehousing concept. Here, we need to fetch the data from multiple different systems and loads it in the data warehouse database. ETL concept is used here to extract the data from the source, transform the data, and load it into the target system.

Example : Data Migration

Data migrations are a difficult task if we are using PLSQL. If we want to migrate the data using a simple way, we will use different ETL tools.

Example : Mergers and Acquisitions

In today's time, lots of companies are merging into different MNCs. To move the data from one company to another, the need for ETL concept arises.

16 .

Explain what factless fact schema is and what is Measures?

A fact table without measures is known as Factless fact table. It can view the number of occurring events. For example, it is used to record an event such as employee count in a company.

The numeric data based on columns in a fact table is known as Measures

17 .

Explain the use of Lookup Transformation?

The Lookup Transformation is useful for :

* Getting a related value from a table using a column value

* Update slowly changing dimension table

* Verify whether records already exist in the table

18 .

What are Cubes and OLAP Cubes?

Cubes are data processing units comprised of fact tables and dimensions from the data warehouse. They provide a multi-dimensional analysis.

OLAP stands for ‘Online Analytics Processing,’ and OLAP Cubes store voluminous data in a multi-dimensional form for reporting purposes. They consist of facts called ‘measures’ categorized by dimensions.

19 .

What do you understand by Data Purging?

There are needs and situations when the data needs to be deleted from the data warehouse. It is a very daunting task to delete the data in bulk. The Purging is an approach that can delete multiple files at the same time and enable users to maintain speed as well as efficiency. A lot of extra space can be created simply with this.

20 .

Are you familiar with the Dynamic and the Static Cache?

When it comes to updating the master table, the dynamic cache can opt. Also, the users are free to use it for changing the dimensions. On the other side, the users can simply manage the flat files through the Static Cache.

It is possible to deploy both the Dynamic and the Static Cache at the same time depending on the task and the overall complexity of the final outcome.

21 .

What is the difference between power mart and power center?

Power Mart	Power Center
It only processes small amounts of data and is considered good if the processing requirements are low.	It is considered good when the amount of data to be processed is high, as it processes bulk data in a short period of time.
ERP sources are not supported.	ERP sources such as SAP, PeopleSoft, etc. are supported.
Currently, it only supports local repositories.	Local and global repositories are supported.
There are no specifications for turning a local repository into a global repository.	It is capable of converting local repositories into global ones.
Session partitions are not supported.	To improve the performance of ETL transactions, it supports session partitioning.

22 .

What is a Data Mart?

Data Mart is a simple form of Data Warehouse, and it is focused on a single functional area. It gets the only from few sources.

For example : In an organization, data marts may exist for marketing, finance, human resource, and other individual departments which stores the data related to their specific functions.

23 .

What is a dimension table and how it is different from the fact table?

Here, we are taking an example to describe how the dimension table is distinguishing from the fact table.

Suppose a company sells its products to its customer. Every sale is a fact which occurs within the company, and the fact table is used to record these facts. Each fact table stores the primary key that joins fact table with the dimension table and measures/ facts.

Example: Fact Units

Cust_ID	Prod_ID	Time_ID	No. of units sold
101	24	1	25
102	25	2	15
103	26	3	30

A dimension table which store attributes or dimensions describe the objects in a fact table. It is a set of companion tables to a fact table.

Cust_ID	Cust_Name	Gender
101	Sana	F
102	Jass	M

24 .

Explain how a data warehouse differs from data mining.

Both data mining and data warehousing are powerful data analysis and storage techniques.

Data warehousing : To generate meaningful business insights, it involves compiling and organizing data from various sources into a common database. In a data warehouse, data are cleaned, integrated and consolidated to support management decision-making processes. Object-oriented, integrated, time-varying, and nonvolatile data can be stored within a Data warehouse.

Data mining : Also referred to as KDD (Knowledge Discover in Database), it involves searching for and identifying hidden, relevant, and potentially valuable patterns in large data sets. An important goal of data mining is to discover previously unknown relationships among the data. Through data mining, insights can be extracted that can be used for things such as marketing, fraud detection, and scientific discoveries.

25 .

What is the Difference between Data Warehouse and Data Mining?

Data Warehousing	Data Mining
It involves gathering all relevant data for analytics in one place.	Data is extracted from large datasets using this method.
Data extraction and storage assist in facilitating easier reporting.	It identifies patterns by using pattern recognition techniques.
Engineers are solely responsible for data warehousing, and data is periodically stored.	Data mining is carried out by business users in conjunction with engineers, and data is analyzed regularly.
In addition to making data mining easier and more convenient, it helps sort and upload important data to databases.	Analyzing information and data is made easier.
It is possible to accumulate a large amount of irrelevant and unnecessary data. Loss and erasure of data can also be problematic.	Not doing it correctly can create data breaches and hacking since data mining isn't always 100% accurate.
Data mining cannot take place without this process, since it compiles and organizes data into a common database.	Because the process requires compiled data, it always takes place after data warehousing.
Data warehouses simplify every type of business data.	Comparatively, data mining techniques are inexpensive.

26 .

What do you mean by ETL Pipeline?

ETL pipelines are the mechanisms to perform ETL processes. This involves a series of processes or activities required for transferring data from one or more sources into the data warehouse for analysis, reporting and data synchronization. It is important to move, consolidate, and alter source data from multiple systems to match the parameters and capabilities of the destination database in order to provide valuable insights.

Benefits are :

* They reduce errors, bottlenecks, and latency, ensuring the smooth flow of information between systems.

* With ETL pipelines, businesses are able to achieve competitive advantage.

* The ETL pipeline can centralize and standardize data, allowing analysts and decision-makers to easily access and use it.

* It facilitates data migrations from legacy systems to new repositories.

27 .

What are the characteristics of Data Warehouse?

* Data Warehouse is a database which is different from the operational database and stores the historical data.

* Data Warehouse Database contains the analytical as well as transactional data.

* Data Warehouse is used for data analysis and reporting purpose.

* Data Warehouse helps the higher management to take strategic and tactical decisions using historical or current data.

* Data Warehouse helps the business user to the current trend to run the business.

28 .

How is ETL used in Data Migration Project? Explain it.

ETL tools are generally used in Data Migration Project. If any organization is managing the data in Oracle 10g previously, now the organization wants to use SQL server cloud database, then there is a need to move the data from source to target. For this kind of movement, ETL tools are very useful. If we want to write the code for ETL, it is a very time-consuming process. To make this simple, we use ETL tool, which makes the coding simple PL SQL or T- SQL code. So the ETL process is useful in Data Migration Projects.

29 .

Explain some types of Facts in ETL Testing.

Facts can be divided into three basic types, as follows :

Additive : Facts that are fully additive are the most flexible and useful. We can sum up additive facts across any dimension associated with the fact table.

Semi-additive : We can sum up semi-additive facts across some dimensions associated with the fact table, but not all.

Non-Additive : The Fact table contains non-additive facts, which cannot be summed up for any dimension. The ratio is an example of a non-additive fact.

30 .

What do you mean by ODS (Operational data store)?

An Operational Data Store (ODS) is a type of database that's often used as an interim logical area for a data warehouse. ODSes are designed to integrate data from multiple sources for lightweight data processing activities such as operational reporting and real-time analysis.

While in the ODS, data can be scrubbed, resolved for redundancy and checked for compliance with the corresponding business rules. An ODS can be used for integrating disparate data from multiple sources so that business operations, analysis and reporting can be carried out while business operations are occurring. This is where most of the data used in current operations is housed before it's transferred to the data warehouse for longer-term storage or archiving.

ODSes are commonly used in online transaction processing applications, which involve processing transactional data. These applications use the quick, lightweight processing that ODS tools provide. ODS systems enable more comprehensive trend analysis reporting and processing across many different systems simultaneously.

31 .

How do operational data stores work?

An operational data store usually stores and processes data in real time. An ODS is connected to multiple data sources and pulls data into a central location.

The way operational data stores work is comparable to the extract, transform and load (ETL) process. ODS systems import raw data from production systems and store it in its original form.

In the ETL process, data is extracted from target sources, transformed and loaded to its destination. In the ODS process, data is not transformed, but rather it's presented as is to business intelligence (BI) applications for analysis and operational decision-making.

In some cases, data from an ODS is replicated and then ETL is used to transport the replicated data to a data warehouse.

As operational data stores ingest data, new incoming data overwrites existing data.

Source : Techtarget

32 .

How are Operational Data Stores(ODS) used?

An operational data store typically pulls data from multiple transactional systems for operational reporting and business reporting. They combine various real-time data sources together in their original format in a central location.

ODS tools contain up-to-date versions of business data integrated from data sources, which is useful for BI tasks such as managing logistics, tracking orders and monitoring customer activity. ODSes are also useful for troubleshooting integration issues with data when they occur. They can compare recent versions of data to copies on other systems to determine if there is a continuity error.

ODSes also lend themselves to easy systems integration. Administrators can program rules into an ODS that synchronize data across multiple systems. When it changes on one system, it can trigger a change on another system.

ODSes can also facilitate a real-time data stream from data sources into the data pipeline.

For a real-world example, an ODS could pull batches of data from a billing application at weekly intervals, ingest transaction data in real time and integrate them into a relational database.

An ODS usually focuses on the operational requirements of a specific business process like customer service, for example. ODSes allow updates and propagate those updates back to the operational system that the data originated from.

33 .

What do you mean by staging area and write its main purpose?

During the extract, transform, and load (ETL) process, a staging area or landing zone is used as an intermediate storage area. It serves as a temporary storage area between data sources and data warehouses. Staging areas are primarily used to extract data quickly from their respective data sources, therefore minimizing the impact of those sources. Using the staging area, data is combined from multiple data sources, transformed, validated, and cleaned after data has been loaded.

34 .

Explain the Snowflake schema.

Adding additional dimension tables to a Star Schema makes it a Snowflake Schema. In the Snowflake schema model, multiple hierarchies of dimension tables surround a central fact table. Alternatively, a dimension table is called a snowflake if its low-cardinality attribute has been segmented into separate normalized tables. These normalized tables are then joined with referential constraints (foreign key constraints) to the original dimensions table. Snowflake schema complexity increases linearly with the level of hierarchy in the dimension tables.

Advantages :

* Data integrity is reduced because of structured data.

* Data are highly structured, so it requires little disk space.

* Updating or maintaining Snowflaking tables is easy.

Disadvantages :

* Snowflake reduces the space consumed by dimension tables, but the space saved is usually insignificant compared with the entire data warehouse.

* Due to the number of tables added, you may need complex joins to perform a query, which will reduce query performance.

35 .

What is the data extraction phase in ETL?

Data Extraction is nothing, but it is extracting the data from multiple different sources using ETL tools.

Here are two types of data extraction :

* Full Extraction : All extracted data from an operational system or source system load to the staging area.

* Partial Extraction : Sometimes, we get the notification from the source system to update the specific data. It is called Delta Load.

Source System Performance : The extraction strategies of data should not affect the performance of the source system.

36 .

What is the Data Pipeline?

Data Pipeline refers to any set of processes elements that move data from one system to another. Data Pipeline can be built for any kind of application which uses data to bring the value. It can be used for integrating the data across the applications, build the data-driven web products and carrying out the data mining activities. Data engineers build the data pipeline.

37 .

What is a mapping, Session, Worklet, and Mapplet?

Mapping : Mapping represents workflow from source to target.

Workflow : Workflow is a set of instructions which tells the Informatica server how to execute the tasks.

Mapplet : Mapplet configures or creates a set of transformation.

Worklet : It is an object that represents a set of tasks.

Session : Session is a set of instructions that describe how and when to move the data from sources to target.

38 .

Can you tell a few benefits of using the Data Reader Destination Adapter?

There are ADO recordsets that generally consist of columns and records. When it comes to populating them in a simple manner, the Data Reader Destination Adapter is very useful. It simply exposes the data flow and let the users impose various restrictions on the data which is required in many cases.

39 .

Is it possible to update the table using the SQL service?

Yes, it is actually possible and the users can perform this task simply and in fact without worrying about anything. The users generally have several options to accomplish this task easily. The methods that can be helpful in this matter are using a staging cable, using a SQL command, using MSSQL, as well as using the Cache.

40 .

Explain what is partitioning, hash partitioning and round robin partitioning?

To improve performance, transactions are sub divided, this is called as Partitioning. Partioning enables Informatica Server for creating of multiple connection to various sources

The types of partitions are :

Round-Robin Partitioning :

* By informatica data is distributed evenly among all partitions

* In each partition where the number of rows to process are approximately same this partioning is applicable.

Hash Partitioning :

* For the purpose of partitioning keys to group data among partitions Informatica server applies a hash function

* It is used when ensuring the processes groups of rows with the same partitioning key in the same partition need to be ensured

41 .

How Is ETL Testing Utilized in the Handling of Third-Party Data?

Vendors of all stripes create a diverse range of software for large corporations. As a result, no single vendor is responsible for the entire process. To illustrate, imagine a telecom project in which one business handles invoicing, and another handles customer relationship management (CRM).

Data feeds from other companies can now be received by CRMs, for example, if a CRM needs billing information from another company. As a result, you can use the ETL method to load data from the feed.

42 .

What are Schema Objects?

Schema objects are database objects such as tables, views, indexes, synonyms, database links, etc. These are the logical structures created by users that can either hold data or definitions and are arranged in models such as star or snowflake schema.

43 .

Why is data warehousing important?

Data warehousing is a core component of business intelligence. By bringing different data sources under a single, unified library, analysts can work more efficiently, get more in-depth insights, and spot patterns across different datasets.

Ultimately, it helps businesses be more competitive by improving their decision-making process.

44 .

How is data analyzed in ETL?

Once data has been successfully transported into the data warehouse, analysts typically use third-party Business Intelligence (BI) applications such as Tableau to turn the raw data into graphs and charts, based on which business decisions can be made.

Some of the latest ETL tools are equipped with their own data analysis mechanisms.

45 .

What are snapshots in ETL?

Snapshots are read-only copies of data from the master table. Before a change to the data warehouse is made, a snapshot is taken and stored locally for reference. This works to preserve records as they were before the change was made.

Each snapshot has three components :

* The time it was taken

* A key to identify the snapshot

* The data relating to the key

46 .

Can you explain the ETL testing process from start to finish?

ETL testing is a demanding process that should be completed in the following order :

Define the business requirements : Liaise with the client to understand their reporting needs and define the scope of the project

Validate data sources : Perform a data count check and ensure check keys are in place

Design ETL logic : Design the mapping sheet, SQL script, and transformational codes

Extract source data : Identify any bugs during the extraction phase

Transform data : Make sure data is transformed consistently

Load data : Perform a record count check and verify the accuracy of the loaded data

Review process : Verify the validity, layout, and export functionality of the summary report

File test report : Share test results with the relevant stakeholders

47 .

What is meant by regression testing in ETL?

Regression testing is used after developing functional repairs to the data warehouse. Its purpose is to check if said repairs have impaired other areas of the ETL process.

Regression testing should always be performed after system modifications to see if they have introduced new defects.

48 .

What is a surrogate key in a database?

A Surrogate key is something having sequence-generated numbers with no meaning, and just to identify the row uniquely. It is not visible to users or application. It is also called as Candidate key.

49 .

What is the difference between surrogate key and primary key?

A Surrogate key has sequence-generated numbers with no meaning. It is meant to identify the rows uniquely.

A Primary key is used to identify the rows uniquely. It is visible to users and can be changed as per requirement.

50 .

What are the differences between Connected lookup and Unconnected lookup?

Connected Lookup	Unconnected Lookup
Connected lookup participates in mapping	It is used when lookup function is used instead of an expression transformation while mapping
Multiple values can be returned	Only returns one output port
It can be connected to another transformations and returns a value	Another transformation cannot be connected
Static or dynamic cache can be used for connected Lookup	Unconnected as only static cache
Connected lookup supports user defined default values	Unconnected look up does not support user defined default values
In Connected Lookup multiple column can be return from the same row or insert into dynamic lookup cache	Unconnected lookup designate one return port and returns one column from each row