Google News
logo
ETL Testing Interview Questions
ETL stands for Extraction, Transformation, and Loading. It is an essential concept in Data Warehousing systems. There are three basics steps in Data Integration Process.
 
* Extraction stands for extracting the data from different data sources such as transactional systems or applications. 
* Transformation stands to apply the conversion rules on data so that it becomes suitable for analytical reporting. 
* Loading process involves, to move the data into the target system, i.e., Data Warehouse.
ETL Testing refers to the process of validating, verifying, and qualifying data while avoiding duplication and preventing data loss. This process is important to ensure that data is delivered securely to the right destination and that the data quality is impeccable before its usage in BI (Business Intelligence) reports.
Some of the notable benefits that are highlighted while endorsing ETL Testing : 
 
* Ensure data is transformed efficiently and quickly from one system to another.
* Data quality issues during ETL processes, such as duplicate data or data loss, can also be identified and prevented by ETL testing.
* Assures that the ETL process itself is running smoothly and is not hampered.
* Ensures that all data implemented is in line with client requirements and provides accurate output.
* Ensures that bulk data is moved to the new destination completely and securely.
The three layers in the ETL are:
 
Staging Layer : Staging layer is used to store the data which is extracted from the different data source systems.

Data Integration Layer : Integration layer transforms the data from the staging layer and moves the data to a database. In the database, the data is arranged into hierarchical groups, which is often called dimension, and into facts and aggregation facts. The combination of facts and dimension table in a data warehouse system is called a schema.

Access Layer : Access layer is used by the end-users to retrieve the data for analytical reporting.
* To keep an eye on data that is being transferred from one system to another
* To keep track of the efficiency and speed of the process
* To achieve fast and the best results
* Stage Tables
* Business Logic Transformation
* Target table loading from the staging table, once we apply the transformation.

Responsibilities of ETL tester are :
 
* ETL tester tests the ETL software thoroughly.
* The tester will check the test component of the ETL Data Warehouse.
* The tester will execute the data-driven test in the backend.
* The tester creates the design and executes the test cases, test plans or test harness, etc.
* Tester identifies the problems and will suggest the best solution also.
* Tester approves the requirements and design specification.
* Tester transfers the data from flat files.
* They write the SQL queries for the different test scenario.
* Cognos Decision Stream
* Oracle Warehouse Builder
* Business Objects XI
* SAS Business Warehouse
* SAS Enterprise ETL Server
* Open source Clover ETL.
* IBM- Websphere DataStage
* Informatica- Power Center
 
Business Intelligence refers to gathering, storage, and analysis of data with the objective of converting raw data into actionable information which can be used to make better business decisions.
ETL TOOLS BI TOOLS
The ETL tools are used to extract the data from different data sources, transform the data, and load it into a data warehouse system. BI tools are used to generate interactive and ad-hoc reports for end-users, data visualization for monthly, quarterly, and annual board meetings.
Most commonly ETL tools are Informatica, SAP BO data service, Microsoft SSIS, Oracle Data Integrator (ODI) Clover ETL Open Source, etc. Most commonly BI tools are SAP Lumira, IBM Cognos, Microsoft BI platform, Tableau, Oracle Business Intelligence Enterprise Edition, etc.
ETL Testing includes :
 
* Verifying whether the data is transformed accurately according to business requirements
* Verifying that the projected data is loaded into the data warehouse without any truncation or data loss
* Making sure that the ETL application reports any invalid data and replaces with default values
* Making sure that the data loads within the expected time frame to improve scalability and performance
Input/Output Bug : This type of bug causes the application to take invalid values in place of valid ones.

Boundary Value Analysis Bug :
Bugs in this section check for both the minimum and maximum values.

Calculation bugs :
These bugs are usually mathematical errors causing incorrect results.

ECP (Equivalence Class Partitioning) Bug :
A bug of this type results in invalid types.

User Interface Bug :
 GUI bugs include issues with color selection, font style, navigation, spelling check, etc.

Hardware Bugs :
This type of bug prevents the device from responding to an application as expected.

Load Condition Bugs :
 A bug like this does not allow multiple users. The user-accepted data is not allowed.

Version Control Bugs :
 Regression Testing is where these kinds of bugs normally occur and does not provide version details.

Race Condition Bugs :
 This type of bug interferes with your system’s ability to function properly and causes it to crash or hang.

Help Source Bugs :
The help documentation will be incorrect due to this bug.
ETL Testing  Manual Testing 
The test is an automated process, which means that no special technical knowledge is needed aside from understanding the software.  It requires technical expertise in SQL and Shell scripting since it is a manual process.
It is extremely fast and systematic, and it delivers excellent results.  In addition to being time-consuming, it is highly prone to errors.
Databases and their counts are central to ETL testing.  Manual testing focuses on the program's functionality.
Metadata is included and can easily be altered.   It lacks metadata, and changes require more effort.  
It is concerned with error handling, log summary, and load progress, which eases the developer's and maintainer's workload.   From a maintenance perspective, it requires maximum effort.  
It is very good at handling historical data.  As data increases, processing time decreases.  
Data warehousing is a broad concept as compared to data mining. Data Mining involves extracting the hidden information from the data and interpreting it for future forecasting. In contrast, data warehousing includes operations such as analytical reporting to generate detailed reports and ad-hoc reports, information processing to generate interactive dashboards and charts.
OLTP OLAP
OLTP stands for Online Transactional Processing. OLAP stands for Online Analytical Processing.
OLTP is a relational database, and it is used to manage the day to day transaction. OLAP is a multidimensional system, and it is also called a data warehouse.
* Before ETL tools user writes the extended code for data transformation to data loading.
* ETL makes life more comfortable, and one tool manages all the scenarios of transformation and loading of the data.

Here is the following example where we are using the ETL:
 
Example : Data Warehousing
 
ETL is used in data warehousing concept. Here, we need to fetch the data from multiple different systems and loads it in the data warehouse database. ETL concept is used here to extract the data from the source, transform the data, and load it into the target system.
 
Example : Data Migration
 
Data migrations are a difficult task if we are using PLSQL. If we want to migrate the data using a simple way, we will use different ETL tools.
 
Example : Mergers and Acquisitions
 
In today's time, lots of companies are merging into different MNCs. To move the data from one company to another, the need for ETL concept arises.
A fact table without measures is known as Factless fact table. It can view the number of occurring events. For example, it is used to record an event such as employee count in a company.
 
The numeric data based on columns in a fact table is known as Measures
The Lookup Transformation is useful for :
 
* Getting a related value from a table using a column value
* Update slowly changing dimension table
* Verify whether records already exist in the table
Cubes are data processing units comprised of fact tables and dimensions from the data warehouse. They provide a multi-dimensional analysis.
 
OLAP stands for ‘Online Analytics Processing,’ and OLAP Cubes store voluminous data in a multi-dimensional form for reporting purposes. They consist of facts called ‘measures’ categorized by dimensions.
There are needs and situations when the data needs to be deleted from the data warehouse. It is a very daunting task to delete the data in bulk. The Purging is an approach that can delete multiple files at the same time and enable users to maintain speed as well as efficiency. A lot of extra space can be created simply with this.
When it comes to updating the master table, the dynamic cache can opt. Also, the users are free to use it for changing the dimensions. On the other side, the users can simply manage the flat files through the Static Cache

It is possible to deploy both the Dynamic and the Static Cache at the same time depending on the task and the overall complexity of the final outcome.
Power Mart  Power Center
It only processes small amounts of data and is considered good if the processing requirements are low.   It is considered good when the amount of data to be processed is high, as it processes bulk data in a short period of time.  
ERP sources are not supported.  ERP sources such as SAP, PeopleSoft, etc. are supported.  
Currently, it only supports local repositories.  Local and global repositories are supported. 
There are no specifications for turning a local repository into a global repository.   It is capable of converting local repositories into global ones.  
Session partitions are not supported.  To improve the performance of ETL transactions, it supports session partitioning. 
Data Mart is a simple form of Data Warehouse, and it is focused on a single functional area. It gets the only from few sources.
 
For example : In an organization, data marts may exist for marketing, finance, human resource, and other individual departments which stores the data related to their specific functions.
Here, we are taking an example to describe how the dimension table is distinguishing from the fact table.
 
Suppose a company sells its products to its customer. Every sale is a fact which occurs within the company, and the fact table is used to record these facts. Each fact table stores the primary key that joins fact table with the dimension table and measures/ facts.
 
Example: Fact Units

Cust_ID Prod_ID Time_ID No. of units sold
101 24 1 25
102 25 2 15
103 26 3 30

A dimension table which store attributes or dimensions describe the objects in a fact table. It is a set of companion tables to a fact table.

Cust_ID Cust_Name Gender
101 Sana F
102 Jass M
Both data mining and data warehousing are powerful data analysis and storage techniques.  
 
Data warehousing : To generate meaningful business insights, it involves compiling and organizing data from various sources into a common database. In a data warehouse, data are cleaned, integrated and consolidated to support management decision-making processes. Object-oriented, integrated, time-varying, and nonvolatile data can be stored within a Data warehouse.
 
Data mining : Also referred to as KDD (Knowledge Discover in Database), it involves searching for and identifying hidden, relevant, and potentially valuable patterns in large data sets. An important goal of data mining is to discover previously unknown relationships among the data. Through data mining, insights can be extracted that can be used for things such as marketing, fraud detection, and scientific discoveries.
Data Warehousing Data Mining
It involves gathering all relevant data for analytics in one place.  Data is extracted from large datasets using this method. 
Data extraction and storage assist in facilitating easier reporting.  It identifies patterns by using pattern recognition techniques. 
Engineers are solely responsible for data warehousing, and data is periodically stored.  Data mining is carried out by business users in conjunction with engineers, and data is analyzed regularly. 
In addition to making data mining easier and more convenient, it helps sort and upload important data to databases.   Analyzing information and data is made easier. 
It is possible to accumulate a large amount of irrelevant and unnecessary data. Loss and erasure of data can also be problematic.  Not doing it correctly can create data breaches and hacking since data mining isn't always 100% accurate. 
Data mining cannot take place without this process, since it compiles and organizes data into a common database.  Because the process requires compiled data, it always takes place after data warehousing.  
Data warehouses simplify every type of business data.  Comparatively, data mining techniques are inexpensive. 
ETL pipelines are the mechanisms to perform ETL processes. This involves a series of processes or activities required for transferring data from one or more sources into the data warehouse for analysis, reporting and data synchronization. It is important to move, consolidate, and alter source data from multiple systems to match the parameters and capabilities of the destination database in order to provide valuable insights. 
ETL Pipeline
Benefits are : 
 
* They reduce errors, bottlenecks, and latency, ensuring the smooth flow of information between systems.
* With ETL pipelines, businesses are able to achieve competitive advantage.
* The ETL pipeline can centralize and standardize data, allowing analysts and decision-makers to easily access and use it.
* It facilitates data migrations from legacy systems to new repositories.
* Data Warehouse is a database which is different from the operational database and stores the historical data.
* Data Warehouse Database contains the analytical as well as transactional data.
* Data Warehouse is used for data analysis and reporting purpose.
* Data Warehouse helps the higher management to take strategic and tactical decisions using historical or current data.
* Data Warehouse helps the business user to the current trend to run the business.
ETL tools are generally used in Data Migration Project. If any organization is managing the data in Oracle 10g previously, now the organization wants to use SQL server cloud database, then there is a need to move the data from source to target. For this kind of movement, ETL tools are very useful. If we want to write the code for ETL, it is a very time-consuming process. To make this simple, we use ETL tool, which makes the coding simple PL SQL or T- SQL code. So the ETL process is useful in Data Migration Projects.
Facts can be divided into three basic types, as follows : 
 
Additive : Facts that are fully additive are the most flexible and useful. We can sum up additive facts across any dimension associated with the fact table.

Semi-additive : We can sum up semi-additive facts across some dimensions associated with the fact table, but not all.

Non-Additive : The Fact table contains non-additive facts, which cannot be summed up for any dimension. The ratio is an example of a non-additive fact.
An Operational Data Store (ODS) is a type of database that's often used as an interim logical area for a data warehouse. ODSes are designed to integrate data from multiple sources for lightweight data processing activities such as operational reporting and real-time analysis.
 
While in the ODS, data can be scrubbed, resolved for redundancy and checked for compliance with the corresponding business rules. An ODS can be used for integrating disparate data from multiple sources so that business operations, analysis and reporting can be carried out while business operations are occurring. This is where most of the data used in current operations is housed before it's transferred to the data warehouse for longer-term storage or archiving.
 
ODSes are commonly used in online transaction processing applications, which involve processing transactional data. These applications use the quick, lightweight processing that ODS tools provide. ODS systems enable more comprehensive trend analysis reporting and processing across many different systems simultaneously.
An operational data store usually stores and processes data in real time. An ODS is connected to multiple data sources and pulls data into a central location.
ODS

The way operational data stores work is comparable to the extract, transform and load (ETL) process. ODS systems import raw data from production systems and store it in its original form.
 
In the ETL process, data is extracted from target sources, transformed and loaded to its destination. In the ODS process, data is not transformed, but rather it's presented as is to business intelligence (BI) applications for analysis and operational decision-making.
 
In some cases, data from an ODS is replicated and then ETL is used to transport the replicated data to a data warehouse.
 
As operational data stores ingest data, new incoming data overwrites existing data.

Source : Techtarget
An operational data store typically pulls data from multiple transactional systems for operational reporting and business reporting. They combine various real-time data sources together in their original format in a central location.
 
ODS tools contain up-to-date versions of business data integrated from data sources, which is useful for BI tasks such as managing logistics, tracking orders and monitoring customer activity. ODSes are also useful for troubleshooting integration issues with data when they occur. They can compare recent versions of data to copies on other systems to determine if there is a continuity error.
 
ODSes also lend themselves to easy systems integration. Administrators can program rules into an ODS that synchronize data across multiple systems. When it changes on one system, it can trigger a change on another system.
 
ODSes can also facilitate a real-time data stream from data sources into the data pipeline.
 
For a real-world example, an ODS could pull batches of data from a billing application at weekly intervals, ingest transaction data in real time and integrate them into a relational database.
 
An ODS usually focuses on the operational requirements of a specific business process like customer service, for example. ODSes allow updates and propagate those updates back to the operational system that the data originated from.
During the extract, transform, and load (ETL) process, a staging area or landing zone is used as an intermediate storage area. It serves as a temporary storage area between data sources and data warehouses. Staging areas are primarily used to extract data quickly from their respective data sources, therefore minimizing the impact of those sources. Using the staging area, data is combined from multiple data sources, transformed, validated, and cleaned after data has been loaded.
Adding additional dimension tables to a Star Schema makes it a Snowflake Schema. In the Snowflake schema model, multiple hierarchies of dimension tables surround a central fact table.  Alternatively, a dimension table is called a snowflake if its low-cardinality attribute has been segmented into separate normalized tables. These normalized tables are then joined with referential constraints (foreign key constraints) to the original dimensions table. Snowflake schema complexity increases linearly with the level of hierarchy in the dimension tables.

Advantages :
 
* Data integrity is reduced because of structured data.
* Data are highly structured, so it requires little disk space.
* Updating or maintaining Snowflaking tables is easy.
 
Disadvantages :
 
* Snowflake reduces the space consumed by dimension tables, but the space saved is usually insignificant compared with the entire data warehouse.
* Due to the number of tables added, you may need complex joins to perform a query, which will reduce query performance.
Data Extraction is nothing, but it is extracting the data from multiple different sources using ETL tools.
 
Here are two types of data extraction :
 
* Full Extraction : All extracted data from an operational system or source system load to the staging area.

* Partial Extraction : Sometimes, we get the notification from the source system to update the specific data. It is called Delta Load.

Source System Performance : The extraction strategies of data should not affect the performance of the source system.
Data Pipeline refers to any set of processes elements that move data from one system to another. Data Pipeline can be built for any kind of application which uses data to bring the value. It can be used for integrating the data across the applications, build the data-driven web products and carrying out the data mining activities. Data engineers build the data pipeline.
Mapping : Mapping represents workflow from source to target.

Workflow : Workflow is a set of instructions which tells the Informatica server how to execute the tasks.

Mapplet : Mapplet configures or creates a set of transformation.

Worklet : It is an object that represents a set of tasks.

Session : Session is a set of instructions that describe how and when to move the data from sources to target.
There are ADO recordsets that generally consist of columns and records. When it comes to populating them in a simple manner, the Data Reader Destination Adapter is very useful. It simply exposes the data flow and let the users impose various restrictions on the data which is required in many cases.
Yes, it is actually possible and the users can perform this task simply and in fact without worrying about anything. The users generally have several options to accomplish this task easily. The methods that can be helpful in this matter are using a staging cable, using a SQL command, using MSSQL, as well as using the Cache.
To improve performance, transactions are sub divided, this is called as Partitioning. Partioning enables Informatica Server for creating of multiple connection to various sources
 
The types of partitions are : 
 
Round-Robin Partitioning :
 
* By informatica data is distributed evenly among all partitions
* In each partition where the number of rows to process are approximately same this partioning is applicable.

Hash Partitioning :
 
* For the purpose of partitioning keys to group data among partitions Informatica server applies a hash function
* It is used when ensuring the processes groups of rows with the same partitioning key in the same partition need to be ensured
Vendors of all stripes create a diverse range of software for large corporations. As a result, no single vendor is responsible for the entire process. To illustrate, imagine a telecom project in which one business handles invoicing, and another handles customer relationship management (CRM). 
 
Data feeds from other companies can now be received by CRMs, for example, if a CRM needs billing information from another company. As a result, you can use the ETL method to load data from the feed.
Schema objects are database objects such as tables, views, indexes, synonyms, database links, etc. These are the logical structures created by users that can either hold data or definitions and are arranged in models such as star or snowflake schema.
Data warehousing is a core component of business intelligence. By bringing different data sources under a single, unified library, analysts can work more efficiently, get more in-depth insights, and spot patterns across different datasets. 
 
Ultimately, it helps businesses be more competitive by improving their decision-making process.
Once data has been successfully transported into the data warehouse, analysts typically use third-party Business Intelligence (BI) applications such as Tableau to turn the raw data into graphs and charts, based on which business decisions can be made. 

Some of the latest ETL tools are equipped with their own data analysis mechanisms.
Snapshots are read-only copies of data from the master table. Before a change to the data warehouse is made, a snapshot is taken and stored locally for reference. This works to preserve records as they were before the change was made.
 
Each snapshot has three components :
 
* The time it was taken
* A key to identify the snapshot
* The data relating to the key
ETL testing is a demanding process that should be completed in the following order :
 
Define the business requirements : Liaise with the client to understand their reporting needs and define the scope of the project

Validate data sources :
Perform a data count check and ensure check keys are in place

Design ETL logic :
Design the mapping sheet, SQL script, and transformational codes 

Extract source data :
Identify any bugs during the extraction phase

Transform data :
Make sure data is transformed consistently

Load data :
Perform a record count check and verify the accuracy of the loaded data

Review process :
Verify the validity, layout, and export functionality of the summary report

File test report :
Share test results with the relevant stakeholders
Regression testing is used after developing functional repairs to the data warehouse. Its purpose is to check if said repairs have impaired other areas of the ETL process. 
 
Regression testing should always be performed after system modifications to see if they have introduced new defects.
A Surrogate key is something having sequence-generated numbers with no meaning, and just to identify the row uniquely. It is not visible to users or application. It is also called as Candidate key.
A Surrogate key has sequence-generated numbers with no meaning. It is meant to identify the rows uniquely.
 
A Primary key is used to identify the rows uniquely. It is visible to users and can be changed as per requirement.
Connected Lookup Unconnected Lookup
Connected lookup participates in mapping It is used when lookup function is used instead of an expression transformation while mapping
Multiple values can be returned Only returns one output port
It can be connected to another transformations and returns a value Another transformation cannot be connected
Static or dynamic cache can be used for connected Lookup Unconnected as only static cache
Connected lookup supports user defined default values Unconnected look up does not support user defined default values
In Connected Lookup multiple column can be return from the same row or insert into dynamic lookup cache Unconnected lookup designate one return port and returns one column from each row