ETL testing in a Business Intelligence application

What is ETL testing all about ?
                   ETL is the extract , transform and load operation performed on the source data in order to build a strong base for developing an application that does the needful for the decision making teams in large enterprise systems.

Such is the importance of a Business Intelligence(BI) application that at times big enterprise end up in developing a complete application for just a single user to access and analyze the reports available in them. And developing the same itself involves huge effort and cost not only due to huge chunk of data and their testing but also the degree of complexity associated with the logic development for achieving the same . Reports however are not solely dependent on the ETL but several other logically inter-related objects and the access and authorization rules implemented on the cube level as well as the UI level.Someone who is an expert at BI application development might just not be sufficient to develop such an application as the end user's requirement needs to be documented and loads of data analysis on its nature needs to be done to frame a volley of questions for the end user and clarification documents needs to be tracked right through out the project development life cycle. this is because in these systems a bug that gets discovered at a very late stage generally has very high cost associated with it for being fixed.

Lets not get out of context and try to understand the approach and priority of testing just the ETL logic within the application under test. For a Business Intelligence application software developed using the Microsoft technologies we have the Microsoft SQL Server in place . And developing the ETL can be achieved by the  SSIS - Sql Server Integration Services feature available there in.
Using the SSIS feature, the packages can be developed which have the ETL logic in them, based on which the source data is filtered as per the requirement traced out for the application.

Once the ETL packages get developed , the testing of the same becomes an uphill task due to the huge amount of data in the source environment which gets extracted, transformed and loaded into the destination rather a sort of pre -destination environment. Just imagine the verification of each and every record set that was in the source against the target environment. From common principle we might just conclude that the data load is based on the Sql queries which in any case will go fine, but the main target area is the logic verification as to which data needs to be ignored for loading in the target environment. there might be cases wherein we have duplicate records in the source and we might not be able to load both the records just because that will create high level of discrepancy when we browse through the reports in the end product. Running ETL packages gets the data loaded from the source to the staging environment and depending on the context and nature of the application , same gets loaded into the data mart as well based on the transformation logic applied and also the nature of load which may be a full load, that is truncate and load, or a incremental load that is just the additional data gets loaded into the environment.

When we have the uphill task of validating such huge chunks of data we take the help of some database automation tools that helps us to verify each and every record . There are many tools available in the industry  and at times we can ourselves create tools using the excel and ,macro programming but then what I prefer to do is utilize the DB unit test projevct feature available within the Visual Studio IDE. Now the time is to build up the logic that will help us validate each record set in the source against the target. The general approach that is considered to be sufficient to to do the same is in a two staged sql queries verification. One being the count of the source and target data environment must match on applying the filters as has been documented by the client and the second is that the data must match . We genuinely find the option of empty return value as sufficient for doing the same . What we all need to apply is the except keyword between the two query execution logic and the addition of the Test condition from within the added DB unit test file.
Just browse through the some screenshots of the working analogy for doing the same that would definitely make things easy for the SSIS testers.


 



Just hit cancel for the above dialog box. This is actually the database configuration file creation which we can create directly by adding an app.config file which is as under. To get it done we can add a new item into the project by right clicking the project and clicking add new item and then  selecting an application configuration file as shown :




The content of the same will be something as under :
 <?xml version="1.0" encoding="utf-8" ?> <configuration>   <configSections>     <section name="DatabaseUnitTesting" type="Microsoft.Data.Schema.UnitTesting.

Configuration.DatabaseUnitTestingSection, 
Microsoft.Data.Schema.UnitTesting, Version=10.0.0.0, Culture=neutral, 
PublicKeyToken=b03f5f7f11d50a3a" />
  </configSections>
  <DatabaseUnitTesting>
    <DataGeneration ClearDatabase="true" />
    <ExecutionContext Provider="System.Data.SqlClient"  
ConnectionString="Data Source=DB_Server_Name;Initial Catalog=Master;
Integrated Security=True;Pooling=False"
        CommandTimeout="220" />
  </DatabaseUnitTesting>
</configuration>

Once this has been set up , we can go ahead with the logic to validate the same, which we do as under.
Here we have just renamed the method from databasetest to more relevant onw that is Employee_RowCount, similarly, we add another test method by clicking the plus icon to add another method to the same DBUnit class file that is employee to validate the data content as under.




So what is it that I have done in this First level of verifictaion as in above image : Its is simple I have just written the query to fetch the row count and utilized the Except keyword. So now if the count matches of the two query , due to the except query in place we are expecting the "Empty resultset" as the return value of the complete query execution. Hence in the below section of the image you can see, I have removed the default condition that got added and added a new condition namely the empty resultset. We are hence ready with one validation that is on the row count.

Second level of verification is for the data match as well. We add a new method by the upper plus icon and rename it to Employee_DataCheck method name, provide the query for the same and add the except keyword in between the two queries and rest is as was done above to get the empty resultset as the return value for the query execution. This will look as under :
As part of some experience tips we at times have issue with data validation especially for the string datatype attributes. Just check in for the collation conflict that creates such issues. Do provide the collation to sort out those issues.

A third level of verification that adds quality to the testing of the SSIS packages and ETL execution is verification of the schema of the database objects in the source against the destination environment. This provides an added quality as it helps us to verify if the data will be loaded with same precision values or not.
The general query that will fetch the schema details of any database table is as under :

select column_name  collate Latin1_General_CI_AI,
 DATA_TYPE  collate Latin1_General_CI_AI
 , CHARACTER_MAXIMUM_LENGTH,
NUMERIC_PRECISION
, DATETIME_PRECISION
from Source.INFORMATION_SCHEMA.COLUMNS
where TABLE_NAME='Employee'
Except
select column_name  collate Latin1_General_CI_AI,
 DATA_TYPE  collate Latin1_General_CI_AI
 , CHARACTER_MAXIMUM_LENGTH,
NUMERIC_PRECISION
, DATETIME_PRECISION
from Destination.INFORMATION_SCHEMA.COLUMNS
where TABLE_NAME='Employee'
and COLUMN_NAME Not in ('clumn not to be validated especially some ID reltaed that gets auto generated')

The code above validates the column names in the two environment,  do keep in mind that there are certain columns that get auto generated especially in staging environment we have staging id and so on, so do exclude their verification as has been addressed.these columns get verified on their attributes which we provide above namely "datatype", "Size" that is character maximum legth,"numeric precision","datetime precision". We could have easily validated another aspect such as IsNull. But the general approach in Business Intellignece(BI) is that if we have some column to validate the ID of the complete record set we tend to oignore the IsNull feature and the same gets reflected in the code above.


Thus we have successfully automated the ETL testing using three levels of verification namely rowcount, datacheck and schemacheck. This provides the team with more level of confidence on the quality of testing of the ETL packages as data verification has been done at a rigorous level rather a sampling manner.

So we have explored the ETL testing and ways to achieve high degree of quality for the same..

Post a Comment

Previous Post Next Post