Articles On Testing

Wecome to http://www.articlesontesting.com !!!

Articles On Testing

Wecome to http://www.articlesontesting.com !!!

Articles On Testing

Wecome to http://www.articlesontesting.com !!!

Articles On Testing

Wecome to http://www.articlesontesting.com !!!

Articles On Testing

Wecome to http://www.articlesontesting.com !!!

Showing posts with label business intelligence. Show all posts
Showing posts with label business intelligence. Show all posts

ETL testing in a Business Intelligence application

What is ETL testing all about ?
                   ETL is the extract , transform and load operation performed on the source data in order to build a strong base for developing an application that does the needful for the decision making teams in large enterprise systems.

Such is the importance of a Business Intelligence(BI) application that at times big enterprise end up in developing a complete application for just a single user to access and analyze the reports available in them. And developing the same itself involves huge effort and cost not only due to huge chunk of data and their testing but also the degree of complexity associated with the logic development for achieving the same . Reports however are not solely dependent on the ETL but several other logically inter-related objects and the access and authorization rules implemented on the cube level as well as the UI level.Someone who is an expert at BI application development might just not be sufficient to develop such an application as the end user's requirement needs to be documented and loads of data analysis on its nature needs to be done to frame a volley of questions for the end user and clarification documents needs to be tracked right through out the project development life cycle. this is because in these systems a bug that gets discovered at a very late stage generally has very high cost associated with it for being fixed.

Lets not get out of context and try to understand the approach and priority of testing just the ETL logic within the application under test. For a Business Intelligence application software developed using the Microsoft technologies we have the Microsoft SQL Server in place . And developing the ETL can be achieved by the  SSIS - Sql Server Integration Services feature available there in.
Using the SSIS feature, the packages can be developed which have the ETL logic in them, based on which the source data is filtered as per the requirement traced out for the application.

Once the ETL packages get developed , the testing of the same becomes an uphill task due to the huge amount of data in the source environment which gets extracted, transformed and loaded into the destination rather a sort of pre -destination environment. Just imagine the verification of each and every record set that was in the source against the target environment. From common principle we might just conclude that the data load is based on the Sql queries which in any case will go fine, but the main target area is the logic verification as to which data needs to be ignored for loading in the target environment. there might be cases wherein we have duplicate records in the source and we might not be able to load both the records just because that will create high level of discrepancy when we browse through the reports in the end product. Running ETL packages gets the data loaded from the source to the staging environment and depending on the context and nature of the application , same gets loaded into the data mart as well based on the transformation logic applied and also the nature of load which may be a full load, that is truncate and load, or a incremental load that is just the additional data gets loaded into the environment.

When we have the uphill task of validating such huge chunks of data we take the help of some database automation tools that helps us to verify each and every record . There are many tools available in the industry  and at times we can ourselves create tools using the excel and ,macro programming but then what I prefer to do is utilize the DB unit test projevct feature available within the Visual Studio IDE. Now the time is to build up the logic that will help us validate each record set in the source against the target. The general approach that is considered to be sufficient to to do the same is in a two staged sql queries verification. One being the count of the source and target data environment must match on applying the filters as has been documented by the client and the second is that the data must match . We genuinely find the option of empty return value as sufficient for doing the same . What we all need to apply is the except keyword between the two query execution logic and the addition of the Test condition from within the added DB unit test file.
Just browse through the some screenshots of the working analogy for doing the same that would definitely make things easy for the SSIS testers.


 



Just hit cancel for the above dialog box. This is actually the database configuration file creation which we can create directly by adding an app.config file which is as under. To get it done we can add a new item into the project by right clicking the project and clicking add new item and then  selecting an application configuration file as shown :




The content of the same will be something as under :
 <?xml version="1.0" encoding="utf-8" ?> <configuration>   <configSections>     <section name="DatabaseUnitTesting" type="Microsoft.Data.Schema.UnitTesting.

Configuration.DatabaseUnitTestingSection, 
Microsoft.Data.Schema.UnitTesting, Version=10.0.0.0, Culture=neutral, 
PublicKeyToken=b03f5f7f11d50a3a" />
  </configSections>
  <DatabaseUnitTesting>
    <DataGeneration ClearDatabase="true" />
    <ExecutionContext Provider="System.Data.SqlClient"  
ConnectionString="Data Source=DB_Server_Name;Initial Catalog=Master;
Integrated Security=True;Pooling=False"
        CommandTimeout="220" />
  </DatabaseUnitTesting>
</configuration>

Once this has been set up , we can go ahead with the logic to validate the same, which we do as under.
Here we have just renamed the method from databasetest to more relevant onw that is Employee_RowCount, similarly, we add another test method by clicking the plus icon to add another method to the same DBUnit class file that is employee to validate the data content as under.




So what is it that I have done in this First level of verifictaion as in above image : Its is simple I have just written the query to fetch the row count and utilized the Except keyword. So now if the count matches of the two query , due to the except query in place we are expecting the "Empty resultset" as the return value of the complete query execution. Hence in the below section of the image you can see, I have removed the default condition that got added and added a new condition namely the empty resultset. We are hence ready with one validation that is on the row count.

Second level of verification is for the data match as well. We add a new method by the upper plus icon and rename it to Employee_DataCheck method name, provide the query for the same and add the except keyword in between the two queries and rest is as was done above to get the empty resultset as the return value for the query execution. This will look as under :
As part of some experience tips we at times have issue with data validation especially for the string datatype attributes. Just check in for the collation conflict that creates such issues. Do provide the collation to sort out those issues.

A third level of verification that adds quality to the testing of the SSIS packages and ETL execution is verification of the schema of the database objects in the source against the destination environment. This provides an added quality as it helps us to verify if the data will be loaded with same precision values or not.
The general query that will fetch the schema details of any database table is as under :

select column_name  collate Latin1_General_CI_AI,
 DATA_TYPE  collate Latin1_General_CI_AI
 , CHARACTER_MAXIMUM_LENGTH,
NUMERIC_PRECISION
, DATETIME_PRECISION
from Source.INFORMATION_SCHEMA.COLUMNS
where TABLE_NAME='Employee'
Except
select column_name  collate Latin1_General_CI_AI,
 DATA_TYPE  collate Latin1_General_CI_AI
 , CHARACTER_MAXIMUM_LENGTH,
NUMERIC_PRECISION
, DATETIME_PRECISION
from Destination.INFORMATION_SCHEMA.COLUMNS
where TABLE_NAME='Employee'
and COLUMN_NAME Not in ('clumn not to be validated especially some ID reltaed that gets auto generated')

The code above validates the column names in the two environment,  do keep in mind that there are certain columns that get auto generated especially in staging environment we have staging id and so on, so do exclude their verification as has been addressed.these columns get verified on their attributes which we provide above namely "datatype", "Size" that is character maximum legth,"numeric precision","datetime precision". We could have easily validated another aspect such as IsNull. But the general approach in Business Intellignece(BI) is that if we have some column to validate the ID of the complete record set we tend to oignore the IsNull feature and the same gets reflected in the code above.


Thus we have successfully automated the ETL testing using three levels of verification namely rowcount, datacheck and schemacheck. This provides the team with more level of confidence on the quality of testing of the ETL packages as data verification has been done at a rigorous level rather a sampling manner.

So we have explored the ETL testing and ways to achieve high degree of quality for the same..

What is Business Intelligence Testing


What is a BI software application all about ?
What is BI testing all about ?
How does BI application ease out the business process in the current world scenario ?
What type of industry enterprise look forward to developing a BI application ?
What are the challenges one faces while testing a BI application ?
How much of an impact can a BI application make to an organization within its peer group ?

So many questions strike at the back of one's mind when we start with a Business Intelligence domain software application with a testing perspective at the back of one's mind.

What is a BI software application all about ?
 
Let me just list out the terms of BI testing.
Source data, target data, data warehouse, data mart, reports, ETL - Extract Transform Load, etl logic - incremental or full , integration services, reporting services, analysis services, database engine, cubes, measures, measure groups,dimensions, facts, reports testing - drill down, drill through, build deployment. I mean there are just so many of them .

How does the chain look like ? May be something like this :
Source  -> Staging -> Datawarehouse -> Cube -> Reports

Source : Multiple data sources  such as MSSql Server, Oracle, MySql, SAP, Teradata, DB2, Sybase. It can be any combination of the listed as well as non listed ones for example as simple as flat text files. These multiple data sources can be an input data source to an application software that might help the end user take future strategic decisions.

Staging : An intermediary state of source data that act as an input to the datawarehouse where in data is loaded by the ETL logic using integration services from the source data. It is more or less the source data in its original form with those tables getting discarded that might not be needed for the report generation within the application.

Datawarehouse : It is the final entity developed within the database engine that gets created in the sequence mentioned above. It is here that the final objects get created and based on these objects the cubes are created for the analysis purpose. Here the object types that the complete data is organized into is called dimensions and facts. They as well as just like simple tables but the attributes within each dimensions and facts have a specific association within itself to resemble some realistic facts and associated information as per the business requirement.

Cube :  This is the Analysis services objects within the complete  BI application development phase. As the name suggests it is not just the two dimension tables that hold its relevance in a typical RDBMS . It is indeed a three dimensional data modeling technique wherein we analyze a data set in more than just two dimensions. The facts that are associated with an application are analyzed as per the association they have with the various parameters which in BI terms are called as measures, or measure groups. Measure groups are actually a combination of more than one related measure. Thus we get greater insights into how a specific aspect of any business decision making gets impacted with a variation in various parameters.

Reports : Reports are nothing but the cubes data getting represented onto a user friendly interface with the option to parameterise the reports as per the business needs. In simple terms it is the end product that gets developed and the data we look for in the cubes are available for view purpose in them. We can drill down and drill through them based on the scenario we need. For example we can by default see the report for any specific fiscal year as to how much sales have been materialized and then drill down onto the quarter basis , and then monthly basis, then the weekly basis and finally on the daily basis . Similarly drill through also gets applied on to the reports and the data can be visualized as per needs. Authorization and authentication is another feature that has its role to be played in the reporting services but then the authorization of the cubes over ride the privileges granted on the reports.

I hope by now we have gain some insights on what these applications resemble like and how is it different from any other typical web application . Its more of an application that delves into data and data modeling and management techniques.

What is BI testing all about ?
 BI testing is pretty different from any other application in ways more than we can just list out. Domain expertise can only make the tester's life easy in an otherwise unknown nature that such application develops . There is no way one can wait till the reports get developed that we have any UI to start testing with !!! And thats what makes it even more interesting and challenging . 

So what do we do ?
When do we do ?
How do we do ?

Do we wait till the last phase of development cycle where in the reports get deployed onto a web application and then we start testing or is it that we have scopes of starting with our testing activities well within the initial development cycle or may be even before the full flow development activities start ?
There is a scope for automation as well, loads of scope for performance testing and of course the manual has to be the core of any testing activity.
 Things really start once the Integration services packages gets into a shaped state wherein the data from the raw format gets extracted ,transformed and loaded into the new environment.That is what we call as ETL execution as well in short.Test cases authoring can be started once the ETL packages gets developed because that is where a tester's activity gets eased out. This is also the area of high automation ROI wherein the etl related test cases  can be automated using the various database automation techniques.
 Once integration services are in place we target the Analysis services section wherein the  cube testing starts at the base of which exists the dimensions and measures and measure groups. These cubes have the raw data organized in a very effectively designed and inter related manner that the real time data analysis is possible. These cubes form the basis of the reporting service that is the base of the UI testing .
 UI testing or the Reporting services testing is more on the filter and various combination of filter that can be provided with a set of inter - related measures and measure groups for the various facts as per the dimensions supplied. Within the same lies the validation of the drill down and drill through capabilities of the reports by just right click and left clicks in the reports on its graphs and the various axes.The three terms that makes sense during the data validation for the complete application is the dimensions, and fact tables . These are the actual tables that form the basis of the cube data.

 As per the data quality is concerned , the major target area is the Integration services packages that is the ETL packages and the logic with which the data gets extracted , transformed and  loaded into the target systems . It is here that we design the data loading logic that shall ensure that the junk data or data that may in no way be utilized for the reports getting generated do not load into the staging system. It is simply because the data mart is the next stage within the complete BI application development logic and for the data mart to have deal with only relevant data , staging environment must have only needed data with quality in tact. Then stored procedures come into play to arrange the data from the staging tables into the data mart tables and two different categories of data tables get created namely the dimensions and the facts. The data arrangement logic has to be validated in these tables with the major target areas being the stored procedure. These also get created during the   ETL execution phase that is the Integration services phase.Once the mart is finalized the cubes need to be tested wherein the testing involves no more the Sql queries execution but the MDX queries . Here in the measures and measure groups and their data for the various inter-related entities come up.It is done by browsing the cubes within the Analysis service engine. Finally these data becomes the base for showing the relevant reports in the UI as per the business user's requirements.

There we end up with a BI application in place ready to be delivered to the business users. Post production defects are unavoidable in these type of applications. The major cause behind the same is that the application is being developed offshore at a location far away from the end users. This results in most of the cases data not being provided that resembles their real time data due to various data security issues. However, that is how the nature of applications is . So do learn to live with that as a challenge.