Automation Testing of Azure data factory pipelines

What is an azure data factory ?
Its an azure service for data ingestion process from varied source system.

What to test in azure data factory  or adf ?
Monitor and validate its successful completion.
Monitor successful execution of each activity of the pipeline. This in turn assures the corrcetness of ingestion process of the source data in the target .

Why should this testing be automated ?
Simply because it fits into automation feasibility assessment and unattended run category.

Some sample code -
Few Prerequisite to start with the automation testing of adf is availability of several parameters.
1. Service principal - This account must have access on the adf to trigger them monitor them, and perform various admin activity on the same for tracking its execution.The automation script use these for authenticating the request to trigger the adfs.
2. Azure resource group, subscription id tenant id,  details etc.
Smart approach is to keep these information in some config file to start with and as devops maturity comes in, pick them from the key vault.

//this is our sample testmethod that automates the testing process of an adf pipeline.
        [TestCategory("ADF Pipeline")]
        [TestMethod]
        public void R1_RunPipeline_PipelineSample()
        {
            this.InitialiseTest();
            string strPipeline = pipelineName_PipelineSample;
            MonitorPipeline(strPipeline);

//this code ensure each activity within a adf pipeline is successfull.
            string[] strActivityarray;
            strActivityarray = new string[13] { "Activity1", "Activity2", "Activity3" };
            bool flag = VerifyActivities(strActivityarray);
            if (flag)
            {
                Assert.AreEqual(true, true);
            }
            else
            {
                Assert.AreEqual(false, true);
            }
        }

// Code for initialization of variables from config file and various other declared variable
 private void InitialiseTest()
        {
            try
            {
                this.clientId = ConfigurationManager.AppSettings.Get("clientId");
                this.clientKey = ConfigurationManager.AppSettings.Get("clientKey");
                this.resourceGroup = ConfigurationManager.AppSettings.Get("resourceGroup");
                this.dataFactory = ConfigurationManager.AppSettings.Get("dataFactory");
                this.pipelineName_PipleineSample = ConfigurationManager.AppSettings.Get("pipelinename_PipelineSample");
);
                this.RunType = ConfigurationManager.AppSettings.Get("RunType");
                this.run_date = ConfigurationManager.AppSettings.Get("run_date");
                this.start_date = ConfigurationManager.AppSettings.Get("start_date");
                this.startDate = ConfigurationManager.AppSettings.Get("startDate");
                this.StartDate = ConfigurationManager.AppSettings.Get("startDate");
                this.ContainerVersion = ConfigurationManager.AppSettings.Get("ContainerVersion");
                this.retries = Convert.ToInt32(ConfigurationManager.AppSettings.Get("retries"));
                this.secondsToWait = Convert.ToInt32(ConfigurationManager.AppSettings.Get("secondsToWait"));
            }
            catch (Exception e)
            {
                this.TestContext.WriteLine("Problem with reading a property. Details are " + e.StackTrace + " : " + e.Message);
                throw;
            }
        }

//here the execution control is done
public void MonitorPipeline(string strPipelineName)
        {
            try
            {
                bool result = this.CreateAdfClient(this.clientId, this.clientKey).GetAwaiter().GetResult();
                if (result)
                {
                    if (string.IsNullOrEmpty(this.runId))
                    {
                        this.StartPipeline(this.resourceGroup, this.dataFactory, strPipelineName);
                    }
                    this.TestContext.WriteLine("Run Id is" + this.runId);
                    int count = 0;
                    PipelineRun run = null;
                    while (count < this.retries)
                    {
                        count++;
                        run = this.innerClient.PipelineRuns.Get(this.resourceGroup, this.dataFactory, this.runId);
                        if (run.Status == "Failed")
                        {
                            this.TestContext.WriteLine("Failed status on ADF");
                            Assert.Fail();
                            break;
                        }
                        else if (run.Status == "Succeeded")
                        {
                            this.TestContext.WriteLine("Success on ADF pipeline");
                            return;
                        }
                        var ts = TimeSpan.FromSeconds(this.secondsToWait);
                        int milliseconds = (int)ts.TotalMilliseconds;
                        System.Threading.Thread.Sleep(milliseconds);
                        TestContext.WriteLine(run.Status + " : retry " + count.ToString() + " : " + DateTime.Now.ToLongTimeString());
                    }
                    Assert.Fail("Did not complete to succeeded in " + (this.secondsToWait * this.retries).ToString() + " seconds");
                }               
                else
                {
                    this.TestContext.WriteLine("Did not create adf client.");
                    Assert.Fail();
                }
            }
            catch (Exception e)
            {
                System.Diagnostics.Debug.WriteLine(e.ToString());
                this.TestContext.WriteLine(e.ToString());
                throw;
            }
        }

//here adf client is being created 
 private async Task<bool> CreateAdfClient(string client_id, string client_key)
        {
            string subscription_id = ConfigurationManager.AppSettings.Get("subscription_id");//
            string tenant_id = ConfigurationManager.AppSettings.Get("tenant_id");
            var authenticationContext = new Microsoft.IdentityModel.Clients.ActiveDirectory.AuthenticationContext($"https://login.windows.net/{tenant_id}");
            var credential = new ClientCredential(clientId: client_id, clientSecret: client_key);
            var result = await authenticationContext.AcquireTokenAsync(resource: "https://management.core.windows.net/", clientCredential: credential);
            if (result == null)
            {
                throw new InvalidOperationException("Failed to obtain the JWT token");
            }
            var cr = new TokenCredentials(result.AccessToken);
            ServiceClientCredentials cred = cr;
            this.innerClient = new DataFactoryManagementClient(cred);
            this.innerClient.SubscriptionId = subscription_id;

            return true;
        }

//Here the pipeline starts running
        private void StartPipeline(string resourceGroup, string dataFactory, string pipelineName)
        {
            System.Collections.Generic.Dictionary<string, object> parameters = new System.Collections.Generic.Dictionary<string, object>();
            parameters.Add("RunType", this.RunType);
            string runId = Guid.NewGuid().ToString();
            Console.WriteLine(runId);
            parameters.Add("RunId", runId);
            parameters.Add("run_date", this.run_date);
            parameters.Add("start_date", this.start_date);
            parameters.Add("startDate", this.startDate);
            parameters.Add("ContainerVersion", this.ContainerVersion);
            var dt = DateTime.Now;
            string description = "Executed by Automation Script on " + dt.ToShortDateString() + " : " + dt.ToShortTimeString();
            parameters.Add("RunDescription", description);
            this.TestContext.WriteLine("Description : " + description);
            CreateRunResponse response = this.innerClient.Pipelines.CreateRun(resourceGroup, dataFactory, pipelineName, null, false, null, parameters);
            this.runId = response.RunId;                   
        }

//here is the code to verify that each activity as is passed in activity list is successfull
 public bool VerifyActivities(string[] strActivitiesArray)
        {
            #region Activity Verification
            bool flag;
            flag = false;
            RunFilterParameters filterParams = new RunFilterParameters(DateTime.UtcNow.AddMinutes(-10), DateTime.UtcNow.AddMinutes(10));
            ActivityRunsQueryResponse queryResponse = this.innerClient.ActivityRuns.QueryByPipelineRun(this.resourceGroup, this.dataFactory, this.runId, filterParams);

            for (int i = 0; i < strActivitiesArray.Count(); i++)
            {
                for (int j = 0; j < queryResponse.Value.Count; j++)
                {
                    if (queryResponse.Value[j].ActivityName.ToString() == strActivitiesArray[i] && queryResponse.Value[j].Status == "Succeeded")
                    {
                        flag = true;
                        break;
                    }
                    else
                        flag = false;
                }
            }
            return (flag);   
            #endregion
        }


So what did we achieve
1. Run an adf pipeline
2. Monitor its execution
3. Validate each activity in the adf pipeline is successful.
4. Validate entire adf execution is Successful

If you have close to 200 plus Azure data factory pipeline and they need to be tested in smart way, you can just automate each of them and give them nightly run. Morning the results shall be available to view.

Also no monotonous experience for testers to conduct same activity time and again in longer run.

Most useful sql queries


To get rowcount of each table within a particular database along with some other attributes like schema name :

SELECT '[' + SCHEMA_NAME(t.schema_id) + '].[' + t.name + ']' AS fulltable_name,
SCHEMA_NAME(t.schema_id) AS schema_name, t.name AS table_name, i.rows
FROM sys.tables AS t
INNER JOIN sys.sysindexes AS i ON t.object_id = i.id AND i.indid < 2 order by i.rows desc


To get table having presence of a particular text
declare @text varchar(255)
set @text = 'Vip'
select 'select * from ' + rtrim(tbl.name) + ' where ' +
          rtrim(col.name) + ' like %' + rtrim(@text) + '%'
from sysobjects tbl
inner join syscolumns col on tbl.id = col.id
and col.xtype in (167, 175, 231, 239) -- (n)char and (n)varchar, there may be others to include
and col.length > 30 -- arbitrary min length into which you might store a URL
where tbl.type = 'U'

To get all tables within a database along with each column in each table , its datatype and IsNUll attribute :

SELECT sh.name+'.'+o.name AS TableName ,s.name as ColumnName
        ,CASE
             WHEN t.name IN ('char','varchar') THEN t.name+'('+CASE WHEN s.max_length<0 then 'MAX' ELSE CONVERT(varchar(10),s.max_length) END+')'
             WHEN t.name IN ('nvarchar','nchar') THEN t.name+'('+CASE WHEN s.max_length<0 then 'MAX' ELSE CONVERT(varchar(10),s.max_length/2) END+')'
            WHEN t.name IN ('numeric') THEN t.name+'('+CONVERT(varchar(10),s.precision)+','+CONVERT(varchar(10),s.scale)+')'
             ELSE t.name END AS DataType
,CASE
             WHEN s.is_nullable=1 THEN 'NULL'
            ELSE 'NOT NULL' END AS Nullable
    FROM sys.columns  s
        INNER JOIN sys.types                   t ON s.system_type_id=t.user_type_id
        INNER JOIN sys.objects                 o ON s.object_id=o.object_id and o.type_desc = 'USER_TABLE'
        INNER JOIN sys.schemas                sh on o.schema_id=sh.schema_id
        LEFT OUTER JOIN sys.identity_columns  ic ON s.object_id=ic.object_id AND s.column_id=ic.column_id
        LEFT OUTER JOIN sys.computed_columns  sc ON s.object_id=sc.object_id AND s.column_id=sc.column_id
        LEFT OUTER JOIN sys.check_constraints cc ON s.object_id=cc.parent_object_id AND s.column_id=cc.parent_column_id
    ORDER BY sh.name+'.'+o.name,s.column_id

If you want to import data from excel using OpenRowset and sp_configure throws error:
sp_configure 'show advanced options', 1;
RECONFIGURE;
GO  --Added      
sp_configure 'Ad Hoc Distributed Queries', 1;
RECONFIGURE;
GO

If you want to get the details of all hosts that are connected to a db with the program to which they are interacting :
SELECT
    c.session_id, c.net_transport, c.encrypt_option,
    c.auth_scheme, s.host_name, s.program_name,
    s.client_interface_name, s.login_name, s.nt_domain,
    s.nt_user_name, s.original_login_name, c.connect_time,
    s.login_time
FROM sys.dm_exec_connections AS c
JOIN sys.dm_exec_sessions AS s
    ON c.session_id = s.session_id

Terminologies and Technologies in BigData

Big Data

Big Data is one of the most in-demand technology. Startups, big tech vendors, companies big and small are all jumping on the big data craze. The reason being the data amount getting just doubled almost every two years.

It is used to handle massive amounts of information in all sorts of formats -- tweets, posts, e-mails, documents, audio, video, feeds etc.

There are a growing number of technologies that make up the Big Data world including all sorts of analytics, in-memory databases, NoSQL databases, Hadoop to name a few. We will briefly look upon the various tools and terminologies that are making huge inroads on this technology stack

Hadoop 

Hadoop is a crucial technology at the center of the whole Big Data.

It is an open source software used to gather and store vast amounts of data and analyze it on low-cost commodity hardware. For instance, banks may use Hadoop for fraud detection, and online shopping services could use it to analyze customers' buying patterns. That will make huge impact once integrated in a CRM system.

Cassandra 

Cassandra is a free and open source NoSQL database.

It's a kind of database that can handle and store data of different types and sizes of data and it's increasingly the go-to database for mobile and cloud applications. Several companies including Apple and Netflix use Cassandra and have been highly impactful.

MapReduce

MapReduce has been called "the heart of Hadoop."

MapReduce is the method that allows Hadoop to store all kinds of data across many low-cost computer servers. To get meaningful data of Hadoop, a programmer writes software programs (often in the popular language, Java) for MapReduce.

Cloudera

Cloudera is a company that makes a commercial version of Hadoop.

Although Hadoop is a free and open-source-project for storing large amounts of data on inexpensive computer servers, the free version of Hadoop is not easy to use. Several companies have created friendlier versions of Hadoop, and Cloudera is arguably the most popular one.

Hbase

Hbase is yet another project based on the popular Hadoop technology.

Hadoop is a way to store all kinds of data across many low-cost computer servers. Once that data is stored using the Hadoop Distributed File System (HDFS), Hbase can sort through that data and group bits of data together, somewhat similar to how a traditional database organizes data.

Pig 

Pig is another hot skill, thanks to demand for technologies like Big Data and Hadoop.

Pig is a programming language that helps extract information from Hadoop like find answers to certain questions or otherwise use the data.

Flume 

Flume is yet another skill spawned from Big data" craze and the popularity of Hadoop.

Hadoop is a way to store all kinds of data across many low-cost computer servers. Flume is a method to move massive amounts of data from the place it was created into a Hadoop system.

Hive 

Hive is yet another hot in-demand skill, courtesy Big Data and the popularity of Hadoop.

Hadoop is a way to store all kinds of data across many low-cost computer servers. Hive provides a way to extract information from Hadoop using the same kind of traditional methods used by regular databases. (In geek speak: it gives Hadoop a database query interface).

NoSQL 

NoSQL is a new kind of database that is part of the big data phenomenon.

NoSQL has sometimes been called the cloud database. Regular databases need data to be organized. Names and account numbers need to be structured and labeled. But noSQL doesn't care about that. It can work with all kinds of documents.

There are a number of popular noSQL databases including Mongo, Couchbase and Cassandra.

Zookeeper 

Zookeeper is a free and open-source project that also came from the big data craze, particularly the uber popular tech called Hadoop.

Hadoop is a way to store all kinds of data across many low-cost computer servers. Zookeeper is like a file system, a way to name, locate, and sync files used in Hadoop. But now it's being used with other big-data technologies beyond Hadoop.

Arista 

Arista makes a computer network switch used in big data centers.

Its claim to fame is its operating system software which users can programme to add features, write apps or make changes to the network.



At the center of much-in demand technology Big Data is something called "analytics," the ability to sift through the humongous amount of data and gather business intelligence out of it.

R is the language of choice for this. It used for statistical analysis and graphics/visualization.

Sqoop -

Sqoop is one of those skills that has zoomed into popularity, thanks to Big Data craze.

It's a free and open source tool that lets you transfer data from popular Big Data storage system, Hadoop, into classic relational databases like the ones made by Oracle, IBM and Microsoft.

It's a command-line interface tool, meaning you have to know the commands and type them directly into the system, rather than click on them with a mouse.


While Big Data options like Hadoop are the new-age way of dealing with data, Documentum (EMC Documentum is an "enterprise content management" system) remains a popular tool in industries that still use a lot of paper or electronic forms, like legal, medical, insurance, and so on. A major sections where BigData can bring about a revolution.

While NoSQL databases are increasingly becoming popular for new applications, many companies still RDBMS-based systems.
Relational Database Management System is the full from of RDBMS, a type of database management system. This is the traditional kind of database that uses the structured query language (SQL) used by databases like Oracle, Microsoft SQL Server, and IBM DB2.

There are data scientists that work on the tech side, the marketing side, and just about every other area of business in enterprise systems, and in just about every size company. They figure out how to get meaningful numbers and information from large volumes of data. And Bigdata is the most magical word in for them.

What is an API testing ?

How you approach testing an API depends on a lot of things. 
Will the API be a public API that will be consumed by some external people/systems, or is it a part of a larger product's infrastructure?
 API logically can be a COM interface, a dll,a JAR ,a REST web service. 
There are different approaches which ca be applied to test it out.
API when internal can be tested through unit testing and the use of the product that consumes it.
If it is an externally consumable API then you need to be much more thorough because people could use it in different ways than you might expect and send data in much different formats, etc. It also usually needs to make sense, be intuitive and be well documented if it is externally consumable. 
Testing an API nearly always requires you to create some sort of consumer for testing purposes. You have to create an application to interact with the API. The application is usually very simple and driven by automated test cases and little/or no manual interaction.
If the API has dependencies, you may choose to mock those dependencies out so you can more thoroughly test all of those interactions and hit all of the positive and negative code paths. 
Suppose an API interacts with a database and intends CRUD operation :
Create an invalid record with some business rule violation, such as a foreign key violation, or a unique key constraint violation, or something like Not Null violation, or even you can check for the precision violation . Just analyse how the system behaves in these type of test activity.
Read some record when row count is huge for a table and assess timelines for the same, and note if the query does not time out in the server-  a typical bug you see in almost all system when developed initially.
Update some set of record with valid business rules and then invalid business rules and invalid constraints. Simultaneous updates involving some pessimistic locks and optimistic locks.
Delete a set of record which involves cascade delete, delete a non -existent records. Lot of scenarios on use cases that come up in a system will be criterion for API testing. 
b contd..

Test Cases Design and Execution Technique

                    Software testing has never been as difficult as software development !!!
I would prefer not to get into discussing the outlined statement above. Let us focus on something that adds more value to the time we spent on this portal. Managing testing with optimum utilization of the test management system is the primary goal to touch-base on the article outlined below.

When we talk of designing the test cases in a system, there are some basic parameters that is offered by various tools which enhances the overall ease of managing the work.  Let us discuss on some aspects that will add more to what we in general keep doing. We will not discuss the basic ones like title , steps description, automation status etc as we all know and are adept in managing them.

In addition to what we have we should have below outlined parameters also in a manner that can be better than is mentioned below. Let us visit them one by one. Why we need these additional parameters and in what type of situation , we can en cash on the opportunity that will be offered by these additional parameters.

1. Iteration : This parameter provides us the opportunity to segregate the test cases based on their implementation in the product as a feature. We have our software developed in multiple cycles and this is one aspect which helps us to track the test case creation and first time inception in the system. And gradually with the progress of the iteration we end up with test cases belonging to various iteration. Any test case can belong to just one iteration and that property will help us to identify the feature - implementation - iteration as well.
     Iteration 1
     Iteration 2
     Iteration 3

2. Functional Area : This property enables us to identify the feature to which a test case belong to. We may have multiple features implemented within a software. let us take an example from e-commerce domain, The primary features in this case would be the various business segments such as Apparels, Housing decors, Electrical appliances etc. This an ideal property within a test case to have the tree structure. Just because implementation of such big features may not be in full in a particular iteration. So how to track such timelines ? Here is the solution that the tree structure will provide . Apparel can be broke into multiple like women , men, children, and so on and so forth . This depth can go to any level. Here one thing we must have at the back of our mind , a single feature should not flow into multiple iteration, as that results in multiple complications in terms of reporting and traceability. Simple rule of thumb will be - "A single iteration can have multiple Functional areas but its vice - versa is not possible" .
     Payment
     Authentication/Authorization
     Reports
     Usability

3. Test Case Category :
     BAT [build acceptance test]
     Sprint Related
     Regression :
            High Impact Breadth - wise
            Medium Impact Breadth - wise
            Low Impact Breadth - wise

4. Priority :
     P1 - Critical Path Business Scenarios
                     Important to reject/accept a build.
     P2 - Average Path Business Scenarios
                      Important to decide on pre - scheduling of any interim drop
     P3 - Low Path Business Scenarios
                      Non - functional aspects of the system under test

Will continue it once I have had some sleep. Too tired to type. Though thinking continues to make it more interesting !!