Continuous Integration / Continuous Deployment with Data

I am going to approach this topic through the lens of data, which I maintain presents a different set of Continuous Integration / Continuous Deployment (CI/CD) project challenges compared to application development for example. Firstly, as is my habit, let’s start with a couple of definitions.

Continuous Integration (CI)

Continuously Integration, as the name implies, is the practice of continuously integrating changes back to the main code ‘branch’. The changes are validated by running automated tests and by doing so the project team avoids the train wreck of waiting for a release day to merge changes.

Continuous Deployment (CD)

Continuous Deployment is the practice where every change that successfully passes through the production pipeline is released There’s is minimal human intervention, and only a failed test will prevent a new change to be deployed to production.
Note that I am deliberately using terms Deployment and Development to mean the same thing, I realise that some choose to make a distinction between them.

Challenges

I have excluded generic challenges (e.g. trying to automate too much too soon) and only discussed D &amp A issues. So lets dive in and see what makes us special!!<

Data and Analytic (D&A) projects do not natively sit well with agile delivery. Whilst projects start of as agile, they very quickly turn into iterative, or waterfall due to the amount of integration and modelling required before a feature is delivered. Check out a publication called Agile Data Warehouse Design by Lawrence Corr, Jim Stagnitto to help try and avoid this.
D & A delivery teams are not traditionally strong in code management. There I have said it!! Mostly it is because we have used tools that have some form of version control embedded within them, for example Informatica, Matillion, etc. Now I am not saying these are as good as a dedicated source control solution like GIT, TFS, etc, they are often used out of convenience.
Project teams often have to share environments due to the cost of each team having their own. This means that they cannot achieve the velocity needed to operate in an agile manner. They are also prevented from completely automating the pipeline.
Database changes strike fear into project teams, this means they are frequently done manually as the consequence of getting it wrong results in a database restore, assuming the data is small enough to take this approach.
D & A features tend not to be simple enough to be released in a sprint, often we are building a iceberg where the bit above the water represents the features the end user actually cares about.
Automated tests can take a long time to run when we are dealing with ‘large’ datasets. This can prevent the use of a CI/CD pipeline.
When D & A projects release a feature, it can include tasks that do not sit comfortably in a automated deployment to production. Examples are:
- Historical load of data that may take a day or more.
- Remediation of existing data.

Solutions

So how do we get around these challenges, some steps we can take are:

Include a data lake as a layer within the architecture, this will make data rapidly available to end users without having to wade through multiple layers or architecture (e.g. Land, Stage, Presentation) before a end users gets the data.
Embed and adopt Bi-Modal (or even Tri-Modal if you feel the urge!) delivery approaches.
Educate the D & A team in code control fundamentals. This is going to become an essential skill set as languages like Python become more prevalent in future solutions.
Educate the business that if they want rapid continuous delivery, there will be additional costs to automate the CI/CD pipeline.
If you are delivering multiple ‘subject areas’ at once, across multiple projects, you need an appropriately sized integration environment for each project team. If you don’t they will grind to a hault.
Make the features as small as possible, if you can’t, you will end up with agile aspirations that end up in a water fall delivery, lots of CI and little CD.
Work with a subset of data as often as possible, especially when running automated tests, unless of course we are dealing with trivial amounts of data.
Work with production quality data within the integration and QA environments. Volume and variety is essential to effectively run automated testing
Finally, and most importantly, we need our D %amp A engineers to start behaving more like software engineers. It needs to become second nature for us to manage our code, whether we are delivering ETL, Streaming Analytics, Dashboards, Advanced Analytics or model changes.

Continuous Integration / Continuous Deployment with Data

Continuous Integration (CI)

Continuous Deployment (CD)

Challenges

Solutions

Leave a Reply Cancel reply