I was recently approached by a friend who runs a growing company to help with Data Management. Their business involves hardware and software installation on certain entities within a location and tracking the interaction between these installations. So they have lots of disparate and unwieldy data, and he wants to commission a system for taming and collating the data and making it ready for analytics and supporting functions like fraud detection and decision support. He also wants to establish a robust Data Governance framework, especially to ensure data security and quality.
Since they're on a growth trajectory, the solutions need to be low-cost or free. They currently use MySQL as their database, so they pay only for data storage (they own their servers and will switch to a self-managed, private cloud later) and not for the tool itself. This is not necessarily a daunting constraint, as there are a number of good solutions in the Data Management stack whose open-source/free (libre and cost-wise) versions are state-of-the-art.
For this use case, a formidable candidate for Data Warehousing is Greenplum, a Massively Parallel Processing system that's more popular as Massively Parallel PostGres for Analytics. I've used Greenplum extensively for processing very large volumes of data and have integrated it with downstream systems for reporting, advanced analytics & data science, as well as customer engagement and marketing. Its architecture enables it to handle large amounts of data and complex compute-workloads in a parallel fashion. In terms of features, being a fork of PostGres, it inherits its parent's useful features like PostGIS and a datatype for storing JSON documents. Additionally, it also has a solid Apache Solr based full-text search and provides Apache MadLib ML libraries that can be run in SQL, thereby taking advantage of Greenplum's MPP features. What's more, its open-source version will fit the bill in this case.
Connecting their current MySQL (OLTP) Database to the Greenplum Data Warehouse can be done using a data integration tool. Most of my professional data warehousing experience has been in enterprises that were able to afford the steep pricing of the market leader, Informatica. On the personal front, though, I've tried and read a fair bit about Talend and have admired their products and thought leadership. If we go with Talend, a simple Data Architecture diagram will look like:
ETL vs. ELT
The diagram uses the term ELT to describe the Talend component. It's not a typo for ETL but a slight paradigm shift: "Extract-Load-Transform" instead of "Extract-Transform-Load". The former is sometimes also referred to as "Push-Down Optimization". There can be a number of reasons for choosing one over the other and there are pros and cons for both but typically if the data warehousing features of the target system are used, ELT is the chosen path. For example, we can use Greenplum's MPP features and can write queries to implement the desired SCD strategies in this case.
In terms of the specific component, Talend's Open Studio for Data Integration will be appropriate. Considering Talend's vast product line for Data Management, we might be able to use their other offerings as we address further use cases. While trying to understand the features of the ELT tool (juxtaposing them with Informatica's) and researching its integration with Greenplum, I came across this highly informative video:
The project is still in its nascent stages, so I'll keep updating this blog through various stages of the project including evaluation of the tools and implementation. Please reach me at firstname.lastname@example.org if you have a similar requirement.