#simpleit: How to build a data platform without a budget

Mon 11 2022

#simpleit: How to build a data platform without a budget

by bernt & torsten

This article is for small companies that may not have a dedicated data team or just a single resource to operate a data platform. The focus of this article is on a small business data platform setup. This article aims to show how to create the best data platform for your organization without a data budget. All you need is a laptop and a bit of technical know-how.

Before I explain how to set up a data platform, let us look at what would be the best data platform for your business. Data being a big topic today, there are many articles you can read about data platforms etc. No data platform is universal. Setting up a data platform uses building blocks to set up a data platform that fits your company.

The fact is that a data platform for one company will look slightly different than for another company. When creating the best data platform for you and your company, it’s essential to answer a few questions about your company’s culture, business objectives, structure, and more.

Data

Building a data platform is to ask a few questions about your company- i.e., would you need a central repository for all your company’s data, which enables the acquisition, storage, delivery, and governance of that data while maintaining security across the data lifecycle. Let’s have a look at some of the more critical questions?

How will you gain stakeholder buy-in?

A data platform is only helpful if its users – stakeholders across the business, are open to and familiar with it. Before creating a data platform, it’s critical to get everyone who might take advantage of it on board before making a data platform.

Employees in every area across the business should understand how the data platform will ultimately provide value to them. That’s the initial job of the data team: to explain and showcase that value and establish a method of measuring success even as the company scales.

Who owns what in the data stack?

How will the data be used? Will it be a shared resource viewed across the business? How owns the company’s data at various points in the data lifecycle: the data team may own the raw data, for example, before they hand it off to the marketing team for analysis and insights, which can then be parsed and applied to a dashboard for the leadership team.

The end-to-end data stack comprises multiple building blocks that support each of these teams.

How will you measure success?

When building a data platform, it’s essential to measure how stakeholders can leverage data to support business needs and ascertain the quality and efficiency of the data team’s performance.

Will you centralize or decentralize your data platform

Should your company choose to consolidate the data team? Will centralization impose too many bottlenecks? Will a decentralized approach lead to duplication and complexity? Understanding what each option looks like – and choosing the best model for your business is an essential consideration as you build your data platform.

How will you tackle data reliability and trust?

As volumes of data increase, data reliability – becomes increasingly important. Whether you build your data reliability tool or buy one, it will become an essential part of a functional data platform.

Technology

Let’s look at the technology considerations that you need to think about upfront before you start building a data platform. Here are some of my thoughts on the topic:

Incremental Thinking

The first logical step is to design your data platform incrementally. When a particular step fails, you will go back to the previous one, avoiding recomputing the entire process. When handling large data loads, you will quickly realize that building an incremental stack cannot be an afterthought.

Lego Block Assembly

If you encounter a problem, you will write a piece of code to fix it, and a more innovative approach is to find existing building blocks to fix that problem instead.

Design a data platform that requires a limited amount of coding. The less custom code being created, the better the business is, especially if you are a one-person data team.

Why? Because the more custom code you write, the more code your business will have to maintain, the more unit testing you’ll have to do, and the more complex your code becomes to understand by others.

Instead, look for pre-existing blocks provided by your data stack’s different components,r orchestrator, cloud provider, warehouse, etc., and assemble them to serve your project needs. It will be cheaper and easier to maintain, but it will free up your time for the core aspects of your work.

Effective Monitoring

When you have created your first data pipeline, it still can fail – Setting up proper alerting and monitoring is vital. You would want to be aware of things as they start misbehaving before someone else in your business, your internal client, makes a complaint.

Setup an excellent alerting practices which generate fewer, higher-level alerts and treat those as production incidents. You can create a dashboard that shows mission-critical failures so that errors can be taken care of in order of priority.

Data Product Management

It may be demanding that you are a one-person data team to manage the Data Product Management task. Data Product Management requires different skills. While you should have the same empathy for clients, you will also need a deep technical understanding of inputs and outputs. And, of course, if you live in a SQL world, you will need a sense of database structure and SQL queries. You may have all that, so not to worry.

The Data Platform

My philosophy has always leaned towards #simpleit, the principle of simplifying systems by having as few IT systems that provide a company’s complete service. The same principle applies to data platforms.

The Basic Pipeline

The primary pipeline requires a bit of programming to write a data extracting script, and I do this with Python. I also load the data to a MySQL database installed on my laptop. This is not a scheduled job, so you need to run it when you extract new data. I am using in this scenario Google Data Studio to connect to my local laptop to pull the data into a report.

This is an example of a data pipeline that you can set up with your own time.

The Basic data pipeline in the Cloud

You can take the same script and adjust the code to be executed as a Cloud Function, then you can signup on to the Google Cloud Platform and run a scheduled data pipeline in the Cloud. Depending on how often you need to execute the Cloud Function, you will be within the free tier if you do it 2 – 3 times a day as a batch job. The only difference from the Basic Pipeline is that instead of MySQL, we use BigQuery, and we have a scheduler setup. I also use Cloud Storage as a data lake to store files of the extracts.

The No Budget Open Source Data Platform

As the previous two examples are more basic, you would need to maintain code for your pipeline, and as I said before, the less code you have to maintain, the more time you have for other tasks.

Being a one-person data team in my company, the ultimate data platform I use is based on two open-source tools, Airbyte and Superset. These two tolls are Open Source tools, and you can run them with docker.

Airbyte

Airbyte is an open-source data integration tool where you, with just some clicks, be able to set up all your ELT data pipelines in minutes, even your custom ones, which allows your team to focus on insights and innovation.

With Airbyte, you can connect your source with a destination that creates a data connection. There should not be any need to write any code as many sources, and destination connectors are already available. If you need to write some code for a very custom source connector, Airbyte comes with a Connector Development Kit (CDK) that allows you to write your custom connector.

Superset

Apache Superset is an open-source program for data exploration and visualization, and you can also build dashboards and schedules to send the dashboard to your stakeholder.

The Data Platform setup with Docker

To set up this data platform on your local desk/laptop takes just minutes. The first step is to download Docker Desktop.

When you have docker desktop installed and running, you need to clone Airbyte and Apache Superset from their respective GitHub repository.

Airbyte Quick Start

The Airbyte quick start is straightforward. Just do the following.

$ git clone https://github.com/airbytehq/airbyte.git
$ cd airbyte
$ docker-compose up

Superset Quick Start

For Superset, read the Superset documentation. The installation is as simple as Airbyte:

$ git clone https://github.com/apache/superset.git
$ cd superset
$ docker-compose -f docker-compose-non-dev.yml up

This will launch Airbyte and Superset container in docker desktop:

Now you are ready to use this powerful open-source data platform. The beauty of this setup is that I can use the MySQL instance that I run locally for loading data from Airbyte and use Superset to build a dashboard from that data. I am not limited to that, and I can still use the Airbyte BigQuery connector as a destination and then do some reporting with Data Studio.

Conclusion

It is a very flexible way of setting up your data pipeline. You can grow by adding other building blocks like dbt, Airflow etc.

If you’re a small team, start with just a few lego blocks for your data platform and grow as the demand increases.

One piece of advice that I have is not to be carried away by new tools or get influenced by others at peer gatherings or conferences. Follow the direction your way without being dictated to. The more you get influenced by others or by new tools, your time is taken away from your main goal of delivering a data platform for your company. Please stay away from highly opinionated colleagues as it is not always the best direction.

If you’re a one-person data department or a small team, you don’t have time to learn and fight with other tools and technologies. Using these tools, you get straight into the good stuff of your transforms—nothing to set up and manage.