#simpleit: How to build a data platform without a budget

This article is for small companies that may not have a dedicated data team or just a single resource to operate a data platform. The focus of this article is on a small business data platform setup. This article will show how to create the best data platform for your organization without a data budget. All you need is a laptop and a bit of technical know-how.

Before I explain how to set up a data platform, let us look at the best data platform for your business. Data being a big topic today, there are many articles you can read about data platforms etc. No data platform is universal. Setting up a data platform uses building blocks to set up one that fits your company.

The fact is that a data platform for one company will look slightly different than for another company. When creating the best data platform for you and your company, you must answer questions about your company’s culture, business objectives, structure, and more.

Data

Building a data platform is to ask a few questions about your company- i.e., would you need a central repository for all your company’s data, which enables the acquisition, storage, delivery, and governance of that data while maintaining security across the data lifecycle? Let’s have a look at some of the more critical questions.

How will you gain stakeholder buy-in?

A data platform is only helpful if its users – stakeholders across the business, are open to and familiar with it. Before creating a data platform, it’s critical to get everyone who might take advantage of it on board before making a data platform.

Employees in every area across the business should understand how the data platform will ultimately provide value to them. That’s the initial job of the data team: to explain and showcase that value and establish a method of measuring success even as the company scales.

Who owns what in the data stack?

How will the data be used? Will it be a shared resource viewed across the business? How owns the company’s data at various points in the data lifecycle: the data team may own the raw data, for example before they hand it off to the marketing team for analysis and insights, which can then be parsed and applied to a dashboard for the leadership team.

The end-to-end data stack comprises multiple building blocks supporting each team.

How will you measure success?

When building a data platform, measuring how stakeholders can leverage data to support business needs and ascertain the quality and efficiency of the data team’s performance is essential.

Will you centralize or decentralize your data platform

Should your company choose to consolidate the data team? Will centralization impose too many bottlenecks? Will a decentralized approach lead to duplication and complexity? Understanding what each option looks like and choosing the best model for your business are essential considerations as you build your data platform.

How will you tackle data reliability and trust?

As volumes of data increase, data reliability – becomes increasingly important. Whether you build your data reliability tool or buy one, it will become an essential part of a functional data platform.

Technology

Let’s look at the technology considerations you need to consider upfront before building a data platform. Here are some of my thoughts on the topic:

Incremental Thinking

The first logical step is to design your data platform incrementally. When a particular step fails, you will return to the previous one, avoiding recomputing the entire process. When handling large data loads, you will quickly realize that building an incremental stack cannot be an afterthought.

Lego Block Assembly

If you encounter a problem, you will write a piece of code to fix it, and a more innovative approach is to find existing building blocks to fix that problem instead.

Design a data platform that requires a limited amount of coding. The less custom code being created, the better the business is, especially if you are a one-person data team.

Why? The more custom code you write, the more code your business will have to maintain, the more unit testing you’ll have to do, and the more complex your code becomes to be understood by others.

Instead, look for pre-existing blocks provided by your data stack’s different components, orchestrator, cloud provider, warehouse, etc., and assemble them to serve your project needs. It will be cheaper and easier to maintain, but it will free up your time for the core aspects of your work.

Effective Monitoring

When you have created your first data pipeline, it still can fail – Setting up proper alerting and monitoring is vital. You would want to be aware of things as they start misbehaving before someone else in your business, your internal client, makes a complaint.

Set up excellent alerting practices which generate fewer, higher-level alerts and treat those as production incidents. You can create a dashboard that shows mission-critical failures so that errors can be taken care of in order of priority.

Data Product Management

It may be demanding that you are a one-person data team to manage the Data Product Management task. Data Product Management requires different skills. While you should have the same empathy for clients, you will also need a deep technical understanding of inputs and outputs. And, of course, if you live in an SQL world, you will need a sense of database structure and SQL queries. You may have all that, so don’t worry.

The Data Platform

My philosophy has always leaned towards #simpleit, the principle of simplifying systems by having as few IT systems that provide a company’s complete service as possible. The same principle applies to data platforms.

The Basic Pipeline

The primary pipeline requires a bit of programming to write a data-extracting script, and I do this with Python. I also load the data to a MySQL database installed on my laptop. This is not a scheduled job, so you must run it when extracting new data. I am using in this scenario Google Data Studio to connect to my local laptop to pull the data into a report.

This is an example of a data pipeline you can set up on time.

The Basic Data Pipeline in the Cloud

You can take the same script and adjust the code to be executed as a Cloud Function. Then, you can sign up for the Google Cloud Platform and run a scheduled data pipeline in the Cloud. Depending on how often you need to execute the Cloud Function, you will be within the free tier if you do it 2 – 3 times a day as a batch job. The only difference from the Basic Pipeline is that instead of MySQL, we use BigQuery, and we have a scheduler setup. I also use Cloud Storage as a data lake to store files of the extracts.

The No Budget Open Source Data Platform

As the previous two examples are more basic, you would need to maintain code for your pipeline, and as I said before, the less code you have to keep, the more time you have for other tasks.

Being a one-person data team in my company, the ultimate data platform I use is based on two open-source tools, Airbyte and Superset. These two tools are Open Source tools, and you can run them with docker.

Airbyte

Airbyte is an open-source data integration tool where you, with just some clicks, be able to set up all your ELT data pipelines in minutes, even your custom ones, which allows your team to focus on insights and innovation.

With Airbyte, you can connect your source with a destination that creates a data connection. There should not be any need to write any code as many sources, and destination connectors are already available. If you need to write some code for a very custom source connector, Airbyte comes with a Connector Development Kit (CDK) that allows you to write your custom connector.

Superset

Apache Superset is an open-source program for data exploration and visualization, and you can also build dashboards and schedules to send the dashboard to your stakeholders.  

The Data Platform setup with Docker

Setting up this data platform on your local desk/laptop takes just minutes. The first step is to download Docker Desktop.

Once you have Docker Desktop installed and running, you must clone Airbyte and Apache Superset from their respective GitHub repositories.

Airbyte Quick Start

The Airbyte quick start is straightforward. Just do the following.

$ git clone https://github.com/airbytehq/airbyte.git
$ cd airbyte
$ docker-compose up

Superset Quick Start

For Superset, read the Superset documentation. The installation is as simple as Airbyte:

$ git clone https://github.com/apache/superset.git
$ cd superset
$ docker-compose -f docker-compose-non-dev.yml up

This will launch Airbyte and Superset container in docker desktop:

Now, you are ready to use this powerful open-source data platform. The beauty of this setup is that I can use the MySQL instance I run locally to load data from Airbyte and use Superset to build a dashboard from that data. I am not limited to that, and I can still use the Airbyte BigQuery connector as a destination and then do some reporting with Data Studio.

Conclusion

It is a very flexible way of setting up your data pipeline. Adding other building blocks like dbt, Airflow, etc., you can grow.

If you’re a small team, start with just a few Lego blocks for your data platform and grow as the demand increases.

One piece of advice that I have is not to be carried away by new tools or get influenced by others at peer gatherings or conferences. Follow the direction your way without being dictated to. The more you get influenced by others or by new tools, the more time is taken away from your main goal of delivering a data platform for your company. Please avoid highly opinionated colleagues as it is not always the best direction.

You don’t have time to learn and fight with other tools and technologies if you’re a one-person data department or a small team. Using these tools, you get straight into the good stuff of your transforms—nothing to set up and manage.


Posted

in

, , ,

by

Comments

2 responses to “#simpleit: How to build a data platform without a budget”

  1. […] I have some experience with Docker Desktop as I run our #datastack on a budget with Docker Desktop […]

  2. […] recently wrote an article #simpleit: How to build a data platform without a budget where I explain how you could set up an end-to-end data platform, one of the open-source tools I […]

Leave a Reply

Your email address will not be published. Required fields are marked *