Analyze and visualize BigQuery dataset with Google Cloud Datalab

Sat 03 2018

Analyze and visualize BigQuery dataset with Google Cloud Datalab

by bernt & torsten

In this article, I will provide you with the essential skills of using Google Cloud Datalab to analyze a primary BigQuery dataset. I’m going to cover how you can bring into Cloud Datalab BigQuery datasets, and how to visualize that data within the same notebook.

Have you used IPython or Jupyter notebooks before?

Then you are familiar with what a notebook is, for the newcomer to data analysis with a notebook, data scientists work these collaborative, self-describing, executable notebooks whenever they want to do some kind of data analysis or machine learning tasks.

Google Cloud Datalab has done an excellent integration of the IPython Jupyter notebook system with Google’s BigQuery. Google has integrated standard IPython libraries such as graphics and scikit-learn and Google’s own machine learning toolkit TensorFlow.

To use it, you will need a Google cloud account, you can get started with the free account, which will be sufficient if you are interested in just trying it out.

You may ask why I need a Google account when I can use Jupyter, IPython, and TensorFlow on my resources.

The reason is that you can easily access BigQuery-sized data collections directly from the notebook running on your laptop. To get started, you can read about Google Cloud Datalab, It will tell you how to install giving you two choices: you may either install the Datalab package locally on your machine, or you may install it on a VM in the Google cloud.

If you continue reading, I will explain how to install Google Cloud Datalab in the Google cloud and give you an example of how to create your first notebook.

We all love Open Source, do we not? Google Cloud Datalab is based on Jupyter, and it’s open-source.

Why Google Cloud Datalab

You could use other tools to do your data analysis. With Google Cloud Datalab, you can do things like run an experiment, run a query, look at the output, update your documentation, add links, and then run more experiments and share those results and collaborate with others as well.

Google Cloud Datalab is more interactive and dynamic than just doing something like writing a SQL query. Collaborating with other data scientists with notebooks is straightforward.

Let’s get Started

Let’s jump straight in, we are going to start Cloud Datalab. The prerequisite is that you have a Google Cloud Platform (GCP) account and that you have created a project.

Goto your GCP home, and click on the Cloud Shell.

You have to click in the top right-hand corner of your Google Cloud Platform account, and the terminal window opens at the bottom of your screen. When you get the prompt, all you’ve got to do is type in:

datalab create cloud-datalab-vm

Give your instance a name of your choice.

cloud-datalab-vm

Specify a machine type that you want to use. You can look here for details about machine types

--machine-type n1-highmem-8

And specify what zone your instance should be running in as well. You can look here for details on Regions and Zones.

--zone us-central1-a

You could, of course, give the full information to create by:

datalab create cloud-datalab-vm --machine-type n1-highmem-8 --zone us-central1-a

Next, your machine will be started – just follow the instructions prompted to you.

This will create an SSH tunnel and may prompt you to create an rsa key pair. To manage these keys, see https://cloud.google.com/compute/docs/instances/adding-removing-ssh-keys
Waiting for Datalab to be reachable at http://localhost:8081/
This tool needs to create the directory [/home/tzetter/.ssh] before
being able to generate SSH keys.

Next, you want to click the preview button

One standard error that I have seen is that the – Cloud Source Repositories API is not enabled if you get that error go to APIs & Services in GCP and enable the API.

Another type of error is that when you click preview, you can not connect to the port, you need them to change the port – you find under the preview button the option to change the port. Your console message will hint at which port it will start the Google Cloud Datalab on.

If you are successful when clicking preview, you will see the Google Cloud Datalab notebook started, Voila, now you can get started with creating your analysis notebooks.

Getting started with your first Cloud Datalab notebooks

Now that you have the Google Cloud Datalab running, the first step would be to take a look at the samples and familiarize yourself with the sample notebooks.

A Simple Example

If you continue reading, I will show you how to create your first notebook with this simple example of a notebook that I created for you, using a publicly available dataset from BigQuery, called Bay Area Bike Share Trips Data.

I started with a blank notebook by clicking on +Notebook on the menu bar, then gave it a name. A good practice is to start your blank notebook by adding a markdown box and defining what your notebook is about and what the outcome of your analysis is.

Next, I added a code box, I declared that I wanted to use BigQuery, you can skip this box if you run in the cloud, if you run your notebook locally on your laptop, then declare the use of BigQuery.

In my next code box, I added a statement for describing the database table that I’m going to use, this is helpful in that you get the details of your table and will help define your query.

It’s good practice to add a markdown box between the code boxes so you can explain what you are doing, this will help others whom you sharing your notebook with to understand what is going on.

In my example, in the following code box, I have defined a SQL statement, you can write your new SQL statement, or if you already have one that you created and saved in BigQuery, you can copy and paste that SQL statement from BigQuery to your notebook.

What you have to add is a line before your SQL query, this is your namespace, in this example, I call it “cyclesharing”, which is the namespace in which the result of the SQL query is stored.

To see the result visualized in your notebook, then you can use the command

%%bg execute -q cyclesharing

This is handy, so you see what the result of your query is, if you go directly from a query command to visualization, it is not a required step.

The namespace “cyclesharing” exist in the background, and you can use it to create visualizations, here are a simple column chart to visualize the namespace “cyclesharing” data.

You will notice that the output here isn’t just command-line output, you can output charts and graphs. There are options available to you.

This is the end of my showing how to create your first notebook, you can copy this basic notebook from GitHub.

Save Notebook source code.

If you don’t want to lose your work, you can download your notebook to your local machine or commit your changes to a version control system, you can do that with a code repository using Git. Google Cloud Datalab stores the code in the Google Cloud Platform code repository.

Shutdown your Google Cloud Datalab notebook

You can shut down the notebook you created by selecting from the menu bar notebooks, then clicking on the button shutdown for the notebook you want to stop. No need to have your notebook active if you are not using it.

Shutdown your Google Cloud Datalab notebook VM

Don’t forget to shut down your notebook VM, you can go to Compute Engine within GCP and view what instances are running.

The Google Cloud Datalab notebooks themselves run on Compute Engine. And you can think of that as a rented infrastructure. You should not keep the Compute Engine instance running continuously, as you will pay for those compute cycles. You can simply stop the Google Cloud Datalab VM instance when you are not using it and fire it back up when you need to do some work.

You can shut down your Google Cloud Datalab VM from the Compute Engine view, or you can shut down your instance from within the Google Cloud Datalab.

When the Compute Engine instance with your Google Cloud Datalab notebook goes away, what do you think happens to the notebook?

Well, it disappears, so make sure you make a copy or save the notebook before you shut down your Compute Engine VM instance.

Resources

Here are some useful resources:

Conclusion

Data is the new gold mine for organizations, getting trends in historical data used to make a decision, adding Machine Learning to the mix – you can use your historical data to predict the future. Who does not want to predict the future?

What’s your view on this, I like to hear from you, you could comment in the comment box below or contact me through the contact form.