Sat 19 2022

#dataset: Creating a dataset using an API with Python

by bernt & torsten

For any data analytics project I do, the first thing that I need is a dataset. I have put together a list of online datasets that you can use for your various analytics, there are sometimes you wish to extract data on your own and begin your own investigation.

One way to collect data and create your own dataset is to use a device like your mobile phone or a specialized device which I did in this project Data project: I became an environmental data collector

In this article, I am going to show how you can extract data from an API using NewsAPI to extract news articles based on keywords and store the news data in Firestore to create a dataset to be used for a mobile app.

An application programming interface (API) is code that allow a connection between computers or between computer applications. The API defines the structure for a developer to write a program that requests services from an other application.

API is a tool that allows anyone to access information from a given website. Some APIs may require you to sign in using the OAuth method to get access to the data via the API. The code that I will show can be run on the Google Cloud Platform as a Cloud Function.

Import Python Libraries

As part of accessing the API content and writing to Firestore, we’ll have to import a number of Python Libraries.

requests library helps us get the content from the API by using the get() method. .
json library is needed so that we can work with the JSON content we get from the API.
logging – this module ptovides a flexible event logging system.
os – This module provides a portable way of using operating system dependent functionality
urllib – is a package that collects several modules for working with URLs
datetime – The datetime module supplies classes for manipulating dates and times.
pytz – This library allows accurate and cross platform timezone calculations
firebase-admin – The Firebase Admin Python SDK enables access to Firebase services
google-api-python-client – This is the Google API Python client library for Google’s discovery based APIs
google-cloud-secret-manager – The GCP Architect Services · Enable the Secret manager API
newspaper – Newspaper is a Python module used for extracting and parsing newspaper articles

Understand the API

The first step is to understand the NewsAPI what you can do with it, I recommend depending on the API that you are going to use to study the NewsAPI documentation.

Request parameters

# date today
today = date.today()

parameters = {
   'q': query, # query phrase
   'from': today,
   'sortBy': 'publishedAt',
   'pageSize': 20,  # maximum is 100
   'apiKey': newsapiapikey # your own API key
}

I first have to define the parameters that are part of the request, if you read the NewsAPI documentation, I need to provide the keyword phrase, which is stored in query, when I want the news from, how many news items and the API Key.

Keyword phrases

The keyword phrases, that I want to use, I store in a collection in Firestore, I have more than one keyword phrase so I will do a for loop to go through all of them.

    for doc in searchquery_ref.where(u'active', u'==', True).stream():
        query = u'{}'.format(doc.to_dict()['keyword'])
        query = urllib.parse.quote_plus(query)

API Key

The API Key I received from NewsAPI I store the Google Cloud Platform Secret manager and I access it by

# Get the secret for Newsapi
secret_name = "Newsapi_apikey"
resource_name = f"projects/{project_id}/secrets/{secret_name}/versions/latest"
response = client.access_secret_version(request={"name": resource_name})
newsapiapikey = response.payload.data.decode('UTF-8')

Requests.get

To access the NewsAPI, I use the function call requests.get(url, params=) which gets me the response from the NewsAPI for the url.

# Define the endpoint
url = 'https://newsapi.org/v2/everything?'

newsitems = requests.get(url, params=parameters).json

I store the JSON output from requests.get in newsitems, the output contains a lot of information that we have received.

for i in newsitems['articles']:
   url = i['url']

I do a for a loop by looking for the output parameter articles, which is defining all the articles,

Check Duplicates

I am only interested in the article URL from the response as I use that for duplicate checking against my Firestore Collection and for using the Article Python library to get more data from the article.

docsurl = db.collection(u'news_articles').where(u'url', u'==', url).stream()
if (len(list(docsurl))):
   logging.info("URL Exist, we will ignore")
else:
   logging.info("URL Not found, we will add to database")

Create the records for my dataset

I now have the data the next step is to create a dataset that I can store in my Firestore newsdata collection, first I use the Python newspaper library article to get more details about each news item by the URL.

try:
   article = Article(url)
   article.download()
   article.parse()

Next, I am creating the record of that news item to be stored in my Firestore collection newsdata, I use data from the request made to NewAPI, and additional data that I get from using Article from the python library newspaper.

newsdata = {
   'active': True,
   'title': i['title'],
   'source': i['source'],
   'article': article.text,
   'summary': article.summary,
   'author': i['author'],
   'description': i['description'],
   'authors': article.authors,
   'publishdate': article.publish_date,
   'publishedAt': _now(),
   'url': article.source_url,
   'content': i['content'],
   'urlToImage': i['urlToImage'],
   'image': article.top_image,
   'keywords': article.keywords,
   'movies': article.movies,
   'category': 'online news',
   'keyword': query,
}

Now I have a record, which I can write to my Firestore database collection newsdata.

db.collection('news_articles').document().set(newsdata)

I also send a notification to my Slack channel so I can follow what news has been added.

payload = '{{"text":"*Title: {0} *\n {1} \n {2} \n Read More: {3}"}}'.format(i['title'], i['publishedAt'], i['description'], article.source_url)
response = requests.request("POST", slackurl, headers=slackheaders, data=payload.encode('utf-8'))

That is all I do the dataset that I have created is updated on a daily basis, as the code I written is deployed as a Cloud Function, and runs twice a day. My dataset is growing on a daily basis, the dataset is used by a mobile application that I have developed, and gives me the new items that I am interested in. My mobile app is similar to Flipboard.

Full Source Code

import logging
import os
import urllib
from datetime import datetime
from datetime import date
import pytz
import requests
from newspaper import Article

# Install Google Libraries
from google.cloud import secretmanager

# Integrated Firebase services
import firebase_admin
from firebase_admin import credentials, firestore

# Setup the Secret manager Client
client = secretmanager.SecretManagerServiceClient()
# Get the sites environment credentials
project_id = os.environ["PROJECT_NAME"]
#project_id = 'social-climate-tech'

# initialize firebase sdk
CREDENTIALS = credentials.ApplicationDefault()
firebase_admin.initialize_app(CREDENTIALS, {
    'projectId': project_id,
})

# Get the secret for Slackkey
secret_name = "newsAPISLackKey"
resource_name = f"projects/{project_id}/secrets/{secret_name}/versions/latest"
response = client.access_secret_version(request={"name": resource_name})
slackurl = response.payload.data.decode('UTF-8')

# Get the secret for Newsapi
secret_name = "Newsapi_apikey"
resource_name = f"projects/{project_id}/secrets/{secret_name}/versions/latest"
response = client.access_secret_version(request={"name": resource_name})
newsapiapikey = response.payload.data.decode('UTF-8')

# Integrate Slack Channel
slackheaders = {
    'Content-Type': 'application/json'
}

# get firestore client
db = firestore.client()

def getNews_http(request):

    searchquery_ref = db.collection(u'news_keywords')
    
    for doc in searchquery_ref.where(u'active', u'==', True).stream():
        query = u'{}'.format(doc.to_dict()['keyword'])
        query = urllib.parse.quote_plus(query)

        # date today
        today = date.today()

        parameters = {
            'q': query, # query phrase
            'from': today,
            'sortBy': 'publishedAt',
            'pageSize': 20,  # maximum is 100
            'apiKey': newsapiapikey # your own API key
        }

        # Define the endpoint
        url = 'https://newsapi.org/v2/everything?'

        newsitems = requests.get(url, params=parameters).json

        for i in newsitems['articles']:
            url = i['url']
                # uplicate check
            docsurl = db.collection(u'news_articles').where(u'url', u'==', url).stream()
            if (len(list(docsurl))):
                logging.info("URL Exist, we will ignore")
            else:
                logging.info("URL Not found, we will add to database")
            try:
                article = Article(url)
                article.download()
                article.parse() # Parse the httml
#                article.nlp() # Apply NLP

                newsdata = {
                    'active': True,
                    'title': i['title'],
                    'source': i['source'],
                    'article': article.text,
                    'summary': article.summary,
                    'author': i['author'],
                    'description': i['description'],
                    'authors': article.authors,
                    'publishdate': article.publish_date, # datetime object containing current date and time
                    'publishedAt': _now(),
                    'url': article.source_url,
                    'content': i['content'],
                    'urlToImage': i['urlToImage'],
                    'image': article.top_image,
                    'keywords': article.keywords,
                    'movies': article.movies,
                    'category': 'online news',
                    'keyword': query,
                }
                db.collection('news_articles').document().set(newsdata)  # Add a new doc in collection links with ID shop
                # Send to Slack Channel
                payload = '{{"text":"*Title: {0} *\n {1} \n {2} \n Read More: {3}"}}'.format(i['title'], i['publishedAt'], i['description'], article.source_url)
                response = requests.request("POST", slackurl, headers=slackheaders, data=payload.encode('utf-8'))

def _now():
#     return datetime.now().timestamp()
    return datetime.utcnow().replace(tzinfo=pytz.utc).strftime('%Y-%m-%d %H:%M:%S')

Please feel free to share your thoughts and hit me up with any questions you might have.