Saving Realtime Transit Data to a DataFrame

In the post Getting Realtime Transit Data from the STM API using Python, I looked at getting realtime transit data and printing it to a notebook cell.

Here, I'll instead save it to a Polars DataFrame so it's easy to analyze. I like working with Polars because it has a really intuitive API.

Set up

Install polars into the same environment as used in the Getting Realtime Transit Data post:

pip install polars

Notebook setup

I write the code. In the first cell of the notebook, import the libraries we'll use:

import requests
import gtfs_realtime_pb2
import polars as pl

The first two here are the same as in the "Getting Realtime Transit Data" post.

Getting the data

The following code comes from the earlier post, the only difference now being that we add it within a function, realtime, and then instead of hardcoding the API key, we add an api_key parameter to the function. Now we'll have code that's easier to work with in our notebook and we can call it multiple times to get realtime data for multiple points in time.

def realtime(api_key):
    url = "https://api.stm.info/pub/od/gtfs-rt/ic/v2/vehiclePositions"
    headers = {
        "accept": "application/x-protobuf",
        "apiKey": f"{api_key}",
    }
    response = requests.get(url, headers=headers)

    protobuf_data = response.content

    message = gtfs_realtime_pb2.FeedMessage()

    message.ParseFromString(protobuf_data)

Processing to a DataFrame

So message contains our realtime data. We are going to process the fields we want from that realtime data. We will initially represent each returned entity as a dictionary and then store each of those dictionaries in a list.

[
    {'trip_id':'123', 'route_id':'45', 'longitude'....}
    {'trip_id':'456', 'route_id':'29', 'longitude'....}
    ...
]

With a list of dictionaries, where each dictionary represents one entity, we'll then convert it to a DataFrame.

Here's what the code will look like:

    ...
    # Create a list to store each entity in
    data = []
    
    # Get the timestamp from the message header
    header_timestamp = message.header.timestamp
    
    # Loop through the entities
    for entity in message.entity:
    
        # Create an empty dict to store the entity information
        entity_data = {}
        
        # Extract all the relevant fields and add them to the empty dict
        entity_data['header_timestamp'] = header_timestamp
        entity_data['entity_id'] = entity.id
     
        trip = entity.vehicle.trip
        entity_data['trip_id'] = trip.trip_id
        entity_data['start_time'] = trip.start_time
        entity_data['start_date'] = trip.start_date
        entity_data['route_id'] = trip.route_id
        
        position = entity.vehicle.position
        entity_data['latitude'] = position.latitude
        entity_data['longitude'] = position.longitude
        entity_data['bearing'] = position.bearing
        entity_data['speed'] = position.speed
        
        entity_data['current_stop_sequence'] = entity.vehicle.current_stop_sequence
        entity_data['current_status'] = entity.vehicle.current_status
        entity_data['timestamp'] = entity.vehicle.timestamp
        
        vehicle = entity.vehicle.vehicle
        entity_data['vehicle_id'] = vehicle.id
        
        entity_data['occupancy_status'] = entity.vehicle.occupancy_status
        
        # Add this record to the list
        data.append(entity_data)


    # Convert the list of dicts to a polars DataFrame
    df = pl.DataFrame(data)

    return df

Now calling the function with an API key:

df = realtime("<api_key>")

We get a polars DataFrame object back:

We can now start to explore the data. For example, using a filter to get the current buses running for a particular route. Here, I use route 45...because it's the best in Montreal.

df.filter(pl.col("route_id")=="45")

In a future post, I'll look at exploring the data in more detail.