Trying out Code Interpreter for ChatGPT

Today I had the chance to experiment with an early alpha release of the Code Interpreter plugin for ChatGPT by OpenAI. Code Interpreter is “an experimental model that can use Python, and that handles uploads and downloads. You can find information about this plugin here, and you can add your own name to the waitlist to get access to this plugin here.

To take this model on a test flight, I used the Air Traffic Passenger Statistics CSV file on data.gov that shows monthly passenger counts for airlines going into and out of San Francisco International Airport from July 2005 through the end of last year.

As shown in the video above and the screenshots and code below, I experimented with Code Interpreter to see what it could do when I uploaded this 50K row spreadsheet and began asking questions about the data, and asking it to analyze it and create charts for me. In a nutshell, I found it’s capabilities to be totally mind-blowing and groundbreaking. Never in my career have I been able to interact with a tool in such a conversational way to analyze and visualize data using code like this.

The kicker, to me, was when it reasoned what must’ve caused the drop in passenger counts in 2020. The data set itself includes nothing whatsoever about COVID-19. But ChatGPT with Code Interpreter knew that COVID-19 was the reason for the huge drop in passenger count that year. Of course, this event took place before September 2021, the upper time bound of its training data (largely internet sites). But one can imagine how OpenAI might combine the Code Interpreter plugin of ChatGPT with the internet browsing plugin to come up with plausible explanations for signals it finds in the data.

In this scenario, the data itself is fairly clean, and the analysis itself is fairly basic. But Code Interpreter took my questions about the data – both very broad questions and very narrow ones – and it successfully generated Python code that allowed me to learn a great deal about the data set in a short amount of time. And attempts to find inaccuracies by cross-checking the numbers using Tableau came up empty. Code Interpreter’s output was accurate.

This does not meant that it’s a perfect tool by any means, and fact checking will be critical for the foreseeable future. But it didn’t seem to exhibit the kinds of hallucinations seen in earlier attempts at data analysis using ChatGPT. LLMs may not be great at math, but it is great at language, and code is a form of language. So it can create code to analyze your data. Upload your data, submit questions in the form of prompts, and then watch it generate code that answers your questions.

Amazing, terrifying.

Subscribe to Data Literacy on YouTube

The following screen shots show my interaction with ChatGPT’s Code Interpreter plugin. The code snippets below the images show python code generated by Code Interpreter to answer my prompts shown in the images directly above the code snippets. This code is what you would see if you opened the “Show work” dropdown in each prompt.

Trying Out Code Interpreter for ChatGPT | Data Literacy | Data Literacy

Copy to Clipboard

import matplotlib.pyplot as plt

# Descriptive statistics of the data
desc_stats = air_traffic_df.describe()

# Visualizations

# Bar chart showing the number of passengers per operating airline (top 10)
top_operating_airlines = air_traffic_df.groupby('Operating Airline')['Passenger Count'].sum().nlargest(10)
top_operating_airlines.plot(kind='bar', figsize=(10, 6))
plt.title('Top 10 Operating Airlines by Passenger Count')
plt.xlabel('Operating Airline')
plt.ylabel('Passenger Count')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

# Bar chart showing the number of passengers per GEO Region (top 10)
top_geo_regions = air_traffic_df.groupby('GEO Region')['Passenger Count'].sum().nlargest(10)
top_geo_regions.plot(kind='bar', figsize=(10, 6), color='orange')
plt.title('Top 10 GEO Regions by Passenger Count')
plt.xlabel('GEO Region')
plt.ylabel('Passenger Count')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

# Pie chart showing the distribution of Activity Type Code
activity_type_counts = air_traffic_df['Activity Type Code'].value_counts()
activity_type_counts.plot(kind='pie', figsize=(8, 8), autopct='%1.1f%%', startangle=90)
plt.title('Distribution of Activity Type Code')
plt.ylabel('')  # Hide ylabel
plt.tight_layout()
plt.show()

# Output the descriptive statistics
desc_stats

Copy to Clipboard

# Combine the two levels of operating airline that both include "United Airlines"
air_traffic_df['Operating Airline'] = air_traffic_df['Operating Airline'].apply(lambda x: 'United Airlines' if 'United Airlines' in x else x)

# Recalculate the top operating airlines with the combined "United Airlines" level
top_operating_airlines_combined = air_traffic_df.groupby('Operating Airline')['Passenger Count'].sum().nlargest(10)

# Plot the modified bar chart showing the number of passengers per operating airline (top 10) with the combined "United Airlines" level
ax = top_operating_airlines_combined.plot(kind='barh', figsize=(10, 6), color='gray')
plt.title('Top 10 Operating Airlines by Passenger Count (Combined United Airlines)')
plt.xlabel('Passenger Count')
plt.ylabel('Operating Airline')

# Adding data labels with thousands separator to the end of each bar
for i, v in enumerate(top_operating_airlines_combined.values):
    ax.text(v + 5000, i, f'{v:,}', color='black', fontsize=10, verticalalignment='center')

plt.tight_layout()
plt.gca().invert_yaxis()  # Invert the y-axis to display the top airline at the top
plt.show()

Copy to Clipboard

from matplotlib.backends.backend_pdf import PdfPages

# Create a PDF file to save the chart
with PdfPages('/mnt/data/Passenger_Counts_Over_Time.pdf') as pdf:
    # Create the time series plot
    plt.figure(figsize=(12, 6))
    plt.plot(passenger_counts_by_period.index, passenger_counts_by_period.values, color='blue')
    plt.title('Passenger Counts Over Time')
    plt.xlabel('Activity Period')
    plt.ylabel('Passenger Count')
    plt.xticks(rotation=45, ha='right')
    plt.grid(True, linestyle='--', alpha=0.5)
    plt.tight_layout()
    
    # Add title and description of the key insight
    title_text = 'Passenger Counts Over Time\n'
    description_text = ('This time series plot shows the trend and fluctuations in passenger counts over time. '
                        'There are observable seasonal patterns in passenger traffic. A significant drop in passenger '
                        'counts is evident in 2020, likely attributed to the impact of the COVID-19 pandemic.')
    plt.suptitle(title_text, fontsize=14, fontweight='bold', y=1.05)
    plt.figtext(0.5, -0.1, description_text, wrap=True, horizontalalignment='center', fontsize=12)
    
    # Save the chart to the PDF file
    pdf.savefig(bbox_inches='tight')
    
    # Close the plot
    plt.close()

# Provide the link to download the PDF file
pdf_link = '/mnt/data/Passenger_Counts_Over_Time.pdf'
pdf_link

Subscribe to Data Literacy on YouTube

Trying out Code Interpreter for ChatGPT

Share This Story, Choose Your Platform!

Related Posts

ChatGPT 4o: Exploring the New Interactive Data Table Functionality

Breaking Down the 4 Different Levels of AI: From Narrow AI to General AI to Superintelligence

How to Be the Ultimate Brainstorming Partner Without a Single Good Idea

Explained: Wide vs Long Format Data

An Illustrated Guide to Data Literacy: What’s an Interval Scale?