This is an example of how i plot statistics using plotly, pandas and numpy.

The data i’m plotting

I’m plotting a table that contains the evaluation of a LLM called Kosmos 2, i used it to find the bounding boxes of certain objects in computer generated images. What i found, computing statistics, is that there’s a correlation between the dimension of the object and the precision of my model: namely my model is more precise when the object is big, and way less precise when the object is small.

entity_typeAverage BBox Dimensions (All)Percentage of Matches
Total0.022222.2700
Painting0.035540.3462
Cell Phone0.00112.0701
Remote Control0.00101.7375
Book0.00438.8660
Chair0.019625.0564
Pen0.00040.0000
Dining Table0.054958.6751
Box0.006210.2804
Key Chain0.00040.4831
Counter Top0.112637.7451
Bowl0.00335.6701
Bottle0.00236.7708
House Plant0.017246.3542
Television0.071663.7838
Statue0.005920.7101
Plate0.00184.2424
Sofa0.077265.8385
Laptop0.008416.7702
Fridge0.158266.2420
Knife0.00182.6667
Bed0.086952.4823
Dresser0.083061.7021
Wine Bottle0.005913.7681
Garbage Can0.021766.6667
Fork0.00070.0000
Spoon0.00040.0000
Pillow0.005415.2000
Mug0.00194.2017
Arm Chair0.041954.3860
Bread0.00297.8947
Spray Bottle0.002715.0442
Vase0.00318.1818
Soap Bottle0.00306.6038
Spatula0.00200.0000
Pencil0.00031.0989
Toaster0.006012.3596
Shelving Unit0.103061.3636
Toilet0.059866.2791
Kettle0.00445.9524
TV Stand0.053835.7143
Butter Knife0.00020.0000
Newspaper0.00245.2632
Apple0.00192.6667
Cup0.00174.1096
Washing Machine0.076469.8630
Side Table0.032740.2778
Candle0.00081.4286
Sink0.060471.8750
Floor Lamp0.083059.3750
Credit Card0.00030.0000
Pepper Shaker0.00190.0000
Potato0.00215.2632
Salt Shaker0.00050.0000
Tomato0.00088.9286
Stool0.018156.3636
Pan0.00487.4074
Garbage Bag0.016768.5185
Faucet0.012011.1111
Dish Sponge0.00060.0000
Lettuce0.001911.7647
Microwave0.013313.0435
Toilet Paper0.00134.3478
Watch0.00030.0000
Teddy Bear0.008848.8372
Paper Towel Roll0.00257.8947
Desk Lamp0.00568.1081
Plunger0.006124.3243
Basket Ball0.003831.4286
Pot0.00498.5714
Dog Bed0.027344.1176
Ladle0.00132.9412
Baseball Bat0.006515.1515
Cart0.055946.8750
Tissue Box0.00263.8462
Egg0.00050.0000
Alarm Clock0.004618.1818
Desk0.087647.0588
Coffee Machine0.015135.7143
Soap Bar0.00050.0000
Tennis Racket0.00529.0909
Safe0.009436.3636
Cloth0.00260.0000
Laundry Hamper0.035244.4444
Vacuum Cleaner0.025057.1429
Boots0.00110.0000
Desktop0.03160.0000
Room Decor0.01300.0000
Table Top Decor0.00270.0000
Ottoman0.1021100.0000
So i plotted it to find out if there actually was a correlation between those values

plot 1.html

And there obviously is. To view a more in depth analysis of this correlation see: Kosmos 2 lavoro svolto finora

The code

First of all we read the csv file and we exclude the total from the plot (we want to know the correlation for each entity type we don’t really care about the total)

```python
import pandas as pd
import plotly.graph_objects as go
import numpy as np
 
# Read the CSV file into a DataFrame
df = pd.read_csv("entity_statistics_with_std_rounded.csv")
 
# Exclude the 'Total' row from the DataFrame
df = df[df['entity_type'] != 'Total']

now we extract the required columns to compute these statistics: these are

number of OccurrencesAverage BBox Dimensions (All)Percentage of Matches
# Extract required columns
percentage_matches = df['Percentage of Matches']
avg_bbox_dimensions = df['Average BBox Dimensions (All)']
num_occurrences = df['Number of Occurrences']
 

Now one cool thing i did was to scale the color of the points: a green dot means that there are a lot of occurrences and that the dot is “trusted” red is the opposite. If i just use a raw scale where dark green is a dot that represented an entity that occurred 700 times (which is the max) then there would be very few green dots (since there are only a bunch of entities that occur around 700 times) and many red dots. So the gradient needs to concentrate around the mean value of “number of Occurrences”

# Calculate the mean of the values of "Number of Occurrences"
mean_num_occurrences = num_occurrences.mean()
 
# Create normalized values based on the mean
normalized_values = 0.5 + (num_occurrences - mean_num_occurrences) / (2 * mean_num_occurrences)
normalized_values = np.clip(normalized_values, 0, 1)  # Clip values to [0, 1] range
 

This gives us a range [0,1] that can be used as a scale for the color parameter in a plotly figure

# Create scatter plot using Plotly
fig = go.Figure(data=go.Scatter(
    x=avg_bbox_dimensions,
    y=percentage_matches,
    mode='markers',
    marker=dict(
        color=normalized_values,
        colorscale='RdYlGn',  # Red-Yellow-Green colormap
        line_width=1
    )
))

The scale is Red Yellow Green and uses the normalized values. Then it just exports the plot as an html file:

# Update layout
fig.update_layout(
    title='Correlation between BBox Dimensions and Percentage of Matches',
    xaxis_title='Average BBox Dimensions (All)',
    yaxis_title='Percentage of Matches',
    plot_bgcolor='rgba(0,0,0,0)'
)
 
# Export the plot to an HTML file
fig.write_html("plot.html")

This is the whole file:

import pandas as pd
import plotly.graph_objects as go
import numpy as np
 
# Read the CSV file into a DataFrame
df = pd.read_csv("entity_statistics_with_std_rounded.csv")
 
# Exclude the 'Total' row from the DataFrame
df = df[df['entity_type'] != 'Total']
 
# Extract required columns
percentage_matches = df['Percentage of Matches']
avg_bbox_dimensions = df['Average BBox Dimensions (All)']
num_occurrences = df['Number of Occurrences']
 
# Calculate the mean of the values of "Number of Occurrences"
mean_num_occurrences = num_occurrences.mean()
 
# Create normalized values based on the mean
normalized_values = 0.5 + (num_occurrences - mean_num_occurrences) / (2 * mean_num_occurrences)
normalized_values = np.clip(normalized_values, 0, 1)  # Clip values to [0, 1] range
 
# Create scatter plot using Plotly
fig = go.Figure(data=go.Scatter(
    x=avg_bbox_dimensions,
    y=percentage_matches,
    mode='markers',
    marker=dict(
        color=normalized_values,
        colorscale='RdYlGn',  # Red-Yellow-Green colormap
        line_width=1
    )
))
 
# Update layout
fig.update_layout(
    title='Correlation between BBox Dimensions and Percentage of Matches',
    xaxis_title='Average BBox Dimensions (All)',
    yaxis_title='Percentage of Matches',
    plot_bgcolor='rgba(0,0,0,0)'
)
 
# Export the plot to an HTML file
fig.write_html("plot.html")