The Origins of Formula 1: (Part 1) Webscraping

Formula 1 data

In the modern era of the sport each car can have between 150 and 300 sensors, generating an approximate of 300gb of data per car per grand prix weekend. This is of great use for the engineers who can make data-driven decisions in order to extract the maximum performance out of the car on each event; however, this wasn’t always the case. In the early days of the sport it was technologically impossible to generate or store these amounts of data.
Most of this information is publicly available in tidy format in some websites, APIs and python packages, such as, the fastf1 python library. Early formula 1 data is also available on the FIA’s website, but mostly in the form of html tables. Therefore, if we want to analyze this data we will have to scrape it.

Webscrapping code

We will import the tidyverse library, which is a collection of packages very useful for data wrangling. The main package we’ll use for webscrapping is the rvest package, we will also need the stringr package, which allows us to manipulate more easily string-type variables.

  
library(tidyverse)
library(rvest)

We will store our main site links from where the scrapping will start.

  
link_fia <- "https://fiaresultsandstatistics.motorsportstats.com"
link_seasons <- "https://fiaresultsandstatistics.motorsportstats.com/series/formula-one/season/"

If we add a specif year to the link_seasons string, it will take us to the main page of that F1 season. The read_html, extracts all html code from the site, from there we can retrieve specific nodes. Let’s go to the link of the first Formula 1 grand prix ever.

  
link_british_1950 <- "https://fiaresultsandstatistics.motorsportstats.com/results/1950-british-grand-prix/classification"
british_1950_page <- read_html(link_british_1950)
british_1950_race <- british_1950_page |> 
  html_table()

british_1950_race

## [[1]]
## # A tibble: 21 × 12
##    Pos      No Driver   Nat   Team   Laps Time  Gap   Interval   Kph Best    Lap
##    <chr> <int> <chr>    <chr> <chr> <int> <chr> <chr> <chr>    <dbl> <chr> <int>
##  1 1         2 Nino Fa… Italy Alfa…    70 2:13… ""    ""        146. "1:5…     2
##  2 2         3 Luigi F… Italy Alfa…    70 2:13… "+ 2… "+ 2.60…  146. ""       NA
##  3 3         4 Reg Par… Grea… Alfa…    70 2:14… "+ 5… "+ 49.4…  145. ""       NA
##  4 4        14 Yves Gi… Fran… Auto…    68 2:13… "2 L… "2 Laps"  142. "1:5…    41
##  5 5        15 Louis R… Fran… Auto…    68 2:14… "2 L… "+ 1:03…  141. "1:5…    10
##  6 6        12 Bob Ger… Grea… Bob …    67 2:13… "3 L… "1 Lap"   140. ""       NA
##  7 7        11 Cuth Ha… Grea… Cuth…    67 2:13… "3 L… "+ 0.40…  140. ""       NA
##  8 8        16 Philipp… Fran… Auto…    65 2:14… "5 L… "2 Laps"  135. "1:5…    31
##  9 9         6 David H… Grea… Scud…    64 2:14… "6 L… "1 Lap"   133. ""       NA
## 10 10       10 Joe Fry… Grea… Joe …    64 2:15… "6 L… "+ 56.8…  132. ""       NA
## # ℹ 11 more rows

Here the html_table function retrieves a list of tibbles of all html tables on the page, fortunately, in this case there was only table. The tibble consists of the position a certain driver finished, the laps ran, the total race time, average speed, and their fastest lap of the race and in which lap was completed, as well as, the drivers’ team and nationality. Let’s look at a screenshot of the webpage for comparison.

If we navigate to the site, we can see that there’s a menu right above the race positions table, which has the buttons: “Qualifying”, “Grid” and “Race”, these buttons have hyperlinks which will take you to another page. We already have the race information, but we are also interested in the qualifying data, we know that in html the hyperlinks are stored in “a” nodes, so we can use the following code to retrieve the link to the qualifying page.

  
href_qual <- british_1950_page |> 
  html_node(xpath = "//a[text() = 'Qualifying']") |> 
  html_attr("href")

british_1950_qual_link <- paste0(link_fia, href_qual)

british_1950_qual_link

## [1] "https://fiaresultsandstatistics.motorsportstats.com/results/1950-british-grand-prix/classification/310d5889-6079-4841-a8df-cb6d92de54d1"

Here we are asking R to retrieve the “a” html node that contains the text “Qualifying”, and get the hyperlink of that node, then we paste that to the url for our main site. We can use the same function as with race data, to get a tibble of qualifying positions.

  
british_1950_qual <- read_html(british_1950_qual_link) |> 
  html_table()

british_1950_qual

## [[1]]
## # A tibble: 21 × 12
##      Pos    No Driver   Nat   Team  Laps  Time  Gap   Interval   Kph Best  Lap  
##    <int> <int> <chr>    <chr> <chr> <lgl> <chr> <chr> <chr>    <dbl> <lgl> <lgl>
##  1     1     2 Nino Fa… Italy Alfa… NA    1:50… ""    ""        151. NA    NA   
##  2     2     3 Luigi F… Italy Alfa… NA    1:51… "+ 0… "+ 0.20…  151. NA    NA   
##  3     3     1 Juan Ma… Arge… Alfa… NA    1:51… "+ 0… ""        151. NA    NA   
##  4     4     4 Reg Par… Grea… Alfa… NA    1:52… "+ 1… "+ 1.20…  149. NA    NA   
##  5     5    21 Prince … Thai… Enri… NA    1:52… "+ 1… "+ 0.40…  149. NA    NA   
##  6     6    14 Yves Gi… Fran… Auto… NA    1:53… "+ 2… "+ 0.80…  148. NA    NA   
##  7     7    17 Eugene … Fran… Auto… NA    1:55… "+ 4… "+ 2.00…  145. NA    NA   
##  8     8    20 Emmanue… Swit… Enri… NA    1:55… "+ 5… "+ 0.40…  145. NA    NA   
##  9     9    15 Louis R… Fran… Auto… NA    1:56… "+ 5… "+ 0.20…  144. NA    NA   
## 10    10     9 Peter W… Grea… Pete… NA    1:56… "+ 5… "+ 0.60…  144. NA    NA   
## # ℹ 11 more rows

Above the Qualifying, Grid and Race menu there’s another menu with buttons: “Event Info”, “Classification”, “Session Facts” and “Standings”. The race and qualifying data are under the classification page; however, under event info. there’s information about the engine and car each driver was running, which might be important in other analysis. We can use the same trick to get the link to this page, and then retrieve the table.

  
href_event <- british_1950_page |> 
  html_nodes(xpath = "//a[text() = 'Event Info']") |> 
  html_attr("href")

british_1950_event <- read_html(paste0(link_fia, href_event)) |> 
  html_table()

british_1950_event

## [[1]]
## # A tibble: 23 × 6
##      `#` Driver             Nat           Team                Car         Engine
##    <int> <chr>              <chr>         <chr>               <chr>       <chr> 
##  1     1 Juan Manuel Fangio Argentina     Alfa Romeo SpA      Alfa Romeo… Alfa …
##  2     2 Nino Farina        Italy         Alfa Romeo SpA      Alfa Romeo… Alfa …
##  3     3 Luigi Fagioli      Italy         Alfa Romeo SpA      Alfa Romeo… Alfa …
##  4     4 Reg Parnell        Great Britain Alfa Romeo SpA      Alfa Romeo… Alfa …
##  5     5 David Murray       Great Britain Scuderia Ambrosiana Maserati 4… Maser…
##  6     6 David Hampshire    Great Britain Scuderia Ambrosiana Maserati 4… Maser…
##  7     8 Leslie Johnson     Great Britain T A S O Mathieson   ERA E       ERA   
##  8     9 Peter Walker       Great Britain Peter Walker        ERA E       ERA   
##  9     9 Tony Rolt          Great Britain Peter Walker        ERA E       ERA   
## 10    10 Joe Fry            Great Britain Joe Fry             Maserati 4… Maser…
## # ℹ 13 more rows

Extracting lap by lap positions

Under the “Session Facts” section there’s a lap by lap positions chart, which might be useful to retrieve. The horizontal axis of this chart are the race positions, while the vertical axis has the lap numbers, while the cell value is the driver’s number.

  
href <- british_1950_page |> 
  html_nodes(xpath = "//a[text() = 'Session Facts']") |> 
  html_attr("href")

href2 <- read_html(paste0(link_fia, href)) |> 
  html_nodes(xpath = "//a[text() = 'Lap Chart']") |> 
  html_attr("href")

lap_chart_link <- paste0(link_fia, href2)
lap_chart_link

## [1] "https://fiaresultsandstatistics.motorsportstats.com/results/1950-british-grand-prix/session-facts/4817f4b9-6d84-4ea2-9c5b-ff9dcff6f2e7?fact=LapChart"

  
read_html(lap_chart_link) |> 
  html_table()

## [[1]]
## # A tibble: 71 × 1
##    Pos. 
##    <chr>
##  1 Grid 
##  2 Lap 1
##  3 Lap 2
##  4 Lap 3
##  5 Lap 4
##  6 Lap 5
##  7 Lap 6
##  8 Lap 7
##  9 Lap 8
## 10 Lap 9
## # ℹ 61 more rows

If we go to the site we can see that there’s a lap by lap positions table, but this can’t be retrieved using the html_table function, so we will have to go a little deeper into the html code. Using inspect mode in the web browser we see that the elements of the chart are under div nodes with a specif class:

  
lap_chart_page <- read_html(lap_chart_link) 

div_nodes <- lap_chart_page |> 
  html_nodes(xpath = "//div[@class='_1BvfV']")

div_nodes

## {xml_nodeset (1491)}
##  [1] <div class="_1BvfV">2</div>\n
##  [2] <div class="_1BvfV">2</div>\n
##  [3] <div class="_1BvfV">2</div>\n
##  [4] <div class="_1BvfV">2</div>\n
##  [5] <div class="_1BvfV">2</div>\n
##  [6] <div class="_1BvfV">2</div>\n
##  [7] <div class="_1BvfV">2</div>\n
##  [8] <div class="_1BvfV">2</div>\n
##  [9] <div class="_1BvfV">2</div>\n
## [10] <div class="_1BvfV">2</div>\n
## [11] <div class="_1BvfV">3</div>\n
## [12] <div class="_1BvfV">3</div>\n
## [13] <div class="_1BvfV">3</div>\n
## [14] <div class="_1BvfV">3</div>\n
## [15] <div class="_1BvfV">3</div>\n
## [16] <div class="_1BvfV">1</div>\n
## [17] <div class="_1BvfV">2</div>\n
## [18] <div class="_1BvfV">2</div>\n
## [19] <div class="_1BvfV">2</div>\n
## [20] <div class="_1BvfV">2</div>\n
## ...

If this was a regular square chart, we could just use the html_text function to retrieve a string-type array of the text inside the div nodes and then convert it to matrix; however, if we take a look at the chart on the site we can see that it has an irregular form since some drivers retire from the race at different stages. Therefore, the number of columns (positions in the race) is constantly shrinking. This leaves a lot empty cells in the table, i.e div nodes with no text attribute, so they can’t be retrieve with the html_text function.
To work around this issue we will have to consider each div node as a character variable, from the above output we can see that all nodes follow the same pattern. We will use the regexpr function to extract the values between <div> and <\div>, if there’s nothing between them, we can just store an empty string.

  
div_char <- as.character(div_nodes)
my_numbers <- character(length(div_char))

for (k in seq_along(div_char)){
  match <- regexpr("(?<=<div class=\"_1BvfV\">)\\d*(?=</div>\\n)", div_char[k], perl = TRUE)
    if (match > 0) { # If there's something there store it in the vector
      my_numbers[k] <- substr(div_char[k], match, match + attr(match, "match.length") - 1)
    } else { # If there isn't anything there store an empty value
      my_numbers[k] <- " "
    }
}
head(my_numbers, 20)

##  [1] "2" "2" "2" "2" "2" "2" "2" "2" "2" "2" "3" "3" "3" "3" "3" "1" "2" "2" "2"
## [20] "2"

The next step is to build a tibble from this array, for that we need to retrieve the number of drivers who started the race, which is the same as the number of columns of our chart. We use the inspect mode again to retrieve the div class in which the column names are stored. We extract the text under these nodes, and pull the last value.

  
total_drivers <- read_html(lap_chart_link) |> 
      html_nodes(xpath = "//div[@class='_3DVzL']") |> 
      html_text() |> tail(1) |> as.numeric()
# Make the character vector a matrix of total_drivers columns, just realize that not all charts 
lap_chart <- as_tibble(matrix(my_numbers, ncol = total_drivers)) 

lap_chart

## # A tibble: 71 × 21
##    V1    V2    V3    V4    V5    V6    V7    V8    V9    V10   V11   V12   V13  
##    <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
##  1 2     3     1     4     21    14    17    20    15    19    8     12    16   
##  2 2     3     1     4     21    20    17    14    15    16    12    11    6    
##  3 2     3     1     4     21    20    17    14    15    16    12    11    6    
##  4 2     3     1     4     21    17    20    14    15    16    12    11    6    
##  5 2     3     1     4     21    17    20    14    15    16    12    11    6    
##  6 2     3     1     4     21    17    20    14    15    16    12    11    6    
##  7 2     3     1     4     21    17    20    14    15    16    12    11    6    
##  8 2     3     1     4     21    17    20    14    15    16    12    11    6    
##  9 2     3     1     4     21    17    20    14    15    16    12    11    6    
## 10 2     3     1     4     21    20    14    15    16    12    11    6     5    
## # ℹ 61 more rows
## # ℹ 8 more variables: V14 <chr>, V15 <chr>, V16 <chr>, V17 <chr>, V18 <chr>,
## #   V19 <chr>, V20 <chr>, V21 <chr>

Thank you for reading this webscrapping exercise, hope you find it useful. If you want to see the full code and subsequent analysis check out my github repo.

The Origins of Formula 1: (Part 1) Webscraping

Formula 1 data

Webscrapping code

Extracting lap by lap positions

Further Reading

The Origins of Formula 1: (Part 2) Bayesian Analysis

Time Series Analysis of the Covid-19 pandemic in Mexico

Fine-tune Sentence Transformers Models for Science Exam Questions