Applying Data Wrangling Skills

Data Wrangling, Day 2 (Tour de France edition)

Author

Madeline Gillman & Austin Daigle

Objectives of Data Wrangling: Class 2

  • Be able to apply the objectives covered in Data Wrangling: Class 1 to a new dataset

Case Study: Tour de France 🚴

Bienvenue en France! 🇫🇷 Today we are traveling across the Atlantic Ocean to conduct urgent data analysis for Team Visma-Lease A Bike (Team VLAB) for the Tour de France. The Tour de France is an annual multi-week bicycle race held primarily in France. The race comprises 21 stages held over 23 days. Each stage covers a different area of France and different terrain; and the course changes each year. There are 20 to 22 teams that participate each year and each team is made up of 8 riders.

The year is 2024 and this year’s tour, as in prior years, is a nail-biter between Team UAE’s Tadej Pogačar and Team VLAB’s Jonas Vingegaard. So far, 14 stages and nearly 1,300 miles have been completed and Jonas is trailing his rival Tadej by just one minute and fourteen seconds. Stage 15 is tomorrow, and Team VLAB wants to know if Jonas has a chance to make up time and win the Tour. To help determine race strategy, Team VLAB wants you to analyze historical trends.

PHOTO: CHRIS AULD/VELO

Part 2: Incorporating this year’s data

Team VLAB has handed you data with rankings though Stage 13 (remember, Stage 14 of 21 is tomorrow). The team is interested in knowing how Jonas’ speed this year compares to previous years.

Read in the dataset.

tdf_2024_df <- read_csv("https://raw.githubusercontent.com/How-to-Learn-to-Code/Rclass-DataScience/main/data/wrangling-files/tdf_2024_stage13_data.csv")
  1. Try using rbind(), bind_rows() , and merge() to join the new data with the rider history data from above. Which one(s), if any, work? Why do you think they do (or do not) work?
  2. If you need to make any modifications to the new data to enable a join, do so and then join the datasets. Make these changes in R, NOT Excel or a text editor. Doing so in R ensures that your changes are reproducible and that the original data is not destroyed or modified.
  3. Use this new dataset to add this year’s average speed to the same plots you made in Part 1.
  4. Write this new dataset to a new comma-separated file. See if you can write the file using one of these functions: write.csv(), write_csv(), and write.table(). Use the help pages for each of the functions to explore the difference between each. Which one will you probably use the most in your own analyses?
    1. Verify you can view this file using the file explorer on OnDemand, and download to your computer. If the file does not look like you expected, review the help pages for the function you used.
  5. Review your code from Parts 1 and 2. See if you can use pipes to connect as many steps as possible. As you do this, make sure you add comments to your code and consider if using too many pipes sacrifices code readability.

Part 3: Stage History

The stage history dataset has data from every stage of the Tour de France. For the below questions, use your data wrangling skills to find the answer.

  1. Use rename() to tidy the column names.
  2. Check for missing values in each column. Are all missing values coded the same? If not, figure out a way to fix this–you may want to use Google!
  3. Which city has had the most visits from the Tour de France? (This means you must consider both start and end cities)
  4. Who has won the most stages?
  5. Who has worn the yellow jersey for the most days in a single tour? What about the polka dot jersey? Green jersey?
  6. Identify the top 10 wearers of the yellow jersey, polka dot jersey, and green jersey. Make a bar plot with the name of the rider on the x-axis and number of stages/days worn on the y-axis. See if you can color the bars so that the colors match the jersey (i.e. the plot for yellow jersey has yellow bars).
    1. See if you can get these all plotted on the same plot. It might involve some pivoting!