Practicing on Real-World Data

Project Management, Day 2

Author

Lorrie He, Matthew Sutcliffe

Final class!

In the previous class, you completed the following steps:

  1. Created a GitHub account.

  2. Linked your GitHub account with your Longleaf account.

  3. Created and pushed your first repository to GitHub.

Today, now that you’ve been through the entire course, we want to reinforce some of the skills that you’ve learned by having you apply your knowledge to several existing datasets and go through the wrangling, analysis, and visualization process as independently as you can. By the end of this lesson, you should have a GitHub repo containing scripts for wrangling and analysis for at least one of these datasets that you can show to others (or at least have as a reference for your future self).

For each of these datasets (or as many as time allows), we’d like you to do some basic analysis, including the following steps:

  1. Create a directory for the analysis of this dataset, using best practices discussed in the previous class.

  2. Load in the data.

  3. Perform some initial data exploration (What are the rows? What are the columns? How many samples does this dataset have? etc.)

  4. Identify at least 1 research question that you could try to answer with this dataset.

  5. Format the data in a way that allows for this analysis.

  6. Visualize the data in helpful ways to answer your question.

  7. Make your code reproducible by using GitHub.

To help with your analyses, we have included a project template document outlining these steps that you are welcome to use in data/project-day-2-files/example-analyses; the location of additional files are specified below.

Datasets

The csvs are available under in data/project-day-2-files/datasets. We have provided links to data that is publicly available for download, as well as some related publications that may give you more information about the data.

These datasets were chosen because we thought they represented a good variety in the different kinds of wrangling and visualization challenges you might encounter with your own data in the future: we have highlighted a few specific R skills that some of these datasets were meant to challenge you on.

Note

The datasets are presented roughly in order of increasing wrangling complexity. Though you should have most of the basic skills you need to wrangle and analyze these datasets, we have specifically provided some code to aid you in certain wrangling tasks for some of the later datasets: see the “Analysis ideas and hints” callout for each dataset.

For each dataset, we have provided a description of the research question/data collection methods, metadata, and the first few lines of each dataset. Read through these descriptions and work on the dataset you find most interesting. Feel free to work with a partner!

But I thought this was HTLTCode, not HTLTScience!

To get the most out of today’s activities, we highly encourage you to practice going through the whole data analysis workflow as independently as you can, using your notes and practicing your Googling skills to develop your own research questions and analysis ideas for these datasets.

That said, we have prepared some example research questions and figures for each dataset to get everyone started, so you can decide if you’d rather focus more on coding or science-ing for each dataset. These research questions are split into “easier” and “harder” coding challenges, with “harder” challenges generally involving slightly more wrangling: all datasets have a pre-filled project template in the same location as the blank template document that will explain the code used to produce the example figures.

Thus, based on your comfort level, you can decide how much of the data analysis workflow you want to try independently today. You can:

  1. Just recreate our figures,
  2. Develop your own analysis and figures to answer our research questions, or
  3. Develop your own research questions, analysis, and figures.

Avida digital evolution dataset

Related publications1

Skills to practice: working with multiple files, pivoting

Background: The following data was generated using Avida-ED, an online educational application that allows one to study the dynamics of evolutionary processes. Digital, asexually-reproducing organisms known as “Avidians” can be placed into something akin to a virtual Petri dish to evolve in, and one can manipulate parameters such as mutation rate, resource availability, and dish size to study how those factors affect the evolution of the population.

Your friend needs your help to analyze their Avida-Ed data. They designed a series of experiments around a mutant Avidian that gets an energy bonus when the sugar “nanose” is present and a wildtype Avidian that does not. They wanted to see how competition and resource availability affect the population dynamics of these Avidians.

Your friend grew either the wildtype only, mutant only, or both populations together (competition) in the following 3 environments:

  • Minimal (no additional sugars present)

  • Selective (only nanose present)

  • Rich (nanose and additional sugars present)

Avida experiment setup

Help your friend get started with some of the exploratory data analysis. What relationships can you find between genotype, competition, and resource availability?

Data for avida_wildtype.csv shown only:

update condition minimal_avg.fitness minimal_avg.offspring.cost minimal_avg.energy.acq.rate minimal_pop.size selective_avg.fitness selective_avg.offspring.cost selective_avg.energy.acq.rate selective_pop.size rich_avg.fitness rich_avg.offspring.cost rich_avg.energy.acq.rate rich_pop.size
0 wildtype-only 0.3587786 131 47 1 0.2486772 189 47 1 0.3587786 131 47 1
1 wildtype-only 0.3587786 131 47 1 0.2486772 189 47 1 0.3587786 131 47 1
2 wildtype-only 0.3587786 131 47 1 0.2486772 189 47 1 0.3587786 131 47 1
3 wildtype-only 0.3587786 131 47 1 0.2486772 189 47 1 0.3587786 131 47 1
4 wildtype-only 0.3587786 131 47 2 0.2486772 189 47 1 0.3587786 131 47 2
5 wildtype-only 0.3587786 131 47 2 0.2486772 189 47 1 0.3587786 131 47 2
Column Type Description Values
update integer Time elapsed Ranges 0-300
condition character Testing conditions (single population or competition) Based off csv: wildtype-only, mutant-only, competition
media_avg.fitness numeric Average individual reproductive success in specified media Ranges 0-1
media_avg.offspring.cost numeric Average individual reproductive cost in specified media
media_avg.energy.acq.rate numeric Average individual rate of energy acquisition from environment in specified media
media_pop.size integer Population size in specified media Ranges 0-900

All population measurements were determined for minimal, selective, and rich media.

  • Which files have the pieces of data you need? Do you need to mix-and-match anything?

  • The data could be “tidier”…

Use these questions to develop your own or answer them as-is. There is no “single” or “correct” way to answer to these questions, and you can refine or broaden the scope of these questions as needed.

Bolded questions have an accompanying figure and code.


Easier

  • How does population size change over time for the wildtype (or mutant) in different medias?
  • What is the relationship between fitness and offspring cost in competition conditions in different medias?

Harder

  • How does the change in fitness over time compare between different treatment conditions and media types?
  • Which media type allows for the most maximal performance of the wildtype? The mutant? Competition conditions?

Easier: How does population size change over time for the wildtype (or mutant) in different medias?

Harder: how does the change in fitness over time compare between different treatment conditions and media types?

Western Africa Ebola public health dataset

Data source

Skills to practice: working with dates, pivoting

Background: The Western African Ebola virus (EV) epidemic of 2013-2016 is the most severe outbreak of the EV disease in history. It caused major disruptions and loss of life, mainly in the republics of Guinea, Liberia, and Sierra Leone.

How might you depict the dynamics of this outbreak?

Country Date Cumulative.no..of.confirmed..probable.and.suspected.cases Cumulative.no..of.confirmed..probable.and.suspected.deaths
Guinea 2014-08-29 648 430
Nigeria 2014-08-29 19 7
Sierra Leone 2014-08-29 1026 422
Liberia 2014-08-29 1378 694
Sierra Leone 2014-09-05 1261 491
Nigeria 2014-09-05 22 8
Column Type Description Values
Country character Country of report
Date character Date of report YYYY-MM-DD
Cumulative.no..of.confirmed..probable.and.suspected.cases numeric Cumulative number till this day
Cumulative.no..of.confirmed..probable.and.suspected.deaths numeric Cumulative number till this day
  • This dataset contains data for other countries besides the three named above. For simplicity, you may want to focus on only those three regions.

  • The data could be “tidier”…

  • You can use format() to extract specific parts of a date object: e.g. if x is a date object with the format %Y-%m-%d, you can get the year with format(x, "%Y").

Use these questions to develop your own or answer them as-is. There is no “single” or “correct” way to answer to these questions, and you can refine or broaden the scope of these questions as needed.

Bolded questions have an accompanying figure and code.


Easier

  • How many cases and deaths in total were recorded by each country from 2014-2016?
  • For a specific year, how did the number of cases and deaths change over time for each country?

Harder

  • By country, how did the average number of cases and death change each year?
  • Are there any seasonal patterns in the average cases and deaths?

Easier: How many cases and deaths in total were recorded by each country from 2014-2016?

Harder: By country, how did the average number of cases and death change each year?

Heat exposure in Phoenix, Arizona ecological dataset

Data source | Related publication

Skills to practice: parsing strings, dealing with NA values

Background: Exposure to extreme heat is of growing concern with the rise of urbanization and ongoing climate change. Though most current knowledge about heat-health risks are known and implemented at the neighborhood level, less is known about individual experiences of heat, which can vary due to differences in access to cooling resources and activity patterns.

To further investigate, the Central Arizona-Pheonix Long-Term Ecological Research Program (CAP-LTER) recruited participants from 5 Pheonix-area neighborhoods to wear air temperature sensors that recorded their individually-experienced temperatures (IETs) as they went about their daily activities.

What relationships can you find between individual activity, temperature, and neighborhood?

Subject.ID period temperature
1A Sat, 8pm-12am 25.14447
1P Sat, 8pm-12am 26.77962
1E Sat, 8pm-12am 27.21875
1B Sat, 8pm-12am 26.79175
2N Sat, 8pm-12am NA
3A Sat, 8pm-12am NA
Column Type Description Values
Subject.ID character Subject identifier where number (1-5) corresponds to neighborhood 1=Coffelt, 2=Encanto-Palmcroft, 3=Garfield, 4=Thunderhill, 5=Power Ranch
period character 4 hour measurement period weekday, period
temperature numeric 4 hour average of IET during specified period Celcius
  • We recommend using functions in the stringr package from the tidyverse to parse the strings, particularly str_sub() and str_split_fixed().

    • e.g. If you had a vector x containing a bunch of 2 character codes (XX), you could extract the first character with str_sub(x, 1, 1).

    • e.g. If you had a vector x containing information in the format first-second-third, you could extract that first chunk with str_split_fixed(x, "-", n=3)[,1].

  • R has some “counting” functions such as length() and n(): how could you get unique counts? Is there a single function that does that?

Use these questions to develop your own or answer them as-is. There is no “single” or “correct” way to answer to these questions, and you can refine or broaden the scope of these questions as needed.

Bolded questions have an accompanying figure and code.


Easier

  • How many participants were there for each neighborhood in the study?
  • On average, which 4-hour measurement period is the warmest? The coolest?

Harder

  • What was the daily average temperature for each neighborhood during the study period?
  • For a specific neighborhood on a specific day, what were all of the individual temperatures experienced for each 4-hour measurement period?

Easier: How many participants were there for each neighborhood in the study?

Harder: What was the daily average temperature for each neighborhood during the study period?


  1. In particular, see Lenski et al. (2003) and Smith et al. (2016).↩︎