- The library is already synchronized with the lockfile.
Data Wrangling with Real Experimental Data
Data Wrangling, Day 2
Objectives of Data Wrangling: Class 2
- Be able to apply the objectives covered in Data Wrangling: Class 1 to a new dataset
Case Study
Last class, we had introduced a dataset/experiment that we would work through. Let’s remind ourselves of some of the details:
- We have proportions of cell types across samples
- There controls made of mostly pure cell types (
fractions
) and experimental samples (whole
) - There are
.csv
files for both the cell type proportion data and the sample phenotypes
There are a few things to think about when wrangling/exploring data:
- What do you want to know about this data?
- What kind of visuals would you want to make?
- How does the data need to be formatted to get those visuals?
- What are some expected features of our data?
Take a moment to talk among yourselves about this/any ideas you had since last class!
Getting familiar with the data
Proportions Data
# Load the data. The sample IDs were stored as the first row, so lets make those the row.names
<- read.table("https://raw.githubusercontent.com/How-to-Learn-to-Code/Rclass-DataScience/main/data/wrangling-files/cellProportions.csv",
cell_props row.names = 1, header = TRUE, sep = ",")
head(cell_props)
Cardiomyocytes Fibroblast Endothelial.Cells Macrophage
whole_2 0.652 0.0886 0.06700 0.1761
fraction_13 0.824 0.0370 0.06387 0.0692
fraction_12 0.895 0.0213 0.04436 0.0390
fraction_19 0.000 0.9983 0.00167 0.0000
fraction_18 0.000 1.0000 0.00000 0.0000
whole_16 0.820 0.0208 0.08889 0.0501
Pericytes.SMC
whole_2 0.01672
fraction_13 0.00558
fraction_12 0.00000
fraction_19 0.00000
fraction_18 0.00000
whole_16 0.02058
Our data fits the tidy
style, since each row is a sample (observation) and each column is a different cell type (variable).
When assessing data, it’s good to consider what features you’d expect from a given data set. This helps you know if something has gone wrong before you’ve gotten your hands on it.
We’re looking at the proportion of cell types in each sample, which should sum up to 1. Checking that the values in each row add to 1 would help confirm that we have what we’re expecting:
rowSums(cell_props)
whole_2 fraction_13 fraction_12 fraction_19 fraction_18 whole_16
1 1 1 1 1 1
whole_15 whole_12 fraction_14 whole_8 whole_14 whole_3
1 1 1 1 1 1
fraction_9 fraction_11 whole_6 fraction_16 fraction_8 whole_4
1 1 1 1 1 1
fraction_5 fraction_1 whole_10 fraction_10 whole_7 whole_5
1 1 1 1 1 1
whole_9 fraction_6 fraction_3 fraction_17 whole_1 whole_13
1 1 1 1 1 1
fraction_7 fraction_4 whole_11 fraction_20 fraction_2 fraction_15
1 1 1 1 1 1
Raw RNA-seq matrices should go up to 100,000s. So if you only see small numbers in the data, it’s likely been manipulated in some way.
Phenotype data
We also have the phenotypes for the samples in a separate file:
<- read.table("https://raw.githubusercontent.com/How-to-Learn-to-Code/Rclass-DataScience/main/data/wrangling-files/cellPhenotypes.csv",
cell_phenos row.names = 1,sep = ",",header = TRUE)
str(cell_phenos)
'data.frame': 36 obs. of 3 variables:
$ type : chr "purified_cardiomyocytes" "purified_cardiomyocytes" "purified_cardiomyocytes" "purified_cardiomyocytes" ...
$ genotype : chr NA NA NA NA ...
$ treatment: chr NA NA NA NA ...
”
Planning the analysis
We want to know:
- If the controls look as we’d expect
- What group differences there are
To get at the question about controls, we’d need to check cell_phenos
to see which samples are from the control or experimental groups. After, we’ll plot the proportions.
flowchart LR A[Data frame of\n cell proportions] --> C(Merged proportions\n and phenotypes) B[Data frame of\n sample phenotypes] --> C C --> D{Steps to get data\nformatted for plotting} D --> E[Plot of control samples] D --> F[Plot of experiment samples]
Manipulating data frames
Summarizing and subsetting
Let’s get more context on what’s in the data. table
is a convenient way to summarize columns and lists:
# What unique values and how many of each are in the "genotype" field
table(cell_phenos$genotype)
cmAKO WT
8 8
# Table can also compare two variables. useNA need to be added to include cells with NAs
table(cell_phenos$type, cell_phenos$genotype, useNA = "ifany")
cmAKO WT <NA>
purified_cardiomyocytes 0 0 9
purified_endothelial_cells 0 0 3
purified_fibroblasts 0 0 8
whole_tissue 8 8 0
Seems that the purified cell types list NA for their genotype and that there are three types. Also, we have 8 knock-outs and 8 are wild type.
Combining and reordering
Data frames can be combined in a bunch of ways, but no matter the method it is essential that the order of samples match. R has two built-in methods, binds (cbind
and rbind
) and merge
.
Binds slap two data frames together. cbind
adds columns, rbind
adds rows. Binds don’t consider the order of the data sets, so there’s a risk of things being out of order.
merge
is similar to cbind
, but matches the data sets based on a common column.
cbind
# bind the rownames to see if they match
cbind(rownames(cell_phenos), rownames(cell_props)) |> head()
[,1] [,2]
[1,] "fraction_1" "whole_2"
[2,] "fraction_2" "fraction_13"
[3,] "fraction_3" "fraction_12"
[4,] "fraction_4" "fraction_19"
[5,] "fraction_5" "fraction_18"
[6,] "fraction_6" "whole_16"
Above, you can see that cbind
would mismatch the samples. Always be careful when using cbind
! It has no guardrails for ordering!
# Reorder one to match the other
# This uses the cell_phenos rownames as a list to specify the order of indices
<- cell_props[rownames(cell_phenos),]
cell_props
# They should all be TRUE now
all(rownames(cell_phenos) == rownames(cell_props))
[1] TRUE
# Now we can merge them
<- cbind(cell_phenos, cell_props)
data_bind
head(data_bind)
type genotype treatment Cardiomyocytes Fibroblast
fraction_1 purified_cardiomyocytes <NA> <NA> 0.911 0.0160
fraction_2 purified_cardiomyocytes <NA> <NA> 0.941 0.0118
fraction_3 purified_cardiomyocytes <NA> <NA> 0.898 0.0165
fraction_4 purified_cardiomyocytes <NA> <NA> 0.869 0.0508
fraction_5 purified_cardiomyocytes <NA> <NA> 0.946 0.0129
fraction_6 purified_fibroblasts <NA> <NA> 0.000 0.8976
Endothelial.Cells Macrophage Pericytes.SMC
fraction_1 0.0398 0.0328 0.00000
fraction_2 0.0266 0.0202 0.00000
fraction_3 0.0446 0.0339 0.00720
fraction_4 0.0235 0.0524 0.00434
fraction_5 0.0215 0.0199 0.00000
fraction_6 0.0138 0.0184 0.07029
merge
# Specify row.names as the feature to merge by
<- merge(cell_phenos, cell_props, by = "row.names")
data_merge
head(data_merge)
Row.names type genotype treatment Cardiomyocytes
1 fraction_1 purified_cardiomyocytes <NA> <NA> 0.911
2 fraction_10 purified_fibroblasts <NA> <NA> 0.000
3 fraction_11 purified_cardiomyocytes <NA> <NA> 0.772
4 fraction_12 purified_cardiomyocytes <NA> <NA> 0.895
5 fraction_13 purified_cardiomyocytes <NA> <NA> 0.824
6 fraction_14 purified_cardiomyocytes <NA> <NA> 0.928
Fibroblast Endothelial.Cells Macrophage Pericytes.SMC
1 0.01600 0.0398 0.03277 0.000000
2 0.99385 0.0050 0.00114 0.000000
3 0.06261 0.0599 0.09814 0.007635
4 0.02127 0.0444 0.03896 0.000000
5 0.03702 0.0639 0.06917 0.005581
6 0.00861 0.0416 0.02198 0.000179
While this won’t always be the case with merge
vs. bind
, its better to use merge
in this scenario, since it helps keep your script interpretable.
If you continue with programming, you’ll need to share your code or return to code you wrote months ago. Writing easy-to-understand scripts gives you less headache later!
Preparing for different visualizations
At this point, we should ask ourselves a few questions:
- What am I trying to see about the data?
- What kind of plot helps us see that?
Take a minute to talk as a group about how you would visualize the data!
What am I trying to see about the data?
Our samples have data on the proportions of many cell types. I’d want to easily compare all of these cell types at once, with samples/groups side-by-side.
What kind of plot do we want?
Pie charts are often used to visualize percents/proportions, but its difficult to see differences between two pie charts. A stacked bar plot would be a better fit, since we’re trying to compare different sample groups.
What format does my data need to be to make said plot?
This stacked bar plot would have:
Samples on the X-axis
Cell-type proportions on the Y-axis
Colors for each cell type in each bar
For ggplot
to make this our data needs to have a column for each term, but the data is spread across many columns. To solve this, we first need to understand the concepts of wide and long data.
Pivoting wide and long
Data is often formatted as wide or long. Our data is in a wide format, which has a single row for each sample and a column for each variable. When wide data is pivoted into a long format columns are condensed together.
It’s easiest to understand how pivoting works in visuals:
As a reminder, we want to make a column of proportions values (val) and a column specifying cell types (key).
library(tidyverse)
# cell types are specified with cols = and name the new column with names_to
# values originally in those columns are going to move to a new values column, which we can name with values_to =
<- pivot_longer(data_merge,
data_long cols = c(Cardiomyocytes, Fibroblast, Endothelial.Cells, Macrophage, Pericytes.SMC),
names_to = "cell.type", values_to = "proportion")
str(data_long)
tibble [180 × 6] (S3: tbl_df/tbl/data.frame)
$ Row.names : 'AsIs' chr [1:180] "fraction_1" "fraction_1" "fraction_1" "fraction_1" ...
$ type : chr [1:180] "purified_cardiomyocytes" "purified_cardiomyocytes" "purified_cardiomyocytes" "purified_cardiomyocytes" ...
$ genotype : chr [1:180] NA NA NA NA ...
$ treatment : chr [1:180] NA NA NA NA ...
$ cell.type : chr [1:180] "Cardiomyocytes" "Fibroblast" "Endothelial.Cells" "Macrophage" ...
$ proportion: num [1:180] 0.9115 0.016 0.0398 0.0328 0 ...
We have a couple of changes:
There are two new columns,
cell.type
andproportion
We have A LOT more rows than we did originally
The sample IDs were coerced to a column “Row.names” that is an ‘AsIs’ character. We’ll need to correct that before we plot the data
Wrangling for plotting
Pure cell-type fraction controls
With our data in this format, we can make a lot of cool plots. Lets start with the bar plot we had planned.
|>
data_long mutate(id = as.character(Row.names)) |> # fix the AsIs type
ggplot(aes(x = id, y = proportion, fill = cell.type))+
geom_bar(position="fill", stat="identity")
mutate
is a great way to modify specific parts of your data or make new columns!
It worked, but it looks… less than pleasing. Lets remind ourselves of what we wanted to see in the plot: groups side-by-side.
I’d like to start by making a plot just for the controls for now. filter
from the dplyr
package will help separate the groups. Also, I’ll make aesthetic changes to make it easier to compare groups and nicer to look at.
|>
data_long filter(type != "whole_tissue") |>
mutate(id = as.character(Row.names)) |>
ggplot(aes(x = id, y = proportion, fill = cell.type))+
geom_bar(position="fill", stat="identity", color = "black", width = 1) +
facet_grid(cols=vars(type), scales = "free") +
scale_fill_manual(values = c("#66C2A5","#FC8D62", "#8DA0CB", "#E78AC3", "#A6D854")) +
theme_minimal() +
theme(
axis.title.x = element_blank(),
legend.title = element_blank(),
legend.position = "bottom"
+
) guides(x = guide_axis(angle = 45)) +
labs(title = "Cell type proportions in purified control samples",
y = "Cell Type Proportion")
This looks good! We can see what we expected of our control samples. Each of the fractions are made up of a single cell type. Let’s move onto the experimental samples.
Experimental Samples
There are two things we should consider before we visualize differences between our experimental groups:
- It would be easier to compare shifts in specific cell types if we break up the stacked bar chart so that the cell types are spread across the x-axis.
- In our last plot, we compared samples across a single phenotypic factor:
type
. This time, it’s more complicated because we want to we want to compare bothgenotype
andtreatment
.
|>
data_long filter(type == "whole_tissue") |>
ggplot(aes(x = cell.type, y = proportion)) +
geom_bar(stat = "summary", fun = mean, width = 0.9, color = "black") +
facet_grid(genotype ~ treatment, scales = "free") +
theme_minimal() +
theme(
axis.title.x = element_blank(),
legend.title = element_blank(),
legend.position = "bottom"
+
) labs(y = "Estimated Proportion") +
scale_fill_manual(values = c("#66C2A5","#FC8D62", "#8DA0CB", "#E78AC3", "#A6D854")) +
guides(x = guide_axis(angle = 45))
We just made three major changes:
cell.type
is on the x-axis, not sampleid
s- We’re plotting the mean of each cell type across many samples in each group.
geom_bar
can do this automatically withstat = "summary", fun = mean,
- We’re showing four plots at once by having
facet_grid
contrast them withgenotype ~ treatment
However, it may still be tough to compare across the groups. Also, only showing the mean masks any variation within groups. Lets make two more major changes to fix that:
- Put all of the groups into a single plot
- Add dots for each sample onto each bar
And to make it easier to read, let’s reorder the X-axis by most to least abundant cell types.
Reorder cell types
We can take advantage of factors
to reorder, since ggplot
references the order of factors when plotting.
# Find the most-to-least abundant cell types
<- data_long |>
cell.type.order filter(type == "whole_tissue") |>
group_by(cell.type) |> # Manipulate the data within cell-type groups
mutate(mean = mean(proportion)) |> # make a new column that is the mean of the proportions
arrange(desc(mean)) |> # arrange by mean proportion
pull(cell.type) |> # pull out the cell type column as a list
unique() # remove duplicated values
cell.type.order
[1] "Cardiomyocytes" "Macrophage" "Fibroblast"
[4] "Endothelial.Cells" "Pericytes.SMC"
Put all groups on a single plot
If we combine genotype
and treatment
into a single variable, we can condense down to a single plot. While we’re at it, we can apply cell.type.order
to make the data_long$cell.type
into a factor-level column:
<- data_long |>
data_long mutate(cell.type = factor(cell.type, levels = cell.type.order),
Genotype_Treatment = factor(paste(genotype, "-", treatment), levels = c("WT - Sham", "WT - MI", "cmAKO - Sham", "cmAKO - MI")))
Plot
# Generate boxplot
|>
data_long filter(type == "whole_tissue") |>
ggplot(aes(x = cell.type, y = proportion, fill = Genotype_Treatment)) +
geom_bar(stat = "summary", fun = mean, width = 0.9, color = "black",
position = position_dodge(0.9)) +
geom_jitter(inherit.aes = T,
position = position_dodge(0.9),
size = 2, alpha = 0.3) +
labs(y = "Estimated Proportion",
fill = "Treatment") +
theme(
axis.title.x = element_blank(),
legend.title = element_blank(),
legend.position = "bottom"
+
) scale_fill_manual(values = c("#A6CEE3", "#1F78B4", "#FDBF6F", "#FF7F00")) +
guides(x = guide_axis(angle = 45))