io.Rmd
The first step when accessing a file – whether reading or writing – is correctly pointing to the file on the filesystem. To do so, you must specify the path. For an example of a path run the following:
getwd() ## [1] "/Users/runner/runners/2.263.0/work/rclass/rclass/vignettes"
This stands for “get working directory.” Your working directory is very important – this is where R will look for files. The working directory will For example:
file.exists("madeupfile.txt") ## [1] FALSE
This function looks for a file called “madeupfile.txt” in the working directory, and returns FALSE
indicating there is no such file. The working directory will depend on where R
is called from. At any time you can change the working directory with the setwd
function. However, R can easily look outside the current working directory by specifying a path. A typical path looks like the following:
/path/to/some/folder/file.extension
The problem is the exact formatting of paths differ from operating system to operating system. Rather than simply typing “/path/to/some/folder/file.extension” using the file.path
function will make your code more portable (meaning it will play nice when running on other operating systems):
file.path("/", "path", "to", "some", "folder", "file.extension") ## [1] "//path/to/some/folder/file.extension"
Specifying paths creates problems for reproducible analysis. Generally speaking, you do not want your analysis to depend on the current working directory. You also do not want the analysis to depend on users having files in specific locations. Later in the course we will give you the skills to circumvent the issue by wrapping your analysis in a package. For now, just keep in mind no one will be able to run your analysis if the paths are specific to your computer! The best option is the package route, but another option is to create a variable that contains the path to the analysis directory.
To illustrate we will learn about a family of functions that manipulate the file system. Conveniently, they follow the format file.*
or dir.*
(“dir” is referencing a directory, or folder). We will use the file sytem functions to create a new directory containing subdirectories and files.
# Step one is to specify the "analysis dir" -- ad -- here, we place the dir # in the temporary dir created by R. This isn't important. (This variable # is where someone recreating your analysis would specify the path) ad <- file.path(tempdir(), "myanalysis") # First check for the directory -- you always want to check so that you # do not overwrite someone's files (there is also a file.exists) dir.exists(ad) ## [1] FALSE dir.create(ad) dir.create(file.path(ad, "sub")) file.create(file.path(ad, "f1.txt")) ## [1] TRUE file.create(file.path(ad, "sub", "f2.txt")) ## [1] TRUE # List the files in the given directory. recursive = TRUE indicates to # look in all subdirectories and list those files as well. You'll notice # the paths to the files in subdirectories specify the path from the given # path list.files(ad, recursive = TRUE) ## [1] "f1.txt" "sub/f2.txt"
Now imagine you have a folder containing all of the scripts and files for your analysis. You should write your scripts such that all of the file paths are rooted by a path to the analysis directory, such that someone else could change that one variable and run your code. There are many other functions for manipulating the file system – most of which we will not cover – but the following shows how to rename, move, and then delete files.
file.rename(from = file.path(ad, "f1.txt"), to = file.path(ad, "f5.txt")) ## [1] TRUE list.files(ad, recursive = TRUE) ## [1] "f5.txt" "sub/f2.txt" file.remove(file.path(ad, "f5.txt")) ## [1] TRUE list.files(ad, recursive = TRUE) ## [1] "sub/f2.txt" # Finally, remove the folder we created above (and all its contents!). # Careful here, this is a dangerous function. unlink(ad, recursive = TRUE)
Manipulating the file system in R can be really helpful. Rather than manually renaming, moving, or deleting files you can have R do it for you. When you get into dealing with hundreds or thousands of files this becomes really helpful!
Before we talk about the higher level functions to read in data files, we will briefly discuss how R interacts with files. This topic is very detailed and is intended for illustration – you will not likely use these functions. The first step is to establish a connection to a file – file being generic here. We will only discuss simple text files, but the connection can be to a website, compressed file, etc. First, we will create a file and establish a connection.
xmplFile <- file.path(tempdir(), "example.txt") file.create(xmplFile) ## [1] TRUE con1 <- file(xmplFile) con1 ## A connection with ## description "/var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T//RtmpGwbP7f/example.txt" ## class "file" ## mode "r" ## text "text" ## opened "closed" ## can read "yes" ## can write "yes"
The connection is created, but not opened. We will now open the connection, read and write some lines, then close the connection and delete the file. When opening a connection you must specify to read, write, or append. Opening a file to write (or to read & write) will delete all data in the file. To preserve the existing content you have to open as append (or read & append). You can run ?open
to read more about the different opening modes.
open(con1, open = "w+") # the mode 'w+' here means write and read readLines(con1) ## character(0) writeLines(text = "hello", con = con1) readLines(con1) ## [1] "hello" writeLines(text = c("this", "is", "some", "more", "text"), con = con1) readLines(con1) ## [1] "this" "is" "some" "more" "text" close(con1) readLines(xmplFile) ## [1] "hello" "this" "is" "some" "more" "text" file.remove(xmplFile) ## [1] TRUE
Lastly, notice two things: (1) the connection stores its current position. After writing and reading “hello”, the next call to readLines
does not read “hello” because the read position already passed “hello.” (2) readLines
can be called directly on a path, and the funciton will handle the connection for you. Similar to the read position, we see the connection also stores the write position because “hello” was not overwritten by the second call to writeLines
. The read and write positions can be queried using the seek
function.
Please note: these functions are dangerous, because calling the wrong function can easily delete the file contents. You should be able to how to extract lines from a file, but mistakes are costly and using these functions would require you to parse the lines yourself. Luckily, R has many wrapper functions that manage the connections, reading, and parsing for you.
A csv is a comma separated values file, which allows data to be saved in a table structured format. Excel files can often be saved as csv files. R has a function to read these files called read.csv
. We will use this function to read in data from a csv file.
Consider the following file
# Here is a couple lines of comments explaining
# the file contents.
"id ","val "
1,10
2,5
3,15
,
We can create this file using what we learned above.
csvFile <- file.path(tempdir(), "csvFile.csv") writeLines(text = c("# Here is a couple lines of comments explaining", "# the file contents.", "\"id \",\"val \"", "1,10", "2,5", "3,15", ","), con = csvFile)
Now we will use the read.csv
function to read the data into a data.frame object.
read.csv(csvFile) ## X..Here.is.a.couple.lines.of.comments.explaining ## # the file contents. ## id val ## 1 10 ## 2 5 ## 3 15 ##
The data is a mess due to the comment lines at the top of the file. We have three options: (1) edit the file by hand – a very poor choice, (2) skip those lines, or (3) tell the function how to look for comments. The following shows how to both skip and designate comments:
read.csv(csvFile, skip = 2) ## id. val.. ## 1 1 10 ## 2 2 5 ## 3 3 15 ## 4 NA NA read.csv(csvFile, comment.char = "#") ## id. val.. ## 1 1 10 ## 2 2 5 ## 3 3 15 ## 4 NA NA
However, we are not done. R does not like spaces in column names, and you can see the spaces have been replaced by periods. Additionally, you can see two rows of NA
at the bottom of the table. To remove the spaces from the names we will introduce the regular expression functions.
Regular expressions are a way of finding text. They can be as simple as find any letter, or as complicated as find an explicit phrase, but only when it occurs after another phrase, and replace part of the first phrase with a third phrase. R has its own flavor of regular expressions (regex’s), but will also accept Perl regex. Here, we will look at a few cases using the grep
, grepl
, sub
, and gsub
functions.
First consider how to find regular expressions using the grep
and grepl
functions:
s1 <- c("a", "ab", "b", "ac", "c") grep(pattern = "a", x = s1) ## [1] 1 2 4 grepl(pattern = "b", x = s1) ## [1] FALSE TRUE TRUE FALSE FALSE
We see grep
returns the index of strings in the vector that contained the pattern, whereas grepl
returns a vector indicating whether the pattern matched each string in the vector. grep
will also return the strings that matched by indicating value = TRUE
:
grep(pattern = "c", x = s1, value = TRUE) ## [1] "ac" "c"
The above examples have very simple patterns. ?regex
gives a good starting point for how to make more general (“regular”!) patterns. For example, suppose we want to find every string of the format: capitalletter, lowercase letter. Here we can use [:A-Z:]
to indicate any uppercase letter, and [:A-Z:]
to indicate any lowercase letter.
s2 <- c("ab", "Aba", "aBa", "Ab", "BB", "Ba", "Cb") grep("[:A-Z:][:a-z:]", s2, value = TRUE) ## [1] "Aba" "aBa" "Ab" "Ba" "Cb"
This is close, but “Aba” breaks the desired format. It contains the format, but has too many letters. We can modify the regular expression with ^
to indicate the start of the string and $
to indicate the end of the string.
grep("^[:A-Z:][:a-z:]$", s2, value = TRUE) ## [1] "Ab" "Ba" "Cb"
Now we get exactly the desired output. Again, ?regex
provides a lot of information to get you started. It is (always!) helpful to read the documentation first. Now that we know how to find patterns, lets look at how to modify strings with the substitute functions.
sub(pattern = "Bob", replacement = "Tom", x = "Bob and Sally") ## [1] "Tom and Sally"
Here we replace “Bob” with “Tom” in the string. What if “Bob” occurs twice in the string?
sub(pattern = "Bob", replacement = "Tom", x = "Bob, Bob, and Sally") ## [1] "Tom, Bob, and Sally"
The sub
function only finds the first instance of the pattern, then stops looking. If we want to replace all entries of the pattern we need the gsub
function.
gsub(pattern = "Bob", replacement = "Tom", x = "Bob, Bob, and Sally") ## [1] "Tom, Tom, and Sally"
Now we can fix the column names. Recall we access the column names with the colnames
function. First store the file to a data.frame called dat
.
dat <- read.csv(csvFile, comment.char = "#") colnames(dat) ## [1] "id." "val.." colnames(dat) <- gsub("[:.:]", "", colnames(dat)) colnames(dat) ## [1] "id" "val"
Now we just need to exclude the NA
columns. There are MANY ways to do this. We will go over three. (1) If we know how many rows you should have you can read only those lines using the nrows
parameter:
read.csv(csvFile, comment.char = "#", nrows = 3) ## id. val.. ## 1 1 10 ## 2 2 5 ## 3 3 15
This is often not ideal, because we want to write generalized code when possible. What happens if we get an input file with a different number of rows? This approach would only work well if the input should always have exactly the same number of rows. (2) we can use the na.omit
function. This is a powerful function, but, (depeding on your goals) you need to be careful about the input data structure. For a data.frame object, na.omit
will remove rows with NA
in ANY column. For example:
na.omit(dat) ## id val ## 1 1 10 ## 2 2 5 ## 3 3 15
Consider if we had a slightly different data.frame, where we only wanted to remove NA
values in the id
column:
copyOfDat <- dat copyOfDat[5, 1] <- 4 copyOfDat ## id val ## 1 1 10 ## 2 2 5 ## 3 3 15 ## 4 NA NA ## 5 4 NA na.omit(copyOfDat) ## id val ## 1 1 10 ## 2 2 5 ## 3 3 15
In the above example, na.omit
removed row 5 – this may not be the desired result. (3) we can find which rows have NA
in them. This becomes tedious if we care about removing rows with any NA
values, but if we only care about missing values in the id
column, then this approach works best.
# First find the rows with valid ids keep <- !is.na(copyOfDat$id) copyOfDat[keep, ] ## id val ## 1 1 10 ## 2 2 5 ## 3 3 15 ## 5 4 NA
Recall from our discussions about data structures and subsetting that we can subset using a logical vector. Supplying a logical vector to the i
and j
positions will only keep the rows or columns, respectivley, where the index is TRUE
.
Not all data comes in as comma separated file – not to worry. The read.csv
function is a specialized version of the read.table
function. read.table
reads arbitrary text files and can handle different separators (eg. tab-delimited .txt files) by specifying the sep
argument. For example:
# Create a text string with our data -- this is just a shortcut rather # than writing out a file txt <- "1,2,3\n\"a\",\"4\",5" # cat interprets the text and prints it to the console -- here it recognizes # the special '\n' newline character cat(txt) ## 1,2,3 ## "a","4",5 # The read.table/read.csv functions can take a 'text' argument rather than # a path read.csv(text = txt) ## X1 X2 X3 ## 1 a 4 5 read.table(text = txt, sep = ",") ## V1 V2 V3 ## 1 1 2 3 ## 2 a 4 5
We can see the two function calls are similar, but not identical. The issue here is with the “header.” R needs to know if the file contains a header, to be the column names, or not. By default, read.table
does not expect a header and read.csv
does. We can control this behavior with the header
argument. This is an excellent example of why reading the documentation is important. Take a look at ?read.table
– it will show the default header
values for the two functions.
read.csv(text = txt, header = FALSE) ## V1 V2 V3 ## 1 1 2 3 ## 2 a 4 5
By default read.table
expects a tab-delimited file, but any separator can be used. Consider the following examples:
read.table(text = "a\tb\nb\ta") ## V1 V2 ## 1 a b ## 2 b a read.table(text = "a;b\nb;a") ## V1 ## 1 a;b ## 2 b;a read.table(text = "a;b\nb;a", sep = ";") ## V1 V2 ## 1 a b ## 2 b a
Notice the first column of txt
has seemingly mixed data types; ie. column 1 contains a number and a letter. When loading data we need to pay close attention to data types. We will use the str
function to keep track of data types for each column. First consider:
str(read.table(text = txt, sep = ",")) ## 'data.frame': 2 obs. of 3 variables: ## $ V1: chr "1" "a" ## $ V2: int 2 4 ## $ V3: int 3 5
We see the first column was actually imported as a factor. From the discussion on coercing data types, we may have expected the first column to import as a character. When data is loaded by read.table
(or read.csv
) each column is processed with the type.convert
function. type.convert
looks at the contents of the column, and chooses the lowest order data type possible. However, this function defaults to treating strings as factors. We can alter this behavior by passing stringsAsFactors = FALSE
to read.table
(this argument gets passed along to type.convert
).
str(read.table(text = txt, sep = ",", stringsAsFactors = FALSE)) ## 'data.frame': 2 obs. of 3 variables: ## $ V1: chr "1" "a" ## $ V2: int 2 4 ## $ V3: int 3 5
Similarly, we can pass as.is = TRUE
. The as.is
paramter allows finer control. Like stringsAsFactors
it accepts TRUE/FALSE
, but it can also accept a vector of integers specifying which columns to leave “as.is” (meaning they are not converted to factors). For example:
str(read.table(text = txt, sep = ",", as.is = 1:3)) ## 'data.frame': 2 obs. of 3 variables: ## $ V1: chr "1" "a" ## $ V2: int 2 4 ## $ V3: int 3 5 str(read.table(text = txt, sep = ",", as.is = 2:3)) ## 'data.frame': 2 obs. of 3 variables: ## $ V1: Factor w/ 2 levels "1","a": 1 2 ## $ V2: int 2 4 ## $ V3: int 3 5 str(read.table(text = sub("4", "b", txt), sep = ",", as.is = 2:3)) ## 'data.frame': 2 obs. of 3 variables: ## $ V1: Factor w/ 2 levels "1","a": 1 2 ## $ V2: chr "2" "b" ## $ V3: int 3 5
Notice the 4 in txt
is quoted, but R still recognizes "4"
as coercible to numeric, and does so because the rest of the column is numeric. In the last example above we sub out the 4 for the letter “b”, but we only leave columns 2 and 3 “as is” resulting in a mix of factor and character columns in the result.
Suppose you do not want to bother with files, but want to save R objects directly. This can be especially helpful when you have complex data structures not easily represented by a flat file. Reading and writing R objects is also faster and takes less diskspace. R provides two modes to save objects. The first mode saves a single unnamed object (saveRDS
/readRDS
), and the second mode can save many named objects (save
/load
). We encourage you to read the documentation for these functions and experiment yourself.
These exercises are to help you solidify and expand on the information given above. We intentionally added some concepts that were not covered above, and hope that you will take a few minutes to think through what is happening and how R is interpreting the code.
What if the “a” value in txt
was meant to indicate a missing value, and you actually want to V1
to read in as numeric. Read the type.convert
documentation and use what you learn to tell read.table
that “a” means the value is missing.
Your collaborator from Europe sends you the following file: "12,34\t15,01\t"
. Use read.table
to correctly load the file with numeric values.
R does not provide a native solution for reading Excel files – wah, wah. Luckily, there are many packages available for doing so. We will discuss packages more later in the class. For now, simply run the following code which will install and load the readxl
package.
install.packages("readxl")
library(readxl)
You will now have access to the read_xls
function. You will use this function to load in some data, and then you will use what you’ve learned so far to create a cleaned object that you could use for further analysis.
# First, download the file copied from the FBI database and take a look in Excel.
xlsLink <- "https://raw.githubusercontent.com/How-to-Learn-to-Code/rclass/master/inst/other/crime.xls"
download.file(xlsLink, "crime.xls")
crime <- read_xls("crime.xls")
Don’t be intimidated by how the object looks. Running class(crime)
shows that it is a new data structure that inherits from data.frame. You can treat it exactly like a data.frame, or coerce it back to a simple data.frame if you prefer (crime <- as.data.frame(crime)). Now you need to use what you’ve learned so far to clean up the file, such that it matches the cleaned version we created. The cleaned version can be found here:
csvLink <- "https://raw.githubusercontent.com/How-to-Learn-to-Code/rclass/master/inst/other/crime.csv"
cleanedDF <- read.csv(csvLink)