Lesson 4 - File IO and string manipulation

Open In Colab

Open In Colab

Learning objectives: Students will be able to load text files into Python objects, and learn to manipulate file names and strings.

Specific skills:

Introduction

Every program has an input and an output. In science, the input is usually your raw data; the output can be anything from processed data, statistical tests, model predictions, or figures for a paper or presentation. In any case, loading your input data is one of the first tasks that you will have to execute in your code.

You have already seen how to load data into a pandas.DataFrame. However, not all data will be formatted this way. This lesson will teach you how to deal with text files, but the tools learned here will be applicable to a variety of different file types that you may come accross in your research.

Motivating Example

You are analyzing data, and your collaborator just sent you several files containing lists of genes that are upregulated in specific cell types. You might be interested in using these gene lists to determine how similar cells in your data set are to the cell types discovered by your collaborator. You have all the files in a single directory, but now you need to figure out how to import all of the gene lists into Python.

How do you go about reading the data from all of these files? We will learn about the tools for reading data in this lesson.

Setup

We will be using data in external files for this lesson, so we will need a way to access these. In Colab, we can do this by cloning the data from GitHub. You can also access your Google Drive from Colab, by mounting your Drive every time you open your notebook.

We clone from github using a language called bash. This language is specialized for communicating with UNIX operating systems, including MacOS and Linux (Google Colab is running your notebook on a Linux server). We can run bash commands in our notebook by starting a line with an exclamation point or percent sign.

!git clone https://github.com/How-to-Learn-to-Code/python-class.git
Cloning into 'python-class'...
remote: Enumerating objects: 977, done.
remote: Counting objects:   0% (1/401)remote: Counting objects:   1% (5/401)remote: Counting objects:   2% (9/401)remote: Counting objects:   3% (13/401)remote: Counting objects:   4% (17/401)remote: Counting objects:   5% (21/401)remote: Counting objects:   6% (25/401)remote: Counting objects:   7% (29/401)remote: Counting objects:   8% (33/401)remote: Counting objects:   9% (37/401)remote: Counting objects:  10% (41/401)remote: Counting objects:  11% (45/401)remote: Counting objects:  12% (49/401)remote: Counting objects:  13% (53/401)remote: Counting objects:  14% (57/401)remote: Counting objects:  15% (61/401)remote: Counting objects:  16% (65/401)remote: Counting objects:  17% (69/401)remote: Counting objects:  18% (73/401)remote: Counting objects:  19% (77/401)remote: Counting objects:  20% (81/401)remote: Counting objects:  21% (85/401)remote: Counting objects:  22% (89/401)remote: Counting objects:  23% (93/401)remote: Counting objects:  24% (97/401)remote: Counting objects:  25% (101/401)remote: Counting objects:  26% (105/401)remote: Counting objects:  27% (109/401)remote: Counting objects:  28% (113/401)remote: Counting objects:  29% (117/401)remote: Counting objects:  30% (121/401)remote: Counting objects:  31% (125/401)remote: Counting objects:  32% (129/401)remote: Counting objects:  33% (133/401)remote: Counting objects:  34% (137/401)remote: Counting objects:  35% (141/401)remote: Counting objects:  36% (145/401)remote: Counting objects:  37% (149/401)remote: Counting objects:  38% (153/401)remote: Counting objects:  39% (157/401)remote: Counting objects:  40% (161/401)remote: Counting objects:  41% (165/401)remote: Counting objects:  42% (169/401)remote: Counting objects:  43% (173/401)remote: Counting objects:  44% (177/401)remote: Counting objects:  45% (181/401)remote: Counting objects:  46% (185/401)remote: Counting objects:  47% (189/401)remote: Counting objects:  48% (193/401)remote: Counting objects:  49% (197/401)remote: Counting objects:  50% (201/401)remote: Counting objects:  51% (205/401)remote: Counting objects:  52% (209/401)remote: Counting objects:  53% (213/401)remote: Counting objects:  54% (217/401)remote: Counting objects:  55% (221/401)remote: Counting objects:  56% (225/401)remote: Counting objects:  57% (229/401)remote: Counting objects:  58% (233/401)remote: Counting objects:  59% (237/401)remote: Counting objects:  60% (241/401)remote: Counting objects:  61% (245/401)remote: Counting objects:  62% (249/401)remote: Counting objects:  63% (253/401)remote: Counting objects:  64% (257/401)remote: Counting objects:  65% (261/401)remote: Counting objects:  66% (265/401)remote: Counting objects:  67% (269/401)remote: Counting objects:  68% (273/401)remote: Counting objects:  69% (277/401)remote: Counting objects:  70% (281/401)remote: Counting objects:  71% (285/401)remote: Counting objects:  72% (289/401)remote: Counting objects:  73% (293/401)remote: Counting objects:  74% (297/401)remote: Counting objects:  75% (301/401)remote: Counting objects:  76% (305/401)remote: Counting objects:  77% (309/401)remote: Counting objects:  78% (313/401)remote: Counting objects:  79% (317/401)remote: Counting objects:  80% (321/401)remote: Counting objects:  81% (325/401)remote: Counting objects:  82% (329/401)remote: Counting objects:  83% (333/401)remote: Counting objects:  84% (337/401)remote: Counting objects:  85% (341/401)remote: Counting objects:  86% (345/401)remote: Counting objects:  87% (349/401)remote: Counting objects:  88% (353/401)remote: Counting objects:  89% (357/401)remote: Counting objects:  90% (361/401)remote: Counting objects:  91% (365/401)remote: Counting objects:  92% (369/401)remote: Counting objects:  93% (373/401)remote: Counting objects:  94% (377/401)remote: Counting objects:  95% (381/401)remote: Counting objects:  96% (385/401)remote: Counting objects:  97% (389/401)remote: Counting objects:  98% (393/401)remote: Counting objects:  99% (397/401)remote: Counting objects: 100% (401/401)remote: Counting objects: 100% (401/401), done.
remote: Compressing objects:   0% (1/222)remote: Compressing objects:   1% (3/222)remote: Compressing objects:   2% (5/222)remote: Compressing objects:   3% (7/222)remote: Compressing objects:   4% (9/222)remote: Compressing objects:   5% (12/222)remote: Compressing objects:   6% (14/222)remote: Compressing objects:   7% (16/222)remote: Compressing objects:   8% (18/222)remote: Compressing objects:   9% (20/222)remote: Compressing objects:  10% (23/222)remote: Compressing objects:  11% (25/222)remote: Compressing objects:  12% (27/222)remote: Compressing objects:  13% (29/222)remote: Compressing objects:  14% (32/222)remote: Compressing objects:  15% (34/222)remote: Compressing objects:  16% (36/222)remote: Compressing objects:  17% (38/222)remote: Compressing objects:  18% (40/222)remote: Compressing objects:  19% (43/222)remote: Compressing objects:  20% (45/222)remote: Compressing objects:  21% (47/222)remote: Compressing objects:  22% (49/222)remote: Compressing objects:  23% (52/222)remote: Compressing objects:  24% (54/222)remote: Compressing objects:  25% (56/222)remote: Compressing objects:  26% (58/222)remote: Compressing objects:  27% (60/222)remote: Compressing objects:  28% (63/222)remote: Compressing objects:  29% (65/222)remote: Compressing objects:  30% (67/222)remote: Compressing objects:  31% (69/222)remote: Compressing objects:  32% (72/222)remote: Compressing objects:  33% (74/222)remote: Compressing objects:  34% (76/222)remote: Compressing objects:  35% (78/222)remote: Compressing objects:  36% (80/222)remote: Compressing objects:  37% (83/222)remote: Compressing objects:  38% (85/222)remote: Compressing objects:  39% (87/222)remote: Compressing objects:  40% (89/222)remote: Compressing objects:  41% (92/222)remote: Compressing objects:  42% (94/222)remote: Compressing objects:  43% (96/222)remote: Compressing objects:  44% (98/222)remote: Compressing objects:  45% (100/222)remote: Compressing objects:  46% (103/222)remote: Compressing objects:  47% (105/222)remote: Compressing objects:  48% (107/222)remote: Compressing objects:  49% (109/222)remote: Compressing objects:  50% (111/222)remote: Compressing objects:  51% (114/222)remote: Compressing objects:  52% (116/222)remote: Compressing objects:  53% (118/222)remote: Compressing objects:  54% (120/222)remote: Compressing objects:  55% (123/222)remote: Compressing objects:  56% (125/222)remote: Compressing objects:  57% (127/222)remote: Compressing objects:  58% (129/222)remote: Compressing objects:  59% (131/222)remote: Compressing objects:  60% (134/222)remote: Compressing objects:  61% (136/222)remote: Compressing objects:  62% (138/222)remote: Compressing objects:  63% (140/222)remote: Compressing objects:  64% (143/222)remote: Compressing objects:  65% (145/222)remote: Compressing objects:  66% (147/222)remote: Compressing objects:  67% (149/222)remote: Compressing objects:  68% (151/222)remote: Compressing objects:  69% (154/222)remote: Compressing objects:  70% (156/222)remote: Compressing objects:  71% (158/222)remote: Compressing objects:  72% (160/222)remote: Compressing objects:  73% (163/222)remote: Compressing objects:  74% (165/222)remote: Compressing objects:  75% (167/222)remote: Compressing objects:  76% (169/222)remote: Compressing objects:  77% (171/222)remote: Compressing objects:  78% (174/222)remote: Compressing objects:  79% (176/222)remote: Compressing objects:  80% (178/222)remote: Compressing objects:  81% (180/222)remote: Compressing objects:  82% (183/222)remote: Compressing objects:  83% (185/222)remote: Compressing objects:  84% (187/222)remote: Compressing objects:  85% (189/222)remote: Compressing objects:  86% (191/222)remote: Compressing objects:  87% (194/222)remote: Compressing objects:  88% (196/222)remote: Compressing objects:  89% (198/222)remote: Compressing objects:  90% (200/222)remote: Compressing objects:  91% (203/222)remote: Compressing objects:  92% (205/222)remote: Compressing objects:  93% (207/222)remote: Compressing objects:  94% (209/222)remote: Compressing objects:  95% (211/222)remote: Compressing objects:  96% (214/222)remote: Compressing objects:  97% (216/222)remote: Compressing objects:  98% (218/222)remote: Compressing objects:  99% (220/222)remote: Compressing objects: 100% (222/222)remote: Compressing objects: 100% (222/222), done.
Receiving objects:   0% (1/977)Receiving objects:   1% (10/977)Receiving objects:   2% (20/977)Receiving objects:   3% (30/977)Receiving objects:   4% (40/977)Receiving objects:   5% (49/977)Receiving objects:   6% (59/977)Receiving objects:   7% (69/977)Receiving objects:   8% (79/977)Receiving objects:   9% (88/977)Receiving objects:  10% (98/977)Receiving objects:  11% (108/977)Receiving objects:  12% (118/977)Receiving objects:  13% (128/977)Receiving objects:  14% (137/977)Receiving objects:  15% (147/977)Receiving objects:  16% (157/977)Receiving objects:  17% (167/977)Receiving objects:  18% (176/977)Receiving objects:  19% (186/977)Receiving objects:  20% (196/977)Receiving objects:  21% (206/977)Receiving objects:  22% (215/977)Receiving objects:  23% (225/977)Receiving objects:  24% (235/977)Receiving objects:  25% (245/977)Receiving objects:  26% (255/977)Receiving objects:  27% (264/977)Receiving objects:  28% (274/977)Receiving objects:  29% (284/977)Receiving objects:  30% (294/977)Receiving objects:  31% (303/977)Receiving objects:  32% (313/977)Receiving objects:  33% (323/977)Receiving objects:  34% (333/977)Receiving objects:  35% (342/977)Receiving objects:  36% (352/977)Receiving objects:  37% (362/977)Receiving objects:  38% (372/977)Receiving objects:  39% (382/977)Receiving objects:  40% (391/977)Receiving objects:  41% (401/977)Receiving objects:  42% (411/977)Receiving objects:  43% (421/977)Receiving objects:  44% (430/977)Receiving objects:  45% (440/977)Receiving objects:  46% (450/977)Receiving objects:  47% (460/977)Receiving objects:  48% (469/977)Receiving objects:  49% (479/977)Receiving objects:  50% (489/977)Receiving objects:  51% (499/977)Receiving objects:  52% (509/977)Receiving objects:  53% (518/977)Receiving objects:  54% (528/977)Receiving objects:  55% (538/977)Receiving objects:  56% (548/977)Receiving objects:  57% (557/977)Receiving objects:  58% (567/977)Receiving objects:  59% (577/977)Receiving objects:  60% (587/977)Receiving objects:  61% (596/977)Receiving objects:  62% (606/977)Receiving objects:  63% (616/977)Receiving objects:  64% (626/977)Receiving objects:  65% (636/977)Receiving objects:  66% (645/977)Receiving objects:  67% (655/977)Receiving objects:  68% (665/977)Receiving objects:  69% (675/977)Receiving objects:  70% (684/977)Receiving objects:  71% (694/977)Receiving objects:  72% (704/977)Receiving objects:  73% (714/977)Receiving objects:  74% (723/977)Receiving objects:  75% (733/977)Receiving objects:  76% (743/977)Receiving objects:  77% (753/977)Receiving objects:  78% (763/977)Receiving objects:  79% (772/977)Receiving objects:  80% (782/977)Receiving objects:  81% (792/977)Receiving objects:  82% (802/977)Receiving objects:  83% (811/977)Receiving objects:  84% (821/977)Receiving objects:  85% (831/977)Receiving objects:  86% (841/977)Receiving objects:  87% (850/977)Receiving objects:  88% (860/977)remote: Total 977 (delta 231), reused 304 (delta 158), pack-reused 576 (from 1)
Receiving objects:  89% (870/977)Receiving objects:  90% (880/977)Receiving objects:  91% (890/977)Receiving objects:  92% (899/977)Receiving objects:  93% (909/977)Receiving objects:  94% (919/977)Receiving objects:  95% (929/977)Receiving objects:  96% (938/977)Receiving objects:  97% (948/977)Receiving objects:  98% (958/977)Receiving objects:  99% (968/977)Receiving objects: 100% (977/977)Receiving objects: 100% (977/977), 32.50 MiB | 67.92 MiB/s, done.
Resolving deltas:   0% (0/450)Resolving deltas:   1% (5/450)Resolving deltas:   2% (9/450)Resolving deltas:   3% (14/450)Resolving deltas:   4% (18/450)Resolving deltas:   5% (23/450)Resolving deltas:   6% (27/450)Resolving deltas:   7% (32/450)Resolving deltas:   8% (36/450)Resolving deltas:   9% (41/450)Resolving deltas:  10% (45/450)Resolving deltas:  11% (50/450)Resolving deltas:  12% (54/450)Resolving deltas:  13% (59/450)Resolving deltas:  14% (63/450)Resolving deltas:  15% (68/450)Resolving deltas:  16% (72/450)Resolving deltas:  17% (77/450)Resolving deltas:  18% (81/450)Resolving deltas:  19% (86/450)Resolving deltas:  20% (90/450)Resolving deltas:  21% (95/450)Resolving deltas:  22% (100/450)Resolving deltas:  23% (104/450)Resolving deltas:  24% (108/450)Resolving deltas:  25% (114/450)Resolving deltas:  26% (118/450)Resolving deltas:  27% (122/450)Resolving deltas:  28% (126/450)Resolving deltas:  29% (131/450)Resolving deltas:  30% (135/450)Resolving deltas:  31% (140/450)Resolving deltas:  32% (144/450)Resolving deltas:  33% (149/450)Resolving deltas:  34% (153/450)Resolving deltas:  35% (158/450)Resolving deltas:  36% (162/450)Resolving deltas:  37% (167/450)Resolving deltas:  38% (171/450)Resolving deltas:  39% (176/450)Resolving deltas:  40% (180/450)Resolving deltas:  41% (185/450)Resolving deltas:  42% (189/450)Resolving deltas:  43% (194/450)Resolving deltas:  44% (198/450)Resolving deltas:  45% (203/450)Resolving deltas:  46% (207/450)Resolving deltas:  47% (212/450)Resolving deltas:  48% (216/450)Resolving deltas:  49% (221/450)Resolving deltas:  50% (225/450)Resolving deltas:  51% (230/450)Resolving deltas:  52% (234/450)Resolving deltas:  53% (239/450)Resolving deltas:  54% (243/450)Resolving deltas:  55% (248/450)Resolving deltas:  56% (252/450)Resolving deltas:  57% (257/450)Resolving deltas:  58% (261/450)Resolving deltas:  59% (266/450)Resolving deltas:  60% (270/450)Resolving deltas:  61% (275/450)Resolving deltas:  62% (279/450)Resolving deltas:  63% (284/450)Resolving deltas:  64% (288/450)Resolving deltas:  65% (293/450)Resolving deltas:  66% (297/450)Resolving deltas:  67% (302/450)Resolving deltas:  68% (306/450)Resolving deltas:  69% (311/450)Resolving deltas:  70% (315/450)Resolving deltas:  71% (320/450)Resolving deltas:  72% (324/450)Resolving deltas:  73% (329/450)Resolving deltas:  74% (333/450)Resolving deltas:  75% (338/450)Resolving deltas:  76% (342/450)Resolving deltas:  77% (347/450)Resolving deltas:  78% (351/450)Resolving deltas:  79% (356/450)Resolving deltas:  80% (360/450)Resolving deltas:  81% (365/450)Resolving deltas:  82% (369/450)Resolving deltas:  83% (374/450)Resolving deltas:  84% (378/450)Resolving deltas:  85% (384/450)Resolving deltas:  86% (387/450)Resolving deltas:  87% (392/450)Resolving deltas:  88% (396/450)Resolving deltas:  89% (401/450)Resolving deltas:  90% (405/450)Resolving deltas:  91% (410/450)Resolving deltas:  92% (414/450)Resolving deltas:  93% (419/450)Resolving deltas:  94% (423/450)Resolving deltas:  95% (428/450)Resolving deltas:  96% (432/450)Resolving deltas:  97% (438/450)Resolving deltas:  98% (441/450)Resolving deltas:  99% (446/450)Resolving deltas: 100% (450/450)Resolving deltas: 100% (450/450), done.

Now we have access to all the files in the GitHub repository. Next, we want to change the folder where our code will execute. This is called the working directory. We can change the working directory using the cd bash command.

%cd python-class/Lesson_4_FileIO/
/home/runner/work/python-class/python-class/Lesson_4_FileIO/python-class/Lesson_4_FileIO

If we want to check that we are in the correct working directory, we can tell bash to print our current working directory using the command pwd.

!pwd
/home/runner/work/python-class/python-class/Lesson_4_FileIO/python-class/Lesson_4_FileIO

We can also see all the files in our current directory, using the “list” command ls.

!ls
HW_seq.txt  Lesson_4_student.ipynb  dna.txt    three_seq.txt
Lesson_4.ipynb  data            gene_sets

Files

filepath = 'dna.txt'
my_file = open(filepath)

File Objects

  • cannot directly print a file object
print(my_file)
<_io.TextIOWrapper name='dna.txt' mode='r' encoding='UTF-8'>

Reading a file object

file_contents = my_file.read()
print (file_contents)
ATATCGCGAA

“Exhausting” a file

my_file = open(filepath)
print(my_file.read())
ATATCGCGAA
print(my_file.read())

No output is displayed because the file object has already been read through.

The file must be read in again to start from the beginning.

Storing the output into a variable (file_contents) allows us to use the file data without worry about this.

print (file_contents)
ATATCGCGAA

Working with the file

dna_length = len(file_contents)
print('sequence is ' + file_contents + ' and the length is ' + str(dna_length))
sequence is ATATCGCGAA
 and the length is 11

Output looks strange and the length is incorrect due to a hidden newline ('\n') character

The file we read in is actually 2 lines with the second line being blank

Stripping

my_dna_strip = file_contents.strip('\n')
print('sequence is ' + my_dna_strip + ' and the length is ' + str(len(my_dna_strip)))
sequence is ATATCGCGAA and the length is 10

.strip() removes any leading or trailing instances of the given character

new_dna = my_dna_strip.strip('A')
print(new_dna)
TATCGCG

Closing files

It’s good programming practice to close files once you have read from them. There are limits by your OS on how many files can be kept open.

my_file2 = open('three_seq.txt')
file_contents2 = my_file2.read()
my_file2.close()

Context management

Writing code to close your file every time you open it is a pain. Fortunately, there is a clean way to deal with this. The Python with statement is known as a context manager, since it is responsible for the context in which the file is being manipulated. What this means, is that the file will automatically be closed once you exit the with statement.

This is how it works:

with open('three_seq.txt') as file_handle:
    file_contents2 = file_handle.read()
file_contents2
'ATCAGACGCGCAGAGGAGGCGGGGCCGCGGCTGGTTTCCTGCCGGGGGGCGGCTCTGGGCCGCCGAGTCCCCTCCTCCCGCCCCTGAGGAGGAGGAGCCGCCGCCACCCGCCGCGCCCGACACCCGGGAGGCCCCGCCAGCCCGCGGGAGAGGCCCAGCGGGAGTCGCGGAACAGCAGGCCCGAGCCCACCGCGCCGGGCCCCGGACGCCGCGCGGAAAAG\nCTGCTCCGGAGTGACGCGGGCCCGGGCGCGACGGTCTCGGCGGCGGCGGCGGCGGCGACAGAGCGAGCGCGGCGCGGGGCCACC\nAGAAGGAGGGCGTGGTAATATGAAGTCAGTTCCGGTTGGTGTAAAACCCCCGGGGCGGCGGCGAACTGGCTTTAGATGCTTCTGGGTCGCGGTGTGCTAAGCGAGGAGTCCGAGTGTGTGAGCTTGAGAGCCGCGCGCTAGAGCGACCCGGCGAGGG'

Reading many lines

Oftentimes the text file we are using will be organized line by line. In this case, rather than reading the whole thing at once, it makes sense to read each line into a different element of a list.

with open('three_seq.txt') as handle:
    seqs = handle.read()
    seq_list = seqs.splitlines()
print(seq_list)
['ATCAGACGCGCAGAGGAGGCGGGGCCGCGGCTGGTTTCCTGCCGGGGGGCGGCTCTGGGCCGCCGAGTCCCCTCCTCCCGCCCCTGAGGAGGAGGAGCCGCCGCCACCCGCCGCGCCCGACACCCGGGAGGCCCCGCCAGCCCGCGGGAGAGGCCCAGCGGGAGTCGCGGAACAGCAGGCCCGAGCCCACCGCGCCGGGCCCCGGACGCCGCGCGGAAAAG', 'CTGCTCCGGAGTGACGCGGGCCCGGGCGCGACGGTCTCGGCGGCGGCGGCGGCGGCGACAGAGCGAGCGCGGCGCGGGGCCACC', 'AGAAGGAGGGCGTGGTAATATGAAGTCAGTTCCGGTTGGTGTAAAACCCCCGGGGCGGCGGCGAACTGGCTTTAGATGCTTCTGGGTCGCGGTGTGCTAAGCGAGGAGTCCGAGTGTGTGAGCTTGAGAGCCGCGCGCTAGAGCGACCCGGCGAGGG']

You can also chain the commands to fit on a single line:

with open('three_seq.txt') as handle:
    seq_list2 = handle.read().splitlines()
print(seq_list2)
['ATCAGACGCGCAGAGGAGGCGGGGCCGCGGCTGGTTTCCTGCCGGGGGGCGGCTCTGGGCCGCCGAGTCCCCTCCTCCCGCCCCTGAGGAGGAGGAGCCGCCGCCACCCGCCGCGCCCGACACCCGGGAGGCCCCGCCAGCCCGCGGGAGAGGCCCAGCGGGAGTCGCGGAACAGCAGGCCCGAGCCCACCGCGCCGGGCCCCGGACGCCGCGCGGAAAAG', 'CTGCTCCGGAGTGACGCGGGCCCGGGCGCGACGGTCTCGGCGGCGGCGGCGGCGGCGACAGAGCGAGCGCGGCGCGGGGCCACC', 'AGAAGGAGGGCGTGGTAATATGAAGTCAGTTCCGGTTGGTGTAAAACCCCCGGGGCGGCGGCGAACTGGCTTTAGATGCTTCTGGGTCGCGGTGTGCTAAGCGAGGAGTCCGAGTGTGTGAGCTTGAGAGCCGCGCGCTAGAGCGACCCGGCGAGGG']

Exercises

How would you check that seq_list and seq_list2 are identical?

# your code goes here

Write a function to read a text file into a list of lines. Test your code using the examples below.

# feel free to change your function name to something more descriptive!
def file_reader(path):
    
    return 
file_reader('dna.txt')[0]
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[23], line 1
----> 1 file_reader('dna.txt')[0]

TypeError: 'NoneType' object is not subscriptable
my_file_test = file_reader('dna.txt')[0]
assert my_file_test == 'ATATCGCGAA'
seq_list_test = file_reader('three_seq.txt')
assert seq_list_test == seq_list
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[24], line 1
----> 1 my_file_test = file_reader('dna.txt')[0]
      2 assert my_file_test == 'ATATCGCGAA'
      3 seq_list_test = file_reader('three_seq.txt')

TypeError: 'NoneType' object is not subscriptable

Manipulating strings

There are a few useful things we can do with strings.

First, we can find and replace text using the replace() function. This can be useful for stripping file extensions:

'some_file.txt'.replace('.txt', '')
'some_file'

The first argument to replace() is the pattern we are searching for in the string, and the second argument is what we want to replace it with.

'this is a STRING'.replace('STRING', 'pizza')
'this is a pizza'

There are several other useful tools for working with strings. For example, upper() and lower() can be used to make all characters in a string upper- or lower-case. swapcase() switches all upper- and lower-case letters.

print('lower2upper'.upper())
print('UPPER2LOWER'.lower())
print('cAmElCaSe'.swapcase())
LOWER2UPPER
upper2lower
CaMeLcAsE

Strings can be concatenated using the addition operator:

'str' + 'ing'
'string'

Variables can be inserted into strings by first converting them into strings and then concatenating them:

my_name = 'NAME'
my_age = None
# fails because my_age is not a string
print('My name is ' + my_name + ' and my age is ' + my_age)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[29], line 4
      2 my_age = None
      3 # fails because my_age is not a string
----> 4 print('My name is ' + my_name + ' and my age is ' + my_age)

TypeError: can only concatenate str (not "NoneType") to str
print('My name is ' + my_name + ' and my age is ' + str(my_age))
My name is NAME and my age is None

As of Python 3.6, the easiest way to format strings is using f-strings:

print(f'My name is {my_name} and my age is {my_age}')
My name is NAME and my age is None

Sometimes it is useful to check if a certain pattern is present within a string. Here is one way to do this:

print('hi' in 'this is a test')
print('ahoy' in 'this is a test')
True
False

Exercises

Convert all nucleotides in ‘dna.txt’ to lowercase.

# your code here

Use f-strings to append a poly-A tail to each sequence in ‘seq_list.txt’.

# your code here

Reading multiple files

To go from reading a single file to reading multiple files, use a for loop.

First, we need get all the file names in a directory. We can do this using the os library.

import os
data_dir = 'gene_sets'
files = os.listdir(data_dir)
files
['HALLMARK_INFLAMMATORY_RESPONSE.v7.5.1.grp',
 'HALLMARK_APOPTOSIS.v7.5.1.grp',
 'HALLMARK_G2M_CHECKPOINT.v7.5.1.grp',
 'HALLMARK_TNFA_SIGNALING_VIA_NFKB.v7.5.1.grp',
 'HALLMARK_PI3K_AKT_MTOR_SIGNALING.v7.5.1.grp']

We want to read the hallmark gene sets, however there is also a folder of ‘junk’ files that we do not want to read. This often occurs if you use a program that stores temporary files–for example, Excel.

We will filter only the files with the extension we want.

gene_set_files = []
for file in files:
    if '.grp' in file:
        gene_set_files.append(file)
gene_set_files
['HALLMARK_INFLAMMATORY_RESPONSE.v7.5.1.grp',
 'HALLMARK_APOPTOSIS.v7.5.1.grp',
 'HALLMARK_G2M_CHECKPOINT.v7.5.1.grp',
 'HALLMARK_TNFA_SIGNALING_VIA_NFKB.v7.5.1.grp',
 'HALLMARK_PI3K_AKT_MTOR_SIGNALING.v7.5.1.grp']

To do this, we had to initialize an empty list and use a nested if statement within a for loop to append to the list only under a certain condition–if a file name included the extension we want to read.

This is a lot of boilerplate, however there is a better way: using list comprehensions.

gene_set_files = [f for f in files if ('.grp' in f)]
gene_set_files
['HALLMARK_INFLAMMATORY_RESPONSE.v7.5.1.grp',
 'HALLMARK_APOPTOSIS.v7.5.1.grp',
 'HALLMARK_G2M_CHECKPOINT.v7.5.1.grp',
 'HALLMARK_TNFA_SIGNALING_VIA_NFKB.v7.5.1.grp',
 'HALLMARK_PI3K_AKT_MTOR_SIGNALING.v7.5.1.grp']

List comprehensions are just shorthand for a for loop that populates a list. You can use them to shorten straight-forward for loops, like the one above.

Here are some more examples of list comprehensions:

# string manipulation
[f'This is a {string}' for string in ['mouse', 'moose', 'pipette']]
['This is a mouse', 'This is a moose', 'This is a pipette']
# arithmetic
numbers = [1, 2, 3, 4, 5]
mean = sum(numbers)/len(numbers)
[number - mean for number in numbers]
[-2.0, -1.0, 0.0, 1.0, 2.0]

When in doubt, you can always write a list comprehension as a full for loop.

Now, we will use a list comprehension to read all the gene sets into a list, using the file_reader() function that you wrote in the first section of the lesson.

gene_sets = [file_reader(f'{data_dir}/{f}') for f in gene_set_files]
print(gene_sets[0])
None

Here is what the code might look like without using a list comprehension:

gene_sets = []  # initialize empty list of output
for file in gene_set_files:
    fname = f'{data_dir}/{file}'  # f-string to get full path to file
    with open(fname) as handle:
        tmp = handle.read().splitlines()  # temporary variable containing text
    gene_sets.append(tmp)  # add tmp to output list

The gene lists each contain the gene set name and a url as the first two elements. The rest of the list consists of genes in the set. We want to filter the lists to contain only the genes of interest. We can use a list comprehension to keep the 2nd through the last element of each gene list:

gene_set_names = [gene_list[0] for gene_list in gene_sets]  # extract names
gene_sets = [gene_list[2::] for gene_list in gene_sets]
gene_set_names
['HALLMARK_INFLAMMATORY_RESPONSE',
 'HALLMARK_APOPTOSIS',
 'HALLMARK_G2M_CHECKPOINT',
 'HALLMARK_TNFA_SIGNALING_VIA_NFKB',
 'HALLMARK_PI3K_AKT_MTOR_SIGNALING']
print(gene_sets)
[['ABCA1', 'ABI1', 'ACVR1B', 'ACVR2A', 'ADGRE1', 'ADM', 'ADORA2B', 'ADRM1', 'AHR', 'APLNR', 'AQP9', 'ATP2A2', 'ATP2B1', 'ATP2C1', 'AXL', 'BDKRB1', 'BEST1', 'BST2', 'BTG2', 'C3AR1', 'C5AR1', 'CALCRL', 'CCL17', 'CCL2', 'CCL20', 'CCL22', 'CCL24', 'CCL5', 'CCL7', 'CCR7', 'CCRL2', 'CD14', 'CD40', 'CD48', 'CD55', 'CD69', 'CD70', 'CD82', 'CDKN1A', 'CHST2', 'CLEC5A', 'CMKLR1', 'CSF1', 'CSF3', 'CSF3R', 'CX3CL1', 'CXCL10', 'CXCL11', 'CXCL6', 'CXCL8', 'CXCL9', 'CXCR6', 'CYBB', 'DCBLD2', 'EBI3', 'EDN1', 'EIF2AK2', 'EMP3', 'EREG', 'F3', 'FFAR2', 'FPR1', 'FZD5', 'GABBR1', 'GCH1', 'GNA15', 'GNAI3', 'GP1BA', 'GPC3', 'GPR132', 'GPR183', 'HAS2', 'HBEGF', 'HIF1A', 'HPN', 'HRH1', 'ICAM1', 'ICAM4', 'ICOSLG', 'IFITM1', 'IFNAR1', 'IFNGR2', 'IL10', 'IL10RA', 'IL12B', 'IL15', 'IL15RA', 'IL18', 'IL18R1', 'IL18RAP', 'IL1A', 'IL1B', 'IL1R1', 'IL2RB', 'IL4R', 'IL6', 'IL7R', 'INHBA', 'IRAK2', 'IRF1', 'IRF7', 'ITGA5', 'ITGB3', 'ITGB8', 'KCNA3', 'KCNJ2', 'KCNMB2', 'KIF1B', 'KLF6', 'LAMP3', 'LCK', 'LCP2', 'LDLR', 'LIF', 'LPAR1', 'LTA', 'LY6E', 'LYN', 'MARCO', 'MEFV', 'MEP1A', 'MET', 'MMP14', 'MSR1', 'MXD1', 'MYC', 'NAMPT', 'NDP', 'NFKB1', 'NFKBIA', 'NLRP3', 'NMI', 'NMUR1', 'NOD2', 'NPFFR2', 'OLR1', 'OPRK1', 'OSM', 'OSMR', 'P2RX4', 'P2RX7', 'P2RY2', 'PCDH7', 'PDE4B', 'PDPN', 'PIK3R5', 'PLAUR', 'PROK2', 'PSEN1', 'PTAFR', 'PTGER2', 'PTGER4', 'PTGIR', 'PTPRE', 'PVR', 'RAF1', 'RASGRP1', 'RELA', 'RGS1', 'RGS16', 'RHOG', 'RIPK2', 'RNF144B', 'ROS1', 'RTP4', 'SCARF1', 'SCN1B', 'SELE', 'SELENOS', 'SELL', 'SEMA4D', 'SERPINE1', 'SGMS2', 'SLAMF1', 'SLC11A2', 'SLC1A2', 'SLC28A2', 'SLC31A1', 'SLC31A2', 'SLC4A4', 'SLC7A1', 'SLC7A2', 'SPHK1', 'SRI', 'STAB1', 'TACR1', 'TACR3', 'TAPBP', 'TIMP1', 'TLR1', 'TLR2', 'TLR3', 'TNFAIP6', 'TNFRSF1B', 'TNFRSF9', 'TNFSF10', 'TNFSF15', 'TNFSF9', 'TPBG', 'VIP'], ['ADD1', 'AIFM3', 'ANKH', 'ANXA1', 'APP', 'ATF3', 'AVPR1A', 'BAX', 'BCAP31', 'BCL10', 'BCL2L1', 'BCL2L10', 'BCL2L11', 'BCL2L2', 'BGN', 'BID', 'BIK', 'BIRC3', 'BMF', 'BMP2', 'BNIP3L', 'BRCA1', 'BTG2', 'BTG3', 'CASP1', 'CASP2', 'CASP3', 'CASP4', 'CASP6', 'CASP7', 'CASP8', 'CASP9', 'CAV1', 'CCNA1', 'CCND1', 'CCND2', 'CD14', 'CD2', 'CD38', 'CD44', 'CD69', 'CDC25B', 'CDK2', 'CDKN1A', 'CDKN1B', 'CFLAR', 'CLU', 'CREBBP', 'CTH', 'CTNNB1', 'CYLD', 'DAP', 'DAP3', 'DCN', 'DDIT3', 'DFFA', 'DIABLO', 'DNAJA1', 'DNAJC3', 'DNM1L', 'DPYD', 'EBP', 'EGR3', 'EMP1', 'ENO2', 'ERBB2', 'ERBB3', 'EREG', 'ETF1', 'F2', 'F2R', 'FAS', 'FASLG', 'FDXR', 'FEZ1', 'GADD45A', 'GADD45B', 'GCH1', 'GNA15', 'GPX1', 'GPX3', 'GPX4', 'GSN', 'GSR', 'GSTM1', 'GUCY2D', 'H1-0', 'HGF', 'HMGB2', 'HMOX1', 'HSPB1', 'IER3', 'IFITM3', 'IFNB1', 'IFNGR1', 'IGF2R', 'IGFBP6', 'IL18', 'IL1A', 'IL1B', 'IL6', 'IRF1', 'ISG20', 'JUN', 'KRT18', 'LEF1', 'LGALS3', 'LMNA', 'LUM', 'MADD', 'MCL1', 'MGMT', 'MMP2', 'NEDD9', 'NEFH', 'PAK1', 'PDCD4', 'PDGFRB', 'PEA15', 'PLAT', 'PLCB2', 'PLPPR4', 'PMAIP1', 'PPP2R5B', 'PPP3R1', 'PPT1', 'PRF1', 'PSEN1', 'PSEN2', 'PTK2', 'RARA', 'RELA', 'RETSAT', 'RHOB', 'RHOT2', 'RNASEL', 'ROCK1', 'SAT1', 'SATB1', 'SC5D', 'SLC20A1', 'SMAD7', 'SOD1', 'SOD2', 'SPTAN1', 'SQSTM1', 'TAP1', 'TGFB2', 'TGFBR3', 'TIMP1', 'TIMP2', 'TIMP3', 'TNF', 'TNFRSF12A', 'TNFSF10', 'TOP2A', 'TSPO', 'TXNIP', 'VDAC2', 'WEE1', 'XIAP'], ['ABL1', 'AMD1', 'ARID4A', 'ATF5', 'ATRX', 'AURKA', 'AURKB', 'BARD1', 'BCL3', 'BIRC5', 'BRCA2', 'BUB1', 'BUB3', 'CASP8AP2', 'CBX1', 'CCNA2', 'CCNB2', 'CCND1', 'CCNF', 'CCNT1', 'CDC20', 'CDC25A', 'CDC25B', 'CDC27', 'CDC45', 'CDC6', 'CDC7', 'CDK1', 'CDK4', 'CDKN1B', 'CDKN2C', 'CDKN3', 'CENPA', 'CENPE', 'CENPF', 'CHAF1A', 'CHEK1', 'CHMP1A', 'CKS1B', 'CKS2', 'CTCF', 'CUL1', 'CUL3', 'CUL4A', 'CUL5', 'DBF4', 'DDX39A', 'DKC1', 'DMD', 'DR1', 'DTYMK', 'E2F1', 'E2F2', 'E2F3', 'E2F4', 'EFNA5', 'EGF', 'ESPL1', 'EWSR1', 'EXO1', 'EZH2', 'FANCC', 'FBXO5', 'FOXN3', 'G3BP1', 'GINS2', 'GSPT1', 'H2AX', 'H2AZ1', 'H2AZ2', 'H2BC12', 'HIF1A', 'HIRA', 'HMGA1', 'HMGB3', 'HMGN2', 'HMMR', 'HNRNPD', 'HNRNPU', 'HOXC10', 'HSPA8', 'HUS1', 'ILF3', 'INCENP', 'JPT1', 'KATNA1', 'KIF11', 'KIF15', 'KIF20B', 'KIF22', 'KIF23', 'KIF2C', 'KIF4A', 'KIF5B', 'KMT5A', 'KNL1', 'KPNA2', 'KPNB1', 'LBR', 'LIG3', 'LMNB1', 'MAD2L1', 'MAP3K20', 'MAPK14', 'MARCKS', 'MCM2', 'MCM3', 'MCM5', 'MCM6', 'MEIS1', 'MEIS2', 'MKI67', 'MNAT1', 'MT2A', 'MTF2', 'MYBL2', 'MYC', 'NASP', 'NCL', 'NDC80', 'NEK2', 'NOLC1', 'NOTCH2', 'NSD2', 'NUMA1', 'NUP50', 'NUP98', 'NUSAP1', 'ODC1', 'ODF2', 'ORC5', 'ORC6', 'PAFAH1B1', 'PBK', 'PDS5B', 'PLK1', 'PLK4', 'PML', 'POLA2', 'POLE', 'POLQ', 'PRC1', 'PRIM2', 'PRMT5', 'PRPF4B', 'PTTG1', 'PTTG3P', 'PURA', 'RACGAP1', 'RAD21', 'RAD23B', 'RAD54L', 'RASAL2', 'RBL1', 'RBM14', 'RPA2', 'RPS6KA5', 'SAP30', 'SFPQ', 'SLC12A2', 'SLC38A1', 'SLC7A1', 'SLC7A5', 'SMAD3', 'SMARCC1', 'SMC1A', 'SMC2', 'SMC4', 'SNRPD1', 'SQLE', 'SRSF1', 'SRSF10', 'SRSF2', 'SS18', 'STAG1', 'STIL', 'STMN1', 'SUV39H1', 'SYNCRIP', 'TACC3', 'TENT4A', 'TFDP1', 'TGFB1', 'TLE3', 'TMPO', 'TNPO2', 'TOP1', 'TOP2A', 'TPX2', 'TRA2B', 'TRAIP', 'TROAP', 'TTK', 'UBE2C', 'UBE2S', 'UCK2', 'UPF1', 'WRN', 'XPO1', 'YTHDC1'], ['ABCA1', 'ACKR3', 'AREG', 'ATF3', 'ATP2B1', 'B4GALT1', 'B4GALT5', 'BCL2A1', 'BCL3', 'BCL6', 'BHLHE40', 'BIRC2', 'BIRC3', 'BMP2', 'BTG1', 'BTG2', 'BTG3', 'CCL2', 'CCL20', 'CCL4', 'CCL5', 'CCN1', 'CCND1', 'CCNL1', 'CCRL2', 'CD44', 'CD69', 'CD80', 'CD83', 'CDKN1A', 'CEBPB', 'CEBPD', 'CFLAR', 'CLCF1', 'CSF1', 'CSF2', 'CXCL1', 'CXCL10', 'CXCL11', 'CXCL2', 'CXCL3', 'CXCL6', 'DDX58', 'DENND5A', 'DNAJB4', 'DRAM1', 'DUSP1', 'DUSP2', 'DUSP4', 'DUSP5', 'EDN1', 'EFNA1', 'EGR1', 'EGR2', 'EGR3', 'EHD1', 'EIF1', 'ETS2', 'F2RL1', 'F3', 'FJX1', 'FOS', 'FOSB', 'FOSL1', 'FOSL2', 'FUT4', 'G0S2', 'GADD45A', 'GADD45B', 'GCH1', 'GEM', 'GFPT2', 'GPR183', 'HBEGF', 'HES1', 'ICAM1', 'ICOSLG', 'ID2', 'IER2', 'IER3', 'IER5', 'IFIH1', 'IFIT2', 'IFNGR2', 'IL12B', 'IL15RA', 'IL18', 'IL1A', 'IL1B', 'IL23A', 'IL6', 'IL6ST', 'IL7R', 'INHBA', 'IRF1', 'IRS2', 'JAG1', 'JUN', 'JUNB', 'KDM6B', 'KLF10', 'KLF2', 'KLF4', 'KLF6', 'KLF9', 'KYNU', 'LAMB3', 'LDLR', 'LIF', 'LITAF', 'MAFF', 'MAP2K3', 'MAP3K8', 'MARCKS', 'MCL1', 'MSC', 'MXD1', 'MYC', 'NAMPT', 'NFAT5', 'NFE2L2', 'NFIL3', 'NFKB1', 'NFKB2', 'NFKBIA', 'NFKBIE', 'NINJ1', 'NR4A1', 'NR4A2', 'NR4A3', 'OLR1', 'PANX1', 'PDE4B', 'PDLIM5', 'PER1', 'PFKFB3', 'PHLDA1', 'PHLDA2', 'PLAU', 'PLAUR', 'PLEK', 'PLK2', 'PLPP3', 'PMEPA1', 'PNRC1', 'PPP1R15A', 'PTGER4', 'PTGS2', 'PTPRE', 'PTX3', 'RCAN1', 'REL', 'RELA', 'RELB', 'RHOB', 'RIPK2', 'RNF19B', 'SAT1', 'SDC4', 'SERPINB2', 'SERPINB8', 'SERPINE1', 'SGK1', 'SIK1', 'SLC16A6', 'SLC2A3', 'SLC2A6', 'SMAD3', 'SNN', 'SOCS3', 'SOD2', 'SPHK1', 'SPSB1', 'SQSTM1', 'STAT5A', 'TANK', 'TAP1', 'TGIF1', 'TIPARP', 'TLR2', 'TNC', 'TNF', 'TNFAIP2', 'TNFAIP3', 'TNFAIP6', 'TNFAIP8', 'TNFRSF9', 'TNFSF9', 'TNIP1', 'TNIP2', 'TRAF1', 'TRIB1', 'TRIP10', 'TSC22D1', 'TUBB2A', 'VEGFA', 'YRDC', 'ZBTB10', 'ZC3H12A', 'ZFP36'], ['ACACA', 'ACTR2', 'ACTR3', 'ADCY2', 'AKT1', 'AKT1S1', 'AP2M1', 'ARF1', 'ARHGDIA', 'ARPC3', 'ATF1', 'CAB39', 'CAB39L', 'CALR', 'CAMK4', 'CDK1', 'CDK2', 'CDK4', 'CDKN1A', 'CDKN1B', 'CFL1', 'CLTC', 'CSNK2B', 'CXCR4', 'DAPP1', 'DDIT3', 'DUSP3', 'E2F1', 'ECSIT', 'EGFR', 'EIF4E', 'FASLG', 'FGF17', 'FGF22', 'FGF6', 'GNA14', 'GNGT1', 'GRB2', 'GRK2', 'GSK3B', 'HRAS', 'HSP90B1', 'IL2RG', 'IL4', 'IRAK4', 'ITPR2', 'LCK', 'MAP2K3', 'MAP2K6', 'MAP3K7', 'MAPK1', 'MAPK10', 'MAPK8', 'MAPK9', 'MAPKAP1', 'MKNK1', 'MKNK2', 'MYD88', 'NCK1', 'NFKBIB', 'NGF', 'NOD1', 'PAK4', 'PDK1', 'PFN1', 'PIK3R3', 'PIKFYVE', 'PIN1', 'PITX2', 'PLA2G12A', 'PLCB1', 'PLCG1', 'PPP1CA', 'PPP2R1B', 'PRKAA2', 'PRKAG1', 'PRKAR2A', 'PRKCB', 'PTEN', 'PTPN11', 'RAC1', 'RAF1', 'RALB', 'RIPK1', 'RIT1', 'RPS6KA1', 'RPS6KA3', 'RPTOR', 'SFN', 'SLA', 'SLC2A1', 'SMAD2', 'SQSTM1', 'STAT2', 'TBK1', 'THEM4', 'TIAM1', 'TNFRSF1A', 'TRAF2', 'TRIB3', 'TSC2', 'UBE2D3', 'UBE2N', 'VAV3', 'YWHAB']]

Exercises

Use string manipulation and a for loop or list comprehension to get rid of the version numbers and file extensions in gene_set_files. The output should be identical to gene_set_names.

# your code here

We don’t actually need to have ‘HALLMARK_’ before every gene set name. Use string manipulation and a for loop or list comprehension to remove it.

# your code here

Visualization: pulling it all together

Now that we’ve read the gene sets, we can use them to calculate how much a cell is expressing the genes in each set. Then, we can visualize the expression of these genes in different cell populations.

Note: this section will use some packages and code that you have not encountered. That’s okay–you don’t have to understand all of it. The goal is just so you can see how you might use these files as part of a larger project.

import scanpy as sc
data = sc.datasets.pbmc3k_processed()
for gene_set, gene_set_name in zip(gene_sets, gene_set_names):
    sc.tl.score_genes(data, gene_set, score_name = gene_set_name)
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[48], line 1
----> 1 import scanpy as sc
      2 data = sc.datasets.pbmc3k_processed()
      3 for gene_set, gene_set_name in zip(gene_sets, gene_set_names):

ModuleNotFoundError: No module named 'scanpy'
sc.tl.pca(data)
sc.pl.pca(data, color = ['louvain'] + gene_set_names, ncols = 2, 
          vmin = 'p1.5', vmax = 'p97.5')
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[49], line 1
----> 1 sc.tl.pca(data)
      2 sc.pl.pca(data, color = ['louvain'] + gene_set_names, ncols = 2, 
      3           vmin = 'p1.5', vmax = 'p97.5')

NameError: name 'sc' is not defined

The gene sets were able to help us visualize what kinds of cellular processes are occurring in different immune cell types–for example, we find that NK and dendritic cells have the highest inflammatory responses, while CD4 T cells have the highest expression of G2M checkpoint genes.