Lesson 2 - Data structures, Control flows, and Python Ecosystems

Open In Colab

Open In Colab

Learning objectives:

Students will be able to use data structures and control flows to make algorithms.

Specific coding skills:

  • Understanding basic data structures (list, dictionary, tuple, set)
  • Using if/else statements
  • Using loops (while/for)
  • Understanding python ecosystem and how packages work

Introduction

Data structures are basically just that - they are structures which can hold some data together. In other words, they are used to store a collection of related data. These are particularly helpful when working with experimental data sets. There are four built-in data structures in Python - list, tuple, dictionary and set.

Data structures

There are four built-in data structures in python: lists,tuples, sets, and dictionary. In this lesson, we will learn about each data structures.

Lists

A list is a data structure that holds an ordered collection of items i.e. you can store a sequence of items in a list. This is easy to imagine if you can think of a shopping list where you have a list of items to buy, except that you probably have each item on a separate line in your shopping list whereas in Python you put commas in between them.

Here are important properties of lists:

  • Lists are ordered – Lists remember the order of items inserted.
  • Accessed by index – Items in a list can be accessed using an index.
  • Lists can contain any sort of object – It can be numbers, strings, tuples and even other lists.
  • Lists are changeable (mutable) – You can change a list in-place, add new items, and delete or update existing items.

Let’s create a list containing five different genes. In Python, a list is created by placing elements inside square brackets [], separated by commas.

gene_list = ['EGFR', 'KRAS', 'MYC', 'RB', 'TP53']
print(gene_list)
['EGFR', 'KRAS', 'MYC', 'RB', 'TP53']

List operations

Now let’s learn what we can do with lists.

Accessing list elements: indexing/slicing

We can access list items using index opreator. In python, indices start at 0.

indexing

Let’s try to access the second gene in our gene_list.

gene_list[1]
'KRAS'

Python also has ‘negative indices’ which can be very convinient when we want to access last i-th item. Last item in the list can be accessed with index of -1. Let’s try to access the second to last item in gene_list.

gene_list[-2]
'RB'

We can also access a range of items in al list using the slicing operator :. Let’s try to access the first three items in gene_list. Note that the start index is inclusive, but end index is exclusive. We will see later when this can be useful.

# first three items
print(gene_list[0:3])
# beginning index (0) can be ommitteed
print(gene_list[:3])
['EGFR', 'KRAS', 'MYC']
['EGFR', 'KRAS', 'MYC']

What should we do if we want to access the last three items?

# last three items

List methods

Now, we are interested in a new cancer gene, PTEN, and want to add this to our gene list. How should we do this? There are a few ways to do this, and one way is using a list method called, append.

gene_list.append('PTEN')
gene_list
['EGFR', 'KRAS', 'MYC', 'RB', 'TP53', 'PTEN']

Oop, we were supposed add BRAF instead of PTEN! What should we do? Since lists are mutable, we can repalce PTEN with BRAF.

gene_list[-1]
'PTEN'
gene_list[-1] = 'BRAF'
gene_list
['EGFR', 'KRAS', 'MYC', 'RB', 'TP53', 'BRAF']

There are other useful list methods that we can use to alterate and describe lists.

Some useful functions and list methods

  • list.append(x): Add an item x to the end of the list.

  • list.remove(x): Remove the first item from the list whose value is equal to x.

  • list.count(x): Return the number of times x appears in the list.

  • reversed(l): Reverse the order of items in list l and return the new list.

  • sorted(l): Sort the order of values in list l (default is in ascending order) and return the new list.

  • len(l): Return the length of list l

Tuples

Tuples are ordered collection of values. Tuples are similar to lists with one major difference. They are are immutable, meaning that we cannot change, add, or remove items after they are created. We an create a tuple by placing comma-seperated values inside ().

gene_tuple = ('EGFR', 'KRAS', 'MYC', 'RB', 'TP53')

Similarly to lists, we can use indexing and slicing to access items

print(gene_tuple[2])  # access third item
print(gene_tuple[-2:])  # access last two items
MYC
('RB', 'TP53')

Can we replace KRAS with NRAS for tuples?

gene_tuple[1] = 'NRAS'
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[11], line 1
----> 1 gene_tuple[1] = 'NRAS'

TypeError: 'tuple' object does not support item assignment

This raises error because tuples are immutable.

Sets

Python set is an unordered collection of unique items. They are commonly used for computing mathematical operations such as union, intersection, difference, and symmetric difference.

set operations

The important properties of Python sets are as follows:

  • Sets are unordered – Items stored in a set aren’t kept in any particular order.
  • Set items are unique – Duplicate items are not allowed.
  • Sets are unindexed – You cannot access set items by referring to an index.
  • Sets are changeable (mutable) – They can be changed in place, can grow and shrink on demand.

Sets can be created by placing items comma-seperated values inside {}

We have upregulated genes in tumor tissues compared to normal tissues from two patients. We would like to know if there is a shared upregulated genes.

patient1 = {'ABCC1', 'BRCA1', 'BRCA2', 'HER2'}
patient2 = {'BRCA1', 'HER2', 'ERCC1'}
# set intersection
print(patient1.intersection(patient2))  # use intersection method
print(patient1 & patient2)  # use & operator
{'BRCA1', 'HER2'}
{'BRCA1', 'HER2'}

Dictionaries

A dictionary is like an address-book where you can find the address or contact details of a person by knowing only his/her name i.e. we associate keys (name) with values (details). Note that the key must be unique just like you cannot find out the correct information if you have two persons with the exact same name.

Note that you can use only immutable objects (like strings) for the keys of a dictionary but you can use either immutable or mutable objects for the values of the dictionary.

Pairs of keys and values are specified in a dictionary by using the notation d = {key1 : value1, key2 : value2 }. Notice that the key-value pairs are separated by a colon and the pairs are separated themselves by commas and all this is enclosed in a pair of curly braces.

The following example of a dictionary might be useful if you wanted to keep track of ages of patients in a clinical trial.

agesDict = {'Karen P.' : 53, 'Jessica M.': 47, 'David G.' : 45, 'Susan K.' : 57, 'Eric O.' : 50}
print(agesDict)
{'Karen P.': 53, 'Jessica M.': 47, 'David G.': 45, 'Susan K.': 57, 'Eric O.': 50}

We can access a person’s age (value) using his/her name (key). Let’s find out Eric O.’s age.

agesDict['Eric O.']
50

A new patient is enrolled into the clinical trial. Her name is Hannah H. and her age is 39. We can add a new item to the dictoary.

agesDict['Hannah H.'] = 39
print(agesDict)
{'Karen P.': 53, 'Jessica M.': 47, 'David G.': 45, 'Susan K.': 57, 'Eric O.': 50, 'Hannah H.': 39}

Data Structures Summary

Now we have learned the four basic python data structures: list, tuple, set, and dictionary. We have jsut touched the surface of these data structures. To learn more about these data structures and how to use them, please refer to the references!

Exercises Part 1

The following exercises will help you better understand data structures.

  1. Make a list containing the following numbers 1, 4, 25, 7, 9, 12, 15, 16, and 21. Name the list num_list
# Q1
  1. Find the following information about the list you made in Q1: length, minimum, and maximum. You may need to google to find functions that can help you.
# Q2
# length
# minimum
# maximum
  1. Make a dictionary that describes the price of five medications. Name the dictionary med_dict >* Lisinopril: 23.07 >* Gabapentin: 86.27 >* Sildenafil: 169.94 >* Amoxicillin: 17.76 >* Prednisone: 13.81
# Q4
  1. Use med_dict to calculate how much it will cost if a patien tis treated with Lisinopril and Prednisone.
# Q5

Control flows

If…else statment

Decision making is required when we want to execute a code only if a certain condition is satisfied.

The if…elif…else statement is used in Python for decision making. We can use these statments to execute a block of code only when the condition is true. The if…elif…else statement follows this syntax. Note, elif is abbreviation for else if.

if condition1:
    statment1
elif condition2:
    statement2
else:
    statement3

ifelse syntax

Let’s think of a dose-finding clinical trial. We first treat three patients with dose x. iIf no patients shows toxic side effects, we increase the dose. If one patient shows toxicity, we treat another three patients to learn more. If more than one patients show toxicity, we stop at that dose. Let’s make this into python code. You can change the value of n_toxic to see how the script works.

n_toxic = 2
if n_toxic == 0:
    print("increase dose")
elif n_toxic == 1:
    print("treat another three patients")
else:
    print("stop")
stop

Loops

There are two types of loops in python: while loops and for loops. Loops are useful when we want to performe the same task repetitively.

While loop

A while loop is used when you want to perform a task indefinitely, until a particular condition is met. For instance, we want to enroll new patients to a clinical trial until we have 30 patients.

n_patients = 1
while n_patients <= 30:
    print("enrolled patient", n_patients)
    n_patients += 1
enrolled patient 1
enrolled patient 2
enrolled patient 3
enrolled patient 4
enrolled patient 5
enrolled patient 6
enrolled patient 7
enrolled patient 8
enrolled patient 9
enrolled patient 10
enrolled patient 11
enrolled patient 12
enrolled patient 13
enrolled patient 14
enrolled patient 15
enrolled patient 16
enrolled patient 17
enrolled patient 18
enrolled patient 19
enrolled patient 20
enrolled patient 21
enrolled patient 22
enrolled patient 23
enrolled patient 24
enrolled patient 25
enrolled patient 26
enrolled patient 27
enrolled patient 28
enrolled patient 29
enrolled patient 30

What do you think will hapen if we use while True?

for loop

For loops are used for iterating over a sequence of objects i.e. go through each item in a sequence. Iterable of items include lists, tuples, dictonaries, sets, and strings. For loop has the following syntax:

for var in iterable:
    statement

Let’s look at an example of a for loop.

for i in [1, 2, 3, 4, 5]:
    print(i)
1
2
3
4
5

We can also use the range function to do the same thing. range function takes three parameters: start(default 0), end, and steps (default 1). Like slicing, the start is inclusive but the end is exclusive.

for i in range(0, 6):
    print(i)
0
1
2
3
4
5
# 0 can be omitted.
for i in range(6):
    print(i)
0
1
2
3
4
5

Exercise Part 2

Before going into exercise, we need to learn a handy operation +=.

a = 0
a = a + 1

is equivalent to

a = 0
a += 1
  1. The most exciting part about control flows is that they can be nested to make more complex algorithms. Let’s look at the complementary DNA sequences that we discussed in lesson 1. Solve this problem using for and if.

Create two new variables, comp_oligo1 and comp_oligo2, that are the complementary DNA sequences of oligo1 and oligo2 (hint: A <-> T and G <-> C)

oligo1 = 'GCGCTCAAT'
oligo2 = 'TACTAGGCA'
# backbone
comp_oligo1 = ''
for nuc in oligo1:
    if nuc == #something:
        comp_oligo1 += #something
    ...
  Cell In[32], line 4
    if nuc == #something:
              ^
SyntaxError: invalid syntax
  1. Let’s go back to the dose-finding clinical trials. We have a list of doses that we want to test. dose_list = [1, 2, 3, 5, 8, 13]. We want to increase the dose until we hit the maximal tolerated dose (MTD). For simplicity, we will increase dose when there is less than two patients out of three patients with toxicity and stop otherwise. The last dose before at least two patients have toxicity is declared MTD. We will look into the future and assume that we know how many patients will have toxicity at each dose tox_list = [0, 0, 1, 1, 2, 2]. Find the MTD using while.
dose_list = [1, 2, 3, 5, 8, 13]
tox_list = [0, 0, 1, 1, 2, 2]

Hopefully, you can see how importing data from and exporting data to csv or other delimiter separated files will be helpful for research. Now it’s time to practice your skills with built-in data structures and data frames!

Sources and References

https://www.programiz.com/python-programming/list

https://python.swaroopch.com/data_structures.html

https://docs.python.org/3/tutorial/datastructures.html

https://www.learnbyexample.org/python-tuple/