Loops¶

Objective¶

  • Explain what for loops are normally used for.
  • Trace the execution of a simple (unnested) loop and correctly state the values of variables in each iteration.
  • Write for loops that use the Accumulator pattern to aggregate values.

A for loop executes commands once for each value in a collection.¶

  • Doing calculations on the values in a list one by one is as painful as working with pressure_001, pressure_002, etc.
  • A for loop tells Python to execute some statements once for each value in a list, a character string, or some other collection.
  • "for each thing in this group, do these operations"
In [1]:
for number in [2, 3, 5]:
    print(number)
2
3
5

The first line of the for loop must end with a colon, and the body must be indented.

A for loop is made up of a collection, a loop variable, and a body.¶

  • The collection, [2, 3, 5], is what the loop is being run on.
  • The body, print(number), specifies what to do for each value in the collection.
  • The loop variable, number, is what changes for each iteration of the loop.
    • The "current thing".

Loop variables can be called anything.¶

  • As with all variables, loop variables are:
    • Created on demand.
    • Meaningless: their names can be anything at all.
for kitten in [2, 3, 5]:
    print(kitten)

Use range to iterate over a sequence of numbers.¶

  • The built-in function range produces a sequence of numbers.
    • Not a list: the numbers are produced on demand to make looping over large ranges more efficient.
  • range(N) is the numbers 0..N-1
    • Exactly the legal indices of a list or character string of length N
In [2]:
print('a range is not a list: range(0, 3)')
for number in range(0, 3):
    print(number)
a range is not a list: range(0, 3)
0
1
2

Exercise¶

  1. Fill in the blanks in each of the programs below to produce the indicated result.
# Total length of the strings in the list: ["red", "green", "blue"] => 12
total = 0
for word in ["red", "green", "blue"]:
    ____ = ____ + len(word)
print(total)
  1. Reorder and properly indent the lines of code below so that they print a list with the cumulative sum of data. The result should be [1, 3, 5, 10].
cumulative.append(total)
for number in data:
cumulative = []
total = total + number
total = 0
print(cumulative)
data = [1,2,2,5]

Conditionals¶

Use if statements to control whether or not a block of code is executed.¶

  • An if statement (more properly called a conditional statement) controls whether some block of code is executed or not.
  • Structure is similar to a for statement:
    • First line opens with if and ends with a colon
    • Body containing one or more statements is indented (usually by 4 spaces)
In [3]:
mass = 3.54
if mass > 3.0:
    print(mass, 'is large')

mass = 2.07
if mass > 3.0:
    print (mass, 'is large')
3.54 is large

Conditionals are often used inside loops.¶

  • Not much point using a conditional when we know the value (as above).
  • But useful when we have a collection to process.
In [4]:
masses = [3.54, 2.07, 9.22, 1.86, 1.71]
for m in masses:
    if m > 3.0:
        print(m, 'is large')
3.54 is large
9.22 is large

Use else to execute a block of code when an if condition is not true.¶

  • else can be used following an if.
  • Allows us to specify an alternative to execute when the if branch isn't taken.
In [5]:
masses = [3.54, 2.07, 9.22, 1.86, 1.71]
for m in masses:
    if m > 3.0:
        print(m, 'is large')
    else:
        print(m, 'is small')
3.54 is large
2.07 is small
9.22 is large
1.86 is small
1.71 is small

Use elif to specify additional tests.¶

  • May want to provide several alternative choices, each with its own test.
  • Use elif (short for "else if") and a condition to specify these.
  • Always associated with an if.
  • Must come before the else (which is the "catch all").
In [6]:
masses = [3.54, 2.07, 9.22, 1.86, 1.71]
for m in masses:
    if m > 9.0:
        print(m, 'is HUGE')
    elif m > 3.0:
        print(m, 'is large')
    else:
        print(m, 'is small')
3.54 is large
2.07 is small
9.22 is HUGE
1.86 is small
1.71 is small

Conditions are tested once, in order.¶

  • Python steps through the branches of the conditional in order, testing each in turn.
  • So ordering matters.
In [7]:
grade = 85
if grade >= 90:
    print('grade is A')
elif grade >= 80:
    print('grade is B')
elif grade >= 70:
    print('grade is C')
grade is B
  • Often use conditionals in a loop to "evolve" the values of variables.
In [8]:
velocity = 10.0
for i in range(5): # execute the loop 5 times
    print(i, ':', velocity)
    if velocity > 20.0:
        print('moving too fast')
        velocity = velocity - 5.0
    else:
        print('moving too slow')
        velocity = velocity + 10.0
print('final velocity:', velocity)
0 : 10.0
moving too slow
1 : 20.0
moving too slow
2 : 30.0
moving too fast
3 : 25.0
moving too fast
4 : 20.0
moving too slow
final velocity: 30.0

Exercise¶

  1. What does this program print?
pressure = 71.9
if pressure > 50.0:
    pressure = 25.0
elif pressure <= 50.0:
    pressure = 0.0
print(pressure)
  1. Modify this program so that it only processes files with fewer than 50 records.
import glob
import pandas as pd
for filename in glob.glob('data/*.csv'):
    contents = pd.read_csv(filename)
    ____:
        print(filename, len(contents))
  1. Modify this program so that it finds the largest and smallest values in the list no matter what the range of values originally is.
values = [...some test data...]
smallest, largest = None, None
for v in values:
    if ____:
        smallest, largest = v, v
    ____:
        smallest = min(____, v)
        largest = max(____, v)
print(smallest, largest)

Takeaway¶

  • Use if statements to control whether or not a block of code is executed.
  • Conditionals are often used inside loops.
  • Use else to execute a block of code when an if condition is not true.
  • Use elif to specify additional tests.
  • Conditions are tested once, in order.
  • Create a table showing variables' values to trace a program's execution.

Looping over datasets¶

Objective¶

  • Be able to read and write globbing expressions that match sets of files.
  • Use glob to create lists of files.
  • Write for loops to perform operations on files given their names in a list.

Use a for loop to process files given a list of their names.¶

  • A filename is a character string.
  • And lists can contain character strings.
In [9]:
import pandas as pd
for filename in ['data/gapminder_gdp_africa.csv', 'data/gapminder_gdp_asia.csv']:
    data = pd.read_csv(filename, index_col='country')
    print(filename, data.min())
data/gapminder_gdp_africa.csv gdpPercap_1952    298.846212
gdpPercap_1957    335.997115
gdpPercap_1962    355.203227
gdpPercap_1967    412.977514
gdpPercap_1972    464.099504
gdpPercap_1977    502.319733
gdpPercap_1982    462.211415
gdpPercap_1987    389.876185
gdpPercap_1992    410.896824
gdpPercap_1997    312.188423
gdpPercap_2002    241.165876
gdpPercap_2007    277.551859
dtype: float64
data/gapminder_gdp_asia.csv gdpPercap_1952    331.0
gdpPercap_1957    350.0
gdpPercap_1962    388.0
gdpPercap_1967    349.0
gdpPercap_1972    357.0
gdpPercap_1977    371.0
gdpPercap_1982    424.0
gdpPercap_1987    385.0
gdpPercap_1992    347.0
gdpPercap_1997    415.0
gdpPercap_2002    611.0
gdpPercap_2007    944.0
dtype: float64

Use glob.glob to find sets of files whose names match a pattern.¶

  • In Unix, the term "globbing" means "matching a set of files with a pattern".
  • The most common patterns are:
    • * meaning "match zero or more characters"
    • ? meaning "match exactly one character"
  • Python's standard library contains the glob module to provide pattern matching functionality
  • The glob module contains a function also called glob to match file patterns
  • E.g., glob.glob('*.txt') matches all files in the current directory whose names end with .txt.
  • Result is a (possibly empty) list of character strings.
In [11]:
import glob
print('all csv files in data directory:', glob.glob('data/*.csv'))
all csv files in data directory: ['data/gapminder_gdp_americas.csv', 'data/gapminder_gdp_europe.csv', 'data/gapminder_all.csv', 'data/gapminder_gdp_oceania.csv', 'data/gapminder_gdp_africa.csv', 'data/gapminder_gdp_asia.csv']

Use glob and for to process batches of files.¶

  • Helps a lot if the files are named and stored systematically and consistently so that simple patterns will find the right data.
In [12]:
for filename in glob.glob('data/gapminder_*.csv'):
    data = pd.read_csv(filename)
    print(filename, data['gdpPercap_1952'].min())
data/gapminder_gdp_americas.csv 1397.717137
data/gapminder_gdp_europe.csv 973.5331948
data/gapminder_all.csv 298.8462121
data/gapminder_gdp_oceania.csv 10039.59564
data/gapminder_gdp_africa.csv 298.8462121
data/gapminder_gdp_asia.csv 331.0

Exercise¶

  1. Which of these files is not matched by the expression glob.glob('data/*as*.csv')?

    1. data/gapminder_gdp_africa.csv
    2. data/gapminder_gdp_americas.csv
    3. data/gapminder_gdp_asia.csv
  1. Write a program that reads in the regional data sets and plots the average GDP per capita for each region over time in a single chart. Pandas will raise an error if it encounters non-numeric columns in a dataframe computation so you may need to either filter out those columns or tell pandas to ignore them.

Solution!¶

import glob
import pandas as pd
import matplotlib.pyplot as plt
fig, ax = plt.subplots(1,1)
for filename in glob.glob('data/gapminder_gdp*.csv'):
    dataframe = pd.read_csv(filename)
    # extract <region> from the filename, expected to be in the format 'data/gapminder_gdp_<region>.csv'.
    # we will split the string using the split method and `_` as our separator,
    # retrieve the last string in the list that split returns (`<region>.csv`), 
    # and then remove the `.csv` extension from that string.
    # NOTE: the pathlib module covered in the next callout also offers
    # convenient abstractions for working with filesystem paths and could solve this as well:
    # from pathlib import Path
    # region = Path(filename).stem.split('_')[-1]
    region = filename.split('_')[-1][:-4]
    # extract the years from the columns of the dataframe 
    headings = dataframe.columns[1:]
    years = headings.str.split('_').str.get(1)
    # pandas raises errors when it encounters non-numeric columns in a dataframe computation
    # but we can tell pandas to ignore them with the `numeric_only` parameter
    dataframe.mean(numeric_only=True).plot(ax=ax, label=region)
    # NOTE: another way of doing this selects just the columns with gdp in their name using the filter method
    # dataframe.filter(like="gdp").mean().plot(ax=ax, label=region)
# set the title and labels
ax.set_title('GDP Per Capita for Regions Over Time')
ax.set_xticks(range(len(years)))
ax.set_xticklabels(years)
ax.set_xlabel('Year')
plt.tight_layout()
plt.legend()
plt.show()

Dealing with File Paths¶

The pathlib module provides useful abstractions for file and path manipulation like returning the name of a file without the file extension. This is very useful when looping over files and directories. In the example below, we create a Path object and inspect its attributes.

In [13]:
from pathlib import Path

p = Path("data/gapminder_gdp_africa.csv")
print(p.parent)
print(p.stem)
print(p.suffix)
data
gapminder_gdp_africa
.csv

Takeaway¶

  • Use a for loop to process files given a list of their names.
  • Use glob.glob to find sets of files whose names match a pattern.
  • Use glob and for to process batches of files.

Continue¶

  1. Lists
  2. Loops
  3. Functions