Plotting¶
Objectives¶
- Create a time series plot showing a single data set.
- Create a scatter plot showing relationship between two data sets.
matplotlib is the most widely used scientific plotting library in Python.¶
- Commonly use a sub-library called
matplotlib.pyplot. - The Jupyter Notebook will render plots inline by default.
In [6]:
import matplotlib.pyplot as plt
time = [0, 1, 2, 3]
position = [0, 100, 200, 300]
plt.plot(time, position)
plt.xlabel('Time (hr)')
plt.ylabel('Position (km)')
plt.show()
Plot data directly from a Pandas dataframe.¶
- We can also plot Pandas dataframes.
- Before plotting, we convert the column headings from a
stringtointegerdata type, since they represent numerical values, using str.replace() to remove thegpdPercap_prefix and then astype(int) to convert the series of string values (['1952', '1957', ..., '2007']) to a series of integers:[1925, 1957, ..., 2007].
In [7]:
import pandas as pd
data = pd.read_csv('data/gapminder_gdp_oceania.csv', index_col='country')
# Extract year from last 4 characters of each column name
# The current column names are structured as 'gdpPercap_(year)',
# so we want to keep the (year) part only for clarity when plotting GDP vs. years
# To do this we use replace(), which removes from the string the characters stated in the argument
# This method works on strings, so we use replace() from Pandas Series.str vectorized string functions
years = data.columns.str.replace('gdpPercap_', '')
# Convert year values to integers, saving results back to dataframe
data.columns = years.astype(int)
data.loc['Australia'].plot()
plt.show()
Select and transform data, then plot it.¶
- By default,
DataFrame.plotplots with the rows as the X axis. - We can transpose the data in order to plot multiple series.
In [8]:
data.T.plot()
plt.ylabel('GDP per capita')
Out[8]:
Text(0, 0.5, 'GDP per capita')
Many styles of plot are available.¶
In [19]:
plt.style.use('ggplot')
data.T.plot(kind='bar')
plt.ylabel('GDP per capita')
Out[19]:
Text(0, 0.5, 'GDP per capita')
In [13]:
plt.style.available
Out[13]:
['Solarize_Light2', '_classic_test_patch', '_mpl-gallery', '_mpl-gallery-nogrid', 'bmh', 'classic', 'dark_background', 'fast', 'fivethirtyeight', 'ggplot', 'grayscale', 'petroff10', 'seaborn-v0_8', 'seaborn-v0_8-bright', 'seaborn-v0_8-colorblind', 'seaborn-v0_8-dark', 'seaborn-v0_8-dark-palette', 'seaborn-v0_8-darkgrid', 'seaborn-v0_8-deep', 'seaborn-v0_8-muted', 'seaborn-v0_8-notebook', 'seaborn-v0_8-paper', 'seaborn-v0_8-pastel', 'seaborn-v0_8-poster', 'seaborn-v0_8-talk', 'seaborn-v0_8-ticks', 'seaborn-v0_8-white', 'seaborn-v0_8-whitegrid', 'tableau-colorblind10']
Data can also be plotted by calling the matplotlib plot function directly.¶
- The command is
plt.plot(x, y) - The color and format of markers can also be specified as an additional optional argument e.g.,
b-is a blue line,g--is a green dashed line.
In [20]:
years = data.columns
gdp_australia = data.loc['Australia']
plt.plot(years, gdp_australia, 'g--')
plt.show()
Out[20]:
[<matplotlib.lines.Line2D at 0x11f2eca50>]
Can plot many sets of data together¶
In [21]:
gdp_australia = data.loc['Australia']
gdp_nz = data.loc['New Zealand']
# Plot with differently-colored markers.
plt.plot(years, gdp_australia, 'b-', label='Australia')
plt.plot(years, gdp_nz, 'g-', label='New Zealand')
eate legend.
plt.legend(loc='upper left')
plt.xlabel('Year')
plt.ylabel('GDP per capita ($)')
Out[21]:
Text(0, 0.5, 'GDP per capita ($)')
Adding a Legend¶
This can be done in matplotlib in two stages:
- Provide a label for each dataset in the figure,
- Instruct
matplotlibto create the legend.
In [23]:
plt.plot(years, gdp_australia, label='Australia')
plt.plot(years, gdp_nz, label='New Zealand')
plt.legend()
plt.show()
- By default matplotlib will attempt to place the legend in a suitable position. If you
would rather specify a position this can be done with the
loc=argument, e.g to place the legend in the upper left corner of the plot, specifyloc='upper left'
Exercises¶
- Fill in the blanks below to plot the minimum GDP per capita over time for all the countries in Europe. Modify it again to plot the maximum GDP per capita over time for Europe.
data_europe = pd.read_csv('data/gapminder_gdp_europe.csv', index_col='country')
data_europe.____.plot(label='min')
data_europe.____
plt.legend(loc='best')
plt.xticks(rotation=90)
- Modify the example in the notes to create a scatter plot showing the relationship between the minimum and maximum GDP per capita among the countries in Asia for each year in the data set. What relationship do you see (if any)?
- This short program creates a plot showing
the correlation between GDP and life expectancy for 2007,
normalizing marker size by population. Using online help and other resources,
explain what each argument to
plotdoes.
data_all = pd.read_csv('data/gapminder_all.csv', index_col='country')
data_all.plot(kind='scatter', x='gdpPercap_2007', y='lifeExp_2007', s=data_all['pop_2007']/1e6)
When using dataframes, data is often generated and plotted to screen in one line.
In addition to using plt.savefig, we can save a reference to the current figure
in a local variable (with plt.gcf) and call the savefig class method from
that variable to save the figure to file.
data.plot(kind='bar')
fig = plt.gcf() # get current figure
fig.savefig('my_figure.png')
Making your plots accessible¶
Whenever you are generating plots to go into a paper or a presentation, there are a few things you can do to make sure that everyone can understand your plots.
- Always make sure your text is large enough to read. Use the
fontsizeparameter inxlabel,ylabel,title, andlegend, andtick_paramswithlabelsizeto increase the text size of the numbers on your axes. - Similarly, you should make your graph elements easy to see. Use
sto increase the size of your scatterplot markers andlinewidthto increase the sizes of your plot lines. - Using color (and nothing else) to distinguish between different plot elements will make your plots unreadable to anyone who is colorblind, or who happens to have a black-and-white office printer. For lines, the
linestyleparameter lets you use different types of lines. For scatterplots,markerlets you change the shape of your points. If you're unsure about your colors, you can use Coblis or Color Oracle to simulate what your plots would look like to those with colorblindness.
Takeaway¶
matplotlibis the most widely used scientific plotting library in Python.- Plot data directly from a Pandas dataframe.
- Select and transform data, then plot it.
- Many styles of plot are available: see the Python Graph Gallery for more options.
- Can plot many sets of data together.