Week 12 Tutorial: Complexity and Performance

POP77001 Computer Programming for Social Scientists

Benchmarking in R

In the lecture we used system.time() function to analyse function performance.
Albeit conveniently built-in, the main drawback is that it’s rather coarse.
While useful for detecting large performance gaps, it often doesn’t capture more subtle differences.
The reason is that it only runs once and uses seconds as a standard unit of measurement.
Here we will use microbenchmark package and identically named function to time function calls.
Remember to print out the results of microbenchmark, otherwise times of individual runs are returns.

library("microbenchmark")

# Here we run 1000 times the same function call
# and time how long it takes to run
microbenchmark::microbenchmark(
  mean(rnorm(n = 1000)),
  times = 1000
)

Unit: microseconds
                  expr    min      lq     mean  median      uq     max neval
 mean(rnorm(n = 1000)) 50.274 51.4335 53.18632 52.1055 54.1065 104.676  1000

Exercise: Compare Performance in R

Consider a data frame with 50 different variables below.
We want to know the mean of each of those variables.
There are 2 principal ways of estimating them:
- One using apply() function.
- Or using built-in colMeans() function.
Apply each of those function to calculate means.
Benchmark the time it took to run using system.time() benchmark and microbenchmark package.
What do you find?

set.seed(1234)
# Here we create a data frame of 1000 observations of 50 variables
# where each variable is a random draw from a normal distribution with mean
# drawn from a uniform distribution between 0 and 10 and standard deviation 1
df <- data.frame(mapply(
  function(x) cbind(rnorm(n = 1000, mean = x, sd = 1)),
  runif(n = 50, min = 0, max = 10)
))

dim(df)

[1] 1000   50

Benchmarking in Python

It is possible to measure timing of operation in Python with built-in time module.
But it would require recording time before a call and after and then taking a difference.
Python’s built-in timeit module provides a better alternative as it does it automatically an more.
It behaves similar to microbenchmark in R in that it averages over many runs.
It is also available in IPython (and, as a result, in Jupyter) as a magic command that can be called with %timeit.

Switching kernels in Jupyter

In order to be able to continue with Python part of the exercises you can switch your kernel.
Got to Kernel, Change kernel and pick Python from the drop-down menu.

import random
import numpy as np
import pandas as pd

# Random numbers in Python can be generated either using
# the built-in `random` module or using `numpy` external
# module (which is underlying a lot of `pandas` operations)
random.gauss(mu = 0, sigma = 1)

0.07450223926264972

# Instead of just a float number it returns an array
np.random.randn(1)

array([0.10102651])

# Let's start our benchmarking experiments from looking
# at random number generation in Python.
# First let's draw a sample of 1M using both built-in `random` module
# And `numpy`'s methods

N = 1000000

# We can use `for _` expression to indicate that returned value is being discarded
%timeit [random.gauss(mu = 0, sigma = 1) for _ in range(N)]

invalid syntax (<string>, line 2)

# `numpy` is order of magnitude faster than built-in module
%timeit np.random.normal(size = N)

invalid syntax (<string>, line 2)

Exercise: Compare Performance in Python

Now let’s replicate the calculation of some summary statistics in pandas DataFrame.
As in the case of R, there are 2 principal ways of doing this:
- First, is iterating over columns in a data set with a list comprehension and applying some function to each of columns (e.g. mean() from statistics module).
- Alternatively, you can apply one of the built-in statistical summary methods (check Week 10 for the list).
Apply each of those approaches to the data frame below.
How do these two approaches compare?

from statistics import mean

# Setting seed using 'numpy' is slightly more involved than with 'random' module (or R)
# We first need to create a random number generator object, that we can than use
# to generate random draws from distributions that are consistent across re-runs
rng = np.random.default_rng(1234)

# Here we are, essentially, replicating the process of data frame creation as in R above
# each variable is a random draw from a normal distribution with mean
# drawn from a uniform distribution between 0 and 10 and standard deviation 1
df2 = pd.DataFrame(np.concatenate([
    rng.normal(loc = x, scale = 1, size = (1000, 1))
    for x
    in rng.uniform(low = 0, high = 10, size = 50)
], axis = 1))

df2.shape

(1000, 50)

Final project: Due by 23:59 on Friday, 13th December (submission on Blackboard)

Week 12 Tutorial: Complexity and Performance

Benchmarking in R

Exercise: Compare Performance in R

Benchmarking in Python

Switching kernels in Jupyter

Exercise: Compare Performance in Python

Next