Week 12: Complexity and Performance

POP77001 Computer Programming for Social Scientists

Tom Paskhalis

Overview

Algorithms
Computational complexity
Big-O notation
Code optimisation
Benchmarking

Complexity

Algorithms

Algorithm

Finite list of well-defined instructions that take input and produce output.
Consists of a sequence of simple steps that start from input, follow some control flow and have a stopping rule.

Complexity

Conceptual complexity
- Structural sophistication of a program
Computational complexity
- Resources (time/space) required to finish a program
Often there is some trade-off between the two
Reducing computational complexity results in increased conceptual complexity

How Long Does It Take to Run?

Linear search using exhaustive enumeration.

In R:

linear_search <- function(v, x) {
  n <- length(v)
  for (i in seq_len(n)) {
    if (v[i] == x) {
      return(TRUE)
    }
  }
  return(FALSE)
}

In Python:

def linear_search(v, x):
    n = len(v)
    for i in range(n):
        if v[i] == x:
            return True
    return False

Benchmarking

Benchmarking in R

# Best case (running time is independent of the length of vector)
system.time(linear_search(1:1e6, 1))

   user  system elapsed 
  0.003   0.000   0.003

# Average case (middle element for vector of length 1M)
system.time(linear_search(1:1e6, 5e5))

   user  system elapsed 
   0.02    0.00    0.02

# Worst case (last element for vector of length 1M)
system.time(linear_search(1:1e6, 1e6))

   user  system elapsed 
  0.037   0.000   0.036

# Even worse case (last element for vector of length 1B)
system.time(linear_search(1:1e9, 1e9))

   user  system elapsed 
 37.812   0.003  37.822

Limitations of Benchmarking

Depends on many factors:
- Computer hardware
- Input size
- Programming language used
Which of the benchmarking cases is the most useful?

More General Approach

Worst-case Scenario

Anything that can go wrong will go wrong.

Edward Murphy

In defensive design it is often helpful to think about the worst-case scenario.
This sets an upper bound on the execution time of a program.
However, this still depends on the input size.

Number of Steps

A useful heuristic is the number of steps that a program takes.

linear_search <- function(v, x) {
  n <- length(v) # 2 steps (call to length() function and assignment `<-`)
  # n steps + 1 step (for seq_len(n) call)
  for (i in seq_len(n)) {
    # 3n steps (n steps for `[`(v, i), another n steps for `==` and n calls to `if`)
    if (v[i] == x) {
      return(TRUE) # 1 step
    }
  }
  return(FALSE) # 1 step
}

\(4n + 4\)
If length of input vector \(v\) is \(1000\) (\(n = 1000\)),
This function will execute roughly \(4004\) steps.

Calculating Number of Steps

Consider the previous example: \(4n + 4\).
As \(n\) grows larger, these extra \(4\) steps can be ignored.
Multiplicative constants can certainly make a difference within implementation.
But across several algorithms the difference between \(2n\) and \(4n\) is usually negligible.

Alternative Algorithms

# Binary search for sorted sequences
binary_search <- function(v, x) {
  low <- 1
  high <- length(v)
  while (low <= high) {
    # Calculate mid-point (similar to median)
    m <- (low + high) %/% 2
    if (v[m] < x) {
      low <- m + 1
    } else if (v[m] > x) {
      high <- m - 1
    } else {
      return(TRUE)
    }
  }
  return(FALSE)
}

Comparing Algorithms: Benchmarking

system.time(linear_search(1:1e6, 1e6))

   user  system elapsed 
  0.039   0.000   0.038

system.time(binary_search(1:1e6, 1e6))

   user  system elapsed 
  0.006   0.000   0.006

system.time(1e6 %in% 1:1e6)

   user  system elapsed 
  0.003   0.003   0.006

Comparing Algorithms: N Steps

linear_search <- function(v, x) {
  n <- length(v)
  for (i in seq_len(n)) {
    if (v[i] == x) {
      print(paste0("Number of iterations: ", as.character(i)))
      return(i) # return(TRUE)
    }
  }
  return(i) # return(FALSE)
}

binary_search <- function(v, x) {
  low <- 1
  high <- length(v)
  iters <- 1
  while (low <= high) {
    m <- (low + high) %/% 2
    if (v[m] < x) {
      low <- m + 1
    } else if (v[m] > x) {
      high <- m - 1
    } else {
      print(paste0("Number of iterations: ", as.character(iters)))
      return(iters) # return(TRUE)
    }
    iters <- iters + 1
  }
  return(iters) # return(FALSE)
}

Comparing Algorithms: N Steps

ls_1e3 <- linear_search(1:1e3, 1e3)

[1] "Number of iterations: 1000"

ls_1e6 <- linear_search(1:1e6, 1e6)

[1] "Number of iterations: 1000000"

bs_1e3 <- binary_search(1:1e3, 1e3)

[1] "Number of iterations: 10"

bs_1e6 <- binary_search(1:1e6, 1e6)

[1] "Number of iterations: 20"

Big-O Notation

Performance on small inputs and in best-case scenarios is usually of limited interest.
What matters is the worst-case performance on progressively larger inputs.
In other words, upper bound (or order of growth).
Big O (Order of growth) is an asymptotic notation for describing such growth.
For example, in case of linear search \(O(n)\) (running time increases linearly in the size of input).
The most important question is the growth rate of the largest term.
All constants can be ignored.

Common Complexity Cases

Varieties of Big-O

Big-O notation	Running time
\(O(1)\)	constant
\(O(\log n)\)	logarithmic
\(O(n)\)	linear
\(O(n \log n)\)	log-linear
\(O(n^c)\)	polynomial
\(O(c^n)\)	exponential
\(O(n!)\)	factorial

Constant Complexity: \(O(1)\)

Running time of a program is bounded by a value, which is independent of the input size

get_len <- function(v) {
    # Internally, length just returns the 'length' attribute of an R object
    n <- length(v)
    return(n)
}

system.time(get_len(1:1e3))

   user  system elapsed 
      0       0       0

system.time(get_len(1:1e6))

   user  system elapsed 
  0.002   0.000   0.001

system.time(get_len(1:1e9))

   user  system elapsed 
      0       0       0

Logarithmic Complexity: \(O(\log n)\)

E.g. for binary search \(O(\log(n))\) (running time increases as a logarithm of the input size)
Base of logarithm is irrelevant as it can be easily re-arranged \(\log_2(n) = \log_2(10) \times \log_{10}(n)\) and constants are ignored
For base \(2\) we can say that each time the size of input doubles, the algorithm performs one additional step

bs_10 <- binary_search(1:10, 10)

[1] "Number of iterations: 4"

bs_20 <- binary_search(1:20, 20)

[1] "Number of iterations: 5"

bs_100 <- binary_search(1:100, 100)

[1] "Number of iterations: 7"

bs_200 <- binary_search(1:200, 200)

[1] "Number of iterations: 8"

bs_1000 <- binary_search(1:1000, 1000)

[1] "Number of iterations: 10"

Linear Complexity: \(O(n)\)

For linear search \(O((n))\) (running time increases as a linear function of the input size).
Generally, iteration over all elements of a sequence would be in \(O(n)\).
But it doesn’t have to be a loop (e.g. recursive factorial implementation).

ls_10 <- linear_search(1:10, 10)

[1] "Number of iterations: 10"

ls_20 <- linear_search(1:20, 20)

[1] "Number of iterations: 20"

ls_100 <- linear_search(1:100, 100)

[1] "Number of iterations: 100"

ls_200 <- linear_search(1:200, 200)

[1] "Number of iterations: 200"

ls_1000 <- linear_search(1:1000, 1000)

[1] "Number of iterations: 1000"

Comparing \(O(1)\), \(O(\log n)\), \(O(n)\)

Plot
Code

x <- c(10, 20, 100, 200, 1000)
constant_complexity <- rep(10, length(x))
log_complexity <- c(bs_10, bs_20, bs_100, bs_200, bs_1000)
linear_complexity <- c(ls_10, ls_20, ls_100, ls_200, ls_1000)

plot(x, linear_complexity, type = "n", main = "O(1) vs O(log(n)) vs O(n)", xlab = "Input size", ylab = "Time")
lines(x, constant_complexity, type = "l", col = "black", lty = 1, lwd = 3)
lines(x, log_complexity, type = "l", col = "red", lty = 2, lwd = 3)
lines(x, linear_complexity, type = "l", col = "blue", lty = 3, lwd = 3)
legend("topleft", legend = c("Constant(10)", "Logarithmic", "Linear"),
       col = c("black", "red", "blue"), lty = 1:3, cex = 1.5)

Log-Linear Complexity: \(O(n \log n)\)

More complicated complexity case.
Involves a product of 2 terms.
Important complexity case as many practical problems are solved in log-linear time.
Many sorting algorithms (e.g. merge sort, timsort, built-in sorted() in Python).

Polynomial Complexity: \(O(n^c)\)

Most common case is quadratic complexity: \(O(n^2)\)
Nested loops typically result in polynomial complexity.

# Check whether one vector is contained within another vector
is_subset <- function(v1, v2) {
  n <- length(v1)
  m <- length(v2)
  iters <- 1
  for (i in seq_len(n)) {
    matched <- FALSE
    for (j in seq_len(m)) {
      iters <- iters + 1
      if (v1[i] == v2[j]) {
        matched <- TRUE
        break
      }
    }
    if (isTRUE(matched)) {
      print(paste0("Number of iterations: ", as.character(iters)))
      return(iters) # return(TRUE)
    }
  }
  print(paste0("Number of iterations: ", as.character(iters)))
  return(iters) # return(FALSE)
}

Polynomial Complexity

# For simplicity of analysis let's assume that
# lengths of 2 input vectors are about the same
is_10 <- is_subset(11:21, 1:10)

[1] "Number of iterations: 111"

is_20 <- is_subset(21:41, 1:20)

[1] "Number of iterations: 421"

is_100 <- is_subset(101:201, 1:100)

[1] "Number of iterations: 10101"

is_200 <- is_subset(201:401, 1:200)

[1] "Number of iterations: 40201"

is_1000 <- is_subset(1001:2001, 1:1000)

[1] "Number of iterations: 1001001"

Comparing \(O(n)\), \(O(n \log n)\), \(O(n^2)\)

Plot
Code

log_linear_complexity <- x * log(x)
quadratic_complexity <- c(is_10, is_20, is_100, is_200, is_1000)

plot(x, quadratic_complexity, type = "n", main = "O(n) vs O(nlog(n)) vs O(n^2)", xlab = "Input size", ylab = "Time")
lines(x, linear_complexity, type = "l", col = "black", lty = 1, lwd = 3)
lines(x, log_linear_complexity, type = "l", col = "red", lty = 2, lwd = 3)
lines(x, quadratic_complexity, type = "l", col = "blue", lty = 3, lwd = 3)
legend("topleft", legend = c("Linear", "Log-Linear", "Quadratic"),
       col = c("black", "red", "blue"), lty = 1:3, cex = 1.5)

Exponential and Factorial Complexity: \(O(c^n)\) and \(O(n!)\)

Many real-life problems have exponential or factorial solutions
E.g. travelling salesman problem (given the list of points and distances, what is the shortest path through all of them)
However, the general solutions with such complexity are usually impractical
For these problems either approximate solutions can be used
Or a problem is solved for specific cases

Comparing \(O(n^2)\), \(O(10^n)\), \(O(n!)\)

Plot
Code

x <- 1:25
quadratic_complexity <- x ^ 2
exponential_complexity <- 10 ^ x
factorial_complexity <- factorial(x)

plot(x, factorial_complexity, type = "n", main = "O(n^2) vs O(10^n) vs O(n!)", xlab = "Input size", ylab = "Time")
lines(x, quadratic_complexity, type = "l", col = "black", lty = 1, lwd = 3)
lines(x, exponential_complexity, type = "l", col = "red", lty = 2, lwd = 3)
lines(x, factorial_complexity, type = "l", col = "blue", lty = 3, lwd = 3)
legend("topleft", legend = c("Quadratic", "Exponential", "Factorial"),
       col = c("black", "red", "blue"), lty = 1:3, cex = 1.5)

Time Complexity in Practice

Check for loops, recursion.
Think about function calls, what is the complexity of underlying implementations?
The presented complexity cases are not exhaustive and are often ‘idealised’ cases.
Complexity can take any form (e.g. one of sort() methods in R is \(O(n^{4/3})\)).
Running time can be a function of more than one input (e.g. is_subset() function above).
Big-O ignores constants (multiplicative and additive), but they matter within implementation.
There is a tradeoff between conceptual and computational complexity (more efficient solutions are often harder to read and debug).

Code Optimisation Tradeoff

(xkcd)

Tutorial: Benchmarking, analysis of function complexity and performance
Final project: Due by 23:59 on Friday, 13th December (submission on Blackboard)

Week 12: Complexity and Performance

Overview

Complexity

Algorithms

Algorithm

Complexity

How Long Does It Take to Run?

Benchmarking

Benchmarking in R

Limitations of Benchmarking

More General Approach

Worst-case Scenario

Number of Steps

Calculating Number of Steps

Alternative Algorithms

Comparing Algorithms: Benchmarking

Comparing Algorithms: N Steps

Comparing Algorithms: N Steps

Big-O Notation

Common Complexity Cases

Varieties of Big-O

Constant Complexity: \(O(1)\)

Logarithmic Complexity: \(O(\log n)\)

Linear Complexity: \(O(n)\)

Comparing \(O(1)\), \(O(\log n)\), \(O(n)\)

Log-Linear Complexity: \(O(n \log n)\)

Polynomial Complexity: \(O(n^c)\)

Polynomial Complexity

Comparing \(O(n)\), \(O(n \log n)\), \(O(n^2)\)

Exponential and Factorial Complexity: \(O(c^n)\) and \(O(n!)\)

Comparing \(O(n^2)\), \(O(10^n)\), \(O(n!)\)

Time Complexity in Practice

Code Optimisation Tradeoff

Next

The End