Week 5: Data Wrangling¶

Python for Social Data Science¶

Tom Paskhalis¶

Overview¶

  • Numerical analysis in Python
  • Tabular data
  • Pandas object types
  • Working with data frames in pandas
  • Data input and output

Numerical analysis in Python¶

  • As opposed to other programming languages (Julia, R, MatLab),

Python provides very bare bones functionality for numeric analysis.

  • E.g. no built-in matrix/array object type, limited mathematical and statistical functions
In [1]:
# Representing 3x3 matrix with list
mat = [[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]]
In [2]:
# Subsetting 2nd row, 3rd element
mat[1][2]
Out[2]:
6
In [3]:
# Naturally, this representation
# breaks down rather quickly
mat * 2
Out[3]:
[[1, 2, 3], [4, 5, 6], [7, 8, 9], [1, 2, 3], [4, 5, 6], [7, 8, 9]]

NumPy - numerical analysis in Python¶

  • NumPy (Numeric Python) package provides the basis of numerical computing in Python:
    • multidimensional array
    • mathematical functions for arrays
    • array data I/O
    • linear algebra, RNG, FFT, ...
In [4]:
# Using 'as' allows to avoid typing full name 
# each time the module is referred to
import numpy as np

NumPy array¶

  • Multidimensional (N) array object (aka ndarray) is a principal container for datasets in Python.
  • It is the backbone of data frames, operating behind the scenes
In [5]:
arr = np.array([[1, 2, 3],
                [4, 5, 6],
                [7, 8, 9]])
In [6]:
arr[1][2]
Out[6]:
6
In [7]:
arr * 2
Out[7]:
array([[ 2,  4,  6],
       [ 8, 10, 12],
       [14, 16, 18]])

Working with arrays¶

In [8]:
# Object type
type(arr)
Out[8]:
numpy.ndarray
In [9]:
# Array dimensionality
arr.ndim
Out[9]:
2
In [10]:
# Array size
arr.shape
Out[10]:
(3, 3)
In [11]:
# Calculating summary statistics on array
# axis indicates the dimension
# note that every list within a list
# is treated as a column (not row)
arr.mean(axis = 0)
Out[11]:
array([4., 5., 6.])

Array indexing and slicing¶

Source: Python for Data Analysis

Tidy data¶

  • Tidy data is a specific subset of rectangular data, where:
    • Each variable is in a column
    • Each observation is in a row
    • Each value is in a cell

Source: R for Data Science

Data preparation¶