Week 3: Probability Distributions

POP88162 Introduction to Quantitative Research Methods

Tom Paskhalis

Department of Political Science, Trinity College Dublin

So Far

Code in R can be distributed as R Scripts and R Markdown.
Everything in R is an object, the references to which are established through assignment.
Vector is the core data structure in R.
They hold only data of the same type and are used to represent individual variables.
Vectors can be one of the three main types (character, numeric, logical).
Lists are used to store different data types (multiple individual vectors).

Topics for Today

Probability functions in R
Cumulative distribution function
Quantiles
Random number generation

Normal Distribution

Standard Normal Distribution

How to Find an Arbitrary Value?

Statistical Tables

Areas under the Normal Curve
Z	.00	.01	.02	.03	.04	.05	.06	.07	.08	.09
-3.1	0.0010	0.0009	0.0009	0.0009	0.0008	0.0008	0.0008	0.0008	0.0007	0.0007
-3.0	0.0013	0.0013	0.0013	0.0012	0.0012	0.0011	0.0011	0.0011	0.0010	0.0010
-2.9	0.0019	0.0018	0.0018	0.0017	0.0016	0.0016	0.0015	0.0015	0.0014	0.0014
-2.8	0.0026	0.0025	0.0024	0.0023	0.0023	0.0022	0.0021	0.0021	0.0020	0.0019
-2.7	0.0035	0.0034	0.0033	0.0032	0.0031	0.0030	0.0029	0.0028	0.0027	0.0026
-2.6	0.0047	0.0045	0.0044	0.0043	0.0041	0.0040	0.0039	0.0038	0.0037	0.0036
-2.5	0.0062	0.0060	0.0059	0.0057	0.0055	0.0054	0.0052	0.0051	0.0049	0.0048
-2.4	0.0082	0.0080	0.0078	0.0075	0.0073	0.0071	0.0069	0.0068	0.0066	0.0064
-2.3	0.0107	0.0104	0.0102	0.0099	0.0096	0.0094	0.0091	0.0089	0.0087	0.0084
-2.2	0.0139	0.0136	0.0132	0.0129	0.0125	0.0122	0.0119	0.0116	0.0113	0.0110
-2.1	0.0179	0.0174	0.0170	0.0166	0.0162	0.0158	0.0154	0.0150	0.0146	0.0143
-2.0	0.0228	0.0222	0.0217	0.0212	0.0207	0.0202	0.0197	0.0192	0.0188	0.0183
-1.9	0.0287	0.0281	0.0274	0.0268	0.0262	0.0256	0.0250	0.0244	0.0239	0.0233
-1.8	0.0359	0.0351	0.0344	0.0336	0.0329	0.0322	0.0314	0.0307	0.0301	0.0294
-1.7	0.0446	0.0436	0.0427	0.0418	0.0409	0.0401	0.0392	0.0384	0.0375	0.0367
-1.6	0.0548	0.0537	0.0526	0.0516	0.0505	0.0495	0.0485	0.0475	0.0465	0.0455
-1.5	0.0668	0.0655	0.0643	0.0630	0.0618	0.0606	0.0594	0.0582	0.0571	0.0559
-1.4	0.0808	0.0793	0.0778	0.0764	0.0749	0.0735	0.0721	0.0708	0.0694	0.0681
-1.3	0.0968	0.0951	0.0934	0.0918	0.0901	0.0885	0.0869	0.0853	0.0838	0.0823
-1.2	0.1151	0.1131	0.1112	0.1093	0.1075	0.1056	0.1038	0.1020	0.1003	0.0985

Built-in Distributions

Instead of relying on statistical tables, we can use built-in R functions.
R provides functions for working with all common probability distributions.
We can calculate critical values and probabilities.
Conducting many statistical tests requires calculation of probabilities for continuous random variables.

Four Key Quantities

The four key quantities for a probability distribution are:
- Density or point probability
- Cumulated probability
- Quantiles
- Pseudo-random numbers
R has the tools for calculating those for a number of distributions!

Density

For discrete random variables (discrete distributions) we can calculate point probability (probability that a random variable takes a specific value).
For continuous random variables we can only calculate probability density (area under the curve) for a specific interval.
In R functions that calculate the density start with d:

dnorm(x, mean = 0, sd = 1)
dchisq(x, df)
dexp(x)
dpois(x, lambda)
dt(x, df, ncp)
dunif(x)

Density for Normal Distribution

x <- seq(-5, 5, 0.1)
plot(x, dnorm(x), type = "l")

Cumulated Distribution Function (CDF)

Describes the probability of getting \(x\) or less in a given distribution.
While density functions in R are used more for visualisations, CDFs are often used for calculating actual probabilities.
Corresponding R functions start with p (for probability)

pnorm(q, mean = 0, sd = 1)
pchisq(q, df)
pexp(q)
ppois(q, lambda)
pt(q, df, ncp)
punif(q)

How to Find an Arbitrary Value?

2 Ways

We note that standard normal distribution (\(\mu = 0, \sigma = 1\)) is symmetric.
- Thus, the area under the curve to the left from a symmetric point on the other end should be the same
We note that the total area under the curve is \(1\).
- So we can calculate the area to the left from a given point and subtract it from \(1\)

First Way

Let’s now calculate it in R using pnorm() function.

pnorm(-1.33)

[1] 0.09175914

Note that doing it for the original point gives results to the left and not to the right from it.

pnorm(1.33)

[1] 0.9082409

Second Way

Let’s now calculate it in R using pnorm() function.

1 - pnorm(1.33)

[1] 0.09175914

Note that both methods (as we would expect) produce identical results.

pnorm(-1.33)

[1] 0.09175914

R provides a 3rd way (with a helpful function argument).

pnorm(1.33, lower.tail = FALSE)

[1] 0.09175914

Finding an Arbitrary Value

Back to Statistical Table

Analogous to the first way.

Areas under the Normal Curve
Z	.00	.01	.02	.03	.04	.05	.06	.07	.08	.09
-3.1	0.0010	0.0009	0.0009	0.0009	0.0008	0.0008	0.0008	0.0008	0.0007	0.0007
-3.0	0.0013	0.0013	0.0013	0.0012	0.0012	0.0011	0.0011	0.0011	0.0010	0.0010
-2.9	0.0019	0.0018	0.0018	0.0017	0.0016	0.0016	0.0015	0.0015	0.0014	0.0014
-2.8	0.0026	0.0025	0.0024	0.0023	0.0023	0.0022	0.0021	0.0021	0.0020	0.0019
-2.7	0.0035	0.0034	0.0033	0.0032	0.0031	0.0030	0.0029	0.0028	0.0027	0.0026
-2.6	0.0047	0.0045	0.0044	0.0043	0.0041	0.0040	0.0039	0.0038	0.0037	0.0036
-2.5	0.0062	0.0060	0.0059	0.0057	0.0055	0.0054	0.0052	0.0051	0.0049	0.0048
-2.4	0.0082	0.0080	0.0078	0.0075	0.0073	0.0071	0.0069	0.0068	0.0066	0.0064
-2.3	0.0107	0.0104	0.0102	0.0099	0.0096	0.0094	0.0091	0.0089	0.0087	0.0084
-2.2	0.0139	0.0136	0.0132	0.0129	0.0125	0.0122	0.0119	0.0116	0.0113	0.0110
-2.1	0.0179	0.0174	0.0170	0.0166	0.0162	0.0158	0.0154	0.0150	0.0146	0.0143
-2.0	0.0228	0.0222	0.0217	0.0212	0.0207	0.0202	0.0197	0.0192	0.0188	0.0183
-1.9	0.0287	0.0281	0.0274	0.0268	0.0262	0.0256	0.0250	0.0244	0.0239	0.0233
-1.8	0.0359	0.0351	0.0344	0.0336	0.0329	0.0322	0.0314	0.0307	0.0301	0.0294
-1.7	0.0446	0.0436	0.0427	0.0418	0.0409	0.0401	0.0392	0.0384	0.0375	0.0367
-1.6	0.0548	0.0537	0.0526	0.0516	0.0505	0.0495	0.0485	0.0475	0.0465	0.0455
-1.5	0.0668	0.0655	0.0643	0.0630	0.0618	0.0606	0.0594	0.0582	0.0571	0.0559
-1.4	0.0808	0.0793	0.0778	0.0764	0.0749	0.0735	0.0721	0.0708	0.0694	0.0681
-1.3	0.0968	0.0951	0.0934	0.0918	0.0901	0.0885	0.0869	0.0853	0.0838	0.0823
-1.2	0.1151	0.1131	0.1112	0.1093	0.1075	0.1056	0.1038	0.1020	0.1003	0.0985

Quantiles

We already discussed one quantile - median (\(50\%\) quantile).
Quantile function can also be viewed as an inverse of CDF.
In that we know the probability (percentage) and want to find out which value corresponds to it.
Other than \(50%\), other common values are \(25\%\), \(75\%\) (quartiles).
Percentiles break up the distribution into 100 groups (\(1\%\), \(2\%\), …, \(100\%\)).
Quantiles do not have to be whole numbers (e.g. \(2.5\%\) quantile).
Functions that calculate quantiles in R start with q:

qnorm(p, mean = 0, sd = 1)
qchisq(p, df)
qexp(p)
qpois(p, lambda)
qt(p, df, ncp)
qunif(p)

Quantiles Examples

# median of the standard normal distribution
qnorm(0.5)

[1] 0

# 95% of such distribution falls between -1.96 * sigma
qnorm(0.025)

[1] -1.959964

# and +1.96 * sigma
qnorm(0.975)

[1] 1.959964

Random Numbers

Generating random numbers on a deterministic computer sounds like an oxymoron.
What we in fact can generate are pseudo-random numbers, that for practical purposes behave as if they were random.
They use some computer parameters that are almost random (time, load, process identifier)
In R functions for random number generation (RNG) start with r:

rnorm(n, mean = 0, sd = 1)
rchisq(n, df)
rexp(n)
rpois(n, lambda)
rt(n, df, ncp)
runif(n)

RNG for Normal Distribution

# Draw 10 pseudo-random numbers from standard normal distribution
rnorm(10)

 [1]  0.05270152  1.53800577 -0.07023770 -0.23166383  1.06150143 -0.85755747
 [7] -2.78842719 -1.00184274  0.69095688  1.58075134

rnorm(10)

 [1]  0.256239843  1.158994256 -0.136316800  1.529497541  0.808011988
 [6]  0.251033913  1.552438351  0.981181616  0.005545037 -1.833525246

# Change the parameters (mean and sd) of a normal distribution
rnorm(10, mean = 7, sd = 5)

 [1] 14.737266  6.623805  3.248738  6.787346 12.444408  6.822736  5.675426
 [8]  4.188826  8.197629  6.690349

Making Random Numbers Replicable

# set.seed() function sets the initial state
# Makes random number generation replicable
set.seed(2023)
rnorm(10)

 [1] -0.08378436 -0.98294375 -1.87506732 -0.18614466 -0.63348570  1.09079746
 [7] -0.91372727  1.00163971 -0.39926660 -0.46812305

set.seed(2023)
rnorm(10)

 [1] -0.08378436 -0.98294375 -1.87506732 -0.18614466 -0.63348570  1.09079746
 [7] -0.91372727  1.00163971 -0.39926660 -0.46812305

Tutorial:
- Working with distributions, sampling
Next week:
- Data Frames

Week 3: Probability Distributions

So Far

Topics for Today

Normal Distribution

Standard Normal Distribution

How to Find an Arbitrary Value?

Statistical Tables

Built-in Distributions

Four Key Quantities

Density

Density for Normal Distribution

Cumulated Distribution Function (CDF)

How to Find an Arbitrary Value?

2 Ways

First Way

First Way

Second Way

Second Way

Finding an Arbitrary Value

Back to Statistical Table

Quantiles

Quantiles Examples

Random Numbers

RNG for Normal Distribution

Making Random Numbers Replicable

Next