Comparison Error Rate
- What will be the p-value cutoff if you would like to control the Per-Comparison Error Rate
(i.e., significance level) under 0.05? How many genes will be identified as differentially expressed
using this cutoff? - Descriptive study
- Exploratory study
- Inferential study
- Predictive study
- Causal study
- Mechanistic study
Problem 3 (20 pts)
We will use a data set babies.txt to demonstrate the use of the permutation test. You may use the
following R commands to read the data set into R:
babies <- read.table(“babies.txt”, header=TRUE)
bwt.nonsmoke <- subset(babies, smoke==0)$bwt
bwt.smoke <- subset(babies, smoke==1)$bwt
The two R objects bwt.nonsmoke and bwt.smoke contain the rows that correspond to the weights
of the babies with nonsmoking and smoking mothers respectively.
- We will generate the following statistics based on a sample size of 10 and observe the following
difference:
n <- 10
set.seed(1)
nonsmokers <- sample(bwt.nonsmoke , n)
smokers <- sample(bwt.smoke , n)
diff <- mean(smokers) – mean(nonsmokers)
The question is whether this observed difference is statistically significant. We do not want to
rely on the assumptions needed for the t test, so instead we will use permutations. We will
reshuffle the data and recompute the mean. We can create one permuted sample with the
following code:
dat <- c(smokers, nonsmokers)
shuffle <- sample(dat)
smokers_star <- shuffle[1:n]
nonsmokers_star <- shuffle[(n+1):(2*n)]
diff_star <- mean(smokers_star)-mean(nonsmokers_star)
2
Problem 2 (15 pts)
Use the NCI60 data, give an example to each of the following data analysis tasks (please refer to the
Leek and Peng, Science (2015) articl). Note: this is an open question;different people may give diff
erence examples to the same task.
The last value is one observation from the null distribution we will conduct. Set the seed at
1, and then repeat the permutation for 1,000 times to create a null distribution. What is the
permutation derived p-value for our observation diff? - Repeat the above exercise, but instead of the differences in mean, consider the differences in
median.
diff_med <- median(smokers) – median(nonsmokers)
What is the permutation based p-value?
Problem 4 (10 pts)
We can use the same data set babies.txt to demonstrate the use of t test. For a sample with
sample size n = 10 like in Problem 3, the Central Limit Theorem does not apply, and we need to
check whether the data are approximately Gaussian under each condition. The following R codes
can be used to check the Gaussian assumption:
qqnorm(bwt.nonsmoke)
qqline(bwt.nonsmoke,col=2)
qqnorm(bwt.smoke)
qqline(bwt.smoke,col=2) - Please display the Q-Q plots and argue whether the Gaussian assumption is reasonable for this
data set. - Perform the t test using the following R codes. Compare the resulting p-value with what you
obtain from the permutation test in Problem 3. Do you reach the same conclusion?
result <- t.test(nonsmokers, smokers)
result$p.value
Problem 5 (10 pts)
A test for cystic fibrosis has an accuracy of 99%. Specifically, we mean that:
P(+|D) = 0.99
and
P(−|ND) = 0.99,
where + and − stand for positive and negative test results respectively, and D and ND represent
the existence and nonexistence of the disease respectively.
The cystic fibrosis rate in the general population is 1 in 3,900, that is
P(D) = 0.00025.
If we select a random person and they test positive, what is probability that the person has the
disease?
Hint: Use the Bayes Theorem.
3
Problem 6 (15 pts)
Suppose that you have an i.i.d. sample X1, . . . , Xn from a Gaussian distribution with unknown mean
µ and known variance σ2. Use a prior distribution of µ, which is Gaussian with mean η and variance
τ 2. - Argue that the Guassian prior is a conjugate prior of a Gaussian distribution.
- Derive the posterior distribution of µ given the sample and the prior.
- Show that the posterior mean of µ is a weighted sum of the sample average and η.
Note: The posterior mean you derive in this problem is often called the “maximum a posteriori”
(MAP) estimator of µ.
Hint: Please follow my derivation for the Bayesian estimator of σ2 in the t-test setting.
Problem 7 (10 pts)
This problem is to help you understand what we mean by a random sample, and why we say that
the sample mean is a random variable.
Use the following R code to simulate a sample with sample size n = 10 and calculate the sample
mean:
n <- 10
x <- rnorm(n)
mean(x)
Repeat the simulation for one million times, plot the distribution of the one million sample means.
Now increase n to 100. Repeat the above simulation, how does the distribution of the one million
sample means change?
Now further increase n to 1000. Repeat the above simulation, how does the distribution of the
one million sample means change?
What do we conclude about the distribution of the sample mean?