Skip to content

Sampling terminology is incorrect #99

@skolenik

Description

@skolenik

I am a sampling statistician. Your terminology is not compatible with any of the existing sampling books.

What you call complete random sampling is hard to name, but the term you came up with is not used in sampling work. What you call simple random sampling is called Poisson sampling. The definition of simple random sampling is that all samples of size n from the finite population of size N have equal probabilities of selection. Equal probabilities of selection of individual units is not a sign of SRS; when I taught sampling to grad students, an exercise I used to have in the sampling section of the survey statistics class was to came up with multiple designs that are equal probability but not SRS (1 for C -- you need to know at least one to understand what SRS is vs. what it isn't, 3 designs for B, 5 different designs for A).

I understand that you are deep in development with a lot of parts that have already been documented, exposed, put on the website, published, etc. Still... this is fundamentals.

A standard introductory reference on sampling is Lohr (3rd edition 2022). The next level is probably Kish (1965). The graduate level textbooks are Valliant Dever Kreuter (2018) accompanied by library(PracTools), and Tille 2006 accompanied by library(sampling) written by his student.

A technical problem that you have is that the unequal probability sampling you use is not a proper probability proportional to size (PPS) sample: you can simulate and see that probabilities of selection are not maintained. The problem is with the underlying base::sample(); proper procedures are implemented in library(sampling) by Alina Matei and Yves Tille, e.g. sampling::UPmaxentropy().

# example from Tille (2006)
set.seed(25)
# sum of probabilities == expected sample size == 3L
uneq_p = c(0.07,0.17,0.41,0.61,0.83,0.91)
sim_uneq_p_base <- as.data.frame(matrix(rep(0,6*20000),ncol=6))
for (k in 1:nrow(sim_uneq_p_base)) {
  this <- sample.int(n=length(uneq_p),size=sum(uneq_p),prob=uneq_p)
  sim_uneq_p_base[k,this] <- 1
}
colSums(sim_uneq_p_base)/nrow(sim_uneq_p_base)
uneq_p
# done right
sim_uneq_p_done_right <- as.data.frame(matrix(rep(0,6*20000),ncol=6))
for (k in 1:nrow(sim_uneq_p_done_right)) {
   sim_uneq_p_done_right[k,] <- sampling::UPmaxentropy(uneq_p)
}
colSums(sim_uneq_p_done_right)/nrow(sim_uneq_p_done_right)
uneq_p

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions