On a scale of 0 to 10, how much does the average citizen of the Republic of
Elbonia trust the president?
You’re conducting a survey to find out, and you’ve calculated that in order
to get the precision you want, you’re going to need a sample of 100
statistically independent individuals. Now you have to decide how to do
You could stand in the central square of the capital city and survey the
next 100 people who walk by. But these opinions won’t be independent:
probably politics in the capital isn’t representative of politics in
Elbonia as a whole.
So you consider travelling to 100 different locations in the country and
asking one Elbonian at each. But apart from anything else, this is far
too expensive for you to do.
Maybe a compromise would be OK. You could go to 10 locations and ask… 20
people at each? 30? How many would you need in order to match the
precision of 100 independent individuals — to have an “effective
sample size” of 100?
The answer turns out to be closely connected to a quantity I’ve written
about many times before:
Let me explain…
The general situation is that we have a large population of individuals (in
this case, Elbonians), and with each there is associated a real number
(in this case, their level of trust in the president). So we have a probability
distribution, and we’re interested in discovering some statistic
(in this case, the mean, but it might instead be the median
or the variance or the 90th percentile). We do this by taking some sample
of individuals, and then doing something with the sampled data to
produce an estimate of .
The “something” we do with the sampled data is called an estimator.
So, an estimator is a real-valued function on the set of possible sample
data. For instance, if you’re trying to estimate the mean of the
population, and we denote the sample data by , then the
obvious estimator for the population mean would be just the sample mean,
But it’s important to realize that the best estimator for a given statistic
of the population (such as the mean) needn’t be that same statistic applied
to the sample. For example, suppose we wish to know the mean mass of
men from Mali. Unfortunately, we’ve only weighed three men from Mali, and
two of them are brothers. You could use
as your estimator, but since body mass is somewhat genetic, that would give
undue importance to one particular family. At the opposite extreme, you
(where is the mass of the non-brother). But that would be going too
far, as it gives the non-brother as much importance as the two brothers put
together. Probably the best answer is somewhere in between. Exactly
where in between depends on the correlation between masses of brothers,
which is a quantity we might reasonably estimate from data gathered elsewhere
in the world.
(There’s a deliberate echo here of something I wrote
in what proportions should we sow
poppies, Polish wheat and Persian wheat in order to maximize
biological diversity? The similarity is no coincidence.)
There are several qualities we might seek in an estimator. I’ll focus on
High precision The precision of an estimator is the
reciprocal of its variance. To make sense of this, you have to realize
that estimators are random variables too! An estimator with high
precision, or low variance, is not much changed by the effects of
randomness. It will give more or less the same answer if you run it
For instance, suppose we’ve decided to do the Elbonian survey by asking
30 people in each of the 5 biggest cities and 20 people from each of 3
chosen villages, then taking some specific weighted mean of the resulting
data. If that’s a high-precision estimator, it will give more or
less the same final answer no matter which specific Elbonians happen to
have been stopped by the pollsters.
Unbiased An estimator of some statistic is unbiased if its expected value is
equal to that statistic for the population.
For example, suppose we’re trying to estimate the variance of some
distribution. If our sample consists of a measly two individuals, then the
variance of the sample is likely to be much less than the variance of the
population. After all, with only two individuals observed, we’ve barely
begun to glimpse the full variation of the population as a whole. It can
actually be shown
that with a sample size of two, the expected value of the sample variance
is half the population variance. So the sample variance is a biased
estimator of the population variance, but twice the sample variance is an
(Being unbiased is perhaps a less crucial property of an estimator than
it might at first appear. Suppose the boss of a chain of pizza takeaways
wants to know the average size of pizzas ordered. “Size” could be measured
by diameter — what you order by — or area — what you eat.
But since the relationship between diameter and area is quadratic rather
than linear, an unbiased estimator of one will be a biased estimator of the
No matter what statistic you’re trying to estimate, you can talk
the “effective sample size” of an estimator. But for simplicity, I’ll only
talk about estimating the mean.
Here’s a loose definition:
The effective sample size of an estimator of the population mean is
the number with the property that our estimator has the same
precision (or variance) as the estimator got by sampling
Let’s unpack that.
Suppose we choose individuals at random from the population (with
replacement, if you care). So we have independent, identically distributed
random variables . As above, we take the sample mean
as our estimator of the population mean. Since variance is additive for
independent random variables, the variance of this estimator is
where is the population variance. The precision of the
estimator is, therefore, . That makes sense: as your sample
size increases, the precision of your estimate increases too.
Now, suppose we have some other estimator of the population
mean. It’s a random variable, so it has a variance . The
effective sample size of the estimator is the number
This doesn’t entirely make sense, as the unique number satisfying
this equation needn’t be an integer, so we can’t sensibly talk about a
sample of size . Nevertheless, we can absolutely rigorously
define the effective sample size of our estimator as
And that’s the definition. Differently put,
Trivial examples If is the mean value of
uncorrelated individuals, then the effective sample size is . If
is the mean value of extremely highly correlated
individuals, then the variance of the estimator is little less than the
variance of a single individual, so the effective sample size is little
more than .
Now, suppose our pollsters have come back from their trips to various parts
of Elbonia. Together, they’ve asked individuals how much they trust the
president. We want to take that data and use it to estimate the population
mean — that is, the mean level of trust in the president across
Elbonia — in as precise a way as possible.
We’re going to restrict ourselves to unbiased estimators, so that the
expected value of the estimator is the population mean. We’re also going
to consider only linear estimators: those of the form
where are the trust levels expressed by the
What choice of unbiased linear estimator maximizes the effective sample
To answer this, we need to recall some basic statistical notions…
Correlation and covariance
Variance is a quadratic form, and covariance is the corresponding bilinear
form. That is, take two random variables and , with respective
means and . Then their covariance is
This is bilinear in and , and .
is bounded above and below by , the
product of the standard deviations. It’s natural to normalize, dividing
through by to obtain a number between and .
This gives the correlation coefficient
Alternatively, we can first scale and to have variance , then
take the covariance, and this also gives the correlation:
Now suppose we have random variables, . The
correlation matrix is the matrix whose -entry
is . Correlation matrices have some easily-proved properties:
The entries are all in .
The diagonal entries are all .
The matrix is symmetric.
The matrix is positive semidefinite. That’s because the corresponding
quadratic form is , and variances are nonnegative.
And actually, it’s not so hard to prove that any matrix with these
properties is the correlation matrix of some sequence of random variables.
In what follows, for simplicity, I’ll quietly assume that the correlation
matrices we encounter are strictly positive definite. This only amounts to
assuming that no linear combination of the s has variance zero —
in other words, that there are no exact linear relationships between the
random variables involved.
Back to the main question
Here’s where we got to. We surveyed individuals from our population,
giving identically distributed but not necessarily independent random
variables . Some of them will be correlated because of
We’re trying to use this data to estimate the population mean in as precise
a way as possible. Specifically, we’re looking for numbers such that the linear estimator is unbiased and has the
maximum possible effective sample size.
The effective sample size was defined as , where is the variance of the distribution we’re drawing
from. Now we need to work out the variance in the denominator.
Let denote the correlation matrix of . I said a
moment ago that is the
quadratic form corresponding to the bilinear form represented by the
covariance matrix. Since each has variance , the
covariance matrix is just times the correlation matrix . Hence
where denotes a transpose and .
So, the effective sample size of our estimator is
We also wanted our estimator to be unbiased. Its expected value is
where is the population mean. So, we need .
Putting this together, the maximum possible effective sample size among all
unbiased linear estimators is
Which achieves this maximum, and what is the maximum
possible effective sample size? That’s easy, and in fact it’s something
that’s appeared many times at this blog before…
The magnitude of a matrix
The magnitude of an invertible matrix is the sum of
all entries of . To calculate it, you don’t need to go as
far as inverting . It’s much easier to find the unique column vector
(the weighting of ), then calculate . This sum is the
magnitude of , since is the th row-sum of .
Most of what I’ve written about
has been in the situation where we start with a finite metric space , and we use the matrix with entries . This turns out to give interesting information about
. In the metric situation, the entries of the matrix are between
and . Often is positive definite (e.g. when ), as correlation matrices are.
When is positive definite, there’s a third way to describe the
The supremum is attained just when , and the proof is a simple
application of the Cauchy–Schwarz inequality.
But that supremum is exactly the expression we had for maximum effective sample size! So:
The maximum possible value of is .
Or more wordily:
The maximum effective sample size of an unbiased linear estimator of the
mean is the magnitude of the sample correlation matrix.
Or wordily but approximately:
Effective sample size magnitude of correlation matrix.
Moreover, we know how to attain that maximum. It’s attained if and only if
our estimator is
where is the weighting of the correlation matrix.
I’m not too sure where this “result” — observation, really —
comes from. I learned it from the statistician Paul
Blackwell at Sheffield, who, like
me, had been reading this paper:
Andrew Solow and Stephen Polasky, Measuring biological diversity.
Environmental and Ecological Statistics 1 (1994), 95–103.
In turn, Solow and Polasky refer to this:
Morris Eaton, A group action on covariances with applications to the
comparison of linear normal experiments. In: Moshe Shaked and Y.L. Tong
(eds.), Stochastic inequalities: Papers from the AMS-IMS-SIAM Joint Summer
Research Conference held in Seattle, Washington, July 1991, Institute of
Mathematical Statistics Lecture Notes — Monograph Series, Volume 22,
But the result is so simple that I’d imagine it’s much older. I’ve been
wondering whether it’s essentially the Gauss-Markov
thought it was, then I thought it wasn’t. Does anyone know?
The surprising behaviour of effective sample size
You might expect the effective size of a sample of individuals to be at
most . It’s not.
You might expect the effective sample size to go down as the correlations
within the sample go up. It doesn’t.
This behaviour appears in even the simplest nontrivial example:
Example Suppose our sample consists of just two individuals.
Call the sampled values and , and write the correlation matrix
Then the maximum-precision unbiased linear estimator is , and its effective sample size is
As the correlation between the two variables increases from to
, the effective sample size decreases from to , as you’d expect.
But when , the effective sample size is greater than 2. In
fact, as , the effective sample size tends to .
That’s intuitively plausible. For if is close to then, writing
and , we have , and so is a very good estimator
of . In the extreme, when , it’s an exact estimator of
— it’s infinitely precise.
The fact that the effective sample size can be greater than the actual
sample size seems to be very well known. For instance, there’s a whole
in the documentation for Q, which is apparently “analysis software for
What’s interesting is that this doesn’t only occur when
some of the variables are negatively correlated. It can also happen when
all the correlations are nonnegative, as in the following example from the
paper by Eaton cited above.
Example Consider the correlation matrix
where . This is positive
definite, so it’s the correlation matrix of some random variables .
A routine computation shows that
As we’ve shown, this is the effective sample size of a maximum-precision
unbiased linear estimator of the mean.
When , it’s , as you’d
expect: the variables are uncorrelated. As increases,
decreases, again as you’d expect: more correlation between the variables
leads to a smaller effective sample size. This behaviour continues until
, where .
But then something strange happens. As increases from to
, the effective sample size increases from to .
Increasing the correlation increases the effective sample size. For
instance, when , we have — the
maximum-precision estimator is as precise as if we’d chosen
independent individuals. For that value of , the maximum-precision
estimator turns out to be
These examples may seem counterintuitive, but Eaton cautions us
to beware of our feeble intuitions:
These examples show that our rather vague intuitive feeling that
“positive correlation tends to decrease information content in an
experiment” is very far from the truth, even for rather simple normal
experiments with three observations.
This is very like the fact that a metric space with points can have
magnitude (“effective number of points”) greater than , even if the
associated matrix is positive definite.
Anyone with any statistical knowledge who’s still reading will easily have
picked up on the fact that I’m a total amateur. If that’s you, I’d love to
hear your comments!