On a scale of 0 to 10, how much does the average citizen of the Republic of Elbonia trust the president?

You’re conducting a survey to find out, and you’ve calculated that in order to get the precision you want, you’re going to need a sample of 100 statistically independent individuals. Now you have to decide how to do this.

You could stand in the central square of the capital city and survey the next 100 people who walk by. But these opinions won’t be independent: probably politics in the capital isn’t representative of politics in Elbonia as a whole.

So you consider travelling to 100 different locations in the country and asking one Elbonian at each. But apart from anything else, this is far too expensive for you to do.

Maybe a compromise would be OK. You could go to 10 locations and ask… 20 people at each? 30? How many would you need in order to match the precision of 100 independent individuals — to have an “effective sample size” of 100?

The answer turns out to be closely connected to a quantity I’ve written about many times before: magnitude. Let me explain…

The general situation is that we have a large population of individuals (in
this case, Elbonians), and with each there is associated a real number
(in this case, their level of trust in the president). So we have a probability
distribution, and we’re interested in discovering some statistic $<semantics>\theta <annotation\; encoding="application/x-tex">\backslash theta</annotation></semantics>$
(in this case, the mean, but it might instead be the median
or the variance or the 90th percentile). We do this by taking some sample
of $<semantics>n<annotation\; encoding="application/x-tex">n</annotation></semantics>$ individuals, and then doing *something* with the sampled data to
produce an estimate of $<semantics>\theta <annotation\; encoding="application/x-tex">\backslash theta</annotation></semantics>$.

The “something” we do with the sampled data is called an **estimator**.
So, an estimator is a real-valued function on the set of possible sample
data. For instance, if you’re trying to estimate the mean of the
population, and we denote the sample data by $<semantics>{Y}_{1},\dots ,{Y}_{n}<annotation\; encoding="application/x-tex">Y\_1,\; \backslash ldots,\; Y\_n</annotation></semantics>$, then the
obvious estimator for the population mean would be just the sample mean,

$$<semantics>\frac{1}{n}{Y}_{1}+\cdots +\frac{1}{n}{Y}_{n}.<annotation\; encoding="application/x-tex">\; \backslash frac\{1\}\{n\}\; Y\_1\; +\; \backslash cdots\; +\; \backslash frac\{1\}\{n\}\; Y\_n.\; </annotation></semantics>$$

But it’s important to realize that the best estimator for a given statistic
of the population (such as the mean) needn’t be that same statistic applied
to the sample. For example, suppose we wish to know the mean mass of
men from Mali. Unfortunately, we’ve only weighed three men from Mali, and
two of them are brothers. You *could* use

$$<semantics>\frac{1}{3}{Y}_{1}+\frac{1}{3}{Y}_{2}+\frac{1}{3}{Y}_{3}<annotation\; encoding="application/x-tex">\; \backslash frac\{1\}\{3\}\; Y\_1\; +\; \backslash frac\{1\}\{3\}\; Y\_2\; +\; \backslash frac\{1\}\{3\}\; Y\_3\; </annotation></semantics>$$

as your estimator, but since body mass is somewhat genetic, that would give undue importance to one particular family. At the opposite extreme, you could use

$$<semantics>\frac{1}{2}{Y}_{1}+\frac{1}{4}{Y}_{2}+\frac{1}{4}{Y}_{3}<annotation\; encoding="application/x-tex">\; \backslash frac\{1\}\{2\}\; Y\_1\; +\; \backslash frac\{1\}\{4\}\; Y\_2\; +\; \backslash frac\{1\}\{4\}\; Y\_3\; </annotation></semantics>$$

(where $<semantics>{Y}_{1}<annotation\; encoding="application/x-tex">Y\_1</annotation></semantics>$ is the mass of the non-brother). But that would be going too
far, as it gives the non-brother as much importance as the two brothers put
together. Probably the best answer is somewhere in between. Exactly
*where* in between depends on the correlation between masses of brothers,
which is a quantity we might reasonably estimate from data gathered elsewhere
in the world.

(There’s a deliberate echo here of something I wrote previously: in what proportions should we sow poppies, Polish wheat and Persian wheat in order to maximize biological diversity? The similarity is no coincidence.)

There are several qualities we might seek in an estimator. I’ll focus on two.

*High precision*The**precision**of an estimator is the reciprocal of its variance. To make sense of this, you have to realize that estimators are random variables too! An estimator with high precision, or low variance, is not much changed by the effects of randomness. It will give more or less the same answer if you run it multiple times.For instance, suppose we’ve decided to do the Elbonian survey by asking 30 people in each of the 5 biggest cities and 20 people from each of 3 chosen villages, then taking some specific weighted mean of the resulting data. If that’s a high-precision estimator, it will give more or less the same final answer no matter which specific Elbonians happen to have been stopped by the pollsters.

*Unbiased*An estimator of some statistic is**unbiased**if its expected value is equal to that statistic for the population.For example, suppose we’re trying to estimate the variance of some distribution. If our sample consists of a measly two individuals, then the variance of the sample is likely to be much less than the variance of the population. After all, with only two individuals observed, we’ve barely begun to glimpse the full variation of the population as a whole. It can actually be shown that with a sample size of two, the expected value of the sample variance is half the population variance. So the sample variance is a biased estimator of the population variance, but twice the sample variance is an unbiased estimator.

(Being unbiased is perhaps a less crucial property of an estimator than it might at first appear. Suppose the boss of a chain of pizza takeaways wants to know the average size of pizzas ordered. “Size” could be measured by diameter — what you order by — or area — what you eat. But since the relationship between diameter and area is quadratic rather than linear, an unbiased estimator of one will be a biased estimator of the other.)

No matter what statistic you’re trying to estimate, you can talk about the “effective sample size” of an estimator. But for simplicity, I’ll only talk about estimating the mean.

Here’s a loose definition:

The

effective sample sizeof an estimator of the population mean is the number $<semantics>{n}_{\mathrm{eff}}<annotation\; encoding="application/x-tex">n\_\{eff\}</annotation></semantics>$ with the property that our estimator has the same precision (or variance) as the estimator got by sampling $<semantics>{n}_{\mathrm{eff}}<annotation\; encoding="application/x-tex">n\_\{eff\}</annotation></semantics>$ independent individuals.

Let’s unpack that.

Suppose we choose $<semantics>n<annotation\; encoding="application/x-tex">n</annotation></semantics>$ individuals at random from the population (with replacement, if you care). So we have independent, identically distributed random variables $<semantics>{Y}_{1},\dots ,{Y}_{n}<annotation\; encoding="application/x-tex">Y\_1,\; \backslash ldots,\; Y\_n</annotation></semantics>$. As above, we take the sample mean

$$<semantics>\frac{1}{n}{Y}_{1}+\cdots +\frac{1}{n}{Y}_{n}<annotation\; encoding="application/x-tex">\; \backslash frac\{1\}\{n\}\; Y\_1\; +\; \backslash cdots\; +\; \backslash frac\{1\}\{n\}\; Y\_n\; </annotation></semantics>$$

as our estimator of the population mean. Since variance is additive for
*independent* random variables, the variance of this estimator is

$$<semantics>n\cdot \mathrm{Var}\left(\frac{1}{n}{Y}_{1}\right)=n\cdot \frac{1}{{n}^{2}}\mathrm{Var}({Y}_{1})=\frac{{\sigma}^{2}}{n}<annotation\; encoding="application/x-tex">\; n\; \backslash cdot\; Var\backslash Bigl(\; \backslash frac\{1\}\{n\}\; Y\_1\; \backslash Bigr)\; =\; n\; \backslash cdot\; \backslash frac\{1\}\{n^2\}\; Var(Y\_1)\; =\; \backslash frac\{\backslash sigma^2\}\{n\}\; </annotation></semantics>$$

where $<semantics>{\sigma}^{2}<annotation\; encoding="application/x-tex">\backslash sigma^2</annotation></semantics>$ is the population variance. The precision of the estimator is, therefore, $<semantics>n/{\sigma}^{2}<annotation\; encoding="application/x-tex">n/\backslash sigma^2</annotation></semantics>$. That makes sense: as your sample size $<semantics>n<annotation\; encoding="application/x-tex">n</annotation></semantics>$ increases, the precision of your estimate increases too.

Now, suppose we have some other estimator $<semantics>\hat{\mu}<annotation\; encoding="application/x-tex">\backslash hat\{\backslash mu\}</annotation></semantics>$ of the population mean. It’s a random variable, so it has a variance $<semantics>\mathrm{Var}(\hat{\mu})<annotation\; encoding="application/x-tex">Var(\backslash hat\{\backslash mu\})</annotation></semantics>$. The effective sample size of the estimator $<semantics>\hat{\mu}<annotation\; encoding="application/x-tex">\backslash hat\{\backslash mu\}</annotation></semantics>$ is the number $<semantics>{n}_{\mathrm{eff}}<annotation\; encoding="application/x-tex">n\_\{eff\}</annotation></semantics>$ satisfying

$$<semantics>{\sigma}^{2}/{n}_{\mathrm{eff}}=\mathrm{Var}(\hat{\mu}).<annotation\; encoding="application/x-tex">\; \backslash sigma^2/n\_\{eff\}\; =\; Var(\backslash hat\{\backslash mu\}).\; </annotation></semantics>$$

This doesn’t entirely make sense, as the unique number $<semantics>{n}_{\mathrm{eff}}<annotation\; encoding="application/x-tex">n\_\{eff\}</annotation></semantics>$ satisfying
this equation needn’t be an integer, so we can’t sensibly talk about a
sample of size $<semantics>{n}_{\mathrm{eff}}<annotation\; encoding="application/x-tex">n\_\{eff\}</annotation></semantics>$. Nevertheless, we can absolutely rigorously
define the **effective sample size** of our estimator $<semantics>\hat{\mu}<annotation\; encoding="application/x-tex">\backslash hat\{\backslash mu\}</annotation></semantics>$ as

$$<semantics>{n}_{\mathrm{eff}}={\sigma}^{2}/Var(\hat{\mu}).<annotation\; encoding="application/x-tex">\; n\_\{eff\}\; =\; \backslash sigma^2/\backslash Var(\backslash hat\{\backslash mu\}).\; </annotation></semantics>$$

And that’s the definition. Differently put,

$$<semantics>\text{effective sample size}=\text{precision}\times \text{population variance}.<annotation\; encoding="application/x-tex">\; \backslash text\{effective\; sample\; size\}\; =\; \backslash text\{precision\; \}\; \backslash times\; \backslash text\{population\; variance\}.\; </annotation></semantics>$$

Trivial examplesIf $<semantics>\hat{\mu}<annotation\; encoding="application/x-tex">\backslash hat\{\backslash mu\}</annotation></semantics>$ is the mean value of $<semantics>n<annotation\; encoding="application/x-tex">n</annotation></semantics>$ uncorrelated individuals, then the effective sample size is $<semantics>n<annotation\; encoding="application/x-tex">n</annotation></semantics>$. If $<semantics>\hat{\mu}<annotation\; encoding="application/x-tex">\backslash hat\{\backslash mu\}</annotation></semantics>$ is the mean value of $<semantics>n<annotation\; encoding="application/x-tex">n</annotation></semantics>$ extremely highly correlated individuals, then the variance of the estimator is little less than the variance of a single individual, so the effective sample size is little more than $<semantics>1<annotation\; encoding="application/x-tex">1</annotation></semantics>$.

Now, suppose our pollsters have come back from their trips to various parts of Elbonia. Together, they’ve asked $<semantics>n<annotation\; encoding="application/x-tex">n</annotation></semantics>$ individuals how much they trust the president. We want to take that data and use it to estimate the population mean — that is, the mean level of trust in the president across Elbonia — in as precise a way as possible.

We’re going to restrict ourselves to unbiased estimators, so that the
expected value of the estimator is the population mean. We’re also going
to consider only **linear estimators**: those of the form

$$<semantics>{a}_{1}{Y}_{1}+\cdots +{a}_{n}{Y}_{n}<annotation\; encoding="application/x-tex">\; a\_1\; Y\_1\; +\; \backslash cdots\; +\; a\_n\; Y\_n\; </annotation></semantics>$$

where $<semantics>{Y}_{1},\dots ,{Y}_{n}<annotation\; encoding="application/x-tex">Y\_1,\; \backslash ldots,\; Y\_n</annotation></semantics>$ are the trust levels expressed by the $<semantics>n<annotation\; encoding="application/x-tex">n</annotation></semantics>$ Elbonians surveyed.

Question:

What choice of unbiased linear estimator maximizes the effective sample size?

To answer this, we need to recall some basic statistical notions…

### Correlation and covariance

Variance is a quadratic form, and covariance is the corresponding bilinear
form. That is, take two random variables $<semantics>X<annotation\; encoding="application/x-tex">X</annotation></semantics>$ and $<semantics>Y<annotation\; encoding="application/x-tex">Y</annotation></semantics>$, with respective
means $<semantics>{\mu}_{X}<annotation\; encoding="application/x-tex">\backslash mu\_X</annotation></semantics>$ and $<semantics>{\mu}_{Y}<annotation\; encoding="application/x-tex">\backslash mu\_Y</annotation></semantics>$. Then their **covariance** is

$$<semantics>\mathrm{Cov}(X,Y)=E((X-{\mu}_{X})(Y-{\mu}_{Y})).<annotation\; encoding="application/x-tex">\; Cov(X,\; Y)\; =\; E((X\; -\; \backslash mu\_X)(Y\; -\; \backslash mu\_Y)).\; </annotation></semantics>$$

This is bilinear in $<semantics>X<annotation\; encoding="application/x-tex">X</annotation></semantics>$ and $<semantics>Y<annotation\; encoding="application/x-tex">Y</annotation></semantics>$, and $<semantics>\mathrm{Cov}(X,X)=\mathrm{Var}(X)<annotation\; encoding="application/x-tex">Cov(X,\; X)\; =\; Var(X)</annotation></semantics>$.

$<semantics>\mathrm{Cov}(X,Y)<annotation\; encoding="application/x-tex">Cov(X,\; Y)</annotation></semantics>$ is bounded above and below by $<semantics>\pm {\sigma}_{X}{\sigma}_{Y}<annotation\; encoding="application/x-tex">\backslash pm\; \backslash sigma\_X\; \backslash sigma\_Y</annotation></semantics>$, the
product of the standard deviations. It’s natural to normalize, dividing
through by $<semantics>{\sigma}_{X}{\sigma}_{Y}<annotation\; encoding="application/x-tex">\backslash sigma\_X\; \backslash sigma\_Y</annotation></semantics>$ to obtain a number between $<semantics>-1<annotation\; encoding="application/x-tex">-1</annotation></semantics>$ and $<semantics>1<annotation\; encoding="application/x-tex">1</annotation></semantics>$.
This gives the **correlation coefficient**

$$<semantics>{\rho}_{X,Y}=\frac{\mathrm{Cov}(X,Y)}{{\sigma}_{X}{\sigma}_{Y}}\in [-1,1].<annotation\; encoding="application/x-tex">\; \backslash rho\_\{X,\; Y\}\; =\; \backslash frac\{Cov(X,\; Y)\}\{\backslash sigma\_X\backslash sigma\_Y\}\; \backslash in\; [-1,\; 1].\; </annotation></semantics>$$

Alternatively, we can first scale $<semantics>X<annotation\; encoding="application/x-tex">X</annotation></semantics>$ and $<semantics>Y<annotation\; encoding="application/x-tex">Y</annotation></semantics>$ to have variance $<semantics>1<annotation\; encoding="application/x-tex">1</annotation></semantics>$, then take the covariance, and this also gives the correlation:

$$<semantics>{\rho}_{X,Y}=\mathrm{Cov}(X/{\sigma}_{X},Y/{\sigma}_{Y}).<annotation\; encoding="application/x-tex">\; \backslash rho\_\{X,\; Y\}\; =\; Cov(X/\backslash sigma\_X,\; Y/\backslash sigma\_Y).\; </annotation></semantics>$$

Now suppose we have $<semantics>n<annotation\; encoding="application/x-tex">n</annotation></semantics>$ random variables, $<semantics>{Y}_{1},\dots ,{Y}_{n}<annotation\; encoding="application/x-tex">Y\_1,\; \backslash ldots,\; Y\_n</annotation></semantics>$. The
**correlation matrix** $<semantics>R<annotation\; encoding="application/x-tex">R</annotation></semantics>$ is the $<semantics>n\times n<annotation\; encoding="application/x-tex">n\; \backslash times\; n</annotation></semantics>$ matrix whose $<semantics>(i,j)<annotation\; encoding="application/x-tex">(i,\; j)</annotation></semantics>$-entry
is $<semantics>{\rho}_{{Y}_{i},{Y}_{j}}<annotation\; encoding="application/x-tex">\backslash rho\_\{Y\_i,\; Y\_j\}</annotation></semantics>$. Correlation matrices have some easily-proved properties:

The entries are all in $<semantics>[-1,1]<annotation\; encoding="application/x-tex">[-1,\; 1]</annotation></semantics>$.

The diagonal entries are all $<semantics>1<annotation\; encoding="application/x-tex">1</annotation></semantics>$.

The matrix is symmetric.

The matrix is positive semidefinite. That’s because the corresponding quadratic form is $<semantics>({a}_{1},\dots ,{a}_{n})\mapsto \mathrm{Var}(\sum {a}_{i}{Y}_{i}/{\sigma}_{i})<annotation\; encoding="application/x-tex">(a\_1,\; \backslash ldots,\; a\_n)\; \backslash mapsto\; Var(\backslash sum\; a\_i\; Y\_i/\backslash sigma\_i)</annotation></semantics>$, and variances are nonnegative.

And actually, it’s not so hard to prove that *any* matrix with these
properties is the correlation matrix of some sequence of random variables.

In what follows, for simplicity, I’ll quietly assume that the correlation
matrices we encounter are *strictly* positive definite. This only amounts to
assuming that no linear combination of the $<semantics>{Y}_{i}<annotation\; encoding="application/x-tex">Y\_i</annotation></semantics>$s has variance zero —
in other words, that there are no *exact* linear relationships between the
random variables involved.

### Back to the main question

Here’s where we got to. We surveyed $<semantics>n<annotation\; encoding="application/x-tex">n</annotation></semantics>$ individuals from our population,
giving $<semantics>n<annotation\; encoding="application/x-tex">n</annotation></semantics>$ identically distributed *but not necessarily independent* random
variables $<semantics>{Y}_{1},\dots ,{Y}_{n}<annotation\; encoding="application/x-tex">Y\_1,\; \backslash ldots,\; Y\_n</annotation></semantics>$. Some of them will be correlated because of
geographical clustering.

We’re trying to use this data to estimate the population mean in as precise a way as possible. Specifically, we’re looking for numbers $<semantics>{a}_{1},\dots ,{a}_{n}<annotation\; encoding="application/x-tex">a\_1,\; \backslash ldots,\; a\_n</annotation></semantics>$ such that the linear estimator $<semantics>\sum {a}_{i}{Y}_{i}<annotation\; encoding="application/x-tex">\backslash sum\; a\_i\; Y\_i</annotation></semantics>$ is unbiased and has the maximum possible effective sample size.

The effective sample size was defined as $<semantics>{n}_{\mathrm{eff}}={\sigma}^{2}/\mathrm{Var}(\sum {a}_{i}{Y}_{i})<annotation\; encoding="application/x-tex">n\_\{eff\}\; =\; \backslash sigma^2/Var(\backslash sum\; a\_i\; Y\_i)</annotation></semantics>$, where $<semantics>{\sigma}^{2}<annotation\; encoding="application/x-tex">\backslash sigma^2</annotation></semantics>$ is the variance of the distribution we’re drawing from. Now we need to work out the variance in the denominator.

Let $<semantics>R<annotation\; encoding="application/x-tex">R</annotation></semantics>$ denote the correlation matrix of $<semantics>{Y}_{1},\dots ,{Y}_{n}<annotation\; encoding="application/x-tex">Y\_1,\; \backslash ldots,\; Y\_n</annotation></semantics>$. I said a moment ago that $<semantics>({a}_{1},\dots ,{a}_{n})\mapsto \mathrm{Var}(\sum {a}_{i}{Y}_{i})<annotation\; encoding="application/x-tex">(a\_1,\; \backslash ldots,\; a\_n)\; \backslash mapsto\; Var\; (\backslash sum\; a\_i\; Y\_i)</annotation></semantics>$ is the quadratic form corresponding to the bilinear form represented by the covariance matrix. Since each $<semantics>{Y}_{i}<annotation\; encoding="application/x-tex">Y\_i</annotation></semantics>$ has variance $<semantics>{\sigma}^{2}<annotation\; encoding="application/x-tex">\backslash sigma^2</annotation></semantics>$, the covariance matrix is just $<semantics>{\sigma}^{2}<annotation\; encoding="application/x-tex">\backslash sigma^2</annotation></semantics>$ times the correlation matrix $<semantics>R<annotation\; encoding="application/x-tex">R</annotation></semantics>$. Hence

$$<semantics>\mathrm{Var}({a}_{1}{Y}_{1}+\cdots +{a}_{n}{Y}_{n})={\sigma}^{2}\cdot {a}^{*}Ra<annotation\; encoding="application/x-tex">\; Var(a\_1\; Y\_1\; +\; \backslash cdots\; +\; a\_n\; Y\_n)\; =\; \backslash sigma^2\; \backslash cdot\; a^\backslash ast\; R\; a\; </annotation></semantics>$$

where $<semantics>*<annotation\; encoding="application/x-tex">\backslash ast</annotation></semantics>$ denotes a transpose and $<semantics>a=({a}_{1},\dots ,{a}_{n})<annotation\; encoding="application/x-tex">a\; =\; (a\_1,\; \backslash ldots,\; a\_n)</annotation></semantics>$.

So, the effective sample size of our estimator is

$$<semantics>1/{a}^{*}Ra.<annotation\; encoding="application/x-tex">\; 1/a^\backslash ast\; R\; a.\; </annotation></semantics>$$

We also wanted our estimator to be unbiased. Its expected value is

$$<semantics>E({a}_{1}{Y}_{1}+\cdots +{a}_{n}{Y}_{n})=({a}_{1}+\cdots +{a}_{n})\mu <annotation\; encoding="application/x-tex">\; E(a\_1\; Y\_1\; +\; \backslash cdots\; +\; a\_n\; Y\_n)\; =\; (a\_1\; +\; \backslash cdots\; +\; a\_n)\; \backslash mu\; </annotation></semantics>$$

where $<semantics>\mu <annotation\; encoding="application/x-tex">\backslash mu</annotation></semantics>$ is the population mean. So, we need $<semantics>\sum {a}_{i}=1<annotation\; encoding="application/x-tex">\backslash sum\; a\_i\; =\; 1</annotation></semantics>$.

Putting this together, the maximum possible effective sample size among all unbiased linear estimators is

$$<semantics>\mathrm{sup}\{\frac{1}{{a}^{*}Ra}\phantom{\rule{thinmathspace}{0ex}}:\phantom{\rule{thinmathspace}{0ex}}a\in {\mathbb{R}}^{n},\phantom{\rule{thinmathspace}{0ex}}\sum {a}_{i}=1\}.<annotation\; encoding="application/x-tex">\; \backslash sup\; \backslash Bigl\backslash \{\; \backslash frac\{1\}\{a^\backslash ast\; R\; a\}\; \backslash ,\; :\; \backslash ,\; a\; \backslash in\; \backslash mathbb\{R\}^n,\; \backslash ,\; \backslash sum\; a\_i\; =\; 1\; \backslash Bigr\backslash \}.\; </annotation></semantics>$$

Which $<semantics>a\in {\mathbb{R}}^{n}<annotation\; encoding="application/x-tex">a\; \backslash in\; \backslash mathbb\{R\}^n</annotation></semantics>$ achieves this maximum, and what *is* the maximum
possible effective sample size? That’s easy, and in fact it’s something
that’s appeared many times at this blog before…

### The magnitude of a matrix

The **magnitude** $<semantics>|R|<annotation\; encoding="application/x-tex">|R|</annotation></semantics>$ of an invertible $<semantics>n\times n<annotation\; encoding="application/x-tex">n\; \backslash times\; n</annotation></semantics>$ matrix $<semantics>R<annotation\; encoding="application/x-tex">R</annotation></semantics>$ is the sum of
all $<semantics>{n}^{2}<annotation\; encoding="application/x-tex">n^2</annotation></semantics>$ entries of $<semantics>{R}^{-1}<annotation\; encoding="application/x-tex">R^\{-1\}</annotation></semantics>$. To calculate it, you don’t need to go as
far as inverting $<semantics>R<annotation\; encoding="application/x-tex">R</annotation></semantics>$. It’s much easier to find the unique column vector
$<semantics>w<annotation\; encoding="application/x-tex">w</annotation></semantics>$ satisfying

$$<semantics>Rw=\left(\begin{array}{c}1\\ \vdots \\ 1\end{array}\right)<annotation\; encoding="application/x-tex">\; R\; w\; =\; \backslash begin\{pmatrix\}\; 1\; \backslash \backslash \; \backslash vdots\; \backslash \backslash \; 1\; \backslash end\{pmatrix\}\; </annotation></semantics>$$

(the **weighting** of $<semantics>R<annotation\; encoding="application/x-tex">R</annotation></semantics>$), then calculate $<semantics>{\sum}_{i}{w}_{i}<annotation\; encoding="application/x-tex">\backslash sum\_i\; w\_i</annotation></semantics>$. This sum is the
magnitude of $<semantics>R<annotation\; encoding="application/x-tex">R</annotation></semantics>$, since $<semantics>{w}_{i}<annotation\; encoding="application/x-tex">w\_i</annotation></semantics>$ is the $<semantics>i<annotation\; encoding="application/x-tex">i</annotation></semantics>$th row-sum of $<semantics>{R}^{-1}<annotation\; encoding="application/x-tex">R^\{-1\}</annotation></semantics>$.

Most of what I’ve written about magnitude has been in the situation where we start with a finite metric space $<semantics>X=\{{x}_{1},\dots ,{x}_{n}\}<annotation\; encoding="application/x-tex">X\; =\; \backslash \{x\_1,\; \backslash ldots,\; x\_n\backslash \}</annotation></semantics>$, and we use the matrix $<semantics>Z<annotation\; encoding="application/x-tex">Z</annotation></semantics>$ with entries $<semantics>{Z}_{ij}=\mathrm{exp}(-d({x}_{i},{x}_{j}))<annotation\; encoding="application/x-tex">Z\_\{i\; j\}\; =\; exp(-d(x\_i,\; x\_j))</annotation></semantics>$. This turns out to give interesting information about $<semantics>X<annotation\; encoding="application/x-tex">X</annotation></semantics>$. In the metric situation, the entries of the matrix $<semantics>Z<annotation\; encoding="application/x-tex">Z</annotation></semantics>$ are between $<semantics>0<annotation\; encoding="application/x-tex">0</annotation></semantics>$ and $<semantics>1<annotation\; encoding="application/x-tex">1</annotation></semantics>$. Often $<semantics>Z<annotation\; encoding="application/x-tex">Z</annotation></semantics>$ is positive definite (e.g. when $<semantics>X\subset {\mathbb{R}}^{n}<annotation\; encoding="application/x-tex">X\; \backslash subset\; \backslash mathbb\{R\}^n</annotation></semantics>$), as correlation matrices are.

When $<semantics>R<annotation\; encoding="application/x-tex">R</annotation></semantics>$ is positive definite, there’s a third way to describe the magnitude:

$$<semantics>|R|=\mathrm{sup}\{\frac{1}{{a}^{*}Ra}\phantom{\rule{thinmathspace}{0ex}}:\phantom{\rule{thinmathspace}{0ex}}a\in {\mathbb{R}}^{n},\phantom{\rule{thinmathspace}{0ex}}\sum {a}_{i}=1\}.<annotation\; encoding="application/x-tex">\; |R|\; =\; \backslash sup\; \backslash Bigl\backslash \{\; \backslash frac\{1\}\{a^\backslash ast\; R\; a\}\; \backslash ,\; :\; \backslash ,\; a\; \backslash in\; \backslash mathbb\{R\}^n,\; \backslash ,\; \backslash sum\; a\_i\; =\; 1\; \backslash Bigr\backslash \}.\; </annotation></semantics>$$

The supremum is attained just when $<semantics>a=w/|R|<annotation\; encoding="application/x-tex">a\; =\; w/|R|</annotation></semantics>$, and the proof is a simple application of the Cauchy–Schwarz inequality.

But that supremum is exactly the expression we had for maximum effective sample size! So:

The maximum possible value of $<semantics>{n}_{\mathrm{eff}}<annotation\; encoding="application/x-tex">n\_\{eff\}</annotation></semantics>$ is $<semantics>|R|<annotation\; encoding="application/x-tex">|R|</annotation></semantics>$.

Or more wordily:

The maximum effective sample size of an unbiased linear estimator of the mean is the magnitude of the sample correlation matrix.

Or wordily but approximately:

Effective sample size $<semantics>=<annotation\; encoding="application/x-tex">=</annotation></semantics>$ magnitude of correlation matrix.

Moreover, we know how to attain that maximum. It’s attained if and only if our estimator is

$$<semantics>\frac{1}{|R|}({w}_{1}{Y}_{1}+\cdots +{w}_{n}{Y}_{n})<annotation\; encoding="application/x-tex">\; \backslash frac\{1\}\{|R|\}\; (w\_1\; Y\_1\; +\; \backslash cdots\; +\; w\_n\; Y\_n)\; </annotation></semantics>$$

where $<semantics>w=({w}_{1},\dots ,{w}_{n})<annotation\; encoding="application/x-tex">w\; =\; (w\_1,\; \backslash ldots,\; w\_n)</annotation></semantics>$ is the weighting of the correlation matrix.

I’m not too sure where this “result” — observation, really — comes from. I learned it from the statistician Paul Blackwell at Sheffield, who, like me, had been reading this paper:

Andrew Solow and Stephen Polasky, Measuring biological diversity.

Environmental and Ecological Statistics1 (1994), 95–103.

In turn, Solow and Polasky refer to this:

Morris Eaton, A group action on covariances with applications to the comparison of linear normal experiments. In: Moshe Shaked and Y.L. Tong (eds.),

Stochastic inequalities: Papers from the AMS-IMS-SIAM Joint Summer Research Conference held in Seattle, Washington, July 1991, Institute of Mathematical Statistics Lecture Notes — Monograph Series, Volume 22, 1992.

But the result is so simple that I’d imagine it’s much older. I’ve been wondering whether it’s essentially the Gauss-Markov theorem; I thought it was, then I thought it wasn’t. Does anyone know?

### The surprising behaviour of effective sample size

You might expect the effective size of a sample of $<semantics>n<annotation\; encoding="application/x-tex">n</annotation></semantics>$ individuals to be at most $<semantics>n<annotation\; encoding="application/x-tex">n</annotation></semantics>$. It’s not.

You might expect the effective sample size to go down as the correlations within the sample go up. It doesn’t.

This behaviour appears in even the simplest nontrivial example:

ExampleSuppose our sample consists of just two individuals. Call the sampled values $<semantics>{Y}_{1}<annotation\; encoding="application/x-tex">Y\_1</annotation></semantics>$ and $<semantics>{Y}_{2}<annotation\; encoding="application/x-tex">Y\_2</annotation></semantics>$, and write the correlation matrix as $$<semantics>R=\left(\begin{array}{cc}1& \rho \\ \rho & 1\end{array}\right).<annotation\; encoding="application/x-tex">\; R\; =\; \backslash begin\{pmatrix\}\; 1\; \&\; \backslash rho\; \backslash \backslash \; \backslash rho\; \&\; 1\; \backslash end\{pmatrix\}.\; </annotation></semantics>$$ Then the maximum-precision unbiased linear estimator is $<semantics>\frac{1}{2}({Y}_{1}+{Y}_{2})<annotation\; encoding="application/x-tex">\backslash frac\{1\}\{2\}(Y\_1\; +\; Y\_2)</annotation></semantics>$, and its effective sample size is $$<semantics>|R|=\frac{2}{1+\rho}.<annotation\; encoding="application/x-tex">\; |R|\; =\; \backslash frac\{2\}\{1\; +\; \backslash rho\}.\; </annotation></semantics>$$ As the correlation $<semantics>\rho <annotation\; encoding="application/x-tex">\backslash rho</annotation></semantics>$ between the two variables increases from $<semantics>0<annotation\; encoding="application/x-tex">0</annotation></semantics>$ to $<semantics>1<annotation\; encoding="application/x-tex">1</annotation></semantics>$, the effective sample size decreases from $<semantics>2<annotation\; encoding="application/x-tex">2</annotation></semantics>$ to $<semantics>1<annotation\; encoding="application/x-tex">1</annotation></semantics>$, as you’d expect.But when $<semantics>\rho <0<annotation\; encoding="application/x-tex">\backslash rho\; \backslash lt\; 0</annotation></semantics>$, the effective sample size is

greaterthan 2. In fact, as $<semantics>\rho \to -1<annotation\; encoding="application/x-tex">\backslash rho\; \backslash to\; -1</annotation></semantics>$, the effective sample size tends to $<semantics>\mathrm{\infty}<annotation\; encoding="application/x-tex">\backslash infty</annotation></semantics>$. That’s intuitively plausible. For if $<semantics>\rho <annotation\; encoding="application/x-tex">\backslash rho</annotation></semantics>$ is close to $<semantics>-1<annotation\; encoding="application/x-tex">-1</annotation></semantics>$ then, writing $<semantics>{Y}_{1}=\mu +{\epsilon}_{1}<annotation\; encoding="application/x-tex">Y\_1\; =\; \backslash mu\; +\; \backslash varepsilon\_1</annotation></semantics>$ and $<semantics>{Y}_{2}=\mu +{\epsilon}_{2}<annotation\; encoding="application/x-tex">Y\_2\; =\; \backslash mu\; +\; \backslash varepsilon\_2</annotation></semantics>$, we have $<semantics>{\epsilon}_{1}\approx -{\epsilon}_{2}<annotation\; encoding="application/x-tex">\backslash varepsilon\_1\; \backslash approx\; -\backslash varepsilon\_2</annotation></semantics>$, and so $<semantics>\frac{1}{2}({Y}_{1}+{Y}_{2})<annotation\; encoding="application/x-tex">\backslash frac\{1\}\{2\}(Y\_1\; +\; Y\_2)</annotation></semantics>$ is a very good estimator of $<semantics>\mu <annotation\; encoding="application/x-tex">\backslash mu</annotation></semantics>$. In the extreme, when $<semantics>\rho =-1<annotation\; encoding="application/x-tex">\backslash rho\; =\; -1</annotation></semantics>$, it’s anexactestimator of $<semantics>\mu <annotation\; encoding="application/x-tex">\backslash mu</annotation></semantics>$ — it’s infinitely precise.

The fact that the effective sample size can be greater than the actual sample size seems to be very well known. For instance, there’s a whole page about it in the documentation for Q, which is apparently “analysis software for market research”.

What’s interesting is that this doesn’t only occur when some of the variables are negatively correlated. It can also happen when all the correlations are nonnegative, as in the following example from the paper by Eaton cited above.

ExampleConsider the correlation matrix $$<semantics>R=\left(\begin{array}{ccc}1& 0& \rho \\ 0& 1& \rho \\ \rho & \rho & 1\end{array}\right)<annotation\; encoding="application/x-tex">\; R\; =\; \backslash begin\{pmatrix\}\; 1\; \&0\; \&\backslash rho\; \backslash \backslash \; 0\; \&1\; \&\backslash rho\; \backslash \backslash \; \backslash rho\; \&\backslash rho\; \&1\; \backslash end\{pmatrix\}\; </annotation></semantics>$$ where $<semantics>0\le \rho <\sqrt{2}/2=0.707\dots <annotation\; encoding="application/x-tex">0\; \backslash leq\; \backslash rho\; \backslash lt\; \backslash sqrt\{2\}/2\; =\; 0.707\backslash ldots</annotation></semantics>$. This is positive definite, so it’s the correlation matrix of some random variables $<semantics>{Y}_{1},{Y}_{2},{Y}_{3}<annotation\; encoding="application/x-tex">Y\_1,\; Y\_2,\; Y\_3</annotation></semantics>$.A routine computation shows that $$<semantics>|R|=\frac{3-4\rho}{1-2{\rho}^{2}}.<annotation\; encoding="application/x-tex">\; |R|\; =\; \backslash frac\{3\; -\; 4\backslash rho\}\{1\; -\; 2\backslash rho^2\}.\; </annotation></semantics>$$ As we’ve shown, this is the effective sample size of a maximum-precision unbiased linear estimator of the mean.

When $<semantics>\rho =0<annotation\; encoding="application/x-tex">\backslash rho\; =\; 0</annotation></semantics>$, it’s $<semantics>3<annotation\; encoding="application/x-tex">3</annotation></semantics>$, as you’d expect: the variables are uncorrelated. As $<semantics>\rho <annotation\; encoding="application/x-tex">\backslash rho</annotation></semantics>$ increases, $<semantics>|R|<annotation\; encoding="application/x-tex">|R|</annotation></semantics>$ decreases, again as you’d expect: more correlation between the variables leads to a smaller effective sample size. This behaviour continues until $<semantics>\rho =1/2<annotation\; encoding="application/x-tex">\backslash rho\; =\; 1/2</annotation></semantics>$, where $<semantics>|R|=2<annotation\; encoding="application/x-tex">|R|\; =\; 2</annotation></semantics>$.

But then something strange happens. As $<semantics>\rho <annotation\; encoding="application/x-tex">\backslash rho</annotation></semantics>$ increases from $<semantics>1/2<annotation\; encoding="application/x-tex">1/2</annotation></semantics>$ to $<semantics>\sqrt{2}/2<annotation\; encoding="application/x-tex">\backslash sqrt\{2\}/2</annotation></semantics>$, the effective sample size increases from $<semantics>2<annotation\; encoding="application/x-tex">2</annotation></semantics>$ to $<semantics>\mathrm{\infty}<annotation\; encoding="application/x-tex">\backslash infty</annotation></semantics>$.

Increasing the correlation increases the effective sample size.For instance, when $<semantics>\rho =0.7<annotation\; encoding="application/x-tex">\backslash rho\; =\; 0.7</annotation></semantics>$, we have $<semantics>|R|=10<annotation\; encoding="application/x-tex">|R|\; =\; 10</annotation></semantics>$ — the maximum-precision estimator is as precise as if we’d chosen $<semantics>10<annotation\; encoding="application/x-tex">10</annotation></semantics>$ independent individuals. For that value of $<semantics>\rho <annotation\; encoding="application/x-tex">\backslash rho</annotation></semantics>$, the maximum-precision estimator turns out to be $$<semantics>\frac{3}{2}{Y}_{1}+\frac{3}{2}{Y}_{2}-2{Y}_{3}.<annotation\; encoding="application/x-tex">\; \backslash frac\{3\}\{2\}\; Y\_1\; +\; \backslash frac\{3\}\{2\}\; Y\_2\; -\; 2\; Y\_3.\; </annotation></semantics>$$ Go figure!

These examples may seem counterintuitive, but Eaton cautions us to beware of our feeble intuitions:

These examples show that our rather vague intuitive feeling that “positive correlation tends to decrease information content in an experiment” is very far from the truth, even for rather simple normal experiments with three observations.

This is very like the fact that a metric space with $<semantics>n<annotation\; encoding="application/x-tex">n</annotation></semantics>$ points can have magnitude (“effective number of points”) greater than $<semantics>n<annotation\; encoding="application/x-tex">n</annotation></semantics>$, even if the associated matrix $<semantics>Z<annotation\; encoding="application/x-tex">Z</annotation></semantics>$ is positive definite.

Anyone with any statistical knowledge who’s still reading will easily have picked up on the fact that I’m a total amateur. If that’s you, I’d love to hear your comments!