The Farey Mean?

I had never heard of the Farey mean, but here it is, brought to you by @howie_hua.

When you add two fractions, you of course remember that you should never just add the numerators and add the denominators across. The resulting fraction will not be the sum of the two addends. But if you do add across (under certain conditions which I’ll show below), the result will be a fraction between the two “addend” fractions. So, you can use the add-across method to find a fraction between two other fractions.

For example, \(\mathtt{\frac{1}{2}+\frac{4}{3}\rightarrow\frac{5}{5}}\). The first “addend” is definitely less than 1, and the second definitely greater than 1. The Farey mean here is exactly 1 (or \(\mathtt{\frac{5}{5}}\)), which is between the two “addend” fractions.

Why Does It Work?

Since this site is fast becoming all linear algebra all the time, let’s throw some linear algebra at this. What we want to show is that, given \(\mathtt{\frac{a}{b}<\frac{c}{d}}\) (we'll go with this assumption for now), \[\mathtt{\frac{a}{b}<\frac{a+c}{b+d}<\frac{c}{d}}\]

for certain positive integer values of \(\mathtt{a,b,c,}\) and \(\mathtt{d}\). I would probably do better to make those inequality signs less-than-or-equal-tos, but let’s stick with this for the present. We’ll start by representing the fraction \(\mathtt{\frac{a}{b}}\) as the vector \(\scriptsize\begin{bmatrix}\mathtt{b}\\\mathtt{a}\end{bmatrix}\) along with the fraction \(\mathtt{\frac{c}{d}}\) as the vector \(\scriptsize\begin{bmatrix}\mathtt{d}\\\mathtt{c}\end{bmatrix}\).

We’re looking specifically at the slopes or angles here (which is why we can represent a fraction as a vector in the first place), so we’ve made \(\scriptsize\begin{bmatrix}\mathtt{d}\\\mathtt{c}\end{bmatrix}\) have a greater slope to keep in line with our assumption above that \(\mathtt{\frac{a}{b}<\frac{c}{d}}\).

The fraction \(\mathtt{\frac{a+c}{b+d}}\) is the same as the vector \(\scriptsize\begin{bmatrix}\mathtt{b+d}\\\mathtt{a+c}\end{bmatrix}\). And since this vector is the diagonal of the vector parallelogram, it will of course have a greater slope than \(\mathtt{\frac{a}{b}}\) but less than \(\mathtt{\frac{c}{d}}\). You can keep going forever—just take one of the side vectors and use the diagonal vector as the other side. So long as you’re making parallelograms, you’ll get a new diagonal shallower than the two side vectors, and the result will be a fraction between the other two.

Incidentally, our assumption at the beginning that \(\mathtt{\frac{a}{b}<\frac{c}{d}}\) doesn't really matter to this picture. If we make \(\mathtt{\frac{c}{d}}\) less than \(\mathtt{\frac{a}{b}}\), our picture simply flips. The diagonal vector still has to be located between the two side vectors.

What Doesn’t Work?

The linear algebra picture of this concept also tells us where this method fails to find a fraction between the two addend fractions. When the two “addend” fractions are equivalent, \(\mathtt{c}\) and \(\mathtt{d}\) are multiples of \(\mathtt{a}\) and \(\mathtt{b}\), respectively, or vice versa. In that case, the resulting fraction looks like this.

The slopes or angles for both addends and for the result are the same, producing a Farey mean that is equal to both fractions.

Cosine Similarity and Correlation

I wrote a lesson not too long ago that started with a Would You Rather? survey activity. For our purposes here, we can pretend that each question had a Likert scale from 1–10 attached to it, though in reality, the lesson was about categorical data.

At any rate, here are the questions—edited a bit. Feel free to rate your answers on the scales provided. Careful! Once you click, you lock in your answer.

Would you rather . . .

  1. be able to fly (1) or be able to read minds (10)?

  1. go way back in time (1) or go way into the future (10)?

  1. be able to talk to animals (1) or speak all languages (10)?

  1. watch only historical movies (1) or sci-fi movies (10) for the rest of your life?

  1. be just a veterinarian (1) or just a musician (10)?

Finally, one last question that is not a would-you-rather. Once you’ve answered this and the rest of the questions, you can press the I'm Finished! button to submit your responses.

  1. Rate your fear of heights from (1) not at all afraid to (10) very afraid.

Check out the results so far.

Are Your Responses Correlated?

Next in the lesson, I move on to asking whether you think some of the survey responses are correlated. For example, if you scored “low” on the veterinarian-or-musician scale—meaning you would strongly prefer to be a veterinarian over a musician—would that indicate that you probably also scored “low” on Question (c) about talking to animals or speaking all the languages? In other words, are those two scores correlated? What about choosing the ability to fly and your fear of heights? Are those correlated? How could we measure this using a lot of responses from a lot of different people?

An ingenious way of looking at this question is by using cosine similarity from linear algebra. (We looked at the cosine of the angle between two vectors here and here.)

For example, suppose you really would rather have the ability to fly and you have almost no fear of heights. So, you answered Question (a) with a 1 and Question (f) with, say, a 2. Another person has no desire to fly and a terrible fear of heights, so they answer Question (a) with an 8 and Question (f) with a (10). From this description, we would probably guess that the two quantities wish-for-flight and fear-of-heights are strongly correlated. But we’ve also now got the vectors (1, 2) and (8, 10) to show us this correlation.

See that tiny angle between the vectors on the left? The cosine of tiny angles (as we saw) is close to 1, which indicates a strong correlation. On the right, you see the opposite idea. One person really wants to fly but is totally afraid of heights (1, 10) and another almost couldn’t care less about flying (or at least would really rather read minds) but has a low fear of heights (8, 2). The cosine of the close-to-90°-angle between these vectors will be close to 0, indicating a weak correlation between responses to our flight and heights questions.

But That’s Not the Ingenious Part

That’s pretty cool, but it is not, in fact, how we measure correlation. The first difficulty we encounter happens after adding more people to the survey, giving us several angles to deal with—not impossible, but pretty messy for a hundred or a thousand responses. The second, more important, difficulty is that the graph on the right above doesn’t show a weak correlation; it shows a strong negative correlation. Given just the two response pairs to work from in that graph, we would have to conclude that a strong fear of heights would make you more likely to want the ability to fly (or vice versa) rather than less likely. But the “weakest” the cosine can measure in this kind of setup is 0.

The solution to the first difficulty is to take all the x-components of the responses and make one giant vector out of them. Then do the same to the y-components. Now we’ve got just two vectors to compare! For our data on the left, the vectors (1, 2) and (8, 10) become (1, 8) and (2, 10). The vectors on the right—(1, 10) and (8, 2)—become (1, 8) and (10, 2).

The solution to the second difficulty—no negative correlations—we can achieve by centering the data. Let’s take our new vectors for the right, uncorrelated, graph: (1, 8) and (10, 2). Add the components in each vector and divide by the number of components (2) to get an average. Then subtract the average from each component. So, our new centered vectors are

(1 – ((1 + 8) ÷ 2), 8 – ((1 + 8) ÷ 2)) and (10 – ((10 + 2) ÷ 2), 2 – ((10 + 2) ÷ 2))

Or (–3.5, 3.5) and (4, –4). It’s probably not to tough to see that a vector in the 2nd quadrant and a vector in the 4th quadrant are heading in opposite directions. And these vectors now form a close-to-180° angle, and the cosine of 180° is –1 which is the actual lowest correlation we can get, indicating a strong negative correlation.

And That’s Correlation

To summarize, the way to determine correlation linear-algebra style is to determine the cosine of the centered x- and y-vectors of the data. That formula is \[\mathtt{\frac{(x-\overline{x}) \cdot (y-\overline{y})}{|x-\overline{x}||y-\overline{y}|} = cos(θ)}\]

Which is just another way of writing the more common version of the r-value correlation.


The Formula for Combinations

And now, finally, let’s get to the formula for combinations. The math in my last post got a little tricky toward the end, with the strange exclamation mark notation floating around. So let’s recap permutations without that notation.

4
× 3
× 3 × 2
× 3 × 2 × 1
÷ (1 × 2 × 3)
÷ (1 × 2)
÷ 1

You should see that, to traverse a tree diagram, we multiply by the tree branches, cumulatively, to move right, and then divide by those branches—again, cumulatively—to move left. The formula for the number of permutations of 2 cards chose from 4, \(\mathtt{\frac{4!}{(4-2)!}}\), tells us to multiply all the way to the right, to get 24 in the numerator and then divide two steps to the left (divide by \(\mathtt{(4-2)!}\) or 2) to get 12 permutations of 2 cards chosen from 4.

Combinations

An important point about the above is that the number of permutations of \(\mathtt{r}\) cards chosen from \(\mathtt{n}\) cards, \(\mathtt{_{n}P_r}\), is a subset of the number of permutations of \(\mathtt{n}\) cards, \(\mathtt{n!}\) The tree diagram shows \(\mathtt{n!}\) and contained within it are \(\mathtt{_{n}P_r}\).

Combinations of \(\mathtt{r}\) items chosen from \(\mathtt{n}\), denoted as \(\mathtt{_{n}C_r}\), are a further subset. That is, \(\mathtt{_{n}C_r}\) are a subset of \(\mathtt{_{n}P_r}\). In our example of 2 cards chosen from 4, \(\mathtt{_{n}P_r}\) represents the first two columns of the tree diagram combined. In those columns, we have, for example, the permutations JQ and QJ. But these two permutations represent just one combination. The same goes for the other pairs in those columns. Thus, we can see that to get the number of combinations of 2 cards chosen from 4, we take \(\mathtt{_{n}P_r}\) and divide by 2. So, \[\mathtt{\frac{4!}{(4-2)!}\div 2=\frac{4!}{(4-2)!\cdot2}}\]

What about combinations of 3 cards chosen from 4? That’s the first 3 columns combined. Now the repeats are, for example, JQK, JKQ, QJK, QKJ, KJQ, KQJ. Which is 6. Noticing the pattern? For \(\mathtt{_{4}C_2}\), we divide \(\mathtt{_{4}P_2}\) further by 2! For \(\mathtt{_{4}C_3}\), we divide \(\mathtt{_{4}P_3}\) further by 3! We’re dividing (further) by \(\mathtt{r!}\)

When you think about it, this makes sense. We need to collapse every permutation of \(\mathtt{r}\) cards down to 1 combination. So we divide by \(\mathtt{r!}\) Here, finally then, is the formula for combinations: \[\mathtt{_{n}C_r=\frac{n!}{(n-r)!r!}}\]


And Now for the Legal Formula

So, did you come up with a working rule to describe the pattern we looked at last time? Here’s what I came up with:

As we saw last time, the “root” of the tree diagram (the first column) shows \(\mathtt{_{4}P_1}\), which is the number of permutations of 1 card chosen from 4. The first and second columns combined show \(\mathtt{_{4}P_2}\), the number of permutations of 2 cards chosen from 4. So, to determine \(\mathtt{_{n}P_r}\), according to this pattern, we start with \(\mathtt{n}\) and then multiply \(\mathtt{(n-1)(n-2)}\) and so on until we reach \(\mathtt{n-(r-1)}\).

The number of permutations of, say, 3 items chosen from 5, then, would be \[\mathtt{_{5}P_3=5\cdot (5-1)(5-2)=60}\]

This is a nice rule that works every time for permutations of \(\mathtt{r}\) things chosen from \(\mathtt{n}\) things. It can even be represented a little more ‘mathily’ as \[\mathtt{_{n}P_r=\prod_{k=0}^{r-1}(n-k)}\]

So let’s move on to the “legal” formula for \(\mathtt{_{n}P_r}\). A quick sidebar on notation, though, which we’ll need in a moment.

When we count the number of permutations at the end of a tree diagram, what we get is actually \(\mathtt{_{n}P_n}\). In our example, that’s \(\mathtt{_{4}P_4}\). The way we write this amount is with an exclamation mark: \(\mathtt{n!}\), or, in our case, \(\mathtt{4!}\) What \(\mathtt{4!}\) means is \(\mathtt{4\times(4-1)\times(4-2)\times(4-3)}\) according to our rule above, or just \(\mathtt{4\times3\times2\times1}\). And \(\mathtt{3!}\) is \(\mathtt{3\times(3-1)\times(3-2)}\), or just \(\mathtt{3\times2\times1}\).

In general, we can say that \(\mathtt{n!=n\times(n-1)!}\) So, for example, \(\mathtt{4!=4\times3!}\) etc. And since this means that \(\mathtt{1!=1\times(1-1)!}\), that means that \(\mathtt{0!=1}\).

So, for the tree diagram, \(\mathtt{_{4}P_4}\) means multiplying all the way to the right by \(\mathtt{n!}\). But if we’re interested in the number of arrangements of \(\mathtt{r}\) cards chosen from \(\mathtt{n}\) cards, then we need to come back to the left by \(\mathtt{(n-r)!}\) And since moving right is multiplying, moving left is dividing.

4
4 × 3
4 × 3 × 2
4 × 3 × 2 × 1
÷ (4 – 1)!
÷ (4 – 2)!
÷ (4 – 3)!
   ÷ (4 – 4)!

The division we need is not immediately obvious, but if you study the tree diagram above, I think it’ll make sense. This gives us, finally, the “legal” formula for the number of permutations of \(\mathtt{r}\) items from \(\mathtt{n}\) items: \[\mathtt{_{n}P_r=\frac{n!}{(n-r)!}}\]


A New Formula for Permutations?

Last time, we saw that combinations are a subset of permutations, and we wondered what the relationship between the two is. Before we get there, though, let’s look at another possible relationship—one we only hinted at last time. And to examine this relationship, we’ll use a tree diagram.

Tree Diagram

This tree diagram shows the number of permutations of the 4 cards J, Q, K, A—the number of ways we can arrange the 4 cards. The topmost branch shows the result JQKA. And you can see all 24 results from our list last time here in the tree diagram.

4
4 × 3
4 × 3 × 2
4 × 3 × 2 × 1
÷ 3!
÷ 2!
÷ 1!
   ÷ 0!

Here’s where, normally, people would talk about the multiplication 4 × 3 × 2 × 1 and tell you that another way to write that is with an exclamation mark: 4! But that’s skipping over something important.

And that something important is this: Notice that the first column of the tree diagram—the root of the tree—shows 4 items. This is the number of different permutations you can make of just 1 card, chosen from 4 different cards. And the first and second columns combined show the number of permutations you can make of 2 cards, chosen from 4 cards (JQ, JK, JA, etc.).

And so on. You might think that to go from “permutations of 1 card chosen from 4″ to “permutations of 2 cards chosen from 4″ you would multiply by 2. But of course that’s not right (and the tree diagram tells us so). You actually multiply 4 by 4 – 1. And to go from “permutations of 1 card chosen from 4″ to “permutations of 3 cards chosen from 4″ you multiply 4 • (4 – 1) • (4 – 2).

We’re on the verge of being able to describe the relationship, which I’ll put in question form (and mix in some notation to):

What is the relationship between the number of permutations of \(\mathtt{n}\) things, \(\mathtt{P(n)}\), and the number of permutations of \(\mathtt{r}\) things chosen from \(\mathtt{n}\) things, \(\mathtt{_{n}P_r}\)?

We can see from our example above that \(\mathtt{P(4)=24}\). That is, the number of permutations of 4 things is 24. But we also noticed these three results: $$\begin{aligned}_{\mathtt{4}}\mathtt{P}_{\mathtt{1}}&= \mathtt{\,\,4}\cdot \mathtt{1} \\ _{\mathtt{4}}\mathtt{P}_{\mathtt{2}}&= (\mathtt{4}\cdot \mathtt{1})(\mathtt{4}-\mathtt{1}) \\ _{\mathtt{4}}\mathtt{P}_{\mathtt{3}}&= (\mathtt{4}\cdot \mathtt{1})(\mathtt{4}-\mathtt{1})(\mathtt{4}-\mathtt{2})\end{aligned}$$

A New Formula?

Study the pattern above and see if you can write a rule that will get you the correct result for any \(\mathtt{_{n}P_r}\). Check your results here (for example, for \(\mathtt{_{16}P_{12}}\), you can just enter 16P12 and press Enter).

The rule you write, if you get it right, won’t be an algorithm. But it’ll work every time! This is the step we always skip when teaching about permutations! The next step is to think hard about why it works. We’ll get to the “legal” formula for permutations next time.

Permutations & Combinations

I have now been blogging for 16 years, and my very first post (long gone) was on combinations and permutations. So, it’s fun to come back to the idea now. In 2004, my experience with the two concepts was limited to how textbooks often used the awkward “care about order” (permutations) or “don’t care about order” (combinations) language to introduce the ideas. So, that’s what I wrote about then. Now I want to talk about how the two concepts are related.

What They Are

When you count permutations, you count how many different ways you can sequentially arrange some things. When you count combinations, you count how many ways you can have some things. So, given 2 cards, there are 2 different ways you can sequentially arrange 2 cards, but given 2 cards, there’s just one way to have 2 cards.

Right off the bat, the language is weird, and it’s hard to see why combinations should ever be a thing (there’s always just 1 way to have a set of things). But combinations make better sense when you are not choosing from all the elements you are given.

So, for example, how many permutations and combinations can I make of 2 cards, chosen from a total of 3 cards?

Now having the two categories of permutation and combination makes a little more sense. There are 6 permutations of 2 cards chosen from 3 cards and there are 3 combinations of 2 cards chosen from 3 cards. That is, there are 6 different ways to sequentially arrange 2 cards chosen from 3 and just 3 different ways to have 2 cards chosen from 3. And you can see, by the way, that the combinations are a subset of the permutations.

In fact, let’s do an example with 4 cards to show the actual relationship between permutations and combinations. Here we’ll just use letters to save space. The permutations of JQKA if we choose 3 cards are:

JQK, JKQ, QJK, QKJ, KJQ, KQJ
JQA, JAQ, QJA, QAJ, AJQ, AQJ
QKA, QAK, KQA, KAQ, AQK, AKQ
KAJ, KJA, JKA, JAK, AJK, AKJ

That’s 24 permutations. For combinations, we get JQK, JQA, QKA, KAJ. That’s 4 combinations. What’s the relationship? We’ll come back on that next time.

Rotating Coordinate Systems

I‘ve just started with Six Not-So-Easy Pieces, based on Feynman’s famous lectures, and already there’s some decently juicy stuff. In the beginning, Feynman discusses the symmetry of physical laws—that is, the invariance of physical laws under certain transformations (like rotations):

If we build a piece of equipment in some place and watch it operate, and nearby we build the same kind of apparatus but put it up on an angle, will it operate in the same way?

He goes on to explain that, of course, a grandfather clock will not operate in the same way under specific rotations. Assuming the invariance of physical laws under rotations, this change in operation tells us something interesting: that the operation of the clock is dependent on something outside of the “system” that is the clock itself.

The theorem is then false in the case of the pendulum clock, unless we include the earth, which is pulling on the pendulum. Therefore we can make a prediction about pendulum clocks if we believe in the symmetry of physical law for rotation: something else is involved in the operation of a pendulum clock besides the machinery of the clock, something outside it that we should look for.

Rotation Coordinates

Feynman then proceeds with a brief mathematical analysis of forces under rotations. A somewhat confusing prelude to this is a presentation that involves expressing the coordinates of a rotated system in terms of the original system. He uses the diagram below to derive those coordinates (except for the blue highlighting, which I use to show what (x’, y’) looks like in Moe’s system). What we want is to express (x’, y’) in terms of x and y—to describe Moe’s point P in terms of Joe’s point P.

“We first drop perpendiculars from P to all four axes and draw AB perpendicular to PQ.”

The first confusion that is not dealt with (because Feynman makes the assumption that his audience is advanced students) is what angles in the diagram are congruent to θ shown. And here again we see the value of the easy-to-forget art of eyeballing and common sense in geometric reasoning.

The y’ axis is displaced just as much as the x’ axis by rotation, and “displaced just as much by rotation” is a perfectly good definition of angle congruence that we tend to forget after hundreds of hours of deriving work. The same reasoning applies to the rotational displacement from AP to AB. If we imagine rotating AP to AB, we see that we are starting perpendicular to the x-axis and ending perpendicular to the x’-axis. The y-to-y’ rotation does the same thing, so the displacement angle must be the same. So let’s put in those new thetas, only one of which we’ll need.

Inspection of the figure shows that x’ can be written as the sum of two lengths along the x’-axis, and y’ as the difference of two lengths along AB.

Here is x’ as the sum of two lengths (red and orange): \[\mathtt{x’=OA\cdot\color{red}{\frac{OC}{OA}}+AP\cdot\color{orange}{\frac{BP}{AP}}\quad\rightarrow\quad x\cdot\color{red}{cos\,θ}+y\cdot\color{orange}{sin\,θ}}\]

And here is y’ as the difference of two lengths (green – purple): \[\mathtt{y’=AP\cdot\color{green}{\frac{AB}{AP}}-OA\cdot\color{purple}{\frac{AC}{OA}}\quad\rightarrow\quad y\cdot\color{green}{cos\,θ}-x\cdot\color{purple}{sin\,θ}}\]

So, if Joe describes the location of point P to Moe, and the rotational displacement between Moe and Joe’s systems is known (and it is known that the two systems share an origin), Moe can use the manipulations above to determine the location of point P in his system.

Another, exactly equal, way of saying this—the way we said it when we talked about rotation matrices—is that, if we represent point P in Joe’s system as a position vector (x, y), then Moe’s point P vector is \[\small{\begin{bmatrix}\mathtt{x’}\\\mathtt{y’}\end{bmatrix}=\begin{bmatrix}\mathtt{\,\,\,\,\,cos\,θ} & \mathtt{sin\,θ}\\\mathtt{-sin\,θ} & \mathtt{cos\,θ}\end{bmatrix}\begin{bmatrix}\mathtt{x}\\\mathtt{y}\end{bmatrix}=\mathtt{x}\begin{bmatrix}\mathtt{\,\,\,\,\,cos\,θ}\\\mathtt{-sin\,θ}\end{bmatrix}+\mathtt{y}\begin{bmatrix}\mathtt{sin\,θ}\\\mathtt{cos\,θ}\end{bmatrix}=\begin{bmatrix}\mathtt{\,\,\,\,\,\,x\cdot \color{red}{cos\,θ}\,\,\,+y\cdot \color{orange}{sin\,θ}}\\\mathtt{-(x\cdot \color{purple}{sin\,θ})+y\cdot \color{green}{cos\,θ}}\end{bmatrix}}\]

The first rotation matrix above actually describes a clockwise rotation, which is both different from what we discussed at the link above (our final matrix there was for counterclockwise rotations) and unexpected, since we know that Moe’s system is a counterclockwise rotation of Joe’s system.

The resolution to that unexpectedness can again be found after a little eyeballing. The position vector for point P in Joe’s system is at a steep angle, whereas in Moe’s system, it is at a shallow angle. Only a clockwise rotation will change the coordinates in the appropriate way.

Projections

A projection is like the shadow of a vector, say \(\mathtt{u}\), on another vector, say \(\mathtt{v}\), if light rays were coming in across \(\mathtt{u}\) and perpendicular to \(\mathtt{v}\). For the vectors at the right, imagine a light source above and to the left of the illustration, perpendicular to the vector \(\mathtt{v}\). The projection, which we’ll call \(\mathtt{p}\), will be a vector that will extend from the point shown to where the end of the shadow of \(\mathtt{u}\) touches \(\mathtt{v}\).

If you take a moment maybe to read that description twice (because it’s kind of dense), you may be able to tell that the vector \(\mathtt{p}\) that we’re looking for will be some scalar multiple of vector \(\mathtt{v}\), since it will lie exactly on top of \(\mathtt{v}\). In fact, given the picture, the projection vector \(\mathtt{p}\) will have the opposite sign as \(\mathtt{v}\) and will have a scale factor pretty close to 0, since the projection vector looks like it will be much smaller than \(\mathtt{v}\).

That information is shown at the right. Helpfully, the vector \(\mathtt{u-p}\) is perpendicular to \(\mathtt{v}\), so we know that \(\mathtt{\left(u-p\right)\cdot v=0}\). And, using the Distributive Property, we get \(\mathtt{u\cdot v-p\cdot v=0}\).

Since we know that our sought-after vector \(\mathtt{p}\) will be some scalar multiple of \(\mathtt{v}\), we can substitute, say, \(\mathtt{cv}\) for \(\mathtt{p}\) in the above to get \(\mathtt{u\cdot v-cv\cdot v=0}\). And a property of dot product multiplication allows us to rewrite that as \(\mathtt{u\cdot v-c\left(v\cdot v\right)=0}\).

This means that \(\mathtt{u\cdot v=c\left(v\cdot v\right)}\), which means that the scale factor \(\mathtt{c}\) that we’re after—the factor we can multiply by \(\mathtt{v}\) to produce \(\mathtt{p}\)—is \[\mathtt{c=\frac{u\cdot v}{v\cdot v}\quad\rightarrow\quad p=\left(\frac{u\cdot v}{v\cdot v}\right)v}\]

The Law of Total Probability

Where has this been all my life? The Law of Total Probability is really cool, and it seems accessible enough to be presented in high school, where it would be very useful as well, I think, although I’ve never seen it there. For example, from the book Causal Inference in Statistics we get this nice problem (in addition to the quote below): “Suppose we roll two dice, and we want to know the probability that the second roll (R2) is higher than the first (R1).” The Law of Total Probability can make answering this much more straightforward.

To understand this ‘law,’ we should start by understanding two simple things about probability. First, for any two mutually exclusive events (the events can’t happen together), the probability of \(\mathtt{A}\) or \(\mathtt{B}\) is the sum of the probability of \(\mathtt{A}\) and the probability of \(\mathtt{B}\):

\(\mathtt{P(A\text{ or }B)\,\,\,\,=\,\,\,\,\,\,P(A)\,\,\,\,+\,\,\,\,P(B)}\)

Second, thinking now of two mutually exclusive events “A and B” and “A and not-B”, we can write the probability of \(\mathtt{A}\) this way, since if \(\mathtt{A}\) is true, then either “A and B” or “A and not-B” must be true:

\(\mathtt{P(A)=P(A,B)\,\,+\,\,P(A,\text{not-}B)}\)

In different situations, however, \(\mathtt{B}\) could take on many different values—for example, the six possible values of one die roll, \(\mathtt{B_1}\)–\(\mathtt{B_6}\)—even while we’re considering just one value for the mututally exclusive event \(\mathtt{A}\)—for example, rolling a 4. The Law of Total Probability tells us that

\(\mathtt{P(A)=P(A,B_1)+\cdots+P(A,B_n)}\).

If we pull a random card from a standard deck, the probability that the card is a Jack [\(\mathtt{P(J)}\)] will be equal to the probability that it’s a Jack and a spade [\(\mathtt{P(J,C_S)}\)], plus the probability that it’s a Jack and a heart [\(\mathtt{P(J,C_H)}\)], plus the probability that it’s a Jack and a club [\(\mathtt{P(J,C_C)}\)], plus the probability that it’s a Jack and a diamond [\(\mathtt{P(J,C_D)}\)].

Now with Conditional Probabilities

Where this gets good is when we throw conditional probabilities into the mix. We can make use of the fact that \(\mathtt{P(A,B)=P(A|B)P(B)}\), where \(\mathtt{P(A|B)}\) means “the probability of A given B.” For example, the probability of randomly pulling a Jack, given that you pulled spades, is \(\mathtt{\frac{1}{13}}\), and the probability of randomly pulling a spade is \(\mathtt{\frac{1}{4}}\). Thus, the probability of pulling the Jack of spades is \(\mathtt{\frac{1}{13}\cdot \frac{1}{4}=\frac{1}{52}}\). We can, therefore, rewrite the Law of Total Probability this way:

\(\mathtt{P(A)=P(A|B_1)P(B_1)+\cdots+P(A|B_n)P(B_n)}\)

And now we’re ready to determine the probability given in the opening paragraph, \(\mathtt{P(R2>R1)}\), the probability that a second die roll is greater than the first die roll: \[\mathtt{P(R2>R1)=P(R2>R1|R1=1)P(R1=1)+\cdots P(R2>R1|R1=6)P(R1=6)}\]

The final result is \(\mathtt{\frac{5}{6}\cdot \frac{1}{6}+\frac{4}{6}\cdot \frac{1}{6}+\frac{3}{6}\cdot \frac{1}{6}+\frac{2}{6}\cdot \frac{1}{6}+\frac{1}{6}\cdot \frac{1}{6}+\frac{0}{6}\cdot \frac{1}{6}=\frac{5}{12}}\).

Monty Hall and the Colliders

Reading through Judea Pearl’s Book of Why, along with Causal Inference in Statistics, co-authored by Pearl, I came across a nifty, new-to-me explanation of the famous Monty Hall Problem.

To make it clearer, we can start with a toy mathematical problem, \(\mathtt{a+b=c}\), which I model with the causal diagram below. A causal diagram is, of course, a little inappropriate for modeling a mathematical equation, but it’s also one that students may first use implicitly to think about operations and equations (and, without proper instruction, it’s one adults may use all their lives).

Here, we will think of the variables \(\mathtt{a}\) and \(\mathtt{b}\) as independent. Changing the number we substitute for \(\mathtt{a}\) does not affect our choice for \(\mathtt{b}\) and vice versa. However, \(\mathtt{a}\) and \(\mathtt{c}\) are dependent, and \(\mathtt{b}\) and \(\mathtt{c}\) are dependent as well. Increasing or decreasing \(\mathtt{a}\) or \(\mathtt{b}\) alone (or together) will have an effect on \(\mathtt{c}\).

But once we fix \(\mathtt{c}\), or “condition on \(\mathtt{c}\)” as Pearl would write, then \(\mathtt{a}\) and \(\mathtt{b}\) become dependent variables. That’s at once clear as a bell and pretty wacky. If we fix \(\mathtt{c}\) at 10, then changing \(\mathtt{a}\) will change \(\mathtt{b}\) and vice versa. But \(\mathtt{a}\) and \(\mathtt{b}\) were entirely independent prior to knowing what \(\mathtt{c}\) was. Afterwards, they’re dependent on each other. The diagram that represents this situation (above) is what Pearl calls a “collider.”

Pearl et. al also use a more everyday example:

Suppose a certain college gives scholarships to two types of students: those with unusual musical talents and those with extraordinary grade point averages. Ordinarily, musical talent and scholastic achievement are independent traits, so, in the population at large, finding a person with musical talent tells us nothing about that person’s grades. However, discovering that a person is on a scholarship changes things; knowing that the person lacks musical talent then tells us immediately that he is likely to have high grade point average. Thus, two variables that are marginally independent become dependent upon learning the value of a third variable (scholarship) that is a common effect of the first two.

For the Monty Hall problem, the collider model looks identical. And the correct model helps us see two things (forgiving some [I hope minor] mathematical sloppiness).

First, the model helps us see that Monty opening the door does not change the probability that you have made the correct initial choice (which is \(\mathtt{\frac{1}{3}}\)). It’s even accurate to say that the probability that the car is behind any of the three doors hasn’t changed from \(\mathtt{\frac{1}{3}}\). The arrows are pointing the wrong way for those things to be possibilities. In and of itself, the correct model should prevent us from upgrading the probability from \(\mathtt{\frac{1}{3}}\) to \(\mathtt{\frac{1}{2}}\) after the freebie goat door is opened.

Second, when Monty opens a door to reveal a goat (thus fixing the value of “door Monty opens”), now changing your choice of door changes the probability that the car is behind that door, since these two variables are now dependent.

Thus, since the probability of being correct must change when I change my door selection, it must change from \(\mathtt{\frac{1}{3}}\) to something else. And since the other door is the only other option, and all the probabilities in the situation must add to \(\mathtt{1}\), switching must change the probability to \(\mathtt{\frac{2}{3}}\).