The Law of Total Probability Where has this been all my life? The Law of Total Probability is really cool, and it seems accessible enough to be presented in high school, where it would be very useful as well, I think, although I’ve never seen it there. For example, from the book Causal Inference in Statistics we get this nice problem (in addition to the quote below): “Suppose we roll two dice, and we want to know the probability that the second roll (R2) is higher than the first (R1).” The Law of Total Probability can make answering this much more straightforward.

To understand this ‘law,’ we should start by understanding two simple things about probability. First, for any two mutually exclusive events (the events can’t happen together), the probability of $$\mathtt{A}$$ or $$\mathtt{B}$$ is the sum of the probability of $$\mathtt{A}$$ and the probability of $$\mathtt{B}$$:

$$\mathtt{P(A\text{ or }B)\,\,\,\,=\,\,\,\,\,\,P(A)\,\,\,\,+\,\,\,\,P(B)}$$

Second, thinking now of two mutually exclusive events “A and B” and “A and not-B”, we can write the probability of $$\mathtt{A}$$ this way, since if $$\mathtt{A}$$ is true, then either “A and B” or “A and not-B” must be true:

$$\mathtt{P(A)=P(A,B)\,\,+\,\,P(A,\text{not-}B)}$$

In different situations, however, $$\mathtt{B}$$ could take on many different values—for example, the six possible values of one die roll, $$\mathtt{B_1}$$–$$\mathtt{B_6}$$—even while we’re considering just one value for the mututally exclusive event $$\mathtt{A}$$—for example, rolling a 4. The Law of Total Probability tells us that

$$\mathtt{P(A)=P(A,B_1)+\cdots+P(A,B_n)}$$.

If we pull a random card from a standard deck, the probability that the card is a Jack [$$\mathtt{P(J)}$$] will be equal to the probability that it’s a Jack and a spade [$$\mathtt{P(J,C_S)}$$], plus the probability that it’s a Jack and a heart [$$\mathtt{P(J,C_H)}$$], plus the probability that it’s a Jack and a club [$$\mathtt{P(J,C_C)}$$], plus the probability that it’s a Jack and a diamond [$$\mathtt{P(J,C_D)}$$].

Now with Conditional Probabilities

Where this gets good is when we throw conditional probabilities into the mix. We can make use of the fact that $$\mathtt{P(A,B)=P(A|B)P(B)}$$, where $$\mathtt{P(A|B)}$$ means “the probability of A given B.” For example, the probability of randomly pulling a Jack, given that you pulled spades, is $$\mathtt{\frac{1}{13}}$$, and the probability of randomly pulling a spade is $$\mathtt{\frac{1}{4}}$$. Thus, the probability of pulling the Jack of spades is $$\mathtt{\frac{1}{13}\cdot \frac{1}{4}=\frac{1}{52}}$$. We can, therefore, rewrite the Law of Total Probability this way:

$$\mathtt{P(A)=P(A|B_1)P(B_1)+\cdots+P(A|B_n)P(B_n)}$$

And now we’re ready to determine the probability given in the opening paragraph, $$\mathtt{P(R2>R1)}$$, the probability that a second die roll is greater than the first die roll: $\mathtt{P(R2>R1)=P(R2>R1|R1=1)P(R1=1)+\cdots P(R2>R1|R1=6)P(R1=6)}$

The final result is $$\mathtt{\frac{5}{6}\cdot \frac{1}{6}+\frac{4}{6}\cdot \frac{1}{6}+\frac{3}{6}\cdot \frac{1}{6}+\frac{2}{6}\cdot \frac{1}{6}+\frac{1}{6}\cdot \frac{1}{6}+\frac{0}{6}\cdot \frac{1}{6}=\frac{5}{12}}$$.

Monty Hall and the Colliders

Reading through Judea Pearl’s Book of Why, along with Causal Inference in Statistics, co-authored by Pearl, I came across a nifty, new-to-me explanation of the famous Monty Hall Problem.

To make it clearer, we can start with a toy mathematical problem, $$\mathtt{a+b=c}$$, which I model with the causal diagram below. A causal diagram is, of course, a little inappropriate for modeling a mathematical equation, but it’s also one that students may first use implicitly to think about operations and equations (and, without proper instruction, it’s one adults may use all their lives). Here, we will think of the variables $$\mathtt{a}$$ and $$\mathtt{b}$$ as independent. Changing the number we substitute for $$\mathtt{a}$$ does not affect our choice for $$\mathtt{b}$$ and vice versa. However, $$\mathtt{a}$$ and $$\mathtt{c}$$ are dependent, and $$\mathtt{b}$$ and $$\mathtt{c}$$ are dependent as well. Increasing or decreasing $$\mathtt{a}$$ or $$\mathtt{b}$$ alone (or together) will have an effect on $$\mathtt{c}$$.

But once we fix $$\mathtt{c}$$, or “condition on $$\mathtt{c}$$” as Pearl would write, then $$\mathtt{a}$$ and $$\mathtt{b}$$ become dependent variables. That’s at once clear as a bell and pretty wacky. If we fix $$\mathtt{c}$$ at 10, then changing $$\mathtt{a}$$ will change $$\mathtt{b}$$ and vice versa. But $$\mathtt{a}$$ and $$\mathtt{b}$$ were entirely independent prior to knowing what $$\mathtt{c}$$ was. Afterwards, they’re dependent on each other. The diagram that represents this situation (above) is what Pearl calls a “collider.”

Pearl et. al also use a more everyday example:

Suppose a certain college gives scholarships to two types of students: those with unusual musical talents and those with extraordinary grade point averages. Ordinarily, musical talent and scholastic achievement are independent traits, so, in the population at large, finding a person with musical talent tells us nothing about that person’s grades. However, discovering that a person is on a scholarship changes things; knowing that the person lacks musical talent then tells us immediately that he is likely to have high grade point average. Thus, two variables that are marginally independent become dependent upon learning the value of a third variable (scholarship) that is a common effect of the first two. For the Monty Hall problem, the collider model looks identical. And the correct model helps us see two things (forgiving some [I hope minor] mathematical sloppiness).

First, the model helps us see that Monty opening the door does not change the probability that you have made the correct initial choice (which is $$\mathtt{\frac{1}{3}}$$). It’s even accurate to say that the probability that the car is behind any of the three doors hasn’t changed from $$\mathtt{\frac{1}{3}}$$. The arrows are pointing the wrong way for those things to be possibilities. In and of itself, the correct model should prevent us from upgrading the probability from $$\mathtt{\frac{1}{3}}$$ to $$\mathtt{\frac{1}{2}}$$ after the freebie goat door is opened.

Second, when Monty opens a door to reveal a goat (thus fixing the value of “door Monty opens”), now changing your choice of door changes the probability that the car is behind that door, since these two variables are now dependent.

Thus, since the probability of being correct must change when I change my door selection, it must change from $$\mathtt{\frac{1}{3}}$$ to something else. And since the other door is the only other option, and all the probabilities in the situation must add to $$\mathtt{1}$$, switching must change the probability to $$\mathtt{\frac{2}{3}}$$.

K-Means Clustering

K-means clustering is one way of taking some data and allowing a computer to do what you do pretty naturally with your eyes and brain—separate the data into distinguishable clusters. For example, in the graph shown below, you can very easily see two clumps of points (points A and D in a clump and points B and C in a clump). A computer, to the extent it sees anything, sees just four points and their coordinates.

Why not just use our eyes and brain? Because once we teach a computer to approximate our ability to cluster 2D or 3D data, it can cluster data with many more than just 2 or 3 components. And then its “seeing” outpaces ours by quite a lot.

Let’s take a look at the instructions a computer could follow to do k-means clustering, and then we’ll dress it all up in linear algebra symbolism some other time. To start, I’ve just made 2 clusters of points (which we know about, but the computer doesn’t) with 2 components each (an x-component and y-component, i.e., 2D data). Determining Least Distances

To start, we select a $$\mathtt{k}$$, a number of clusters that we want, remembering that we know right now how many clusters there are but in most situations we would not. We choose $$\mathtt{k=2}$$. Then we place the two cluster points at random locations—here I’ve put $$\mathtt{\color{blue}{k_1}}$$ at $$\mathtt{(2,7)}$$ and $$\mathtt{\color{red}{k_2}}$$ at $$\mathtt{(4,2)}$$.

Next, we calculate the distance from each point to each center. This is the good ol’ Pythagorean Theoremish Euclidean distance of $$\mathtt{\sqrt{(x_2-x_1)^{2}+(y_2-y_1)^{2}}}$$. The cluster that we assign to each point is given by the closest center to that point. You can run the code below to print out the 8 distances.

Going by shorter distances, our first result, then, is to group point A in cluster $$\mathtt{\color{blue}{k_1}}$$, because the first number in that pair is smaller, and points B, C, and D in cluster $$\mathtt{\color{red}{k_2}}$$, because the second number in each of those pairs is the smaller one.

Moving the Centers Based on the Means

The next step—and last before repeating the process—is to move each center to the mean of the points in the cluster to which it is currently assigned. The mean of the points is determined by calculating the mean of the components separately. So, for our current cluster $$\mathtt{\color{red}{k_2}}$$, the points B, C, and D have a mean of ($$\mathtt{\frac{2 + 2 + 6}{3}}$$, $$\mathtt{\frac{4 + 3 + 5}{3}}$$), or ($$\mathtt{\frac{10}{3}}$$, $$\mathtt{4}$$). And since $$\mathtt{\color{blue}{k_1}}$$ has just one point, A, it will move to smack dab on top of that point, at (5, 6). Now we can do another round of distance comparisons, given the new center locations. These calculations give us what we can see automatically—that points A and D belong to one cluster and points B and C belong to another cluster. In this case, A and D belong to cluster $$\mathtt{\color{blue}{k_1}}$$ and B and C belong to cluster $$\mathtt{\color{red}{k_2}}$$.

The cluster centers now move to the means of each pair of points, placing them where we would likely place them to begin with (directly between the two points in the cluster). Further calculations won’t change these assignments, so the k-means algorithm is done when it stops changing things drastically (or at all).  Linear Algebra Exercises I We’ve done a fair amount of landscaping, as it were—running out a long way in the field of linear algebra to mark interesting points. Now it seems like a good time to turn back and start tending to each area a little more closely. An excellent way to do that—to make things more secure—is to practice.

The sets below can be completed after reading Lines the Linear Algebra Way. I would suggest completing this by working on a piece of paper and then checking your answers—one at a time at first and then go a stretch before checking. Try to complete each exercise before either checking back with the original post or looking at my answer. My answers can be uncovered by hovering under each exercise.

Convert each equation of a line to a vector equation in parametric form, $$\mathtt{l(k)=p+kv}$$, where $$\mathtt{p}$$ and $$\mathtt{v}$$ are 2D vectors and $$\mathtt{k}$$ is a scalar variable.

1. $$\mathtt{y=x+3}$$

$$\mathtt{l(k)=}\begin{bmatrix}\mathtt{0}\\\mathtt{3}\end{bmatrix}+\begin{bmatrix}\mathtt{1}\\\mathtt{1}\end{bmatrix}\mathtt{k}$$

1. $$\mathtt{y=2x+3}$$

$$\mathtt{l(k)=}\begin{bmatrix}\mathtt{0}\\\mathtt{3}\end{bmatrix}+\begin{bmatrix}\mathtt{1}\\\mathtt{2}\end{bmatrix}\mathtt{k}$$

1. $$\mathtt{y=-2x}$$

$$\mathtt{l(k)=}\begin{bmatrix}\mathtt{0}\\\mathtt{0}\end{bmatrix}+\begin{bmatrix}\mathtt{\,\,\,\,1}\\\mathtt{-2}\end{bmatrix}\mathtt{k}$$,    or just $$\begin{bmatrix}\mathtt{\,\,\,\,1}\\\mathtt{-2}\end{bmatrix}\mathtt{k}$$

1. $$\mathtt{y=x}$$

$$\mathtt{l(k)=}\begin{bmatrix}\mathtt{0}\\\mathtt{0}\end{bmatrix}+\begin{bmatrix}\mathtt{1}\\\mathtt{1}\end{bmatrix}\mathtt{k}$$,    or just $$\begin{bmatrix}\mathtt{1}\\\mathtt{1}\end{bmatrix}\mathtt{k}$$

1. $$\mathtt{y=10}$$

$$\mathtt{l(k)=}\begin{bmatrix}\mathtt{0}\\\mathtt{10}\end{bmatrix}+\begin{bmatrix}\mathtt{1}\\\mathtt{0}\end{bmatrix}\mathtt{k}$$

1. $$\mathtt{x+y=1}$$

$$\mathtt{l(k)=}\begin{bmatrix}\mathtt{0}\\\mathtt{1}\end{bmatrix}+\begin{bmatrix}\mathtt{\,\,\,\,1}\\\mathtt{-1}\end{bmatrix}\mathtt{k}$$

1. $$\mathtt{2x+3y=9}$$

$$\mathtt{l(k)=}\begin{bmatrix}\mathtt{0}\\\mathtt{3}\end{bmatrix}+\begin{bmatrix}\mathtt{\,\,\,\,3}\\\mathtt{-2}\end{bmatrix}\mathtt{k}$$

1. $$\mathtt{x=5}$$

$$\mathtt{l(k)=}\begin{bmatrix}\mathtt{5}\\\mathtt{0}\end{bmatrix}+\begin{bmatrix}\mathtt{0}\\\mathtt{1}\end{bmatrix}\mathtt{k}$$

1. $$\mathtt{-3x-\frac{1}{4}y=15}$$

$$\mathtt{l(k)=}\begin{bmatrix}\mathtt{\,\,\,\,0}\\\mathtt{-60}\end{bmatrix}+\begin{bmatrix}\mathtt{\,\,\,\,1}\\\mathtt{-12}\end{bmatrix}\mathtt{k}$$

1. Devise an algorithm you could use to convert an equation for a line in slope-intercept form to a vector equation for the line in parametric form.

For slope-intercept form, $$\mathtt{y=mx+b}$$, the conversion $$\begin{bmatrix}\mathtt{0}\\\mathtt{b}\end{bmatrix}+\begin{bmatrix}\mathtt{1}\\\mathtt{m}\end{bmatrix}\mathtt{k}\,$$ seems to work for most lines.

There are many ways to write the above vector equations. In particular, the intercept vector does not have to have a first component of $$\mathtt{0}$$, and the slope vector does not have to be in simplest form. All that is required is for the intercept vector to get you to some point on the line and for the slope vector to correctly represent the slope of the line.

Now let’s go the other way. Identify the slope, y-intercept, and x-intercept of each line.

1. $$\mathtt{l_{1}(k)=}\begin{bmatrix}\mathtt{1}\\\mathtt{2}\end{bmatrix}+\begin{bmatrix}\mathtt{\,\,\,\,2}\\\mathtt{-1}\end{bmatrix}\mathtt{k}$$

slope → $$\mathtt{-\frac{1}{2}}$$, y-intercept → $$\mathtt{\frac{5}{2}}$$, x-intercept → $$\mathtt{5}$$

1. $$\mathtt{l_{2}(k)=}\begin{bmatrix}\mathtt{-3}\\\mathtt{-5}\end{bmatrix}+\begin{bmatrix}\mathtt{0}\\\mathtt{4}\end{bmatrix}\mathtt{k}$$

slope → undefined, y-intercept → none, x-intercept → $$\mathtt{-3}$$

1. $$\mathtt{l_{3}(k)=}\begin{bmatrix}\mathtt{\,\,\,\,0}\\\mathtt{-1}\end{bmatrix}+\begin{bmatrix}\mathtt{2}\\\mathtt{2}\end{bmatrix}\mathtt{k}$$

slope → $$\mathtt{1}$$, y-intercept → $$\mathtt{-1}$$, x-intercept → $$\mathtt{1}$$

1. $$\mathtt{l_{4}(k)=}\begin{bmatrix}\mathtt{\,\,\,\,3}\\\mathtt{-9}\end{bmatrix}+\begin{bmatrix}\mathtt{7}\\\mathtt{0}\end{bmatrix}\mathtt{k}$$

slope → $$\mathtt{0}$$, y-intercept → $$\mathtt{-9}$$, x-intercept → none

1. $$\mathtt{l_{5}(k)=}\begin{bmatrix}\mathtt{-5}\\\mathtt{\,\,\,\,1}\end{bmatrix}+\begin{bmatrix}\mathtt{-3}\\\mathtt{\,\,\,\,4}\end{bmatrix}\mathtt{k}$$

slope → $$\mathtt{-\frac{4}{3}}$$, y-intercept → $$\mathtt{-\frac{17}{3}}$$, x-intercept → $$\mathtt{-\frac{17}{4}}$$

And just two more. Determine if each point is on the line mentioned from above.

1. Is $$\mathtt{(-11,3)}$$ on the line $$\mathtt{l_1}$$?

No: $$\begin{bmatrix}\mathtt{1}\\\mathtt{2}\end{bmatrix}+\begin{bmatrix}\mathtt{\,\,\,\,2}\\\mathtt{-1}\end{bmatrix}\mathtt{k=}\begin{bmatrix}\mathtt{-11}\\\mathtt{\,\,\,\,3}\end{bmatrix}$$.
No value of $$\mathtt{k}$$ solves both $$\mathtt{2k+1=-11}$$ and $$\mathtt{-k+2=3}$$.

1. Is $$\mathtt{(5,5)}$$ on the line $$\mathtt{l_3}$$?

No. Since the line has a slope of $$\mathtt{1}$$, it would contain $$\mathtt{(5,5)}$$ if the line passed through the origin, but it does not.

More with Least Squares

One very cool thing about our formula for the least squares regression line, $$\mathtt{\left(X^{T}X\right)^{-1}X^{T}y}$$, is that it is the same no matter whether we have one independent variable (univariate) or many independent variables (multivariate).

Consider these data, showing the selling prices of some grandfather clocks at auction. The first scatter plot shows the age of the clock in years on the x-axis (100–200), and the second shows the number of bidders on the x-axis (0–20). Price (in pounds or dollars) is on the y-axis on each plot (500–2500).

Age (years)BiddersPrice (\$)
127131235
115121080
1277845
15091522
15661047
182111979
156121822
132101253
13791297
1139946
137151713
117111024
13781147
15361092
117131152
126101336
170142131
18281550
162111884
184102041
1436854
15991483
108141055
17581545
1086729
17991792
111151175
18781593
1117785
1157744
19451356
16871262

You can see in the notebook below that the first regression line, for the price of a clock as a function of its age, is approximately $$\mathtt{10.5x-192}$$. The second regression line, for the price of a clock as a function of the number of bidders at auction, is approximately $$\mathtt{55x+806}$$. As mentioned above, each of these univariate least squares regression lines can be calculated with the formula $$\mathtt{\left(X^{T}X\right)^{-1}X^{T}y}$$.

Combining both age and number of bidders together, we can calculate, using the same formula, a multivariate least squares regression equation. This of course is no longer a line. In the case of two input variables as we have here, our line becomes a plane.

Our final regression equation becomes $$\mathtt{12.74x_{1}+85.82x_{2}-1336.72}$$, with $$\mathtt{x_1}$$ representing the age of a clock and $$\mathtt{x_2}$$ representing the number of bidders. Linear Algebra Connections

There are so many connections within and applications of linear algebra—I can only imagine that this will be a series of posts, to the extent that I continue writing about the subject. Here are a few connections that I’ve come across in my reading recently.

Compound Interest and Matrix Powers

We can multiply a matrix by itself $$\mathtt{n}$$ times. The result is the matrix to the power $$\mathtt{n}$$. We can use this when setting up a compound interest situation. For example, suppose we have three accounts, which each have a different interest rate compounded annually—say, 5%, 3%, and 2%. Without linear algebra, the amount in the first account can be modeled by the equation $\mathtt{A(t)=p \cdot 1.05^t}$ where $$\mathtt{p}$$ represents the starting amount in the account, and $$\mathtt{t}$$ represents the time in years. With linear algebra, we can group all of the account interest rates into a matrix. The first year for each account would look like this: $\mathtt{A(1)}=\begin{bmatrix}\mathtt{1.05}&\mathtt{0}&\mathtt{0}\\\mathtt{0}&\mathtt{1.03}&\mathtt{0}\\\mathtt{0}&\mathtt{0}&\mathtt{1.02}\end{bmatrix}^\mathtt{1}\begin{bmatrix}\mathtt{p_1}\\\mathtt{p_2}\\\mathtt{p_3}\end{bmatrix}=\begin{bmatrix}\mathtt{1.05p_1}\\\mathtt{1.03p_2}\\\mathtt{1.02p_3}\end{bmatrix}$

For years beyond the first year, all we have to do is raise the matrix to the appropriate power. Since it’s diagonal, squaring it, cubing it, etc., will square, cube, etc., each entry. This computation can be a little more organized—and more straightforward for programming. A matrix has to be square ($$\mathtt{m \times m})$$ in order to raise it to a power. Below we calculate the amount in each account after 100 years.

Centroids and Areas

We have seen that the determinant can be thought about as the area of the parallelogram formed by two vectors. We can use this fact to determine the area of a complex shape like the one shown below.  Since determinants are signed areas, as we move around the shape counterclockwise, calculating the determinant of each vector pair (and multiplying each determinant by one half so we just get each triangle), we get the total area of the shape.$\frac{1}{2}\left(\begin{vmatrix}\mathtt{6}&\mathtt{6}\\\mathtt{0}&\mathtt{4}\end{vmatrix}+\begin{vmatrix}\mathtt{6}&\mathtt{3}\\\mathtt{4}&\mathtt{4}\end{vmatrix}+\begin{vmatrix}\mathtt{3}&\mathtt{3}\\\mathtt{4}&\mathtt{6}\end{vmatrix}+\begin{vmatrix}\mathtt{3}&\mathtt{-2}\\\mathtt{6}&\mathtt{\,\,\,\,6}\end{vmatrix}+\begin{vmatrix}\mathtt{-2}&\mathtt{-2}\\\mathtt{\,\,\,\,6}&\mathtt{\,\,\,\,3}\end{vmatrix}+\begin{vmatrix}\mathtt{-2}&\mathtt{0}\\\mathtt{\,\,\,\,3}&\mathtt{3}\end{vmatrix}\right)=\mathtt{36}\text{ units}\mathtt{^2}$

That’s pretty hand-wavy, but it’s something that you can probably figure out with a little experimentation.

Another counterclockwise-moving calculation (though this one can be clockwise without changing the answer) is the calculation of the centroid of a closed shape. All that is required here is to calculate the sum of the position vectors of the vertices of the figure and then divide by the number of vertices.   $\mathtt{\frac{1}{5}}\left(\begin{bmatrix}\mathtt{3}\\\mathtt{4}\end{bmatrix}+\begin{bmatrix}\mathtt{0}\\\mathtt{6}\end{bmatrix}+\begin{bmatrix}\mathtt{-3}\\\mathtt{\,\,\,\,4}\end{bmatrix}+\begin{bmatrix}\mathtt{-1}\\\mathtt{\,\,\,\,3}\end{bmatrix}\right)=\begin{bmatrix}\mathtt{-\frac{1}{5}}\\\mathtt{\,\,\,\,\frac{17}{5}}\end{bmatrix}$

Here again, it’s just magic, but you can figure it out with a little play. In each case—for both complex areas and centroids—we assign a point to be the origin and go from there. Least Squares with Linear Algebra

In brief, linear regression is about finding the line of best fit to a data set. If you’re looking for the linear algebra way of doing this, you will most likely find it searching for the term least squares.

In the basic scenario, you’ve got some two-dimensional data, $$\mathtt{(x, y)}$$ coordinates, and you want to find the equation for a straight line that is as close as possible to each point. Such a scenario is shown below, though here the line has already been graphed and the equation for the line of best fit displayed. (But feel free to change the data in the table or move the points around to see how the line of best fit changes.)

xy
108.04
86.95
137.58
98.81
118.33
149.96
67.24
44.26
1210.84
74.82
55.68

So, let’s imagine that we haven’t found this line yet. We know what we are looking for is a line of the form $$\mathtt{y=mx+b}$$. In linear algebra terms, we want the vector $$\mathtt{y}$$ (11 rows, 1 column) to equal the vector $$\mathtt{x}$$ (11 rows, 1 column) times a slope vector $$\mathtt{m}$$ (1 row, 2 columns) plus an intercept vector $$\mathtt{b}$$ (11 rows, 1 column).

The first problem with this that we have to fix is that $$\mathtt{x}$$ the 11 × 1 vector needs to be $$\mathtt{X}$$ the 11 × 2 matrix so that we get our equations right. So we’ll pad $$\mathtt{x}$$ with some 1s and then we’ll be able to call it $$\mathtt{X}$$. (We can pad on either side, left or right, just remembering to interchange $$\mathtt{b}$$ and $$\mathtt{m}$$ to keep the equations straight.) The second problem we can fix is that we don’t need a separate intercept vector and a separate slope vector. We can combine things so that we form an equivalent matrix equation that means the same thing as $$\mathtt{y=mx+b}$$. We need the equation to look like this: $\begin{bmatrix}\mathtt{8.04}\\\mathtt{6.95}\\\mathtt{7.58}\\\mathtt{8.81}\\\mathtt{8.33}\\\mathtt{9.96}\\\mathtt{7.24}\\\mathtt{4.26}\\\mathtt{10.84}\\\mathtt{4.82}\\\mathtt{5.68}\end{bmatrix} = \begin{bmatrix}\mathtt{1}&\mathtt{10}\\\mathtt{1}&\mathtt{8}\\\mathtt{1}&\mathtt{13}\\\mathtt{1}&\mathtt{9}\\\mathtt{1}&\mathtt{11}\\\mathtt{1}&\mathtt{14}\\\mathtt{1}&\mathtt{6}\\\mathtt{1}&\mathtt{4}\\\mathtt{1}&\mathtt{12}\\\mathtt{1}&\mathtt{7}\\\mathtt{1}&\mathtt{5}\end{bmatrix}\begin{bmatrix}\mathtt{b}\\\mathtt{m}\end{bmatrix}$

Multiply the matrix and vector on the right side of that equation, and you get, for the first equation, $$\mathtt{8.04=b+(m)(10)}$$. That’s equivalent to $$\mathtt{y=mx+b}$$ for the first $$\mathtt{(x, y)}$$ data point $$\mathtt{(10, 8.04)}$$. The matrix and vector setup ensures that all the equations for all the points are of the correct form.

In linear algebra terms, we have rewritten the equation to be $$\mathtt{y=Xv}$$, where $$\mathtt{y}$$ is an 11 × 1 vector, $$\mathtt{X}$$ is an 11 × 2 matrix (padded with some 1s), and $$\mathtt{v}$$ is a 2 × 1 vector which contains the unknown slope $$\mathtt{m}$$ of the best fit line and the unknown intercept $$\mathtt{b}$$.

It’s an Approximation

At this point in the explanation, it’s important to realize that we will make another shift. The first was from the real data to the matrix algebra setup. We shift again below, away from that setup, per se, and toward just finding out what that unknown $$\mathtt{v}$$ is.

As an analogy, the system shown below features a 3d vector (2, 0, –2) which does not live in the same plane as the one formed by the two other column vectors of the matrix. Thus, there is no solution $$\mathtt{(j, k)}$$. $\begin{bmatrix}\mathtt{1}&\mathtt{\,\,\,\,1}\\\mathtt{1}&\mathtt{-3}\\\mathtt{1}&\mathtt{\,\,\,\,1}\end{bmatrix}\begin{bmatrix}\mathtt{j}\\\mathtt{k}\end{bmatrix}=\begin{bmatrix}\mathtt{\,\,\,\,2}\\\mathtt{\,\,\,\,0}\\\mathtt{-2}\end{bmatrix}\longleftarrow\text{no solutions}$

Similarly, the columns of our $$\mathtt{X}$$ matrix form a plane in 11-dimensional space. In order for $$\mathtt{v}$$ to be a solution to our original matrix equation above, $$\mathtt{y}$$ has to live in this plane too. But we already know that it doesn’t. If it did, the points would lie along some line. It’s true that $$\mathtt{y}$$ is an 11-dimensional vector, just like each column of $$\mathtt{X}$$, but $$\mathtt{y}$$ doesn’t live in the same plane. The closest approximation we can get to $$\mathtt{y}$$ in the plane of $$\mathtt{X}$$ is $$\mathtt{Xv=p}$$, where $$\mathtt{p}$$ is the projection of $$\mathtt{y}$$ onto the plane (another thing we’ll have to come back to). The vector $$\mathtt{q}$$ connecting $$\mathtt{y}$$ with its projection $$\mathtt{p}$$ is perpendicular to $$\mathtt{p}$$ (and, thus, to the plane).

Now we can write some more equations. For example, we know we want $$\mathtt{Xv=p}$$, but it’s also true that $$\mathtt{p+q=y}$$. And we can do some manipulation to show that all the column vectors of $$\mathtt{X}$$ are perpendicular to $$\mathtt{q}$$. Assuming for a second that our $$\mathtt{X}$$ is a 2 × 2 matrix, we write $$\mathtt{X}$$ as a 3D matrix, transpose it (so that multiplication is defined), and multiply it by $$\mathtt{q}$$. The same basic idea applies to our original $$\mathtt{X}$$ (11 × 2) matrix. $\begin{bmatrix}\mathtt{x_{11}}&\mathtt{x_{12}}\\\mathtt{x_{21}}&\mathtt{x_{22}}\end{bmatrix}\rightarrow\begin{bmatrix}\mathtt{x_{11}}&\mathtt{x_{12}}\\\mathtt{x_{21}}&\mathtt{x_{22}}\\\mathtt{0}&\mathtt{0}\end{bmatrix}\rightarrow\begin{bmatrix}\mathtt{x_{11}}&\mathtt{x_{21}}&\mathtt{0}\\\mathtt{x_{12}}&\mathtt{x_{22}}&\mathtt{0}\end{bmatrix}$$\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\,\,\,\begin{bmatrix}\mathtt{x_{11}}&\mathtt{x_{21}}&\mathtt{0}\\\mathtt{x_{12}}&\mathtt{x_{22}}&\mathtt{0}\end{bmatrix}\begin{bmatrix}\mathtt{0}\\\mathtt{0}\\\mathtt{q_3}\end{bmatrix}=\begin{bmatrix}\mathtt{0}\\\mathtt{0}\end{bmatrix}$ This gives us a key equation: $$\mathtt{X^{T}q=0}$$.

Since $$\mathtt{q=y-p}$$, substituting for $$\mathtt{q}$$ gets us $$\mathtt{X^{T}(y-p)=0}$$. Then since $$\mathtt{Xv=p}$$, substituting for $$\mathtt{p}$$ brings us to $$\mathtt{X^{T}(y-Xv)=0}$$. Distribute to get $$\mathtt{X^{T}y-X^{T}Xv=0}$$, which means $$\mathtt{X^{T}y=X^{T}Xv}$$. Multiply both sides on the left by the inverse of $$\mathtt{X^{T}X}$$, and the vector $$\mathtt{v}$$ that we’re after, then, is $\mathtt{v=(X^{T}X)^{-1}X^{T}y}$

That formula gives us the slope and intercept of our best fit line. Below is one way the least squares can be calculated with a little Python. The calculation I used for the above interactive to get the best fit line is way more complicated. If I had known about the linear algebra way when I made it, I would have definitely gone with that instead.

The Singular Value Decomposition

We’ve now seen the eigenvalue decomposition of a linear transformation (in the form of a matrix). We can think of what we did in that decomposition as breaking up the original transformation into three transformations. If we multiply the rightmost matrix by any vector, and then multiply the middle matrix by that product, and then multiply the leftmost matrix on the right-hand side by that product, we would see that starting vector be transformed three times. That process would be equivalent to multiplying the starting vector by the original matrix. We can also say that, in that original transformation matrix, which we’ll call $$\mathtt{A}$$, we mapped a set of orthogonal vectors, or vectors at right angles to each other, (1, 0) and (0, 1), onto a set of non-orthogonal vectors (0, –2) and (1, –3). We don’t have to multiply each vector by the transformation matrix one at a time. We can multiply the set of vectors, as a matrix, by the transformation matrix, like so.

$$\begin{bmatrix}\mathtt{\,\,\,\,0}&\mathtt{\,\,\,\,1}\\\mathtt{-2}&\mathtt{-3}\end{bmatrix}\begin{bmatrix}\mathtt{1}&\mathtt{0}\\\mathtt{0}&\mathtt{1}\end{bmatrix}=\begin{bmatrix}\mathtt{1}\begin{bmatrix}\mathtt{\,\,\,\,0}\\\mathtt{-2}\end{bmatrix}+\mathtt{0}\begin{bmatrix}\mathtt{\,\,\,\,1}\\\mathtt{-3}\end{bmatrix}&\mathtt{0}\begin{bmatrix}\mathtt{\,\,\,\,0}\\\mathtt{-2}\end{bmatrix}+\mathtt{1}\begin{bmatrix}\mathtt{\,\,\,\,1}\\\mathtt{-3}\end{bmatrix}\end{bmatrix}=\begin{bmatrix}\mathtt{\,\,\,\,0}&\mathtt{\,\,\,\,1}\\\mathtt{-2}&\mathtt{-3}\end{bmatrix}$$

Of course, we just multiplied the original matrix by the identity matrix, so it spit out the original matrix again. But the above interpretation is different, though it gives the same results.

Okay, great, but we humans seem to love our right angles. So, this question arises about linear transformations: could we use any matrix to map some pair of orthogonal vectors (vectors at right angles to each other) to a different set of orthogonal vectors? That is, could we transform our way from one orthogonal “system” to another with any single transformation matrix?

Could we find a pair of orthogonal vectors which, after undergoing our transformation $$\mathtt{A}$$, were mapped to a different pair of orthogonal vectors? If we could, that would mean that if we multiplied $$\mathtt{A}$$ by a set of orthogonal vectors $$\mathtt{V}$$ (i.e., transformed the set of vectors $$\mathtt{V}$$ by the action of $$\mathtt{A}$$), it would be equivalent to just starting with a different set of orthogonal vectors already in position (we’ll call these orthogonal vectors $$\mathtt{U}$$) and just scaling them by a scaling matrix $$\mathtt{\Sigma}$$. In notation, we’ll write this hypothesis as $\mathtt{AV=U \Sigma}$

The word hypothesis is important here. We’re not really writing down something we know is equivalent—that is, we don’t really know if there is a $$\mathtt{V}$$, $$\mathtt{U}$$, and $$\mathtt{\Sigma}$$ which will make this equivalence true. Students (and I) have a hard time not seeing the equals sign as meaning “I know that it is true that these are equivalent.” But that’s not what it means here, and it’s good to get used to that flexibility. What it means here is that we are supposing for the time being that these two products are equivalent. If some contradiction falls out of our algebra (or we get some kind of infinity), we’ll know that the equivalence fails—at least insofar as we want just one matrix to pop out for each unknown matrix.

Let’s first see how this equivalence appears before we dig into figuring out what $$\mathtt{V}$$, $$\mathtt{U}$$, and $$\mathtt{\Sigma}$$ are. That is, let me show you that the matrices $$\mathtt{V}$$, $$\mathtt{U}$$, and $$\mathtt{\Sigma}$$ are possible before we look at what they are. On the left is our original transformation matrix acting on a set of orthogonal vectors $$\mathtt{V}$$ (purple and red), so this transformation shows $$\mathtt{AV}$$. (You can see that the original (1, 0) and (0, 1) vectors go to where they’re supposed to.)

$\mathtt{AV\quad\quad\quad\quad =\quad\quad\quad\quad U \Sigma}$

On the right is a set of two different orthogonal vectors $$\mathtt{\Sigma}$$ (purple and red) which start as being aligned to the horizontal and vertical grid lines and then are rotated and reflected (which is a part of scaling) by a matrix $$\mathtt{U}$$. And the two transformations are equivalent! (In the demo they are very close.) We can squeeze out an orthogonal transformation out of that weird matrix we saw in the eigenvalue decomposition which took a square and rotated and stretched it into a funky parallelogram.

I note that, above, I called $$\mathtt{\Sigma}$$ a scaling matrix, but here I’m using it as just a set of orthogonal vectors. Luckily for us, both things are true. It just depends on how you look at them. A square matrix like the ones we are using represents both a pair of 2-dimensional vectors or a linear transformation. We get to decide how to interpret the matrix in any given situation.

Getting $$\mathtt{V}$$

To begin to know what these matrices are, we can start by writing the equation above like this. $\mathtt{A=U \Sigma V^T}$ We can do this because $$\mathtt{V}$$ is orthogonal, and I briefly mentioned last time that the transpose of an orthogonal matrix is the same as its inverse. So, in effect, we multiplied both sides of the equation by $$\mathtt{V^{-1}}$$, which removed the V from the left-hand side.

Okay, now we pull a little transpose magic, but let’s walk through it. Start by multiplying the expression on each side, on the left, by its transpose. (We’ll circle back to grounding all this some other time.) So, we have $\mathtt{A^TA=(U\Sigma V^T)^T(U\Sigma V^T)}$ We multiplied $$\mathtt{A}$$ by its transpose by multiplying to the left of $$\mathtt{A}$$, and we multiplied $$\mathtt{U\Sigma V^T}$$ by its transpose by multiplying to the left of $$\mathtt{U\Sigma V^T}$$. The product $$\mathtt{A^TA}$$ is simple: $\begin{bmatrix}\mathtt{0}&\mathtt{-2}\\\mathtt{1}&\mathtt{-3}\end{bmatrix}\begin{bmatrix}\mathtt{\,\,\,\,0}&\mathtt{\,\,\,\,1}\\\mathtt{-2}&\mathtt{-3}\end{bmatrix}=\begin{bmatrix}\mathtt{4}&\mathtt{6}\\\mathtt{6}&\mathtt{10}\end{bmatrix}$ But let’s take some time to simplify $$\mathtt{(U\Sigma V^T)^T(U\Sigma V^T)}$$. As an example of what to do when we have the transpose of a product of matrices, consider these products.

$\left(\begin{bmatrix}\mathtt{1}&\mathtt{3}\\\mathtt{2}&\mathtt{4}\end{bmatrix}\begin{bmatrix}\mathtt{4}&\mathtt{6}\\\mathtt{5}&\mathtt{7}\end{bmatrix}\right)^\mathtt{T} \longrightarrow$
$\begin{bmatrix}\mathtt{19}&\mathtt{27}\\\mathtt{28}&\mathtt{40}\end{bmatrix}^\mathtt{T}$
$\left(\begin{bmatrix}\mathtt{4}&\mathtt{5}\\\mathtt{6}&\mathtt{7}\end{bmatrix}\begin{bmatrix}\mathtt{1}&\mathtt{2}\\\mathtt{3}&\mathtt{4}\end{bmatrix}\right)^\mathtt{T} \longrightarrow$
$\begin{bmatrix}\mathtt{19}&\mathtt{28}\\\mathtt{27}&\mathtt{40}\end{bmatrix}\,\,\,$

Each matrix product on the left is equal to the expression on its right. Test them out for yourself. But the expressions on the right are equal to each other, which means the products on the left are equal to each other. This example shows that $$\mathtt{(AB)^{T}=B^{T}A^{T}}$$. The transpose of a product is equal to the product of the transposes, multiplied in reverse order. Again, we’ll ground this later, but this suggests, correctly, that we can rewrite the transposed product above, $$\mathtt{(U\Sigma V^T)^T}$$, as $$\mathtt{V\Sigma^{T}U^{T}}$$, considering that $$\mathtt{(V^{T})^{T}=V}$$. Multiplying that by the remaining part of the right-hand side, $$\mathtt{U\Sigma V^T}$$, we get $$\mathtt{V\Sigma^{T}U^{T}U\Sigma V^T}$$. Since $$\mathtt{U}$$ is orthogonal, its transpose is its inverse, so the $$\mathtt{U}$$ terms cancel, leaving us with $$\mathtt{V\Sigma^{T}\Sigma V^T}$$. The transpose of a scaling matrix, $$\mathtt{\Sigma}$$ in this case, is itself (try it out!), so the middle can be written more simply as $$\mathtt{\Sigma^{2}}$$. So, finally (pretty far from finally, actually), we have $\mathtt{A^{T}A=V\Sigma^{2}V^T}$

And this is an eigenvalue decomposition for $$\mathtt{A^{T}A}$$! We’ve got a scaling matrix as the lunchmeat, sandwiched by a vector and its inverse (which is the same as the transpose in this case). So, now we can figure out $$\mathtt{V}$$ and $$\mathtt{\Sigma}$$ by doing the eigenvalue decomposition like we did previously. Here it is: $\begin{bmatrix}\mathtt{4}&\mathtt{6}\\\mathtt{6}&\mathtt{10}\end{bmatrix}=\begin{bmatrix}\mathtt{\color{purple}{-\frac{521}{991}}}&\mathtt{\color{red}{-\frac{3725}{4379}}}\\\mathtt{\color{purple}{-\frac{3725}{4379}}}&\mathtt{\,\,\,\,\color{red}{\frac{521}{991}}}\end{bmatrix}\begin{bmatrix}\mathtt{\frac{4510}{329}}&\mathtt{0}\\\mathtt{0}&\mathtt{\frac{658}{2255}}\end{bmatrix}\begin{bmatrix}\mathtt{-\frac{521}{991}}&\mathtt{-\frac{3725}{4379}}\\\mathtt{-\frac{3725}{4379}}&\mathtt{\,\,\,\,\frac{521}{991}}\end{bmatrix}$

The purple and red vectors are our starting vectors from the animation on the left above: (–0.53, –0.85) and (–0.85, 0.53). When we ask what orthogonal starting vectors can we pick such that matrix $$\mathtt{A}$$ will transform them to a different pair of orthogonal vectors, matrix $$\mathtt{V}$$—the red and purple vectors above—is our answer. The square roots of the diagonal values of $$\mathtt{\Sigma}$$ above (remember, this matrix was squared) are called the singular values of $$\mathtt{A}$$. To me, getting $$\mathtt{V}$$ is the coolest part of this decomposition. The rest, below, is just gravy.

And Now for $$\mathtt{U}$$ and $$\mathtt{\Sigma}$$

In this case, we multiply both sides of the equation $$\mathtt{A=U\Sigma V^T}$$ on the right by the transpose of $$\mathtt{A}$$ and get an eigenvalue decomposition of $$\mathtt{AA^T}$$. $\,\,\mathtt{AA^T=(U\Sigma V^T)(U\Sigma V^T)^T=U\Sigma V^{T}V\Sigma^{T}U^T=U\Sigma^{2}U^T}$

$\begin{bmatrix}\mathtt{\,\,\,\,1}&\mathtt{-3}\\\mathtt{-3}&\mathtt{\,\,\,\,13}\end{bmatrix}=\begin{bmatrix}\mathtt{-\frac{400}{1741}}&\mathtt{\frac{764}{785}}\\\mathtt{\,\,\,\,\frac{764}{785}}&\mathtt{\frac{400}{1741}}\end{bmatrix}\begin{bmatrix}\mathtt{\color{purple}{\frac{4510}{329}}}&\mathtt{\color{red}{0}}\\\mathtt{\color{purple}{0}}&\mathtt{\color{red}{\frac{658}{2255}}}\end{bmatrix}\begin{bmatrix}\mathtt{-\frac{400}{1741}}&\mathtt{\frac{764}{785}}\\\mathtt{\,\,\,\,\frac{764}{785}}&\mathtt{\frac{400}{1741}}\end{bmatrix}$

Now the square roots of the purple and red vectors are our starting vectors from the animation on the right above: (3.7, 0) and (0, 0.54). You’ll notice, of course, that $$\mathtt{\Sigma^2}$$ is the same as above, so we really found it earlier. This step is only to pin down what $$\mathtt{U}$$ is. At any rate, we are done, and we can write down the full singular value decomposition (SVD) of the original matrix $$\mathtt{A}$$. $\quad\begin{bmatrix}\mathtt{\,\,\,\,0}&\mathtt{\,\,\,\,1}\\\mathtt{-2}&\mathtt{-3}\end{bmatrix}=\begin{bmatrix}\mathtt{-\frac{400}{1741}}&\mathtt{\frac{764}{785}}\\\mathtt{\,\,\,\,\frac{764}{785}}&\mathtt{\frac{400}{1741}}\end{bmatrix}\begin{bmatrix}\mathtt{\frac{1655}{447}}&\mathtt{0}\\\mathtt{0}&\mathtt{\frac{894}{1655}}\end{bmatrix}\begin{bmatrix}\mathtt{-\frac{521}{991}}&\mathtt{-\frac{3725}{4379}}\\\mathtt{-\frac{3725}{4379}}&\mathtt{\,\,\,\,\frac{521}{991}}\end{bmatrix}$

We’re finding an orthogonal-to-orthogonal transformation in a hurricane, so it shouldn’t be surprising to get weird numbers. One thing the SVD makes clear (the eigenvalue decomposition does this too) is that linear transformations can be described as combinations of rotations and scalings (the latter of which include reflections) and that’s it. Inverses and Transposes

This will now be my 22nd post on linear algebra, and I hope it’ll be noticeable, looking at all of them together so far, that we haven’t talked about systems of equations. And there’s a good reason for that: because they suck you down an ugly hole of mindless calculation, meaning-challenged tedium, and fruitless, pointless backward thinking. Systems are awesome and important, and I’m sure someone can make a strong argument for introducing them early, but, for me, they can wait.

The inverses of matrices are pretty interesting. The inverse of a 2 × 2 matrix is the matrix that, when multiplied to the original matrix, gives the product that is the identity matrix. The inverse is given by the middle matrix below: $\begin{bmatrix}\mathtt{a} & \mathtt{b}\\\mathtt{c} & \mathtt{d}\end{bmatrix}\begin{bmatrix}\mathtt{\,\,\,\,\frac{d}{ad-bc}} & \mathtt{-\frac{b}{ad-bc}}\\\mathtt{-\frac{c}{ad-bc}} & \mathtt{\,\,\,\,\frac{a}{ad-bc}}\end{bmatrix}\mathtt{=}\begin{bmatrix}\mathtt{1} & \mathtt{0}\\\mathtt{0} & \mathtt{1}\end{bmatrix}$

There is an easy-to-follow derivation of this formula here if you’re interested—one that requires only some high school algebra.

You can notice, given this setup, that the identity matrix is its own inverse. Are there any others that fit this description? One thought: we need $$\mathtt{ad-bc}$$ to be equal to 1, and we also want $$\mathtt{a=d}$$. The values on the other diagonal, $$\mathtt{b}$$ and $$\mathtt{c}$$, should both be 0, since they have to be equal to their opposites.

In that case, $$\begin{bmatrix}\mathtt{-1} & \mathtt{\,\,\,\,0}\\\mathtt{\,\,\,\,0} & \mathtt{-1}\end{bmatrix}$$ would be its own inverse too. What does that mean?

For the identity matrix, it means that if we apply the do-nothing transformation to the home-base matrix, we stay on home base, with the “x” vector pointed to (1, 0) and the “y” vector pointed to (0, 1). The matrix above represents a reflection across the origin. So, applying a reflection across the origin to it should return us to home base.

Wouldn’t a reflection across the x-axis be an inverse of itself, then? Yes! The criteria above for an inverse need to be amended a little to allow for negatives effectively canceling each other out to make positives. $\begin{bmatrix}\mathtt{1} & \mathtt{\,\,\,\,0}\\\mathtt{0} & \mathtt{-1}\end{bmatrix}\begin{bmatrix}\mathtt{1} & \mathtt{\,\,\,\,0}\\\mathtt{0} & \mathtt{-1}\end{bmatrix}\mathtt{=}\begin{bmatrix}\mathtt{1} & \mathtt{0}\\\mathtt{0} & \mathtt{1}\end{bmatrix}$

The inverse of a matrix is represented with a superscript $$\mathtt{-1}$$. So, the inverse of the matrix $$\mathtt{A}$$ is written as $$\mathtt{A^{-1}}$$.

The Transpose

The transpose of a matrix is the matrix you get when you take the rows of the matrix and turn them into columns instead. The transpose of a matrix is represented with a superscript $$\mathtt{T}$$. So: $\begin{bmatrix}\mathtt{\,\,\,\,1} & \mathtt{3}\\\mathtt{-2} & \mathtt{4}\end{bmatrix}^{\mathtt{T}}\mathtt{=}\begin{bmatrix}\mathtt{1} & \mathtt{-2}\\\mathtt{3} & \mathtt{\,\,\,\,4}\end{bmatrix}$

If a 2 × 2 matrix, $$\mathtt{V}$$ is orthogonal—meaning that its column vectors are perpendicular and each column vector has a length of 1—then its transpose is the same as its inverse, or $$\mathtt{V^{-1}=V^T}$$. Eigenvectors for Reflections

Having just finished a post on eigenvalue decomposition, involving eigenvalues and eigenvectors, I was flipping through this nifty little textbook, where I saw a description of geometric reflections using eigenvectors, and I thought, “Oh my gosh, of course!”

We have seen reflections here, yet back then we had to find something called the foot of a point to figure out the reflection. But we can construct a reflection matrix (same as a scaling matrix) using only knowledge about eigenvectors.

When we talked about eigenvectors before—those vectors which do not change direction under a transformation (except if they reverse direction)—we were looking for them. But in a reflection, we already know them. In any reflection across a line, we already know that the vector that matches the line of reflection will not change direction, and the vector perpendicular to the line of reflection will only be scaled by $$\mathtt{-1}$$. So, both the vector which describes the line of reflection and the vector perpendicular to that line are eigenvectors. So let’s say we want to reflect point $$\mathtt{C}$$ across the line described by the vector $$\mathtt{\alpha(2, -1)}$$.

Our eigenvector matrix will be $\begin{bmatrix}\mathtt{\,\,\,\,2}&\mathtt{-1}\\\mathtt{-1}&\mathtt{-2}\end{bmatrix}$ Here we see the second eigenvector, but we can simply use the vector perpendicular to the first if we don’t know the second one.

Our eigenvalue matrix will be $\begin{bmatrix}\mathtt{1}&\mathtt{\,\,\,\,0}\\\mathtt{0}&\mathtt{-1}\end{bmatrix}$ since we will keep the first eigenvector—the line of reflection—fixed (eigenvalue of 1) and just flip the second eigenvector (eigenvalue of –1).

As a penultimate step, we calculate the inverse of the eigenvector matrix (we’ll get into inverses fairly soon), and then finally multiply all those matrices together (right to left), as we saw with the eigenvalue decomposition, to get the reflection matrix. $\begin{bmatrix}\mathtt{\,\,\,\,2}&\mathtt{-1}\\\mathtt{-1}&\mathtt{-2}\end{bmatrix}\begin{bmatrix}\mathtt{1}&\mathtt{\,\,\,\,0}\\\mathtt{0}&\mathtt{-1}\end{bmatrix}\begin{bmatrix}\mathtt{\,\,\,\,\frac{2}{5}}&\mathtt{-\frac{1}{5}}\\\mathtt{-\frac{1}{5}}&\mathtt{-\frac{2}{5}}\end{bmatrix}\mathtt{=}\begin{bmatrix}\mathtt{\,\,\,\,\frac{3}{5}}&\mathtt{-\frac{4}{5}}\\\mathtt{-\frac{4}{5}}&\mathtt{-\frac{3}{5}}\end{bmatrix}$

The final matrix on the right side is the reflection matrix for reflections across the line represented by the vector $$\mathtt{\alpha(2, -1)}$$, or essentially all lines with a slope of $$\mathtt{-\frac{1}{2}}$$.

It’s worth playing around with building these matrices and using them to find reflections. A lot of what’s here is pretty obvious, but reflection matrices can come in handy in some non-obvious ways too. The perpendicular vector at the very least has to be on the same side of the line as the point you are reflecting. 