K-Means Clustering

K-means clustering is one way of taking some data and allowing a computer to do what you do pretty naturally with your eyes and brain—separate the data into distinguishable clusters. For example, in the graph shown below, you can very easily see two clumps of points (points A and D in a clump and points B and C in a clump). A computer, to the extent it sees anything, sees just four points and their coordinates.

Why not just use our eyes and brain? Because once we teach a computer to approximate our ability to cluster 2D or 3D data, it can cluster data with many more than just 2 or 3 components. And then its “seeing” outpaces ours by quite a lot.

Let’s take a look at the instructions a computer could follow to do k-means clustering, and then we’ll dress it all up in linear algebra symbolism some other time. To start, I’ve just made 2 clusters of points (which we know about, but the computer doesn’t) with 2 components each (an x-component and y-component, i.e., 2D data).

Determining Least Distances

To start, we select a \(\mathtt{k}\), a number of clusters that we want, remembering that we know right now how many clusters there are but in most situations we would not. We choose \(\mathtt{k=2}\). Then we place the two cluster points at random locations—here I’ve put \(\mathtt{\color{blue}{k_1}}\) at \(\mathtt{(2,7)}\) and \(\mathtt{\color{red}{k_2}}\) at \(\mathtt{(4,2)}\).

Next, we calculate the distance from each point to each center. This is the good ol’ Pythagorean Theoremish Euclidean distance of \(\mathtt{\sqrt{(x_2-x_1)^{2}+(y_2-y_1)^{2}}}\). The cluster that we assign to each point is given by the closest center to that point. You can run the code below to print out the 8 distances.

Going by shorter distances, our first result, then, is to group point A in cluster \(\mathtt{\color{blue}{k_1}}\), because the first number in that pair is smaller, and points B, C, and D in cluster \(\mathtt{\color{red}{k_2}}\), because the second number in each of those pairs is the smaller one.

Moving the Centers Based on the Means

The next step—and last before repeating the process—is to move each center to the mean of the points in the cluster to which it is currently assigned. The mean of the points is determined by calculating the mean of the components separately. So, for our current cluster \(\mathtt{\color{red}{k_2}}\), the points B, C, and D have a mean of (\(\mathtt{\frac{2 + 2 + 6}{3}}\), \(\mathtt{\frac{4 + 3 + 5}{3}}\)), or (\(\mathtt{\frac{10}{3}}\), \(\mathtt{4}\)). And since \(\mathtt{\color{blue}{k_1}}\) has just one point, A, it will move to smack dab on top of that point, at (5, 6).

Now we can do another round of distance comparisons, given the new center locations. These calculations give us what we can see automatically—that points A and D belong to one cluster and points B and C belong to another cluster. In this case, A and D belong to cluster \(\mathtt{\color{blue}{k_1}}\) and B and C belong to cluster \(\mathtt{\color{red}{k_2}}\).

The cluster centers now move to the means of each pair of points, placing them where we would likely place them to begin with (directly between the two points in the cluster). Further calculations won’t change these assignments, so the k-means algorithm is done when it stops changing things drastically (or at all).

Sum and Product Loops

It’s something of a truism that mathematical symbolism is difficult. There are some situations, though, where the symbolism is not just difficult, but also annoying and ridiculous. It likely saved a lot of time when people were still mostly writing ideas out by hand, so back then even the annoying and ridiculous could not be righteously pointed at and mocked, but nowadays it is almost certainly more difficult to set some statements in LaTeX than it is to type them—and, if the text is intended to teach students, more difficult to unpack the former than it is to understand the latter.

Examples of symbols that are justly symbolized, even today, are \(\mathtt{\sum}\) and \(\mathtt{\prod}\), representing a sum and a product, respectively. More specifically, these symbols represent loops—an addition loop or a multiplication loop.

So, for example, take this expression on the left side of the equals sign, which represents the loop sum on the right of the equals sign: \(\mathtt{\sum_{n=1}^{5}n=1+2+3+4+5}\). The expression on the left just means (a) start a counter at 1, (b) count up to 5 by 1s, (c) let n = each number you count, then (d) add all the n’s one by one in a loop.

How about this one? \[\mathtt{\sum_{n=0}^{4}2n=0+2+4+6+8}\]

This one means (a) start a counter at 0, (b) count up to 4 by 1s, (c) let n = each number you count, then (d) add all the 2n’s one by one in a loop.

For products, we just swap out the symbol. Here is the corresponding product for the first loop: \(\mathtt{\prod_{n=1}^{5}n=1\times2\times3\times4\times5}\). And here’s one for the second loop: \[\mathtt{\prod_{n=0}^{4}2n=0\times2\times4\times6\times8}\]

Loops and Linear Algebra

You’ll often see the summation loop in linear algebra contexts, because it is an equivalent way to write a dot product, for example. The sum \(\mathtt{\sum_{n=0}^{4}2n=0+2+4+6+8}\) above can be written as shown below, which looks like more work to write—and is—but when we’re dealing mostly with variables, the savings in writing effort is more evident. \[\quad\,\,\,\begin{bmatrix}\mathtt{2}\\\mathtt{2}\\\mathtt{2}\\\mathtt{2}\\\mathtt{2}\end{bmatrix}\cdot \begin{bmatrix}\mathtt{0}\\\mathtt{1}\\\mathtt{2}\\\mathtt{3}\\\mathtt{4}\end{bmatrix}\mathtt{=2\cdot0+2\cdot1+2\cdot2\ldots}\]

The loop sum \(\mathtt{\sum_{i}a_{i}x_{i}+b}\), where \(\mathtt{i}\) is an index pointing to a component of vector \(\mathtt{a}\) and vector \(\mathtt{x}\), can be written more simply as \(\mathtt{a\cdot x+b}\), as long as the context is clear that \(\mathtt{a}\) and \(\mathtt{x}\) are vectors.

Linear Algebra Exercises I

We’ve done a fair amount of landscaping, as it were—running out a long way in the field of linear algebra to mark interesting points. Now it seems like a good time to turn back and start tending to each area a little more closely. An excellent way to do that—to make things more secure—is to practice.

The sets below can be completed after reading Lines the Linear Algebra Way. I would suggest completing this by working on a piece of paper and then checking your answers—one at a time at first and then go a stretch before checking. Try to complete each exercise before either checking back with the original post or looking at my answer. My answers can be uncovered by hovering under each exercise.

Convert each equation of a line to a vector equation in parametric form, \(\mathtt{l(k)=p+kv}\), where \(\mathtt{p}\) and \(\mathtt{v}\) are 2D vectors and \(\mathtt{k}\) is a scalar variable.

  1. \(\mathtt{y=x+3}\)

\(\mathtt{l(k)=}\begin{bmatrix}\mathtt{0}\\\mathtt{3}\end{bmatrix}+\begin{bmatrix}\mathtt{1}\\\mathtt{1}\end{bmatrix}\mathtt{k}\)

  1. \(\mathtt{y=2x+3}\)

\(\mathtt{l(k)=}\begin{bmatrix}\mathtt{0}\\\mathtt{3}\end{bmatrix}+\begin{bmatrix}\mathtt{1}\\\mathtt{2}\end{bmatrix}\mathtt{k}\)

  1. \(\mathtt{y=-2x}\)

\(\mathtt{l(k)=}\begin{bmatrix}\mathtt{0}\\\mathtt{0}\end{bmatrix}+\begin{bmatrix}\mathtt{\,\,\,\,1}\\\mathtt{-2}\end{bmatrix}\mathtt{k}\),    or just \(\begin{bmatrix}\mathtt{\,\,\,\,1}\\\mathtt{-2}\end{bmatrix}\mathtt{k}\)

  1. \(\mathtt{y=x}\)

\(\mathtt{l(k)=}\begin{bmatrix}\mathtt{0}\\\mathtt{0}\end{bmatrix}+\begin{bmatrix}\mathtt{1}\\\mathtt{1}\end{bmatrix}\mathtt{k}\),    or just \(\begin{bmatrix}\mathtt{1}\\\mathtt{1}\end{bmatrix}\mathtt{k}\)

  1. \(\mathtt{y=10}\)

\(\mathtt{l(k)=}\begin{bmatrix}\mathtt{0}\\\mathtt{10}\end{bmatrix}+\begin{bmatrix}\mathtt{1}\\\mathtt{0}\end{bmatrix}\mathtt{k}\)

  1. \(\mathtt{x+y=1}\)

\(\mathtt{l(k)=}\begin{bmatrix}\mathtt{0}\\\mathtt{1}\end{bmatrix}+\begin{bmatrix}\mathtt{\,\,\,\,1}\\\mathtt{-1}\end{bmatrix}\mathtt{k}\)

  1. \(\mathtt{2x+3y=9}\)

\(\mathtt{l(k)=}\begin{bmatrix}\mathtt{0}\\\mathtt{3}\end{bmatrix}+\begin{bmatrix}\mathtt{\,\,\,\,3}\\\mathtt{-2}\end{bmatrix}\mathtt{k}\)

  1. \(\mathtt{x=5}\)

\(\mathtt{l(k)=}\begin{bmatrix}\mathtt{5}\\\mathtt{0}\end{bmatrix}+\begin{bmatrix}\mathtt{0}\\\mathtt{1}\end{bmatrix}\mathtt{k}\)

  1. \(\mathtt{-3x-\frac{1}{4}y=15}\)

\(\mathtt{l(k)=}\begin{bmatrix}\mathtt{\,\,\,\,0}\\\mathtt{-60}\end{bmatrix}+\begin{bmatrix}\mathtt{\,\,\,\,1}\\\mathtt{-12}\end{bmatrix}\mathtt{k}\)

  1. Devise an algorithm you could use to convert an equation for a line in slope-intercept form to a vector equation for the line in parametric form.

For slope-intercept form, \(\mathtt{y=mx+b}\), the conversion \(\begin{bmatrix}\mathtt{0}\\\mathtt{b}\end{bmatrix}+\begin{bmatrix}\mathtt{1}\\\mathtt{m}\end{bmatrix}\mathtt{k}\,\) seems to work for most lines.

There are many ways to write the above vector equations. In particular, the intercept vector does not have to have a first component of \(\mathtt{0}\), and the slope vector does not have to be in simplest form. All that is required is for the intercept vector to get you to some point on the line and for the slope vector to correctly represent the slope of the line.

Now let’s go the other way. Identify the slope, y-intercept, and x-intercept of each line.

  1. \(\mathtt{l_{1}(k)=}\begin{bmatrix}\mathtt{1}\\\mathtt{2}\end{bmatrix}+\begin{bmatrix}\mathtt{\,\,\,\,2}\\\mathtt{-1}\end{bmatrix}\mathtt{k}\)

slope → \(\mathtt{-\frac{1}{2}}\), y-intercept → \(\mathtt{\frac{5}{2}}\), x-intercept → \(\mathtt{5}\)

  1. \(\mathtt{l_{2}(k)=}\begin{bmatrix}\mathtt{-3}\\\mathtt{-5}\end{bmatrix}+\begin{bmatrix}\mathtt{0}\\\mathtt{4}\end{bmatrix}\mathtt{k}\)

slope → undefined, y-intercept → none, x-intercept → \(\mathtt{-3}\)

  1. \(\mathtt{l_{3}(k)=}\begin{bmatrix}\mathtt{\,\,\,\,0}\\\mathtt{-1}\end{bmatrix}+\begin{bmatrix}\mathtt{2}\\\mathtt{2}\end{bmatrix}\mathtt{k}\)

slope → \(\mathtt{1}\), y-intercept → \(\mathtt{-1}\), x-intercept → \(\mathtt{1}\)

  1. \(\mathtt{l_{4}(k)=}\begin{bmatrix}\mathtt{\,\,\,\,3}\\\mathtt{-9}\end{bmatrix}+\begin{bmatrix}\mathtt{7}\\\mathtt{0}\end{bmatrix}\mathtt{k}\)

slope → \(\mathtt{0}\), y-intercept → \(\mathtt{-9}\), x-intercept → none

  1. \(\mathtt{l_{5}(k)=}\begin{bmatrix}\mathtt{-5}\\\mathtt{\,\,\,\,1}\end{bmatrix}+\begin{bmatrix}\mathtt{-3}\\\mathtt{\,\,\,\,4}\end{bmatrix}\mathtt{k}\)

slope → \(\mathtt{-\frac{4}{3}}\), y-intercept → \(\mathtt{-\frac{17}{3}}\), x-intercept → \(\mathtt{-\frac{17}{4}}\)

And just two more. Determine if each point is on the line mentioned from above.

  1. Is \(\mathtt{(-11,3)}\) on the line \(\mathtt{l_1}\)?

No: \(\begin{bmatrix}\mathtt{1}\\\mathtt{2}\end{bmatrix}+\begin{bmatrix}\mathtt{\,\,\,\,2}\\\mathtt{-1}\end{bmatrix}\mathtt{k=}\begin{bmatrix}\mathtt{-11}\\\mathtt{\,\,\,\,3}\end{bmatrix}\).
No value of \(\mathtt{k}\) solves both \(\mathtt{2k+1=-11}\) and \(\mathtt{-k+2=3}\).

  1. Is \(\mathtt{(5,5)}\) on the line \(\mathtt{l_3}\)?

No. Since the line has a slope of \(\mathtt{1}\), it would contain \(\mathtt{(5,5)}\) if the line passed through the origin, but it does not.

More with Least Squares

One very cool thing about our formula for the least squares regression line, \(\mathtt{\left(X^{T}X\right)^{-1}X^{T}y}\), is that it is the same no matter whether we have one independent variable (univariate) or many independent variables (multivariate).

Consider these data, showing the selling prices of some grandfather clocks at auction. The first scatter plot shows the age of the clock in years on the x-axis (100–200), and the second shows the number of bidders on the x-axis (0–20). Price (in pounds or dollars) is on the y-axis on each plot (500–2500).

Age (years)BiddersPrice ($)
127131235
115121080
1277845
15091522
15661047
182111979
156121822
132101253
13791297
1139946
137151713
117111024
13781147
15361092
117131152
126101336
170142131
18281550
162111884
184102041
1436854
15991483
108141055
17581545
1086729
17991792
111151175
18781593
1117785
1157744
19451356
16871262

You can see in the notebook below that the first regression line, for the price of a clock as a function of its age, is approximately \(\mathtt{10.5x-192}\). The second regression line, for the price of a clock as a function of the number of bidders at auction, is approximately \(\mathtt{55x+806}\). As mentioned above, each of these univariate least squares regression lines can be calculated with the formula \(\mathtt{\left(X^{T}X\right)^{-1}X^{T}y}\).

Combining both age and number of bidders together, we can calculate, using the same formula, a multivariate least squares regression equation. This of course is no longer a line. In the case of two input variables as we have here, our line becomes a plane.

Our final regression equation becomes \(\mathtt{12.74x_{1}+85.82x_{2}-1336.72}\), with \(\mathtt{x_1}\) representing the age of a clock and \(\mathtt{x_2}\) representing the number of bidders.

Linear Algebra Connections

There are so many connections within and applications of linear algebra—I can only imagine that this will be a series of posts, to the extent that I continue writing about the subject. Here are a few connections that I’ve come across in my reading recently.

Compound Interest and Matrix Powers

We can multiply a matrix by itself \(\mathtt{n}\) times. The result is the matrix to the power \(\mathtt{n}\). We can use this when setting up a compound interest situation. For example, suppose we have three accounts, which each have a different interest rate compounded annually—say, 5%, 3%, and 2%. Without linear algebra, the amount in the first account can be modeled by the equation \[\mathtt{A(t)=p \cdot 1.05^t}\] where \(\mathtt{p}\) represents the starting amount in the account, and \(\mathtt{t}\) represents the time in years. With linear algebra, we can group all of the account interest rates into a matrix. The first year for each account would look like this: \[\mathtt{A(1)}=\begin{bmatrix}\mathtt{1.05}&\mathtt{0}&\mathtt{0}\\\mathtt{0}&\mathtt{1.03}&\mathtt{0}\\\mathtt{0}&\mathtt{0}&\mathtt{1.02}\end{bmatrix}^\mathtt{1}\begin{bmatrix}\mathtt{p_1}\\\mathtt{p_2}\\\mathtt{p_3}\end{bmatrix}=\begin{bmatrix}\mathtt{1.05p_1}\\\mathtt{1.03p_2}\\\mathtt{1.02p_3}\end{bmatrix}\]

For years beyond the first year, all we have to do is raise the matrix to the appropriate power. Since it’s diagonal, squaring it, cubing it, etc., will square, cube, etc., each entry. This computation can be a little more organized—and more straightforward for programming. A matrix has to be square (\(\mathtt{m \times m})\) in order to raise it to a power. Below we calculate the amount in each account after 100 years.

Centroids and Areas

We have seen that the determinant can be thought about as the area of the parallelogram formed by two vectors. We can use this fact to determine the area of a complex shape like the one shown below.

Since determinants are signed areas, as we move around the shape counterclockwise, calculating the determinant of each vector pair (and multiplying each determinant by one half so we just get each triangle), we get the total area of the shape.\[\frac{1}{2}\left(\begin{vmatrix}\mathtt{6}&\mathtt{6}\\\mathtt{0}&\mathtt{4}\end{vmatrix}+\begin{vmatrix}\mathtt{6}&\mathtt{3}\\\mathtt{4}&\mathtt{4}\end{vmatrix}+\begin{vmatrix}\mathtt{3}&\mathtt{3}\\\mathtt{4}&\mathtt{6}\end{vmatrix}+\begin{vmatrix}\mathtt{3}&\mathtt{-2}\\\mathtt{6}&\mathtt{\,\,\,\,6}\end{vmatrix}+\begin{vmatrix}\mathtt{-2}&\mathtt{-2}\\\mathtt{\,\,\,\,6}&\mathtt{\,\,\,\,3}\end{vmatrix}+\begin{vmatrix}\mathtt{-2}&\mathtt{0}\\\mathtt{\,\,\,\,3}&\mathtt{3}\end{vmatrix}\right)=\mathtt{36}\text{ units}\mathtt{^2}\]

That’s pretty hand-wavy, but it’s something that you can probably figure out with a little experimentation.

Another counterclockwise-moving calculation (though this one can be clockwise without changing the answer) is the calculation of the centroid of a closed shape. All that is required here is to calculate the sum of the position vectors of the vertices of the figure and then divide by the number of vertices.

\[\mathtt{\frac{1}{5}}\left(\begin{bmatrix}\mathtt{3}\\\mathtt{4}\end{bmatrix}+\begin{bmatrix}\mathtt{0}\\\mathtt{6}\end{bmatrix}+\begin{bmatrix}\mathtt{-3}\\\mathtt{\,\,\,\,4}\end{bmatrix}+\begin{bmatrix}\mathtt{-1}\\\mathtt{\,\,\,\,3}\end{bmatrix}\right)=\begin{bmatrix}\mathtt{-\frac{1}{5}}\\\mathtt{\,\,\,\,\frac{17}{5}}\end{bmatrix}\]

Here again, it’s just magic, but you can figure it out with a little play. In each case—for both complex areas and centroids—we assign a point to be the origin and go from there.

Least Squares with Linear Algebra

In brief, linear regression is about finding the line of best fit to a data set. If you’re looking for the linear algebra way of doing this, you will most likely find it searching for the term least squares.

In the basic scenario, you’ve got some two-dimensional data, \(\mathtt{(x, y)}\) coordinates, and you want to find the equation for a straight line that is as close as possible to each point. Such a scenario is shown below, though here the line has already been graphed and the equation for the line of best fit displayed. (But feel free to change the data in the table or move the points around to see how the line of best fit changes.)

xy
108.04
86.95
137.58
98.81
118.33
149.96
67.24
44.26
1210.84
74.82
55.68

So, let’s imagine that we haven’t found this line yet. We know what we are looking for is a line of the form \(\mathtt{y=mx+b}\). In linear algebra terms, we want the vector \(\mathtt{y}\) (11 rows, 1 column) to equal the vector \(\mathtt{x}\) (11 rows, 1 column) times a slope vector \(\mathtt{m}\) (1 row, 2 columns) plus an intercept vector \(\mathtt{b}\) (11 rows, 1 column).

The first problem with this that we have to fix is that \(\mathtt{x}\) the 11 × 1 vector needs to be \(\mathtt{X}\) the 11 × 2 matrix so that we get our equations right. So we’ll pad \(\mathtt{x}\) with some 1s and then we’ll be able to call it \(\mathtt{X}\). (We can pad on either side, left or right, just remembering to interchange \(\mathtt{b}\) and \(\mathtt{m}\) to keep the equations straight.) The second problem we can fix is that we don’t need a separate intercept vector and a separate slope vector. We can combine things so that we form an equivalent matrix equation that means the same thing as \(\mathtt{y=mx+b}\). We need the equation to look like this: \[\begin{bmatrix}\mathtt{8.04}\\\mathtt{6.95}\\\mathtt{7.58}\\\mathtt{8.81}\\\mathtt{8.33}\\\mathtt{9.96}\\\mathtt{7.24}\\\mathtt{4.26}\\\mathtt{10.84}\\\mathtt{4.82}\\\mathtt{5.68}\end{bmatrix} = \begin{bmatrix}\mathtt{1}&\mathtt{10}\\\mathtt{1}&\mathtt{8}\\\mathtt{1}&\mathtt{13}\\\mathtt{1}&\mathtt{9}\\\mathtt{1}&\mathtt{11}\\\mathtt{1}&\mathtt{14}\\\mathtt{1}&\mathtt{6}\\\mathtt{1}&\mathtt{4}\\\mathtt{1}&\mathtt{12}\\\mathtt{1}&\mathtt{7}\\\mathtt{1}&\mathtt{5}\end{bmatrix}\begin{bmatrix}\mathtt{b}\\\mathtt{m}\end{bmatrix}\]

Multiply the matrix and vector on the right side of that equation, and you get, for the first equation, \(\mathtt{8.04=b+(m)(10)}\). That’s equivalent to \(\mathtt{y=mx+b}\) for the first \(\mathtt{(x, y)}\) data point \(\mathtt{(10, 8.04)}\). The matrix and vector setup ensures that all the equations for all the points are of the correct form.

In linear algebra terms, we have rewritten the equation to be \(\mathtt{y=Xv}\), where \(\mathtt{y}\) is an 11 × 1 vector, \(\mathtt{X}\) is an 11 × 2 matrix (padded with some 1s), and \(\mathtt{v}\) is a 2 × 1 vector which contains the unknown slope \(\mathtt{m}\) of the best fit line and the unknown intercept \(\mathtt{b}\).

It’s an Approximation

At this point in the explanation, it’s important to realize that we will make another shift. The first was from the real data to the matrix algebra setup. We shift again below, away from that setup, per se, and toward just finding out what that unknown \(\mathtt{v}\) is.

As an analogy, the system shown below features a 3d vector (2, 0, –2) which does not live in the same plane as the one formed by the two other column vectors of the matrix. Thus, there is no solution \(\mathtt{(j, k)}\). \[\begin{bmatrix}\mathtt{1}&\mathtt{\,\,\,\,1}\\\mathtt{1}&\mathtt{-3}\\\mathtt{1}&\mathtt{\,\,\,\,1}\end{bmatrix}\begin{bmatrix}\mathtt{j}\\\mathtt{k}\end{bmatrix}=\begin{bmatrix}\mathtt{\,\,\,\,2}\\\mathtt{\,\,\,\,0}\\\mathtt{-2}\end{bmatrix}\longleftarrow\text{no solutions}\]

Similarly, the columns of our \(\mathtt{X}\) matrix form a plane in 11-dimensional space. In order for \(\mathtt{v}\) to be a solution to our original matrix equation above, \(\mathtt{y}\) has to live in this plane too. But we already know that it doesn’t. If it did, the points would lie along some line. It’s true that \(\mathtt{y}\) is an 11-dimensional vector, just like each column of \(\mathtt{X}\), but \(\mathtt{y}\) doesn’t live in the same plane.

The closest approximation we can get to \(\mathtt{y}\) in the plane of \(\mathtt{X}\) is \(\mathtt{Xv=p}\), where \(\mathtt{p}\) is the projection of \(\mathtt{y}\) onto the plane (another thing we’ll have to come back to). The vector \(\mathtt{q}\) connecting \(\mathtt{y}\) with its projection \(\mathtt{p}\) is perpendicular to \(\mathtt{p}\) (and, thus, to the plane).

Now we can write some more equations. For example, we know we want \(\mathtt{Xv=p}\), but it’s also true that \(\mathtt{p+q=y}\). And we can do some manipulation to show that all the column vectors of \(\mathtt{X}\) are perpendicular to \(\mathtt{q}\). Assuming for a second that our \(\mathtt{X}\) is a 2 × 2 matrix, we write \(\mathtt{X}\) as a 3D matrix, transpose it (so that multiplication is defined), and multiply it by \(\mathtt{q}\). The same basic idea applies to our original \(\mathtt{X}\) (11 × 2) matrix. \[\begin{bmatrix}\mathtt{x_{11}}&\mathtt{x_{12}}\\\mathtt{x_{21}}&\mathtt{x_{22}}\end{bmatrix}\rightarrow\begin{bmatrix}\mathtt{x_{11}}&\mathtt{x_{12}}\\\mathtt{x_{21}}&\mathtt{x_{22}}\\\mathtt{0}&\mathtt{0}\end{bmatrix}\rightarrow\begin{bmatrix}\mathtt{x_{11}}&\mathtt{x_{21}}&\mathtt{0}\\\mathtt{x_{12}}&\mathtt{x_{22}}&\mathtt{0}\end{bmatrix}\]\[\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\,\,\,\begin{bmatrix}\mathtt{x_{11}}&\mathtt{x_{21}}&\mathtt{0}\\\mathtt{x_{12}}&\mathtt{x_{22}}&\mathtt{0}\end{bmatrix}\begin{bmatrix}\mathtt{0}\\\mathtt{0}\\\mathtt{q_3}\end{bmatrix}=\begin{bmatrix}\mathtt{0}\\\mathtt{0}\end{bmatrix}\] This gives us a key equation: \(\mathtt{X^{T}q=0}\).

Since \(\mathtt{q=y-p}\), substituting for \(\mathtt{q}\) gets us \(\mathtt{X^{T}(y-p)=0}\). Then since \(\mathtt{Xv=p}\), substituting for \(\mathtt{p}\) brings us to \(\mathtt{X^{T}(y-Xv)=0}\). Distribute to get \(\mathtt{X^{T}y-X^{T}Xv=0}\), which means \(\mathtt{X^{T}y=X^{T}Xv}\). Multiply both sides on the left by the inverse of \(\mathtt{X^{T}X}\), and the vector \(\mathtt{v}\) that we’re after, then, is \[\mathtt{v=(X^{T}X)^{-1}X^{T}y}\]

That formula gives us the slope and intercept of our best fit line. Below is one way the least squares can be calculated with a little Python. The calculation I used for the above interactive to get the best fit line is way more complicated. If I had known about the linear algebra way when I made it, I would have definitely gone with that instead.

The Singular Value Decomposition

We’ve now seen the eigenvalue decomposition of a linear transformation (in the form of a matrix). We can think of what we did in that decomposition as breaking up the original transformation into three transformations. If we multiply the rightmost matrix by any vector, and then multiply the middle matrix by that product, and then multiply the leftmost matrix on the right-hand side by that product, we would see that starting vector be transformed three times. That process would be equivalent to multiplying the starting vector by the original matrix.

We can also say that, in that original transformation matrix, which we’ll call \(\mathtt{A}\), we mapped a set of orthogonal vectors, or vectors at right angles to each other, (1, 0) and (0, 1), onto a set of non-orthogonal vectors (0, –2) and (1, –3). We don’t have to multiply each vector by the transformation matrix one at a time. We can multiply the set of vectors, as a matrix, by the transformation matrix, like so.

\(\begin{bmatrix}\mathtt{\,\,\,\,0}&\mathtt{\,\,\,\,1}\\\mathtt{-2}&\mathtt{-3}\end{bmatrix}\begin{bmatrix}\mathtt{1}&\mathtt{0}\\\mathtt{0}&\mathtt{1}\end{bmatrix}=\begin{bmatrix}\mathtt{1}\begin{bmatrix}\mathtt{\,\,\,\,0}\\\mathtt{-2}\end{bmatrix}+\mathtt{0}\begin{bmatrix}\mathtt{\,\,\,\,1}\\\mathtt{-3}\end{bmatrix}&\mathtt{0}\begin{bmatrix}\mathtt{\,\,\,\,0}\\\mathtt{-2}\end{bmatrix}+\mathtt{1}\begin{bmatrix}\mathtt{\,\,\,\,1}\\\mathtt{-3}\end{bmatrix}\end{bmatrix}=\begin{bmatrix}\mathtt{\,\,\,\,0}&\mathtt{\,\,\,\,1}\\\mathtt{-2}&\mathtt{-3}\end{bmatrix}\)

Of course, we just multiplied the original matrix by the identity matrix, so it spit out the original matrix again. But the above interpretation is different, though it gives the same results.

Okay, great, but we humans seem to love our right angles. So, this question arises about linear transformations: could we use any matrix to map some pair of orthogonal vectors (vectors at right angles to each other) to a different set of orthogonal vectors? That is, could we transform our way from one orthogonal “system” to another with any single transformation matrix?

Could we find a pair of orthogonal vectors which, after undergoing our transformation \(\mathtt{A}\), were mapped to a different pair of orthogonal vectors? If we could, that would mean that if we multiplied \(\mathtt{A}\) by a set of orthogonal vectors \(\mathtt{V}\) (i.e., transformed the set of vectors \(\mathtt{V}\) by the action of \(\mathtt{A}\)), it would be equivalent to just starting with a different set of orthogonal vectors already in position (we’ll call these orthogonal vectors \(\mathtt{U}\)) and just scaling them by a scaling matrix \(\mathtt{\Sigma}\). In notation, we’ll write this hypothesis as \[\mathtt{AV=U \Sigma}\]

The word hypothesis is important here. We’re not really writing down something we know is equivalent—that is, we don’t really know if there is a \(\mathtt{V}\), \(\mathtt{U}\), and \(\mathtt{\Sigma}\) which will make this equivalence true. Students (and I) have a hard time not seeing the equals sign as meaning “I know that it is true that these are equivalent.” But that’s not what it means here, and it’s good to get used to that flexibility. What it means here is that we are supposing for the time being that these two products are equivalent. If some contradiction falls out of our algebra (or we get some kind of infinity), we’ll know that the equivalence fails—at least insofar as we want just one matrix to pop out for each unknown matrix.

Let’s first see how this equivalence appears before we dig into figuring out what \(\mathtt{V}\), \(\mathtt{U}\), and \(\mathtt{\Sigma}\) are. That is, let me show you that the matrices \(\mathtt{V}\), \(\mathtt{U}\), and \(\mathtt{\Sigma}\) are possible before we look at what they are. On the left is our original transformation matrix acting on a set of orthogonal vectors \(\mathtt{V}\) (purple and red), so this transformation shows \(\mathtt{AV}\). (You can see that the original (1, 0) and (0, 1) vectors go to where they’re supposed to.)

\[\mathtt{AV\quad\quad\quad\quad =\quad\quad\quad\quad U \Sigma}\]

On the right is a set of two different orthogonal vectors \(\mathtt{\Sigma}\) (purple and red) which start as being aligned to the horizontal and vertical grid lines and then are rotated and reflected (which is a part of scaling) by a matrix \(\mathtt{U}\). And the two transformations are equivalent! (In the demo they are very close.) We can squeeze out an orthogonal transformation out of that weird matrix we saw in the eigenvalue decomposition which took a square and rotated and stretched it into a funky parallelogram.

I note that, above, I called \(\mathtt{\Sigma}\) a scaling matrix, but here I’m using it as just a set of orthogonal vectors. Luckily for us, both things are true. It just depends on how you look at them. A square matrix like the ones we are using represents both a pair of 2-dimensional vectors or a linear transformation. We get to decide how to interpret the matrix in any given situation.

Getting \(\mathtt{V}\)

To begin to know what these matrices are, we can start by writing the equation above like this. \[\mathtt{A=U \Sigma V^T}\] We can do this because \(\mathtt{V}\) is orthogonal, and I briefly mentioned last time that the transpose of an orthogonal matrix is the same as its inverse. So, in effect, we multiplied both sides of the equation by \(\mathtt{V^{-1}}\), which removed the V from the left-hand side.

Okay, now we pull a little transpose magic, but let’s walk through it. Start by multiplying the expression on each side, on the left, by its transpose. (We’ll circle back to grounding all this some other time.) So, we have \[\mathtt{A^TA=(U\Sigma V^T)^T(U\Sigma V^T)}\] We multiplied \(\mathtt{A}\) by its transpose by multiplying to the left of \(\mathtt{A}\), and we multiplied \(\mathtt{U\Sigma V^T}\) by its transpose by multiplying to the left of \(\mathtt{U\Sigma V^T}\). The product \(\mathtt{A^TA}\) is simple: \[\begin{bmatrix}\mathtt{0}&\mathtt{-2}\\\mathtt{1}&\mathtt{-3}\end{bmatrix}\begin{bmatrix}\mathtt{\,\,\,\,0}&\mathtt{\,\,\,\,1}\\\mathtt{-2}&\mathtt{-3}\end{bmatrix}=\begin{bmatrix}\mathtt{4}&\mathtt{6}\\\mathtt{6}&\mathtt{10}\end{bmatrix}\] But let’s take some time to simplify \(\mathtt{(U\Sigma V^T)^T(U\Sigma V^T)}\). As an example of what to do when we have the transpose of a product of matrices, consider these products.

\[\left(\begin{bmatrix}\mathtt{1}&\mathtt{3}\\\mathtt{2}&\mathtt{4}\end{bmatrix}\begin{bmatrix}\mathtt{4}&\mathtt{6}\\\mathtt{5}&\mathtt{7}\end{bmatrix}\right)^\mathtt{T} \longrightarrow \]
\[\begin{bmatrix}\mathtt{19}&\mathtt{27}\\\mathtt{28}&\mathtt{40}\end{bmatrix}^\mathtt{T}\]
\[\left(\begin{bmatrix}\mathtt{4}&\mathtt{5}\\\mathtt{6}&\mathtt{7}\end{bmatrix}\begin{bmatrix}\mathtt{1}&\mathtt{2}\\\mathtt{3}&\mathtt{4}\end{bmatrix}\right)^\mathtt{T} \longrightarrow \]
\[\begin{bmatrix}\mathtt{19}&\mathtt{28}\\\mathtt{27}&\mathtt{40}\end{bmatrix}\,\,\,\]

Each matrix product on the left is equal to the expression on its right. Test them out for yourself. But the expressions on the right are equal to each other, which means the products on the left are equal to each other. This example shows that \(\mathtt{(AB)^{T}=B^{T}A^{T}}\). The transpose of a product is equal to the product of the transposes, multiplied in reverse order. Again, we’ll ground this later, but this suggests, correctly, that we can rewrite the transposed product above, \(\mathtt{(U\Sigma V^T)^T}\), as \(\mathtt{V\Sigma^{T}U^{T}}\), considering that \(\mathtt{(V^{T})^{T}=V}\). Multiplying that by the remaining part of the right-hand side, \(\mathtt{U\Sigma V^T}\), we get \(\mathtt{V\Sigma^{T}U^{T}U\Sigma V^T}\). Since \(\mathtt{U}\) is orthogonal, its transpose is its inverse, so the \(\mathtt{U}\) terms cancel, leaving us with \(\mathtt{V\Sigma^{T}\Sigma V^T}\). The transpose of a scaling matrix, \(\mathtt{\Sigma}\) in this case, is itself (try it out!), so the middle can be written more simply as \(\mathtt{\Sigma^{2}}\). So, finally (pretty far from finally, actually), we have \[\mathtt{A^{T}A=V\Sigma^{2}V^T}\]

And this is an eigenvalue decomposition for \(\mathtt{A^{T}A}\)! We’ve got a scaling matrix as the lunchmeat, sandwiched by a vector and its inverse (which is the same as the transpose in this case). So, now we can figure out \(\mathtt{V}\) and \(\mathtt{\Sigma}\) by doing the eigenvalue decomposition like we did previously. Here it is: \[\begin{bmatrix}\mathtt{4}&\mathtt{6}\\\mathtt{6}&\mathtt{10}\end{bmatrix}=\begin{bmatrix}\mathtt{\color{purple}{-\frac{521}{991}}}&\mathtt{\color{red}{-\frac{3725}{4379}}}\\\mathtt{\color{purple}{-\frac{3725}{4379}}}&\mathtt{\,\,\,\,\color{red}{\frac{521}{991}}}\end{bmatrix}\begin{bmatrix}\mathtt{\frac{4510}{329}}&\mathtt{0}\\\mathtt{0}&\mathtt{\frac{658}{2255}}\end{bmatrix}\begin{bmatrix}\mathtt{-\frac{521}{991}}&\mathtt{-\frac{3725}{4379}}\\\mathtt{-\frac{3725}{4379}}&\mathtt{\,\,\,\,\frac{521}{991}}\end{bmatrix}\]

The purple and red vectors are our starting vectors from the animation on the left above: (–0.53, –0.85) and (–0.85, 0.53). When we ask what orthogonal starting vectors can we pick such that matrix \(\mathtt{A}\) will transform them to a different pair of orthogonal vectors, matrix \(\mathtt{V}\)—the red and purple vectors above—is our answer. The square roots of the diagonal values of \(\mathtt{\Sigma}\) above (remember, this matrix was squared) are called the singular values of \(\mathtt{A}\). To me, getting \(\mathtt{V}\) is the coolest part of this decomposition. The rest, below, is just gravy.

And Now for \(\mathtt{U}\) and \(\mathtt{\Sigma}\)

In this case, we multiply both sides of the equation \(\mathtt{A=U\Sigma V^T}\) on the right by the transpose of \(\mathtt{A}\) and get an eigenvalue decomposition of \(\mathtt{AA^T}\). \[\,\,\mathtt{AA^T=(U\Sigma V^T)(U\Sigma V^T)^T=U\Sigma V^{T}V\Sigma^{T}U^T=U\Sigma^{2}U^T}\]

\[\begin{bmatrix}\mathtt{\,\,\,\,1}&\mathtt{-3}\\\mathtt{-3}&\mathtt{\,\,\,\,13}\end{bmatrix}=\begin{bmatrix}\mathtt{-\frac{400}{1741}}&\mathtt{\frac{764}{785}}\\\mathtt{\,\,\,\,\frac{764}{785}}&\mathtt{\frac{400}{1741}}\end{bmatrix}\begin{bmatrix}\mathtt{\color{purple}{\frac{4510}{329}}}&\mathtt{\color{red}{0}}\\\mathtt{\color{purple}{0}}&\mathtt{\color{red}{\frac{658}{2255}}}\end{bmatrix}\begin{bmatrix}\mathtt{-\frac{400}{1741}}&\mathtt{\frac{764}{785}}\\\mathtt{\,\,\,\,\frac{764}{785}}&\mathtt{\frac{400}{1741}}\end{bmatrix}\]

Now the square roots of the purple and red vectors are our starting vectors from the animation on the right above: (3.7, 0) and (0, 0.54). You’ll notice, of course, that \(\mathtt{\Sigma^2}\) is the same as above, so we really found it earlier. This step is only to pin down what \(\mathtt{U}\) is. At any rate, we are done, and we can write down the full singular value decomposition (SVD) of the original matrix \(\mathtt{A}\). \[\quad\begin{bmatrix}\mathtt{\,\,\,\,0}&\mathtt{\,\,\,\,1}\\\mathtt{-2}&\mathtt{-3}\end{bmatrix}=\begin{bmatrix}\mathtt{-\frac{400}{1741}}&\mathtt{\frac{764}{785}}\\\mathtt{\,\,\,\,\frac{764}{785}}&\mathtt{\frac{400}{1741}}\end{bmatrix}\begin{bmatrix}\mathtt{\frac{1655}{447}}&\mathtt{0}\\\mathtt{0}&\mathtt{\frac{894}{1655}}\end{bmatrix}\begin{bmatrix}\mathtt{-\frac{521}{991}}&\mathtt{-\frac{3725}{4379}}\\\mathtt{-\frac{3725}{4379}}&\mathtt{\,\,\,\,\frac{521}{991}}\end{bmatrix}\]

We’re finding an orthogonal-to-orthogonal transformation in a hurricane, so it shouldn’t be surprising to get weird numbers. One thing the SVD makes clear (the eigenvalue decomposition does this too) is that linear transformations can be described as combinations of rotations and scalings (the latter of which include reflections) and that’s it.


Inverses and Transposes

This will now be my 22nd post on linear algebra, and I hope it’ll be noticeable, looking at all of them together so far, that we haven’t talked about systems of equations. And there’s a good reason for that: because they suck you down an ugly hole of mindless calculation, meaning-challenged tedium, and fruitless, pointless backward thinking. Systems are awesome and important, and I’m sure someone can make a strong argument for introducing them early, but, for me, they can wait.

The inverses of matrices are pretty interesting. The inverse of a 2 × 2 matrix is the matrix that, when multiplied to the original matrix, gives the product that is the identity matrix. The inverse is given by the middle matrix below: \[\begin{bmatrix}\mathtt{a} & \mathtt{b}\\\mathtt{c} & \mathtt{d}\end{bmatrix}\begin{bmatrix}\mathtt{\,\,\,\,\frac{d}{ad-bc}} & \mathtt{-\frac{b}{ad-bc}}\\\mathtt{-\frac{c}{ad-bc}} & \mathtt{\,\,\,\,\frac{a}{ad-bc}}\end{bmatrix}\mathtt{=}\begin{bmatrix}\mathtt{1} & \mathtt{0}\\\mathtt{0} & \mathtt{1}\end{bmatrix}\]

There is an easy-to-follow derivation of this formula here if you’re interested—one that requires only some high school algebra.

You can notice, given this setup, that the identity matrix is its own inverse. Are there any others that fit this description? One thought: we need \(\mathtt{ad-bc}\) to be equal to 1, and we also want \(\mathtt{a=d}\). The values on the other diagonal, \(\mathtt{b}\) and \(\mathtt{c}\), should both be 0, since they have to be equal to their opposites.

In that case, \(\begin{bmatrix}\mathtt{-1} & \mathtt{\,\,\,\,0}\\\mathtt{\,\,\,\,0} & \mathtt{-1}\end{bmatrix}\) would be its own inverse too. What does that mean?

For the identity matrix, it means that if we apply the do-nothing transformation to the home-base matrix, we stay on home base, with the “x” vector pointed to (1, 0) and the “y” vector pointed to (0, 1). The matrix above represents a reflection across the origin. So, applying a reflection across the origin to it should return us to home base.

Wouldn’t a reflection across the x-axis be an inverse of itself, then? Yes! The criteria above for an inverse need to be amended a little to allow for negatives effectively canceling each other out to make positives. \[\begin{bmatrix}\mathtt{1} & \mathtt{\,\,\,\,0}\\\mathtt{0} & \mathtt{-1}\end{bmatrix}\begin{bmatrix}\mathtt{1} & \mathtt{\,\,\,\,0}\\\mathtt{0} & \mathtt{-1}\end{bmatrix}\mathtt{=}\begin{bmatrix}\mathtt{1} & \mathtt{0}\\\mathtt{0} & \mathtt{1}\end{bmatrix}\]

The inverse of a matrix is represented with a superscript \(\mathtt{-1}\). So, the inverse of the matrix \(\mathtt{A}\) is written as \(\mathtt{A^{-1}}\).

The Transpose

The transpose of a matrix is the matrix you get when you take the rows of the matrix and turn them into columns instead. The transpose of a matrix is represented with a superscript \(\mathtt{T}\). So: \[\begin{bmatrix}\mathtt{\,\,\,\,1} & \mathtt{3}\\\mathtt{-2} & \mathtt{4}\end{bmatrix}^{\mathtt{T}}\mathtt{=}\begin{bmatrix}\mathtt{1} & \mathtt{-2}\\\mathtt{3} & \mathtt{\,\,\,\,4}\end{bmatrix}\]

If a 2 × 2 matrix, \(\mathtt{V}\) is orthogonal—meaning that its column vectors are perpendicular and each column vector has a length of 1—then its transpose is the same as its inverse, or \(\mathtt{V^{-1}=V^T}\).

Eigenvectors for Reflections

Having just finished a post on eigenvalue decomposition, involving eigenvalues and eigenvectors, I was flipping through this nifty little textbook, where I saw a description of geometric reflections using eigenvectors, and I thought, “Oh my gosh, of course!”

We have seen reflections here, yet back then we had to find something called the foot of a point to figure out the reflection. But we can construct a reflection matrix (same as a scaling matrix) using only knowledge about eigenvectors.

When we talked about eigenvectors before—those vectors which do not change direction under a transformation (except if they reverse direction)—we were looking for them. But in a reflection, we already know them. In any reflection across a line, we already know that the vector that matches the line of reflection will not change direction, and the vector perpendicular to the line of reflection will only be scaled by \(\mathtt{-1}\). So, both the vector which describes the line of reflection and the vector perpendicular to that line are eigenvectors.

So let’s say we want to reflect point \(\mathtt{C}\) across the line described by the vector \(\mathtt{\alpha(2, -1)}\).

Our eigenvector matrix will be \[\begin{bmatrix}\mathtt{\,\,\,\,2}&\mathtt{-1}\\\mathtt{-1}&\mathtt{-2}\end{bmatrix}\] Here we see the second eigenvector, but we can simply use the vector perpendicular to the first if we don’t know the second one.

Our eigenvalue matrix will be \[\begin{bmatrix}\mathtt{1}&\mathtt{\,\,\,\,0}\\\mathtt{0}&\mathtt{-1}\end{bmatrix}\] since we will keep the first eigenvector—the line of reflection—fixed (eigenvalue of 1) and just flip the second eigenvector (eigenvalue of –1).

As a penultimate step, we calculate the inverse of the eigenvector matrix (we’ll get into inverses fairly soon), and then finally multiply all those matrices together (right to left), as we saw with the eigenvalue decomposition, to get the reflection matrix. \[\begin{bmatrix}\mathtt{\,\,\,\,2}&\mathtt{-1}\\\mathtt{-1}&\mathtt{-2}\end{bmatrix}\begin{bmatrix}\mathtt{1}&\mathtt{\,\,\,\,0}\\\mathtt{0}&\mathtt{-1}\end{bmatrix}\begin{bmatrix}\mathtt{\,\,\,\,\frac{2}{5}}&\mathtt{-\frac{1}{5}}\\\mathtt{-\frac{1}{5}}&\mathtt{-\frac{2}{5}}\end{bmatrix}\mathtt{=}\begin{bmatrix}\mathtt{\,\,\,\,\frac{3}{5}}&\mathtt{-\frac{4}{5}}\\\mathtt{-\frac{4}{5}}&\mathtt{-\frac{3}{5}}\end{bmatrix}\]

The final matrix on the right side is the reflection matrix for reflections across the line represented by the vector \(\mathtt{\alpha(2, -1)}\), or essentially all lines with a slope of \(\mathtt{-\frac{1}{2}}\).

It’s worth playing around with building these matrices and using them to find reflections. A lot of what’s here is pretty obvious, but reflection matrices can come in handy in some non-obvious ways too. The perpendicular vector at the very least has to be on the same side of the line as the point you are reflecting.

Eigenvalue Decomposition

I wrote about eigenvalues and eigenvectors a while back, here. In this post, I’ll show how determining the eigenvalues and eigenvectors of a matrix (2 by 2 in this case) is pretty much all of the work of what’s called eigenvalue decomposition. We’ll start with this matrix, which represents a linear transformation: \[\begin{bmatrix}\mathtt{\,\,\,\,0}&\mathtt{\,\,\,\,1}\\\mathtt{-2}&\mathtt{-3}\end{bmatrix}\]

You can see the action of this matrix at the right (sort of). It sends the (1, 0) vector to (0, –2) and the (0, 1) vector to (1, –3).

The eigenvectors of this transformation are any nonzero vectors that do not change their direction during this transformation, but only scale up or down (or stay the same) by a factor of \(\mathtt{\lambda}\) as a result of the transformation. So,

\(\begin{bmatrix}\mathtt{\,\,\,\,0}&\mathtt{\,\,\,\,1}\\\mathtt{-2}&\mathtt{-3}\end{bmatrix}\begin{bmatrix}\mathtt{r_1}\\\mathtt{r_2}\end{bmatrix}=\lambda \begin{bmatrix}\mathtt{r_1}\\\mathtt{r_2}\end{bmatrix}\)

Using our calculations from the previous post linked above, we calculate the eigenvalues to be \(\mathtt{\lambda_1=-2}\) and \(\mathtt{\lambda_2=-1}\). And the corresponding eigenvectors are of the form \(\mathtt{(r, -2r)}\) and \(\mathtt{(-r, r)}\), respectively.

The red vector (representing the eigenvector \(\mathtt{(-r, r)}\)) at right starts at \(\mathtt{(-1, 1)}\). It is scaled by the eigenvalue of \(\mathtt{-1}\) during the transformation—meaning it simply turns in the opposite direction and its magnitude doesn’t change. Any vector of the form \(\mathtt{(-r, r)}\) will behave this way during this tranformation.

The purple vector (representing the eigenvector \(\mathtt{(r, -2r)}\)) starts at \(\mathtt{(-1, 2)}\). It is scaled by the eigenvalue of \(\mathtt{-2}\) during the transformation—meaning it turns in the opposite direction and is scaled by a factor of \(\mathtt{2}\). Any vector of the form \(\mathtt{(r, -2r)}\) will behave this way during this tranformation.

And Now for the Decomposition

We can now use the equation above and plug in each eigenvalue and its corresponding eigenvector to create two matrix equations.

\(\begin{bmatrix}\mathtt{\,\,\,\,0}&\mathtt{\,\,\,\,1}\\\mathtt{-2}&\mathtt{-3}\end{bmatrix}\begin{bmatrix}\mathtt{-1}\\\mathtt{\,\,\,\,2}\end{bmatrix}=\mathtt{-2}\begin{bmatrix}\mathtt{-1}\\\mathtt{\,\,\,\,2}\end{bmatrix}\) \[\begin{bmatrix}\mathtt{\,\,\,\,0}&\mathtt{\,\,\,\,1}\\\mathtt{-2}&\mathtt{-3}\end{bmatrix}\begin{bmatrix}\mathtt{-1}\\\mathtt{\,\,\,\,1}\end{bmatrix}=\mathtt{-1}\begin{bmatrix}\mathtt{-1}\\\mathtt{\,\,\,\,1}\end{bmatrix}\]

We can combine the items on the left side of each equation and the items on the right side of each equation into one matrix equation.

\(\begin{bmatrix}\mathtt{\,\,\,\,0}&\mathtt{\,\,\,\,1}\\\mathtt{-2}&\mathtt{-3}\end{bmatrix}\begin{bmatrix}\mathtt{-1}&\mathtt{-1}\\\mathtt{\,\,\,\,2}&\mathtt{\,\,\,\,1}\end{bmatrix}=\begin{bmatrix}\mathtt{-1}&\mathtt{-1}\\\mathtt{\,\,\,\,2}&\mathtt{\,\,\,\,1}\end{bmatrix}\begin{bmatrix}\mathtt{-2}&\mathtt{\,\,\,\,0}\\\mathtt{\,\,\,\,0}&\mathtt{-1}\end{bmatrix}\)

This leaves us with [original matrix][eigenvector matrix] = [eigenvector matrix][eigenvalue matrix]. Finally, we multiply both sides by the inverse of the eigenvector matrix, in order to remove it from the left side of the equation. We can’t remove it from the right side, because matrix multiplication is not commutative. That leaves us with the final decomposition (hat tip to Math the Beautiful for some of the ideas in this post):

Multiplying these three matrices together, or combining the transformations represented by the matrices as we showed here, will result in the original matrix.