Linear Algebra Exercises I

We’ve done a fair amount of landscaping, as it were—running out a long way in the field of linear algebra to mark interesting points. Now it seems like a good time to turn back and start tending to each area a little more closely. An excellent way to do that—to make things more secure—is to practice.

The sets below can be completed after reading Lines the Linear Algebra Way. I would suggest completing this by working on a piece of paper and then checking your answers—one at a time at first and then go a stretch before checking. Try to complete each exercise before either checking back with the original post or looking at my answer. My answers can be uncovered by hovering under each exercise.

Convert each equation of a line to a vector equation in parametric form, \(\mathtt{l(k)=p+kv}\), where \(\mathtt{p}\) and \(\mathtt{v}\) are 2D vectors and \(\mathtt{k}\) is a scalar variable.

  1. \(\mathtt{y=x+3}\)

\(\mathtt{l(k)=}\begin{bmatrix}\mathtt{0}\\\mathtt{3}\end{bmatrix}+\begin{bmatrix}\mathtt{1}\\\mathtt{1}\end{bmatrix}\mathtt{k}\)

  1. \(\mathtt{y=2x+3}\)

\(\mathtt{l(k)=}\begin{bmatrix}\mathtt{0}\\\mathtt{3}\end{bmatrix}+\begin{bmatrix}\mathtt{1}\\\mathtt{2}\end{bmatrix}\mathtt{k}\)

  1. \(\mathtt{y=-2x}\)

\(\mathtt{l(k)=}\begin{bmatrix}\mathtt{0}\\\mathtt{0}\end{bmatrix}+\begin{bmatrix}\mathtt{\,\,\,\,1}\\\mathtt{-2}\end{bmatrix}\mathtt{k}\),    or just \(\begin{bmatrix}\mathtt{\,\,\,\,1}\\\mathtt{-2}\end{bmatrix}\mathtt{k}\)

  1. \(\mathtt{y=x}\)

\(\mathtt{l(k)=}\begin{bmatrix}\mathtt{0}\\\mathtt{0}\end{bmatrix}+\begin{bmatrix}\mathtt{1}\\\mathtt{1}\end{bmatrix}\mathtt{k}\),    or just \(\begin{bmatrix}\mathtt{1}\\\mathtt{1}\end{bmatrix}\mathtt{k}\)

  1. \(\mathtt{y=10}\)

\(\mathtt{l(k)=}\begin{bmatrix}\mathtt{0}\\\mathtt{10}\end{bmatrix}+\begin{bmatrix}\mathtt{1}\\\mathtt{0}\end{bmatrix}\mathtt{k}\)

  1. \(\mathtt{x+y=1}\)

\(\mathtt{l(k)=}\begin{bmatrix}\mathtt{0}\\\mathtt{1}\end{bmatrix}+\begin{bmatrix}\mathtt{\,\,\,\,1}\\\mathtt{-1}\end{bmatrix}\mathtt{k}\)

  1. \(\mathtt{2x+3y=9}\)

\(\mathtt{l(k)=}\begin{bmatrix}\mathtt{0}\\\mathtt{3}\end{bmatrix}+\begin{bmatrix}\mathtt{\,\,\,\,3}\\\mathtt{-2}\end{bmatrix}\mathtt{k}\)

  1. \(\mathtt{x=5}\)

\(\mathtt{l(k)=}\begin{bmatrix}\mathtt{5}\\\mathtt{0}\end{bmatrix}+\begin{bmatrix}\mathtt{0}\\\mathtt{1}\end{bmatrix}\mathtt{k}\)

  1. \(\mathtt{-3x-\frac{1}{4}y=15}\)

\(\mathtt{l(k)=}\begin{bmatrix}\mathtt{\,\,\,\,0}\\\mathtt{-60}\end{bmatrix}+\begin{bmatrix}\mathtt{\,\,\,\,1}\\\mathtt{-12}\end{bmatrix}\mathtt{k}\)

  1. Devise an algorithm you could use to convert an equation for a line in slope-intercept form to a vector equation for the line in parametric form.

For slope-intercept form, \(\mathtt{y=mx+b}\), the conversion \(\begin{bmatrix}\mathtt{0}\\\mathtt{b}\end{bmatrix}+\begin{bmatrix}\mathtt{1}\\\mathtt{m}\end{bmatrix}\mathtt{k}\,\) seems to work for most lines.

There are many ways to write the above vector equations. In particular, the intercept vector does not have to have a first component of \(\mathtt{0}\), and the slope vector does not have to be in simplest form. All that is required is for the intercept vector to get you to some point on the line and for the slope vector to correctly represent the slope of the line.

Now let’s go the other way. Identify the slope, y-intercept, and x-intercept of each line.

  1. \(\mathtt{l_{1}(k)=}\begin{bmatrix}\mathtt{1}\\\mathtt{2}\end{bmatrix}+\begin{bmatrix}\mathtt{\,\,\,\,2}\\\mathtt{-1}\end{bmatrix}\mathtt{k}\)

slope → \(\mathtt{-\frac{1}{2}}\), y-intercept → \(\mathtt{\frac{5}{2}}\), x-intercept → \(\mathtt{5}\)

  1. \(\mathtt{l_{2}(k)=}\begin{bmatrix}\mathtt{-3}\\\mathtt{-5}\end{bmatrix}+\begin{bmatrix}\mathtt{0}\\\mathtt{4}\end{bmatrix}\mathtt{k}\)

slope → undefined, y-intercept → none, x-intercept → \(\mathtt{-3}\)

  1. \(\mathtt{l_{3}(k)=}\begin{bmatrix}\mathtt{\,\,\,\,0}\\\mathtt{-1}\end{bmatrix}+\begin{bmatrix}\mathtt{2}\\\mathtt{2}\end{bmatrix}\mathtt{k}\)

slope → \(\mathtt{1}\), y-intercept → \(\mathtt{-1}\), x-intercept → \(\mathtt{1}\)

  1. \(\mathtt{l_{4}(k)=}\begin{bmatrix}\mathtt{\,\,\,\,3}\\\mathtt{-9}\end{bmatrix}+\begin{bmatrix}\mathtt{7}\\\mathtt{0}\end{bmatrix}\mathtt{k}\)

slope → \(\mathtt{0}\), y-intercept → \(\mathtt{-9}\), x-intercept → none

  1. \(\mathtt{l_{5}(k)=}\begin{bmatrix}\mathtt{-5}\\\mathtt{\,\,\,\,1}\end{bmatrix}+\begin{bmatrix}\mathtt{-3}\\\mathtt{\,\,\,\,4}\end{bmatrix}\mathtt{k}\)

slope → \(\mathtt{-\frac{4}{3}}\), y-intercept → \(\mathtt{-\frac{17}{3}}\), x-intercept → \(\mathtt{-\frac{17}{4}}\)

And just two more. Determine if each point is on the line mentioned from above.

  1. Is \(\mathtt{(-11,3)}\) on the line \(\mathtt{l_1}\)?

No: \(\begin{bmatrix}\mathtt{1}\\\mathtt{2}\end{bmatrix}+\begin{bmatrix}\mathtt{\,\,\,\,2}\\\mathtt{-1}\end{bmatrix}\mathtt{k=}\begin{bmatrix}\mathtt{-11}\\\mathtt{\,\,\,\,3}\end{bmatrix}\).
No value of \(\mathtt{k}\) solves both \(\mathtt{2k+1=-11}\) and \(\mathtt{-k+2=3}\).

  1. Is \(\mathtt{(5,5)}\) on the line \(\mathtt{l_3}\)?

No. Since the line has a slope of \(\mathtt{1}\), it would contain \(\mathtt{(5,5)}\) if the line passed through the origin, but it does not.

More with Least Squares

One very cool thing about our formula for the least squares regression line, \(\mathtt{\left(X^{T}X\right)^{-1}X^{T}y}\), is that it is the same no matter whether we have one independent variable (univariate) or many independent variables (multivariate).

Consider these data, showing the selling prices of some grandfather clocks at auction. The first scatter plot shows the age of the clock in years on the x-axis (100–200), and the second shows the number of bidders on the x-axis (0–20). Price (in pounds or dollars) is on the y-axis on each plot (500–2500).

Age (years)BiddersPrice ($)
127131235
115121080
1277845
15091522
15661047
182111979
156121822
132101253
13791297
1139946
137151713
117111024
13781147
15361092
117131152
126101336
170142131
18281550
162111884
184102041
1436854
15991483
108141055
17581545
1086729
17991792
111151175
18781593
1117785
1157744
19451356
16871262

You can see in the notebook below that the first regression line, for the price of a clock as a function of its age, is approximately \(\mathtt{10.5x-192}\). The second regression line, for the price of a clock as a function of the number of bidders at auction, is approximately \(\mathtt{55x+806}\). As mentioned above, each of these univariate least squares regression lines can be calculated with the formula \(\mathtt{\left(X^{T}X\right)^{-1}X^{T}y}\).

Combining both age and number of bidders together, we can calculate, using the same formula, a multivariate least squares regression equation. This of course is no longer a line. In the case of two input variables as we have here, our line becomes a plane.

Our final regression equation becomes \(\mathtt{12.74x_{1}+85.82x_{2}-1336.72}\), with \(\mathtt{x_1}\) representing the age of a clock and \(\mathtt{x_2}\) representing the number of bidders.

Post-Hoc Confidences

A smart defense of any argument for less teacher-directed instruction in mathematics classrooms is to point to the logical connectedness of mathematics as a body of knowledge and suggest that students are capable of crossing many if not all of the logical bridges between propositions themselves, or with minimal guidance.

Such connectedness–it can be suggested–makes mathematics somewhat different from other school subjects. For example, given a student’s conceptual understanding of a fraction as a part-to-whole ratio, which can include his or her ability to represent a fraction with a visual or physical model, it seems to follow logically that he or she can then add two fractions and get the correct sum, so long as the student knows (intuitively or more formally) that addition is about combining values linearly. It doesn’t matter how many prerequisites there are for adding fractions. The suggestion is that once those prerequisites have been met, it is a matter of merely crossing a logical bridge to adding fractions (mostly) correctly.

By way of contrast, a student can’t really induce what happened after, say, the bombing of Pearl Harbor. They have to be informed about it directly. The effects can certainly be narrowed down using common sense reasoning and other domain-specific knowledge. But, ultimately, what happened happened and there is no reason to suspect that, in general, students can make their way through a study of history mostly blindfolded, relying only on logic and common sense.

The example of history brings up an interesting point (to me, anyway) about the example of mathematics, though. Historical consequences from historical causes can be dubbed “inevitable” only after the fact. How can we be sure it is not the same when learning anything, including mathematics? Once you know, conceptually as it were, what adding fractions is, of course it seems to be a purely logical consequence of what fractions are fundamentally. But is this seeming inevitability available to the novice, the learner who is aware of what fractions are but hasn’t ever thought about adding them? With the average novice is, after all, where that feeling of logical inevitability has to lie. It is not enough for educated adults to think of something as ‘logical’ after they already know it.

Bertrand Russell argues, in a 1907 essay, that even in mathematics we don’t proceed from premises to conclusions, but rather the other way around.

We tend to believe the premises because we can see that their consequences are true, instead of believing the consequences because we know the premises to be true. But the inferring of premises from consequences is the essence of induction [abduction]; thus the method in investigating the principles of mathematics is really an inductive method, and is substantially the same as the method of discovering in any other science.

So, how can we decide whether some bridge in reasoning is available to and crossable by the average novice? I hope it’s clear that we can’t just figure it out via anecdotes and armchair reasoning. Our intuitions can’t be trusted with this question. And our opinions one way or the other on the matter are not helpful, no matter what they are.

Linear Algebra Connections

There are so many connections within and applications of linear algebra—I can only imagine that this will be a series of posts, to the extent that I continue writing about the subject. Here are a few connections that I’ve come across in my reading recently.

Compound Interest and Matrix Powers

We can multiply a matrix by itself \(\mathtt{n}\) times. The result is the matrix to the power \(\mathtt{n}\). We can use this when setting up a compound interest situation. For example, suppose we have three accounts, which each have a different interest rate compounded annually—say, 5%, 3%, and 2%. Without linear algebra, the amount in the first account can be modeled by the equation \[\mathtt{A(t)=p \cdot 1.05^t}\] where \(\mathtt{p}\) represents the starting amount in the account, and \(\mathtt{t}\) represents the time in years. With linear algebra, we can group all of the account interest rates into a matrix. The first year for each account would look like this: \[\mathtt{A(1)}=\begin{bmatrix}\mathtt{1.05}&\mathtt{0}&\mathtt{0}\\\mathtt{0}&\mathtt{1.03}&\mathtt{0}\\\mathtt{0}&\mathtt{0}&\mathtt{1.02}\end{bmatrix}^\mathtt{1}\begin{bmatrix}\mathtt{p_1}\\\mathtt{p_2}\\\mathtt{p_3}\end{bmatrix}=\begin{bmatrix}\mathtt{1.05p_1}\\\mathtt{1.03p_2}\\\mathtt{1.02p_3}\end{bmatrix}\]

For years beyond the first year, all we have to do is raise the matrix to the appropriate power. Since it’s diagonal, squaring it, cubing it, etc., will square, cube, etc., each entry. This computation can be a little more organized—and more straightforward for programming. A matrix has to be square (\(\mathtt{m \times m})\) in order to raise it to a power. Below we calculate the amount in each account after 100 years.

Centroids and Areas

We have seen that the determinant can be thought about as the area of the parallelogram formed by two vectors. We can use this fact to determine the area of a complex shape like the one shown below.

Since determinants are signed areas, as we move around the shape counterclockwise, calculating the determinant of each vector pair (and multiplying each determinant by one half so we just get each triangle), we get the total area of the shape.\[\frac{1}{2}\left(\begin{vmatrix}\mathtt{6}&\mathtt{6}\\\mathtt{0}&\mathtt{4}\end{vmatrix}+\begin{vmatrix}\mathtt{6}&\mathtt{3}\\\mathtt{4}&\mathtt{4}\end{vmatrix}+\begin{vmatrix}\mathtt{3}&\mathtt{3}\\\mathtt{4}&\mathtt{6}\end{vmatrix}+\begin{vmatrix}\mathtt{3}&\mathtt{-2}\\\mathtt{6}&\mathtt{\,\,\,\,6}\end{vmatrix}+\begin{vmatrix}\mathtt{-2}&\mathtt{-2}\\\mathtt{\,\,\,\,6}&\mathtt{\,\,\,\,3}\end{vmatrix}+\begin{vmatrix}\mathtt{-2}&\mathtt{0}\\\mathtt{\,\,\,\,3}&\mathtt{3}\end{vmatrix}\right)=\mathtt{36}\text{ units}\mathtt{^2}\]

That’s pretty hand-wavy, but it’s something that you can probably figure out with a little experimentation.

Another counterclockwise-moving calculation (though this one can be clockwise without changing the answer) is the calculation of the centroid of a closed shape. All that is required here is to calculate the sum of the position vectors of the vertices of the figure and then divide by the number of vertices.

\[\mathtt{\frac{1}{5}}\left(\begin{bmatrix}\mathtt{3}\\\mathtt{4}\end{bmatrix}+\begin{bmatrix}\mathtt{0}\\\mathtt{6}\end{bmatrix}+\begin{bmatrix}\mathtt{-3}\\\mathtt{\,\,\,\,4}\end{bmatrix}+\begin{bmatrix}\mathtt{-1}\\\mathtt{\,\,\,\,3}\end{bmatrix}\right)=\begin{bmatrix}\mathtt{-\frac{1}{5}}\\\mathtt{\,\,\,\,\frac{17}{5}}\end{bmatrix}\]

Here again, it’s just magic, but you can figure it out with a little play. In each case—for both complex areas and centroids—we assign a point to be the origin and go from there.

Least Squares with Linear Algebra

In brief, linear regression is about finding the line of best fit to a data set. If you’re looking for the linear algebra way of doing this, you will most likely find it searching for the term least squares.

In the basic scenario, you’ve got some two-dimensional data, \(\mathtt{(x, y)}\) coordinates, and you want to find the equation for a straight line that is as close as possible to each point. Such a scenario is shown below, though here the line has already been graphed and the equation for the line of best fit displayed. (But feel free to change the data in the table or move the points around to see how the line of best fit changes.)

xy
108.04
86.95
137.58
98.81
118.33
149.96
67.24
44.26
1210.84
74.82
55.68

So, let’s imagine that we haven’t found this line yet. We know what we are looking for is a line of the form \(\mathtt{y=mx+b}\). In linear algebra terms, we want the vector \(\mathtt{y}\) (11 rows, 1 column) to equal the vector \(\mathtt{x}\) (11 rows, 1 column) times a slope vector \(\mathtt{m}\) (1 row, 2 columns) plus an intercept vector \(\mathtt{b}\) (11 rows, 1 column).

The first problem with this that we have to fix is that \(\mathtt{x}\) the 11 × 1 vector needs to be \(\mathtt{X}\) the 11 × 2 matrix so that we get our equations right. So we’ll pad \(\mathtt{x}\) with some 1s and then we’ll be able to call it \(\mathtt{X}\). (We can pad on either side, left or right, just remembering to interchange \(\mathtt{b}\) and \(\mathtt{m}\) to keep the equations straight.) The second problem we can fix is that we don’t need a separate intercept vector and a separate slope vector. We can combine things so that we form an equivalent matrix equation that means the same thing as \(\mathtt{y=mx+b}\). We need the equation to look like this: \[\begin{bmatrix}\mathtt{8.04}\\\mathtt{6.95}\\\mathtt{7.58}\\\mathtt{8.81}\\\mathtt{8.33}\\\mathtt{9.96}\\\mathtt{7.24}\\\mathtt{4.26}\\\mathtt{10.84}\\\mathtt{4.82}\\\mathtt{5.68}\end{bmatrix} = \begin{bmatrix}\mathtt{1}&\mathtt{10}\\\mathtt{1}&\mathtt{8}\\\mathtt{1}&\mathtt{13}\\\mathtt{1}&\mathtt{9}\\\mathtt{1}&\mathtt{11}\\\mathtt{1}&\mathtt{14}\\\mathtt{1}&\mathtt{6}\\\mathtt{1}&\mathtt{4}\\\mathtt{1}&\mathtt{12}\\\mathtt{1}&\mathtt{7}\\\mathtt{1}&\mathtt{5}\end{bmatrix}\begin{bmatrix}\mathtt{b}\\\mathtt{m}\end{bmatrix}\]

Multiply the matrix and vector on the right side of that equation, and you get, for the first equation, \(\mathtt{8.04=b+(m)(10)}\). That’s equivalent to \(\mathtt{y=mx+b}\) for the first \(\mathtt{(x, y)}\) data point \(\mathtt{(10, 8.04)}\). The matrix and vector setup ensures that all the equations for all the points are of the correct form.

In linear algebra terms, we have rewritten the equation to be \(\mathtt{y=Xv}\), where \(\mathtt{y}\) is an 11 × 1 vector, \(\mathtt{X}\) is an 11 × 2 matrix (padded with some 1s), and \(\mathtt{v}\) is a 2 × 1 vector which contains the unknown slope \(\mathtt{m}\) of the best fit line and the unknown intercept \(\mathtt{b}\).

It’s an Approximation

At this point in the explanation, it’s important to realize that we will make another shift. The first was from the real data to the matrix algebra setup. We shift again below, away from that setup, per se, and toward just finding out what that unknown \(\mathtt{v}\) is.

As an analogy, the system shown below features a 3d vector (2, 0, –2) which does not live in the same plane as the one formed by the two other column vectors of the matrix. Thus, there is no solution \(\mathtt{(j, k)}\). \[\begin{bmatrix}\mathtt{1}&\mathtt{\,\,\,\,1}\\\mathtt{1}&\mathtt{-3}\\\mathtt{1}&\mathtt{\,\,\,\,1}\end{bmatrix}\begin{bmatrix}\mathtt{j}\\\mathtt{k}\end{bmatrix}=\begin{bmatrix}\mathtt{\,\,\,\,2}\\\mathtt{\,\,\,\,0}\\\mathtt{-2}\end{bmatrix}\longleftarrow\text{no solutions}\]

Similarly, the columns of our \(\mathtt{X}\) matrix form a plane in 11-dimensional space. In order for \(\mathtt{v}\) to be a solution to our original matrix equation above, \(\mathtt{y}\) has to live in this plane too. But we already know that it doesn’t. If it did, the points would lie along some line. It’s true that \(\mathtt{y}\) is an 11-dimensional vector, just like each column of \(\mathtt{X}\), but \(\mathtt{y}\) doesn’t live in the same plane.

The closest approximation we can get to \(\mathtt{y}\) in the plane of \(\mathtt{X}\) is \(\mathtt{Xv=p}\), where \(\mathtt{p}\) is the projection of \(\mathtt{y}\) onto the plane (another thing we’ll have to come back to). The vector \(\mathtt{q}\) connecting \(\mathtt{y}\) with its projection \(\mathtt{p}\) is perpendicular to \(\mathtt{p}\) (and, thus, to the plane).

Now we can write some more equations. For example, we know we want \(\mathtt{Xv=p}\), but it’s also true that \(\mathtt{p+q=y}\). And we can do some manipulation to show that all the column vectors of \(\mathtt{X}\) are perpendicular to \(\mathtt{q}\). Assuming for a second that our \(\mathtt{X}\) is a 2 × 2 matrix, we write \(\mathtt{X}\) as a 3D matrix, transpose it (so that multiplication is defined), and multiply it by \(\mathtt{q}\). The same basic idea applies to our original \(\mathtt{X}\) (11 × 2) matrix. \[\begin{bmatrix}\mathtt{x_{11}}&\mathtt{x_{12}}\\\mathtt{x_{21}}&\mathtt{x_{22}}\end{bmatrix}\rightarrow\begin{bmatrix}\mathtt{x_{11}}&\mathtt{x_{12}}\\\mathtt{x_{21}}&\mathtt{x_{22}}\\\mathtt{0}&\mathtt{0}\end{bmatrix}\rightarrow\begin{bmatrix}\mathtt{x_{11}}&\mathtt{x_{21}}&\mathtt{0}\\\mathtt{x_{12}}&\mathtt{x_{22}}&\mathtt{0}\end{bmatrix}\]\[\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\,\,\,\begin{bmatrix}\mathtt{x_{11}}&\mathtt{x_{21}}&\mathtt{0}\\\mathtt{x_{12}}&\mathtt{x_{22}}&\mathtt{0}\end{bmatrix}\begin{bmatrix}\mathtt{0}\\\mathtt{0}\\\mathtt{q_3}\end{bmatrix}=\begin{bmatrix}\mathtt{0}\\\mathtt{0}\end{bmatrix}\] This gives us a key equation: \(\mathtt{X^{T}q=0}\).

Since \(\mathtt{q=y-p}\), substituting for \(\mathtt{q}\) gets us \(\mathtt{X^{T}(y-p)=0}\). Then since \(\mathtt{Xv=p}\), substituting for \(\mathtt{p}\) brings us to \(\mathtt{X^{T}(y-Xv)=0}\). Distribute to get \(\mathtt{X^{T}y-X^{T}Xv=0}\), which means \(\mathtt{X^{T}y=X^{T}Xv}\). Multiply both sides on the left by the inverse of \(\mathtt{X^{T}X}\), and the vector \(\mathtt{v}\) that we’re after, then, is \[\mathtt{v=(X^{T}X)^{-1}X^{T}y}\]

That formula gives us the slope and intercept of our best fit line. Below is one way the least squares can be calculated with a little Python. The calculation I used for the above interactive to get the best fit line is way more complicated. If I had known about the linear algebra way when I made it, I would have definitely gone with that instead.

The Singular Value Decomposition

We’ve now seen the eigenvalue decomposition of a linear transformation (in the form of a matrix). We can think of what we did in that decomposition as breaking up the original transformation into three transformations. If we multiply the rightmost matrix by any vector, and then multiply the middle matrix by that product, and then multiply the leftmost matrix on the right-hand side by that product, we would see that starting vector be transformed three times. That process would be equivalent to multiplying the starting vector by the original matrix.

We can also say that, in that original transformation matrix, which we’ll call \(\mathtt{A}\), we mapped a set of orthogonal vectors, or vectors at right angles to each other, (1, 0) and (0, 1), onto a set of non-orthogonal vectors (0, –2) and (1, –3). We don’t have to multiply each vector by the transformation matrix one at a time. We can multiply the set of vectors, as a matrix, by the transformation matrix, like so.

\(\begin{bmatrix}\mathtt{\,\,\,\,0}&\mathtt{\,\,\,\,1}\\\mathtt{-2}&\mathtt{-3}\end{bmatrix}\begin{bmatrix}\mathtt{1}&\mathtt{0}\\\mathtt{0}&\mathtt{1}\end{bmatrix}=\begin{bmatrix}\mathtt{1}\begin{bmatrix}\mathtt{\,\,\,\,0}\\\mathtt{-2}\end{bmatrix}+\mathtt{0}\begin{bmatrix}\mathtt{\,\,\,\,1}\\\mathtt{-3}\end{bmatrix}&\mathtt{0}\begin{bmatrix}\mathtt{\,\,\,\,0}\\\mathtt{-2}\end{bmatrix}+\mathtt{1}\begin{bmatrix}\mathtt{\,\,\,\,1}\\\mathtt{-3}\end{bmatrix}\end{bmatrix}=\begin{bmatrix}\mathtt{\,\,\,\,0}&\mathtt{\,\,\,\,1}\\\mathtt{-2}&\mathtt{-3}\end{bmatrix}\)

Of course, we just multiplied the original matrix by the identity matrix, so it spit out the original matrix again. But the above interpretation is different, though it gives the same results.

Okay, great, but we humans seem to love our right angles. So, this question arises about linear transformations: could we use any matrix to map some pair of orthogonal vectors (vectors at right angles to each other) to a different set of orthogonal vectors? That is, could we transform our way from one orthogonal “system” to another with any single transformation matrix?

Could we find a pair of orthogonal vectors which, after undergoing our transformation \(\mathtt{A}\), were mapped to a different pair of orthogonal vectors? If we could, that would mean that if we multiplied \(\mathtt{A}\) by a set of orthogonal vectors \(\mathtt{V}\) (i.e., transformed the set of vectors \(\mathtt{V}\) by the action of \(\mathtt{A}\)), it would be equivalent to just starting with a different set of orthogonal vectors already in position (we’ll call these orthogonal vectors \(\mathtt{U}\)) and just scaling them by a scaling matrix \(\mathtt{\Sigma}\). In notation, we’ll write this hypothesis as \[\mathtt{AV=U \Sigma}\]

The word hypothesis is important here. We’re not really writing down something we know is equivalent—that is, we don’t really know if there is a \(\mathtt{V}\), \(\mathtt{U}\), and \(\mathtt{\Sigma}\) which will make this equivalence true. Students (and I) have a hard time not seeing the equals sign as meaning “I know that it is true that these are equivalent.” But that’s not what it means here, and it’s good to get used to that flexibility. What it means here is that we are supposing for the time being that these two products are equivalent. If some contradiction falls out of our algebra (or we get some kind of infinity), we’ll know that the equivalence fails—at least insofar as we want just one matrix to pop out for each unknown matrix.

Let’s first see how this equivalence appears before we dig into figuring out what \(\mathtt{V}\), \(\mathtt{U}\), and \(\mathtt{\Sigma}\) are. That is, let me show you that the matrices \(\mathtt{V}\), \(\mathtt{U}\), and \(\mathtt{\Sigma}\) are possible before we look at what they are. On the left is our original transformation matrix acting on a set of orthogonal vectors \(\mathtt{V}\) (purple and red), so this transformation shows \(\mathtt{AV}\). (You can see that the original (1, 0) and (0, 1) vectors go to where they’re supposed to.)

\[\mathtt{AV\quad\quad\quad\quad =\quad\quad\quad\quad U \Sigma}\]

On the right is a set of two different orthogonal vectors \(\mathtt{\Sigma}\) (purple and red) which start as being aligned to the horizontal and vertical grid lines and then are rotated and reflected (which is a part of scaling) by a matrix \(\mathtt{U}\). And the two transformations are equivalent! (In the demo they are very close.) We can squeeze out an orthogonal transformation out of that weird matrix we saw in the eigenvalue decomposition which took a square and rotated and stretched it into a funky parallelogram.

I note that, above, I called \(\mathtt{\Sigma}\) a scaling matrix, but here I’m using it as just a set of orthogonal vectors. Luckily for us, both things are true. It just depends on how you look at them. A square matrix like the ones we are using represents both a pair of 2-dimensional vectors or a linear transformation. We get to decide how to interpret the matrix in any given situation.

Getting \(\mathtt{V}\)

To begin to know what these matrices are, we can start by writing the equation above like this. \[\mathtt{A=U \Sigma V^T}\] We can do this because \(\mathtt{V}\) is orthogonal, and I briefly mentioned last time that the transpose of an orthogonal matrix is the same as its inverse. So, in effect, we multiplied both sides of the equation by \(\mathtt{V^{-1}}\), which removed the V from the left-hand side.

Okay, now we pull a little transpose magic, but let’s walk through it. Start by multiplying the expression on each side, on the left, by its transpose. (We’ll circle back to grounding all this some other time.) So, we have \[\mathtt{A^TA=(U\Sigma V^T)^T(U\Sigma V^T)}\] We multiplied \(\mathtt{A}\) by its transpose by multiplying to the left of \(\mathtt{A}\), and we multiplied \(\mathtt{U\Sigma V^T}\) by its transpose by multiplying to the left of \(\mathtt{U\Sigma V^T}\). The product \(\mathtt{A^TA}\) is simple: \[\begin{bmatrix}\mathtt{0}&\mathtt{-2}\\\mathtt{1}&\mathtt{-3}\end{bmatrix}\begin{bmatrix}\mathtt{\,\,\,\,0}&\mathtt{\,\,\,\,1}\\\mathtt{-2}&\mathtt{-3}\end{bmatrix}=\begin{bmatrix}\mathtt{4}&\mathtt{6}\\\mathtt{6}&\mathtt{10}\end{bmatrix}\] But let’s take some time to simplify \(\mathtt{(U\Sigma V^T)^T(U\Sigma V^T)}\). As an example of what to do when we have the transpose of a product of matrices, consider these products.

\[\left(\begin{bmatrix}\mathtt{1}&\mathtt{3}\\\mathtt{2}&\mathtt{4}\end{bmatrix}\begin{bmatrix}\mathtt{4}&\mathtt{6}\\\mathtt{5}&\mathtt{7}\end{bmatrix}\right)^\mathtt{T} \longrightarrow \]
\[\begin{bmatrix}\mathtt{19}&\mathtt{27}\\\mathtt{28}&\mathtt{40}\end{bmatrix}^\mathtt{T}\]
\[\left(\begin{bmatrix}\mathtt{4}&\mathtt{5}\\\mathtt{6}&\mathtt{7}\end{bmatrix}\begin{bmatrix}\mathtt{1}&\mathtt{2}\\\mathtt{3}&\mathtt{4}\end{bmatrix}\right)^\mathtt{T} \longrightarrow \]
\[\begin{bmatrix}\mathtt{19}&\mathtt{28}\\\mathtt{27}&\mathtt{40}\end{bmatrix}\,\,\,\]

Each matrix product on the left is equal to the expression on its right. Test them out for yourself. But the expressions on the right are equal to each other, which means the products on the left are equal to each other. This example shows that \(\mathtt{(AB)^{T}=B^{T}A^{T}}\). The transpose of a product is equal to the product of the transposes, multiplied in reverse order. Again, we’ll ground this later, but this suggests, correctly, that we can rewrite the transposed product above, \(\mathtt{(U\Sigma V^T)^T}\), as \(\mathtt{V\Sigma^{T}U^{T}}\), considering that \(\mathtt{(V^{T})^{T}=V}\). Multiplying that by the remaining part of the right-hand side, \(\mathtt{U\Sigma V^T}\), we get \(\mathtt{V\Sigma^{T}U^{T}U\Sigma V^T}\). Since \(\mathtt{U}\) is orthogonal, its transpose is its inverse, so the \(\mathtt{U}\) terms cancel, leaving us with \(\mathtt{V\Sigma^{T}\Sigma V^T}\). The transpose of a scaling matrix, \(\mathtt{\Sigma}\) in this case, is itself (try it out!), so the middle can be written more simply as \(\mathtt{\Sigma^{2}}\). So, finally (pretty far from finally, actually), we have \[\mathtt{A^{T}A=V\Sigma^{2}V^T}\]

And this is an eigenvalue decomposition for \(\mathtt{A^{T}A}\)! We’ve got a scaling matrix as the lunchmeat, sandwiched by a vector and its inverse (which is the same as the transpose in this case). So, now we can figure out \(\mathtt{V}\) and \(\mathtt{\Sigma}\) by doing the eigenvalue decomposition like we did previously. Here it is: \[\begin{bmatrix}\mathtt{4}&\mathtt{6}\\\mathtt{6}&\mathtt{10}\end{bmatrix}=\begin{bmatrix}\mathtt{\color{purple}{-\frac{521}{991}}}&\mathtt{\color{red}{-\frac{3725}{4379}}}\\\mathtt{\color{purple}{-\frac{3725}{4379}}}&\mathtt{\,\,\,\,\color{red}{\frac{521}{991}}}\end{bmatrix}\begin{bmatrix}\mathtt{\frac{4510}{329}}&\mathtt{0}\\\mathtt{0}&\mathtt{\frac{658}{2255}}\end{bmatrix}\begin{bmatrix}\mathtt{-\frac{521}{991}}&\mathtt{-\frac{3725}{4379}}\\\mathtt{-\frac{3725}{4379}}&\mathtt{\,\,\,\,\frac{521}{991}}\end{bmatrix}\]

The purple and red vectors are our starting vectors from the animation on the left above: (–0.53, –0.85) and (–0.85, 0.53). When we ask what orthogonal starting vectors can we pick such that matrix \(\mathtt{A}\) will transform them to a different pair of orthogonal vectors, matrix \(\mathtt{V}\)—the red and purple vectors above—is our answer. The square roots of the diagonal values of \(\mathtt{\Sigma}\) above (remember, this matrix was squared) are called the singular values of \(\mathtt{A}\). To me, getting \(\mathtt{V}\) is the coolest part of this decomposition. The rest, below, is just gravy.

And Now for \(\mathtt{U}\) and \(\mathtt{\Sigma}\)

In this case, we multiply both sides of the equation \(\mathtt{A=U\Sigma V^T}\) on the right by the transpose of \(\mathtt{A}\) and get an eigenvalue decomposition of \(\mathtt{AA^T}\). \[\,\,\mathtt{AA^T=(U\Sigma V^T)(U\Sigma V^T)^T=U\Sigma V^{T}V\Sigma^{T}U^T=U\Sigma^{2}U^T}\]

\[\begin{bmatrix}\mathtt{\,\,\,\,1}&\mathtt{-3}\\\mathtt{-3}&\mathtt{\,\,\,\,13}\end{bmatrix}=\begin{bmatrix}\mathtt{-\frac{400}{1741}}&\mathtt{\frac{764}{785}}\\\mathtt{\,\,\,\,\frac{764}{785}}&\mathtt{\frac{400}{1741}}\end{bmatrix}\begin{bmatrix}\mathtt{\color{purple}{\frac{4510}{329}}}&\mathtt{\color{red}{0}}\\\mathtt{\color{purple}{0}}&\mathtt{\color{red}{\frac{658}{2255}}}\end{bmatrix}\begin{bmatrix}\mathtt{-\frac{400}{1741}}&\mathtt{\frac{764}{785}}\\\mathtt{\,\,\,\,\frac{764}{785}}&\mathtt{\frac{400}{1741}}\end{bmatrix}\]

Now the square roots of the purple and red vectors are our starting vectors from the animation on the right above: (3.7, 0) and (0, 0.54). You’ll notice, of course, that \(\mathtt{\Sigma^2}\) is the same as above, so we really found it earlier. This step is only to pin down what \(\mathtt{U}\) is. At any rate, we are done, and we can write down the full singular value decomposition (SVD) of the original matrix \(\mathtt{A}\). \[\quad\begin{bmatrix}\mathtt{\,\,\,\,0}&\mathtt{\,\,\,\,1}\\\mathtt{-2}&\mathtt{-3}\end{bmatrix}=\begin{bmatrix}\mathtt{-\frac{400}{1741}}&\mathtt{\frac{764}{785}}\\\mathtt{\,\,\,\,\frac{764}{785}}&\mathtt{\frac{400}{1741}}\end{bmatrix}\begin{bmatrix}\mathtt{\frac{1655}{447}}&\mathtt{0}\\\mathtt{0}&\mathtt{\frac{894}{1655}}\end{bmatrix}\begin{bmatrix}\mathtt{-\frac{521}{991}}&\mathtt{-\frac{3725}{4379}}\\\mathtt{-\frac{3725}{4379}}&\mathtt{\,\,\,\,\frac{521}{991}}\end{bmatrix}\]

We’re finding an orthogonal-to-orthogonal transformation in a hurricane, so it shouldn’t be surprising to get weird numbers. One thing the SVD makes clear (the eigenvalue decomposition does this too) is that linear transformations can be described as combinations of rotations and scalings (the latter of which include reflections) and that’s it.


Inverses and Transposes

This will now be my 22nd post on linear algebra, and I hope it’ll be noticeable, looking at all of them together so far, that we haven’t talked about systems of equations. And there’s a good reason for that: because they suck you down an ugly hole of mindless calculation, meaning-challenged tedium, and fruitless, pointless backward thinking. Systems are awesome and important, and I’m sure someone can make a strong argument for introducing them early, but, for me, they can wait.

The inverses of matrices are pretty interesting. The inverse of a 2 × 2 matrix is the matrix that, when multiplied to the original matrix, gives the product that is the identity matrix. The inverse is given by the middle matrix below: \[\begin{bmatrix}\mathtt{a} & \mathtt{b}\\\mathtt{c} & \mathtt{d}\end{bmatrix}\begin{bmatrix}\mathtt{\,\,\,\,\frac{d}{ad-bc}} & \mathtt{-\frac{b}{ad-bc}}\\\mathtt{-\frac{c}{ad-bc}} & \mathtt{\,\,\,\,\frac{a}{ad-bc}}\end{bmatrix}\mathtt{=}\begin{bmatrix}\mathtt{1} & \mathtt{0}\\\mathtt{0} & \mathtt{1}\end{bmatrix}\]

There is an easy-to-follow derivation of this formula here if you’re interested—one that requires only some high school algebra.

You can notice, given this setup, that the identity matrix is its own inverse. Are there any others that fit this description? One thought: we need \(\mathtt{ad-bc}\) to be equal to 1, and we also want \(\mathtt{a=d}\). The values on the other diagonal, \(\mathtt{b}\) and \(\mathtt{c}\), should both be 0, since they have to be equal to their opposites.

In that case, \(\begin{bmatrix}\mathtt{-1} & \mathtt{\,\,\,\,0}\\\mathtt{\,\,\,\,0} & \mathtt{-1}\end{bmatrix}\) would be its own inverse too. What does that mean?

For the identity matrix, it means that if we apply the do-nothing transformation to the home-base matrix, we stay on home base, with the “x” vector pointed to (1, 0) and the “y” vector pointed to (0, 1). The matrix above represents a reflection across the origin. So, applying a reflection across the origin to it should return us to home base.

Wouldn’t a reflection across the x-axis be an inverse of itself, then? Yes! The criteria above for an inverse need to be amended a little to allow for negatives effectively canceling each other out to make positives. \[\begin{bmatrix}\mathtt{1} & \mathtt{\,\,\,\,0}\\\mathtt{0} & \mathtt{-1}\end{bmatrix}\begin{bmatrix}\mathtt{1} & \mathtt{\,\,\,\,0}\\\mathtt{0} & \mathtt{-1}\end{bmatrix}\mathtt{=}\begin{bmatrix}\mathtt{1} & \mathtt{0}\\\mathtt{0} & \mathtt{1}\end{bmatrix}\]

The inverse of a matrix is represented with a superscript \(\mathtt{-1}\). So, the inverse of the matrix \(\mathtt{A}\) is written as \(\mathtt{A^{-1}}\).

The Transpose

The transpose of a matrix is the matrix you get when you take the rows of the matrix and turn them into columns instead. The transpose of a matrix is represented with a superscript \(\mathtt{T}\). So: \[\begin{bmatrix}\mathtt{\,\,\,\,1} & \mathtt{3}\\\mathtt{-2} & \mathtt{4}\end{bmatrix}^{\mathtt{T}}\mathtt{=}\begin{bmatrix}\mathtt{1} & \mathtt{-2}\\\mathtt{3} & \mathtt{\,\,\,\,4}\end{bmatrix}\]

If a 2 × 2 matrix, \(\mathtt{V}\) is orthogonal—meaning that its column vectors are perpendicular and each column vector has a length of 1—then its transpose is the same as its inverse, or \(\mathtt{V^{-1}=V^T}\).

Eigenvectors for Reflections

Having just finished a post on eigenvalue decomposition, involving eigenvalues and eigenvectors, I was flipping through this nifty little textbook, where I saw a description of geometric reflections using eigenvectors, and I thought, “Oh my gosh, of course!”

We have seen reflections here, yet back then we had to find something called the foot of a point to figure out the reflection. But we can construct a reflection matrix (same as a scaling matrix) using only knowledge about eigenvectors.

When we talked about eigenvectors before—those vectors which do not change direction under a transformation (except if they reverse direction)—we were looking for them. But in a reflection, we already know them. In any reflection across a line, we already know that the vector that matches the line of reflection will not change direction, and the vector perpendicular to the line of reflection will only be scaled by \(\mathtt{-1}\). So, both the vector which describes the line of reflection and the vector perpendicular to that line are eigenvectors.

So let’s say we want to reflect point \(\mathtt{C}\) across the line described by the vector \(\mathtt{\alpha(2, -1)}\).

Our eigenvector matrix will be \[\begin{bmatrix}\mathtt{\,\,\,\,2}&\mathtt{-1}\\\mathtt{-1}&\mathtt{-2}\end{bmatrix}\] Here we see the second eigenvector, but we can simply use the vector perpendicular to the first if we don’t know the second one.

Our eigenvalue matrix will be \[\begin{bmatrix}\mathtt{1}&\mathtt{\,\,\,\,0}\\\mathtt{0}&\mathtt{-1}\end{bmatrix}\] since we will keep the first eigenvector—the line of reflection—fixed (eigenvalue of 1) and just flip the second eigenvector (eigenvalue of –1).

As a penultimate step, we calculate the inverse of the eigenvector matrix (we’ll get into inverses fairly soon), and then finally multiply all those matrices together (right to left), as we saw with the eigenvalue decomposition, to get the reflection matrix. \[\begin{bmatrix}\mathtt{\,\,\,\,2}&\mathtt{-1}\\\mathtt{-1}&\mathtt{-2}\end{bmatrix}\begin{bmatrix}\mathtt{1}&\mathtt{\,\,\,\,0}\\\mathtt{0}&\mathtt{-1}\end{bmatrix}\begin{bmatrix}\mathtt{\,\,\,\,\frac{2}{5}}&\mathtt{-\frac{1}{5}}\\\mathtt{-\frac{1}{5}}&\mathtt{-\frac{2}{5}}\end{bmatrix}\mathtt{=}\begin{bmatrix}\mathtt{\,\,\,\,\frac{3}{5}}&\mathtt{-\frac{4}{5}}\\\mathtt{-\frac{4}{5}}&\mathtt{-\frac{3}{5}}\end{bmatrix}\]

The final matrix on the right side is the reflection matrix for reflections across the line represented by the vector \(\mathtt{\alpha(2, -1)}\), or essentially all lines with a slope of \(\mathtt{-\frac{1}{2}}\).

It’s worth playing around with building these matrices and using them to find reflections. A lot of what’s here is pretty obvious, but reflection matrices can come in handy in some non-obvious ways too. The perpendicular vector at the very least has to be on the same side of the line as the point you are reflecting.

Eigenvalue Decomposition

I wrote about eigenvalues and eigenvectors a while back, here. In this post, I’ll show how determining the eigenvalues and eigenvectors of a matrix (2 by 2 in this case) is pretty much all of the work of what’s called eigenvalue decomposition. We’ll start with this matrix, which represents a linear transformation: \[\begin{bmatrix}\mathtt{\,\,\,\,0}&\mathtt{\,\,\,\,1}\\\mathtt{-2}&\mathtt{-3}\end{bmatrix}\]

You can see the action of this matrix at the right (sort of). It sends the (1, 0) vector to (0, –2) and the (0, 1) vector to (1, –3).

The eigenvectors of this transformation are any nonzero vectors that do not change their direction during this transformation, but only scale up or down (or stay the same) by a factor of \(\mathtt{\lambda}\) as a result of the transformation. So,

\(\begin{bmatrix}\mathtt{\,\,\,\,0}&\mathtt{\,\,\,\,1}\\\mathtt{-2}&\mathtt{-3}\end{bmatrix}\begin{bmatrix}\mathtt{r_1}\\\mathtt{r_2}\end{bmatrix}=\lambda \begin{bmatrix}\mathtt{r_1}\\\mathtt{r_2}\end{bmatrix}\)

Using our calculations from the previous post linked above, we calculate the eigenvalues to be \(\mathtt{\lambda_1=-2}\) and \(\mathtt{\lambda_2=-1}\). And the corresponding eigenvectors are of the form \(\mathtt{(r, -2r)}\) and \(\mathtt{(-r, r)}\), respectively.

The red vector (representing the eigenvector \(\mathtt{(-r, r)}\)) at right starts at \(\mathtt{(-1, 1)}\). It is scaled by the eigenvalue of \(\mathtt{-1}\) during the transformation—meaning it simply turns in the opposite direction and its magnitude doesn’t change. Any vector of the form \(\mathtt{(-r, r)}\) will behave this way during this tranformation.

The purple vector (representing the eigenvector \(\mathtt{(r, -2r)}\)) starts at \(\mathtt{(-1, 2)}\). It is scaled by the eigenvalue of \(\mathtt{-2}\) during the transformation—meaning it turns in the opposite direction and is scaled by a factor of \(\mathtt{2}\). Any vector of the form \(\mathtt{(r, -2r)}\) will behave this way during this tranformation.

And Now for the Decomposition

We can now use the equation above and plug in each eigenvalue and its corresponding eigenvector to create two matrix equations.

\(\begin{bmatrix}\mathtt{\,\,\,\,0}&\mathtt{\,\,\,\,1}\\\mathtt{-2}&\mathtt{-3}\end{bmatrix}\begin{bmatrix}\mathtt{-1}\\\mathtt{\,\,\,\,2}\end{bmatrix}=\mathtt{-2}\begin{bmatrix}\mathtt{-1}\\\mathtt{\,\,\,\,2}\end{bmatrix}\) \[\begin{bmatrix}\mathtt{\,\,\,\,0}&\mathtt{\,\,\,\,1}\\\mathtt{-2}&\mathtt{-3}\end{bmatrix}\begin{bmatrix}\mathtt{-1}\\\mathtt{\,\,\,\,1}\end{bmatrix}=\mathtt{-1}\begin{bmatrix}\mathtt{-1}\\\mathtt{\,\,\,\,1}\end{bmatrix}\]

We can combine the items on the left side of each equation and the items on the right side of each equation into one matrix equation.

\(\begin{bmatrix}\mathtt{\,\,\,\,0}&\mathtt{\,\,\,\,1}\\\mathtt{-2}&\mathtt{-3}\end{bmatrix}\begin{bmatrix}\mathtt{-1}&\mathtt{-1}\\\mathtt{\,\,\,\,2}&\mathtt{\,\,\,\,1}\end{bmatrix}=\begin{bmatrix}\mathtt{-1}&\mathtt{-1}\\\mathtt{\,\,\,\,2}&\mathtt{\,\,\,\,1}\end{bmatrix}\begin{bmatrix}\mathtt{-2}&\mathtt{\,\,\,\,0}\\\mathtt{\,\,\,\,0}&\mathtt{-1}\end{bmatrix}\)

This leaves us with [original matrix][eigenvector matrix] = [eigenvector matrix][eigenvalue matrix]. Finally, we multiply both sides by the inverse of the eigenvector matrix, in order to remove it from the left side of the equation. We can’t remove it from the right side, because matrix multiplication is not commutative. That leaves us with the final decomposition (hat tip to Math the Beautiful for some of the ideas in this post):

Multiplying these three matrices together, or combining the transformations represented by the matrices as we showed here, will result in the original matrix.

Subtractive Knowledge

I was intrigued by a pedagogical insight offered by the example below, from the introductory class of a course called Computational Linear Algebra. The setup is that the graph (diagram) at the bottom of the image represents a Markov model, and it shows the probabilities of moving from one stage of a disease to another in a year.

So, if a patient is asymptomatic, there is a 7% (0.07) probability of moving from asymptomatic to symptomatic, a 90% chance of staying at asymptomatic (indicated by a curved arrow), and so on. This information is also encoded in the stochastic matrix shown.

Here’s the problem: Given a group of people: 85% are asymptomatic, 10% are symptomatic, 5% have AIDS, and of course 0% are deceased, what percent will be in each health state in a year? Putting yourself in the mind of a student right now, take a moment to try to answer the question and, importantly, reflect on your thinking at this stage, even if that thinking involves having no clue how to proceed.

I hope that, given this setup, you’ll be somewhat surprised to learn the following: If a high school student (or even middle school student) knows a little bit about probability and how to multiply and add, they should be able to answer this question.

Why? Well, if 85 out of every 100 in a group is asymptomatic, and there is a 90% probability of remaining asymptomatic in a year, then (0.85)(0.9) = 76.5% of the group is predicted to be asymptomatic in a year. The symptomatic group has two products that must be added: 10% of the group is symptomatic, and there is a 93% probability of remaining that way, so that’s (0.93)(0.1). But this group also takes on 7% of the 85% that were asymptomatic but moved to symptomatic. So, the total is (0.93)(0.1) + (0.07)(0.85) = 15.25%. The AIDS group percent is the sum of three products, for a total of 6.45%, and the Death group percent is a sum of four products, for a total of 1.8%.

Probability, multiplication, and addition are all you need to know. No doubt, knowing something about matrix-vector multiplication, as we have discussed, (and transposes) can be helpful, but it does not seem to be necessary in this case.

Bamboozled

I think it’s reasonable to suspect that many knowledgeable students—and knowledgeable adults—would be bamboozled by the highfalutin language here into believing that they cannot solve this problem, when in fact they can. If that’s true, then why is that the case?

Knowledge is domain specific, of course, (and very context specific) and that would seem to be the best explanation of students’ hypothesized difficulties. That is, given the cues (both verbal and visual) that this problem involves knowledge of linear algebra, Markov models, and/or stochastic matrices, anyone without that knowledge would naturally assume that they don’t have what is required to solve the problem and give up. And even if they suspected that some simple probability theory, multiplication, and addition were all that they needed, being bombarded by even a handful of foreign mathematical terms would greatly reduce their confidence in this suspicion.

Perhaps, then, the reason we are looking for—the reason students don’t believe they can solve problems when in fact they can—has to do with students’ attitudes, not their knowledge. And situations like these during instruction are enough to convince many that knowledge is overrated. The solution to this psychological reticence is, for many people, to encourage students to be fearless, to have problem-solving orientations and growth mindsets. After all, it’s clear that more knowledge would be helpful, but it’s not necessary in many cases. We’ll teach knowledge, sure, but we can do even better if we spend time encouraging soft skills along the way. Do we want students to give up every time they face a situation in life that they were not explicitly taught to deal with?

Knowledge Is Not Just Additive

The problem with this view is that it construes knowledge as only additive. That is, it is thought, knowledge only works to give its owner things to think about and think with. So, in the above example, students already have all the knowledge things to work with: probability, multiplication, and addition. Anything else would only serve to bring in more things to think about, which would be superfluous.

But this isn’t the only way knowledge works. It can also be subtractive—that is, knowing something can tell you that it is irrelevant to the current problem. Not knowing it means that you can’t know about its relevance (and situations like the above will easily bias you to giving superficial information a high degree of relevance). So, students cannot know with high confidence that matrices are essentially irrelevant to the problem above if they don’t know what matrices are. But even knowing nothing about matrices, knowing that, computationally, linear algebra is fundamentally about multiplying and adding things may be enough. Taking that perspective can allow you to ignore the superficial setup of the problem. But that’s still knowledge.

A better interpretation of students’ difficulties with the above is that, in fact, they do need more knowledge to solve the problem. The knowledge they need is subtractive; it will help them ignore superficial irrelevant details to get at the marrow of the problem.

Knowledge is obviously additive, but it is much more subtly subtractive too, helping to clear away facts that are irrelevant to a given situation. Subtractive knowledge is like the myelinated sheaths around some nerve cells in the brain. It acts as an insulator for thinking—making it faster, more efficient, and, as we have seen, more effective.