Least Squares with Linear Algebra

In brief, linear regression is about finding the line of best fit to a data set. If you’re looking for the linear algebra way of doing this, you will most likely find it searching for the term least squares.

In the basic scenario, you’ve got some two-dimensional data, \(\mathtt{(x, y)}\) coordinates, and you want to find the equation for a straight line that is as close as possible to each point. Such a scenario is shown below, though here the line has already been graphed and the equation for the line of best fit displayed. (But feel free to change the data in the table or move the points around to see how the line of best fit changes.)

xy
108.04
86.95
137.58
98.81
118.33
149.96
67.24
44.26
1210.84
74.82
55.68

So, let’s imagine that we haven’t found this line yet. We know what we are looking for is a line of the form \(\mathtt{y=mx+b}\). In linear algebra terms, we want the vector \(\mathtt{y}\) (11 rows, 1 column) to equal the vector \(\mathtt{x}\) (11 rows, 1 column) times a slope vector \(\mathtt{m}\) (1 row, 2 columns) plus an intercept vector \(\mathtt{b}\) (11 rows, 1 column).

The first problem with this that we have to fix is that \(\mathtt{x}\) the 11 × 1 vector needs to be \(\mathtt{X}\) the 11 × 2 matrix so that we get our equations right. So we’ll pad \(\mathtt{x}\) with some 1s and then we’ll be able to call it \(\mathtt{X}\). (We can pad on either side, left or right, just remembering to interchange \(\mathtt{b}\) and \(\mathtt{m}\) to keep the equations straight.) The second problem we can fix is that we don’t need a separate intercept vector and a separate slope vector. We can combine things so that we form an equivalent matrix equation that means the same thing as \(\mathtt{y=mx+b}\). We need the equation to look like this: \[\begin{bmatrix}\mathtt{8.04}\\\mathtt{6.95}\\\mathtt{7.58}\\\mathtt{8.81}\\\mathtt{8.33}\\\mathtt{9.96}\\\mathtt{7.24}\\\mathtt{4.26}\\\mathtt{10.84}\\\mathtt{4.82}\\\mathtt{5.68}\end{bmatrix} = \begin{bmatrix}\mathtt{1}&\mathtt{10}\\\mathtt{1}&\mathtt{8}\\\mathtt{1}&\mathtt{13}\\\mathtt{1}&\mathtt{9}\\\mathtt{1}&\mathtt{11}\\\mathtt{1}&\mathtt{14}\\\mathtt{1}&\mathtt{6}\\\mathtt{1}&\mathtt{4}\\\mathtt{1}&\mathtt{12}\\\mathtt{1}&\mathtt{7}\\\mathtt{1}&\mathtt{5}\end{bmatrix}\begin{bmatrix}\mathtt{b}\\\mathtt{m}\end{bmatrix}\]

Multiply the matrix and vector on the right side of that equation, and you get, for the first equation, \(\mathtt{8.04=b+(m)(10)}\). That’s equivalent to \(\mathtt{y=mx+b}\) for the first \(\mathtt{(x, y)}\) data point \(\mathtt{(10, 8.04)}\). The matrix and vector setup ensures that all the equations for all the points are of the correct form.

In linear algebra terms, we have rewritten the equation to be \(\mathtt{y=Xv}\), where \(\mathtt{y}\) is an 11 × 1 vector, \(\mathtt{X}\) is an 11 × 2 matrix (padded with some 1s), and \(\mathtt{v}\) is a 2 × 1 vector which contains the unknown slope \(\mathtt{m}\) of the best fit line and the unknown intercept \(\mathtt{b}\).

It’s an Approximation

At this point in the explanation, it’s important to realize that we will make another shift. The first was from the real data to the matrix algebra setup. We shift again below, away from that setup, per se, and toward just finding out what that unknown \(\mathtt{v}\) is.

As an analogy, the system shown below features a 3d vector (2, 0, –2) which does not live in the same plane as the one formed by the two other column vectors of the matrix. Thus, there is no solution \(\mathtt{(j, k)}\). \[\begin{bmatrix}\mathtt{1}&\mathtt{\,\,\,\,1}\\\mathtt{1}&\mathtt{-3}\\\mathtt{1}&\mathtt{\,\,\,\,1}\end{bmatrix}\begin{bmatrix}\mathtt{j}\\\mathtt{k}\end{bmatrix}=\begin{bmatrix}\mathtt{\,\,\,\,2}\\\mathtt{\,\,\,\,0}\\\mathtt{-2}\end{bmatrix}\longleftarrow\text{no solutions}\]

Similarly, the columns of our \(\mathtt{X}\) matrix form a plane in 11-dimensional space. In order for \(\mathtt{v}\) to be a solution to our original matrix equation above, \(\mathtt{y}\) has to live in this plane too. But we already know that it doesn’t. If it did, the points would lie along some line. It’s true that \(\mathtt{y}\) is an 11-dimensional vector, just like each column of \(\mathtt{X}\), but \(\mathtt{y}\) doesn’t live in the same plane.

The closest approximation we can get to \(\mathtt{y}\) in the plane of \(\mathtt{X}\) is \(\mathtt{Xv=p}\), where \(\mathtt{p}\) is the projection of \(\mathtt{y}\) onto the plane (another thing we’ll have to come back to). The vector \(\mathtt{q}\) connecting \(\mathtt{y}\) with its projection \(\mathtt{p}\) is perpendicular to \(\mathtt{p}\) (and, thus, to the plane).

Now we can write some more equations. For example, we know we want \(\mathtt{Xv=p}\), but it’s also true that \(\mathtt{p+q=y}\). And we can do some manipulation to show that all the column vectors of \(\mathtt{X}\) are perpendicular to \(\mathtt{q}\). Assuming for a second that our \(\mathtt{X}\) is a 2 × 2 matrix, we write \(\mathtt{X}\) as a 3D matrix, transpose it (so that multiplication is defined), and multiply it by \(\mathtt{q}\). The same basic idea applies to our original \(\mathtt{X}\) (11 × 2) matrix. \[\begin{bmatrix}\mathtt{x_{11}}&\mathtt{x_{12}}\\\mathtt{x_{21}}&\mathtt{x_{22}}\end{bmatrix}\rightarrow\begin{bmatrix}\mathtt{x_{11}}&\mathtt{x_{12}}\\\mathtt{x_{21}}&\mathtt{x_{22}}\\\mathtt{0}&\mathtt{0}\end{bmatrix}\rightarrow\begin{bmatrix}\mathtt{x_{11}}&\mathtt{x_{21}}&\mathtt{0}\\\mathtt{x_{12}}&\mathtt{x_{22}}&\mathtt{0}\end{bmatrix}\]\[\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\,\,\,\begin{bmatrix}\mathtt{x_{11}}&\mathtt{x_{21}}&\mathtt{0}\\\mathtt{x_{12}}&\mathtt{x_{22}}&\mathtt{0}\end{bmatrix}\begin{bmatrix}\mathtt{0}\\\mathtt{0}\\\mathtt{q_3}\end{bmatrix}=\begin{bmatrix}\mathtt{0}\\\mathtt{0}\end{bmatrix}\] This gives us a key equation: \(\mathtt{X^{T}q=0}\).

Since \(\mathtt{q=y-p}\), substituting for \(\mathtt{q}\) gets us \(\mathtt{X^{T}(y-p)=0}\). Then since \(\mathtt{Xv=p}\), substituting for \(\mathtt{p}\) brings us to \(\mathtt{X^{T}(y-Xv)=0}\). Distribute to get \(\mathtt{X^{T}y-X^{T}Xv=0}\), which means \(\mathtt{X^{T}y=X^{T}Xv}\). Multiply both sides on the left by the inverse of \(\mathtt{X^{T}X}\), and the vector \(\mathtt{v}\) that we’re after, then, is \[\mathtt{v=(X^{T}X)^{-1}X^{T}y}\]

That formula gives us the slope and intercept of our best fit line. Below is one way the least squares can be calculated with a little Python. The calculation I used for the above interactive to get the best fit line is way more complicated. If I had known about the linear algebra way when I made it, I would have definitely gone with that instead.

The Singular Value Decomposition

We’ve now seen the eigenvalue decomposition of a linear transformation (in the form of a matrix). We can think of what we did in that decomposition as breaking up the original transformation into three transformations. If we multiply the rightmost matrix by any vector, and then multiply the middle matrix by that product, and then multiply the leftmost matrix on the right-hand side by that product, we would see that starting vector be transformed three times. That process would be equivalent to multiplying the starting vector by the original matrix.

We can also say that, in that original transformation matrix, which we’ll call \(\mathtt{A}\), we mapped a set of orthogonal vectors, or vectors at right angles to each other, (1, 0) and (0, 1), onto a set of non-orthogonal vectors (0, –2) and (1, –3). We don’t have to multiply each vector by the transformation matrix one at a time. We can multiply the set of vectors, as a matrix, by the transformation matrix, like so.

\(\begin{bmatrix}\mathtt{\,\,\,\,0}&\mathtt{\,\,\,\,1}\\\mathtt{-2}&\mathtt{-3}\end{bmatrix}\begin{bmatrix}\mathtt{1}&\mathtt{0}\\\mathtt{0}&\mathtt{1}\end{bmatrix}=\begin{bmatrix}\mathtt{1}\begin{bmatrix}\mathtt{\,\,\,\,0}\\\mathtt{-2}\end{bmatrix}+\mathtt{0}\begin{bmatrix}\mathtt{\,\,\,\,1}\\\mathtt{-3}\end{bmatrix}&\mathtt{0}\begin{bmatrix}\mathtt{\,\,\,\,0}\\\mathtt{-2}\end{bmatrix}+\mathtt{1}\begin{bmatrix}\mathtt{\,\,\,\,1}\\\mathtt{-3}\end{bmatrix}\end{bmatrix}=\begin{bmatrix}\mathtt{\,\,\,\,0}&\mathtt{\,\,\,\,1}\\\mathtt{-2}&\mathtt{-3}\end{bmatrix}\)

Of course, we just multiplied the original matrix by the identity matrix, so it spit out the original matrix again. But the above interpretation is different, though it gives the same results.

Okay, great, but we humans seem to love our right angles. So, this question arises about linear transformations: could we use any matrix to map some pair of orthogonal vectors (vectors at right angles to each other) to a different set of orthogonal vectors? That is, could we transform our way from one orthogonal “system” to another with any single transformation matrix?

Could we find a pair of orthogonal vectors which, after undergoing our transformation \(\mathtt{A}\), were mapped to a different pair of orthogonal vectors? If we could, that would mean that if we multiplied \(\mathtt{A}\) by a set of orthogonal vectors \(\mathtt{V}\) (i.e., transformed the set of vectors \(\mathtt{V}\) by the action of \(\mathtt{A}\)), it would be equivalent to just starting with a different set of orthogonal vectors already in position (we’ll call these orthogonal vectors \(\mathtt{U}\)) and just scaling them by a scaling matrix \(\mathtt{\Sigma}\). In notation, we’ll write this hypothesis as \[\mathtt{AV=U \Sigma}\]

The word hypothesis is important here. We’re not really writing down something we know is equivalent—that is, we don’t really know if there is a \(\mathtt{V}\), \(\mathtt{U}\), and \(\mathtt{\Sigma}\) which will make this equivalence true. Students (and I) have a hard time not seeing the equals sign as meaning “I know that it is true that these are equivalent.” But that’s not what it means here, and it’s good to get used to that flexibility. What it means here is that we are supposing for the time being that these two products are equivalent. If some contradiction falls out of our algebra (or we get some kind of infinity), we’ll know that the equivalence fails—at least insofar as we want just one matrix to pop out for each unknown matrix.

Let’s first see how this equivalence appears before we dig into figuring out what \(\mathtt{V}\), \(\mathtt{U}\), and \(\mathtt{\Sigma}\) are. That is, let me show you that the matrices \(\mathtt{V}\), \(\mathtt{U}\), and \(\mathtt{\Sigma}\) are possible before we look at what they are. On the left is our original transformation matrix acting on a set of orthogonal vectors \(\mathtt{V}\) (purple and red), so this transformation shows \(\mathtt{AV}\). (You can see that the original (1, 0) and (0, 1) vectors go to where they’re supposed to.)

\[\mathtt{AV\quad\quad\quad\quad =\quad\quad\quad\quad U \Sigma}\]

On the right is a set of two different orthogonal vectors \(\mathtt{\Sigma}\) (purple and red) which start as being aligned to the horizontal and vertical grid lines and then are rotated and reflected (which is a part of scaling) by a matrix \(\mathtt{U}\). And the two transformations are equivalent! (In the demo they are very close.) We can squeeze out an orthogonal transformation out of that weird matrix we saw in the eigenvalue decomposition which took a square and rotated and stretched it into a funky parallelogram.

I note that, above, I called \(\mathtt{\Sigma}\) a scaling matrix, but here I’m using it as just a set of orthogonal vectors. Luckily for us, both things are true. It just depends on how you look at them. A square matrix like the ones we are using represents both a pair of 2-dimensional vectors or a linear transformation. We get to decide how to interpret the matrix in any given situation.

Getting \(\mathtt{V}\)

To begin to know what these matrices are, we can start by writing the equation above like this. \[\mathtt{A=U \Sigma V^T}\] We can do this because \(\mathtt{V}\) is orthogonal, and I briefly mentioned last time that the transpose of an orthogonal matrix is the same as its inverse. So, in effect, we multiplied both sides of the equation by \(\mathtt{V^{-1}}\), which removed the V from the left-hand side.

Okay, now we pull a little transpose magic, but let’s walk through it. Start by multiplying the expression on each side, on the left, by its transpose. (We’ll circle back to grounding all this some other time.) So, we have \[\mathtt{A^TA=(U\Sigma V^T)^T(U\Sigma V^T)}\] We multiplied \(\mathtt{A}\) by its transpose by multiplying to the left of \(\mathtt{A}\), and we multiplied \(\mathtt{U\Sigma V^T}\) by its transpose by multiplying to the left of \(\mathtt{U\Sigma V^T}\). The product \(\mathtt{A^TA}\) is simple: \[\begin{bmatrix}\mathtt{0}&\mathtt{-2}\\\mathtt{1}&\mathtt{-3}\end{bmatrix}\begin{bmatrix}\mathtt{\,\,\,\,0}&\mathtt{\,\,\,\,1}\\\mathtt{-2}&\mathtt{-3}\end{bmatrix}=\begin{bmatrix}\mathtt{4}&\mathtt{6}\\\mathtt{6}&\mathtt{10}\end{bmatrix}\] But let’s take some time to simplify \(\mathtt{(U\Sigma V^T)^T(U\Sigma V^T)}\). As an example of what to do when we have the transpose of a product of matrices, consider these products.

\[\left(\begin{bmatrix}\mathtt{1}&\mathtt{3}\\\mathtt{2}&\mathtt{4}\end{bmatrix}\begin{bmatrix}\mathtt{4}&\mathtt{6}\\\mathtt{5}&\mathtt{7}\end{bmatrix}\right)^\mathtt{T} \longrightarrow \]
\[\begin{bmatrix}\mathtt{19}&\mathtt{27}\\\mathtt{28}&\mathtt{40}\end{bmatrix}^\mathtt{T}\]
\[\left(\begin{bmatrix}\mathtt{4}&\mathtt{5}\\\mathtt{6}&\mathtt{7}\end{bmatrix}\begin{bmatrix}\mathtt{1}&\mathtt{2}\\\mathtt{3}&\mathtt{4}\end{bmatrix}\right)^\mathtt{T} \longrightarrow \]
\[\begin{bmatrix}\mathtt{19}&\mathtt{28}\\\mathtt{27}&\mathtt{40}\end{bmatrix}\,\,\,\]

Each matrix product on the left is equal to the expression on its right. Test them out for yourself. But the expressions on the right are equal to each other, which means the products on the left are equal to each other. This example shows that \(\mathtt{(AB)^{T}=B^{T}A^{T}}\). The transpose of a product is equal to the product of the transposes, multiplied in reverse order. Again, we’ll ground this later, but this suggests, correctly, that we can rewrite the transposed product above, \(\mathtt{(U\Sigma V^T)^T}\), as \(\mathtt{V\Sigma^{T}U^{T}}\), considering that \(\mathtt{(V^{T})^{T}=V}\). Multiplying that by the remaining part of the right-hand side, \(\mathtt{U\Sigma V^T}\), we get \(\mathtt{V\Sigma^{T}U^{T}U\Sigma V^T}\). Since \(\mathtt{U}\) is orthogonal, its transpose is its inverse, so the \(\mathtt{U}\) terms cancel, leaving us with \(\mathtt{V\Sigma^{T}\Sigma V^T}\). The transpose of a scaling matrix, \(\mathtt{\Sigma}\) in this case, is itself (try it out!), so the middle can be written more simply as \(\mathtt{\Sigma^{2}}\). So, finally (pretty far from finally, actually), we have \[\mathtt{A^{T}A=V\Sigma^{2}V^T}\]

And this is an eigenvalue decomposition for \(\mathtt{A^{T}A}\)! We’ve got a scaling matrix as the lunchmeat, sandwiched by a vector and its inverse (which is the same as the transpose in this case). So, now we can figure out \(\mathtt{V}\) and \(\mathtt{\Sigma}\) by doing the eigenvalue decomposition like we did previously. Here it is: \[\begin{bmatrix}\mathtt{4}&\mathtt{6}\\\mathtt{6}&\mathtt{10}\end{bmatrix}=\begin{bmatrix}\mathtt{\color{purple}{-\frac{521}{991}}}&\mathtt{\color{red}{-\frac{3725}{4379}}}\\\mathtt{\color{purple}{-\frac{3725}{4379}}}&\mathtt{\,\,\,\,\color{red}{\frac{521}{991}}}\end{bmatrix}\begin{bmatrix}\mathtt{\frac{4510}{329}}&\mathtt{0}\\\mathtt{0}&\mathtt{\frac{658}{2255}}\end{bmatrix}\begin{bmatrix}\mathtt{-\frac{521}{991}}&\mathtt{-\frac{3725}{4379}}\\\mathtt{-\frac{3725}{4379}}&\mathtt{\,\,\,\,\frac{521}{991}}\end{bmatrix}\]

The purple and red vectors are our starting vectors from the animation on the left above: (–0.53, –0.85) and (–0.85, 0.53). When we ask what orthogonal starting vectors can we pick such that matrix \(\mathtt{A}\) will transform them to a different pair of orthogonal vectors, matrix \(\mathtt{V}\)—the red and purple vectors above—is our answer. The square roots of the diagonal values of \(\mathtt{\Sigma}\) above (remember, this matrix was squared) are called the singular values of \(\mathtt{A}\). To me, getting \(\mathtt{V}\) is the coolest part of this decomposition. The rest, below, is just gravy.

And Now for \(\mathtt{U}\) and \(\mathtt{\Sigma}\)

In this case, we multiply both sides of the equation \(\mathtt{A=U\Sigma V^T}\) on the right by the transpose of \(\mathtt{A}\) and get an eigenvalue decomposition of \(\mathtt{AA^T}\). \[\,\,\mathtt{AA^T=(U\Sigma V^T)(U\Sigma V^T)^T=U\Sigma V^{T}V\Sigma^{T}U^T=U\Sigma^{2}U^T}\]

\[\begin{bmatrix}\mathtt{\,\,\,\,1}&\mathtt{-3}\\\mathtt{-3}&\mathtt{\,\,\,\,13}\end{bmatrix}=\begin{bmatrix}\mathtt{-\frac{400}{1741}}&\mathtt{\frac{764}{785}}\\\mathtt{\,\,\,\,\frac{764}{785}}&\mathtt{\frac{400}{1741}}\end{bmatrix}\begin{bmatrix}\mathtt{\color{purple}{\frac{4510}{329}}}&\mathtt{\color{red}{0}}\\\mathtt{\color{purple}{0}}&\mathtt{\color{red}{\frac{658}{2255}}}\end{bmatrix}\begin{bmatrix}\mathtt{-\frac{400}{1741}}&\mathtt{\frac{764}{785}}\\\mathtt{\,\,\,\,\frac{764}{785}}&\mathtt{\frac{400}{1741}}\end{bmatrix}\]

Now the square roots of the purple and red vectors are our starting vectors from the animation on the right above: (3.7, 0) and (0, 0.54). You’ll notice, of course, that \(\mathtt{\Sigma^2}\) is the same as above, so we really found it earlier. This step is only to pin down what \(\mathtt{U}\) is. At any rate, we are done, and we can write down the full singular value decomposition (SVD) of the original matrix \(\mathtt{A}\). \[\quad\begin{bmatrix}\mathtt{\,\,\,\,0}&\mathtt{\,\,\,\,1}\\\mathtt{-2}&\mathtt{-3}\end{bmatrix}=\begin{bmatrix}\mathtt{-\frac{400}{1741}}&\mathtt{\frac{764}{785}}\\\mathtt{\,\,\,\,\frac{764}{785}}&\mathtt{\frac{400}{1741}}\end{bmatrix}\begin{bmatrix}\mathtt{\frac{1655}{447}}&\mathtt{0}\\\mathtt{0}&\mathtt{\frac{894}{1655}}\end{bmatrix}\begin{bmatrix}\mathtt{-\frac{521}{991}}&\mathtt{-\frac{3725}{4379}}\\\mathtt{-\frac{3725}{4379}}&\mathtt{\,\,\,\,\frac{521}{991}}\end{bmatrix}\]

We’re finding an orthogonal-to-orthogonal transformation in a hurricane, so it shouldn’t be surprising to get weird numbers. One thing the SVD makes clear (the eigenvalue decomposition does this too) is that linear transformations can be described as combinations of rotations and scalings (the latter of which include reflections) and that’s it.


Inverses and Transposes

This will now be my 22nd post on linear algebra, and I hope it’ll be noticeable, looking at all of them together so far, that we haven’t talked about systems of equations. And there’s a good reason for that: because they suck you down an ugly hole of mindless calculation, meaning-challenged tedium, and fruitless, pointless backward thinking. Systems are awesome and important, and I’m sure someone can make a strong argument for introducing them early, but, for me, they can wait.

The inverses of matrices are pretty interesting. The inverse of a 2 × 2 matrix is the matrix that, when multiplied to the original matrix, gives the product that is the identity matrix. The inverse is given by the middle matrix below: \[\begin{bmatrix}\mathtt{a} & \mathtt{b}\\\mathtt{c} & \mathtt{d}\end{bmatrix}\begin{bmatrix}\mathtt{\,\,\,\,\frac{d}{ad-bc}} & \mathtt{-\frac{b}{ad-bc}}\\\mathtt{-\frac{c}{ad-bc}} & \mathtt{\,\,\,\,\frac{a}{ad-bc}}\end{bmatrix}\mathtt{=}\begin{bmatrix}\mathtt{1} & \mathtt{0}\\\mathtt{0} & \mathtt{1}\end{bmatrix}\]

There is an easy-to-follow derivation of this formula here if you’re interested—one that requires only some high school algebra.

You can notice, given this setup, that the identity matrix is its own inverse. Are there any others that fit this description? One thought: we need \(\mathtt{ad-bc}\) to be equal to 1, and we also want \(\mathtt{a=d}\). The values on the other diagonal, \(\mathtt{b}\) and \(\mathtt{c}\), should both be 0, since they have to be equal to their opposites.

In that case, \(\begin{bmatrix}\mathtt{-1} & \mathtt{\,\,\,\,0}\\\mathtt{\,\,\,\,0} & \mathtt{-1}\end{bmatrix}\) would be its own inverse too. What does that mean?

For the identity matrix, it means that if we apply the do-nothing transformation to the home-base matrix, we stay on home base, with the “x” vector pointed to (1, 0) and the “y” vector pointed to (0, 1). The matrix above represents a reflection across the origin. So, applying a reflection across the origin to it should return us to home base.

Wouldn’t a reflection across the x-axis be an inverse of itself, then? Yes! The criteria above for an inverse need to be amended a little to allow for negatives effectively canceling each other out to make positives. \[\begin{bmatrix}\mathtt{1} & \mathtt{\,\,\,\,0}\\\mathtt{0} & \mathtt{-1}\end{bmatrix}\begin{bmatrix}\mathtt{1} & \mathtt{\,\,\,\,0}\\\mathtt{0} & \mathtt{-1}\end{bmatrix}\mathtt{=}\begin{bmatrix}\mathtt{1} & \mathtt{0}\\\mathtt{0} & \mathtt{1}\end{bmatrix}\]

The inverse of a matrix is represented with a superscript \(\mathtt{-1}\). So, the inverse of the matrix \(\mathtt{A}\) is written as \(\mathtt{A^{-1}}\).

The Transpose

The transpose of a matrix is the matrix you get when you take the rows of the matrix and turn them into columns instead. The transpose of a matrix is represented with a superscript \(\mathtt{T}\). So: \[\begin{bmatrix}\mathtt{\,\,\,\,1} & \mathtt{3}\\\mathtt{-2} & \mathtt{4}\end{bmatrix}^{\mathtt{T}}\mathtt{=}\begin{bmatrix}\mathtt{1} & \mathtt{-2}\\\mathtt{3} & \mathtt{\,\,\,\,4}\end{bmatrix}\]

If a 2 × 2 matrix, \(\mathtt{V}\) is orthogonal—meaning that its column vectors are perpendicular and each column vector has a length of 1—then its transpose is the same as its inverse, or \(\mathtt{V^{-1}=V^T}\).

Eigenvectors for Reflections

Having just finished a post on eigenvalue decomposition, involving eigenvalues and eigenvectors, I was flipping through this nifty little textbook, where I saw a description of geometric reflections using eigenvectors, and I thought, “Oh my gosh, of course!”

We have seen reflections here, yet back then we had to find something called the foot of a point to figure out the reflection. But we can construct a reflection matrix (same as a scaling matrix) using only knowledge about eigenvectors.

When we talked about eigenvectors before—those vectors which do not change direction under a transformation (except if they reverse direction)—we were looking for them. But in a reflection, we already know them. In any reflection across a line, we already know that the vector that matches the line of reflection will not change direction, and the vector perpendicular to the line of reflection will only be scaled by \(\mathtt{-1}\). So, both the vector which describes the line of reflection and the vector perpendicular to that line are eigenvectors.

So let’s say we want to reflect point \(\mathtt{C}\) across the line described by the vector \(\mathtt{\alpha(2, -1)}\).

Our eigenvector matrix will be \[\begin{bmatrix}\mathtt{\,\,\,\,2}&\mathtt{-1}\\\mathtt{-1}&\mathtt{-2}\end{bmatrix}\] Here we see the second eigenvector, but we can simply use the vector perpendicular to the first if we don’t know the second one.

Our eigenvalue matrix will be \[\begin{bmatrix}\mathtt{1}&\mathtt{\,\,\,\,0}\\\mathtt{0}&\mathtt{-1}\end{bmatrix}\] since we will keep the first eigenvector—the line of reflection—fixed (eigenvalue of 1) and just flip the second eigenvector (eigenvalue of –1).

As a penultimate step, we calculate the inverse of the eigenvector matrix (we’ll get into inverses fairly soon), and then finally multiply all those matrices together (right to left), as we saw with the eigenvalue decomposition, to get the reflection matrix. \[\begin{bmatrix}\mathtt{\,\,\,\,2}&\mathtt{-1}\\\mathtt{-1}&\mathtt{-2}\end{bmatrix}\begin{bmatrix}\mathtt{1}&\mathtt{\,\,\,\,0}\\\mathtt{0}&\mathtt{-1}\end{bmatrix}\begin{bmatrix}\mathtt{\,\,\,\,\frac{2}{5}}&\mathtt{-\frac{1}{5}}\\\mathtt{-\frac{1}{5}}&\mathtt{-\frac{2}{5}}\end{bmatrix}\mathtt{=}\begin{bmatrix}\mathtt{\,\,\,\,\frac{3}{5}}&\mathtt{-\frac{4}{5}}\\\mathtt{-\frac{4}{5}}&\mathtt{-\frac{3}{5}}\end{bmatrix}\]

The final matrix on the right side is the reflection matrix for reflections across the line represented by the vector \(\mathtt{\alpha(2, -1)}\), or essentially all lines with a slope of \(\mathtt{-\frac{1}{2}}\).

It’s worth playing around with building these matrices and using them to find reflections. A lot of what’s here is pretty obvious, but reflection matrices can come in handy in some non-obvious ways too. The perpendicular vector at the very least has to be on the same side of the line as the point you are reflecting.

Eigenvalue Decomposition

I wrote about eigenvalues and eigenvectors a while back, here. In this post, I’ll show how determining the eigenvalues and eigenvectors of a matrix (2 by 2 in this case) is pretty much all of the work of what’s called eigenvalue decomposition. We’ll start with this matrix, which represents a linear transformation: \[\begin{bmatrix}\mathtt{\,\,\,\,0}&\mathtt{\,\,\,\,1}\\\mathtt{-2}&\mathtt{-3}\end{bmatrix}\]

You can see the action of this matrix at the right (sort of). It sends the (1, 0) vector to (0, –2) and the (0, 1) vector to (1, –3).

The eigenvectors of this transformation are any nonzero vectors that do not change their direction during this transformation, but only scale up or down (or stay the same) by a factor of \(\mathtt{\lambda}\) as a result of the transformation. So,

\(\begin{bmatrix}\mathtt{\,\,\,\,0}&\mathtt{\,\,\,\,1}\\\mathtt{-2}&\mathtt{-3}\end{bmatrix}\begin{bmatrix}\mathtt{r_1}\\\mathtt{r_2}\end{bmatrix}=\lambda \begin{bmatrix}\mathtt{r_1}\\\mathtt{r_2}\end{bmatrix}\)

Using our calculations from the previous post linked above, we calculate the eigenvalues to be \(\mathtt{\lambda_1=-2}\) and \(\mathtt{\lambda_2=-1}\). And the corresponding eigenvectors are of the form \(\mathtt{(r, -2r)}\) and \(\mathtt{(-r, r)}\), respectively.

The red vector (representing the eigenvector \(\mathtt{(-r, r)}\)) at right starts at \(\mathtt{(-1, 1)}\). It is scaled by the eigenvalue of \(\mathtt{-1}\) during the transformation—meaning it simply turns in the opposite direction and its magnitude doesn’t change. Any vector of the form \(\mathtt{(-r, r)}\) will behave this way during this tranformation.

The purple vector (representing the eigenvector \(\mathtt{(r, -2r)}\)) starts at \(\mathtt{(-1, 2)}\). It is scaled by the eigenvalue of \(\mathtt{-2}\) during the transformation—meaning it turns in the opposite direction and is scaled by a factor of \(\mathtt{2}\). Any vector of the form \(\mathtt{(r, -2r)}\) will behave this way during this tranformation.

And Now for the Decomposition

We can now use the equation above and plug in each eigenvalue and its corresponding eigenvector to create two matrix equations.

\(\begin{bmatrix}\mathtt{\,\,\,\,0}&\mathtt{\,\,\,\,1}\\\mathtt{-2}&\mathtt{-3}\end{bmatrix}\begin{bmatrix}\mathtt{-1}\\\mathtt{\,\,\,\,2}\end{bmatrix}=\mathtt{-2}\begin{bmatrix}\mathtt{-1}\\\mathtt{\,\,\,\,2}\end{bmatrix}\) \[\begin{bmatrix}\mathtt{\,\,\,\,0}&\mathtt{\,\,\,\,1}\\\mathtt{-2}&\mathtt{-3}\end{bmatrix}\begin{bmatrix}\mathtt{-1}\\\mathtt{\,\,\,\,1}\end{bmatrix}=\mathtt{-1}\begin{bmatrix}\mathtt{-1}\\\mathtt{\,\,\,\,1}\end{bmatrix}\]

We can combine the items on the left side of each equation and the items on the right side of each equation into one matrix equation.

\(\begin{bmatrix}\mathtt{\,\,\,\,0}&\mathtt{\,\,\,\,1}\\\mathtt{-2}&\mathtt{-3}\end{bmatrix}\begin{bmatrix}\mathtt{-1}&\mathtt{-1}\\\mathtt{\,\,\,\,2}&\mathtt{\,\,\,\,1}\end{bmatrix}=\begin{bmatrix}\mathtt{-1}&\mathtt{-1}\\\mathtt{\,\,\,\,2}&\mathtt{\,\,\,\,1}\end{bmatrix}\begin{bmatrix}\mathtt{-2}&\mathtt{\,\,\,\,0}\\\mathtt{\,\,\,\,0}&\mathtt{-1}\end{bmatrix}\)

This leaves us with [original matrix][eigenvector matrix] = [eigenvector matrix][eigenvalue matrix]. Finally, we multiply both sides by the inverse of the eigenvector matrix, in order to remove it from the left side of the equation. We can’t remove it from the right side, because matrix multiplication is not commutative. That leaves us with the final decomposition (hat tip to Math the Beautiful for some of the ideas in this post):

Multiplying these three matrices together, or combining the transformations represented by the matrices as we showed here, will result in the original matrix.

Subtractive Knowledge

I was intrigued by a pedagogical insight offered by the example below, from the introductory class of a course called Computational Linear Algebra. The setup is that the graph (diagram) at the bottom of the image represents a Markov model, and it shows the probabilities of moving from one stage of a disease to another in a year.

So, if a patient is asymptomatic, there is a 7% (0.07) probability of moving from asymptomatic to symptomatic, a 90% chance of staying at asymptomatic (indicated by a curved arrow), and so on. This information is also encoded in the stochastic matrix shown.

Here’s the problem: Given a group of people: 85% are asymptomatic, 10% are symptomatic, 5% have AIDS, and of course 0% are deceased, what percent will be in each health state in a year? Putting yourself in the mind of a student right now, take a moment to try to answer the question and, importantly, reflect on your thinking at this stage, even if that thinking involves having no clue how to proceed.

I hope that, given this setup, you’ll be somewhat surprised to learn the following: If a high school student (or even middle school student) knows a little bit about probability and how to multiply and add, they should be able to answer this question.

Why? Well, if 85 out of every 100 in a group is asymptomatic, and there is a 90% probability of remaining asymptomatic in a year, then (0.85)(0.9) = 76.5% of the group is predicted to be asymptomatic in a year. The symptomatic group has two products that must be added: 10% of the group is symptomatic, and there is a 93% probability of remaining that way, so that’s (0.93)(0.1). But this group also takes on 7% of the 85% that were asymptomatic but moved to symptomatic. So, the total is (0.93)(0.1) + (0.07)(0.85) = 15.25%. The AIDS group percent is the sum of three products, for a total of 6.45%, and the Death group percent is a sum of four products, for a total of 1.8%.

Probability, multiplication, and addition are all you need to know. No doubt, knowing something about matrix-vector multiplication, as we have discussed, (and transposes) can be helpful, but it does not seem to be necessary in this case.

Bamboozled

I think it’s reasonable to suspect that many knowledgeable students—and knowledgeable adults—would be bamboozled by the highfalutin language here into believing that they cannot solve this problem, when in fact they can. If that’s true, then why is that the case?

Knowledge is domain specific, of course, (and very context specific) and that would seem to be the best explanation of students’ hypothesized difficulties. That is, given the cues (both verbal and visual) that this problem involves knowledge of linear algebra, Markov models, and/or stochastic matrices, anyone without that knowledge would naturally assume that they don’t have what is required to solve the problem and give up. And even if they suspected that some simple probability theory, multiplication, and addition were all that they needed, being bombarded by even a handful of foreign mathematical terms would greatly reduce their confidence in this suspicion.

Perhaps, then, the reason we are looking for—the reason students don’t believe they can solve problems when in fact they can—has to do with students’ attitudes, not their knowledge. And situations like these during instruction are enough to convince many that knowledge is overrated. The solution to this psychological reticence is, for many people, to encourage students to be fearless, to have problem-solving orientations and growth mindsets. After all, it’s clear that more knowledge would be helpful, but it’s not necessary in many cases. We’ll teach knowledge, sure, but we can do even better if we spend time encouraging soft skills along the way. Do we want students to give up every time they face a situation in life that they were not explicitly taught to deal with?

Knowledge Is Not Just Additive

The problem with this view is that it construes knowledge as only additive. That is, it is thought, knowledge only works to give its owner things to think about and think with. So, in the above example, students already have all the knowledge things to work with: probability, multiplication, and addition. Anything else would only serve to bring in more things to think about, which would be superfluous.

But this isn’t the only way knowledge works. It can also be subtractive—that is, knowing something can tell you that it is irrelevant to the current problem. Not knowing it means that you can’t know about its relevance (and situations like the above will easily bias you to giving superficial information a high degree of relevance). So, students cannot know with high confidence that matrices are essentially irrelevant to the problem above if they don’t know what matrices are. But even knowing nothing about matrices, knowing that, computationally, linear algebra is fundamentally about multiplying and adding things may be enough. Taking that perspective can allow you to ignore the superficial setup of the problem. But that’s still knowledge.

A better interpretation of students’ difficulties with the above is that, in fact, they do need more knowledge to solve the problem. The knowledge they need is subtractive; it will help them ignore superficial irrelevant details to get at the marrow of the problem.

Knowledge is obviously additive, but it is much more subtly subtractive too, helping to clear away facts that are irrelevant to a given situation. Subtractive knowledge is like the myelinated sheaths around some nerve cells in the brain. It acts as an insulator for thinking—making it faster, more efficient, and, as we have seen, more effective.

The Gricean Maxims

When we converse with one another, we implicitly obey a principle of cooperation, according to language philosopher Paul Grice’s theory of conversational implicature.

This ‘cooperative principle’ has four maxims, which although stated as commands are intended to be descriptions of specific rules that we follow—and expect others will follow—in conversation:

  • quality:    Be truthful.
  • quantity:  Don’t say more or less than is required.
  • relation:  Be relevant.
  • manner:    Be clear and orderly.

I was drawn recently to these maxims (and to Grice’s theory) because they rather closely resemble four principles of instructional explanation that I have been toying with off and on for a long time now: precision, clarity, order, and cohesion.

In fact, there is a fairly snug one-to-one correspondence among our respective principles, a relationship which is encouraging to me precisely because it is coincidental. Here they are in an order corresponding to the above:

  • precision:  Instruction should be accurate.
  • cohesion: Group related ideas.
  • clarity:     Instruction should be understandable and present to its audience.
  • order:       Instruction should be sequenced appropriately.

Both sets of principles likely seem dumbfoundingly obvious, but that’s the point. As principles (or maxims), they are footholds on the perimeters of complex ideas—in Grice’s case, the implicit contexts that make up the study of pragmatics; in my case (insert obligatory note that I am not comparing myself with Paul Grice), the explicit “texts” that comprise the content of our teaching and learning.

The All-Consuming Clarity Principle

Frameworks like these can be more than just armchair abstractions; they are helpful scaffolds for thinking about the work we do. Understanding a topic up and down the curriculum, for example, can help us represent it more accurately in instruction. We can think about work in this area as related specifically to the precision principle and, in some sense, as separate from (though connected to) work in other areas, such as topic sequencing (order), explicitly building connections (cohesion), and motivation (clarity).

But principle frameworks can also lift us to some height above this work, where we can find new and useful perspectives. For instance, simply having these principles, plural, in front of us can help us see—I would like to persuade you to see—that “clarity,” or in Grice’s terminology, “relevance,” is the only one we really talk about anymore, and that this is bizarre given that it’s just one aspect of education.

The work of negotiating the accuracy, sequencing, and connectedness of instruction drawn from our shared knowledge has been largely outsourced to publishers and technology startups and Federal agencies, and goes mostly unquestioned by the “delivery agents” in the system, whose role is one of a go-between, tasked with trying to sell a “product” in the classroom to student “customers.”

Making Parallelepipeds

We have talked about the cross product (here and here), so let’s move on to doing something marginally interesting with it: we’ll make a rectangular prism, or the more general parallelepiped.

A parallelepiped is shown at the right. It is defined by three vectors: u, v, and w. The cross product vector \(\mathtt{v \wedge w}\) is perpendicular to both v and w, and its magnitude, \(\mathtt{||v \wedge w||}\), is equal to the area of the parallelogram formed by the vectors v and w (something I didn’t mention in the previous two posts).

The perpendicular height of the skewed prism, or parallelepiped, is given by \(\mathtt{||u||cos(θ)}\).

The volume of the parallelepiped can thus be written as the area of the base times the height, or \(\mathtt{V = (||v \wedge w||)(||u||\text{cos}(θ))}\).

We can write \(\mathtt{\text{cos}(θ)}\) in this case as \[\mathtt{\text{cos}(θ) = \frac{(v \wedge w) \cdot u}{(||v \wedge w||)(||u||)}}\]

Which means that, after simplifying the volume equation, we’re left with \(\mathtt{V = (v \wedge w) \cdot u}\): so, the dot product of the vector perpendicular to the base and the slant height of the prism. The result is a scalar value, of course, for the volume of the parallelepiped, and, because it is a dot product, it is a signed value. We can get negative volumes, which doesn’t mean the volume is negative but tells us something about the orientation of the parallelepiped.

Creating Some Parallelepipeds

Creating prisms and skewed prisms can be done in Geogebra 3D, but here I’ll show how to create these figures from scratch using Three.js. Click and drag on the window to rotate the scene below. Right click and drag to pan left, right, up, or down. Scroll to zoom in and out.

Click on the Pencil icon above in the lovely Trinket window and navigate to the parallelepiped.js tab to see the code that makes the cubes. You can see that vectors are used to create the vertices (position vectors, so just points). Each face is composed of 2 triangles: (0, 1, 2) means to create a face from the 0th, 1st, and 2nd vertices from the vertices list. Make some copies of the box in the code and play around!

To determine the volume of each cube: \[\left(\begin{bmatrix}\mathtt{1}\\\mathtt{0}\\\mathtt{0}\end{bmatrix} \wedge \begin{bmatrix}\mathtt{0}\\\mathtt{1}\\\mathtt{0}\end{bmatrix}\right) \cdot \begin{bmatrix}\mathtt{0}\\\mathtt{0}\\\mathtt{1}\end{bmatrix} \mathtt{= 1}\]

Spooky Action at a Distance

I really like this recent post, called Tell Me More, Tell Me More, by math teacher Dani Quinn. The content is an excellent analysis of expert blindness in math teaching. The form, though, is worth seeing as well—it is a traditional educational syllogism, which Quinn helpfully commandeers to arrive at a non-traditional conclusion, that instructional effects have instructional causes, on the right:

The Traditional ArgumentAn Alternative Argument
There is a problem in how we teach: We typically spoon-feed students procedures for answering questions that will be on some kind of test.

“There is a problem in how we teach: We typically show pupils only the classic forms of a problem or a procedure.”

This is why students can’t generalize to non-routine problems: we got in the way of their thinking and didn’t allow them to take ownership and creatively explore material on their own.“This is why they then can’t generalise: we didn’t show them anything non-standard or, if we did, it was in an exercise when they were floundering on their own with the least support.”

Problematically for education debates, each of these premises and conclusions taken individually are true. That is, they exist. At our (collective) weakest, we do sometimes spoon-feed kids procedures to get them through tests. We do cover only a narrow range of situations—what Engelmann refers to as the problem of stipulation. And we can be, regrettably in either case, systematically unassertive or overbearing.

Solving equations provides a nice example of the instructional effects of both spoon-feeding and stipulation. Remember how to solve equations? Inverse operations. That was the way to do equations. If you have something like \(\mathtt{2x + 5 = 15}\), the table shows how it goes.

EquationStep
\(\mathtt{2x + 5 \color{red}{- 5} = 15 \color{red}{- 5}}\)Subtract \(\mathtt{5}\) from both sides of the equation to get \(\mathtt{2x = 10}\).
\(\mathtt{\color{white}{+ 5 \,\,} 2x \color{red}{\div 2} = 10 \color{red}{\div 2}}\)Divide both sides of the equation by 2.
\(\mathtt{\color{white}{+ 5 \,\,}x = 5}\)You have solved the equation.

Do that a couple dozen times and maybe around 50% of the class freezes when they encounter \(\mathtt{22 = 4x + 6}\), with the variable on the right side, or, even worse, \(\mathtt{22 = 6 + 4x}\).

That’s spoon-feeding and stipulation: do it this one way and do it over and over—and, crucially, doing that summarizes most of the instruction around solving equations.

Of course, the lack of prior knowledge exacerbates the negative instructional effects of stipulation and spoon-feeding. But we’ll set that aside for the moment.

The Connection Between Premises and Conclusion

The traditional and alternative arguments above are easily (and often) confused, though, until you include the premise that I have omitted in the middle for each. These help make sense of the conclusions derived in each argument.

The Traditional ArgumentAn Alternative Argument
There is a problem in how we teach: We typically spoon-feed students procedures for answering questions that will be on some kind of test.

“There is a problem in how we teach: We typically show pupils only the classic forms of a problem or a procedure.”

Students’ success in schooling is determined mostly by internal factors, like creativity, motivation, and self-awareness.

Students’ success in schooling is determined mostly by external factors, like amount of instruction, socioeconomic status, and curricula.

This is why students can’t generalize to non-routine problems: we got in the way of their thinking and didn’t allow them to take ownership and creatively explore material on their own.“This is why they then can’t generalise: we didn’t show them anything non-standard or, if we did, it was in an exercise when they were floundering on their own with the least support.”

In short, the argument on the left tends to diagnose pedagogical illnesses and their concomitant instructional effects as people problems; the alternative sees them as situation problems. The solutions generated by each argument are divergent in just this way: the traditional one looks to pull the levers that mostly benefit personal, internal attributes that contribute to learning; the alternative messes mostly with external inputs.

It’s Not the Spoon-Feeding, It’s What’s on the Spoon

I am and have always been more attracted to the alternative argument than the traditional one. Probably for a very simple reason: my role in education doesn’t involve pulling personal levers. Being close to the problem almost certainly changes your view of it—not necessarily for the better. But, roles aside, it’s also the case that the traditional view is simply more widespread, and informed by the positive version of what is called the Fundamental Attribution Error:

We are frequently blind to the power of situations. In a famous article, Stanford psychologist Lee Ross surveyed dozens of studies in psychology and noted that people have a systematic tendency to ignore the situational forces that shape other people’s behavior. He called this deep-rooted tendency the “Fundamental Attribution Error.” The error lies in our inclination to attribute people’s behavior to the way they are rather than to the situation they are in.

What you get with the traditional view is, to me, a kind of spooky action at a distance—a phrase attributed to Einstein, in remarks about the counterintuitive consequences of quantum physics. Adopting this view forces one to connect positive instructional effects (e.g., thinking flexibly when solving equations) with something internal, ethereal and often poorly defined, like creativity. We might as well attribute success to rabbit’s feet or lucky underwear or horoscopes!

Teaching and Learning Coevolved?

Just a few pages in to David Didau and Nick Rose’s new book What Every Teacher Needs to Know About Psychology, and I’ve already come across what is, for me, a new thought—that teaching ability and learning ability coevolved:

Strauss, Ziv, and Stein (2002) . . . point to the fact that the ability to teach arises spontaneously at an early age without any apparent instruction and that it is common to all human cultures as evidence that it is an innate ability. Essentially, they suggest that despite its complexity, teaching is a natural cognition that evolved alongside our ability to learn.

Or perhaps this is, even for me, an old thought, but just unpopular enough—and for long enough—to seem like a brand new thought. Perhaps after years of exposure to the characterization of teaching as an anti-natural object—a smoky, rusty gearbox of torture techniques designed to break students’ wills and control their behavior—I have simply come to accept that it is true, and have forgotten that I had done so.

Strauss, et. al, however, provide some evidence in their research that it is not true. Very young children engage in teaching behavior before formal schooling by relying on a naturally developing ability to understand the minds of others, known as theory of mind (ToM).

Kruger and Tomasello (1996) postulated that defining teaching in terms of its intention—to cause learning, suggests that teaching is linked to theory of mind, i.e., that teaching relies on the human ability to understand the other’s mind. Olson and Bruner (1996) also identified theoretical links between theory of mind and teaching. They suggested that teaching is possible only when a lack of knowledge can be recognized and that the goal of teaching then is to enhance the learner’s knowledge. Thus, a theory of mind definition of teaching should refer to both the intentionality involved in teaching and the knowledge component, as follows: teaching is an intentional activity that is pursued in order to increase the knowledge (or understanding) of another who lacks knowledge, has partial knowledge or possesses a false belief.

The Experiment

One hundred children were separated into 50 pairs—25 pairs with a mean age of 3.5 and 25 with a mean age of 5.5. Twenty-five of the 50 children in each age group served as test subjects (teachers); the other 25 were learners. The teachers completed three groups of tasks before teaching, the first of which (1) involved two classic false-belief tasks. If you are not familiar with these kinds of tasks, the video at right should serve as a delightfully creepy precis—from what appears to be the late 70s, when every single instructional video on Earth was made. The second and third groups of tasks probed participants’ understanding that (2) a knowledge gap between teacher and learner must exist for “teaching” to occur and (3) a false belief about this knowledge gap is possible.

Finally, children participated in the teaching task by teaching the learners how to play a board game. The teacher-children were, naturally, taught how to play the game prior to their own teaching, and they were allowed to play the game with the experimenter until they demonstrated some proficiency. The teacher-learner pair was then left alone, “with no further encouragement or instructions.”

The Results

Consistent with the results from prior false-belief studies, there were significant differences between the 3- and 5-year-olds in Tasks (1) and (3) above, both of which relied on false-belief mechanisms. In Task (3), when participants were told, for example, that a teacher thought a child knew how to read when in fact he didn’t, 3-year-olds were much more likely to say that the teacher would still teach the child. Five-year-olds, on the other hand, were more likely to recognize the teacher’s false belief and say that he or she would not teach the child.

Intriguingly, however, the development of a theory of mind does not seem necessary to either recognizing the need for a special type of discourse called “teaching” or to teaching ability itself—only to a refinement of teaching strategies. Task (2), in which participants were asked, for instance, whether a teacher would teach someone who knew something or someone who didn’t, showed no significant differences between 3- and 5-year-olds in the study. But the groups were significantly different in the strategies they employed during teaching.

Three-year-olds have some understanding of teaching. They understand that in order to determine the need for teaching as well as the target learner, there is a need to recognize a difference in knowledge between (at least) two people . . . Recognition of the learner’s lack of knowledge seems to be a necessary prerequisite for any attempt to teach. Thus, 3-year-olds who identify a peer who doesn’t know [how] to play a game will attempt to teach the peer. However, they will differ from 5-year-olds in their teaching strategies, reflecting the further change in ToM and understanding of teaching that occurs between the ages of 3 and 5 years.

Coevolution of Teaching and Learning

The study here dealt with the innateness of teaching ability and sensibilities but not with whether teaching and learning coevolved, which it mentions at the beginning and then leaves behind.

It is an interesting question, however. Discussions in education are increasingly focused on “how students learn,” and it seems to be widely accepted that teaching should adjust itself to what we discover about this. But if teaching is as natural a human faculty as learning—and coevolved alongside it—then this may be only half the story. How students (naturally) learn might be caused, in part, by how teachers (naturally) teach, and vice versa. And learners perhaps should be asked to adjust to what we learn about how we teach as much as the other way around.

Those seem like new thoughts to me. But they’re probably not.