I just wanted to pause briefly to showcase how some of the linear transformations we have been looking into can be represented in computerese (or at least one version of computerese). You can click on the pencil icon and then on the matrix_transform.js file in the trinket below and look for the word matrix. Change the numbers in those lines to check the effects on the transformations. You can get some fairly wild stuff.

By the way, trinket is an incredibly beautiful product if you like tinkering with all kinds of code. Grab a free account and share your work!

For this demo, I stuck with simple transformations centered at the origin of a coordinate system (so to speak). As you can imagine, there are much more elaborate things you can do when you combine transformations and move the center point around.

So, we did rotations with matrices. Now what about reflections? The basic reflections—of the identity matrix, say—aren’t worth mentioning at the moment. The more puzzling reflections—those about a line that is not horizontal or vertical—are worth looking at.

The more complicated way, though, to do this we’ll save for another time. The simpler way involves something called the foot of the point. Back when we were working out the distance of a point to a line, naturally we were thinking about the perpendicular distance of that point from the line. And where that perpendicular distance to the point intersects the line is called the foot of the point.

This point is also the perpendicular bisector of \(\mathtt{\overline{rr’}}\), or the line segment connecting the point \(\mathtt{r}\) with its reflection across the line. So, if we can get the foot of the point we are reflecting, we can get the reflected point.

Determining the Foot of the Point

Let’s start with a different diagram. The line shown here can be represented by the following vector equation: \[\mathtt{p +\, α}\begin{bmatrix}\mathtt{2}\\\mathtt{1}\end{bmatrix}\] What is the ordered pair for point r’, the reflection of point r across the line?

Let’s start by finding the location of q, the foot of the point. Since we know p (it’s [0, 4]), and we know that the line is described by the vector \(\mathtt{(2α, α)}\), what we need to know is the scalar that scales us from p to q. We’ll call that scalar t.

To get at the scalar t, we can equate two cosine equations. The equation on the left shows the cosine of β that we learned when we looked at the dot product. And the equation on the right shows the cosine of β as the simple adjacent over hypotenuse ratio: \[\mathtt{\text{cos(β)} = \frac{(q\,-\,p) \cdot (r\,-\,p)}{|q\,-\,p||r\,-\,p|} \quad\quad\quad\text{cos(β)} = \frac{|t(q\,-\,p)|}{|r\,-\,p|}}\]

When we set the two right-hand expressions equal to each other and solve for t, we get the scalar t. (The difference in points, q – p, is just the vector [2, 1] and r – p is just the vector [5, –1].) \[\mathtt{t = \frac{(q\,-\,p) \cdot (r\,-\,p)}{|q\,-\,p|^2}} \,\,\longrightarrow\,\, \mathtt{t =} \frac{\begin{bmatrix}\mathtt{2}\\\mathtt{1}\end{bmatrix} \cdot \begin{bmatrix}\mathtt{\,\,\,\,5}\\\mathtt{-1}\end{bmatrix}}{5} \,\,\longrightarrow\,\,\mathtt{t = 1.8}\]

Using the equation for the line at the start of this section, we see that we can set \(\mathtt{α}\) equal to t to determine the location of point q. So, point q is at \[\begin{bmatrix}\mathtt{0}\\\mathtt{4}\end{bmatrix} \mathtt{+\,\,\, 1.8}\begin{bmatrix}\mathtt{2}\\\mathtt{1}\end{bmatrix} = \begin{bmatrix}\mathtt{3.6}\\\mathtt{5.8}\end{bmatrix}\]

The Midpoint and the Reflection

Now that we have found the location of point q, we can treat it as the midpoint of \(\mathtt{\overline{rr’}}\), or the line segment connecting the point \(\mathtt{r}\) with its reflection across the line.

This is yet another thing we haven’t covered, but the midpoint between \(\mathtt{r}\) and \(\mathtt{r’}\) is \(\mathtt{q = \frac{1}{2}(r + r’)}\). Thus, the equation for the reflection of r (r’) across the given line, when we have figured out the foot of the point q is \[\mathtt{r’ = 2q\,-\,r}\]

I have to say, this makes reflections seem like a lot of work anyway.

We can do all kinds of weird scalings with matrices, which we saw first here. For example, stretch the ‘horizontal’ vector (1, 0) to, say, (2, 0) and then stretch and move the ‘vertical’ vector (0, 1) to, say, (–3, 5). Our transformation matrix, then, is

The vector representing point A in this case clearly changed directions as a result of the transformation, in addition to getting stretched. However, a question that doesn’t seem worth asking now but will later is whether there are any vectors that don’t change direction as a result of the transformation—either staying the same or just getting scaled. That is, are there vectors (\(\mathtt{r_1, r_2}\)), such that (using lambda, \(\mathtt{\lambda}\), as a constant to be cool again):

A good guess would be that any ‘horizontal’ vector would not change direction, since the original (1, 0) was only scaled to (2, 0). Anyway, remembering that the identity matrix represents the do-nothing transformation, we can also write the above equation like this:

And although we haven’t yet talked about the idea that you can combine transformation matrices (add and subtract them), let me just say now that you can do this. So, we can manipulate the sides of the equation above (the far left and far right) and rewrite using the Distributive Property in reverse to get:

The vector (\(\mathtt{r_1}, \mathtt{r_2}\)) could, of course, always be the zero vector. But we ignore that solution and assume that it represents some non-zero vector. Given this assumption, the transformation matrix that has the lambdas subtracted from integers above must have a determinant of 0. We haven’t talked about that last point yet either, but it should make some sense even now. If a transformation matrix takes a non-zero vector (a one-dimensional ray, so to speak) to zero, no positive areas will survive. If you take the side of a square and reduce one of its dimensions to zero, it becomes a one-dimensional object with no area.

Getting the Eigenvalues and Eigenvectors

Moving on, we know how to calculate the determinant, and we know that the determinant must be 0. So, \(\mathtt{(2 – \lambda)(5 – \lambda) = 0}\). The solutions here are \(\mathtt{\lambda = 2}\) and \(\mathtt{\lambda = 5}\). These two numbers are the eigenvalues. To get the eigenvectors, plug in each of the eigenvalues into that transformation matrix above and solve for the vector: \[\begin{bmatrix}\mathtt{2\,-\,2} & \mathtt{-3}\\\mathtt{0} & \mathtt{5\,-\,2}\end{bmatrix}\begin{bmatrix}\mathtt{r_1}\\\mathtt{r_2}\end{bmatrix} = \mathtt{0}\]

We have to kind of fudge a solution to that system of equations, but in the end we wind up with the result that one of the eigenvectors will be any vector of the form \(\mathtt{(c, 0)}\), where c represents any number. This confirms our earlier intuition that one of the vectors that will not change directions will be any ‘horizontal’ vector. The eigenvalue tells us that any vector of this form will be stretched by a factor of 2 in the transformation.

A similar process with the eigenvalue of 5 results in an eigenvector of the form \(\mathtt{(c, -c)}\). Any vector of this form will not change its direction as a result of the transformation, but will be scaled by a factor of 5.

Check out and play with this interactive to watch how the transformation matrix works and to watch how the eigenvectors appear in the transformation. Be sure to check out the video linked at the top too!

My grandfather used to tell me a story about a young boy who was stuck in traffic with his family for hours because an 18-wheeler had got itself pinned under an overpass bridge ahead of them. The huge truck was wedged in so strongly and strangely that a flock of engineers had descended on the scene. They argued back and forth about their favorite physical and mathematical models that would unpin the trapped vehicle and release the miles-long stream of cars idling behind it on the freeway. This bickering went on for hours—until the boy got out of his car, walked up to the group of engineers, and shouted, “Why don’t you just let the air out the tires!”

It’s a nice story, precisely because it’s so rare and noticeable. We don’t notice unbroken strings of solved problems from experts, because that’s what we expect of experts—and, for the most part, what we get from them. We notice when they fail. And, because these failures are more noticeable than the far more boring and numerable successes, we fall prey to availability bias, and assume that expert failure occurs with much more regularity than it actually does. (In turn, we start to think that it’s maybe a good idea to keep students naive and, therefore, creative and open-minded rather than have them study things that other people have already figured out.) As Tom Nichols writes in The Death of Expertise:

At the root of all this is an inability among laypeople to understand that experts being wrong on occasion about certain issues is not the same thing as experts being wrong consistently on everything. The fact of the matter is that experts are more often right than wrong, especially on essential matters of fact. And yet the public constantly searches for the loopholes in expert knowledge that will allow them to disregard all expert advice they don’t like.

A 2008 study which put this folk notion of expert inflexibility to the test compared chess experts and novices, and measured the famous Einstellung effect in both groups across three experiments.

In the first experiment, the experts were given the board on the left and were instructed to find the shortest solution. The board on the left is designed to activate a motif familiar to chess experts (and thus activate Einstellung)—the smothered mate motif—which can be carried out using 5 moves. A shorter solution (3 moves) also exists, however.

If the experts failed to find the three-move solution, they were then given the board on the right. This board can be solved by the shorter three-move solution but not by the Einstellung motif of the smothered mate. The group of novices in the experiment were all given this second board (the one on the right) featuring the three-move mate solution without the Einstellung motif as well.

Findings

If knowledge corrupts insight, as it were, then the experts would, by and large, be fixated by the smothered mate sequence and miss the three-move solution. And this is indeed what happened—sort of. What the researchers found was that level of expertise correlated strongly with the results. Grandmasters (those with the highest levels of chess expertise) were not taken in by the Einstellung motif at all. Every one of them found the optimal three-move solution. However, experts with lower ratings, such as International Masters, Masters, and Candidate Masters, all experienced the Einstellung effect, with 50%, 18%, and 0%, respectively, finding the shorter solution on the first board, even though all of them found the optimal solution when it was presented on the second board, in the absence of the smothered mate motif.

The novices’ performance showed a positive correlation with rating also. Sixty-three percent of the highest rated (Class A) players in the novices group found the optimal solution on the right board, while 13% of Class B players and 0% of Class C players found the three-move solution. Thus, the Einstellung effect made International Masters experts perform like Class A players, Master players perform like Class B players, and Candidate Masters perform like Class C players.

Experiment 2 replicated the above finding in a slightly more naturalistic setting, and Experiment 3 did so with strategic Einstellungs instead of tactical ones.

Knowledge Is Essential for Cognitive Flexibility

While this study shows that Einstellung effects are powerful and observable in expert performance, it also demonstrates that the notion that expertise causes cognitive inflexibility is probably wrong.

The failure of the ordinary experts to find a better solution when they had already found a good one supports the view that experts can be vulnerable to inflexible thought patterns. But the performance of the super experts shows that ‘experts are inflexible’ would be the wrong conclusion to draw from this failure. The Einstellung effect is very powerful—the problem solving capability of our ordinary experts was reduced by about three SDs when a well-known solution was apparent to them. But the super experts, at least with the range of difficulty of problems used here, were less susceptible to the effect. Greater expertise led to greater flexibility, not less.

Knowledge, and the expertise inevitably linked to it, were also responsible for both forms of expert flexibility demonstrated in the experiments. The optimal solution was more likely to be noticed immediately, even before the nominally more familiar solution, among some super experts. Hence, expertise helped super experts avoid an Einstellung situation in the first place because they immediately found the optimal solution. Even when experts did not find the optimal solution immediately, expertise and knowledge were positively associated with the probability of finding the optimal solution after the non-optimal solution had been generated first. Finally, when knowledge discrepancy was minimized, as in the third experiment, super experts had sufficient resources to outperform their slightly weaker colleagues. In all three instances, knowledge was inextricably and positively related to expert flexibility. . . .

The training required to produce experts should not be seen as a source of potential problems but as a way to acquire the skill to deal effectively and flexibly with all the situations that can arise in the domain. Creativity is a consequence of expertise rather than expertise being a hindrance to creativity. To produce something novel and useful it is necessary first to master the previous knowledge in the domain. More knowledge empowers creativity rather than hurting it (e.g., Kulkarni & Simon, 1988; Simonton, 1997; Weisberg, 1993, 1999).

Okay, now let’s move stuff around with linear algebra. We’ll eventually do rotations, reflections, and maybe translations too, while mixing that up with stretchings and skewings and other things that matrices can do for us.

We learned here that a matrix gives us information about two arrows—the x-axis arrow and the y-axis arrow. What we really mean is that a 2 × 2 matrix represents a transformation of 2D space. This transformation is given by 2 column vectors—the 2 columns of the matrix. The identity matrix, as we saw previously, represents the do-nothing transformation:

Another way to look at this matrix is that it tells us about the 2D space we’re looking at and how to interpret ANY vector in that space. So, what does the vector (1, 2) mean here? It means take 1 of the (1, 0) vectors and add 2 of the (0, 1) vectors.

But what if we reflect the entire coordinate plane across the y-axis? That’s a new system, and it’s a system given by where the blue and orange vectors would be under that reflection:

In that new system, we can guess where the vector (1, 2) will end up. It will just be reflected across the y-axis. But matrix-vector multiplication allows us to figure that out by just multiplying the vector and the matrix:

This opens up a ton of possibilities for specifying different kinds of transformations. And it makes it pretty straightforward to specify transformations and play with them—just set the two column vectors of your matrix and see what happens! We can rotate and reflect the column vectors and scale them up together or separately.

Rotations

Let’s start with rotations. And we’ll throw in some scaling too, just to make it more interesting. The image shows a coordinate system that has been rotated –135°, by rotating our column vectors from the identity matrix by that degree. The coordinate system has also been dilated by a factor of 0.5. This results in \(\mathtt{\triangle{ABC}}\) rotated –135° and scaled down by a half as shown.

What matrix represents this new rotated and scaled down system? The rotation of the first column vector, (1, 0), can be represented as (\(\mathtt{cos\,θ, sin\,θ}\)). And the second column vector, which is (0, 1) before the rotation, is perpendicular to the first column vector, so we just flip the components and make one of them the opposite of what it originally was: (\(\mathtt{-sin\,θ, cos\,θ}\)). So, a general rotation matrix looks like the matrix on the left. The rotation matrix for a –135° rotation is on the right: \[\begin{bmatrix}\mathtt{cos \,θ} & \mathtt{-sin\,θ}\\\mathtt{sin\,θ} & \mathtt{\,\,\,\,\,cos\,θ}\end{bmatrix}\quad\quad\begin{bmatrix}\mathtt{-\frac{\sqrt{2}}{2}} & \mathtt{\,\,\,\,\frac{\sqrt{2}}{2}}\\\mathtt{-\frac{\sqrt{2}}{2}} & \mathtt{-\frac{\sqrt{2}}{2}}\end{bmatrix}\]

You can eyeball that the rotation matrix is correct by interpreting the columns of the matrix as the new positions of the horizontal vector and vertical vector, respectively (the new coordinates they are pointing to). A –135° rotation is a clockwise rotation of 90° + 45°.

Now for the scaling, or dilation by a factor of 0.5. This is accomplished by the matrix on the left, which, when multiplied by the rotation matrix on the right, will give us the one combo transformation matrix: \[\begin{bmatrix}\mathtt{\frac{1}{2}} & \mathtt{0}\\\mathtt{0} & \mathtt{\frac{1}{2}}\end{bmatrix}\begin{bmatrix}\mathtt{-\frac{\sqrt{2}}{2}} & \mathtt{\,\,\,\,\frac{\sqrt{2}}{2}}\\\mathtt{-\frac{\sqrt{2}}{2}} & \mathtt{-\frac{\sqrt{2}}{2}}\end{bmatrix} = \begin{bmatrix}\mathtt{-\frac{\sqrt{2}}{4}} & \mathtt{\,\,\,\,\frac{\sqrt{2}}{4}}\\\mathtt{-\frac{\sqrt{2}}{4}} & \mathtt{-\frac{\sqrt{2}}{4}}\end{bmatrix}\]

The result is another 2 × 2 matrix, with two column vectors. The calculations below show how we find those two new column vectors: \[\mathtt{-\frac{\sqrt{2}}{2}}\begin{bmatrix}\mathtt{\frac{1}{2}}\\\mathtt{0}\end{bmatrix} + -\frac{\sqrt{2}}{2}\begin{bmatrix}\mathtt{0}\\\mathtt{\frac{1}{2}}\end{bmatrix} = \begin{bmatrix}\mathtt{-\frac{\sqrt{2}}{4}}\\\mathtt{-\frac{\sqrt{2}}{4}}\end{bmatrix}\quad\quad\mathtt{\frac{\sqrt{2}}{2}}\begin{bmatrix}\mathtt{\frac{1}{2}}\\\mathtt{0}\end{bmatrix} + -\frac{\sqrt{2}}{2}\begin{bmatrix}\mathtt{0}\\\mathtt{\frac{1}{2}}\end{bmatrix} = \begin{bmatrix}\mathtt{\,\,\,\,\frac{\sqrt{2}}{4}}\\\mathtt{-\frac{\sqrt{2}}{4}}\end{bmatrix}\]

Now for the Point of Rotation

We’ve got just one problem left. Our transformation matrix, let’s call it \(\mathtt{A}\), is perfect, but we don’t rotate around the origin. So, we have to do some adding to get our final expression. To rotate, for example, point B around point C, we don’t use point B’s position vector from the origin—we rewrite this vector as though point C were the origin. So, point B has a position vector of B – C = (1, 0) in the point C–centered system. Once we’re done rotating this new position vector for point B, we have to add the position vector for C back to the result. So, we get: \[\mathtt{B’} = \begin{bmatrix}\mathtt{-\frac{\sqrt{2}}{4}} & \mathtt{\,\,\,\,\frac{\sqrt{2}}{4}}\\\mathtt{-\frac{\sqrt{2}}{4}} & \mathtt{-\frac{\sqrt{2}}{4}}\end{bmatrix}\begin{bmatrix}\mathtt{1}\\\mathtt{0}\end{bmatrix} + \begin{bmatrix}\mathtt{2}\\\mathtt{2}\end{bmatrix} = \begin{bmatrix}\mathtt{2\,-\,\frac{\sqrt{2}}{4}}\\\mathtt{2\,-\,\frac{\sqrt{2}}{4}}\end{bmatrix}\]

Which gives us a result, for point B’, of approximately (1.65, 1.65). We can do the calculation for point A as well: \[\,\,\,\,\,\mathtt{A’} = \begin{bmatrix}\mathtt{-\frac{\sqrt{2}}{4}} & \mathtt{\,\,\,\,\frac{\sqrt{2}}{4}}\\\mathtt{-\frac{\sqrt{2}}{4}} & \mathtt{-\frac{\sqrt{2}}{4}}\end{bmatrix}\begin{bmatrix}\mathtt{-1}\\\mathtt{\,\,\,\,2}\end{bmatrix} + \begin{bmatrix}\mathtt{2}\\\mathtt{2}\end{bmatrix} = \begin{bmatrix}\mathtt{2\,+\,\frac{3\sqrt{2}}{4}}\\\mathtt{2\,-\,\frac{\sqrt{2}}{4}}\end{bmatrix}\]

This puts A’ at about (3.06, 1.65). Looks right! By the way, the determinant is \(\mathtt{\frac{1}{4}}\)—go calculate that for yourself. This is no surprise, of course, since a dilation by a factor of 0.5 will scale areas down by one fourth. The rotation has no effect on the determinant, because rotations do not affect areas.

Our general formula, then, for a rotation through \(\mathtt{θ}\) of some point \(\mathtt{x}\) (as represented by a position vector) about some point \(\mathtt{r}\) (also represented by a position vector) is: \[\mathtt{x’} = \begin{bmatrix}\mathtt{cos\,θ} & \mathtt{-sin\,θ}\\\mathtt{sin\,θ} & \mathtt{\,\,\,\,\,cos\,θ}\end{bmatrix}\begin{bmatrix}\mathtt{x_1\,-\,r_1}\\\mathtt{x_2\,-\,r_2}\end{bmatrix} + \begin{bmatrix}\mathtt{r_1}\\\mathtt{r_2}\end{bmatrix}\]

I want to get to moving stuff around using vectors and matrices, but I’ll stop for a second and touch on the determinant, since linear algebra seems to think it’s important. And, to be honest, it is kind of interesting.

The determinant is the area of the parallelogram created by two vectors. Two vectors will always create a parallelogram like the one shown below, unless they are just scaled versions of each other—but we’ll get to that.

The two vectors shown here are \(\color{blue}{\mathtt{u} = \begin{bmatrix}\mathtt{u_1}\\\mathtt{u_2}\end{bmatrix}}\) and \(\color{red}{\mathtt{v} = \begin{bmatrix}\mathtt{v_1}\\\mathtt{v_2}\end{bmatrix}}\).

We can determine the area of the parallelogram by first determining the area of the large rectangle and then subtracting the triangle areas. Note, by the way, that there are two pairs of two congruent triangles.

So, the area of the large rectangle is \(\mathtt{(u_1 + -v_1)(u_2 + v_2)}\). The negative is interesting. We need it because we want to use positive values when calculating the area of the rectangle. If you play around with different pairs of vectors and different rectangles, you will notice that one of the vector components will always have to be negative in the area calculation, if a parallelogram is formed.

The two large congruent right triangles have a combined area of \(\mathtt{u_{1}u_{2}}\). And the two smaller congruent right triangles have a combined area of \(\mathtt{-v_{1}v_{2}}\). Thus, distributing and subtracting, we get \[\mathtt{u_{1}u_{2} + u_{1}v_{2} – v_{1}u_{2} – v_{1}v_{2} – u_{1}u_{2} – (-v_{1}v_{2})}\]

Then, after simplifying, we have \(\mathtt{u_{1}v_{2} – u_{2}v_{1}}\). If the two vectors u and v represented a linear transformation and were written as column vectors in a matrix, then we could say that there is a determinant of the matrix and show the determinant of the matrix in the way it is usually presented: \[\begin{vmatrix}\mathtt{u_1} & \mathtt{v_1}\\\mathtt{u_2} & \mathtt{v_2}\end{vmatrix} = \mathtt{u_{1}v_{2} – u_{2}v_{1}}\]

One thing to note is that this is a signed area. The sign records a change in orientation that we won’t go into at the moment. In fact, describing the determinant as an area is a little misleading. When you look at transformations, the determinant tells you the scale factor of the change in area. A determinant of 1 would mean that areas did not change, etc. Also, if we have vectors that are simply scaled versions of one another—the components of one vector are scaled versions of the other—then the determinant will be zero, which is pretty much what we want, since the area will be zero. Let’s use lambda (\(\mathtt{\lambda}\)) as our scalar to be cool. \[\,\,\,\,\,\,\quad\,\,\,\,\,\begin{vmatrix}\mathtt{u_1} & \mathtt{\lambda u_1}\\\mathtt{u_2} & \mathtt{\lambda u_2}\end{vmatrix} = \mathtt{\lambda u_{1}u_{2} – \lambda u_{1}u_{2} = 0}\]

So, we’ve jumped around a bit in what is turning into an introduction to linear algebra. The posts here, here, here, here, here, and here show the ground we’ve covered so far—although, saying it that way implies that we’ve moved along continuous patches of ground, which is certainly not true. We skipped over adding and scaling vectors and have focused on concepts which have close analogs to current high school algebra and geometry topics.

Now we’ll jump to the concept of a matrix. A matrix gives you information about two arrows—the x-axis arrow, if you will, and the y-axis arrow. The matrix below, for example, tells you that you are in the familiar xy coordinate plane, with the x arrow, or x vector, extending from the origin to (1, 0) and the y arrow, or y vector, going from the origin to (0, 1).

This is a kind of home-base matrix, and it is called the identity matrix. If we multiply a vector by this matrix, we’ll always get back the vector we put in. The equation below shows how this matrix-vector multiplication is done with the identity matrix and the vector (1, 2), as shown at the right.

As you can see on the far right of the equation, the result is (1 + 0, 0 + 2), or (1, 2), the vector we started with.

A Linear Transformation

Now let’s take the vector at (1, 2) and map it to (0, 2). We’re looking for a matrix that can accomplish this—a transformation of the coordinate system that will map (1, 2) to (0, 2). If we shrink the horizontal vector to (0, 0) and keep the vertical vector the same, that would seem to do the trick.

And it does! This matrix is called a shear matrix, and it takes any vector and shmooshes it onto the y-axis. We could do the same for any vector and the x-axis by zeroing out the second column of the matrix and keeping the first column the same.

You can try out all kinds of different numbers to see their effects. You can do rotations, reflections, and scalings, among other things. The transformation shown at right, for example, where the two column vectors are taken to (1, 1) and (–1, 1), respectively, maps the vector (1, 2) to the vector (–1, 3).

You may notice, by the way, that what we did with the matrix above was to first rotate the column vectors by 45° and then scale them up by a factor of \(\mathtt{\sqrt{2}}\). We can do each of these transformations with just one matrix. \[\begin{bmatrix}\mathtt{\frac{\sqrt{2}}{\,\,2}} & \mathtt{\frac{-\sqrt{2}}{\,\,2}}\\\mathtt{\frac{\sqrt{2}}{\,\,2}} & \mathtt{\,\,\,\,\frac{\sqrt{2}}{2}}\end{bmatrix} \leftarrow \textrm{Rotate by 45}^\circ \textrm{.} \quad \quad \begin{bmatrix}\mathtt{\sqrt{2}} & \mathtt{0}\\\mathtt{0} & \mathtt{\sqrt{2}}\end{bmatrix} \leftarrow \textrm{Scale up by }\sqrt{2}\textrm{.}\]

Then, we can combine these matrices by multiplying them to produce the transformation matrix we needed. Each column of one of the matrices is multiplied by both columns of the other to get the two column vectors of the resulting matrix. We’ll look at that more in the future.

I‘d almost always prefer to solve a problem using what I already know—if that can be done—than learning something I don’t know in order to solve the problem. After that, I’m happy to see how the new learning relates to what I already know. That’s what I’ll do here. There is a way to use the dot product efficiently to determine the distance of a point to a line, but we already know enough to get at it another way, so let’s start there.

So, suppose we know this information about the diagram at the right: \[\mathtt{p=}\begin{bmatrix}\mathtt{4}\\\mathtt{2}\end{bmatrix}, \,\,\,\mathtt{x=}\begin{bmatrix}\mathtt{2}\\\mathtt{1}\end{bmatrix}, \,\,\,\mathtt{r=}\begin{bmatrix}\mathtt{-1}\\\mathtt{-3}\end{bmatrix}\] And we want to know the distance \(\mathtt{r}\) is from the line.

An equation for the distance of \(\mathtt{r}\) to the line, then—a symbolic way to identify this distance—might be given in words as follows: go to point \(\mathtt{p}\), then scale to some point on the line. From that point, scale to some point on the vector that is perpendicular to the line until you get to point \(\mathtt{r}\). In symbols, that could be written as: \[\begin{bmatrix}\mathtt{4}\\\mathtt{2}\end{bmatrix}\mathtt{+\,\,\,\, j}\begin{bmatrix}\mathtt{2}\\\mathtt{1}\end{bmatrix}\mathtt{+\,\,\,\,k}\begin{bmatrix}\mathtt{-1}\\\mathtt{\,\,\,\,\,2}\end{bmatrix}\mathtt{\,\,=\,\,}\begin{bmatrix}\mathtt{-1}\\\mathtt{-3}\end{bmatrix}\] With the vector and scalar names, we could write this as \(\mathtt{p + j(p – x) + ka = r}\). The distance to the line depends on our figuring out what \(\mathtt{k}\) is. Once we have that, then the distance is just \(\mathtt{\sqrt{(ka_1)^2 + (ka_2)^2}}\).

We can subtract vectors from both sides of an equation just like we do with scalar values. Subtracting the vector (4, 2) from both sides, we get an equation which can be rewritten as a system of two equations \[\mathtt{j}\begin{bmatrix}\mathtt{2}\\\mathtt{1}\end{bmatrix}\mathtt{+\,\,\,\,k}\begin{bmatrix}\mathtt{-1}\\\mathtt{\,\,\,\,\,2}\end{bmatrix}\mathtt{\,\,=\,\,}\begin{bmatrix}\mathtt{-5}\\\mathtt{-5}\end{bmatrix} \rightarrow \left\{\begin{align*}\mathtt{2j – k = -5} \\ \mathtt{j + 2k = -5}\end{align*}\right.\]

Solving that system gives us \(\mathtt{j = -3}\) and \(\mathtt{k = -1}\). So, the distance of \(\mathtt{r}\) to the line is \(\mathtt{\sqrt{5}.}\)

Can We Get to the Dot Product?

Maybe we can get to the dot product. I’m not sure at this point. But there are some interesting things to point out about what we’ve already done. First, we can see that the vector \(\mathtt{j(p-x)}\) is a scaling of vector \(\mathtt{(p-x)}\) along the line, which, when added to \(\mathtt{p}\), brings us to the right point on the line where some scaling of the perpendicular \(\mathtt{a}\) can intersect to give us the distance. The scalar \(\mathtt{j=-3}\) tells us to reverse the vector (2, 1) and stretch it by a factor of 3. Adding to \(\mathtt{p}\) means that all of that happens starting at point \(\mathtt{p}\).

Then the scalar \(\mathtt{k=-1}\) reverses the direction of \(\mathtt{a}\) to take us to \(\mathtt{r}\).

We can then use this diagram to at least show how the dot product gets us there. We modify it a little to include the parts we will need and talk about.

Okay, here we go. Let’s consider the dot product \(\mathtt{-a \cdot (r – p)}\). We know that since \(\mathtt{-a}\) and \(\mathtt{x-p}\) are perpendicular, their dot product is 0, but this is \(\mathtt{r-p}\), not \(\mathtt{x-p}\). So, \(\mathtt{-a \cdot (r – p)}\) will likely have some nonzero value. Their dot product is this \[\mathtt{a \cdot (r – p) = |-a||r-p|\textrm{cos}(θ)}\] We got this by rearranging the formula we saw here.

We also know, however, that we can use the cosine of the same angle in representing the distance, d: \[\mathtt{d=|r-p|\textrm{cos}(θ)}\]

Putting those two equations together, we get \(\mathtt{d = \frac{a \cdot (r – p)}{|a|}}\).

We can forget about the negative in front of \(\mathtt{a}\). But you may want to play around with it to convince yourself of that. A nice feature of determining the distance this way is that the distance is signed. It is negative below the line and positive above it.

At the heart of many calls to improve education is the taken-for-granted notion that because the world is now changing so rapidly, it is better for schools to focus on producing innovative and critical thinkers and ‘not just’ knowledgable students. The common instructional approach deployed, at all scales, to produce this effect—whether it is inquiry learning or personalized learning—is to remove or dramatically lessen the influence of knowledgable others.

Copying the effective behaviors of knowledgable others was a much more effective learning strategy than learning directly from the environment.

But important research on learning strategies in the wild shows that, at the very least, different intuitions are possible here. Researchers discovered—much to their surprise—that, in a rapidly changing environment, copying the effective behaviors of knowledgable others (social learning) could be a much more effective learning strategy than learning directly from the environment (asocial learning). This result held even when social learning was “noisy” and asocial learning was noise free.

The team has gone on to further investigate and apply their findings to other animal studies, and a book, Darwin’s Unfinished Symphony, was released just last year, detailing their work.

Social Learning Strategies Tournament

The method used for this research was a tournament in which the researchers designed a computer simulation environment and entrants to the tournament (104 in all) designed ‘agents’ that competed to survive in the generated environment by learning behaviors and applying them to receive payoffs for those behaviors. Each agent had three possible moves it could play: Observe, Innovate, or Exploit. The first two of these moves—Observe and Innovate—were learning moves, which allowed the agent to acquire new behaviors (or not in some cases), and the third move, Exploit, allowed agents to apply their acquired behaviors to receive a payoff (or not, depending on the environment and the behavior). As was mentioned above, Observe moves were “noisy,” whereas Innovate moves were noise free:

Innovate represented asocial learning, that is, individual learning stemming solely through direct interaction with the environment, for example, through trial and error. An Innovate move always returned accurate information about the payoff of a randomly selected behavior previously unknown to the agent. Observe represented any form of social learning or copying through which an agent could acquire a behavior performed by another individual, whether by observation of or interaction with that individual. An Observe move returned noisy information about the behavior and payoff currently being demonstrated in the population by one or more other agents playing Exploit. Playing Observe could return no behavior if none was demonstrated or if a behavior that was already in the agent’s repertoire is observed and always occurred with error, such that the wrong behavior or wrong payoff could be acquired. The probabilities of these errors occurring and the number of agents observed were parameters we varied.

Some Key Findings

When the winning agent, which learned primarily by copying, was modified to learn only through Innovate moves, it placed last.

It was not effective to play a lot of learning moves. But when learning moves were played, agents which relied almost exclusively on Observe outperformed the rest, and an increase in copying was strongly positively correlated with higher payoffs. When the winning agent (called DISCOUNTMACHINE) was modified to learn only through Innovate moves, it placed last.

Even when learning by copying was made noisier—the probability and size of copying errors increased—agents which relied on it heavily still did best.

Finally, agents who combined asocial and social learning in more balanced ways (winning agents used social learning at least 95% of the time) performed worse than those who opted for social learning most of the time.

Why Copying Is Effective

It must be underscored, again, that, in more naturalistic environments there is a cost to asocial learning that copying does not have. Learning by observation is safer than learning by interacting directly with the environment, alone. But in this simulation, that cost was erased. And social learning (copying) STILL outperformed innovation, even when social learning was noisy (Observe “failed to introduce new behavior into an agent’s repertoire in 53% of all the Observe moves in the first tournament phase, overwhelmingly because agents observed behaviors they already knew”).

So, why was copying effective? The researchers boiled it down to being surrounded by rational agents, which I choose to rephrase as “knowledgable adults”:

Social learning proved advantageous because other agents were rational in demonstrating the behavior in their repertoire with the highest payoff, thereby making adaptive information available for others to copy. This is confirmed by modified simulations wherein social learners could not benefit from this filtering process and in which social learning performed poorly. Under any random payoff distribution, if one observes an agent using the best of several behaviors that it knows about, then the expected payoff of this behavior is much higher than the average payoff of all behaviors, which is the expected return for innovating. Previous theory has proposed that individuals should critically evaluate which form of learning to adopt in order to ensure that social learning is only used adaptively, but a conclusion from our tournament is that this may not be necessary. Provided the copied individuals themselves have selected the best behavior to perform from at least two possible options, social learning will be adaptive.

Any takeaways for education from this will be stretches. The research was a computer simulation, after all. But, whatever. My takeaway from all this is that, as long as there are knowledgable adults around, we should encourage students to learn directly from them. A milder takeaway (or maybe stronger, depending on your point of view): regardless of how adept you feel yourself to be in your social world, social worlds are not intuitive. What seems to make sense to you as a strong connection between ideas A and B (in this case, changing world → promote innovation) will not necessarily be effective just because a lot of people believe it and it makes intuitive sense. The way to change that is not to stop making those arguments, because few people do. The way to change it is to stop forwarding those kinds of arguments along when they are made. That way, the behavior won’t be copied. : )

Coda

I should add, by way of the quote below from Darwin’s Unfinished Symphony, that, although copying was a more successful strategy than innovating, it was not, by itself, the reason for success. What made the difference was better, more efficient, more accurate copying behaviors:

The tournament teaches us that natural selection will tend to favor those individuals who exhibit more efficient, more strategic, and higher-fidelity (i.e., more accurate) copying over others who either display less efficient or exact copying, or are reliant on asocial learning.

The dot product is helpful in finding the distance of a point to a line. The dot product, as we mentioned here, is the the sum of the element-wise products of the vector components. Given two vectors \(\mathtt{v}\) and \(\mathtt{w}\), their dot product is \[\begin{bmatrix}\mathtt{v_1}\\\mathtt{v_2}\end{bmatrix} \cdot \begin{bmatrix}\mathtt{w_1}\\\mathtt{w_2}\end{bmatrix}\mathtt{= v_1w_1 + v_2w_2}\]

The result of this computation is not another vector, but just a number, a scalar quantity. And, given that the dot product of two perpendicular vectors is 0, it would be nice if the dot product were related to cosine in some way, since the cosine of 90° is also 0. So let’s take a look at some vector pairs and their dot products and think about any patterns we see.

Well, so, the dot products have the same signs as the cosines. That’s a start. And in all but one case shown, we can divide the dot product by 4 to get the cosine. What makes the 45° case different?

Each of the vectors shown, with the exception of the vector (2, 2) has a length, a magnitude, of 2. To determine the magnitude, or length, of a vector, you treat the components of the vector as the legs of a right triangle and the vector itself as the hypotenuse. So, \[|\begin{bmatrix}\mathtt{-1}\\\mathtt{\sqrt{3}}\end{bmatrix}|=\sqrt{(-1)^2+(\sqrt{3})^2}=2\]

But the length of (2, 2) is \(\mathtt{\sqrt{8}}\). If we were to give that vector a length of 2, without changing the angle between v and w, then the vector would become (\(\mathtt{\sqrt{2}, \sqrt{2}}\)). And, lo, the dot product would become \(\mathtt{2\sqrt{2}}\), which, when divided by 4, would yield the cosine.

The 4 that we divide by isn’t random. It’s the product of the lengths of the vectors. If we leave the 45° angled vectors alone, the product of their lengths is \(\mathtt{2\sqrt{8}}\). Dividing 4 by this product does indeed yield the correct cosine. So, we have an initial conjecture that the dot product of two vectors v and w relates to cosine like this: \[\mathtt{\frac{v \cdot w}{|v||w|} = cos(θ)}\]

Perpendicular vectors will still have a dot product of 0 with this formula, so that’s good. And we can scale the vectors however we want and the cosine should remain the same—as it should be—though it may take a little manipulation to see that that’s true. But we are still left with the puzzle of proving this conjecture, more or less, or at least demonstrating to our satisfaction that the result is general.

Although the derivation doesn’t go beyond the Pythagorean Theorem, really, it gets a little symbol heavy, so let’s start with something simpler. We can write the cosine of θ at the right as \[\mathtt{\textrm{cos}(θ)=\frac{|w|}{|v|}}\] If we think of w here as truly horizontal, its length is simply \(\mathtt{v_1}\), the length of the horizontal component of v. Combining this fact with the length of v, we can rewrite the cosine equation above as \[\mathtt{\textrm{cos}(θ)=\frac{v_1}{\sqrt{v_{1}^2+v_{2}^2}}}\]

Since w is horizontal (has a second component of 0), the dot product \(\mathtt{v \cdot w}\) becomes simply \(\mathtt{v_{1}^2}\). Dividing this by the product of the lengths of the vectors v and w (where the length of w is just \(\mathtt{v_1}\)), we get this equation for cosine: \[\mathtt{\,\,\,\,\,\textrm{cos}(θ)=\frac{v_{1}^2}{(v_1)(\sqrt{v_{1}^2+v_{2}^2})}}\] And that’s clearly equal to the above. So, while it is by no means definitive, we can have a little more confidence at this point that we have the right equation for cosine using the dot product. We can get more formal and sure about it later. Next time we’ll look at how it can help us determine the distance from a point to a line.