The Farey Mean?

I had never heard of the Farey mean, but here it is, brought to you by @howie_hua.

When you add two fractions, you of course remember that you should never just add the numerators and add the denominators across. The resulting fraction will not be the sum of the two addends. But if you do add across (under certain conditions which I’ll show below), the result will be a fraction between the two “addend” fractions. So, you can use the add-across method to find a fraction between two other fractions.

For example, \(\mathtt{\frac{1}{2}+\frac{4}{3}\rightarrow\frac{5}{5}}\). The first “addend” is definitely less than 1, and the second definitely greater than 1. The Farey mean here is exactly 1 (or \(\mathtt{\frac{5}{5}}\)), which is between the two “addend” fractions.

Why Does It Work?

Since this site is fast becoming all linear algebra all the time, let’s throw some linear algebra at this. What we want to show is that, given \(\mathtt{\frac{a}{b}<\frac{c}{d}}\) (we'll go with this assumption for now), \[\mathtt{\frac{a}{b}<\frac{a+c}{b+d}<\frac{c}{d}}\]

for certain positive integer values of \(\mathtt{a,b,c,}\) and \(\mathtt{d}\). I would probably do better to make those inequality signs less-than-or-equal-tos, but let’s stick with this for the present. We’ll start by representing the fraction \(\mathtt{\frac{a}{b}}\) as the vector \(\scriptsize\begin{bmatrix}\mathtt{b}\\\mathtt{a}\end{bmatrix}\) along with the fraction \(\mathtt{\frac{c}{d}}\) as the vector \(\scriptsize\begin{bmatrix}\mathtt{d}\\\mathtt{c}\end{bmatrix}\).

We’re looking specifically at the slopes or angles here (which is why we can represent a fraction as a vector in the first place), so we’ve made \(\scriptsize\begin{bmatrix}\mathtt{d}\\\mathtt{c}\end{bmatrix}\) have a greater slope to keep in line with our assumption above that \(\mathtt{\frac{a}{b}<\frac{c}{d}}\).

The fraction \(\mathtt{\frac{a+c}{b+d}}\) is the same as the vector \(\scriptsize\begin{bmatrix}\mathtt{b+d}\\\mathtt{a+c}\end{bmatrix}\). And since this vector is the diagonal of the vector parallelogram, it will of course have a greater slope than \(\mathtt{\frac{a}{b}}\) but less than \(\mathtt{\frac{c}{d}}\). You can keep going forever—just take one of the side vectors and use the diagonal vector as the other side. So long as you’re making parallelograms, you’ll get a new diagonal shallower than the two side vectors, and the result will be a fraction between the other two.

Incidentally, our assumption at the beginning that \(\mathtt{\frac{a}{b}<\frac{c}{d}}\) doesn't really matter to this picture. If we make \(\mathtt{\frac{c}{d}}\) less than \(\mathtt{\frac{a}{b}}\), our picture simply flips. The diagonal vector still has to be located between the two side vectors.

What Doesn’t Work?

The linear algebra picture of this concept also tells us where this method fails to find a fraction between the two addend fractions. When the two “addend” fractions are equivalent, \(\mathtt{c}\) and \(\mathtt{d}\) are multiples of \(\mathtt{a}\) and \(\mathtt{b}\), respectively, or vice versa. In that case, the resulting fraction looks like this.

The slopes or angles for both addends and for the result are the same, producing a Farey mean that is equal to both fractions.

Cosine Similarity and Correlation

I wrote a lesson not too long ago that started with a Would You Rather? survey activity. For our purposes here, we can pretend that each question had a Likert scale from 1–10 attached to it, though in reality, the lesson was about categorical data.

At any rate, here are the questions—edited a bit. Feel free to rate your answers on the scales provided. Careful! Once you click, you lock in your answer.

Would you rather . . .

  1. be able to fly (1) or be able to read minds (10)?

  1. go way back in time (1) or go way into the future (10)?

  1. be able to talk to animals (1) or speak all languages (10)?

  1. watch only historical movies (1) or sci-fi movies (10) for the rest of your life?

  1. be just a veterinarian (1) or just a musician (10)?

Finally, one last question that is not a would-you-rather. Once you’ve answered this and the rest of the questions, you can press the I'm Finished! button to submit your responses.

  1. Rate your fear of heights from (1) not at all afraid to (10) very afraid.

Check out the results so far.

Are Your Responses Correlated?

Next in the lesson, I move on to asking whether you think some of the survey responses are correlated. For example, if you scored “low” on the veterinarian-or-musician scale—meaning you would strongly prefer to be a veterinarian over a musician—would that indicate that you probably also scored “low” on Question (c) about talking to animals or speaking all the languages? In other words, are those two scores correlated? What about choosing the ability to fly and your fear of heights? Are those correlated? How could we measure this using a lot of responses from a lot of different people?

An ingenious way of looking at this question is by using cosine similarity from linear algebra. (We looked at the cosine of the angle between two vectors here and here.)

For example, suppose you really would rather have the ability to fly and you have almost no fear of heights. So, you answered Question (a) with a 1 and Question (f) with, say, a 2. Another person has no desire to fly and a terrible fear of heights, so they answer Question (a) with an 8 and Question (f) with a (10). From this description, we would probably guess that the two quantities wish-for-flight and fear-of-heights are strongly correlated. But we’ve also now got the vectors (1, 2) and (8, 10) to show us this correlation.

See that tiny angle between the vectors on the left? The cosine of tiny angles (as we saw) is close to 1, which indicates a strong correlation. On the right, you see the opposite idea. One person really wants to fly but is totally afraid of heights (1, 10) and another almost couldn’t care less about flying (or at least would really rather read minds) but has a low fear of heights (8, 2). The cosine of the close-to-90°-angle between these vectors will be close to 0, indicating a weak correlation between responses to our flight and heights questions.

But That’s Not the Ingenious Part

That’s pretty cool, but it is not, in fact, how we measure correlation. The first difficulty we encounter happens after adding more people to the survey, giving us several angles to deal with—not impossible, but pretty messy for a hundred or a thousand responses. The second, more important, difficulty is that the graph on the right above doesn’t show a weak correlation; it shows a strong negative correlation. Given just the two response pairs to work from in that graph, we would have to conclude that a strong fear of heights would make you more likely to want the ability to fly (or vice versa) rather than less likely. But the “weakest” the cosine can measure in this kind of setup is 0.

The solution to the first difficulty is to take all the x-components of the responses and make one giant vector out of them. Then do the same to the y-components. Now we’ve got just two vectors to compare! For our data on the left, the vectors (1, 2) and (8, 10) become (1, 8) and (2, 10). The vectors on the right—(1, 10) and (8, 2)—become (1, 8) and (10, 2).

The solution to the second difficulty—no negative correlations—we can achieve by centering the data. Let’s take our new vectors for the right, uncorrelated, graph: (1, 8) and (10, 2). Add the components in each vector and divide by the number of components (2) to get an average. Then subtract the average from each component. So, our new centered vectors are

(1 – ((1 + 8) ÷ 2), 8 – ((1 + 8) ÷ 2)) and (10 – ((10 + 2) ÷ 2), 2 – ((10 + 2) ÷ 2))

Or (–3.5, 3.5) and (4, –4). It’s probably not too tough to see that a vector in the 2nd quadrant and a vector in the 4th quadrant are heading in opposite directions. And these vectors now form a close-to-180° angle, and the cosine of 180° is –1 which is the actual lowest correlation we can get, indicating a strong negative correlation.

And That’s Correlation

To summarize, the way to determine correlation linear-algebra style is to determine the cosine of the centered x- and y-vectors of the data. That formula is \[\mathtt{\frac{(x-\overline{x}) \cdot (y-\overline{y})}{|x-\overline{x}||y-\overline{y}|} = cos(θ)}\]

Which is just another way of writing the more common version of the r-value correlation.


Rotating Coordinate Systems

I‘ve just started with Six Not-So-Easy Pieces, based on Feynman’s famous lectures, and already there’s some decently juicy stuff. In the beginning, Feynman discusses the symmetry of physical laws—that is, the invariance of physical laws under certain transformations (like rotations):

If we build a piece of equipment in some place and watch it operate, and nearby we build the same kind of apparatus but put it up on an angle, will it operate in the same way?

He goes on to explain that, of course, a grandfather clock will not operate in the same way under specific rotations. Assuming the invariance of physical laws under rotations, this change in operation tells us something interesting: that the operation of the clock is dependent on something outside of the “system” that is the clock itself.

The theorem is then false in the case of the pendulum clock, unless we include the earth, which is pulling on the pendulum. Therefore we can make a prediction about pendulum clocks if we believe in the symmetry of physical law for rotation: something else is involved in the operation of a pendulum clock besides the machinery of the clock, something outside it that we should look for.

Rotation Coordinates

Feynman then proceeds with a brief mathematical analysis of forces under rotations. A somewhat confusing prelude to this is a presentation that involves expressing the coordinates of a rotated system in terms of the original system. He uses the diagram below to derive those coordinates (except for the blue highlighting, which I use to show what (x’, y’) looks like in Moe’s system). What we want is to express (x’, y’) in terms of x and y—to describe Moe’s point P in terms of Joe’s point P.

“We first drop perpendiculars from P to all four axes and draw AB perpendicular to PQ.”

The first confusion that is not dealt with (because Feynman makes the assumption that his audience is advanced students) is what angles in the diagram are congruent to θ shown. And here again we see the value of the easy-to-forget art of eyeballing and common sense in geometric reasoning.

The y’ axis is displaced just as much as the x’ axis by rotation, and “displaced just as much by rotation” is a perfectly good definition of angle congruence that we tend to forget after hundreds of hours of deriving work. The same reasoning applies to the rotational displacement from AP to AB. If we imagine rotating AP to AB, we see that we are starting perpendicular to the x-axis and ending perpendicular to the x’-axis. The y-to-y’ rotation does the same thing, so the displacement angle must be the same. So let’s put in those new thetas, only one of which we’ll need.

Inspection of the figure shows that x’ can be written as the sum of two lengths along the x’-axis, and y’ as the difference of two lengths along AB.

Here is x’ as the sum of two lengths (red and orange): \[\mathtt{x’=OA\cdot\color{red}{\frac{OC}{OA}}+AP\cdot\color{orange}{\frac{BP}{AP}}\quad\rightarrow\quad x\cdot\color{red}{cos\,θ}+y\cdot\color{orange}{sin\,θ}}\]

And here is y’ as the difference of two lengths (green – purple): \[\mathtt{y’=AP\cdot\color{green}{\frac{AB}{AP}}-OA\cdot\color{purple}{\frac{AC}{OA}}\quad\rightarrow\quad y\cdot\color{green}{cos\,θ}-x\cdot\color{purple}{sin\,θ}}\]

So, if Joe describes the location of point P to Moe, and the rotational displacement between Moe and Joe’s systems is known (and it is known that the two systems share an origin), Moe can use the manipulations above to determine the location of point P in his system.

Another, exactly equal, way of saying this—the way we said it when we talked about rotation matrices—is that, if we represent point P in Joe’s system as a position vector (x, y), then Moe’s point P vector is \[\small{\begin{bmatrix}\mathtt{x’}\\\mathtt{y’}\end{bmatrix}=\begin{bmatrix}\mathtt{\,\,\,\,\,cos\,θ} & \mathtt{sin\,θ}\\\mathtt{-sin\,θ} & \mathtt{cos\,θ}\end{bmatrix}\begin{bmatrix}\mathtt{x}\\\mathtt{y}\end{bmatrix}=\mathtt{x}\begin{bmatrix}\mathtt{\,\,\,\,\,cos\,θ}\\\mathtt{-sin\,θ}\end{bmatrix}+\mathtt{y}\begin{bmatrix}\mathtt{sin\,θ}\\\mathtt{cos\,θ}\end{bmatrix}=\begin{bmatrix}\mathtt{\,\,\,\,\,\,x\cdot \color{red}{cos\,θ}\,\,\,+y\cdot \color{orange}{sin\,θ}}\\\mathtt{-(x\cdot \color{purple}{sin\,θ})+y\cdot \color{green}{cos\,θ}}\end{bmatrix}}\]

The first rotation matrix above actually describes a clockwise rotation, which is both different from what we discussed at the link above (our final matrix there was for counterclockwise rotations) and unexpected, since we know that Moe’s system is a counterclockwise rotation of Joe’s system.

The resolution to that unexpectedness can again be found after a little eyeballing. The position vector for point P in Joe’s system is at a steep angle, whereas in Moe’s system, it is at a shallow angle. Only a clockwise rotation will change the coordinates in the appropriate way.

Projections

A projection is like the shadow of a vector, say \(\mathtt{u}\), on another vector, say \(\mathtt{v}\), if light rays were coming in across \(\mathtt{u}\) and perpendicular to \(\mathtt{v}\). For the vectors at the right, imagine a light source above and to the left of the illustration, perpendicular to the vector \(\mathtt{v}\). The projection, which we’ll call \(\mathtt{p}\), will be a vector that will extend from the point shown to where the end of the shadow of \(\mathtt{u}\) touches \(\mathtt{v}\).

If you take a moment maybe to read that description twice (because it’s kind of dense), you may be able to tell that the vector \(\mathtt{p}\) that we’re looking for will be some scalar multiple of vector \(\mathtt{v}\), since it will lie exactly on top of \(\mathtt{v}\). In fact, given the picture, the projection vector \(\mathtt{p}\) will have the opposite sign as \(\mathtt{v}\) and will have a scale factor pretty close to 0, since the projection vector looks like it will be much smaller than \(\mathtt{v}\).

That information is shown at the right. Helpfully, the vector \(\mathtt{u-p}\) is perpendicular to \(\mathtt{v}\), so we know that \(\mathtt{\left(u-p\right)\cdot v=0}\). And, using the Distributive Property, we get \(\mathtt{u\cdot v-p\cdot v=0}\).

Since we know that our sought-after vector \(\mathtt{p}\) will be some scalar multiple of \(\mathtt{v}\), we can substitute, say, \(\mathtt{cv}\) for \(\mathtt{p}\) in the above to get \(\mathtt{u\cdot v-cv\cdot v=0}\). And a property of dot product multiplication allows us to rewrite that as \(\mathtt{u\cdot v-c\left(v\cdot v\right)=0}\).

This means that \(\mathtt{u\cdot v=c\left(v\cdot v\right)}\), which means that the scale factor \(\mathtt{c}\) that we’re after—the factor we can multiply by \(\mathtt{v}\) to produce \(\mathtt{p}\)—is \[\mathtt{c=\frac{u\cdot v}{v\cdot v}\quad\rightarrow\quad p=\left(\frac{u\cdot v}{v\cdot v}\right)v}\]

K-Means Clustering

K-means clustering is one way of taking some data and allowing a computer to do what you do pretty naturally with your eyes and brain—separate the data into distinguishable clusters. For example, in the graph shown below, you can very easily see two clumps of points (points A and D in a clump and points B and C in a clump). A computer, to the extent it sees anything, sees just four points and their coordinates.

Why not just use our eyes and brain? Because once we teach a computer to approximate our ability to cluster 2D or 3D data, it can cluster data with many more than just 2 or 3 components. And then its “seeing” outpaces ours by quite a lot.

Let’s take a look at the instructions a computer could follow to do k-means clustering, and then we’ll dress it all up in linear algebra symbolism some other time. To start, I’ve just made 2 clusters of points (which we know about, but the computer doesn’t) with 2 components each (an x-component and y-component, i.e., 2D data).

Determining Least Distances

To start, we select a \(\mathtt{k}\), a number of clusters that we want, remembering that we know right now how many clusters there are but in most situations we would not. We choose \(\mathtt{k=2}\). Then we place the two cluster points at random locations—here I’ve put \(\mathtt{\color{blue}{k_1}}\) at \(\mathtt{(2,7)}\) and \(\mathtt{\color{red}{k_2}}\) at \(\mathtt{(4,2)}\).

Next, we calculate the distance from each point to each center. This is the good ol’ Pythagorean Theoremish Euclidean distance of \(\mathtt{\sqrt{(x_2-x_1)^{2}+(y_2-y_1)^{2}}}\). The cluster that we assign to each point is given by the closest center to that point. You can run the code below to print out the 8 distances.

Going by shorter distances, our first result, then, is to group point A in cluster \(\mathtt{\color{blue}{k_1}}\), because the first number in that pair is smaller, and points B, C, and D in cluster \(\mathtt{\color{red}{k_2}}\), because the second number in each of those pairs is the smaller one.

Moving the Centers Based on the Means

The next step—and last before repeating the process—is to move each center to the mean of the points in the cluster to which it is currently assigned. The mean of the points is determined by calculating the mean of the components separately. So, for our current cluster \(\mathtt{\color{red}{k_2}}\), the points B, C, and D have a mean of (\(\mathtt{\frac{2 + 2 + 6}{3}}\), \(\mathtt{\frac{4 + 3 + 5}{3}}\)), or (\(\mathtt{\frac{10}{3}}\), \(\mathtt{4}\)). And since \(\mathtt{\color{blue}{k_1}}\) has just one point, A, it will move to smack dab on top of that point, at (5, 6).

Now we can do another round of distance comparisons, given the new center locations. These calculations give us what we can see automatically—that points A and D belong to one cluster and points B and C belong to another cluster. In this case, A and D belong to cluster \(\mathtt{\color{blue}{k_1}}\) and B and C belong to cluster \(\mathtt{\color{red}{k_2}}\).

The cluster centers now move to the means of each pair of points, placing them where we would likely place them to begin with (directly between the two points in the cluster). Further calculations won’t change these assignments, so the k-means algorithm is done when it stops changing things drastically (or at all).

Sum and Product Loops

It’s something of a truism that mathematical symbolism is difficult. There are some situations, though, where the symbolism is not just difficult, but also annoying and ridiculous. It likely saved a lot of time when people were still mostly writing ideas out by hand, so back then even the annoying and ridiculous could not be righteously pointed at and mocked, but nowadays it is almost certainly more difficult to set some statements in LaTeX than it is to type them—and, if the text is intended to teach students, more difficult to unpack the former than it is to understand the latter.

Examples of symbols that are justly symbolized, even today, are \(\mathtt{\sum}\) and \(\mathtt{\prod}\), representing a sum and a product, respectively. More specifically, these symbols represent loops—an addition loop or a multiplication loop.

So, for example, take this expression on the left side of the equals sign, which represents the loop sum on the right of the equals sign: \(\mathtt{\sum_{n=1}^{5}n=1+2+3+4+5}\). The expression on the left just means (a) start a counter at 1, (b) count up to 5 by 1s, (c) let n = each number you count, then (d) add all the n’s one by one in a loop.

How about this one? \[\mathtt{\sum_{n=0}^{4}2n=0+2+4+6+8}\]

This one means (a) start a counter at 0, (b) count up to 4 by 1s, (c) let n = each number you count, then (d) add all the 2n’s one by one in a loop.

For products, we just swap out the symbol. Here is the corresponding product for the first loop: \(\mathtt{\prod_{n=1}^{5}n=1\times2\times3\times4\times5}\). And here’s one for the second loop: \[\mathtt{\prod_{n=0}^{4}2n=0\times2\times4\times6\times8}\]

Loops and Linear Algebra

You’ll often see the summation loop in linear algebra contexts, because it is an equivalent way to write a dot product, for example. The sum \(\mathtt{\sum_{n=0}^{4}2n=0+2+4+6+8}\) above can be written as shown below, which looks like more work to write—and is—but when we’re dealing mostly with variables, the savings in writing effort is more evident. \[\quad\,\,\,\begin{bmatrix}\mathtt{2}\\\mathtt{2}\\\mathtt{2}\\\mathtt{2}\\\mathtt{2}\end{bmatrix}\cdot \begin{bmatrix}\mathtt{0}\\\mathtt{1}\\\mathtt{2}\\\mathtt{3}\\\mathtt{4}\end{bmatrix}\mathtt{=2\cdot0+2\cdot1+2\cdot2\ldots}\]

The loop sum \(\mathtt{\sum_{i}a_{i}x_{i}+b}\), where \(\mathtt{i}\) is an index pointing to a component of vector \(\mathtt{a}\) and vector \(\mathtt{x}\), can be written more simply as \(\mathtt{a\cdot x+b}\), as long as the context is clear that \(\mathtt{a}\) and \(\mathtt{x}\) are vectors.

Linear Algebra Exercises I

We’ve done a fair amount of landscaping, as it were—running out a long way in the field of linear algebra to mark interesting points. Now it seems like a good time to turn back and start tending to each area a little more closely. An excellent way to do that—to make things more secure—is to practice.

The sets below can be completed after reading Lines the Linear Algebra Way. I would suggest completing this by working on a piece of paper and then checking your answers—one at a time at first and then go a stretch before checking. Try to complete each exercise before either checking back with the original post or looking at my answer. My answers can be uncovered by hovering under each exercise.

Convert each equation of a line to a vector equation in parametric form, \(\mathtt{l(k)=p+kv}\), where \(\mathtt{p}\) and \(\mathtt{v}\) are 2D vectors and \(\mathtt{k}\) is a scalar variable.

  1. \(\mathtt{y=x+3}\)

\(\mathtt{l(k)=}\begin{bmatrix}\mathtt{0}\\\mathtt{3}\end{bmatrix}+\begin{bmatrix}\mathtt{1}\\\mathtt{1}\end{bmatrix}\mathtt{k}\)

  1. \(\mathtt{y=2x+3}\)

\(\mathtt{l(k)=}\begin{bmatrix}\mathtt{0}\\\mathtt{3}\end{bmatrix}+\begin{bmatrix}\mathtt{1}\\\mathtt{2}\end{bmatrix}\mathtt{k}\)

  1. \(\mathtt{y=-2x}\)

\(\mathtt{l(k)=}\begin{bmatrix}\mathtt{0}\\\mathtt{0}\end{bmatrix}+\begin{bmatrix}\mathtt{\,\,\,\,1}\\\mathtt{-2}\end{bmatrix}\mathtt{k}\),    or just \(\begin{bmatrix}\mathtt{\,\,\,\,1}\\\mathtt{-2}\end{bmatrix}\mathtt{k}\)

  1. \(\mathtt{y=x}\)

\(\mathtt{l(k)=}\begin{bmatrix}\mathtt{0}\\\mathtt{0}\end{bmatrix}+\begin{bmatrix}\mathtt{1}\\\mathtt{1}\end{bmatrix}\mathtt{k}\),    or just \(\begin{bmatrix}\mathtt{1}\\\mathtt{1}\end{bmatrix}\mathtt{k}\)

  1. \(\mathtt{y=10}\)

\(\mathtt{l(k)=}\begin{bmatrix}\mathtt{0}\\\mathtt{10}\end{bmatrix}+\begin{bmatrix}\mathtt{1}\\\mathtt{0}\end{bmatrix}\mathtt{k}\)

  1. \(\mathtt{x+y=1}\)

\(\mathtt{l(k)=}\begin{bmatrix}\mathtt{0}\\\mathtt{1}\end{bmatrix}+\begin{bmatrix}\mathtt{\,\,\,\,1}\\\mathtt{-1}\end{bmatrix}\mathtt{k}\)

  1. \(\mathtt{2x+3y=9}\)

\(\mathtt{l(k)=}\begin{bmatrix}\mathtt{0}\\\mathtt{3}\end{bmatrix}+\begin{bmatrix}\mathtt{\,\,\,\,3}\\\mathtt{-2}\end{bmatrix}\mathtt{k}\)

  1. \(\mathtt{x=5}\)

\(\mathtt{l(k)=}\begin{bmatrix}\mathtt{5}\\\mathtt{0}\end{bmatrix}+\begin{bmatrix}\mathtt{0}\\\mathtt{1}\end{bmatrix}\mathtt{k}\)

  1. \(\mathtt{-3x-\frac{1}{4}y=15}\)

\(\mathtt{l(k)=}\begin{bmatrix}\mathtt{\,\,\,\,0}\\\mathtt{-60}\end{bmatrix}+\begin{bmatrix}\mathtt{\,\,\,\,1}\\\mathtt{-12}\end{bmatrix}\mathtt{k}\)

  1. Devise an algorithm you could use to convert an equation for a line in slope-intercept form to a vector equation for the line in parametric form.

For slope-intercept form, \(\mathtt{y=mx+b}\), the conversion \(\begin{bmatrix}\mathtt{0}\\\mathtt{b}\end{bmatrix}+\begin{bmatrix}\mathtt{1}\\\mathtt{m}\end{bmatrix}\mathtt{k}\,\) seems to work for most lines.

There are many ways to write the above vector equations. In particular, the intercept vector does not have to have a first component of \(\mathtt{0}\), and the slope vector does not have to be in simplest form. All that is required is for the intercept vector to get you to some point on the line and for the slope vector to correctly represent the slope of the line.

Now let’s go the other way. Identify the slope, y-intercept, and x-intercept of each line.

  1. \(\mathtt{l_{1}(k)=}\begin{bmatrix}\mathtt{1}\\\mathtt{2}\end{bmatrix}+\begin{bmatrix}\mathtt{\,\,\,\,2}\\\mathtt{-1}\end{bmatrix}\mathtt{k}\)

slope → \(\mathtt{-\frac{1}{2}}\), y-intercept → \(\mathtt{\frac{5}{2}}\), x-intercept → \(\mathtt{5}\)

  1. \(\mathtt{l_{2}(k)=}\begin{bmatrix}\mathtt{-3}\\\mathtt{-5}\end{bmatrix}+\begin{bmatrix}\mathtt{0}\\\mathtt{4}\end{bmatrix}\mathtt{k}\)

slope → undefined, y-intercept → none, x-intercept → \(\mathtt{-3}\)

  1. \(\mathtt{l_{3}(k)=}\begin{bmatrix}\mathtt{\,\,\,\,0}\\\mathtt{-1}\end{bmatrix}+\begin{bmatrix}\mathtt{2}\\\mathtt{2}\end{bmatrix}\mathtt{k}\)

slope → \(\mathtt{1}\), y-intercept → \(\mathtt{-1}\), x-intercept → \(\mathtt{1}\)

  1. \(\mathtt{l_{4}(k)=}\begin{bmatrix}\mathtt{\,\,\,\,3}\\\mathtt{-9}\end{bmatrix}+\begin{bmatrix}\mathtt{7}\\\mathtt{0}\end{bmatrix}\mathtt{k}\)

slope → \(\mathtt{0}\), y-intercept → \(\mathtt{-9}\), x-intercept → none

  1. \(\mathtt{l_{5}(k)=}\begin{bmatrix}\mathtt{-5}\\\mathtt{\,\,\,\,1}\end{bmatrix}+\begin{bmatrix}\mathtt{-3}\\\mathtt{\,\,\,\,4}\end{bmatrix}\mathtt{k}\)

slope → \(\mathtt{-\frac{4}{3}}\), y-intercept → \(\mathtt{-\frac{17}{3}}\), x-intercept → \(\mathtt{-\frac{17}{4}}\)

And just two more. Determine if each point is on the line mentioned from above.

  1. Is \(\mathtt{(-11,3)}\) on the line \(\mathtt{l_1}\)?

No: \(\begin{bmatrix}\mathtt{1}\\\mathtt{2}\end{bmatrix}+\begin{bmatrix}\mathtt{\,\,\,\,2}\\\mathtt{-1}\end{bmatrix}\mathtt{k=}\begin{bmatrix}\mathtt{-11}\\\mathtt{\,\,\,\,3}\end{bmatrix}\).
No value of \(\mathtt{k}\) solves both \(\mathtt{2k+1=-11}\) and \(\mathtt{-k+2=3}\).

  1. Is \(\mathtt{(5,5)}\) on the line \(\mathtt{l_3}\)?

No. Since the line has a slope of \(\mathtt{1}\), it would contain \(\mathtt{(5,5)}\) if the line passed through the origin, but it does not.

More with Least Squares

One very cool thing about our formula for the least squares regression line, \(\mathtt{\left(X^{T}X\right)^{-1}X^{T}y}\), is that it is the same no matter whether we have one independent variable (univariate) or many independent variables (multivariate).

Consider these data, showing the selling prices of some grandfather clocks at auction. The first scatter plot shows the age of the clock in years on the x-axis (100–200), and the second shows the number of bidders on the x-axis (0–20). Price (in pounds or dollars) is on the y-axis on each plot (500–2500).

Age (years)BiddersPrice ($)
127131235
115121080
1277845
15091522
15661047
182111979
156121822
132101253
13791297
1139946
137151713
117111024
13781147
15361092
117131152
126101336
170142131
18281550
162111884
184102041
1436854
15991483
108141055
17581545
1086729
17991792
111151175
18781593
1117785
1157744
19451356
16871262

You can see in the notebook below that the first regression line, for the price of a clock as a function of its age, is approximately \(\mathtt{10.5x-192}\). The second regression line, for the price of a clock as a function of the number of bidders at auction, is approximately \(\mathtt{55x+806}\). As mentioned above, each of these univariate least squares regression lines can be calculated with the formula \(\mathtt{\left(X^{T}X\right)^{-1}X^{T}y}\).

Combining both age and number of bidders together, we can calculate, using the same formula, a multivariate least squares regression equation. This of course is no longer a line. In the case of two input variables as we have here, our line becomes a plane.

Our final regression equation becomes \(\mathtt{12.74x_{1}+85.82x_{2}-1336.72}\), with \(\mathtt{x_1}\) representing the age of a clock and \(\mathtt{x_2}\) representing the number of bidders.

Linear Algebra Connections

There are so many connections within and applications of linear algebra—I can only imagine that this will be a series of posts, to the extent that I continue writing about the subject. Here are a few connections that I’ve come across in my reading recently.

Compound Interest and Matrix Powers

We can multiply a matrix by itself \(\mathtt{n}\) times. The result is the matrix to the power \(\mathtt{n}\). We can use this when setting up a compound interest situation. For example, suppose we have three accounts, which each have a different interest rate compounded annually—say, 5%, 3%, and 2%. Without linear algebra, the amount in the first account can be modeled by the equation \[\mathtt{A(t)=p \cdot 1.05^t}\] where \(\mathtt{p}\) represents the starting amount in the account, and \(\mathtt{t}\) represents the time in years. With linear algebra, we can group all of the account interest rates into a matrix. The first year for each account would look like this: \[\mathtt{A(1)}=\begin{bmatrix}\mathtt{1.05}&\mathtt{0}&\mathtt{0}\\\mathtt{0}&\mathtt{1.03}&\mathtt{0}\\\mathtt{0}&\mathtt{0}&\mathtt{1.02}\end{bmatrix}^\mathtt{1}\begin{bmatrix}\mathtt{p_1}\\\mathtt{p_2}\\\mathtt{p_3}\end{bmatrix}=\begin{bmatrix}\mathtt{1.05p_1}\\\mathtt{1.03p_2}\\\mathtt{1.02p_3}\end{bmatrix}\]

For years beyond the first year, all we have to do is raise the matrix to the appropriate power. Since it’s diagonal, squaring it, cubing it, etc., will square, cube, etc., each entry. This computation can be a little more organized—and more straightforward for programming. A matrix has to be square (\(\mathtt{m \times m})\) in order to raise it to a power. Below we calculate the amount in each account after 100 years.

Centroids and Areas

We have seen that the determinant can be thought about as the area of the parallelogram formed by two vectors. We can use this fact to determine the area of a complex shape like the one shown below.

Since determinants are signed areas, as we move around the shape counterclockwise, calculating the determinant of each vector pair (and multiplying each determinant by one half so we just get each triangle), we get the total area of the shape.\[\frac{1}{2}\left(\begin{vmatrix}\mathtt{6}&\mathtt{6}\\\mathtt{0}&\mathtt{4}\end{vmatrix}+\begin{vmatrix}\mathtt{6}&\mathtt{3}\\\mathtt{4}&\mathtt{4}\end{vmatrix}+\begin{vmatrix}\mathtt{3}&\mathtt{3}\\\mathtt{4}&\mathtt{6}\end{vmatrix}+\begin{vmatrix}\mathtt{3}&\mathtt{-2}\\\mathtt{6}&\mathtt{\,\,\,\,6}\end{vmatrix}+\begin{vmatrix}\mathtt{-2}&\mathtt{-2}\\\mathtt{\,\,\,\,6}&\mathtt{\,\,\,\,3}\end{vmatrix}+\begin{vmatrix}\mathtt{-2}&\mathtt{0}\\\mathtt{\,\,\,\,3}&\mathtt{3}\end{vmatrix}\right)=\mathtt{36}\text{ units}\mathtt{^2}\]

That’s pretty hand-wavy, but it’s something that you can probably figure out with a little experimentation.

Another counterclockwise-moving calculation (though this one can be clockwise without changing the answer) is the calculation of the centroid of a closed shape. All that is required here is to calculate the sum of the position vectors of the vertices of the figure and then divide by the number of vertices.

\[\mathtt{\frac{1}{5}}\left(\begin{bmatrix}\mathtt{3}\\\mathtt{4}\end{bmatrix}+\begin{bmatrix}\mathtt{0}\\\mathtt{6}\end{bmatrix}+\begin{bmatrix}\mathtt{-3}\\\mathtt{\,\,\,\,4}\end{bmatrix}+\begin{bmatrix}\mathtt{-1}\\\mathtt{\,\,\,\,3}\end{bmatrix}\right)=\begin{bmatrix}\mathtt{-\frac{1}{5}}\\\mathtt{\,\,\,\,\frac{17}{5}}\end{bmatrix}\]

Here again, it’s just magic, but you can figure it out with a little play. In each case—for both complex areas and centroids—we assign a point to be the origin and go from there.

Least Squares with Linear Algebra

In brief, linear regression is about finding the line of best fit to a data set. If you’re looking for the linear algebra way of doing this, you will most likely find it searching for the term least squares.

In the basic scenario, you’ve got some two-dimensional data, \(\mathtt{(x, y)}\) coordinates, and you want to find the equation for a straight line that is as close as possible to each point. Such a scenario is shown below, though here the line has already been graphed and the equation for the line of best fit displayed. (But feel free to change the data in the table or move the points around to see how the line of best fit changes.)

xy
108.04
86.95
137.58
98.81
118.33
149.96
67.24
44.26
1210.84
74.82
55.68

So, let’s imagine that we haven’t found this line yet. We know what we are looking for is a line of the form \(\mathtt{y=mx+b}\). In linear algebra terms, we want the vector \(\mathtt{y}\) (11 rows, 1 column) to equal the vector \(\mathtt{x}\) (11 rows, 1 column) times a slope vector \(\mathtt{m}\) (1 row, 2 columns) plus an intercept vector \(\mathtt{b}\) (11 rows, 1 column).

The first problem with this that we have to fix is that \(\mathtt{x}\) the 11 × 1 vector needs to be \(\mathtt{X}\) the 11 × 2 matrix so that we get our equations right. So we’ll pad \(\mathtt{x}\) with some 1s and then we’ll be able to call it \(\mathtt{X}\). (We can pad on either side, left or right, just remembering to interchange \(\mathtt{b}\) and \(\mathtt{m}\) to keep the equations straight.) The second problem we can fix is that we don’t need a separate intercept vector and a separate slope vector. We can combine things so that we form an equivalent matrix equation that means the same thing as \(\mathtt{y=mx+b}\). We need the equation to look like this: \[\begin{bmatrix}\mathtt{8.04}\\\mathtt{6.95}\\\mathtt{7.58}\\\mathtt{8.81}\\\mathtt{8.33}\\\mathtt{9.96}\\\mathtt{7.24}\\\mathtt{4.26}\\\mathtt{10.84}\\\mathtt{4.82}\\\mathtt{5.68}\end{bmatrix} = \begin{bmatrix}\mathtt{1}&\mathtt{10}\\\mathtt{1}&\mathtt{8}\\\mathtt{1}&\mathtt{13}\\\mathtt{1}&\mathtt{9}\\\mathtt{1}&\mathtt{11}\\\mathtt{1}&\mathtt{14}\\\mathtt{1}&\mathtt{6}\\\mathtt{1}&\mathtt{4}\\\mathtt{1}&\mathtt{12}\\\mathtt{1}&\mathtt{7}\\\mathtt{1}&\mathtt{5}\end{bmatrix}\begin{bmatrix}\mathtt{b}\\\mathtt{m}\end{bmatrix}\]

Multiply the matrix and vector on the right side of that equation, and you get, for the first equation, \(\mathtt{8.04=b+(m)(10)}\). That’s equivalent to \(\mathtt{y=mx+b}\) for the first \(\mathtt{(x, y)}\) data point \(\mathtt{(10, 8.04)}\). The matrix and vector setup ensures that all the equations for all the points are of the correct form.

In linear algebra terms, we have rewritten the equation to be \(\mathtt{y=Xv}\), where \(\mathtt{y}\) is an 11 × 1 vector, \(\mathtt{X}\) is an 11 × 2 matrix (padded with some 1s), and \(\mathtt{v}\) is a 2 × 1 vector which contains the unknown slope \(\mathtt{m}\) of the best fit line and the unknown intercept \(\mathtt{b}\).

It’s an Approximation

At this point in the explanation, it’s important to realize that we will make another shift. The first was from the real data to the matrix algebra setup. We shift again below, away from that setup, per se, and toward just finding out what that unknown \(\mathtt{v}\) is.

As an analogy, the system shown below features a 3d vector (2, 0, –2) which does not live in the same plane as the one formed by the two other column vectors of the matrix. Thus, there is no solution \(\mathtt{(j, k)}\). \[\begin{bmatrix}\mathtt{1}&\mathtt{\,\,\,\,1}\\\mathtt{1}&\mathtt{-3}\\\mathtt{1}&\mathtt{\,\,\,\,1}\end{bmatrix}\begin{bmatrix}\mathtt{j}\\\mathtt{k}\end{bmatrix}=\begin{bmatrix}\mathtt{\,\,\,\,2}\\\mathtt{\,\,\,\,0}\\\mathtt{-2}\end{bmatrix}\longleftarrow\text{no solutions}\]

Similarly, the columns of our \(\mathtt{X}\) matrix form a plane in 11-dimensional space. In order for \(\mathtt{v}\) to be a solution to our original matrix equation above, \(\mathtt{y}\) has to live in this plane too. But we already know that it doesn’t. If it did, the points would lie along some line. It’s true that \(\mathtt{y}\) is an 11-dimensional vector, just like each column of \(\mathtt{X}\), but \(\mathtt{y}\) doesn’t live in the same plane.

The closest approximation we can get to \(\mathtt{y}\) in the plane of \(\mathtt{X}\) is \(\mathtt{Xv=p}\), where \(\mathtt{p}\) is the projection of \(\mathtt{y}\) onto the plane (another thing we’ll have to come back to). The vector \(\mathtt{q}\) connecting \(\mathtt{y}\) with its projection \(\mathtt{p}\) is perpendicular to \(\mathtt{p}\) (and, thus, to the plane).

Now we can write some more equations. For example, we know we want \(\mathtt{Xv=p}\), but it’s also true that \(\mathtt{p+q=y}\). And we can do some manipulation to show that all the column vectors of \(\mathtt{X}\) are perpendicular to \(\mathtt{q}\). Assuming for a second that our \(\mathtt{X}\) is a 2 × 2 matrix, we write \(\mathtt{X}\) as a 3D matrix, transpose it (so that multiplication is defined), and multiply it by \(\mathtt{q}\). The same basic idea applies to our original \(\mathtt{X}\) (11 × 2) matrix. \[\begin{bmatrix}\mathtt{x_{11}}&\mathtt{x_{12}}\\\mathtt{x_{21}}&\mathtt{x_{22}}\end{bmatrix}\rightarrow\begin{bmatrix}\mathtt{x_{11}}&\mathtt{x_{12}}\\\mathtt{x_{21}}&\mathtt{x_{22}}\\\mathtt{0}&\mathtt{0}\end{bmatrix}\rightarrow\begin{bmatrix}\mathtt{x_{11}}&\mathtt{x_{21}}&\mathtt{0}\\\mathtt{x_{12}}&\mathtt{x_{22}}&\mathtt{0}\end{bmatrix}\]\[\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\,\,\,\begin{bmatrix}\mathtt{x_{11}}&\mathtt{x_{21}}&\mathtt{0}\\\mathtt{x_{12}}&\mathtt{x_{22}}&\mathtt{0}\end{bmatrix}\begin{bmatrix}\mathtt{0}\\\mathtt{0}\\\mathtt{q_3}\end{bmatrix}=\begin{bmatrix}\mathtt{0}\\\mathtt{0}\end{bmatrix}\] This gives us a key equation: \(\mathtt{X^{T}q=0}\).

Since \(\mathtt{q=y-p}\), substituting for \(\mathtt{q}\) gets us \(\mathtt{X^{T}(y-p)=0}\). Then since \(\mathtt{Xv=p}\), substituting for \(\mathtt{p}\) brings us to \(\mathtt{X^{T}(y-Xv)=0}\). Distribute to get \(\mathtt{X^{T}y-X^{T}Xv=0}\), which means \(\mathtt{X^{T}y=X^{T}Xv}\). Multiply both sides on the left by the inverse of \(\mathtt{X^{T}X}\), and the vector \(\mathtt{v}\) that we’re after, then, is \[\mathtt{v=(X^{T}X)^{-1}X^{T}y}\]

That formula gives us the slope and intercept of our best fit line. Below is one way the least squares can be calculated with a little Python. The calculation I used for the above interactive to get the best fit line is way more complicated. If I had known about the linear algebra way when I made it, I would have definitely gone with that instead.