Entia Successiva

The term ‘entia successiva’ means ‘successive entities.’ And, as you may guess, it is a term one might come across in a philosophy class, in particular when discussing metaphysical questions about personhood. For instance, is a person a single thing throughout its entire life or a succession of different things—an ‘ens successivum’? Though there is no right answer to this question, becoming familiar with the latter perspective can, I think, help people to be more skeptical and knowledgeable consumers of education research.

Richard Taylor provides an example of a symphony (in here) that is, depending on your perspective, both a successive and a permanent entity:

Let us imagine [a symphony orchestra] and give it a name—say, the Boston Symphony. One might write a history of this orchestra, beginning with its birth one hundred years ago, chronicling its many tours and triumphs and the fame of some of its musical directors, and so on. But are we talking about one orchestra?

In one sense we are, but in another sense we are not. The orchestra persists through time, is incorporated, receives gifts and funding, holds property, has a bank account, returns always to the same city, and rehearses, year in and year out, in the same hall. Yet its membership constantly changes, so that no member of fifty years ago is still a member today. So in that sense it is an entirely different orchestra. We are in this sense not talking about one orchestra, but many. There is a succession of orchestras going under the same name. Each, in [Roderick] Chisholm’s apt phrase, does duty for what we are calling the Boston Symphony.

The Boston Symphony is thus an ens successivum.

People are entia successiva, too. Or, at least their bodies are. Just about every cell in your body has been replaced from only 10 years ago. So, if you’re a 40-year-old Boston Symphony like me, almost all of your musicians and directors have been swapped out from when you were a 30-year-old symphony. People still call you the Boston Symphony of course (because you still are), but an almost entirely different set of parts is doing duty for “you” under the banner of this name. You are, in a sense, an almost completely different person—one who is, incidentally, made up of at least as many bacterial cells as human ones.

What’s worse (if you think of the above as bad news), the fact of evolution by natural selection tells us that humanity itself is an ens successivum. If you could line up your ancestors—your mother or father, his or her mother or father, and so on—it would be a very short trip down this line before you reached a person with whom you could not communicate at all, save through gestures. Between 30 and 40 people in would be a person who had almost no real knowledge about the physical universe. And there’s a good chance that perhaps the four thousandth person in your row of ancestors would not even be human.

The ‘Successive’ Perspective

Needless to say, seeing people as entia successiva does not come naturally to anyone. Nor should it, ever. We couldn’t go about out our daily lives seeing things this way. But the general invisibility of this ‘successiveness’ is not due to its only being operational at the very macro or very micro levels. It can be seen at the psychological level too. Trouble is, our brains are so good at constructing singular narratives out of even absolute gibberish, we sometimes have to place people in unnatural or extreme situations to get a good look at how much we can delude ourselves.

An Air Force doctor’s experiences investigating the blackouts of pilots in centrifuge training provides a nice example (from here). It’s definitely worth quoting at length:

Over time, he has found striking similarities to the same sorts of things reported by patients who lost consciousness on operating tables, in car crashes, and after returning from other nonbreathing states. The tunnel, the white light, friends and family coming to greet you, memories zooming around—the pilots experienced all of this. In addition, the centrifuge was pretty good at creating out-of-body experiences. Pilots would float over themselves, or hover nearby, looking on as their heads lurched and waggled about . . . the near-death and out-of-body phenomena are both actually the subjective experience of a brain owner watching as his brain tries desperately to figure out what is happening and to orient itself amid its systems going haywire due to oxygen deprivation. Without the ability to map out its borders, the brain often places consciousness outside the head, in a field, swimming in a lake, fighting a dragon—whatever it can connect together as the walls crumble. What the deoxygenated pilots don’t experience is a smeared mess of random images and thoughts. Even as the brain is dying, it refuses to stop generating a narrative . . . Narrative is so important to survival that it is literally the last thing you give up before becoming a sack of meat.

You’ll note, I hope, that not only does the report above disclose how our very mental lives are entia successiva—thoughts and emotions that arise and pass away—but the report assumes this perspective in its own narrative. That’s because the report is written from a scientific point of view. And from that vantage point, people are assumed (correctly) to have parts that “do duty” for them and may even be at odds with each other, as they were with the pilots (a perception part fighting against a powerful narrative-generating part). The unit of analysis in the report is not an entire pilot, but the various mechanisms of her mind. Allowing for these parts allows for functional explanations like the one we see.

An un-scientific analysis, on the other hand, is entirely possible. But it would stop at the pilot. He or she is, after all, an indivisible, permanent entity. There is nothing else “doing duty” for him, so there are really only two choices: his experience was an illusion or it was real. End of analysis. Interpret it as an illusion and you don’t really have much to say; interpret it as real, and you can make a lot of money.

Entia Permanentia

Good scientific research in education will adopt an entia successiva perspective about the people it studies. This does not guarantee that its conclusions are correct. But it makes it more likely that, over time, it will get to the bottom of things.

This is not to say that an alternative perspective is without scientific merit. If we want to know how to improve the performance of the Boston Symphony, we can make some headway with ‘entia permanentia’—seeing the symphony as a whole stable unit rather than a collection of successive parts. We could increase its funding, perhaps try to make sure “it” is treated as well as other symphonies around the world. We could try to change the music, maybe include some movie scores instead of that stuffy old classical music. That would make it more exciting for audiences (and more inclusive), which is certainly one interpretation of “improvement.” But to whatever extent improvement means improving the functioning of the parts of the symphony—the musicians, the director, etc.—we can do nothing, because with entia permanentia these tiny creatures do not exist. Even raising the question about improving the parts would be beyond the scope of our imagination.

Further, seeing students as entia permanentia rather than entia successiva stops us from being appropriately skeptical about both ‘scientific’ and ‘un-scientific’ ideas. Do students learn best when matched to their learning style? What parts of their neurophysiology and psychology could possibly make something like that true? Why would it have evolved, if it did? In what other aspects of our lives might this present itself? Adopting the entia successiva perspective would have slowed the adoption of this myth (even if were not a myth) to a crawl and would have eventually killed it. Instead, entia permanentia, a person-level analysis, holds sway: students benefit from learning-style matching because we see them respond differently to different representations. End of analysis.

A different but similar perspective on this, from a recurring theme in the book Switch:

In a pioneering study of organizational change, described in the book The Critical Path to Corporate Renewal, researchers divided the change efforts they’d studied into three groups: the most successful (the top third), the average (the middle third), and the least successful (the bottom third). They found that, across the spectrum, almost everyone set goals: 89 percent of the top third and 86 percent of the bottom third . . . But the more successful change transformations were more likely to set behavioral goals: 89 percent of the top third versus only 33 percent of the bottom third.

Why do “behavioral” goals work when just “goals” don’t? Behavioral goals are, after all, telling you what to do, forcing you to behave in a certain way. Do you like to be told what to do? Probably not.

But the “you” that responds to behavioral goals isn’t the same “you” whose in-the-moment “likes” are important. You are more than just one solid indivisible self. You are many selves, and the self that can start checking stuff off the to-do list is often pulling the other selves behind it. And when it does, you get to think that “you” are determined, “you” take initiative, “you” have willpower. But in truth, your environment—both immediate and distant, both internal and external—has simply made it possible for that determined self to take the lead. Behavioral goals often create this exact environment.

The Law of Total Probability

Where has this been all my life? The Law of Total Probability is really cool, and it seems accessible enough to be presented in high school, where it would be very useful as well, I think, although I’ve never seen it there. For example, from the book Causal Inference in Statistics we get this nice problem (in addition to the quote below): “Suppose we roll two dice, and we want to know the probability that the second roll (R2) is higher than the first (R1).” The Law of Total Probability can make answering this much more straightforward.

To understand this ‘law,’ we should start by understanding two simple things about probability. First, for any two mutually exclusive events (the events can’t happen together), the probability of \(\mathtt{A}\) or \(\mathtt{B}\) is the sum of the probability of \(\mathtt{A}\) and the probability of \(\mathtt{B}\):

\(\mathtt{P(A\text{ or }B)\,\,\,\,=\,\,\,\,\,\,P(A)\,\,\,\,+\,\,\,\,P(B)}\)

Second, thinking now of two mutually exclusive events “A and B” and “A and not-B”, we can write the probability of \(\mathtt{A}\) this way, since if \(\mathtt{A}\) is true, then either “A and B” or “A and not-B” must be true:

\(\mathtt{P(A)=P(A,B)\,\,+\,\,P(A,\text{not-}B)}\)

In different situations, however, \(\mathtt{B}\) could take on many different values—for example, the six possible values of one die roll, \(\mathtt{B_1}\)–\(\mathtt{B_6}\)—even while we’re considering just one value for the mututally exclusive event \(\mathtt{A}\)—for example, rolling a 4. The Law of Total Probability tells us that

\(\mathtt{P(A)=P(A,B_1)+\cdots+P(A,B_n)}\).

If we pull a random card from a standard deck, the probability that the card is a Jack [\(\mathtt{P(J)}\)] will be equal to the probability that it’s a Jack and a spade [\(\mathtt{P(J,C_S)}\)], plus the probability that it’s a Jack and a heart [\(\mathtt{P(J,C_H)}\)], plus the probability that it’s a Jack and a club [\(\mathtt{P(J,C_C)}\)], plus the probability that it’s a Jack and a diamond [\(\mathtt{P(J,C_D)}\)].

Now with Conditional Probabilities

Where this gets good is when we throw conditional probabilities into the mix. We can make use of the fact that \(\mathtt{P(A,B)=P(A|B)P(B)}\), where \(\mathtt{P(A|B)}\) means “the probability of A given B.” For example, the probability of randomly pulling a Jack, given that you pulled spades, is \(\mathtt{\frac{1}{13}}\), and the probability of randomly pulling a spade is \(\mathtt{\frac{1}{4}}\). Thus, the probability of pulling the Jack of spades is \(\mathtt{\frac{1}{13}\cdot \frac{1}{4}=\frac{1}{52}}\). We can, therefore, rewrite the Law of Total Probability this way:

\(\mathtt{P(A)=P(A|B_1)P(B_1)+\cdots+P(A|B_n)P(B_n)}\)

And now we’re ready to determine the probability given in the opening paragraph, \(\mathtt{P(R2>R1)}\), the probability that a second die roll is greater than the first die roll: \[\mathtt{P(R2>R1)=P(R2>R1|R1=1)P(R1=1)+\cdots P(R2>R1|R1=6)P(R1=6)}\]

The final result is \(\mathtt{\frac{5}{6}\cdot \frac{1}{6}+\frac{4}{6}\cdot \frac{1}{6}+\frac{3}{6}\cdot \frac{1}{6}+\frac{2}{6}\cdot \frac{1}{6}+\frac{1}{6}\cdot \frac{1}{6}+\frac{0}{6}\cdot \frac{1}{6}=\frac{5}{12}}\).

Monty Hall and the Colliders

Reading through Judea Pearl’s Book of Why, along with Causal Inference in Statistics, co-authored by Pearl, I came across a nifty, new-to-me explanation of the famous Monty Hall Problem.

To make it clearer, we can start with a toy mathematical problem, \(\mathtt{a+b=c}\), which I model with the causal diagram below. A causal diagram is, of course, a little inappropriate for modeling a mathematical equation, but it’s also one that students may first use implicitly to think about operations and equations (and, without proper instruction, it’s one adults may use all their lives).

Here, we will think of the variables \(\mathtt{a}\) and \(\mathtt{b}\) as independent. Changing the number we substitute for \(\mathtt{a}\) does not affect our choice for \(\mathtt{b}\) and vice versa. However, \(\mathtt{a}\) and \(\mathtt{c}\) are dependent, and \(\mathtt{b}\) and \(\mathtt{c}\) are dependent as well. Increasing or decreasing \(\mathtt{a}\) or \(\mathtt{b}\) alone (or together) will have an effect on \(\mathtt{c}\).

But once we fix \(\mathtt{c}\), or “condition on \(\mathtt{c}\)” as Pearl would write, then \(\mathtt{a}\) and \(\mathtt{b}\) become dependent variables. That’s at once clear as a bell and pretty wacky. If we fix \(\mathtt{c}\) at 10, then changing \(\mathtt{a}\) will change \(\mathtt{b}\) and vice versa. But \(\mathtt{a}\) and \(\mathtt{b}\) were entirely independent prior to knowing what \(\mathtt{c}\) was. Afterwards, they’re dependent on each other. The diagram that represents this situation (above) is what Pearl calls a “collider.”

Pearl et. al also use a more everyday example:

Suppose a certain college gives scholarships to two types of students: those with unusual musical talents and those with extraordinary grade point averages. Ordinarily, musical talent and scholastic achievement are independent traits, so, in the population at large, finding a person with musical talent tells us nothing about that person’s grades. However, discovering that a person is on a scholarship changes things; knowing that the person lacks musical talent then tells us immediately that he is likely to have high grade point average. Thus, two variables that are marginally independent become dependent upon learning the value of a third variable (scholarship) that is a common effect of the first two.

For the Monty Hall problem, the collider model looks identical. And the correct model helps us see two things (forgiving some [I hope minor] mathematical sloppiness).

First, the model helps us see that Monty opening the door does not change the probability that you have made the correct initial choice (which is \(\mathtt{\frac{1}{3}}\)). It’s even accurate to say that the probability that the car is behind any of the three doors hasn’t changed from \(\mathtt{\frac{1}{3}}\). The arrows are pointing the wrong way for those things to be possibilities. In and of itself, the correct model should prevent us from upgrading the probability from \(\mathtt{\frac{1}{3}}\) to \(\mathtt{\frac{1}{2}}\) after the freebie goat door is opened.

Second, when Monty opens a door to reveal a goat (thus fixing the value of “door Monty opens”), now changing your choice of door changes the probability that the car is behind that door, since these two variables are now dependent.

Thus, since the probability of being correct must change when I change my door selection, it must change from \(\mathtt{\frac{1}{3}}\) to something else. And since the other door is the only other option, and all the probabilities in the situation must add to \(\mathtt{1}\), switching must change the probability to \(\mathtt{\frac{2}{3}}\).

Provided vs Generated Examples

research

The results reported in this research (below) about the value of provided examples versus generated examples are a bit surprising. To get a sense of why that’s the case, start with this definition of the concept availability heuristic used in the study—a term from the social psychology literature:

Availability heuristic: the tendency to estimate the likelihood that an event will occur by how easily instances of it come to mind.

All participants first read this definition, along with the definitions of nine other social psychology concepts, in a textbook passage. Participants then completed two blocks of practice trials in one of three groups: (1) subjects in the provided examples group read two different examples, drawn from an undergraduate psychology textbook, of each of the 10 concepts (two practice blocks, so four examples total for each concept), (2) subjects in the generated examples group created their own examples for each concept (four generated examples total for each concept), and (3) subjects in the combination group were provided with an example and then created their own example of each concept (two provided and two generated examples total for each concept).

The researchers—Amanda Zamary and Katharine Rawson at Kent State University in Ohio—made the following predictions, with regard to both student performance and the efficiency of the instructional treatments:

We predicted that long-term learning would be greater following generated examples compared to provided examples. Concerning efficiency, we predicted that less time would be spent studying provided examples compared to generating examples . . . [and] long-term learning would be greater after a combination of provided and generated examples compared to either technique alone. Concerning efficiency, our prediction was that less time would be spent when students study provided examples and generate examples compared to just generating examples.

Achievement Results

All participants completed the same two self-paced tests two days later. The first assessment, an example classification test, asked subjects to classify each of 100 real-world examples into one of the 10 concept definition categories provided. Sixty of these 100 were new (Novel) to the provided-examples group, 80 of the 100 were new to the combination group, and of course all 100 were likely new to the generated-examples group. The second assessment, a definition-cued recall test, asked participants to type in the definition of each of the 10 concepts, given in random order. (The test order was varied among subjects.)

provided examples

Given that participants in the provided-examples and combination groups had an advantage over participants in the generated-examples group on the classification task (they had seen between 20 and 40 of the examples previously), the researchers helpfully drew out results on just the 60 novel examples.

Subjects who were given only textbook-provided examples of the concepts outperformed other subjects on applying these concepts to classifying real-world examples. This difference was significant. No significant differences were found on the cued-recall test between the provided-examples and generated-examples groups.

Also, Students’ Time Is Valuable

Another measure of interest to the researchers in this study, as mentioned above, was the time used by the participants to read through or create the examples. What the authors say about efficiency is worth quoting, since it does not often seem to be taken as seriously as measures of raw achievement (emphasis mine):

Howe and Singer (1975) note that in practice, the challenge for educators and researchers is not to identify effective learning techniques when time is unlimited. Rather, the problem arises when trying to identify what is most effective when time is fixed. Indeed, long-term learning could easily be achieved if students had an unlimited amount of time and only a limited amount of information to learn (with the caveat that students spend their time employing useful encoding strategies). However, achieving long-term learning is difficult because students have a lot to learn within a limited amount of time (Rawson and Dunlosky 2011). Thus, long-term learning and efficiency are both important to consider when competitively evaluating the effectiveness of learning techniques.

provided examples

With that in mind, and given the results above, it is noteworthy to learn that the provided-examples group outperformed the generated-examples group on real-world examples after engaging in practice that took less than half as much time. The researchers divided subjects’ novel classification score by the amount of time they spent practicing and determined that the provided-examples group had an average gain of 5.7 points per minute of study, compared to 2.2 points per minute for the generated-examples group and 1.7 points per minute for the combination group.

For learning declarative concepts in a domain and then identifying those concepts in novel real-world situations, provided examples proved to be better than student-generated examples for both long-term learning and for instructional efficiency. The second experiment in the study replicated these findings.

Some Commentary

First, some familiarity with the research literature makes the above results not so surprising. The provided-examples group likely outperformed the other groups because participants in that group practiced with examples generated by experts. Becoming more expert in a domain does not necessarily involve becoming more isolated from other people and their interests. Such expertise is likely positively correlated with better identifying and collating examples within a domain that are conceptually interesting to students and more widely generalizable. I reported on two studies, for example, which showed that greater expertise was associated with a significantly greater number of conceptual explanations, as opposed to “product oriented” (answer-getting) explanations—and these conceptual explanations resulted in the superior performance of students receiving them.

Second, I am sympathetic to the efficiency argument, as laid out here by the study’s authors—that is, I agree that we should focus in education on “trying to identify what is most effective when time is fixed.” Problematically, however, a wide variety of instructional actions can be informed by decisions about what is and isn’t “fixed.” Time is not the only thing that can be fixed in one’s experience. The intuition that students should “own their own learning,” for example, which undergirds the idea in the first place that students should generate their own examples, may rest on the more fundamental notion that students themselves are “fixed” identities that adults must work around rather than try to alter. This notion is itself circumscribed by the research summarized above. So, it is worth having a conversation about what should and should not be considered “fixed” when it comes to learning.

K-Means Clustering

K-means clustering is one way of taking some data and allowing a computer to do what you do pretty naturally with your eyes and brain—separate the data into distinguishable clusters. For example, in the graph shown below, you can very easily see two clumps of points (points A and D in a clump and points B and C in a clump). A computer, to the extent it sees anything, sees just four points and their coordinates.

Why not just use our eyes and brain? Because once we teach a computer to approximate our ability to cluster 2D or 3D data, it can cluster data with many more than just 2 or 3 components. And then its “seeing” outpaces ours by quite a lot.

Let’s take a look at the instructions a computer could follow to do k-means clustering, and then we’ll dress it all up in linear algebra symbolism some other time. To start, I’ve just made 2 clusters of points (which we know about, but the computer doesn’t) with 2 components each (an x-component and y-component, i.e., 2D data).

Determining Least Distances

To start, we select a \(\mathtt{k}\), a number of clusters that we want, remembering that we know right now how many clusters there are but in most situations we would not. We choose \(\mathtt{k=2}\). Then we place the two cluster points at random locations—here I’ve put \(\mathtt{\color{blue}{k_1}}\) at \(\mathtt{(2,7)}\) and \(\mathtt{\color{red}{k_2}}\) at \(\mathtt{(4,2)}\).

Next, we calculate the distance from each point to each center. This is the good ol’ Pythagorean Theoremish Euclidean distance of \(\mathtt{\sqrt{(x_2-x_1)^{2}+(y_2-y_1)^{2}}}\). The cluster that we assign to each point is given by the closest center to that point. You can run the code below to print out the 8 distances.

Going by shorter distances, our first result, then, is to group point A in cluster \(\mathtt{\color{blue}{k_1}}\), because the first number in that pair is smaller, and points B, C, and D in cluster \(\mathtt{\color{red}{k_2}}\), because the second number in each of those pairs is the smaller one.

Moving the Centers Based on the Means

The next step—and last before repeating the process—is to move each center to the mean of the points in the cluster to which it is currently assigned. The mean of the points is determined by calculating the mean of the components separately. So, for our current cluster \(\mathtt{\color{red}{k_2}}\), the points B, C, and D have a mean of (\(\mathtt{\frac{2 + 2 + 6}{3}}\), \(\mathtt{\frac{4 + 3 + 5}{3}}\)), or (\(\mathtt{\frac{10}{3}}\), \(\mathtt{4}\)). And since \(\mathtt{\color{blue}{k_1}}\) has just one point, A, it will move to smack dab on top of that point, at (5, 6).

Now we can do another round of distance comparisons, given the new center locations. These calculations give us what we can see automatically—that points A and D belong to one cluster and points B and C belong to another cluster. In this case, A and D belong to cluster \(\mathtt{\color{blue}{k_1}}\) and B and C belong to cluster \(\mathtt{\color{red}{k_2}}\).

The cluster centers now move to the means of each pair of points, placing them where we would likely place them to begin with (directly between the two points in the cluster). Further calculations won’t change these assignments, so the k-means algorithm is done when it stops changing things drastically (or at all).

Sum and Product Loops

It’s something of a truism that mathematical symbolism is difficult. There are some situations, though, where the symbolism is not just difficult, but also annoying and ridiculous. It likely saved a lot of time when people were still mostly writing ideas out by hand, so back then even the annoying and ridiculous could not be righteously pointed at and mocked, but nowadays it is almost certainly more difficult to set some statements in LaTeX than it is to type them—and, if the text is intended to teach students, more difficult to unpack the former than it is to understand the latter.

Examples of symbols that are justly symbolized, even today, are \(\mathtt{\sum}\) and \(\mathtt{\prod}\), representing a sum and a product, respectively. More specifically, these symbols represent loops—an addition loop or a multiplication loop.

So, for example, take this expression on the left side of the equals sign, which represents the loop sum on the right of the equals sign: \(\mathtt{\sum_{n=1}^{5}n=1+2+3+4+5}\). The expression on the left just means (a) start a counter at 1, (b) count up to 5 by 1s, (c) let n = each number you count, then (d) add all the n’s one by one in a loop.

How about this one? \[\mathtt{\sum_{n=0}^{4}2n=0+2+4+6+8}\]

This one means (a) start a counter at 0, (b) count up to 4 by 1s, (c) let n = each number you count, then (d) add all the 2n’s one by one in a loop.

For products, we just swap out the symbol. Here is the corresponding product for the first loop: \(\mathtt{\prod_{n=1}^{5}n=1\times2\times3\times4\times5}\). And here’s one for the second loop: \[\mathtt{\prod_{n=0}^{4}2n=0\times2\times4\times6\times8}\]

Loops and Linear Algebra

You’ll often see the summation loop in linear algebra contexts, because it is an equivalent way to write a dot product, for example. The sum \(\mathtt{\sum_{n=0}^{4}2n=0+2+4+6+8}\) above can be written as shown below, which looks like more work to write—and is—but when we’re dealing mostly with variables, the savings in writing effort is more evident. \[\quad\,\,\,\begin{bmatrix}\mathtt{2}\\\mathtt{2}\\\mathtt{2}\\\mathtt{2}\\\mathtt{2}\end{bmatrix}\cdot \begin{bmatrix}\mathtt{0}\\\mathtt{1}\\\mathtt{2}\\\mathtt{3}\\\mathtt{4}\end{bmatrix}\mathtt{=2\cdot0+2\cdot1+2\cdot2\ldots}\]

The loop sum \(\mathtt{\sum_{i}a_{i}x_{i}+b}\), where \(\mathtt{i}\) is an index pointing to a component of vector \(\mathtt{a}\) and vector \(\mathtt{x}\), can be written more simply as \(\mathtt{a\cdot x+b}\), as long as the context is clear that \(\mathtt{a}\) and \(\mathtt{x}\) are vectors.

Linear Algebra Exercises I

We’ve done a fair amount of landscaping, as it were—running out a long way in the field of linear algebra to mark interesting points. Now it seems like a good time to turn back and start tending to each area a little more closely. An excellent way to do that—to make things more secure—is to practice.

The sets below can be completed after reading Lines the Linear Algebra Way. I would suggest completing this by working on a piece of paper and then checking your answers—one at a time at first and then go a stretch before checking. Try to complete each exercise before either checking back with the original post or looking at my answer. My answers can be uncovered by hovering under each exercise.

Convert each equation of a line to a vector equation in parametric form, \(\mathtt{l(k)=p+kv}\), where \(\mathtt{p}\) and \(\mathtt{v}\) are 2D vectors and \(\mathtt{k}\) is a scalar variable.

  1. \(\mathtt{y=x+3}\)

\(\mathtt{l(k)=}\begin{bmatrix}\mathtt{0}\\\mathtt{3}\end{bmatrix}+\begin{bmatrix}\mathtt{1}\\\mathtt{1}\end{bmatrix}\mathtt{k}\)

  1. \(\mathtt{y=2x+3}\)

\(\mathtt{l(k)=}\begin{bmatrix}\mathtt{0}\\\mathtt{3}\end{bmatrix}+\begin{bmatrix}\mathtt{1}\\\mathtt{2}\end{bmatrix}\mathtt{k}\)

  1. \(\mathtt{y=-2x}\)

\(\mathtt{l(k)=}\begin{bmatrix}\mathtt{0}\\\mathtt{0}\end{bmatrix}+\begin{bmatrix}\mathtt{\,\,\,\,1}\\\mathtt{-2}\end{bmatrix}\mathtt{k}\),    or just \(\begin{bmatrix}\mathtt{\,\,\,\,1}\\\mathtt{-2}\end{bmatrix}\mathtt{k}\)

  1. \(\mathtt{y=x}\)

\(\mathtt{l(k)=}\begin{bmatrix}\mathtt{0}\\\mathtt{0}\end{bmatrix}+\begin{bmatrix}\mathtt{1}\\\mathtt{1}\end{bmatrix}\mathtt{k}\),    or just \(\begin{bmatrix}\mathtt{1}\\\mathtt{1}\end{bmatrix}\mathtt{k}\)

  1. \(\mathtt{y=10}\)

\(\mathtt{l(k)=}\begin{bmatrix}\mathtt{0}\\\mathtt{10}\end{bmatrix}+\begin{bmatrix}\mathtt{1}\\\mathtt{0}\end{bmatrix}\mathtt{k}\)

  1. \(\mathtt{x+y=1}\)

\(\mathtt{l(k)=}\begin{bmatrix}\mathtt{0}\\\mathtt{1}\end{bmatrix}+\begin{bmatrix}\mathtt{\,\,\,\,1}\\\mathtt{-1}\end{bmatrix}\mathtt{k}\)

  1. \(\mathtt{2x+3y=9}\)

\(\mathtt{l(k)=}\begin{bmatrix}\mathtt{0}\\\mathtt{3}\end{bmatrix}+\begin{bmatrix}\mathtt{\,\,\,\,3}\\\mathtt{-2}\end{bmatrix}\mathtt{k}\)

  1. \(\mathtt{x=5}\)

\(\mathtt{l(k)=}\begin{bmatrix}\mathtt{5}\\\mathtt{0}\end{bmatrix}+\begin{bmatrix}\mathtt{0}\\\mathtt{1}\end{bmatrix}\mathtt{k}\)

  1. \(\mathtt{-3x-\frac{1}{4}y=15}\)

\(\mathtt{l(k)=}\begin{bmatrix}\mathtt{\,\,\,\,0}\\\mathtt{-60}\end{bmatrix}+\begin{bmatrix}\mathtt{\,\,\,\,1}\\\mathtt{-12}\end{bmatrix}\mathtt{k}\)

  1. Devise an algorithm you could use to convert an equation for a line in slope-intercept form to a vector equation for the line in parametric form.

For slope-intercept form, \(\mathtt{y=mx+b}\), the conversion \(\begin{bmatrix}\mathtt{0}\\\mathtt{b}\end{bmatrix}+\begin{bmatrix}\mathtt{1}\\\mathtt{m}\end{bmatrix}\mathtt{k}\,\) seems to work for most lines.

There are many ways to write the above vector equations. In particular, the intercept vector does not have to have a first component of \(\mathtt{0}\), and the slope vector does not have to be in simplest form. All that is required is for the intercept vector to get you to some point on the line and for the slope vector to correctly represent the slope of the line.

Now let’s go the other way. Identify the slope, y-intercept, and x-intercept of each line.

  1. \(\mathtt{l_{1}(k)=}\begin{bmatrix}\mathtt{1}\\\mathtt{2}\end{bmatrix}+\begin{bmatrix}\mathtt{\,\,\,\,2}\\\mathtt{-1}\end{bmatrix}\mathtt{k}\)

slope → \(\mathtt{-\frac{1}{2}}\), y-intercept → \(\mathtt{\frac{5}{2}}\), x-intercept → \(\mathtt{5}\)

  1. \(\mathtt{l_{2}(k)=}\begin{bmatrix}\mathtt{-3}\\\mathtt{-5}\end{bmatrix}+\begin{bmatrix}\mathtt{0}\\\mathtt{4}\end{bmatrix}\mathtt{k}\)

slope → undefined, y-intercept → none, x-intercept → \(\mathtt{-3}\)

  1. \(\mathtt{l_{3}(k)=}\begin{bmatrix}\mathtt{\,\,\,\,0}\\\mathtt{-1}\end{bmatrix}+\begin{bmatrix}\mathtt{2}\\\mathtt{2}\end{bmatrix}\mathtt{k}\)

slope → \(\mathtt{1}\), y-intercept → \(\mathtt{-1}\), x-intercept → \(\mathtt{1}\)

  1. \(\mathtt{l_{4}(k)=}\begin{bmatrix}\mathtt{\,\,\,\,3}\\\mathtt{-9}\end{bmatrix}+\begin{bmatrix}\mathtt{7}\\\mathtt{0}\end{bmatrix}\mathtt{k}\)

slope → \(\mathtt{0}\), y-intercept → \(\mathtt{-9}\), x-intercept → none

  1. \(\mathtt{l_{5}(k)=}\begin{bmatrix}\mathtt{-5}\\\mathtt{\,\,\,\,1}\end{bmatrix}+\begin{bmatrix}\mathtt{-3}\\\mathtt{\,\,\,\,4}\end{bmatrix}\mathtt{k}\)

slope → \(\mathtt{-\frac{4}{3}}\), y-intercept → \(\mathtt{-\frac{17}{3}}\), x-intercept → \(\mathtt{-\frac{17}{4}}\)

And just two more. Determine if each point is on the line mentioned from above.

  1. Is \(\mathtt{(-11,3)}\) on the line \(\mathtt{l_1}\)?

No: \(\begin{bmatrix}\mathtt{1}\\\mathtt{2}\end{bmatrix}+\begin{bmatrix}\mathtt{\,\,\,\,2}\\\mathtt{-1}\end{bmatrix}\mathtt{k=}\begin{bmatrix}\mathtt{-11}\\\mathtt{\,\,\,\,3}\end{bmatrix}\).
No value of \(\mathtt{k}\) solves both \(\mathtt{2k+1=-11}\) and \(\mathtt{-k+2=3}\).

  1. Is \(\mathtt{(5,5)}\) on the line \(\mathtt{l_3}\)?

No. Since the line has a slope of \(\mathtt{1}\), it would contain \(\mathtt{(5,5)}\) if the line passed through the origin, but it does not.

More with Least Squares

One very cool thing about our formula for the least squares regression line, \(\mathtt{\left(X^{T}X\right)^{-1}X^{T}y}\), is that it is the same no matter whether we have one independent variable (univariate) or many independent variables (multivariate).

Consider these data, showing the selling prices of some grandfather clocks at auction. The first scatter plot shows the age of the clock in years on the x-axis (100–200), and the second shows the number of bidders on the x-axis (0–20). Price (in pounds or dollars) is on the y-axis on each plot (500–2500).

Age (years)BiddersPrice ($)
127131235
115121080
1277845
15091522
15661047
182111979
156121822
132101253
13791297
1139946
137151713
117111024
13781147
15361092
117131152
126101336
170142131
18281550
162111884
184102041
1436854
15991483
108141055
17581545
1086729
17991792
111151175
18781593
1117785
1157744
19451356
16871262

You can see in the notebook below that the first regression line, for the price of a clock as a function of its age, is approximately \(\mathtt{10.5x-192}\). The second regression line, for the price of a clock as a function of the number of bidders at auction, is approximately \(\mathtt{55x+806}\). As mentioned above, each of these univariate least squares regression lines can be calculated with the formula \(\mathtt{\left(X^{T}X\right)^{-1}X^{T}y}\).

Combining both age and number of bidders together, we can calculate, using the same formula, a multivariate least squares regression equation. This of course is no longer a line. In the case of two input variables as we have here, our line becomes a plane.

Our final regression equation becomes \(\mathtt{12.74x_{1}+85.82x_{2}-1336.72}\), with \(\mathtt{x_1}\) representing the age of a clock and \(\mathtt{x_2}\) representing the number of bidders.

Post-Hoc Confidences

A smart defense of any argument for less teacher-directed instruction in mathematics classrooms is to point to the logical connectedness of mathematics as a body of knowledge and suggest that students are capable of crossing many if not all of the logical bridges between propositions themselves, or with minimal guidance.

Such connectedness–it can be suggested–makes mathematics somewhat different from other school subjects. For example, given a student’s conceptual understanding of a fraction as a part-to-whole ratio, which can include his or her ability to represent a fraction with a visual or physical model, it seems to follow logically that he or she can then add two fractions and get the correct sum, so long as the student knows (intuitively or more formally) that addition is about combining values linearly. It doesn’t matter how many prerequisites there are for adding fractions. The suggestion is that once those prerequisites have been met, it is a matter of merely crossing a logical bridge to adding fractions (mostly) correctly.

By way of contrast, a student can’t really induce what happened after, say, the bombing of Pearl Harbor. They have to be informed about it directly. The effects can certainly be narrowed down using common sense reasoning and other domain-specific knowledge. But, ultimately, what happened happened and there is no reason to suspect that, in general, students can make their way through a study of history mostly blindfolded, relying only on logic and common sense.

The example of history brings up an interesting point (to me, anyway) about the example of mathematics, though. Historical consequences from historical causes can be dubbed “inevitable” only after the fact. How can we be sure it is not the same when learning anything, including mathematics? Once you know, conceptually as it were, what adding fractions is, of course it seems to be a purely logical consequence of what fractions are fundamentally. But is this seeming inevitability available to the novice, the learner who is aware of what fractions are but hasn’t ever thought about adding them? With the average novice is, after all, where that feeling of logical inevitability has to lie. It is not enough for educated adults to think of something as ‘logical’ after they already know it.

Bertrand Russell argues, in a 1907 essay, that even in mathematics we don’t proceed from premises to conclusions, but rather the other way around.

We tend to believe the premises because we can see that their consequences are true, instead of believing the consequences because we know the premises to be true. But the inferring of premises from consequences is the essence of induction [abduction]; thus the method in investigating the principles of mathematics is really an inductive method, and is substantially the same as the method of discovering in any other science.

So, how can we decide whether some bridge in reasoning is available to and crossable by the average novice? I hope it’s clear that we can’t just figure it out via anecdotes and armchair reasoning. Our intuitions can’t be trusted with this question. And our opinions one way or the other on the matter are not helpful, no matter what they are.

Linear Algebra Connections

There are so many connections within and applications of linear algebra—I can only imagine that this will be a series of posts, to the extent that I continue writing about the subject. Here are a few connections that I’ve come across in my reading recently.

Compound Interest and Matrix Powers

We can multiply a matrix by itself \(\mathtt{n}\) times. The result is the matrix to the power \(\mathtt{n}\). We can use this when setting up a compound interest situation. For example, suppose we have three accounts, which each have a different interest rate compounded annually—say, 5%, 3%, and 2%. Without linear algebra, the amount in the first account can be modeled by the equation \[\mathtt{A(t)=p \cdot 1.05^t}\] where \(\mathtt{p}\) represents the starting amount in the account, and \(\mathtt{t}\) represents the time in years. With linear algebra, we can group all of the account interest rates into a matrix. The first year for each account would look like this: \[\mathtt{A(1)}=\begin{bmatrix}\mathtt{1.05}&\mathtt{0}&\mathtt{0}\\\mathtt{0}&\mathtt{1.03}&\mathtt{0}\\\mathtt{0}&\mathtt{0}&\mathtt{1.02}\end{bmatrix}^\mathtt{1}\begin{bmatrix}\mathtt{p_1}\\\mathtt{p_2}\\\mathtt{p_3}\end{bmatrix}=\begin{bmatrix}\mathtt{1.05p_1}\\\mathtt{1.03p_2}\\\mathtt{1.02p_3}\end{bmatrix}\]

For years beyond the first year, all we have to do is raise the matrix to the appropriate power. Since it’s diagonal, squaring it, cubing it, etc., will square, cube, etc., each entry. This computation can be a little more organized—and more straightforward for programming. A matrix has to be square (\(\mathtt{m \times m})\) in order to raise it to a power. Below we calculate the amount in each account after 100 years.

Centroids and Areas

We have seen that the determinant can be thought about as the area of the parallelogram formed by two vectors. We can use this fact to determine the area of a complex shape like the one shown below.

Since determinants are signed areas, as we move around the shape counterclockwise, calculating the determinant of each vector pair (and multiplying each determinant by one half so we just get each triangle), we get the total area of the shape.\[\frac{1}{2}\left(\begin{vmatrix}\mathtt{6}&\mathtt{6}\\\mathtt{0}&\mathtt{4}\end{vmatrix}+\begin{vmatrix}\mathtt{6}&\mathtt{3}\\\mathtt{4}&\mathtt{4}\end{vmatrix}+\begin{vmatrix}\mathtt{3}&\mathtt{3}\\\mathtt{4}&\mathtt{6}\end{vmatrix}+\begin{vmatrix}\mathtt{3}&\mathtt{-2}\\\mathtt{6}&\mathtt{\,\,\,\,6}\end{vmatrix}+\begin{vmatrix}\mathtt{-2}&\mathtt{-2}\\\mathtt{\,\,\,\,6}&\mathtt{\,\,\,\,3}\end{vmatrix}+\begin{vmatrix}\mathtt{-2}&\mathtt{0}\\\mathtt{\,\,\,\,3}&\mathtt{3}\end{vmatrix}\right)=\mathtt{36}\text{ units}\mathtt{^2}\]

That’s pretty hand-wavy, but it’s something that you can probably figure out with a little experimentation.

Another counterclockwise-moving calculation (though this one can be clockwise without changing the answer) is the calculation of the centroid of a closed shape. All that is required here is to calculate the sum of the position vectors of the vertices of the figure and then divide by the number of vertices.

\[\mathtt{\frac{1}{5}}\left(\begin{bmatrix}\mathtt{3}\\\mathtt{4}\end{bmatrix}+\begin{bmatrix}\mathtt{0}\\\mathtt{6}\end{bmatrix}+\begin{bmatrix}\mathtt{-3}\\\mathtt{\,\,\,\,4}\end{bmatrix}+\begin{bmatrix}\mathtt{-1}\\\mathtt{\,\,\,\,3}\end{bmatrix}\right)=\begin{bmatrix}\mathtt{-\frac{1}{5}}\\\mathtt{\,\,\,\,\frac{17}{5}}\end{bmatrix}\]

Here again, it’s just magic, but you can figure it out with a little play. In each case—for both complex areas and centroids—we assign a point to be the origin and go from there.