A projection is like the shadow of a vector, say \(\mathtt{u}\), on another vector, say \(\mathtt{v}\), if light rays were coming in across \(\mathtt{u}\) and perpendicular to \(\mathtt{v}\). For the vectors at the right, imagine a light source above and to the left of the illustration, perpendicular to the vector \(\mathtt{v}\). The projection, which we’ll call \(\mathtt{p}\), will be a vector that will extend from the point shown to where the end of the shadow of \(\mathtt{u}\) touches \(\mathtt{v}\).

If you take a moment maybe to read that description twice (because it’s kind of dense), you may be able to tell that the vector \(\mathtt{p}\) that we’re looking for will be some scalar multiple of vector \(\mathtt{v}\), since it will lie exactly on top of \(\mathtt{v}\). In fact, given the picture, the projection vector \(\mathtt{p}\) will have the opposite sign as \(\mathtt{v}\) and will have a scale factor pretty close to 0, since the projection vector looks like it will be much smaller than \(\mathtt{v}\).

That information is shown at the right. Helpfully, the vector \(\mathtt{u-p}\) is perpendicular to \(\mathtt{v}\), so we know that \(\mathtt{\left(u-p\right)\cdot v=0}\). And, using the Distributive Property, we get \(\mathtt{u\cdot v-p\cdot v=0}\).

Since we know that our sought-after vector \(\mathtt{p}\) will be some scalar multiple of \(\mathtt{v}\), we can substitute, say, \(\mathtt{cv}\) for \(\mathtt{p}\) in the above to get \(\mathtt{u\cdot v-cv\cdot v=0}\). And a property of dot product multiplication allows us to rewrite that as \(\mathtt{u\cdot v-c\left(v\cdot v\right)=0}\).

This means that \(\mathtt{u\cdot v=c\left(v\cdot v\right)}\), which means that the scale factor \(\mathtt{c}\) that we’re after—the factor we can multiply by \(\mathtt{v}\) to produce \(\mathtt{p}\)—is \[\mathtt{c=\frac{u\cdot v}{v\cdot v}\quad\rightarrow\quad p=\left(\frac{u\cdot v}{v\cdot v}\right)v}\]

I‘ve started a writing project recently that I’m having a good time working on so far. I’ve called it Scala Math (and on Twitter here) for now, because its central focus is deconstructing concepts and procedures into steps, and la scala is Italian for ‘staircase’. You can see the word at work in ‘escalator’, ‘scale’, etc. Scala is also the name of a programming language. Here are some reasons for that I found online.

Most of the projects I’ve worked on over the past few years have also been ways for me to learn new software languages or libraries. For Geometry Theorems, it was d3. For Scala, it was React—as well as the beautiful, amazing database that a normal person can actually look at and edit and it’s still a database: Airtable.

How It Works: Learn

Every Scala has a display window—where images and videos are shown—and a steps window, where you find the text of the steps, or ‘parts’. These areas are divided by a brain, which I’ll talk about below. When you land on a Scala (this one is Solving Arithmetic Sequences), the first thing shown in the display window is an image presenting a quick snippet of what will be covered. The image shows an essential question at the top. The use-case for the snippet was a student wanting a quick reminder about something they are working on, perhaps for homework, without having to search online and wade through tons of stuff that sorta-kinda matches what you want but not really.

The remainder of the section shown at left (called ‘Learn’ mode) is a series of steps (in this case, six), explained with text, audio narration, and the accompanying images that you can see appearing when clicking on each step. The dot navigation at the top shows us that we are on the first screen of this Scala.

Each step card has a button to replay the step, which can be pressed at any time while the step is active, and a button (up arrow) to go to the preceding step.

How It Works: Reflect

As you can see at the end of the video above, there is a Reflection question which calls for a short or extended text response. This is where the audio input on my cell phone comes in handy. Students’ responses are, at the moment, compared to a few ‘correct’ responses that I have written, and others have conributed to. The response which has the highest numerical match on a scale from 0 to 100 is presented as your score, and the pre-written response is presented as a suggested answer.

How It Works: Try

After the Learn phase is the Try phase, which consists of example-problem pairs (usually; for a very few cases, so far, stepped-out problems only). Or, more specifically, stepped-out problems followed by not-stepped-out problems. These look a little different from what I typically see as example-problem pairs, where the example and the problem are set side by side. Here, the problem follows the example, and the example is not provided when solving the problem. The typical sequence is shown below.

For the Try and Test phases, it’s always multiple choice, although it’s in the plan to look at other response inputs. When students are logged in, they build up (not earn; see below) points for every question. Right now, it’s just 50 points for each, though that gets cut in half and rounded up to the nearest integer for every incorrect answer. For an item with 3 choices, the lowest point total possible is 13. For an item with 4 choices, the lowest is 7.

On desktop, students can have the question read aloud via text-to-speech. As far as I know, that hasn’t yet come to mobile as a built-in feature, but I’ll keep my ears open for when it does.

How It Works: Test

Finally, there’s the Test phase. This is typically 4 to 6 questions that are of the same form as the ‘problems’ in the example-problem-pair Try phase. I’m just showing one such question in the video at the right.

When students are logged in, they can earn points by taking the test. The points are built up in both the Learn and Try phases. I have described how the points work for the Try phase above. The Learn phase is simpler: just clicking on a step builds up 100 points. At the moment, no points are tied to the Reflect question.

Once a student reaches the Test phase, the greatest number of points he or she can ‘bank’ is the number he or she has built up over the course of the Learn and Try phases. And the Test phase is fairly high stakes, in that each incorrect answer divides the total possible points to earn in half.

The stars shown on the score modal are awarded based on percent of total points earned. For the lesson shown in this post, the total that can be earned is 1700. So, approximately 560 points is 1 star (33%), 1130 points is 2 stars (66%), and 1360 points is 3 stars (80%).

Finally, to make sure this product connects knowledgeable people with students (whether they be parents or teachers or both) and guards against mindlessly pressing buttons to earn points, there is a final front-and-back activity, wherein students solve a different problem by listing the steps themselves and showing all their work.

Pay attention to your thought process and how you use expert knowledge as you answer the question below. How do you think very young students would think about it?

Here are some birds and here are some worms. Suppose the birds all race over, and each one tries to get a worm. Will every bird get a worm? How many birds won’t get a worm?

Hudson (1983) found that, among a small group of first-grade children (mean age of 7.0), just 64% completed this type of task correctly. However, when the task was rephrased as follows, all of the students answered correctly.

Here are some birds and here are some worms. Suppose the birds all race over, and each one tries to get a worm. Will every bird get a worm? How many birds won’t get a worm?

This is consistent with adults’ intuitions about the two tasks as well.

Interpret the Results

Still, what can we say about these results? Is it the case that 100% of the students used “their knowledge of correspondence to determine exact numerical differences between disjoint sets”? That is how Hudson describes students’ unanimous success in the second task. The idea seems to be that the knowledge exists; it’s just that a certain magical turn of phrase unlocks and releases this otherwise submerged expertise.

But that expert knowledge is given in the second task: “each one tries to get a worm.” The question paints the picture of one-to-one correspondence, and gives away the procedure to use to determine the difference. So, “their knowledge” is a bit of a stretch, and “used their knowledge” is even more of a stretch, since the task not only sets up a structure but animates its moving parts as well (“suppose the birds all race over”).

Further, questions about whether or not students are using knowledge they possess raise questions about whether or not students are, in fact, determining “exact numerical differences between disjoint sets.” On the contrary, it can be argued that students are simply watching almost all of a movie in their heads (a mental simulation)—a movie for which we have provided the screenplay—and then telling us how it ends (spoiler: 2 birds don’t get a worm). The deeper equivalence between the solution “2” and the response “2” to the question “How many birds won’t get a worm?” is evident only to a knowledgeable onlooker.

Experiment 3

Hudson anticipates some of the skepticism on display above when he introduces the third and last experiment in the series.

It might be argued, success in the Won’t Get task does not require a deep level of mathematical understanding; the children could have obtained the exact numerical differences by mimicking by rote the actions described by the problem context . . . In order to determine more fully the level of children’s understanding of correspondences and numerical differences, a third experiment was carried out that permitted a detailed analysis of children’s strategies for establishing correspondences between disjoint sets.

The wording in the Numerical Differences task of this third experiment, however, did not change. The “won’t get” locutions were still used. Yet, in this experiment, when paying attention to students’ strategies, Hudson observed that most children did not mentally simulate in the way directly suggested by the wording (pairing up the items in a one-to-one correspondence).

This does not defeat the complaint above, though. The fact that a text does not effectively compel the use of a procedure does not mean that it is not the primary influence on correct answers. It still seems more likely than not that participants who failed the “how many more” task simply didn’t have stable, abstract, transferable notions about mathematical difference. And the reformulation represented by the “won’t get” task influenced students to provide a response that was correct.

But this was a correct response to a different question. As adults with expert knowledge, we see the logical and mathematical similarities between the “how many more” and “won’t get” situations, and, thus we are easily fooled into believing that applying skills and knowledge in one task is equivalent to doing so in the other.

The term ‘entia successiva’ means ‘successive entities.’ And, as you may guess, it is a term one might come across in a philosophy class, in particular when discussing metaphysical questions about personhood. For instance, is a person a single thing throughout its entire life or a succession of different things—an ‘ens successivum’? Though there is no right answer to this question, becoming familiar with the latter perspective can, I think, help people to be more skeptical and knowledgeable consumers of education research.

Richard Taylor provides an example of a symphony (in here) that is, depending on your perspective, both a successive and a permanent entity:

Let us imagine [a symphony orchestra] and give it a name—say, the Boston Symphony. One might write a history of this orchestra, beginning with its birth one hundred years ago, chronicling its many tours and triumphs and the fame of some of its musical directors, and so on. But are we talking about one orchestra?

In one sense we are, but in another sense we are not. The orchestra persists through time, is incorporated, receives gifts and funding, holds property, has a bank account, returns always to the same city, and rehearses, year in and year out, in the same hall. Yet its membership constantly changes, so that no member of fifty years ago is still a member today. So in that sense it is an entirely different orchestra. We are in this sense not talking about one orchestra, but many. There is a succession of orchestras going under the same name. Each, in [Roderick] Chisholm’s apt phrase, does duty for what we are calling the Boston Symphony.

The Boston Symphony is thus an ens successivum.

People are entia successiva, too. Or, at least their bodies are. Just about every cell in your body has been replaced from only 10 years ago. So, if you’re a 40-year-old Boston Symphony like me, almost all of your musicians and directors have been swapped out from when you were a 30-year-old symphony. People still call you the Boston Symphony of course (because you still are), but an almost entirely different set of parts is doing duty for “you” under the banner of this name. You are, in a sense, an almost completely different person—one who is, incidentally, made up of at least as many bacterial cells as human ones.

What’s worse (if you think of the above as bad news), the fact of evolution by natural selection tells us that humanity itself is an ens successivum. If you could line up your ancestors—your mother or father, his or her mother or father, and so on—it would be a very short trip down this line before you reached a person with whom you could not communicate at all, save through gestures. Between 30 and 40 people in would be a person who had almost no real knowledge about the physical universe. And there’s a good chance that perhaps the four thousandth person in your row of ancestors would not even be human.

The ‘Successive’ Perspective

Needless to say, seeing people as entia successiva does not come naturally to anyone. Nor should it, ever. We couldn’t go about out our daily lives seeing things this way. But the general invisibility of this ‘successiveness’ is not due to its only being operational at the very macro or very micro levels. It can be seen at the psychological level too. Trouble is, our brains are so good at constructing singular narratives out of even absolute gibberish, we sometimes have to place people in unnatural or extreme situations to get a good look at how much we can delude ourselves.

An Air Force doctor’s experiences investigating the blackouts of pilots in centrifuge training provides a nice example (from here). It’s definitely worth quoting at length:

Over time, he has found striking similarities to the same sorts of things reported by patients who lost consciousness on operating tables, in car crashes, and after returning from other nonbreathing states. The tunnel, the white light, friends and family coming to greet you, memories zooming around—the pilots experienced all of this. In addition, the centrifuge was pretty good at creating out-of-body experiences. Pilots would float over themselves, or hover nearby, looking on as their heads lurched and waggled about . . . the near-death and out-of-body phenomena are both actually the subjective experience of a brain owner watching as his brain tries desperately to figure out what is happening and to orient itself amid its systems going haywire due to oxygen deprivation. Without the ability to map out its borders, the brain often places consciousness outside the head, in a field, swimming in a lake, fighting a dragon—whatever it can connect together as the walls crumble. What the deoxygenated pilots don’t experience is a smeared mess of random images and thoughts. Even as the brain is dying, it refuses to stop generating a narrative . . . Narrative is so important to survival that it is literally the last thing you give up before becoming a sack of meat.

You’ll note, I hope, that not only does the report above disclose how our very mental lives are entia successiva—thoughts and emotions that arise and pass away—but the report assumes this perspective in its own narrative. That’s because the report is written from a scientific point of view. And from that vantage point, people are assumed (correctly) to have parts that “do duty” for them and may even be at odds with each other, as they were with the pilots (a perception part fighting against a powerful narrative-generating part). The unit of analysis in the report is not an entire pilot, but the various mechanisms of her mind. Allowing for these parts allows for functional explanations like the one we see.

An un-scientific analysis, on the other hand, is entirely possible. But it would stop at the pilot. He or she is, after all, an indivisible, permanent entity. There is nothing else “doing duty” for him, so there are really only two choices: his experience was an illusion or it was real. End of analysis. Interpret it as an illusion and you don’t really have much to say; interpret it as real, and you can make a lot of money.

Entia Permanentia

Good scientific research in education will adopt an entia successiva perspective about the people it studies. This does not guarantee that its conclusions are correct. But it makes it more likely that, over time, it will get to the bottom of things.

This is not to say that an alternative perspective is without scientific merit. If we want to know how to improve the performance of the Boston Symphony, we can make some headway with ‘entia permanentia’—seeing the symphony as a whole stable unit rather than a collection of successive parts. We could increase its funding, perhaps try to make sure “it” is treated as well as other symphonies around the world. We could try to change the music, maybe include some movie scores instead of that stuffy old classical music. That would make it more exciting for audiences (and more inclusive), which is certainly one interpretation of “improvement.” But to whatever extent improvement means improving the functioning of the parts of the symphony—the musicians, the director, etc.—we can do nothing, because with entia permanentia these tiny creatures do not exist. Even raising the question about improving the parts would be beyond the scope of our imagination.

Further, seeing students as entia permanentia rather than entia successiva stops us from being appropriately skeptical about both ‘scientific’ and ‘un-scientific’ ideas. Do students learn best when matched to their learning style? What parts of their neurophysiology and psychology could possibly make something like that true? Why would it have evolved, if it did? In what other aspects of our lives might this present itself? Adopting the entia successiva perspective would have slowed the adoption of this myth (even if were not a myth) to a crawl and would have eventually killed it. Instead, entia permanentia, a person-level analysis, holds sway: students benefit from learning-style matching because we see them respond differently to different representations. End of analysis.

A different but similar perspective on this, from a recurring theme in the book Switch:

In a pioneering study of organizational change, described in the book The Critical Path to Corporate Renewal, researchers divided the change efforts they’d studied into three groups: the most successful (the top third), the average (the middle third), and the least successful (the bottom third). They found that, across the spectrum, almost everyone set goals: 89 percent of the top third and 86 percent of the bottom third . . . But the more successful change transformations were more likely to set behavioral goals: 89 percent of the top third versus only 33 percent of the bottom third.

Why do “behavioral” goals work when just “goals” don’t? Behavioral goals are, after all, telling you what to do, forcing you to behave in a certain way. Do you like to be told what to do? Probably not.

But the “you” that responds to behavioral goals isn’t the same “you” whose in-the-moment “likes” are important. You are more than just one solid indivisible self. You are many selves, and the self that can start checking stuff off the to-do list is often pulling the other selves behind it. And when it does, you get to think that “you” are determined, “you” take initiative, “you” have willpower. But in truth, your environment—both immediate and distant, both internal and external—has simply made it possible for that determined self to take the lead. Behavioral goals often create this exact environment.

Where has this been all my life? The Law of Total Probability is really cool, and it seems accessible enough to be presented in high school, where it would be very useful as well, I think, although I’ve never seen it there. For example, from the book Causal Inference in Statistics we get this nice problem (in addition to the quote below): “Suppose we roll two dice, and we want to know the probability that the second roll (R2) is higher than the first (R1).” The Law of Total Probability can make answering this much more straightforward.

To understand this ‘law,’ we should start by understanding two simple things about probability. First, for any two mutually exclusive events (the events can’t happen together), the probability of \(\mathtt{A}\) or \(\mathtt{B}\) is the sum of the probability of \(\mathtt{A}\) and the probability of \(\mathtt{B}\):

\(\mathtt{P(A\text{ or }B)\,\,\,\,=\,\,\,\,\,\,P(A)\,\,\,\,+\,\,\,\,P(B)}\)

Second, thinking now of two mutually exclusive events “A and B” and “A and not-B”, we can write the probability of \(\mathtt{A}\) this way, since if \(\mathtt{A}\) is true, then either “A and B” or “A and not-B” must be true:

In different situations, however, \(\mathtt{B}\) could take on many different values—for example, the six possible values of one die roll, \(\mathtt{B_1}\)–\(\mathtt{B_6}\)—even while we’re considering just one value for the mututally exclusive event \(\mathtt{A}\)—for example, rolling a 4. The Law of Total Probability tells us that

\(\mathtt{P(A)=P(A,B_1)+\cdots+P(A,B_n)}\).

If we pull a random card from a standard deck, the probability that the card is a Jack [\(\mathtt{P(J)}\)] will be equal to the probability that it’s a Jack and a spade [\(\mathtt{P(J,C_S)}\)], plus the probability that it’s a Jack and a heart [\(\mathtt{P(J,C_H)}\)], plus the probability that it’s a Jack and a club [\(\mathtt{P(J,C_C)}\)], plus the probability that it’s a Jack and a diamond [\(\mathtt{P(J,C_D)}\)].

Now with Conditional Probabilities

Where this gets good is when we throw conditional probabilities into the mix. We can make use of the fact that \(\mathtt{P(A,B)=P(A|B)P(B)}\), where \(\mathtt{P(A|B)}\) means “the probability of A given B.” For example, the probability of randomly pulling a Jack, given that you pulled spades, is \(\mathtt{\frac{1}{13}}\), and the probability of randomly pulling a spade is \(\mathtt{\frac{1}{4}}\). Thus, the probability of pulling the Jack of spades is \(\mathtt{\frac{1}{13}\cdot \frac{1}{4}=\frac{1}{52}}\). We can, therefore, rewrite the Law of Total Probability this way:

And now we’re ready to determine the probability given in the opening paragraph, \(\mathtt{P(R2>R1)}\), the probability that a second die roll is greater than the first die roll: \[\mathtt{P(R2>R1)=P(R2>R1|R1=1)P(R1=1)+\cdots P(R2>R1|R1=6)P(R1=6)}\]

The final result is \(\mathtt{\frac{5}{6}\cdot \frac{1}{6}+\frac{4}{6}\cdot \frac{1}{6}+\frac{3}{6}\cdot \frac{1}{6}+\frac{2}{6}\cdot \frac{1}{6}+\frac{1}{6}\cdot \frac{1}{6}+\frac{0}{6}\cdot \frac{1}{6}=\frac{5}{12}}\).

Reading through Judea Pearl’s Book of Why, along with Causal Inference in Statistics, co-authored by Pearl, I came across a nifty, new-to-me explanation of the famous Monty Hall Problem.

To make it clearer, we can start with a toy mathematical problem, \(\mathtt{a+b=c}\), which I model with the causal diagram below. A causal diagram is, of course, a little inappropriate for modeling a mathematical equation, but it’s also one that students may first use implicitly to think about operations and equations (and, without proper instruction, it’s one adults may use all their lives).

Here, we will think of the variables \(\mathtt{a}\) and \(\mathtt{b}\) as independent. Changing the number we substitute for \(\mathtt{a}\) does not affect our choice for \(\mathtt{b}\) and vice versa. However, \(\mathtt{a}\) and \(\mathtt{c}\) are dependent, and \(\mathtt{b}\) and \(\mathtt{c}\) are dependent as well. Increasing or decreasing \(\mathtt{a}\) or \(\mathtt{b}\) alone (or together) will have an effect on \(\mathtt{c}\).

But once we fix \(\mathtt{c}\), or “condition on \(\mathtt{c}\)” as Pearl would write, then \(\mathtt{a}\) and \(\mathtt{b}\) become dependent variables. That’s at once clear as a bell and pretty wacky. If we fix \(\mathtt{c}\) at 10, then changing \(\mathtt{a}\) will change \(\mathtt{b}\) and vice versa. But \(\mathtt{a}\) and \(\mathtt{b}\) were entirely independent prior to knowing what \(\mathtt{c}\) was. Afterwards, they’re dependent on each other. The diagram that represents this situation (above) is what Pearl calls a “collider.”

Pearl et. al also use a more everyday example:

Suppose a certain college gives scholarships to two types of students: those with unusual musical talents and those with extraordinary grade point averages. Ordinarily, musical talent and scholastic achievement are independent traits, so, in the population at large, finding a person with musical talent tells us nothing about that person’s grades. However, discovering that a person is on a scholarship changes things; knowing that the person lacks musical talent then tells us immediately that he is likely to have high grade point average. Thus, two variables that are marginally independent become dependent upon learning the value of a third variable (scholarship) that is a common effect of the first two.

For the Monty Hall problem, the collider model looks identical. And the correct model helps us see two things (forgiving some [I hope minor] mathematical sloppiness).

First, the model helps us see that Monty opening the door does not change the probability that you have made the correct initial choice (which is \(\mathtt{\frac{1}{3}}\)). It’s even accurate to say that the probability that the car is behind any of the three doors hasn’t changed from \(\mathtt{\frac{1}{3}}\). The arrows are pointing the wrong way for those things to be possibilities. In and of itself, the correct model should prevent us from upgrading the probability from \(\mathtt{\frac{1}{3}}\) to \(\mathtt{\frac{1}{2}}\) after the freebie goat door is opened.

Second, when Monty opens a door to reveal a goat (thus fixing the value of “door Monty opens”), now changing your choice of door changes the probability that the car is behind that door, since these two variables are now dependent.

Thus, since the probability of being correct must change when I change my door selection, it must change from \(\mathtt{\frac{1}{3}}\) to something else. And since the other door is the only other option, and all the probabilities in the situation must add to \(\mathtt{1}\), switching must change the probability to \(\mathtt{\frac{2}{3}}\).

The results reported in this research (below) about the value of provided examples versus generated examples are a bit surprising. To get a sense of why that’s the case, start with this definition of the concept availability heuristic used in the study—a term from the social psychology literature:

Availability heuristic: the tendency to estimate the likelihood that an event will occur by how easily instances of it come to mind.

All participants first read this definition, along with the definitions of nine other social psychology concepts, in a textbook passage. Participants then completed two blocks of practice trials in one of three groups: (1) subjects in the provided examples group read two different examples, drawn from an undergraduate psychology textbook, of each of the 10 concepts (two practice blocks, so four examples total for each concept), (2) subjects in the generated examples group created their own examples for each concept (four generated examples total for each concept), and (3) subjects in the combination group were provided with an example and then created their own example of each concept (two provided and two generated examples total for each concept).

The researchers—Amanda Zamary and Katharine Rawson at Kent State University in Ohio—made the following predictions, with regard to both student performance and the efficiency of the instructional treatments:

We predicted that long-term learning would be greater following generated examples compared to provided examples. Concerning efficiency, we predicted that less time would be spent studying provided examples compared to generating examples . . . [and] long-term learning would be greater after a combination of provided and generated examples compared to either technique alone. Concerning efficiency, our prediction was that less time would be spent when students study provided examples and generate examples compared to just generating examples.

Achievement Results

All participants completed the same two self-paced tests two days later. The first assessment, an example classification test, asked subjects to classify each of 100 real-world examples into one of the 10 concept definition categories provided. Sixty of these 100 were new (Novel) to the provided-examples group, 80 of the 100 were new to the combination group, and of course all 100 were likely new to the generated-examples group. The second assessment, a definition-cued recall test, asked participants to type in the definition of each of the 10 concepts, given in random order. (The test order was varied among subjects.)

Given that participants in the provided-examples and combination groups had an advantage over participants in the generated-examples group on the classification task (they had seen between 20 and 40 of the examples previously), the researchers helpfully drew out results on just the 60 novel examples.

Subjects who were given only textbook-provided examples of the concepts outperformed other subjects on applying these concepts to classifying real-world examples. This difference was significant. No significant differences were found on the cued-recall test between the provided-examples and generated-examples groups.

Also, Students’ Time Is Valuable

Another measure of interest to the researchers in this study, as mentioned above, was the time used by the participants to read through or create the examples. What the authors say about efficiency is worth quoting, since it does not often seem to be taken as seriously as measures of raw achievement (emphasis mine):

Howe and Singer (1975) note that in practice, the challenge for educators and researchers is not to identify effective learning techniques when time is unlimited. Rather, the problem arises when trying to identify what is most effective when time is fixed. Indeed, long-term learning could easily be achieved if students had an unlimited amount of time and only a limited amount of information to learn (with the caveat that students spend their time employing useful encoding strategies). However, achieving long-term learning is difficult because students have a lot to learn within a limited amount of time (Rawson and Dunlosky 2011). Thus, long-term learning and efficiency are both important to consider when competitively evaluating the effectiveness of learning techniques.

With that in mind, and given the results above, it is noteworthy to learn that the provided-examples group outperformed the generated-examples group on real-world examples after engaging in practice that took less than half as much time. The researchers divided subjects’ novel classification score by the amount of time they spent practicing and determined that the provided-examples group had an average gain of 5.7 points per minute of study, compared to 2.2 points per minute for the generated-examples group and 1.7 points per minute for the combination group.

For learning declarative concepts in a domain and then identifying those concepts in novel real-world situations, provided examples proved to be better than student-generated examples for both long-term learning and for instructional efficiency. The second experiment in the study replicated these findings.

Some Commentary

First, some familiarity with the research literature makes the above results not so surprising. The provided-examples group likely outperformed the other groups because participants in that group practiced with examples generated by experts. Becoming more expert in a domain does not necessarily involve becoming more isolated from other people and their interests. Such expertise is likely positively correlated with better identifying and collating examples within a domain that are conceptually interesting to students and more widely generalizable. I reported on two studies, for example, which showed that greater expertise was associated with a significantly greater number of conceptual explanations, as opposed to “product oriented” (answer-getting) explanations—and these conceptual explanations resulted in the superior performance of students receiving them.

Second, I am sympathetic to the efficiency argument, as laid out here by the study’s authors—that is, I agree that we should focus in education on “trying to identify what is most effective when time is fixed.” Problematically, however, a wide variety of instructional actions can be informed by decisions about what is and isn’t “fixed.” Time is not the only thing that can be fixed in one’s experience. The intuition that students should “own their own learning,” for example, which undergirds the idea in the first place that students should generate their own examples, may rest on the more fundamental notion that students themselves are “fixed” identities that adults must work around rather than try to alter. This notion is itself circumscribed by the research summarized above. So, it is worth having a conversation about what should and should not be considered “fixed” when it comes to learning.

K-means clustering is one way of taking some data and allowing a computer to do what you do pretty naturally with your eyes and brain—separate the data into distinguishable clusters. For example, in the graph shown below, you can very easily see two clumps of points (points A and D in a clump and points B and C in a clump). A computer, to the extent it sees anything, sees just four points and their coordinates.

Why not just use our eyes and brain? Because once we teach a computer to approximate our ability to cluster 2D or 3D data, it can cluster data with many more than just 2 or 3 components. And then its “seeing” outpaces ours by quite a lot.

Let’s take a look at the instructions a computer could follow to do k-means clustering, and then we’ll dress it all up in linear algebra symbolism some other time. To start, I’ve just made 2 clusters of points (which we know about, but the computer doesn’t) with 2 components each (an x-component and y-component, i.e., 2D data).

Determining Least Distances

To start, we select a \(\mathtt{k}\), a number of clusters that we want, remembering that we know right now how many clusters there are but in most situations we would not. We choose \(\mathtt{k=2}\). Then we place the two cluster points at random locations—here I’ve put \(\mathtt{\color{blue}{k_1}}\) at \(\mathtt{(2,7)}\) and \(\mathtt{\color{red}{k_2}}\) at \(\mathtt{(4,2)}\).

Next, we calculate the distance from each point to each center. This is the good ol’ Pythagorean Theoremish Euclidean distance of \(\mathtt{\sqrt{(x_2-x_1)^{2}+(y_2-y_1)^{2}}}\). The cluster that we assign to each point is given by the closest center to that point. You can run the code below to print out the 8 distances.

Going by shorter distances, our first result, then, is to group point A in cluster \(\mathtt{\color{blue}{k_1}}\), because the first number in that pair is smaller, and points B, C, and D in cluster \(\mathtt{\color{red}{k_2}}\), because the second number in each of those pairs is the smaller one.

Moving the Centers Based on the Means

The next step—and last before repeating the process—is to move each center to the mean of the points in the cluster to which it is currently assigned. The mean of the points is determined by calculating the mean of the components separately. So, for our current cluster \(\mathtt{\color{red}{k_2}}\), the points B, C, and D have a mean of (\(\mathtt{\frac{2 + 2 + 6}{3}}\), \(\mathtt{\frac{4 + 3 + 5}{3}}\)), or (\(\mathtt{\frac{10}{3}}\), \(\mathtt{4}\)). And since \(\mathtt{\color{blue}{k_1}}\) has just one point, A, it will move to smack dab on top of that point, at (5, 6).

Now we can do another round of distance comparisons, given the new center locations. These calculations give us what we can see automatically—that points A and D belong to one cluster and points B and C belong to another cluster. In this case, A and D belong to cluster \(\mathtt{\color{blue}{k_1}}\) and B and C belong to cluster \(\mathtt{\color{red}{k_2}}\).

The cluster centers now move to the means of each pair of points, placing them where we would likely place them to begin with (directly between the two points in the cluster). Further calculations won’t change these assignments, so the k-means algorithm is done when it stops changing things drastically (or at all).

It’s something of a truism that mathematical symbolism is difficult. There are some situations, though, where the symbolism is not just difficult, but also annoying and ridiculous. It likely saved a lot of time when people were still mostly writing ideas out by hand, so back then even the annoying and ridiculous could not be righteously pointed at and mocked, but nowadays it is almost certainly more difficult to set some statements in LaTeX than it is to type them—and, if the text is intended to teach students, more difficult to unpack the former than it is to understand the latter.

Examples of symbols that are justly symbolized, even today, are \(\mathtt{\sum}\) and \(\mathtt{\prod}\), representing a sum and a product, respectively. More specifically, these symbols represent loops—an addition loop or a multiplication loop.

So, for example, take this expression on the left side of the equals sign, which represents the loop sum on the right of the equals sign: \(\mathtt{\sum_{n=1}^{5}n=1+2+3+4+5}\). The expression on the left just means (a) start a counter at 1, (b) count up to 5 by 1s, (c) let n = each number you count, then (d) add all the n’s one by one in a loop.

How about this one? \[\mathtt{\sum_{n=0}^{4}2n=0+2+4+6+8}\]

This one means (a) start a counter at 0, (b) count up to 4 by 1s, (c) let n = each number you count, then (d) add all the 2n’s one by one in a loop.

For products, we just swap out the symbol. Here is the corresponding product for the first loop: \(\mathtt{\prod_{n=1}^{5}n=1\times2\times3\times4\times5}\). And here’s one for the second loop: \[\mathtt{\prod_{n=0}^{4}2n=0\times2\times4\times6\times8}\]

Loops and Linear Algebra

You’ll often see the summation loop in linear algebra contexts, because it is an equivalent way to write a dot product, for example. The sum \(\mathtt{\sum_{n=0}^{4}2n=0+2+4+6+8}\) above can be written as shown below, which looks like more work to write—and is—but when we’re dealing mostly with variables, the savings in writing effort is more evident. \[\quad\,\,\,\begin{bmatrix}\mathtt{2}\\\mathtt{2}\\\mathtt{2}\\\mathtt{2}\\\mathtt{2}\end{bmatrix}\cdot \begin{bmatrix}\mathtt{0}\\\mathtt{1}\\\mathtt{2}\\\mathtt{3}\\\mathtt{4}\end{bmatrix}\mathtt{=2\cdot0+2\cdot1+2\cdot2\ldots}\]

The loop sum \(\mathtt{\sum_{i}a_{i}x_{i}+b}\), where \(\mathtt{i}\) is an index pointing to a component of vector \(\mathtt{a}\) and vector \(\mathtt{x}\), can be written more simply as \(\mathtt{a\cdot x+b}\), as long as the context is clear that \(\mathtt{a}\) and \(\mathtt{x}\) are vectors.

We’ve done a fair amount of landscaping, as it were—running out a long way in the field of linear algebra to mark interesting points. Now it seems like a good time to turn back and start tending to each area a little more closely. An excellent way to do that—to make things more secure—is to practice.

The sets below can be completed after reading Lines the Linear Algebra Way. I would suggest completing this by working on a piece of paper and then checking your answers—one at a time at first and then go a stretch before checking. Try to complete each exercise before either checking back with the original post or looking at my answer. My answers can be uncovered by hovering under each exercise.

Convert each equation of a line to a vector equation in parametric form, \(\mathtt{l(k)=p+kv}\), where \(\mathtt{p}\) and \(\mathtt{v}\) are 2D vectors and \(\mathtt{k}\) is a scalar variable.

\(\mathtt{l(k)=}\begin{bmatrix}\mathtt{0}\\\mathtt{0}\end{bmatrix}+\begin{bmatrix}\mathtt{\,\,\,\,1}\\\mathtt{-2}\end{bmatrix}\mathtt{k}\), or just \(\begin{bmatrix}\mathtt{\,\,\,\,1}\\\mathtt{-2}\end{bmatrix}\mathtt{k}\)

\(\mathtt{y=x}\)

\(\mathtt{l(k)=}\begin{bmatrix}\mathtt{0}\\\mathtt{0}\end{bmatrix}+\begin{bmatrix}\mathtt{1}\\\mathtt{1}\end{bmatrix}\mathtt{k}\), or just \(\begin{bmatrix}\mathtt{1}\\\mathtt{1}\end{bmatrix}\mathtt{k}\)

Devise an algorithm you could use to convert an equation for a line in slope-intercept form to a vector equation for the line in parametric form.

For slope-intercept form, \(\mathtt{y=mx+b}\), the conversion \(\begin{bmatrix}\mathtt{0}\\\mathtt{b}\end{bmatrix}+\begin{bmatrix}\mathtt{1}\\\mathtt{m}\end{bmatrix}\mathtt{k}\,\) seems to work for most lines.

There are many ways to write the above vector equations. In particular, the intercept vector does not have to have a first component of \(\mathtt{0}\), and the slope vector does not have to be in simplest form. All that is required is for the intercept vector to get you to some point on the line and for the slope vector to correctly represent the slope of the line.

Now let’s go the other way. Identify the slope, y-intercept, and x-intercept of each line.

And just two more. Determine if each point is on the line mentioned from above.

Is \(\mathtt{(-11,3)}\) on the line \(\mathtt{l_1}\)?

No: \(\begin{bmatrix}\mathtt{1}\\\mathtt{2}\end{bmatrix}+\begin{bmatrix}\mathtt{\,\,\,\,2}\\\mathtt{-1}\end{bmatrix}\mathtt{k=}\begin{bmatrix}\mathtt{-11}\\\mathtt{\,\,\,\,3}\end{bmatrix}\). No value of \(\mathtt{k}\) solves both \(\mathtt{2k+1=-11}\) and \(\mathtt{-k+2=3}\).

Is \(\mathtt{(5,5)}\) on the line \(\mathtt{l_3}\)?

No. Since the line has a slope of \(\mathtt{1}\), it would contain \(\mathtt{(5,5)}\) if the line passed through the origin, but it does not.