## Modulus and Hidden Symmetries

research

A really nice research paper, titled The Hidden Symmetries of the Multiplication Table was posted over in the Math Ed Community yesterday. The key ideas in the article center around (a) the standard multiplication table—with a row of numbers at the top, a column of numbers down the left, and the products of those numbers in the body of the table, and (b) modulus. In particular, what patterns emerge in the standard multiplication table when products are colored by equivalence to $$\mathtt{n \bmod k}$$ as $$\mathtt{k}$$ is varied?

The little interactive tool below shows a large multiplication table (you can figure out the dimensions), which starts by coloring those products which are equivalent to $$\mathtt{0 \bmod 12}$$, meaning those products which, when divided by 12 give a remainder of zero (in other words, multiples of 12).

mod

When you vary $$\mathtt{k}$$, you can see some other pretty cool patterns (broken up occasionally by the boring patterns produced by primes). Observing the patterns produced by varying the remainder, $$\mathtt{n}$$, is left as an exercise for the reader (and me).

Incidentally, I’ve wired up the “u” and “d” keys, for “up” and “down.” Just click in one of the boxes and press the “u” or “d” key to vary $$\mathtt{k}$$ or $$\mathtt{n}$$ without having to retype and press Return every time. And definitely go look at the paper linked above. They’ve got some other beautiful images and interesting questions.

Barka, Z. (2017). The Hidden Symmetries of the Multiplication Table Journal of Humanistic Mathematics, 7 (1), 189-203 DOI: 10.5642/jhummath.201701.15

## Retrieval Practice with Kindle: Feel the Learn

I use Amazon’s free Kindle Reader for all of my (online and offline) book reading, except for any book that I really want that just can’t be had digitally. Besides notes and highlights, the Reader has a nifty little Flashcards feature that works really well for retrieval practice. Here’s how I do retrieval practice with Kindle.

Step 1: Construct the Empty Flashcard Decks

Currently I’m working through Sarah Guido and Andreas Müller’s book Introduction to Machine Learning with Python. I skimmed the chapters before starting and decided that the authors’ breakdown by chapter was pretty good—not too long and not too short. So, I made a flashcard deck for each chapter in the book, as shown at the right. On your Kindle Reader, click on the stacked cards icon. Then click on the large + sign next to “Flashcards” to create and name each new deck.

Depending on your situation, you may not have a choice in how you break things down. But I think it’s good advice to set up the decks—however far in advance you want—before you start reading.

So, if I were assigned to read the first half of Chapter 2 for a class, I would create a flashcard deck for the first half of Chapter 2 before I started reading. And, although I didn’t set titles in this example, it’s probably a good idea to give the flashcard deck a title related to what it’s about (e.g., Supervised Learning).

You still need to read and comprehend the content. Retrieval practice adds, it doesn’t replace. So, I read and highlight and write notes like I normally would. I don’t worry at this point about the flashcards, about what is important or not. I just read for the pleasure of finding things out. I highlight things that strike me as especially interesting and write notes with questions, or comments I want to make on the text.

Read a section of the content represented by one flashcard deck. Since I divided my decks by chapter, I read the first chapter straight through, highlighting and making notes as I went.

The reading doesn’t have to be done in one sitting. The important thing is to just focus on reading one section before moving on to the next step.

Step 3: Create the Fronts for the Flashcards

Now, go through the content of your first section of reading and identify important concepts, items worth remembering, things you want to be able to produce. You’ll want to add these as prompts on your flashcards. You don’t necessarily have to write these all down in a list. You can enter a prompt on a flashcard, return to the text for another prompt, enter a prompt on another flashcard, and on and on.

Screenshot 1

Screenshot 2

Screenshot 3

When you have at least one prompt, click on the flashcard deck and then click on Add a Card (Screenshot 1) and enter the prompt.

Enter the prompt at the top. (Screenshot 2) This will be the front of the flashcard you will see when testing yourself. Leave the back blank for the moment. Click Save and Add Another Card at the bottom right to repeat this with more prompts.

When you are finished entering one card or all the cards, click on Save at the top right. This will automatically take you to the testing mode (Screenshot 3), which you’ll want to ignore for a while. Click on the stacked cards icon to return to the text for more prompts. When you come back to the flashcards, your decks may have shifted, since the most recently edited deck will be at the top.

Importantly, though, Screenshot 3 is the screen you will see when you return and click on a deck. To add more cards from this screen, click on the + sign at the bottom right. When you are done entering the cards for a section, get ready for the retrieval practice challenge! This is where it gets good (for learning).

Step 4: Create the Backs for the Flashcards

Rather than simply enter the backs of the flashcards from the information in the book, I first fill out the backs by simply trying to retrieve what I can remember. For example, for the prompt, “Write the code for the Iris model, using K Nearest Neighbors,” I wrote something like this on the back of the card:

import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
iris_dataset = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris_dataset.data, iris_dataset.target)

There are a lot of omissions here and some errors, and I moved things around after I wrote them down, but I tried as hard as I could to remember the code. To make the back of the card right, I filled in the omissions and corrected the errors. As I went through this process with all the cards in a section, I edited the fronts and backs of the cards and even added new cards as the importance of some material presented itself more clearly.

Create the backs of the flashcards for a section by first trying as hard as you can to retrieve the information asked for in the prompt. Then, correct the information and fill in omissions. Repeat this for each card in the deck.

Step 5: Test Yourself and Feel the Learn

One thing you should notice when you do this is that it hurts. And it should. In my view, the prompts should not be easy to answer. Another prompt I have for a different chapter is “Explain how k-neighbors regression works for both 1 neighbor and multiple neighbors.” My expectations for my response are high—I want to answer completely with several details from the text, not just a mooshy general answer. I keep the number of cards per chapter fairly low (about 5 to 10 cards per 100 pages). But your goals for retaining information may be different.

But once you have a set of cards for a section, come back to them occasionally and complete a round of testing for the section. To test yourself, click on the deck and respond to the first prompt you see without looking at the answer. Try to be as complete (correct) as possible before looking at the correct response.

To view the correct response, click on the card. Then, click on the checkmark if you completely nailed the response. Anything short of that, I click on the red X.

For large decks, you may want to restudy those items you got incorrect. In that case, you can click on Study Incorrect to go back over just those cards you got wrong. There is also an option to shuffle the deck (at the bottom left), which you should make use of if the content of the cards build on each other, making them too predictable.

In this post, we looked at the perceptron, a machine-learning algorithm that is able to “learn” the distinction between two linearly separable classes (categories of objects whose data points can be separated by a line—or, with multi-dimensional categories, by a hyperplane). In this post, we’ll look at gradient descent, which is used to gradually reduce prediction error.

The data shown below resemble the apples and oranges data we used last time. There are two classes, or categories—in this case, the two categories are the setosa and versicolor species of the Iris flower. And each species has two dimensions (or features): sepal length and petal length. In our hypothetical apple-orange data, the two dimensions were weight and number of seeds.

Using just two dimensions allows us to plot each instance in the training data on a coordinate plane as above and draw some of the prediction lines as the program cycles through the data. For the above data, a solution is found after about 5–6 “epochs” (cycles of $$\mathtt{n}$$ runs through the data). This solution is represented by the blue dashed line.

This process is a bit clunky. The coefficients, or weights $$\mathtt{w}$$ are updated using the learning rate $$\mathtt{\eta}$$ by $$\mathtt{\Delta w_{1,2} = \pm 2\eta w_k}$$ and $$\mathtt{\Delta w_0 = \pm 2\eta}$$. This process, though it always converges to a solution so long as there is one, tends to jolt the prediction line back and forth a bit abruptly.

With gradient descent, we can gradually reduce the error in the prediction. Sometimes. We do this by making use of the sum of squared errors function—a quadratic function (parabola) that has a minimum:
$\mathtt{\frac{1}{2}\sum(y – wx)^2}$

This formula shows the target vector $$\mathtt{y}$$ (the collection of target values of 1 and -1) minus the input vector—the linear combinations of weights and dimensions for each object in the data, $$\mathtt{w^ix^i}$$. The components of the difference vector are squared and summed, and the result is divided by 2, which gives us a scalar value that places us somewhere on the parabola. We don’t use this “cost” value except to keep track of the cost to see if it reduces over cycles.

Okay, so we don’t know what side of the parabola we’re on. In that case, we look at the opposite of the gradient of the curve with respect to the weights, or the opposite of the partial derivative of the curve with respect to the weights:
$\mathtt{-\frac{\partial}{\partial w}(y – wx)^2 = -\frac{\partial}{\partial u}(u^2)\frac{\partial}{\partial w}(y – wx) = -2u \cdot -x = -2(y – wx)(-x)}$

Multiply this result by $$\mathtt{\frac{1}{2}}$$, plug that summation back in, and we get a gradient of $$\mathtt{\sum (y – wx)(x)}$$. Finally, multiply by the learning rate $$\mathtt{\eta}$$ to get the change to each weight: $\mathtt{\eta\sum (y – wx)(x)}$

An Example

Let’s take a look at a small example of gradient descent in action. I’ll use data for just 10 flowers in the Iris data set. All of these belong to the setosa species of the Iris flower. I’ll use a learning rate of $$\mathtt{\eta = 0.0001}$$.

Sepal Length (cm) Petal Length (cm)
5.1 1.4
4.9 1.4
4.7 1.3
4.6 1.5
5 1.4
5.4 1.7
4.6 1.4
5 1.5
4.4 1.4
4.9 1.5

In that case, all of these instances have target values of $$\mathtt{-1}$$, which don’t change as the data cycle through. Our $$\mathtt{y}$$ vector, then, is [$$\mathtt{-1, -1, -1, -1, -1, -1, -1, -1, -1, -1}$$], and our starting weights are [0, 0, 0]. The first weight here is the “intercept” weight, or bias weight, which gets updated differently from the others.

Our input vector, $$\mathtt{w^{T}x}$$, is the combination sepal length × weight 1 + petal length × weight 2 for each object in the data. At the start, then, our input vector is a zero vector: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]. The difference vector, $$\mathtt{y – w^{T}x}$$, is, in this case, equal to the y vector: just a collection of ten negative 1s.

The bias weight, $$\mathtt{w_0}$$, is updated by learning rate × the sum of the components of the difference vector, or $$\mathtt{\eta\sum(y – w^{T}x)}$$. This gives us $$\mathtt{w_0 = w_0 + 0.0001 \times -10 = -0.001}$$.

The $$\mathtt{w_1}$$ weight is updated as 0.0001(5.1 × -1 + 4.9 × -1 + 4.7 × -1 + 4.6 × -1 + 5 × -1 + 5.4 × -1 + 4.6 × -1 + 5 × -1 + 4.4 × -1 + 4.9 × -1) and the $$\mathtt{w_2}$$ weight is updated as 0.0001(1.4 × -1 + 1.4 × -1 + 1.3 × -1 + 1.5 × -1 + 1.4 × -1 + 1.7 × -1 + 1.4 × -1 + 1.5 × -1 + 1.4 × -1 + 1.5 × -1).

Through all that gobbledygook, our new weights are $$\mathtt{-0.00486}$$ and $$\mathtt{-0.00145}$$ with a bias weight of $$\mathtt{-0.001}$$. You can see below how different the gradient descent process is from the perceptron model. The gradient descent model in particular doesn’t have to stop when it finds a line that separates the categories, since the minimum may not be reached even when the program can classify the categories precisely.

## The Perceptron

I‘d like to write about something called the perceptron in this post—partly because I’m just learning about it, and writing helps me wrap my head around new-to-me things, and partly because, well, it’s interesting.

It works like this. Take two categories which are, in some quantifiable way, completely different from each other. In this example, let’s talk about apples and oranges as the categories. And we’ll give each of these categories just two dimensions—weight in grams and number of seeds. (I’ll fudge these numbers a bit to help with the example.)

Studying this information alone, you’re already prepared to know with 100% confidence whether an object is an apple or an orange given its weight and number of seeds. In fact, the number of seeds is unnecessary. If you have just the two objects to choose from, and you are given only an object’s weight, you have all the information you need to assign it to the apple category or the orange category.

But, you have to play along here. We want to train the computer to make the inference we just made above, given only a set of data with the weights and number of seeds of apples and oranges. Crucially, what the perceptron does is find a line between the two categories. You can easily see how to draw a line between the categories at the left (where I have plotted 100 random apples and oranges in their given ranges of weights and seeds), but the perceptron program allows a computer to learn where to draw a line between the categories, given only a set of data about the categories.

Training the Perceptron

The way we teach the computer how to draw this line is that we ask it to essentially draw a prediction line. Then we make it look at each instance (apple or orange, one at a time) and see whether the line correctly predicts the category of the object. If it does, we make no change to the line; if it doesn’t we adjust the line.

The prediction line takes the standard form $$\mathtt{Ax + By + C = 0}$$. If $$\mathtt{Ax + By + C > 0}$$, we predict orange. If $$\mathtt{Ax + By + C \leq 0}$$, we predict apple. We can put the “or equal to” on either one of these inequalities, and we need the “or equal to” on one of them for this to be a truly binary decision.

Okay, so let’s draw a prediction line across the data above at $$\mathtt{y = 119}$$, say. In that case, $$\mathtt{A = 0, B = 1,}$$ and $$\mathtt{C = -119}$$. When we come across apple data points, they will all be categorized correctly (as below the line), but some of the orange data points will be categorized correctly and some incorrectly. On correct categorizations, we don’t want the line to update at all, and on incorrect categorizations, we want to change the line. Here’s how both of these things are accomplished:

Let’s take the point at (10, 109). This should be classified as orange, which we’ll quantify as $$\mathtt{1}$$. The prediction line, however, gives us $$\mathtt{0(10) + 1(109) + -119}$$, which is $$\mathtt{-10 \leq 0}$$. So, the predictron mistakenly predicts this point to be an apple, which we’ll quantify as $$\mathtt{-1}$$. We want to adjust the line.

Subtract the prediction ($$\mathtt{-1}$$) from the actual (1) to get $$\mathtt{2}$$. Multiply this by a small learning rate of 0.1, so $$\mathtt{0.1 \times 2 = 0.2}$$. Finally, change $$\mathtt{A}$$, $$\mathtt{B}$$, and $$\mathtt{C}$$ like this:

$$\mathtt{A = A + 0.2 \times 10}$$

$$\mathtt{B = B + 0.2 \times 109}$$

$$\mathtt{C = C + 0.2}$$

So, in our example, $$\mathtt{A}$$ becomes $$\mathtt{0 + 0.2 \times 10}$$, or 2, $$\mathtt{B}$$ becomes $$\mathtt{1 + 0.2 \times 109}$$, or 22.8, and $$\mathtt{C}$$ becomes $$\mathtt{-119 + 0.2}$$, or $$\mathtt{-118.8}$$.

We have a new prediction line, which is $$\mathtt{2x + 22.8y + -118.8 = 0}$$. This line now makes the correct prediction for the orange at (10, 109), as you can see at the right with the blue line.

But this point is only used to adjust A, B, and C (called the “weights”), and then it’s on to the next point to see if the new prediction line succeeds in making a correct prediction or not. The weights change with each incorrect prediction, and these changing weights move the prediction line up (left) and down (right) and alter its slope as well.

Many Iterations

Suppose we next encounter a point at (10, 90), which should be an apple. Our predictron, however, will make the prediction $$\mathtt{2(10) + 22.8(90) – 118.8 = 2153.2 > 0}$$, which is a prediction of orange, or 1. Subtract the prediction (1) from the actual ($$\mathtt{-1}$$) to get $$\mathtt{-2}$$, and multiply by the learning rate to get $$\mathtt{-0.2}$$. Our weights are adjusted as follows:

$$\mathtt{A = 2 + -0.2 \times 10 = 0}$$

$$\mathtt{B = 22.8 + -0.2 \times 90 = 4.8}$$

$$\mathtt{C = -118.8 + -0.2 = -119}$$

If you graph this new line, $$\mathtt{0x + 4.8y + -119 = 0}$$, you’ll notice that it’s between the other two, but it would still make the incorrect categorization for the apple at (10, 90). This is why the predictron must cycle through the data several times in order to “train” itself into determining the correct line. If you can make sense of it, there is a proof that this algorithm always converges in finite time for data that can be separated by a line (with a sufficiently small learning rate).

Other Notes

It’s worth mentioning that the perceptron finds a line between the categories, but there are an infinite number of lines available. Also, in this example, we have two dimensions, or features, but the perceptron works for as many dimensions as you please.

So, to get schmancy, for each object in the data, $$\mathtt{z}$$, our prediction function takes a linear combination of weights (coefficients + that C intercept weight) and coordinates (features, or dimensions) $$\mathtt{z = w_{0}x_{0} + w_{1}x_{1} + \ldots + w_{m}x_{m} = w^{T}x}$$ and outputs
$\theta(z) = \left\{\begin{array}{11} \color{white}{-}1, & \quad z > 0 \\ -1, & \quad z \leq 0 \end{array} \right.$

Even with multi-dimensional vectors, the weights are updated just as above, perhaps with a different learning rate, chosen at the beginning.

Below is a Python implementation of the Perceptron from Sebastian Raschka, using a classic data set about irises (the flowers).

## Teach Me My Colors

In the box below, you can try your hand at teaching a program, a toy problem, to reliably identify the four colors red, blue, yellow, and green by name.

You don’t have a lot of flexibility, though. Ask the program to show you one of the four colors, and then provide it feedback as to its response—in that order. Then repeat. That’s all you’ve got. That and your time and endurance.

Of course, I’d love to leave the question about the meaning of “reliably identify the four colors” to the comments, but let’s say that the program knows the colors when it scores 3 perfect scores in a row—that is, if you cycle through the 4 colors three times in a row, and the program gets a 4 out of 4 all three times.

Just keep in mind that closing or refreshing the window wipes out any “learning.” Kind of like summer vacation. Or winter break. Or the weekend.

Death, Taxes, and the Mind

The teaching device above is a toy problem because it is designed to highlight what I believe to be the most salient feature of instruction—the fact that we don’t know a lot about our impact. Can you not imagine someone becoming frustrated with the “teaching” above, perhaps feverishly wondering what’s going on in the “mind” of the program? Ultimately, the one problem we all face in education is this unknown about students’ minds and about their learning—like the unknown of how the damn program above works, if it even does.

One can think of the collective activity of education as essentially the group of varied responses to this situation of fundamental ambiguity and ignorance. And similarly, there are a variety of ways to respond to the painful want of knowing solicited by this toy problem:

Seeing What You Want to See
Pareidolia is the name given to an occurrence where people perceive a pattern that isn’t there—like the famous “face” on Mars (just shadows, angles, and topography). This can happen when incessantly clicking on the teaching device above too. In fact, these kinds of pattern-generating hypotheses jumped up sporadically in my mind as I played with the program, and I wrote the program. For example, I noticed on more than one occasion that if I took a break from incessant clicking and came back, the program did better on that subsequent trial. And between sessions, I was at one point prepared to say with some confidence that the program simply learned a specific color faster than the others. There are a huge number of other, related superstitions that can arise. If you think they can only happen to technophobes and the elderly, you live in a bubble.

Constantly Shifting Strategies
It might be optimal to constantly change up what you’re doing with the teaching device, but trying to optimize the program’s performance over time is probably not why you do it. Frustration with a seeming lack of progress and following little mini-hypotheses about short-term improvements are more likely candidates. A colleague of mine used to characterize the general orientation to work in education as the “Wile E. Coyote approach”—constantly changing strategies rather than sticking with one and improving on it. The darkness is to blame.

Letting the Activity Judge You
This may be a bit out in left field, but it’s something I felt while doing the toy problem “teaching,” and it is certainly caused by the great unknown here—guilt. Did I remember to give feedback that last time? My gosh, when was the last time I gave it? Am I the only one who can’t figure this out, who is having such a hard time with this? (Okay, I didn’t experience that last one, but I can imagine someone experiencing it.) It seems we will happily choose even the distorted feel-bad projections of a hyperactive conscience over the irritating blankness of not knowing. Yet, while we might find some consolation in the truth that we’re too hard on ourselves, we also have the unhappy task of remembering that a thousand group hugs and high-fives are even less effective than a clinically diagnosable level of self-loathing at turning unknowns into knowns.

Conjecturing and Then Testing
This, of course, is the response to the unknown that we want. For the toy problem in particular, what strategies are possible? Can I exhaust them all? What knowledge can I acquaint myself with that will shine light on this task? How will I know if my strategy is working?

Here’s a plot I made of one of my runs through, using just one strategy. Each point represents a test of all 4 colors, and the score represents how many colors the program identified correctly.

Was the program improving? Yes. The mean for the first 60 trials was approximately 1.83 out of 4 correct, and the mean for the back 63 was approximately 2.14 out of 4. That’s a jump from about 46% to about 54%.

Is that the best that can be done? No. But that’s just another way the darkness gets ya—it makes it really hard to let go of hard-won footholds.

Knowing Stuff

Some knowledge about how the human mind works is analogous to knowing something about how programs work in the case of this toy problem. Such knowledge makes it harder to be bamboozled by easy to vary explanations. And in general such knowledge works like all knowledge does—it keeps you away, defeasibly, from dead-ends and wrong turns so that your cognitive energy is spent more productively.

Knowing something about code, for example, might instantly give you the idea to start looking for it in the source for this page. It’s just a right click away, practically. But even if you don’t want to “cheat,” you can notice that the program serves up answers even prior to any feedback, which, if you know something about code, would make you suspect that they might be generated randomly. Do they stay random, or do they converge based on feedback? And what hints does this provide about the possible functioning of the program? These better questions are generated by knowledge about typical behavior, not by having a vast amount of experience with all kinds of toy problem teaching devices.

How It Works

So, here’s how it works. The program contains 4 “registers,” or arrays, one for each of the 4 colors—blue, red, green, yellow. At the beginning of the training, each of those registers contains the exact same 4 items: the 4 different color names. So, each register looks like this at the beginning: [‘blue’, ‘red’, ‘green’, ‘yellow’].

Throughout the training, when you ask the program to show you a color, it chooses a random one from the register. This behavior never changes. It always selects a random color from the array. However, when you provide feedback, you change the array for that color. For example, if you ask the program to show you blue, and it shows you blue, and you select the “Yes” feedback from the dropdown, a “blue” choice is added to the register. So, if this happened on the very first trial, the “blue” register would change from [‘blue’, ‘red’, ‘green’, ‘yellow’] to [‘blue’, ‘red’, ‘green’, ‘yellow’, ‘blue’]. If, on the other hand, you ask for blue on the very first trial, and the program shows you green, and you select the “No” feedback from the dropdown, the 3 colors that are NOT green are added to the “blue” register. In that case, the “blue” register would change from [‘blue’, ‘red’, ‘green’, ‘yellow’] to [‘blue’, ‘red’, ‘green’, ‘yellow’, ‘blue’, ‘red’, ‘yellow’].

A little math work can reveal that positive feedback on the first trial moves the probability of randomly selecting the correct answer from 0.25 to 0.4. For negative feedback, there is still a strengthening of the probability, but it is much smaller: from 0.25 to about 0.29. These increases decrease over time, of course, as the registers fill up with color names. For positive feedback on the second trial, the probability would strengthen from 0.4 to 0.5. For negative feedback, approximately 0.29 to 0.3.

Thus, in some sense, you can do no harm here so long as your feedback matches the truth—i.e., you say no when the answer is incorrect and yes when it is correct. The probability of a correct answer from the program always gets stronger over time with appropriate feedback. Can you imagine an analogous conclusion being offered from education research? “Always provide feedback” seems to be the inescapable conclusion here.

But a limit analysis provides a different perspective. Given an infinite sequence of correct-answer-only trials $$\mathtt{C(t)}$$ and an infinite sequence of incorrect-answer-only trials $$\mathtt{I(t)}$$, we get these results:

$\mathtt{\lim_{t\to\infty} C(t) = \lim_{t\to\infty}\frac{t + 1}{t + 4} = 1, \qquad \lim_{t\to\infty} I(t) = \lim_{t\to\infty}\frac{t + 1}{3t + 4} = \frac{1}{3}}$

These results indicate that, over time, providing appropriate feedback only when the program makes a correct color identification strengthens the probability of correct answers from 0.25 to 1 (a perfect score), whereas the best that can be hoped for when providing feedback only when the program gives an incorrect answer is just a 1-in-3 shot at getting the correct answer. When both negative and positive feedback are given, I believe a similar analysis shows a limit of 0.5, assuming an equal number of both types of feedback.

Of course, the real-world trials bear out this conclusion. The data graphed above are from my 123 trials giving both correct and incorrect feedback. Below are data from just 67 trials giving feedback only on correct answers. The program hits the benchmark of 3 perfect scores in a row at Trial 53, and, just for kicks, does it again 3 more times shortly thereafter.

Parallels

Of course, the toy problem here is not a student, and what is modeled as the program’s “cognitive architecture” is nowhere near as complex as a student’s, even with regard to the same basic task of identifying 4 colors. There are obviously a lot of differences.

Yet there are a few parallels as well. For example, behaviorally, we see progress followed by regress with both the program and, in general, with students. Perhaps our minds work in a probabilistic way similar to that of the program. Could it be helpful to think about improvements to learning as strengthening response probabilities? Relatedly, “practice” observably strengthens what we would call “knowledge” in the program just as it does, again in general, for students.

And, I think fascinatingly, we can create and reverse “misconceptions” in both students and in this toy problem. We can see how this operates on just one color in the program by first training it to falsely identify blue as ‘green’ (to a level we benchmarked earlier as mastery—3 perfect responses in a row). Then, we can switch and begin teaching it the correct correspondence. As we can now predict, reversing the misconception will take longer than instantiating it, even with the optimal strategy, because the program’s register will have a large amount of information in it—we will be fighting against that large denominator.