Cosine Similarity and Correlation

I wrote a lesson not too long ago that started with a Would You Rather? survey activity. For our purposes here, we can pretend that each question had a Likert scale from 1–10 attached to it, though in reality, the lesson was about categorical data.

At any rate, here are the questions—edited a bit. Feel free to rate your answers on the scales provided. Careful! Once you click, you lock in your answer.

Would you rather . . .

  1. be able to fly (1) or be able to read minds (10)?

  1. go way back in time (1) or go way into the future (10)?

  1. be able to talk to animals (1) or speak all languages (10)?

  1. watch only historical movies (1) or sci-fi movies (10) for the rest of your life?

  1. be just a veterinarian (1) or just a musician (10)?

Finally, one last question that is not a would-you-rather. Once you’ve answered this and the rest of the questions, you can press the I'm Finished! button to submit your responses.

  1. Rate your fear of heights from (1) not at all afraid to (10) very afraid.

Check out the results so far.

Are Your Responses Correlated?

Next in the lesson, I move on to asking whether you think some of the survey responses are correlated. For example, if you scored “low” on the veterinarian-or-musician scale—meaning you would strongly prefer to be a veterinarian over a musician—would that indicate that you probably also scored “low” on Question (c) about talking to animals or speaking all the languages? In other words, are those two scores correlated? What about choosing the ability to fly and your fear of heights? Are those correlated? How could we measure this using a lot of responses from a lot of different people?

An ingenious way of looking at this question is by using cosine similarity from linear algebra. (We looked at the cosine of the angle between two vectors here and here.)

For example, suppose you really would rather have the ability to fly and you have almost no fear of heights. So, you answered Question (a) with a 1 and Question (f) with, say, a 2. Another person has no desire to fly and a terrible fear of heights, so they answer Question (a) with an 8 and Question (f) with a (10). From this description, we would probably guess that the two quantities wish-for-flight and fear-of-heights are strongly correlated. But we’ve also now got the vectors (1, 2) and (8, 10) to show us this correlation.

See that tiny angle between the vectors on the left? The cosine of tiny angles (as we saw) is close to 1, which indicates a strong correlation. On the right, you see the opposite idea. One person really wants to fly but is totally afraid of heights (1, 10) and another almost couldn’t care less about flying (or at least would really rather read minds) but has a low fear of heights (8, 2). The cosine of the close-to-90°-angle between these vectors will be close to 0, indicating a weak correlation between responses to our flight and heights questions.

But That’s Not the Ingenious Part

That’s pretty cool, but it is not, in fact, how we measure correlation. The first difficulty we encounter happens after adding more people to the survey, giving us several angles to deal with—not impossible, but pretty messy for a hundred or a thousand responses. The second, more important, difficulty is that the graph on the right above doesn’t show a weak correlation; it shows a strong negative correlation. Given just the two response pairs to work from in that graph, we would have to conclude that a strong fear of heights would make you more likely to want the ability to fly (or vice versa) rather than less likely. But the “weakest” the cosine can measure in this kind of setup is 0.

The solution to the first difficulty is to take all the x-components of the responses and make one giant vector out of them. Then do the same to the y-components. Now we’ve got just two vectors to compare! For our data on the left, the vectors (1, 2) and (8, 10) become (1, 8) and (2, 10). The vectors on the right—(1, 10) and (8, 2)—become (1, 8) and (10, 2).

The solution to the second difficulty—no negative correlations—we can achieve by centering the data. Let’s take our new vectors for the right, uncorrelated, graph: (1, 8) and (10, 2). Add the components in each vector and divide by the number of components (2) to get an average. Then subtract the average from each component. So, our new centered vectors are

(1 – ((1 + 8) ÷ 2), 8 – ((1 + 8) ÷ 2)) and (10 – ((10 + 2) ÷ 2), 2 – ((10 + 2) ÷ 2))

Or (–3.5, 3.5) and (4, –4). It’s probably not too tough to see that a vector in the 2nd quadrant and a vector in the 4th quadrant are heading in opposite directions. And these vectors now form a close-to-180° angle, and the cosine of 180° is –1 which is the actual lowest correlation we can get, indicating a strong negative correlation.

And That’s Correlation

To summarize, the way to determine correlation linear-algebra style is to determine the cosine of the centered x- and y-vectors of the data. That formula is \[\mathtt{\frac{(x-\overline{x}) \cdot (y-\overline{y})}{|x-\overline{x}||y-\overline{y}|} = cos(θ)}\]

Which is just another way of writing the more common version of the r-value correlation.

Published by

Josh Fisher

Instructional designer, software development in K-12 mathematics education.