How can we calculate cosine similarity in BigQuery ?

**BigQuery(BQ)** is very useful for data analysis or processing.

It is good for handling huge data.

It returns summary result in short time.

BQ has one useful data format "**array**".

When we consider array as vector, we may want **cosine similarity of vectors**.

So how can we get cosine similarity ?

So today I introduce about **"How to calculate cosine similarity of array in BigQuery"**.

## What is cosine similarity

Cosine similarity is a **measure of similarity between two vectors**.

Cosine `cos`

is one of the trigonometric functions.

It takes 1 in case of 0°, 0 in case of 90° and -1 in case of 180°.

So it means that **we can get an angle if we know value of cosine**.

And if an angle between 2 vectors is close to zero, it means 2 vectors are similar.

So the cosine is used as similarity.

Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space.

Cosine similarity – Wikipedia

The formula of cosine similarity is below.

I remember that I used to learn it.

## Prepare data

Before calculate it, we should prepare data.

In order to make array data, we can use the table which we create in previous topic.

You can add `CREATE TABLE`

to SQL, and create table.

```
CREATE TABLE test.array_sample2
AS
SELECT
key,
ARRAY_AGG(val) as val_array
FROM test.array_sample
GROUP BY key
```

Result table

Row | key | val_array |
---|---|---|

1 | a | 1 |

2 | ||

3 | ||

2 | b | 2 |

4 | ||

5 | ||

3 | c | 3 |

2 | ||

-1 |

We use this table to calculate cosine similarity.

## How to calculate cosine similarity of array in BigQuery

In order to calculate cosine similarity of array in BigQuery, we should calculate product between elements of vectors.

SQL is below.

```
SELECT
t1.key AS key1,
t2.key AS key2,
(
SELECT
SUM(value1 * value2)/ SQRT(SUM(value1 * value1))/ SQRT(SUM(value2 * value2))
FROM
UNNEST(t1.val_array) AS value1
WITH
OFFSET
pos1
JOIN
UNNEST(t2.val_array) AS value2
WITH
OFFSET
pos2
ON
pos1 = pos2
) AS cosine_similarity
FROM
test.array_sample2 AS t1,
test.array_sample2 AS t2
ORDER BY
key1,
key2,
cosine_similarity
```

It decomposed array to each element bt `UNNEST(array)`

, and added order number by `WITH OFFSET pos`

.

Then use `pos`

as join key for multiplication of each element.

Result is below.

Result

Row | key1 | key2 | cosine_similarity |
---|---|---|---|

1 | a | a | 1 |

2 | a | b | 0.9960238411 |

3 | a | c | 0.2857142857 |

4 | b | a | 0.9960238411 |

5 | b | b | 1 |

6 | b | c | 0.3585685828 |

7 | c | a | 0.2857142857 |

8 | c | b | 0.3585685828 |

9 | c | c | 1 |

Vector a and b are similar. So similarity is high.

Vector c faces different direction. So similarity is low.

## Conclusion

Today I explained about **"How to calculate cosine similarity of array in BigQuery"**.

In order to calculate cosine similarity of array, we can check these following important pints.

- Decompose array by "UNNEST(array)"
- Add number to array element by "WITH OFFSET pos"
- Calculate cosine similarity by multiplication between each array element

Using `UNNEST(array)`

is little bit complicated.