Cosine similarity calculations can be used to check how similar two vectors are. Using ARM NEON intrinsics, we can dramatically improve the performance of this calculation while changing very little code.

The most simple way of doing this calculation, in C++, is by simply using `std::inner_product`

to calculate the dot products.

```
#include <cmath>
#include <chrono>
#include <iostream>
#include <numeric>
#include <vector>
float dot(const std::vector<float>& a, const std::vector<float>& b) {
return std::inner_product(a.begin(), a.end(), b.begin(), 0.0);
}
float norm(const std::vector<float>& a) {
return std::sqrt(dot(a, a));
}
float cos(const std::vector<float>& a, const std::vector<float>& b) {
return dot(a, b) / (norm(a) * norm(b));
}
int main() {
const std::vector<float> a(128, 1.0);
const std::vector<float> b(128, 1.0);
auto start = std::chrono::high_resolution_clock::now();
auto result = cos(a, b);
auto end = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end - start).count();
std::cout << "took time " << duration << "us" << std::endl;
std::cout << "cosine similarity: " << result << std::endl;
}
```

Running the above code produces the following results:

```
took time 70us
cosine similarity: 1
```

Let’s see the speed we get when we run this calculation 1,000,000 times. We’ll make the following changes to our `main`

.

```
int main() {
const std::vector<std::vector<float>> vectors(1000000, std::vector<float>(128, 1.0));
float result = 0;
auto start = std::chrono::high_resolution_clock::now();
for (const auto& v : vectors) {
result += cos(v, v);
}
auto end = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end - start).count();
std::cout << "took time " << duration << "us" << std::endl;
std::cout << "result: " << result << std::endl;
}
```

And the results for this test:

```
took time 33314621us
result: 1e+06
```

To run cosine similarity for 1,000,000 vectors, it takes about 30 seconds in total. What improvements might we see if we use NEON intrinsics?

The ARM documentation already includes an example of matrix multiplication that we can use as reference. Let’s first try and optimize the dot product, since that is currently used by both the cosine similarity calculation as well as the norm calculation. Before we start, make sure to include `arm_neon.h`

at the start of your script.

The important instrinsic to use in this scenario is `vfmaq_f32`

. Fused multiply-add will allow us to perform four multiplications and additions in a single instruction.

```
float dot(const std::vector<float>& a, const std::vector<float>& b) {
auto a_data = a.data();
auto b_data = b.data();
float32x4_t x, y, z = vmovq_n_f32(0.0);
for (int i = 0; i < 128; i += 4) {
x = vld1q_f32(&a_data[i]);
y = vld1q_f32(&b_data[i]);
z = vfmaq_f32(z, x, y);
}
return vaddvq_f32(z);
}
```

The code above loads the floats from the input vectors in groups of 4, performs fused multiply-add on them, and then sums up and returns the results in z. Let’s see what speed-up we get.

```
took time 5490190us
result: 1e+06
```

By simply replacing our dot product calculation with ARM NEON intrinsics, we’re able to reduce the runtime of the cosine similarity calculations to 1/6th of the original time.