September 19th 2021

A few years ago I was hoping that Java will have a chance to become again an important contented into machine learning field. I was hoping for interactivity, vectorization, and seamless integration with the external world (c/c++/fortran). With the last release of Java 17 the last two dreams are closer to reality than ever.

The Ultimate Gaming Experience: Explo…
Key Tips for Renting an Electric Vehi…
Android 15: Locate Your Smartphone Ev…
Another Appliance Company Now Files A…
Integration des Wartungssoftware mit …

JEP 414: Vector Api (Second Incubator) is something I awaited a lot and I spent a few hours playing with it. Personally, I am really happy with the results, and I have a lot of motivation to migrate much of the linear algebra staff on that. It looks really cool.

To make a story short, I implemented a small set of microbenchmarks for two simple operations. The first Operation is fillNaN and for the second test, we simply add elements of a vector.

fillNaN

This is a common problem when working with large chunks of floating numbers: some of them are not numbers for various reasons: missing data, impossible operations, and so on. A panda version of it could be fillna. The whole idea is that for a given vector you want to replace all Double.NaN values with a given value to make arithmetic possible.

The following is a listing of the fillNa benchmark.

As you can see, nothing fancy here. The `testFillNaNArrays` method iterates over the array and if the given value is Double.NaN. Pretty straightforward. How about the results? It should be faster.

Benchmark Mode Cnt Score Error Units
VectorFillNaNBenchmark.testFillNaNArrays thrpt 10 3.405 ± 0.149 ops/ms
VectorFillNaNBenchmark.testFillNaNVectorized thrpt 10 41.930 ± 4.437 ops/ms
VectorFillNaNBenchmark.testFillNaNArrays avgt 10 0.289 ± 0.002 ms/op
VectorFillNaNBenchmark.testFillNaNVectorized avgt 10 0.023 ± 0.001 ms/op

But over 10 times faster? It is a really pleasant surprise, but not quite a surprise. This is in strict connection with auto-vectorization in Java. When it works, and for simple loops it works, it gives intrinsic optimizations and sometimes even SIMD based. But calling such a thing as Double.isNaN is not a simple thing, at least for auto-vectorization. In the new Vector API this operation is vectorized and we go fast, even if we use masks, which are not the lightest things in this new API. So we get a boost of 13x in speed which looks amazing.

sum and sumNaN

For the second microbenchmark, we have the same operation in two flavors. The first sum is implemented over all elements, with no constraints. The second sum operation, we call it sumNaN skips the potential non-numeric values and computes the sum of the rest of the numbers. We do that to check two things. We want to know how vectorization behaves compared to auto-vectorization (this is the normal sum, which is implemented as a simple loop that benefits from all optimizations possible). And we also want to see another operation with masks, compared with an auto-vectorized code. Let's see the benchmark:

And with no additional comments the results:

Benchmark Mode Cnt Score Error Units

VectorSumBenchmark.testSumArrays thrpt 10 9.264 ± 1.591 ops/ms

VectorSumBenchmark.testSumVectorized thrpt 10 12.222 ± 0.738 ops/ms

VectorSumBenchmark.testSumNanArrays thrpt 10 2.692 ± 0.191 ops/ms

VectorSumBenchmark.testSumNanVectorized thrpt 10 10.704 ± 0.428 ops/ms

VectorSumBenchmark.testSumArrays avgt 10 0.120 ± 0.011 ms/op

VectorSumBenchmark.testSumVectorized avgt 10 0.054 ± 0.011 ms/op

VectorSumBenchmark.testSumNanArrays avgt 10 0.390 ± 0.018 ms/op

VectorSumBenchmark.testSumNanVectorized avgt 10 0.068 ± 0.005 ms/op

We can see from those results that the unoptimized code for sumNan on arrays performs badly by distance. This is expected. What I personally did not expect was the vectorized version with masks (sum nan vectorized) to perform better than an auto-vectorized version of the simple sum (sum arrays). Really good job. Hat off!

Conclusions

For the sake of reproduction, I have run that on 'Intel(R) Core(TM) i7-8665U CPU @ 1.90GHz/8cores/32GB RAM'. This processor can make SIMD operations on lanes of 256 bits / 4 double floats. A better one runs faster, of course. But the absolute numbers are not important here. What is important is that you can vectorize many things in Java directly and it makes it possible to implement complex things with masks, which, at least sometimes, is faster than auto-vectorization. This is a really really amazing job.

This post first appeared on Rapaio, please read the originial post: here