Programming for the Larrabee/Xeon Phi
Back in the day, I worked on this little project called Larrabee – which later turned into the Intel Xeon Phi coprocessor. It was an ambitious and exciting platform. It consisted of a ton of 512 bit wide instructions to operate like a lot of streaming GPU architectures, yet was fully general purpose x86.
It turned out that getting performance out of this hardware was difficult. In order to get the full potential of the hardware, you simply had to utilize the vector units. Without that, it is like writing a single threaded app on a 8 core system. Single SIMD lane operation just wasn’t going to cut it as was written about in 2017 International Journal of Parallel Programming article:
“Our results show that, although the Xeon Phi delivers a relatively good speedup in comparison with a shared-memory architecture in terms of scalability, the relatively low computing power of its computational units when specific vectorization and SIMD instructions are not fully exploited makes this first generation of Xeon Phi architectures not competitive”
Using the Xeon Phi Platform to Run Speculatively Parallelized Codes
The paper, and the host of others linked on the page as references, are a good read and gives some hints why fixed-function GPUs have an advantage when it comes to raw streaming throughput. Hint: cache and data flow behavior is as, if not more, important as utilizing vectorization in such architectures.