Jump to content

Do optimizers try to use SSE and AVX?


Recommended Posts

Hello dear friends!

 

How do programmers make use of the Sse and Avx?

 

A - I imagine it could be only through the call of library functions that are programmed in assembler.

 

B - Or the programmer could give some (pragma) hint and write a source that suggests strongly how to compile.

 

C - Or, if the compiler is smart enough (Visual C++ impressed me in the Pentium-1 era), it could analyze the source code and detect when loops are parallel enough and produce the faster binary.

 

Example:

 

double s[], a[], b[]; int i;

for (i=0; i<max; i++) s += a*b;

 

could be compiled (with additional parity tests on i) as:

 

for (i=0; i<max; i+=2)

 {s += a*b; s[i+1] += a[i+1]*b[i+1];}

 

if the target machine has the Sse, and with i+=4 if the machine has the Avx256.

 

So what does the programmer need to do, and how autonomous is the compiler?

Thank you!

Edited by pointertovoid
Link to comment
Share on other sites


Pretty much any of how you listed in descending order of prominence/priority.  Notably, if any optimizations are attempted of this nature, a profiler is run against the code and the bottleneck is identified.  Since it has to be a CPU oriented bottleneck (and not disk or API), the number of opportunities for being able to do what you're talking about is limited, and any optimized code is just a limited section of that.  Notably as well, adjusting the algorithm has much bigger impacts than looking for CPU instruction optimization - removing a crappy algorithm in favor of a better one will always net better results.

 

The only other outstanding question is that compilers will support most of a subset of CPU instructions (but not all) available a reasonable time before the compiler is released.  Since it's a subset, it means the optimization route is always a possible valid one. (I'm reminded of the last two I did, which were optimizable with simple x86 assembler instructions to gain a good speed increase)

Link to comment
Share on other sites

Thanks Glenn9999!

 

So, what shall the programmer write so that the compiler uses the Sse or Avx?

 

Usually if it's up to the compiler, it's just going to select whatever it thinks is best.  You really aren't going to know *exactly* what it does.  If you get an option, you can compile for certain architectures.  This is important because you need to be able to target for what is available to you.  Note, most compilers will default to the most common instruction *sets* unless you tell them different.  Like I indicated above, there's no real guarantees on any of the instruction sets you're going to get "most optimized" unless you provide a custom-written ASM function.  But then again, as I indicated above as well, it's not a concern to 99% of applications, since it has to be a CPU bottleneck, and the *best* algorithm is always better to be found than to optimize a crappy one.  After all, you can't polish a turd no matter how hard you try.

Link to comment
Share on other sites

In the cases I consider it's completely a matter of computation speed on double floats.

 

Did you (or someone else) check whether the compiler indeed uses the Sse or Mmx instructions when the programmer writes like

 

double s[], a[], b[]; int i;

for (i=0; i<max; i++) s += a*b;

 

or if more detailed indications are necessary?

 

It's not just a matter of the compiler knowing an instruction set. Here the compiler must make some formal transformations which are sometimes easy (here they're), sometimes far less.

Link to comment
Share on other sites

 

Did you (or someone else) check whether the compiler indeed uses the Sse or Mmx instructions when the programmer writes like

 

To review what I wrote above, which answers this question completely: All you can hope for is that the compiler will know the proper instruction sets so the possibility of using those instructions exists.  Other than that, you don't know what the compiler will actually *do*.  You can get assembler dumps with a number of them to find out what the compiler did, but you can influence what the compiler does incredibly little. 

 

To that end, your original option A is the only option if you want a guarantee of something to happen.  The problem with that is as I detailed above too - you need to know that first, it's a CPU bottleneck involving precisely that code at hand, and then your attempt is helping matters and not hurting them.  Truth be told, 99.99% of the instances involved make it not worth your time to mess with it, and that's what I was trying to relay up above.

Edited by Glenn9999
Link to comment
Share on other sites

To review what I wrote above, which answers this question completely: [...]

 

In 3 messages from you I didn't get the answer. I asked "How do the programmers make use of the Sse and Avx" and "what does the programmer need to do, and how autonomous is the compiler?", and got vague and out-of-topic recommendations at the level of a general culture course. Reading the same material in a Press paper I'd regret that the journalist only reproduced some sentences he grasped during a Powerpoint presentation.

 

Meanwhile I got the sought information through varied websites, for instance

http://svmoore.pbworks.com/w/file/fetch/70583970/VectorOps.pdf

http://sci.tuomastonteri.fi/programming/sse

http://felix.abecassis.me/2011/09/cpp-getting-started-with-sse/

In case this is helpful to someone else:

 

The programmer can obviously include assembler in his source code. Example here

http://neilkemp.us/src/sse_tutorial/sse_tutorial.html

which has clear drawbacks, hence the desire for alternatives.

 

Most compilers define extra data types corresponding to the Sse or Avx data packets, and functions that operate on these data types, overloading the usual operators for some functions. It needs a few more conditions ilke data alignment and special keywords.

 

The most convenient is when the compiler recognizes that a portion of source code can run on the Sse or Avx, because it keeps the source easier to read, and an existing source can be recompiled for more recent processors -with luck, present code could even be recompiled in the future when processors offer new possibilities.

http://svmoore.pbworks.com/w/file/fetch/70583970/VectorOps.pdf on page 8

calls "Automatic vectorization" this autonomous work by the compiler, which isn't widespread, works more or less and can bring worries.

 

Intel's Icc compiler does that and was possibly the first one. Visual Studio 2012 and later try to do it, Gcc too. Over portions of the source that obviously can run on the Sse or Avx, their success rate varies a lot.

 

Vs and Gcc create only the binary for the intended target machine without adding tests for the adequateness of the actual processor. This binary crashes if run on a processor lacking the instructions. Intel's Icc does some tests, but obviously it would have been the programmer's role and is often neglected.

 

Not specific to compilation:

 

Having used Avx256 instructions can make Sse code, or scalar code that runs on the Sse as on the Core, much slower.

 

Only the registers provide the data throughput to feed present compute hardware. The L1 cache suffices in some cases, other caches and the main Ram never with existing processors. The Sse and Avx make that worse. Compute loops must be deeply reorganized according to the caches' possibilities; compilers don't do that alone, so it remains the programmer's duty.

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...