C++ SSE and SSE2 compiler settings, and their Floating Point effects.

I have been running C++ floating point tests using the Twilight Dragon Media C++ compiler for 64 bit Windows 10.

https://jmeubank.github.io/tdm-gcc/

C++ starts by using IEEE 754 equation for floats and doubles, decimal and hexadecimal values, floating point arithmetic, if I have things right. See this at:

https://introcs.cs.princeton.edu/java/91float/

I have learned that when you compile in
1
2
A)
float a = 0.1F;


what you get is stored is

0.100000001490116119384765625


whereas when you compile in

1
2
B)
double b = 0.1D;


what you get stored is

0.1000000000000000055511151231257827021181583404541015625

However, TDM, being an example of GNU C++, has SSE and SSE2 compiler switches. I even think it has them on by default, without specifying. What exactly does enabling SSE or SSE2 actually do to C++ floating point like this, to make its representation and then arithmetic results sensible? I have heard that it used additional SSE/SSE2 bits in the CPU Maths CoProcessor to sublement 32 bit and 64 bit float and double to help with these spurious data problems, at representation as well as to further accurate arithmetic use, via the usual operators.

+, -, *, /, %, ++n, --n, n++, n--, +=, -=, *=, /=, %=


What exactly does SSE, SSE2 and their descendent compiler switches actually do with my two examples to try and make things more accurate base 10, even base 16 accurate? I have heard that they supply more bit storage past the end of my two, 32 bit and 64 bit examples. What do the stored results become, with 32 and 64 bit, or further? Are the stored results the same, or are different arithmetic algorithms used? How may SSE,SSE2 bits/registries are there for every 32 bit, 64 bit registries in a modern 64 bit SSE-enabled CPU?
Last edited on
The floating-point format and semantics are defined by the IEEE-754 standard:
https://en.wikipedia.org/wiki/IEEE_754

On the "x86" architecture, this has traditionally been implemented by the "x87 FPU", originally a separate chip, but now integrated in the CPU on all modern processors. This is still available today, but has been supplemented by the SSE/2/3/4 and AVX/2 instruction set extensions. Those are so-called SIMD extensions, so they can process multiple values with a single instructions, potentially giving you a nice speed-up.

TTBOMK, one difference is that the "x87 FPU" has an internal precision of 80 bits (extended precision), whereas SSE math works with either 32-Bit (single precision) or 64-Bit (double precision) precision. But: Unless you enable the so-called "fast math" option in your compiler, the compiler has to enforce "strict" IEEE-754 semantics – which means that float values have to be truncated to 32-Bit and double values have to be truncated to 64-Bit after each operation! You probably won't see a difference between SSE and "x87".

With "fast math" on, the compiler does all sort of things that speed up floating-point operations, but IEEE-754 semantics are no longer strictly enforced. This means that, for example, intermediate results may be kept at a higher precision, if the hardware allows for it. You may see difference between SSE and "x87" then!

See also:
* https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html#index-ffast-math
* https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html#index-fexcess-precision
* https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html#index-mfpmath-1

Note that option -Ofast implies -ffast-math, which in turn implies -fexcess-precision=fast !

In order to enable SSE math, you have to set option -mfpmath=sse plus option -march= with a CPU model that actually supports SSE. If you are compiling for "x64" (AMD64) target, then SSE support always implied.
Last edited on
So, with GNU C++, if I want the source code assignment command to
1
2
A)
float a = 0.1F;

which is presently stored as

0.100000001490116119384765625

to not be changed away from a floating point type, or any type, to be stored as either

0.1

or to be range accurate, and truncate, and maybe overflow or underflow after that, and store

0.100000000000000000000000000

for all float number types, as data or variable contents, and doubles as well, for value storage, and arithmetic result calculation, to for all such examples to always provide:
1
2
3
4
float n = 0.1F*0.1F
std::cout << std::endl << n << std::endl;

//0.01F, and be stored as  

0.01

or really

0.010000000000000000000000000

For all C++ float and double number values, and arithmetic operators from

+, -, *, /, %, ++n, --n, n++, n--, +=, -=, *=, /=, %=

Is there a switch, or a series of switches that will do this? Is fast-math the one and only answer to this, the solution that I am looking for to sending away IEEE 754 underflow and overflow away forever, for values that actively fit and work in C++ floats and doubles? I have had it in my head that SSE processing, with its extra bits and behaviours, was the answer to killing floating point underflow and overflow. What are the C++ settings for that?
Last edited on
So, with GNU C++, if I want the source code assignment command to
1
2
A)
float a = 0.1F;

which is presently stored as
0.100000001490116119384765625

Actually, the decimal value "0.1" is stored/encoded in binary form, using the IEEE-754 format:
https://i.imgur.com/cSSVDHD.png

It is stored as the closest possible single precision, i.e. float (32-Bit), IEEE-754 value to the number 0.1.

Note: It is not possible to store "0.1" in floating-point (IEEE-754) format with "100%" accuracy, because of the way how floating-point numbers work! You always get the closest representable number to what you requested. There always will be a certain "error", just this error will be smaller with double than with float.

Of course, when you print a floating-point number to the screen, then you can decide how many decimal places should be printed out. So you can do something like this (limited to 8 decimal fractional places):
1
2
float value = 0.1f;
printf("%10.8f\n", value);

...which results in:
0.10000000
Last edited on
Note: It is not possible to store "0.1" in floating-point (IEEE-754) format with "100%" accuracy, because of the way how floating-point numbers work!


Yes, I know that, at least while they operate that way on their own. I also know that the decimal value "0.1" gets stored in binary form. The translation of that binary form no longer equals "0.1" perfectly anymore, because of IEEE 754 being the binary scheme used.

I have been lead to believe that C++ compilers, GNU C++, like the one linked to at the top of my discussion, for 64 bit CPUs certainly, supply SSE switches and fast-math switches, which can mean that IEEE 754 doesn't get held to precisely for both decimal-to-binary storage translation, and arithmetic between two number binaries to calculate a new number through float and double arithmetic operators, like:


+, -, *, /, %, ++n, --n, n++, n--, +=, -=, *=, /=, %=


A part variation away from IEEE 754 is possible which means that augmenting it with some extra SSE bits, and some different added logic, that within the range, it is in fact possible to get 100% accuracy with the upper and lower range of a float or double value. Is there someone on these forums who can tell me what'where those switches are, to steer me in the right direction, to prevent floating point overflow and underflow via setting(s) from the outset, kindly?
Last edited on
SSE has bigger registers (128 Bit), but only to stuff multiple values into a single register, for reasons of SIMD. I don't think SSE supports anything other than single-precision (32-Bit) and double-precision (64-Bit) floating-point numbers. The old "x87" FPU used 80-bit floating-point numbers. But, as said before, if you use the float or double types in your program, you usually won't see a difference; you simply get IEEE-754 semantics. The only thing that "fast math" changes is that intermediate results may be held at a somewhat higher precision.

I think what you want is not possible with the standard floating-point types.

Maybe have a look at the GMP library, which allows for "arbitrary precision" arithmetic (incl. floating-point):
https://gmplib.org/

See also:
* https://gmplib.org/manual/Floating_002dpoint-Functions
* https://gmplib.org/manual/Rational-Number-Functions 👈 maybe rational numbers are more suitable!
Last edited on
indeed back in the day the tenbyte was accessible in visual studio as a type for c++ or assembly codes etc. I am not sure you can still get to them, or if they even still exist in some way that is exposed to the user.
Lets do:
1
2
float A{}, B{}, C{};
float D = A * B + C;

What would it be with FPU (approximately)?
1. Load 32-bit A into 80-bit register
2. Multiply register with 32-bit B
3. Truncate register to 32-bit value
4. Add to register with 32-bit C
5. Truncate&store register into 32-bit D

How about FPU + fastmath?
1. Load 32-bit A into 80-bit register
2. Multiply register with 32-bit B
3. Truncate register to 32-bit value
4. Add to register with 32-bit C
5. Truncate&store register into 32-bit D

The difference is that now we add C to the 80-bit result of A*B, rather than to IEEE-754 dictated lower precision 32-bit value

Plan B: Lets use vector instruction (SSE) set:
1. Load 32-bit A into 32-bit SSE register (the register has 128 bits, for four 32-bit floats)
2. Load 32-bit B into other 32-bit SSE register
3. Multiply the two registers. The result is in one 32-bit part of the register (The multiply instruction did operate on all four parts)
4. Load 32-bit C into other 32-bit SSE register
5. Add the two registers. The result is in one 32-bit part of the register (The add instruction did operate on all four parts)
6. Store register into 32-bit D

Just like with plain FPU, all operations were with 32-bit values.


The benefit of SSE is elsewhere, in "auto-vectorization", which is not on by default.
Lets take:
1
2
3
4
5
6
7
8
float A[4]{};
float B[4]{};
float C[4]{};
float D[4]{}

for ( int i=0; i<4; ++i ) {
    D[i] = A[i] * B[i] + C[i];
}

Our pseudo-code FPU operation had 5 steps, so the body of loop has 20 steps, and the loop ops on top of that.

If compiler is told (and if it can) to vectorize the loop, then al it takes is those 6 steps (and no loop ops) to perform the operation for four floats simultaneously.
The only thing that "fast math" changes is that intermediate results may be held at a somewhat higher precision.
This is broadly untrue, at least for the major three compilers. Perhaps most importantly, fast math options also allow the compiler to reorder and reassociate floating point operations in a way that is compatible with SIMD, but there are other effects too.

A part variation away from IEEE 754 is possible which means that augmenting it with some extra SSE bits, and some different added logic, that within the range, it is in fact possible to get 100% accuracy with the upper and lower range of a float or double value.
No matter how far one deviates from IEEE754, it is fundamentally impossible to "get 100% accuracy".

It is not possible to store one-tenth in a binary floating point number, because one-tenth cannot be written as a finite sum of powers-of-two. All number systems have the same problems for certain values.

Is there someone on these forums who can tell me what'where those switches are, to steer me in the right direction, to prevent floating point overflow and underflow via setting(s) from the outset, kindly?
There is fundamentally no way to prevent overflow or underflow in any finite representation. There will always be a number whose representation is too big or too small in magnitude, to fit in the allotted space.
Last edited on
The only thing that "fast math" changes is that intermediate results may be held at a somewhat higher precision.
This is broadly untrue, at least for the major three compilers. Perhaps most importantly, fast math options also allow the compiler to reorder and reassociate floating point operations in a way that is compatible with SIMD, but there are other effects too

I meant the only thing "fast math" changes with respect to how much floating-point "precision" you get.

-ffast-math implies a number of options, yes. And -fexcess-precision=fast is one of them.

-fexcess-precision=fast means that operations may be carried out in a wider precision than the types specified in the source if that would result in faster code, and it is unpredictable when rounding to the types specified in the source code takes place.
On the x86, it has no effect if -mfpmath=sse or -mfpmath=sse+387 is specified

(it allows the compiler to keep the full 80 bit precision of the x87 FPU and only truncate the final result to 64 or 32 bits; but that doesn't work if SSE-FP is used, because SSE-FP is limited to 32 or 64 bit anayway)
Last edited on
The default switches are conservative and cause the compiler to generate code that preserves as much precision as possible in intermediate results. Using any switch explicitly will at best not lose any precision. Pretty much all switches lose precision in favor of speed.

If you need to process exact fractional values, you should use either a Rational type that stores the numerator and denominator of a rational number, or a decimal floating point type.

Rational:
Pros:
* Can represent difficult values such as 1/3 and 1/10 exactly.
* Implementation is simple.
Cons:
* Does not lose precision at all through intermediate calculations and it's very difficult to make it lose that precision, so you end up stuck when you have 2147483646/2147483647 and you can't operate on the value anymore without causing an integer overflow.

Decimal floating point:
Pros:
* Can represent common values such as negative powers of 10 exactly. This makes it commonly used for example in monetary applications.
* Does lose insignificant digits during intermediate calculations, so overflows are not a problem.
Cons:
* Implementation is more complex.
* Still cannot represent certain fractions such as 1/3 or 1/7 exactly.
Topic archived. No new replies allowed.