As you saw in the Intel optimization manual, the SQRTSS instruction isn’t pipelined. But you interpreted this a little too strongly when you said, “When the CPU hits an unpipelined instruction, every other instruction in the pipeline has to stop and wait for it to retire before proceeding, so it’s like putting the handbrake on your processor.”

The SQRTSS instruction’s effects aren’t quite *that* severe on the Core 2 or its descendants. SQRTSS only uses a single execution unit (the divide/square-root unit) during its processing, and it’s *that unit* which isn’t pipelined. So any further instructions needing that unit (such as additional square root or divide instructions) will be blocked from proceeding until the SQRTSS is finished. And of course, the SQRTSS will delay any instructions which need its result until it is ready. However, other instructions which don’t depend on the SQRTSS result and don’t need the divide unit will still be free to proceed.

You also speculated about the implementation of SQRTSS: “In the case of ssqrt, the processor is probably doing the same thing internally that I’m doing in my “fast” function — taking an estimated reciprocal square root, improving it with Newton’s method, and then multiplying it by the input parameter. Taken all together, this is far too much work to fit into a single execution unit, so the processor stalls until it’s all done.”

While this isn’t an unreasonable method, it’s not what the processor is actually doing. As I said, SQRTSS uses the same functional unit as the divide instructions do, but it uses it in a different way. It turns out that there are digit-by-digit algorithms for computing square root which are quite different from Newton’s method of successive approximations. They more closely resemble conventional long division, except modified to exploit the fact that the algorithm is trying to find a quotient which is also the divisor. Since these square-root algorithms are similar to division algorithms, they can make use of the same hardware with modest modifications, and control lines (set according to the instruction being executed) determine whether a division or a square root is performed.

Unfortunately, the digit-by-digit algorithms commonly used for hardware division and square root are very difficult to pipeline. Thus even today, the divide and square root instructions on Intel x86 processors will basically block that functional unit until the result is complete. Some later processors do a limited amount of overlapping, but not really true pipelining. The biggest way these operations have been sped up over the years has been to improve the divider to generate more quotient bits per cycle (in effect, making each “digit” larger).

]]>rsqrt(y) give this iterate: x += 0.5*(x – y*x*x*x)

– No division in the second scheme, maybe this is why rsqrt() is faster both in software and hardware design.

1/y give this iterate: x += x*(1-y*x)

– Maybe this is why sqrt() is an unpipelined instruction (two interleaved iterative loop).

]]>sqrt(x) = x*rsqrt(x) is definitely fast but suffers from a NaN problem for x==0.

sqrt(0) = 0*rsqrt(0) = 0*(1/sqrt(0)) = 0*(1/0) = 0*INF = NaN

using the rcp instruction (_mm_rcp_ss) computing 1/x, the square root could be computed as

sqrt(x) = rcp( rsqrt(x) )

This doesn’t produce a NaN

sqrt(0) = rcp(rsqrt(0)) = 1/(1/sqrt(0)) = 1/INF = 0

It’s still very fast (not very accurate though!)