it falls back to a scalar c loop when not zeroing. which seems disappointing. Also not sure if it correctly handles `-0` in the floating point code