Cleanups
~~~~~~~~
* Important: iropt: Make sure XorV128 and XorV256 of identical
args gets folded to zero
* add more iteration in test cases
* math_UNPCKxPS_128: use xIsH ? InterleaveHI32x4 : InterleaveLO32x
I think this is safe w.r.t. the backend
* math_UNPCKxPD_128: ditto
* math_UNPCKxPD_256: split into 128 bit chunks and use math_UNPCKxPD_128
Known limitations
~~~~~~~~~~~~~~~~~
* for many (all?) of the vector shift-by-imm cases (pre-existing as
well as AVX), out of range shifts are not handled properly and only
work I think because the host happens to have the same semantics.