Cleanups
~~~~~~~~

* Important: iropt: Make sure XorV128 and XorV256 of identical
  args gets folded to zero

* add more iteration in test cases

* math_UNPCKxPS_128: use xIsH ? InterleaveHI32x4 : InterleaveLO32x
  I think this is safe w.r.t. the backend

* math_UNPCKxPD_128: ditto

* math_UNPCKxPD_256: split into 128 bit chunks and use math_UNPCKxPD_128


Known limitations
~~~~~~~~~~~~~~~~~

* for many (all?) of the vector shift-by-imm cases (pre-existing as
  well as AVX), out of range shifts are not handled properly and only
  work I think because the host happens to have the same semantics.