This may (or may not) fix some performance issues on ARM, which generally deals extremely poorly with unaligned reads (see #7). As far as I am able to test, there is negligible performance impact for this on x86, likely because the average number of reads per call hasn't changed much.
Another approach which was taken was attempting to call copy_nonoverlapping to copy byte-by-byte to a u64 buffer. This was significantly slower than the previous implementation, and thus dropped.
@darakian, would you mind testing this? I still don't have access to an ARM chip to test things on.
@koivunej, would you be able to bench on x86 to make sure this doesn't cause a performance regression? I currently only have access to my laptop, which isn't the best for attempting to benchmark things.