Skip to content
Snippets Groups Projects
  • pi_pi3's avatar
    c4fc76f8
    A faster implementation of the memcpy family · c4fc76f8
    pi_pi3 authored
    The default implementation of the memcpy, memmove, memset and memcmp
    functions in the kernel file `extern.rs` uses a naive implementation
    by copying, assigning or comparing bytes ony by one. This can be slow.
    This commit proposes a reimplementation of those functions by copying,
    assigning or comparing in group of 8 bytes by using the u64 type and
    its respective pointers instead of u8. Alternative version for 32-bit
    architectures are also supplied for future compatibility with x86.
    Both version first copy whatever they can with wide word types. The
    tail, i.e. the final few bytes that do not fit in a dword or qword
    are then copied byte by byte.
    
    Here is a comparison of copying 64kiB (65536 bytes) on stack:
    
    x86_64-unknown-linux-gnu: (64-bit)
           | naive (ns) | fast (ns) | speedup (x)
    -------|------------|-----------|------------
    memcpy |   204430   |   32994   |   ~6.20
    memmove|   202540   |   33186   |   ~6.10
    memset |   163391   |   23884   |   ~6.84
    memcmp |   205663   |   34385   |   ~5.98
    
    i686-unknown-linux-gnu: (32-bit)
           | naive (ns) | fast (ns) | speedup (x)
    -------|------------|-----------|------------
    memcpy |   206297   |   66858   |   ~3.09
    memmove|   204576   |   70326   |   ~2.91
    memset |   165599   |   50227   |   ~3.30
    memcmp |   204262   |   70572   |   ~2.89
    
    Copying on the heap behaves simmilarly.
    
    All tests performed on Intel i5 6600K (4x4.2GHz),
    ArchLinux Kernel 4.8.12-3 x86_64.
    c4fc76f8
    History
    A faster implementation of the memcpy family
    pi_pi3 authored
    The default implementation of the memcpy, memmove, memset and memcmp
    functions in the kernel file `extern.rs` uses a naive implementation
    by copying, assigning or comparing bytes ony by one. This can be slow.
    This commit proposes a reimplementation of those functions by copying,
    assigning or comparing in group of 8 bytes by using the u64 type and
    its respective pointers instead of u8. Alternative version for 32-bit
    architectures are also supplied for future compatibility with x86.
    Both version first copy whatever they can with wide word types. The
    tail, i.e. the final few bytes that do not fit in a dword or qword
    are then copied byte by byte.
    
    Here is a comparison of copying 64kiB (65536 bytes) on stack:
    
    x86_64-unknown-linux-gnu: (64-bit)
           | naive (ns) | fast (ns) | speedup (x)
    -------|------------|-----------|------------
    memcpy |   204430   |   32994   |   ~6.20
    memmove|   202540   |   33186   |   ~6.10
    memset |   163391   |   23884   |   ~6.84
    memcmp |   205663   |   34385   |   ~5.98
    
    i686-unknown-linux-gnu: (32-bit)
           | naive (ns) | fast (ns) | speedup (x)
    -------|------------|-----------|------------
    memcpy |   206297   |   66858   |   ~3.09
    memmove|   204576   |   70326   |   ~2.91
    memset |   165599   |   50227   |   ~3.30
    memcmp |   204262   |   70572   |   ~2.89
    
    Copying on the heap behaves simmilarly.
    
    All tests performed on Intel i5 6600K (4x4.2GHz),
    ArchLinux Kernel 4.8.12-3 x86_64.