Add multithreaded gpu_cache rasterization
This change adds multithreaded rasterization in environments with more than 1 CPU thread available. This can significantly improve worst case performance of the gpu_cache, ie initialization, resizing etc.
Gpu cache work
- Glyph rasterization work is generally uneven, one glyph may require much more work than the next.
- The pixel data upload function may only be called by the "main" thread.
- We want to spread out work, and have all cores stay as busy as possible.
- We want to avoid holding all glyph pixel data in memory before passing to upload.
To handle the above I used crossbeam-deque
where n-1 threads are work stealers, rasterizing and then sending the pixels to the "main" thread. The main thread itself continually rasterizes and uploads, then uploads any completed work-stealer work.
In this way all threads are working without blocking, and no more than the necessary pixel data is held in memory before being uploaded.
Most of the gpu_cache benchmarks only rasterize during warmup, but the population, thrashing & resizing benchmarks are significantly improved by multithreading.
Benchmark comparison with a 4-core Haswell
name control ns/iter change ns/iter diff ns/iter diff % speedup
cache::multi_font_population 8,361,106 2,704,365 -5,656,741 -67.66% x 3.09
cache_bad_cases::moving_text_thrashing 21,818,522 7,153,560 -14,664,962 -67.21% x 3.05
cache_bad_cases::resizing 15,417,159 4,812,120 -10,605,039 -68.79% x 3.20