[gpu_cache] Rework + parallelize cache_queued
Created by: alexheretic
This pr reworks the gpu_cache cache_queued
method. I came out of me wondering whether using rayon in rusttype would be beneficial.
- Use a full search for matching glyph textures instead of a "close enough" search. The full search is slower, but results in overall better performance (see benchmarks).
- Split the processing into 2 phases. The first splitting the glyphs into already-cached & not-cached, the second adding new glyph textures. The first phase does not require mutability so can easily make use of more CPU cores.
- Use rayon to parallelize phase-1 of
cache_queued
In short this results in a quite nice performance boost, testing on my 4-core system.
It occurred to me that the current 3 benchmarks may be overly favourable to this change as they run exactly the same text through the cache each run. So I added 2 more benchmarks to try to capture other cases. bench_moving_text
cycles through 3 different text variants for each bench run & bench_multi_font_population
clears the cache after each run meaning the concurrent phase-1 has no benefit.
In the end both the new benchmarks saw similarly large speed ups with the changes. Even though bench_multi_font_population
does not benefit from the parallel search phase, it does hugely benefit from the full glyph search.
The changes are split into commits with the benchmark added at the beginning to easily test the changes.
Adding full search for glyph texture matches + split into 2 phases
name control ns/iter change ns/iter diff ns/iter diff % speedup
gpu_cache::cache_bench_tests::bench_high_position_tolerance 1,797,074 1,768,575 -28,499 -1.59% x 1.02
gpu_cache::cache_bench_tests::bench_moving_text 4,067,186 3,902,118 -165,068 -4.06% x 1.04
gpu_cache::cache_bench_tests::bench_multi_font 3,613,380 3,627,586 14,206 0.39% x 1.00
gpu_cache::cache_bench_tests::bench_multi_font_population 12,002,097 9,804,160 -2,197,937 -18.31% x 1.22
gpu_cache::cache_bench_tests::bench_single_font 3,770,866 3,950,332 179,466 4.76% x 0.95
A big benefit for the population/first run benchmark, as packing is more efficient eliminating duplicate matching glyph textures. A bit of give and take for the other benchmarks. Importantly though this shows the changes should not cause performance regression for single-core systems.
Use rayon to spread phase-1 work across all cores (4 on test machine)
name control ns/iter change ns/iter diff ns/iter diff % speedup
gpu_cache::cache_bench_tests::bench_high_position_tolerance 1,797,074 1,594,505 -202,569 -11.27% x 1.13
gpu_cache::cache_bench_tests::bench_moving_text 4,067,186 2,901,750 -1,165,436 -28.65% x 1.40
gpu_cache::cache_bench_tests::bench_multi_font 3,613,380 2,705,355 -908,025 -25.13% x 1.34
gpu_cache::cache_bench_tests::bench_multi_font_population 12,002,097 9,704,296 -2,297,801 -19.14% x 1.24
gpu_cache::cache_bench_tests::bench_single_font 3,770,866 2,811,994 -958,872 -25.43% x 1.34
13-40% performance improvements across all benchmarks.