[gpu_cache] Rework + parallelize cache_queued (!109) · Merge requests · redox-os / rusttype

Jeremy Soller requested to merge alexheretic:master into master May 14, 2018

Created by: alexheretic

This pr reworks the gpu_cache cache_queued method. I came out of me wondering whether using rayon in rusttype would be beneficial.

Use a full search for matching glyph textures instead of a "close enough" search. The full search is slower, but results in overall better performance (see benchmarks).
Split the processing into 2 phases. The first splitting the glyphs into already-cached & not-cached, the second adding new glyph textures. The first phase does not require mutability so can easily make use of more CPU cores.
Use rayon to parallelize phase-1 of cache_queued

In short this results in a quite nice performance boost, testing on my 4-core system.

It occurred to me that the current 3 benchmarks may be overly favourable to this change as they run exactly the same text through the cache each run. So I added 2 more benchmarks to try to capture other cases. bench_moving_text cycles through 3 different text variants for each bench run & bench_multi_font_population clears the cache after each run meaning the concurrent phase-1 has no benefit.

In the end both the new benchmarks saw similarly large speed ups with the changes. Even though bench_multi_font_population does not benefit from the parallel search phase, it does hugely benefit from the full glyph search.

The changes are split into commits with the benchmark added at the beginning to easily test the changes.

Adding full search for glyph texture matches + split into 2 phases

name                                                         control ns/iter  change ns/iter  diff ns/iter   diff %  speedup 
gpu_cache::cache_bench_tests::bench_high_position_tolerance  1,797,074        1,768,575            -28,499   -1.59%   x 1.02 
gpu_cache::cache_bench_tests::bench_moving_text              4,067,186        3,902,118           -165,068   -4.06%   x 1.04 
gpu_cache::cache_bench_tests::bench_multi_font               3,613,380        3,627,586             14,206    0.39%   x 1.00 
gpu_cache::cache_bench_tests::bench_multi_font_population    12,002,097       9,804,160         -2,197,937  -18.31%   x 1.22 
gpu_cache::cache_bench_tests::bench_single_font              3,770,866        3,950,332            179,466    4.76%   x 0.95

A big benefit for the population/first run benchmark, as packing is more efficient eliminating duplicate matching glyph textures. A bit of give and take for the other benchmarks. Importantly though this shows the changes should not cause performance regression for single-core systems.

Use rayon to spread phase-1 work across all cores (4 on test machine)

name                                                         control ns/iter  change ns/iter  diff ns/iter   diff %  speedup 
gpu_cache::cache_bench_tests::bench_high_position_tolerance  1,797,074        1,594,505           -202,569  -11.27%   x 1.13 
gpu_cache::cache_bench_tests::bench_moving_text              4,067,186        2,901,750         -1,165,436  -28.65%   x 1.40 
gpu_cache::cache_bench_tests::bench_multi_font               3,613,380        2,705,355           -908,025  -25.13%   x 1.34 
gpu_cache::cache_bench_tests::bench_multi_font_population    12,002,097       9,704,296         -2,297,801  -19.14%   x 1.24 
gpu_cache::cache_bench_tests::bench_single_font              3,770,866        2,811,994           -958,872  -25.43%   x 1.34

13-40% performance improvements across all benchmarks.

[gpu_cache] Rework + parallelize cache_queued

Adding full search for glyph texture matches + split into 2 phases

Use rayon to spread phase-1 work across all cores (4 on test machine)

Merge request reports