Draft: Coprocesses
Merge request reports
Activity
This sounds like an exciting concept. I don't have time to go through this completely right now, but I will throw a couple of questions at you. Ignore them if they are already answered in the RFC.
- Can we trap the PKRU instruction so the user code can't self-modify its way around our checks?
- How applicable is this to all the current ISAs? RISC-V, ARM, any others?
- Is this a "bonus" feature, as in, can it be turned off with no implications to anything other than performance?
- Is it mainly for schemes, like RedoxFS+NVMe, mainly for user applications, or other? I worry about the safety of running RedoxFS in the same space as a user application. On the other hand, making RedoxFS a crate and building it in to the NVMe driver seems like something that could be done easily. I have a whole train of thought on this aspect of RedoxFS.
- Is this mainly a performance enhancement? The big thing with microkernels is that the performance vs security trade-off is very explicit calculation - 10x performance improvement means 10x less hardware cost, but how many $$$ for the loss of security?
- How much value does this provide if we have smart scheduling? Can we optimize Redox to the point where this is not worth doing? Or is this an important concept in its own right that will have value beyond optimization?
A couple of answers:
-
PKRU can't be trapped performantly, so the strategy is instead to ensure (1) executable pages must be readonly, and (2) only redox-rt in master mode is allowed to make existing mappings executable. An exception can be made if the memory can be checked not to include WRPKRU first, by a privileged party.
-
RISC-V does not appear to support protection keys (except nonstandard extensions), and neither does x86-32. AArch64 supports the Permission Overlay Extension, which appears to be even more powerful than x86's interface. On x86-32, this is implementable using segmentation, but probably not worth the effort.
-
Coprocesses would be a strictly opportunistic optimization. This is both because not all architectures support it, and because on both x86_64 and aarch64, it's an optional extension. A useful analogy is a 16-way associative cache with M sets; the set is chosen based on subsystem (process), containing 16 ways (coprocesses). Switching between sets is expensive, whereas switching between ways is much cheaper. In the TLB, the CPU can maintain the pages from the entire MxN matrix of "coprocess address subspaces" (the subset of the coprocess's process address space that matches the coprocess's tag). If the CPU supports protection keys then
N=16, otherwiseN=1. Similarly, if the CPU supports process-context identifiers,1 <= M <= 1024, otherwiseM=1as well. -
This is mainly for combining programs in the same subsystem, like the net stack, disk stack, graphics stack, etc. That could possibly apply for user-level programs like browser tab isolation, but it's possible they're already using protection keys for that purpose. As for cross-privilege processes, this could work in theory, but will need strong confidentiality+integrity and reasonable availability guarantees. Whether this RFC would make sense in those scenarios, will require further investigation.
Update: It appears protection keys offer an equivalent level of security against against Meltdown-type exploits, if the CPU is already Meltdown-immune. It should thus be possible to use this to isolate any program that is compatible with the runtime restrictions, namely position-independence and that it doesn't contain the WRPKRU instruction. This generalizes into the graph-clique problem with
k=15, of the "IPC graph". A protection key would be a color, and calling a separate address subspace would require a separate color. -
This is an improvement to the performance/security "tradeoff curve". Specifically, provided there's a reasonable level of trust between same-subsystem programs, one can achieve performance by ditching memory protection entirely and using threads. It's also possible to use the status quo model, or improve security further by virtualization. This optimization is about being able to retain a similar level of security as separate-process programs while significantly improving performance, by imposing (light) restrictions on the programs' environment.
-
Context switch latency is fundamentally bounded by the hardware latency, which to a large extent itself is due to page table switching latency. That latency can optimistically be cut by an order of magnitude from this optimization.
There are several possible optimizations that can bring latency closer to the hardware limit, and several for improving indirect latency (e.g. process-context identifiers), but for regular non-async userspace, context switch latency will probably continue to be a significant variable controlling overall performance.
A potential superintelligent scheduler would be able to order processes to minimize unnecessary switches, as would improved queueing, but switch latency will still be significant.
Edited by Jacob Lorentzon-
mentioned in issue kernel#160
added 2 commits
added 1 commit
- 8c39a479 - Clarify how capability pages would interact with protection keys.
added 1 commit
- 0e81d0d2 - Describe how dynamic coprocesses could work.
added 2 commits
added 1 commit
- cc298519 - Mention the critical x86 variable-instr flaw.