kernel issueshttps://gitlab.redox-os.org/redox-os/kernel/-/issues2024-03-17T20:29:57Zhttps://gitlab.redox-os.org/redox-os/kernel/-/issues/148Support generic interrupts on arm642024-03-17T20:29:57Zbjorn3Support generic interrupts on arm64See the thread starting at https://matrix.to/#/%23redox-dev%3Amatrix.org/%24hVX4XI4x2tbwbJCSl1vQ5wx0C7GOBNGfWTmXAtW6e3c?via=mozilla.org&via=matrix.org&via=artifact8.xyz&via=envs.net But basically only timer and serial interrupts currentl...See the thread starting at https://matrix.to/#/%23redox-dev%3Amatrix.org/%24hVX4XI4x2tbwbJCSl1vQ5wx0C7GOBNGfWTmXAtW6e3c?via=mozilla.org&via=matrix.org&via=artifact8.xyz&via=envs.net But basically only timer and serial interrupts currently work. The `irq:` scheme required by most PCI drivers (and other drivers) is non-functional. This needs to be implemented for each of the three irq chips supported by the kernel.https://gitlab.redox-os.org/redox-os/kernel/-/issues/147Support huge pages2024-03-16T10:21:13ZJacob Lorentzon4ldo2@protonmail.comSupport huge pagesHuge pages are heavier, slower to CoW, and possibly wasting memory, and would waste 0.4% being unused `PageInfo`s. But they do almost always reduce TLB overhead, and most importantly, they require far less page table mappings and flushes...Huge pages are heavier, slower to CoW, and possibly wasting memory, and would waste 0.4% being unused `PageInfo`s. But they do almost always reduce TLB overhead, and most importantly, they require far less page table mappings and flushes than small pages. That means they can potentially, in some cases, be a huge improvement (512x for 2 MiB pages) in IPC latency and to a lesser extent, throughput. 1 GiB pages may also allow (recently-used) physical addresses to hopefully always reside in at least the L2DTLB in kernel mode, speeding up e.g. copying of pages. Jeremy measured the optimal buffer size for throughput, (for most schemes including redoxfs), was 4 MiB, with larger sizes being slower due to mapping/flushing and possibly TLB overhead.
Worth noting AArch64 supports two additional standard page sizes -- 16 KiB and 64 KiB, which for some if not most workloads is more efficient, and 16k could maybe even be a better default. Ironically, Zen3+ AMD CPUs also support a 16 KiB page size, by merging any 4 virtually-contiguous pages that are physically contiguous and naturally 16k-aligned, into a single 16 KiB TLB entry. Although it would waste 4x as much page table memory, that might also be worth looking into.https://gitlab.redox-os.org/redox-os/kernel/-/issues/146Implement CPU softlockup and hardlockup detection2024-03-15T14:17:19ZRibbonImplement CPU softlockup and hardlockup detectionhttps://www.kernel.org/doc/html/latest/admin-guide/lockup-watchdogs.htmlhttps://www.kernel.org/doc/html/latest/admin-guide/lockup-watchdogs.htmlhttps://gitlab.redox-os.org/redox-os/kernel/-/issues/145Support process-context identifiers2024-03-11T15:24:56ZJacob Lorentzon4ldo2@protonmail.comSupport process-context identifiersPage table switching is a significant part of context switch overhead, which process-context identifiers help reduce. With [TLB shootdown](https://gitlab.redox-os.org/redox-os/kernel/-/merge_requests/282) now properly implemented, it wou...Page table switching is a significant part of context switch overhead, which process-context identifiers help reduce. With [TLB shootdown](https://gitlab.redox-os.org/redox-os/kernel/-/merge_requests/282) now properly implemented, it would be a natural extension to add a percpu queue of `Weak<AddrSpaceWrapper>`, and retain the address space CPU users bits set as long as they are still in that queue. PCIDs would only be used for userspace mappings, as the redox kernel memory layout makes a clear distinction between user and kernel addresses. An address space is user-accessible if and only if it's lower half, and if and only if it's non-Global.
Because this may impact TLB shootdown performance, it would need to be benchmarked thoroughly before being enabled at least by default.https://gitlab.redox-os.org/redox-os/kernel/-/issues/144Allow the kernel to log a panic message while the logger is still locked2024-02-27T10:44:41Zbjorn3Allow the kernel to log a panic message while the logger is still lockedNormally you can't log anything while the logger is locked to avoid multiple log messages from getting interleaved. When panicking the logger may never be unlocked as you may be panicking on the same kernel thread that holds the lock. Ge...Normally you can't log anything while the logger is locked to avoid multiple log messages from getting interleaved. When panicking the logger may never be unlocked as you may be panicking on the same kernel thread that holds the lock. Getting the panic message interleaved with other messages is much better than hanging without any panic message getting printed at all.https://gitlab.redox-os.org/redox-os/kernel/-/issues/142Kernel as an user-space process2023-12-14T13:53:34ZRibbonKernel as an user-space processAllow the kernel to run in user-space to improve debugging, like [NetBSD](https://en.wikipedia.org/wiki/Rump_kernel) and [DragonFlyBSD](https://www.dragonflybsd.org/docs/handbook/vkernel/) does.Allow the kernel to run in user-space to improve debugging, like [NetBSD](https://en.wikipedia.org/wiki/Rump_kernel) and [DragonFlyBSD](https://www.dragonflybsd.org/docs/handbook/vkernel/) does.https://gitlab.redox-os.org/redox-os/kernel/-/issues/140Add a system call to delete a namespace2024-03-18T09:58:09ZRon WilliamsAdd a system call to delete a namespaceThere does not seem to be a system call to remove a namespace after use. Contain creates namespaces and should delete them when when the user session completes.There does not seem to be a system call to remove a namespace after use. Contain creates namespaces and should delete them when when the user session completes.https://gitlab.redox-os.org/redox-os/kernel/-/issues/139Save kernel memory after panic2023-12-14T13:53:49ZRibbonSave kernel memory after panicOn Linux the [kdump](https://en.wikipedia.org/wiki/Kdump_(Linux)) method is used to save the kernel memory to a file after a panic (crash dump), by doing this the kernel memory snapshot can be analyzed later (debugging).
This method giv...On Linux the [kdump](https://en.wikipedia.org/wiki/Kdump_(Linux)) method is used to save the kernel memory to a file after a panic (crash dump), by doing this the kernel memory snapshot can be analyzed later (debugging).
This method gives more information than crash logs.https://gitlab.redox-os.org/redox-os/kernel/-/issues/138Support restartable sequences2023-10-29T20:37:56ZJacob Lorentzon4ldo2@protonmail.comSupport restartable sequencesRestartable sequences are available on Linux, and would allow better spinlock performance, and possibly make it easier to move parts of the futex API to userspace (because atomic hashmaps are hard without using spinlocks at least *somewh...Restartable sequences are available on Linux, and would allow better spinlock performance, and possibly make it easier to move parts of the futex API to userspace (because atomic hashmaps are hard without using spinlocks at least *somewhere*).
This would likely be achieved by userspace providing its TCB page to the kernel. Such a page may also store sigprocmask and possibly the pending mask/signal arguments, if most of signal handling is moved to userspace.https://gitlab.redox-os.org/redox-os/kernel/-/issues/134Move address space virtual address range allocation to userspace2023-10-31T14:56:21ZJacob Lorentzon4ldo2@protonmail.comMove address space virtual address range allocation to userspaceThere's currently a lot of code in the kernel dealing solely with managing user address space virtual address allocation. A more minimal kernel would only store the grants and their ranges, which would (1) allow userspace to implement gu...There's currently a lot of code in the kernel dealing solely with managing user address space virtual address allocation. A more minimal kernel would only store the grants and their ranges, which would (1) allow userspace to implement guard pages, (2) remove the need for `mmap_min`, and (3) simplify mmap and similar operations, so that they always behave as `MAP_FIXED_NOREPLACE`.https://gitlab.redox-os.org/redox-os/kernel/-/issues/132Boot time scales inversely by the number of CPUs2023-07-13T14:25:48ZJacob Lorentzon4ldo2@protonmail.comBoot time scales inversely by the number of CPUsThis can be checked by comparing the boot time (context switch and syscall heavy), when setting QEMU's `-smp` to 1, 4, or 16.
The global context switch lock, combined with every processor being preempted exactly at the same time (the BS...This can be checked by comparing the boot time (context switch and syscall heavy), when setting QEMU's `-smp` to 1, 4, or 16.
The global context switch lock, combined with every processor being preempted exactly at the same time (the BSP sends out IPIs when there are PIT ticks), and that the context is locked numerous times for each syscall, might be the primary causes of this slowdown.https://gitlab.redox-os.org/redox-os/kernel/-/issues/131Support 32-bit userspace when using 64-bit kernels2023-07-08T09:22:45ZJacob Lorentzon4ldo2@protonmail.comSupport 32-bit userspace when using 64-bit kernelshttps://gitlab.redox-os.org/redox-os/kernel/-/issues/130 needs to be fixed, and there probably needs to be two permanent GDT entries, for FS and GS.
It might be possible to allow enabling/disabling compatibility mode, at compile time.https://gitlab.redox-os.org/redox-os/kernel/-/issues/130 needs to be fixed, and there probably needs to be two permanent GDT entries, for FS and GS.
It might be possible to allow enabling/disabling compatibility mode, at compile time.https://gitlab.redox-os.org/redox-os/kernel/-/issues/130x86 segment registers are not saved/restored2023-08-08T09:50:44ZJacob Lorentzon4ldo2@protonmail.comx86 segment registers are not saved/restoredOn x86, segment registers are currently not saved and restored when context switching. Since userspace is capable of loading available selector values into segment registers,
- CS is immutable, everything but GDT_USER_CODE #GPs
- SS (wi...On x86, segment registers are currently not saved and restored when context switching. Since userspace is capable of loading available selector values into segment registers,
- CS is immutable, everything but GDT_USER_CODE #GPs
- SS (will be) immutable, can only be set to GDT_USER_DATA
- DS, ES, FS, and GS, can be either NULL or GDT_USER_DATA,
at most four bits of data can be leaked between contexts when switching.https://gitlab.redox-os.org/redox-os/kernel/-/issues/129Support syscall62023-10-29T20:44:40ZJacob Lorentzon4ldo2@protonmail.comSupport syscall6Redox currently supports syscall0..=syscall5, i.e. rax+rdi+rsi+rdx+r10+r8, but some future syscalls like preadv2/pwritev2 (and futex?) on 32-bit architectures would need e.g. SYS_PWRITEV2+fd+addr+len+off_lo+off_hi+flags, i.e. 7 args.
Th...Redox currently supports syscall0..=syscall5, i.e. rax+rdi+rsi+rdx+r10+r8, but some future syscalls like preadv2/pwritev2 (and futex?) on 32-bit architectures would need e.g. SYS_PWRITEV2+fd+addr+len+off_lo+off_hi+flags, i.e. 7 args.
The registers Linux uses for that, are
- x86_64: rax, rdi, rsi, rdx, r10, r8, r9
- x86_32: eax, ebx, ecx, ecx, edi, esi, ebp
- aarch64: x8, x0, x1, x2, x3, x4, x5, x6
Might be worth looking into whether supporting full-width syscall return values, on x86/x86_64, by setting the carry flag, improves performance (the BSDs do this IIRC).https://gitlab.redox-os.org/redox-os/kernel/-/issues/124Implement x86 security mitigations2024-03-16T08:44:54ZJacob Lorentzon4ldo2@protonmail.comImplement x86 security mitigationsHere's the list based on the x86 CPU vulnerabilities that Linux's lscpu prints. IIRC some of these only require updated microcode (but Redox doesn't currently support microcode updates).
- [ ] Spec store bypass (add IA32_SPEC_CTRL to co...Here's the list based on the x86 CPU vulnerabilities that Linux's lscpu prints. IIRC some of these only require updated microcode (but Redox doesn't currently support microcode updates).
- [ ] Spec store bypass (add IA32_SPEC_CTRL to context state)
- [ ] Spectre v1
- [ ] usercopy lfence barriers
- [ ] swapgs lfence barriers
- [ ] race condition induced Spectre (Ghostrace)
- [ ] etc...
- [ ] Spectre v2
- [ ] Retpolines
- [ ] RSB filling on context switches
- [ ] etc...
- [ ] Meltdown (PTI - unfinished)
- [ ] Retbleed - https://lwn.net/Articles/901834/, https://lwn.net/Articles/907054/
- [ ] Mmio stale data
- [ ] Mds
- [x] L1tf (VMM) - does not affect the Redox kernel... yet (no hypervisor support).
- [x] L1tf (OS) - `Frame`s are statically enforced not to be 0x0, and RMM is clearing page entries to zero (though it could be enforced better: https://gitlab.redox-os.org/redox-os/rmm/-/issues/3)
- [ ] Itlb multihit - does not yet affect the Redox kernel... but once hypervisor support is added, ensure that large/huge pages are not executable on vulnerable CPU models.
- [ ] Srbds - requires microcode update (mitigation can be disabled via MSRs)
- [ ] Tsx async abort - requires microcode update, Linux defaults to disabling TSX entirely in that case
- [ ] Gather data sampling ("DOWNFALL") - requires microcode update. TODO: anything else?
- [ ] RAS overflow ("INCEPTION") - requires microcode update too. TODO: anything else?
- [ ] Register File Data Sampling (only affects Intel Atom though)
Some other useful security-enhancing x86 features less related to side channels:
- [x] UMIP (trivial to add support for)
- [x] SMEP (also trivial) - apparently related to RSB filling
- [x] SMAP (will require [usercopy functions](https://gitlab.redox-os.org/redox-os/kernel/-/issues/115), hard)
- [ ] Protection keys
- [ ] Shadow stacks
It would most likely be wise to prioritize vulnerabilities affecting newer CPUs first, most notably Spec Store Bypass and Spectre V1/V2, then continuing with Retbleed, Meltdown, and lastly, the Intel-specific mostly-patched bugs (MDS, L1TF, TSX, MMIO stale data, SRBDS).
Redox also needs to implement microcode loading, which can probably be done from userspace.https://gitlab.redox-os.org/redox-os/kernel/-/issues/114Allow splitting and merging (all) grants2024-03-16T09:18:23ZJacob Lorentzon4ldo2@protonmail.comAllow splitting and merging (all) grantsCurrently, some parts of the kernel assume that simply the base address is enough to obtain grants. However, grants are memory regions with both base and size, and the ability to merge grants that are contiguous and with identical attrib...Currently, some parts of the kernel assume that simply the base address is enough to obtain grants. However, grants are memory regions with both base and size, and the ability to merge grants that are contiguous and with identical attributes, would reduce fragmentation and be more correct.
~~Currently, the primary blocker is the current `UserScheme` code.~~
As of https://gitlab.redox-os.org/redox-os/kernel/-/merge_requests/238, the simplest `Allocated` grants are mergeable, but the remaining grant types (AllocatedShared, External, FmapBorrowed, and PhysBorrowed) need to be mergeable too.https://gitlab.redox-os.org/redox-os/kernel/-/issues/113Consider moving most of signal handling to userspace2023-07-15T09:39:43ZJacob Lorentzon4ldo2@protonmail.comConsider moving most of signal handling to userspaceCurrently, signal handling is done mostly in the kernel, and lacks important features (such as [sending arguments to signal handlers](https://gitlab.redox-os.org/redox-os/kernel/-/issues/105)). Additionally, the signal trampoline is done...Currently, signal handling is done mostly in the kernel, and lacks important features (such as [sending arguments to signal handlers](https://gitlab.redox-os.org/redox-os/kernel/-/issues/105)). Additionally, the signal trampoline is done in the kernel, where the entire kernel stack is copied, which is ugly and probably a bit UB.
It would be possible to put the sigprocmask and pending mask in shared memory. Thus, while `SYS_KILL`ers may need exclusive access to that shared memory, the sigprocmask and pending mask can be accessed using atomics, by storing sigmask[i] and pending[i] striped in the same atomic word.https://gitlab.redox-os.org/redox-os/kernel/-/issues/112Single-stack kernel2023-08-10T15:27:57ZJacob Lorentzon4ldo2@protonmail.comSingle-stack kernelCurrently, each context has a 64 KiB stack, which is switched to and from during context switches. However, when switching from user mode (timer interrupts), the kernel stack is by definition empty (before the interrupt). The kernel stac...Currently, each context has a 64 KiB stack, which is switched to and from during context switches. However, when switching from user mode (timer interrupts), the kernel stack is by definition empty (before the interrupt). The kernel stack is only populated before switches, when switching from within the kernel, which is almost always when waiting for e.g. a scheme operation to complete, or a futex. By switching, state (i.e. local variables) is conveniently restored when the awaited event is complete. Usually, what is done after it completes, is relatively simple. For example, [here](https://gitlab.redox-os.org/redox-os/kernel/-/blob/cb58500b684ca86c563cfa026dcb7bd522717ed8/src/syscall/fs.rs#L80).
However, one could argue 64 KiB is too much state for regular scheme ops/futexes/pipes/signal queues, and instead there could be a state enum in ContextStatus::Blocked that stores local variables used at each wait point. I'm not sure how feasible this would be to change in practice, but it would simplify context switching a lot. (Sidenote: although unnecessary due to the simplicity of most Redox syscall handlers, async/await could be used to manage state across such wait points.)
I'm not sure if this will work before the ugly signal stack-switching code is removed. That probably needs to be [fixed first](https://gitlab.redox-os.org/redox-os/kernel/-/issues/113).
Each context uses a few hundred bytes AFAIK, and with FXSAVE/FXRSTOR state, round that up to the 4 KiB page size. Removing the kernel stack would thus reduce the kernel's memory usage per context, from 64+4 to 4, i.e. by 17 times. It will also reduce kernel UB (like storing kernel stack bytes in a regular Vec), and make the kernel more like a regular program (1:1 between CPUs and kernel stacks i.e. what Rust calls "threads", only difference is userspace pages). There will obviously be at least one stack per CPU (and probably more by using the x86_64 IST), but it won't scale by the number of contexts.
This idea is called "event-based kernel" in seL4 literature (https://dl.acm.org/doi/10.1145/2517349.2522720).https://gitlab.redox-os.org/redox-os/kernel/-/issues/110Moving namespace functionality to userspace2024-03-08T12:24:20ZJacob Lorentzon4ldo2@protonmail.comMoving namespace functionality to userspaceWe can move namespace functionality to relibc (while obviously preserving security). The kernel root scheme will be replaced by a scheme that only gives out anonymous scheme sockets. Userspace will implement `:` instead, as a scheme wher...We can move namespace functionality to relibc (while obviously preserving security). The kernel root scheme will be replaced by a scheme that only gives out anonymous scheme sockets. Userspace will implement `:` instead, as a scheme where namespaces are file descriptors, and where the usual `open(":name")` registers that name and fd-forwards the kernel-provided anonymous fd. FD forwarding will also allow insertion and removal of schemes from namespaces, with great flexibility.
Relibc will have a global variable called ACTIVE_NS, containing a namespace fd, and possibly more namespace fds as well. This eliminates getens/getrns/setrens/makens. Prefixes would be parsed in relibc's open(3), scheme access would be obtained through openat(ns, scheme_name) (possibly cached) and openat(scheme_access, path) would do the rest. The idea is that both ns and scheme access will be fd-based "capabilities".
This would require SYS_OPENAT to be implemented, and the libredox migration (https://gitlab.redox-os.org/redox-os/libredox/-/issues/1) needs to be completed first (because this will break `syscall::open`).https://gitlab.redox-os.org/redox-os/kernel/-/issues/105Support passing arguments to signal handlers, to comply with POSIX2023-06-27T09:55:35ZJacob Lorentzon4ldo2@protonmail.comSupport passing arguments to signal handlers, to comply with POSIXCurrently, the only user-visible argument that is received to the signal handler when entering user mode from the kernel, is the signal number. As far as I am aware, POSIX requires that signal handlers specified via `sa_sigaction` also t...Currently, the only user-visible argument that is received to the signal handler when entering user mode from the kernel, is the signal number. As far as I am aware, POSIX requires that signal handlers specified via `sa_sigaction` also take two additional arguments: `siginfo_t *info` and `void *context` (which technically is `ucontext_t *context`). The lack of `sa_sigaction` requires conditional compilation to support Redox, e.g. [jobserver-rs#12](https://github.com/alexcrichton/jobserver-rs/pull/12), which is not ideal given that Redox belongs to the `#[cfg(unix)` target family in `libstd`.
Implementing this would not be particularly hard; it should be as simple as pushing the required structures to the stack, and then pass pointers to them when calling the usermode code. Otherwise we could let the kernel deviate from the POSIX spec, and instead use a libc-level wrapper that registers the actual signal to the kernel, and then calls the `sa_sigaction` field with the POSIX structures.
Another interesting point is whether we could use all of the allowed System V available registers (or all registers and come up with our own calling convention, which could also be a long-term optimization in the syscall handlers (they have to push the preserved registers once to comply with ptrace's `InterruptStack`, and then again since the functions they call have no idea that the caller does not care about the preserved registers since they are already saved)). This could be used to implement SeL4-like synchronous IPC, despite the obvious drawbacks of re-entrant and asynchronous signals.