0000-io_uring.md 24.2 KB
Newer Older
4lDO2's avatar
4lDO2 committed
1
2
- Feature Name: io_uring
- Start Date: 2020-06-21
4lDO2's avatar
4lDO2 committed
3
- RFC PR: [redox-os/rfcs#15](https://gitlab.redox-os.org/redox-os/rfcs/-/merge_requests/15)
4lDO2's avatar
4lDO2 committed
4
- Redox Issue: [redox-os/redox#1312](https://gitlab.redox-os.org/redox-os/redox/issues/1312)
4lDO2's avatar
4lDO2 committed
5
6
7
8

# Summary
[summary]: #summary

4lDO2's avatar
4lDO2 committed
9
10
11
12
13
14
15
16
17
18
`io_uring` is a low-latency high-throughput asynchronous I/O interface,
inspired by Linux's `io_uring` (since 5.1). In essence `io_uring` behaves like
a regular SPSC channel or queue: the producer pushes entries, which the
consumer pops.  This interface provides two different rings for each `io_uring`
instance, namely the _submission queue_ (SQ) and the _completion queue_ (CQ).
The process making the syscall (which from now on is referred to as the
_consumer_), sends a _submission queue entry_ (SQE or SE), which the process
(or kernel) processing the syscall (referred to as the _producer_) handles, and
then sends a _completion queue entry_ (CQE or CE). The Redox implementation
also allows different roles for processes and the kernel, unlike with Linux.
4lDO2's avatar
4lDO2 committed
19
20
21
22

# Motivation
[motivation]: #motivation

4lDO2's avatar
4lDO2 committed
23
24
25
26
Since Redox is a microkernel, context switches will be much more frequent than
on monolithic kernels, which do a more miscellaneous work in the kernel. This
context switch overhead is even greater when using KPTI to mitigate the recent
Meltdown vulnerability; this is also the motivation behind Linux's io_uring
4lDO2's avatar
4lDO2 committed
27
28
29
API, even if Redox would benefit to a greater extent.

By using two separate queues, the _submission queue_, and the _completion
4lDO2's avatar
4lDO2 committed
30
31
queue_, the only overhead whatsoever of the syscall, is to write the submission
queue entry to a flat shared memory region, and then increment two atomic
4lDO2's avatar
4lDO2 committed
32
variables (where the second is optional and only for process notification,
4lDO2's avatar
4lDO2 committed
33
34
35
36
37
38
albeit highly recommended), in that order. Similarly, when a command completes,
all that has to be done is the same: reading from shared memory, and then
incrementing two counters. Since the channels are lock-free unless the queues
are empty (when receiving) or full (when sending), this also allows two
processes that run in parallel to serve each other's requests simultaneously in
real-time by polling the rings.
4lDO2's avatar
4lDO2 committed
39
40
41
42
43
44
45
46

Another reason why `io_uring` is to be considered, is due to the
completion-based model for async I/O. While the readiness-based model works
greatly for network drivers, where events are triggered by hardware (or rather,
polled by the driver), on other hardware such as NVME disks, the hardware will
not read from the disk by itself, but has to be started by the driver. The way
`xhcid` handles this is by having two files allocated for each hardware
endpoint: `ctl` and `data`. Not only is it much more complex and complicated
4lDO2's avatar
4lDO2 committed
47
both for the driver and the driver user, but it also requires two separate
4lDO2's avatar
4lDO2 committed
48
syscalls, and thus causes more context switches than necessary.
4lDO2's avatar
4lDO2 committed
49
50
51
52

# Detailed design
[design]: #detailed-design

4lDO2's avatar
4lDO2 committed
53
## Ring layer
54

4lDO2's avatar
4lDO2 committed
55
### Ring operation
56

4lDO2's avatar
4lDO2 committed
57
58
59
On the technical side, each `io_uring` instance comprises two rings - the
submission ring and the completion ring. These rings reside in shared memory,
which is either shared by the kernel and a process, or two processes. Every
4lDO2's avatar
4lDO2 committed
60
61
62
63
64
individual reference to a ring uses two memory regions (typically regular
memory mappings managed by the kernel). The first region is exactly one page
large (4 KiB on x86_64, however so long as the ring header fits within a page,
architectures with smaller page sizes would also be able to use those), and
contains the ring header, which is used for atomically accessed counters such
65
66
67
68
69
as indices. Meanwhile, the second region contains the ring entries, which must
for performance purposes be aligned to a multiple of its own size.  Hence, with
two rings for command submission and command completion, and two regions per
ring, this results in two fixed-size regions and two variably-sized regions
(dependent on the number of submission and completion entries, respectively).
4lDO2's avatar
4lDO2 committed
70

4lDO2's avatar
4lDO2 committed
71
### Ring data structures
4lDO2's avatar
4lDO2 committed
72
73
74
75
76
77
78
79
80
81

The current structure of the ring header, simplified, is defined as follows:

```rust

#[cfg_attr(target_arch = "x86_64", align(128))]
#[cfg_attr(not(target_arch = "x86_64", align(64)))]
struct CachePadded<T>(pub T);

// Generic type parameter T omitted, for simplicity.
4lDO2's avatar
4lDO2 committed
82
#[repr(C)]
4lDO2's avatar
4lDO2 committed
83
pub struct Ring {
4lDO2's avatar
4lDO2 committed
84
    // the index of the head of the ring, from which entries are popped.
4lDO2's avatar
4lDO2 committed
85
86
    head_index: CachePadded<AtomicUsize>,

4lDO2's avatar
4lDO2 committed
87
    // the index of the tail of the ring, to which new entries are pushed.
4lDO2's avatar
4lDO2 committed
88
89
90
91
92
93
94
    tail_index: CachePadded<AtomicUsize>,

    // various status flags, currently only implemented for shutting down rings.
    status: CachePadded<AtomicUsize>,
}
```

95
96
97
98
99
100
There is one submission entry type called `SqEntry` and one completion entry
type called `CqEntry`. Apart from most other structures within the `syscall`
crate, which mainly use `usize` for integers, these entry structures always use
fixed size integers, to make sure that every byte in every struct is used
(while still allowing for future expandability). The definitions for these
entry types are the following (somewhat simplified):
4lDO2's avatar
4lDO2 committed
101
102

```rust
4lDO2's avatar
4lDO2 committed
103
// 64 bytes long
4lDO2's avatar
4lDO2 committed
104
#[repr(C, align(64))]
105
pub struct SqEntry {
4lDO2's avatar
4lDO2 committed
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
    // the operation code negotiated between the producer and consumer
    pub opcode: u8,
    // flags specific to entries and how they behave
    pub flags: u8,
    // the priority; greater means higher priority
    pub priority: u16,
    // the flags specific to the syscall
    pub syscall_flags: u32,
    // the file descriptor to operate on, for most syscalls
    pub fd: u64,
    // the length, for most syscalls
    pub len: u64,
    // an arbitrary number or pointer passed to the completion entry
    pub user_data: u64,
    // for syscalls taking an address, this will point to that address
    pub addr: u64,
    // for syscalls taking an offset, this specifies that that offset
    pub offset: u64,
4lDO2's avatar
4lDO2 committed
124
125
126
127
    // two additional fields to pad the struct to 64 bytes
    pub additional1: u64,
    // these are only used for certain syscalls
    pub additional2: u64,
4lDO2's avatar
4lDO2 committed
128
}
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
// 16 bytes long
#[repr(C, align(16))]
pub struct CqEntry {
    // an arbitrary number or pointer associating this CQE with its SQE,
    // OR, the `extra` field of the second CQE in a pair.
    pub user_data: u64,
    // the lower 32 bits of the syscall result, or higher 32 bits of the second
    // CQE in a pair.
    pub status: u32,
    // the lower 32 bits of the CQE flags, or higher 32 bits of the second CQE
    // in a pair.
    //
    // this field also indicates whether this CQE is single or double.
    pub flags: u32,
}
```
4lDO2's avatar
4lDO2 committed
145

146
147
148
149
150
151
152
153
154
Sometimes when a syscall returns a number greater than 32 bits wide, it becomes
crucial that this value be storable in the CQEs. However, since many syscalls
likely never exceed values which require 64 bits (and Linux has an upper cap on
2 GiB for all I/O requests), we use a smaller type. Therefore, there exist both
"base" and "extended" CQEs, where such a pair is indicated by the `EXTENDED`
CQE flag. If set, the first CQE will be followed by an additional. Together
they form a pseudo-structure like this:

```rust
4lDO2's avatar
4lDO2 committed
155
// 32 bytes long
156
pub struct AssembledCqes {
4lDO2's avatar
4lDO2 committed
157
158
159
160
    pub user_data: u64,
    pub status: u64,
    pub flags: u64,
    pub extra: u64,
4lDO2's avatar
4lDO2 committed
161
162
163
}
```

164
165
166
167
168
That structure is never used in the interface, but shows what values are
communicated when dual CQEs are used. The `user_data` field comes from the
first CQE, as does the lower 32 bits of the `status` and `flags` fields. The
`extra` field comes the `user_data` field of the second CQE, which logically
also stores the upper bits of the `status` and `flags` fields, respectively.
4lDO2's avatar
4lDO2 committed
169
170

### Algorithms
4lDO2's avatar
4lDO2 committed
171
172
173
Every ring has two major operations:

* `push_back`, which tries to push new entry onto the ring, or fails if the
4lDO2's avatar
4lDO2 committed
174
  ring was full, or shut down
4lDO2's avatar
4lDO2 committed
175
176
177
* `pop_front`, which tries pop pop an entry from the ring, or fails if the ring
  was empty, or shut down

4lDO2's avatar
4lDO2 committed
178
179
180
181
182
183
184
185
Implementation-wise, both operations will first begin by fetching the status,
and fail early if the ring has been registered as broken
(`RingStatus::BROKEN`). The push operation will also fail early if dropped
(`RingStatus::DROP`), while the pop operation will continue to drain the final
elements when that occurs.  Any invalid bit pattern in the `sts` field will
result in the `BROKEN` flag being set, marking the ring unrecoverable both for
the producer and the consumer.

4lDO2's avatar
4lDO2 committed
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
After that, the raw head and tail indices are fetched to later compare them.
They are converted to the actual indices, by calculating the bitwise AND
between the raw indices, and the index mask, which itself is calculated by
subtracting the entry count by one. Effectively, this also constrains the
possible entry counts to powers of two; however, it comes with various
benefits, such as simplifying push and pop operations, as well as eliminating
relatively expensive division instructions. Furthermore, it makes handling of
integer overflows trivial, as it only needs to rely on natural wrapping. Hence,
the entry count must also not exceed half of the largest number representable
by the word size, even though this limit is very unlikely to be met in practice.

The computed indices, represent regular array indices in the entries array.
Therefore, the byte offset of the entry is simply calculated by multiplying the
size of the entry, with the index.

201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
Both the sender and the receiver can own either one or two contiguous ranges,
depending on the cycle bit, which is computed by shifting the head and tail
indices by the entry count bits, checking whether the first bit of the result
for each index, and then calculating the XOR of those two bits. In other words,
it has cycled if the tail has overflown an odd number of times, and the head
has overflown an even number of times, or vice versa. If the cycle condition is
not present, then the sender logically owns the range `[tail, end)` and also
`[0, head)` (two ranges), whereas the receiver logically owns `[head, tail)`
(one range). Likewise, if the cycle condition _is_ present, then the receiver
owns `[tail, head)` (one range) while the sender owns both `[head, end)` and
`[0, tail)` (two ranges).

The ring header page is always read-write by both the sender and receiver;
however, the entry pages are normally read-write only for the sender, and
read-only for the receiver. On architectures which support write-only pages,
the sender may then only have write access. Therefore, it unfortunately does
not align well with Rust's ownership rules, and thereby while it is possible to
grant mutable references to the sender, the receiver can never obtain regular
shared references, which are not inside of `UnsafeCell`.
4lDO2's avatar
4lDO2 committed
220
221
222

#### Calculating the number of available and empty entry slots

4lDO2's avatar
4lDO2 committed
223
224
225
The number of entries that can be popped by the receiver of a ring, is
determined by subtracting the raw tail index by the raw head index. Note that
this may underflow, since those indices can wrap.
4lDO2's avatar
4lDO2 committed
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242

Similarly, the number of entry slots that can be pushed into, is calculated as
the total number of entries, subtracted by the available count.

The _Empty_ condition occurs if and only if the number of available slots
equals zero. This being true causes receiver to not be able to pop further
entries, until more become available.

Similarly, the _Full_ condition occurs if and only if the number of free slots
equals zero, When true, the sender will be unable to push a new entry until the
receiver has popped at least one. This naturally allows for basic congestion
control, where the sender can only send additional entries as frequent as the
receiver can handle.

#### Pushing

When the ring is not full, the sender may write arbitrary values to the region
4lDO2's avatar
4lDO2 committed
243
244
245
246
between the tail and head index. Once the desired entries are written to this
region, the sender may then increment the tail index, possibly wrapping around
to zero. After that, the sender is no longer allowed to write to that range of
entries, until the ring has cycled back, and it has become available again.
4lDO2's avatar
4lDO2 committed
247
248
249
250
251
252
253
254
255
256
257

Since the ring is strictly SPSC, the producer can assume exclusive access to
the sender-owned region.

#### Popping

Popping works very much in the same way as pushing, with the only difference
being that the receiver does not necessarily have write access to the entries,
and can only read from the receiver-owned region. Rather than incrementing the
tail index, the head index is incremented.

4lDO2's avatar
4lDO2 committed
258
## Interface layer
4lDO2's avatar
4lDO2 committed
259

4lDO2's avatar
4lDO2 committed
260
261
262
263
The Redox `io_uring` implementation comes with a new scheme, `io_uring:`. The
scheme provides only a top-level file, like `debug:`, which when opened creates
a new `io_uring` instance in its _Initial_ state. By then writing the create
info (`IoUringCreateInfo`) to that file descriptor, the `io_uring` instance
4lDO2's avatar
4lDO2 committed
264
transistions into the _Created_ state with basic information such as the type
4lDO2's avatar
4lDO2 committed
265
of submission and completion entries, and the number of entries. In that state,
4lDO2's avatar
4lDO2 committed
266
the four memory regions can then be mmapped, which are located at the following
4lDO2's avatar
4lDO2 committed
267
offsets:
4lDO2's avatar
4lDO2 committed
268
269
270
271
272
273

```rust
pub const SQ_HEADER_MMAP_OFFSET: usize = 0x0000_0000;
pub const SQ_ENTRIES_MMAP_OFFSET: usize = 0x0020_0000;
pub const CQ_HEADER_MMAP_OFFSET: usize = 0x8000_0000;
pub const CQ_ENTRIES_MMAP_OFFSET: usize = 0x8020_0000;
4lDO2's avatar
4lDO2 committed
274
275
276

// "SQEs" is intentionally omitted, unlike with Linux. This may change in the
// future though.
4lDO2's avatar
4lDO2 committed
277
```
4lDO2's avatar
4lDO2 committed
278
279
280
281
282
283
284
When all of these are mapped, the `io_uring` instance is ready to be attached,
in one of the following ways:

* userspace to userspace
* userspace to kernel
* kernel to userspace

4lDO2's avatar
4lDO2 committed
285
286
After a successful attachment, the instance will transition into the _Attached_
state, where it is capable of submitting commands.
4lDO2's avatar
4lDO2 committed
287

4lDO2's avatar
4lDO2 committed
288
### Userspace-to-userspace
4lDO2's avatar
4lDO2 committed
289

4lDO2's avatar
4lDO2 committed
290
In the userspace-to-userspace scenario, which has the benefit of zero kernel
291
involvement (except for polling the ring indices to know when to notify), the
4lDO2's avatar
4lDO2 committed
292
consumer process calls the `SYS_IORING_ATTACH` syscall, which takes two
4lDO2's avatar
4lDO2 committed
293
arguments: the file descriptor of the `io_uring` instance, together with the
4lDO2's avatar
4lDO2 committed
294
295
296
name of the scheme.

The scheme will however receive a different syscall in a regular packet, namely
4lDO2's avatar
4lDO2 committed
297
`SYS_IORING_RECV`, which includes a temporary kernel-mapped buffer, containing
4lDO2's avatar
4lDO2 committed
298
299
300
already-mapped offsets to the four ring memory regions, as well as an
`io_uring` instance file descriptor for the producer, linked to the consumer
instance file descriptor.
4lDO2's avatar
4lDO2 committed
301
302
303
304
305

This mode is generally somewhat specialized, because the kernel may have to
poll the rings on its own (which perhaps some magic trickery of putting every
`io_uring` allocation in its own adjacent group of page tables, and checking
the "dirty" flag on x86_64), and because the buffers have to be managed
4lDO2's avatar
4lDO2 committed
306
307
308
309
310
externally. For that type of memory management to be possible, the kernel
provides an efficient way for consumers to quickly add new preallocated chunks
in a buffer pool; by invoking `SYS_DUP` on a userspace-to-userspace `io_uring`
instance file descriptor, a new pool file descriptor will be created. The pool
file descriptor, can be mmapped from an arbitrary offset, which will allocate
4lDO2's avatar
4lDO2 committed
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
memory, and push an allocation entry to an internal list.

The producer can do the same with its instance file descriptor, with the
exception that the producer can only map offsets, that the consumer has
allocated. The pending allocations that the producer has not yet mapped, can be
retrieved by reading the producer pool file descriptor.

Another constraint is that the stack cannot be used at all, since unpriviliged
processes are obviously not supposed to access other process' stacks. The main
point of this however, is the low latency it offsers; with some intelligent
kernel scheduling, the kernel could place an audio driver and an audio
application on two different CPUs, and they would be to process each other's
syscalls without any context switch (apart from scheduling of course).

In the future there might be scenarios where real addresses are used rather
than offsets in this attachment mode, e.g. physical addresses for an NVME
driver. The consumer and producer are free to negotiate their behavior anyway.

4lDO2's avatar
4lDO2 committed
329
When waiting for notification, the regular `SYS_IORING_ENTER` syscall may be
4lDO2's avatar
4lDO2 committed
330
331
332
invoked, that will cause the kernel to wait for the ring to update, blocking
the process in the meantime. However, the primary notification mechanism for
consumers, is using event queues in a primary userspace-to-kernel io_uring; by
4lDO2's avatar
4lDO2 committed
333
using the `RegisterEvents` opcode with the `EVENT_READ` flag (and `EVENT_WRITE`
4lDO2's avatar
4lDO2 committed
334
335
336
337
338
for notification when a ring is no longer full), with the `io_uring` instance
file descriptor as target, an event will be triggered once the ring gets
available completion entries.

TODO: Right now the kernel polls rings using event queues or
4lDO2's avatar
4lDO2 committed
339
`SYS_IORING_ENTER`, but would it make sense to optimize this by putting the
4lDO2's avatar
4lDO2 committed
340
341
kernel virtual addresses for the rings, adjacently in virtual memory, and then
using the "dirty" page table flags instead?
4lDO2's avatar
4lDO2 committed
342

4lDO2's avatar
4lDO2 committed
343
### Userspace-to-kernel
4lDO2's avatar
4lDO2 committed
344

4lDO2's avatar
4lDO2 committed
345
The userspace-to-kernel attachment mode is similar; the process still calls
4lDO2's avatar
4lDO2 committed
346
`SYS_IORING_ATTACH`, with the exception of the kernel handling the request. The
4lDO2's avatar
4lDO2 committed
347
348
kernel supports the same standard opcode set as other schemes do, with the
advantage of being able to communicate with every open scheme for that process.
4lDO2's avatar
4lDO2 committed
349

4lDO2's avatar
4lDO2 committed
350
351
This is the recommended mode, especially for applications, since it does not
require one ring per consumer per producer, but only at least one ring on the
4lDO2's avatar
4lDO2 committed
352
353
354
355
356
357
consumer side. The kernel will handle the syscalls pushed onto that `io_uring`,
and possibly forward these to schemes using a kernel-to-userspace `io_uring`s.
The kernel will also manage the buffers automatically here, as it has full
access to them, as the kernel has access to all address space, when processing
syscall. This results in less overhead as no preregistered buffers or other
shared memory is used.
4lDO2's avatar
4lDO2 committed
358

4lDO2's avatar
4lDO2 committed
359
360
361
362
363
When using an `io_uring` this way, a process pushes submission entries, and
then calls `SYS_IORING_ENTER`, which is a regular syscall. The kernel then
reads and processes the submission entries, and pushes completion entries in
case some requests were already finished or not asynchronous (while `io_uring`
is perfectly suited to async by design, it can also be used to batch regular
4lDO2's avatar
4lDO2 committed
364
syscalls within one syscall to reduce context switching overhead).
4lDO2's avatar
4lDO2 committed
365
366

TODO: Are schemes that handle the file descriptors referenced when talking to
4lDO2's avatar
4lDO2 committed
367
the kernel, going to be directly scheduled, in or after `SYS_IORING_ENTER`, or
4lDO2's avatar
4lDO2 committed
368
369
is the kernel going to wait for it to be scheduled regularly first?

4lDO2's avatar
4lDO2 committed
370
### Kernel-to-userspace
4lDO2's avatar
4lDO2 committed
371

4lDO2's avatar
4lDO2 committed
372
373
374
The kernel-to-userspace mode is the producer counterpart of the
userspace-to-kernel-mode; it shares the same properties as with the other, but
the roles are reversed. A scheme may at any time receive a request to attach an
4lDO2's avatar
4lDO2 committed
375
376
`io_uring` where the kernel is the producer, and this will happen at most once
per scheme.
4lDO2's avatar
4lDO2 committed
377

4lDO2's avatar
4lDO2 committed
378
379
380
381
The kernel is free to use this `io_uring` whenever it wants asynchronous I/O on
a scheme that supports it, for example if a userspace-to-kernel `io_uring`
wrote to a file descriptor owned by that scheme.

4lDO2's avatar
4lDO2 committed
382
383
384
385
386
387
388
389
390
391
392
393
One noteworthy difference between userspace-to-userspace plus
kernel-to-userspace, and userspace-to-kernel, is that the file descriptor
numbers are local to the target scheme in the userspace-to-userspace case.
Oppositely, in the userspace-to-kernel case, the producer is free to use global
file descriptors (to that process). This is akin to the regular syscall
interface and scheme handlers: the user can specify any file descriptor, and if
it belonged to a certain scheme, the kernel will then translate that file
descriptor to a file descriptor local to that scheme. The only distinction with
the `io_uring` interface is that in the userspace-to-userspace case, the
consumer must distinguish between file descriptors local to a specific scheme,
the its own global file descriptors.

4lDO2's avatar
4lDO2 committed
394
395
Since the kernel initiates these by itself, and keeps track of state, no
dedicated syscall resembling `SYS_IORING_ENTER` needs to be called, since the
396
kernel can simply check whether the indices have been changed or not.
4lDO2's avatar
4lDO2 committed
397

4lDO2's avatar
4lDO2 committed
398
399
### Deallocation
The four `io_uring` memory regions are kernel managed, and thus they only
4lDO2's avatar
4lDO2 committed
400
401
402
403
404
require a close syscall on both the consumer side and the producer side. They
can also be unmapped with `funmap` separately, before closing.

The kernel will then shutdown the opposite ring part, so that the other side of
the `io_uring` receives the last items and then also destroys its instance.
4lDO2's avatar
4lDO2 committed
405
406
407
408

# Drawbacks
[drawbacks]: #drawbacks

4lDO2's avatar
4lDO2 committed
409
Since Linux handles every syscall within ring 0, no extra communication needs
4lDO2's avatar
4lDO2 committed
410
to occur when dealing with addresses; the kernel can simply map a userspace
4lDO2's avatar
4lDO2 committed
411
address (if it is not already mapped), do its work, and then push a completion
4lDO2's avatar
4lDO2 committed
412
413
414
415
416
417
418
419
entry. Redox has a similar mode: userspace-to-kernel. One downside with this
compared to how Linux works, is that the kernel has to use two rings: one
between the consumer and kernel, and one between the kernel and consumer. There
is also a direct mode, userspace-to-userspace, although it suffers from various
limitations.

Since the Redox `io_uring` in userspace-to-userspace mode has no direct kernel
involvement, all buffers will have to be registered in order for the producer
4lDO2's avatar
4lDO2 committed
420
421
process to access those buffers. The current workaround here is to use a
different `io_uring` with the kernel as the producer and allocating buffers
4lDO2's avatar
4lDO2 committed
422
423
424
425
426
there, or to simply mmap them at a somewhat higher cost using regular syscalls.
Although this might not impose that of a greater overhead than simply letting
the kernel deal with this during a syscall, for the most part it completely
eliminates the ability to use the stack for buffers, since that would be a
major issue, both for program security and safety.
4lDO2's avatar
4lDO2 committed
427
428

An additional drawback is the increased complexity of both using asynchronous
4lDO2's avatar
4lDO2 committed
429
430
431
I/O and regular blocking I/O in the kernel. The kernel may also have to keep
track of the state of certain operations, instead of bouncing between userspace
and kernel in a stack-like way, when schemes call schemes. Futures (and maybe
4lDO2's avatar
4lDO2 committed
432
async/await) might help with this. Alternatively, one could limit more complex
4lDO2's avatar
4lDO2 committed
433
434
syscalls, e.g. `SYS_OPEN` from having to be repeated by only allowing simple
paths without symlinks, which reduces complex state.
4lDO2's avatar
4lDO2 committed
435
436
437
438

# Alternatives
[alternatives]: #alternatives

4lDO2's avatar
4lDO2 committed
439
Even though it may not be strictly necessary aside from increased performance
4lDO2's avatar
4lDO2 committed
440
to use SPSC queues, one could instead simply introduce `O_COMPLETION_IO` (which
4lDO2's avatar
4lDO2 committed
441
442
would be paired with `O_NONBLOCK`), `EVENT_READ_COMPLETION` and
`EVENT_WRITE_COMPLETION`, and let the kernel retain the grants of the buffers
4lDO2's avatar
4lDO2 committed
443
until events with the aforementioned flags have been sent, to indicate that the
4lDO2's avatar
4lDO2 committed
444
445
446
buffers are not longer needed by the scheme process. However, this would still
limit applications from executing more than one system call at a time. While it
would be non-blocking, it would remain synchronous unlike `io_uring`.
4lDO2's avatar
4lDO2 committed
447
448
449
450

Additionally, while the kernel can obviously assist with notifying processes
with lower overhead, this API could be implemented in userspace by using shared
memory for communication, and futexes or even regular event queues (although
4lDO2's avatar
4lDO2 committed
451
452
453
454
455
456
457
458
459
with a higher overhead due to syscalls) for notifying.

The major problem with this approach however would be buffer management, since
that would require processes to use regular syscalls every time they want to
register a buffer.  Another feature that would be possible, would be fast
communication between the kernel and userspace processes. The current Redox
`io_uring` design also allows the kernel to be the producer of the rings, which
makes it possible for traditional event queues to be completely replaced. This
also applies for other kernel schemes.
460

4lDO2's avatar
4lDO2 committed
461
462
463
# Unresolved questions
[unresolved]: #unresolved-questions

4lDO2's avatar
4lDO2 committed
464
The API is for the most part actually quite similar to Linux's API, with the
4lDO2's avatar
4lDO2 committed
465
only incompatible difference of not having separated the SQ with its entries
4lDO2's avatar
4lDO2 committed
466
467
468
469
(which Linux does). The problem however, is that `io_uring`s can have different
producers, unlike Linux which only has the kernel. Thus, there might be
motivation for a kernel-managed queue that sends submissions to the respective
schemes, and collects completions from different schemes, so that the consumer
4lDO2's avatar
4lDO2 committed
470
471
472
473
474
475
only needs one `io_uring` instance, like on Linux.

If that were implemented, there would be a possibility for source-compatibility
with liburing, especially if the entry types are exactly the same as Linux's
(`struct io_uring_sqe` and `struct io_uring_cqe`). If the syscalls were to be
emulated, this could also result possible complete emulation.
4lDO2's avatar
4lDO2 committed
476

4lDO2's avatar
4lDO2 committed
477
478
479
480
And finally, the potential use of an SQE array with an indirect SQ (pushing
indices to the SQE array rather than the SQEs directly), could be beneficial
for certain applications.

4lDO2's avatar
4lDO2 committed
481
# References
4lDO2's avatar
4lDO2 committed
482
483
1. [https://kernel.dk/io_uring.pdf](https://kernel.dk/io_uring.pdf) - A
   reference of the io_uring API present from Linux 5.1 and onwards.