jD91mZM2 · jD91mZM2 · jD91mZM2 · 69f0187f
--- a/text/0000-ptrace.md
+++ b/text/0000-ptrace.md
@@ -16,102 +16,179 @@ no interface for tracing a process' system calls or instructions, and
 no interface for managing another process' memory.

 A good first step for implementing `gdb` or a similar utility would be
-to implement the former, a Linux `ptrace(...)` alternative for
-Redox. This should not only open up the possibility for debuggers, but
-also system-call translation processes like WINE, perhaps for Linux
-compatibility which would rid us the problem of porting software.
-
-And even with *pure* `ptrace`, without any memory reading or
-otherwise, one can still implement the immensely useful tool `strace`,
-which could serve as an alternative to recompiling the kernel with
-system call debugging turned on or off. This is probably what we
-should focus on getting to work initially, before getting to the good
-stuff.
+to implement a Linux `ptrace(...)` alternative for Redox. This should
+not only open up the possibility for debuggers, but also system-call
+translation processes like WINE, perhaps for Linux compatibility which
+would rid us the problem of porting software.
+
+And even with *pure* `ptrace`, without any register or memory reading,
+one can still implement the immensely useful tool `strace`, which
+could serve as an alternative to recompiling the kernel with system
+call debugging turned on or off. This is probably what we should focus
+on getting to work initially, before getting to the good stuff.

 # Detailed design
 [design]: #detailed-design

+The Linux ptrace interface is sometimes considered a huge mistake due
+to its inconsistence and it being just one massive function. The Redox
+interface will have to take that in mind, as well as remove any
+duplicate or otherwise redundant functions.
+
 All process-controlling functions are implemented as one kernel
 scheme, `proc:`. Opening it up with the `pid` of a tracee and a path
-will perform a specific operation on the process.
-
-## ptrace
-
-Opening `proc:<pid>/trace` will attach to that process and internally
-stop it similar to how `SIGSTOP` would work (it does, however, not
-rely on signals in any way), and closing the file descriptor will
-detach from it automatically. The benefit of using schemes as opposed
-to a Linux-style function is that we get the ability to disallow this
-feature using namespacing for free. The `ptrace(PTRACE_TRACEME, ...)`
-call from Linux is not implemented, as one can just attach to the
-`pid` of the child.
+will perform a specific operation on the process. The benefit of using
+schemes as opposed to a Linux-style function is that we get the
+ability to disallow this feature using namespacing for free. We also
+allow multiplexing using `event:` and therefore can be used in a
+nonblocking fashion just like any other file descriptor.
+
+## Process trace
+
+Opening `proc:<pid>/trace` will give you a file which you can write
+proc-related functions, and closing the file descriptor will detach
+from it automatically. If any breakpoint is set when the file is
+closed, they are deleted and the process is resumed. Only *one* tracer
+can control a process, as I am too close-minded to come up with a
+design that would make sense for running multiple tracers on a single
+tracee.
+
+That said, if the tracer has the flag `O_EXCL` will instead send
+`SIGKILL` to the tracee when the tracer closes its file. This is to
+prevent any ptrace-contained processes from breaking out. (`O_EXCL`
+can be thought of as meaning the tracer is the only one who controls
+the process, and the process can't live on its own)
+
+Another flag used in `open` is `O_TRUNC` which will stop the process
+*immediately*. This can be compared to using `PTRACE_ATTACH` on Linux
+as opposed to `PTRACE_SEIZE`. (`O_TRUNC` can be thought of as
+*truncating*/clearing the file's execution. It's a stretch, but I have
+no better idea)

 The most important operation of `ptrace` is of course to put a
-breakpoints! This can be done by `write`ing a specific byte to the
-file, in a 1-length array. This byte can be one of the following
-combinations:
-
- `PTRACE_SYSCALL` breaks both before and after the next syscall.
- `PTRACE_SYSCALL | PTRACE_SYSEMU` breaks *instead* of the next
-  syscall.
- `PTRACE_SINGLESTEP` allows one single instruction to be ran and then
-  stops again.
- `PTRACE_SINGLESTEP | PTRACE_SYSEMU` allows one single instruction to
-  be ran and then stops again, but doesn't execute any syscalls.
- `PTRACE_CONT` runs the program to completion.
- `PTRACE_SIGNAL` runs the program until the next signal is called and
-  handled by user code (so `SIG_DFL` can not be overwritten this
-  way). Before this breakpoint, an event specifying the signal number
-  is called (see paragraph on events below).
- `PTRACE_WAIT` ignores the `O_NONBLOCK` flag and waits until a
-  breakpoint is reached, but does nothing else.
-
-The `PTRACE_SYSEMU` flag can be used by, for example, emulators that
-do not want to run any system calls on the host system.
-
-Unless `O_NONBLOCK` is set, each `write` invocation blocks until the
-breakpoint is hit. With `O_NONBLOCK`, one can still use `PTRACE_WAIT`
-to wait for any breakpoint to hit. Because `ptrace` does **not** rely
-on signals, when a process is ptrace-stopped you can send `SIGCONT`
-without actually restarting the process. The process is restarted only
-using a ptrace operation such as `PTRACE_CONT` or when the tracer file
-handle is closed.
+breakpoints! Redox tries to unify the Linux event system and the
+breakpoint system by making the input a bitflags with *one or more*
+breakpoints. It will return when the first breakpoint/event specified
+within the bitmask is reached, which in case it'll add that event for
+reading using the `read` system call (see events below). If an event
+is already set, the `write` returns immediately. (slight exception is
+`O_NONBLOCK`, see that below as well)
+
+Each breakpoint is set and optionally awaited using the `write` system
+call. Each such call will also resume the tracee in case it's stopped
+after another breakpoint. So if you write a value with no stop bits
+set, the program will run to completion. The exception is manually
+specifying `PTRACE_WAIT` (even in blocking mode, see below), which
+will - unless any new stop is set - only wait for an existing
+breakpoint to be reached.
+
+## Breakpoints
+
+- If `PTRACE_STOP_PRE_SYSCALL` is set, the tracee will break on the
+  next start of a syscall. This diverges from Linux' way of using
+  `PTRACE_SYSCALL` for *both* pre- and post- syscall. However it's for
+  a good reason: Signals can occur in the middle of a syscall, and
+  unlike Linux which just delays the signal, we should go the simplest
+  route to minimize kernel code size and let the user choose the
+  behavior they want and not choose for them.
+- If `PTRACE_STOP_POST_SYSCALL` is set, the tracee will break at the
+  end of a syscall, when the return value has just been set in the
+  appropriate register.
+- If `PTRACE_STOP_SINGLESTEP` is set, the tracee is stopped after the
+  execution of just one assembly instruction. If used together with
+  any system-call trace, the system-call method will take precedence
+  and allow you to fine-grane how that should work. (Not a special
+  case, the syscall trace returns before the instruction returns and
+  thus is what is used by the multiplexing trace call!)
+- If `PTRACE_STOP_SIGNAL` is set, the tracee is stopped before next
+  signal is handled. The signal number is pushed a the first parameter
+  to the break event, see section on events below.
+
+### Non-breakpoint events
+
+These events will not stop the tracee, but rather keep running in the
+background until whatever breakpoint was set alongside this, was reached.
+
+- If `PTRACE_EVENT_CLONE` was set, the tracer will wake up when the
+  traee creates a new child process. An event will be delivered to the
+  tracer with the PID as the first parameter. The child process will
+  be in a stopped state, but unless attached to with a separate
+  tracer, it will be restarted upon the next ptrace invocation.
+
+### Flags
+
+- If `PTRACE_FLAG_SYSEMU` is set and the tracee was just stopped
+  pre-syscall, don't continue running the syscall but rather return
+  directly without setting a return value (the tracer might have set
+  that)! This way you will not have to decide whether you want to
+  change or replace a system call before inspecting it.
+- If `PTRACE_FLAG_WAIT` is set, the `write` call will not return
+  before the breakpoint is reached, but rather await that. This is the
+  default behavior whenever `O_NONBLOCK` is not set, but this flags
+  lets nonblocking tracers override that behavior. As explained
+  briefly above, this flag will not restart a stopped tracee unless a
+  new stop bit was set - which is behavior *not* replicated by default
+  without `O_NONBLOCK`.
+
+---
+
+Because `ptrace` does **not** rely on signals, when a process is
+ptrace-stopped (such as attaching to the tracee with `O_TRUNC`
+explained above) you can send `SIGCONT` without actually restarting
+the process. The process is restarted only using a ptrace operation or
+when the tracer file handle is closed. This signal is instead just
+scheduled to get handled whenever the tracee starts, which allows the
+tracee to raise `SIGSTOP` and let the tracer to restart it only after
+a ptrace operation was completed.

 When the tracee exits, any blocking operation depending on it stops
 and instead returns `ESRCH`. It does not, however, reap the zombie
-process. Therefore `waitpid` is to be attempted after a `ESRCH` error,
-which will also allow you to obtain the exit status.
-
-### Trace threads/subprocesses
-
-By default, a tracer only traces the exact process it specifies. The
-creation of a child process, using `clone`, sends a special event that
-allows you to attach a new tracer to that process, before it starts
-running. Unless restarted by a tracer attached to the child itself,
-the subprocess remains stopped until the next operation on the parent
-process, which will continue (and ignore) the subprocess only if no
-tracer was attached to it.
-
-You can use `O_NONBLOCK` to multiplex multiple tracers for events
-using the `event:` scheme:
-
- `EVENT_READ` is sent out when an event is available.
- `EVENT_WRITE` is sent out when a breakpoint is reached and a
-   nonblocking `write` operation therefore completed.
+process. Therefore, if the tracee is your own child process you should
+invoke `waitpid` immediately after a `ESRCH` error, which will also
+allow you to obtain the exit status.

 ### Events

-Some special events are sent out, such as the event described
-above. The way you receive events is by `read`ing a `PtraceEvent`
-structure. Reads are not blocking, and will return `0` when no event
-was able to be read.
-
-If an event occurs during a ptrace operation, such as
-`PTRACE_SYSCALL`, this operation returns early specifying it has
-written `0` bytes. The standard way to handle this situation when in
-blocking mode is to detect the `0`, read and handle all events, and
-call `PTRACE_WAIT` until the byte is written, looping back to the
-point of reading all events when it returns `0`.
+Events give the tracer information about breakpoints or actions the
+tracee has taken. There are two types of events: Breakpoint events,
+and non-breakpoint events. Only breakpoint events stop the tracee when
+reached, other events only wake up the tracer, while the tracee keeps
+going. The way you receive events is by `read`ing a `PtraceEvent`
+structure from the file. Reads are not blocking, and will return `0`
+when no event was able to be read.
+
+Events are read sequencially, i.e. follow first-in-last-out. The
+standard behavior for handling non-breakpoint events is to read them
+all and then retry waiting for the breakpoint to be reached using
+`PTRACE_FLAG_WAIT`. Any unread events from the last operation will
+cause a new one to return immediately, in order to prevent a possible
+race condition where you think you've read all events but another one
+occurs right when want to retry the wait for a breakpoint to be reached.
+
+The structure has a value `kind` specifying what bit caused the tracer
+to wake up, as well as a set of values like `a` (first parameter), `b`
+(second parameter), `c` (third parameter), etc. For example, if the
+input was `PTRACE_STOP_SIGNAL | PTRACE_EVENT_CLONE`, the bitmask may
+be either `PTRACE_STOP_SIGNAL` or `PTRACE_EVENT_CLONE` depending on
+which event was hit first. The `a` value of `PTRACE_STOP_SIGNAL` is
+the signal number which caused the breakpoint to be hit, while the `a`
+value of `PTRACE_EVENT_CLONE` is the PID of the tracee's new child
+process.
+
+### Nonblocking mode
+
+In nonblocking mode, a ptrace call without the `PTRACE_WAIT` bit set
+will return `1` immediately. Any breakpoint specified is set, and will
+like usual overwriting any existing breakpoints. Note that the file
+will send events to the `event:` scheme, meaning you can multiplex
+multiple tracers.
+
+`EVENT_READ` is triggered whenever the first event arrives. Since an
+event only gets pushed to the stack if it's within the specified write
+bitmask, all events in the stack are of interest and this notification
+means you should immediately read them all.
+
+`EVENT_WRITE` is reserved, for now.

 ## Modify registers

@@ -136,10 +213,10 @@ unification of the following calls in Linux:

 ## Security

-By default a process should only be allowed to trace a child processes
-owned by the current user, direct or indirect. The main motivation for
-allowing indirect subprocesses is so one can trace threads of a direct
-subprocess.
+By default a process should only be allowed to control a process owned
+by the current user, as well as being an anchestor of the process,
+direct or indirect. The main motivation for allowing indirect
+subprocesses is so one can trace threads of a direct subprocess.

 This restriction is lifted by processes owned by `root`, which can
 trace any process. In the future, a capability-like system could be
@@ -168,7 +245,11 @@ if pid == 0 {

    // ptrace attach: Stop the process using internal ptrace mechanism
    // (not SIGSTOP!)
-    let mut trace = File::open(&format!("proc:{}/trace", pid))?;
+    let mut trace = OpenOptions::new()
+        .read(true)
+        .write(true)
+        .truncate(true)
+        .open(&format!("proc:{}/trace", pid))?;
    // obtain a handle to the process registers
    let mut regs = File::open(&format!("proc:{}/regs/int", pid))?;
    let mut status = 0;
@@ -177,13 +258,22 @@ if pid == 0 {
    // it is still stopped by ptrace
    syscall::kill(pid, SIGCONT)?;

-    let mut written = trace.write(&[syscall::PTRACE_SYSCALL])?;
-    while written == 0 {
-        // Ignore events
-        let mut _event: PtraceEvent = PtraceEvent::default();
-        trace.read(&mut _event)?;
-        written = trace.write(&[syscall::PTRACE_WAIT])?;
+    trace.write(&(syscall::PTRACE_STOP_PRE_SYSCALL | syscall::PTRACE_EVENT_CLONE))?;
+    // Mostly ignore event... usually you can get some interesting
+    // data from it
+    let mut event: PtraceEvent = PtraceEvent::default();
+    trace.read(&mut event)?;
+    while event.kind & syscall::PTRACE_EVENT_MASK != 0 {
+        // In reality, you'll actually want to handle this event, or
+        // it makes no sense to listen for it at all. This is just an
+        // example to show you how you can handle non-breakpoint
+        // events though.
+        trace.write(&(syscall::PTRACE_FLAG_WAIT))?;
+        trace.read(&mut event)?;
    }
+    // This assertion is safe because if the process exits, the write
+    // call returns ESRCH
+    assert_eq!(event.kind, syscall::PTRACE_STOP_PRE_SYSCALL)?;

    let mut registers = syscall::IntRegisters::default();
    regs.read(&mut registers)?;
@@ -196,10 +286,10 @@ if pid == 0 {

    regs.write(&registers)?;

-    // trace.write(&[syscall::PTRACE_SYSCALL])?; // wait for the completion of the system call
+    // trace.write(&[syscall::PTRACE_STOP_POST_SYSCALL])?; // wait for the completion of the system call

-    trace.write(&[syscall::PTRACE_CONT])?; // run the program to the end, which is like right now
-    syscall::waitpid(pid, &mut status, 0)?; // reap zombie processes
+    trace.write(&[])?; // don't set any stops, rather run the program to the end, which is like right now
+    syscall::waitpid(pid, &mut status, 0)?; // reap zombie process

    // trace file dropped here: process tracing detached and process
    // implicitly resumed if it hadn't already been, y'know, killed
@@ -231,6 +321,10 @@ writing memory. This was what the original RFC first suggested, before
 implements a `ptrace(...)` function as a userspace library over their
 ProcFS.

+There are lots of possible alternatives, one of which was implemented
+and tried out. However, out of the ones I've considerd, this one
+should be the most scalable over time.
+
 # Unresolved questions
 [unresolved]: #unresolved-questions

@@ -244,3 +338,4 @@ ProcFS.
 - Should one be able to override behavior of `SIG_DFL`-handled
  signals?
 - How should `int3` be user-handled? (perhaps by catching `SIGTRAP`?)
+- How should a user read memory maps?