Skip to content
Commits on Source (3)
......@@ -16,102 +16,179 @@ no interface for tracing a process' system calls or instructions, and
no interface for managing another process' memory.
A good first step for implementing `gdb` or a similar utility would be
to implement the former, a Linux `ptrace(...)` alternative for
Redox. This should not only open up the possibility for debuggers, but
also system-call translation processes like WINE, perhaps for Linux
compatibility which would rid us the problem of porting software.
And even with *pure* `ptrace`, without any memory reading or
otherwise, one can still implement the immensely useful tool `strace`,
which could serve as an alternative to recompiling the kernel with
system call debugging turned on or off. This is probably what we
should focus on getting to work initially, before getting to the good
stuff.
to implement a Linux `ptrace(...)` alternative for Redox. This should
not only open up the possibility for debuggers, but also system-call
translation processes like WINE, perhaps for Linux compatibility which
would rid us the problem of porting software.
And even with *pure* `ptrace`, without any register or memory reading,
one can still implement the immensely useful tool `strace`, which
could serve as an alternative to recompiling the kernel with system
call debugging turned on or off. This is probably what we should focus
on getting to work initially, before getting to the good stuff.
# Detailed design
[design]: #detailed-design
The Linux ptrace interface is sometimes considered a huge mistake due
to its inconsistence and it being just one massive function. The Redox
interface will have to take that in mind, as well as remove any
duplicate or otherwise redundant functions.
All process-controlling functions are implemented as one kernel
scheme, `proc:`. Opening it up with the `pid` of a tracee and a path
will perform a specific operation on the process.
## ptrace
Opening `proc:<pid>/trace` will attach to that process and internally
stop it similar to how `SIGSTOP` would work (it does, however, not
rely on signals in any way), and closing the file descriptor will
detach from it automatically. The benefit of using schemes as opposed
to a Linux-style function is that we get the ability to disallow this
feature using namespacing for free. The `ptrace(PTRACE_TRACEME, ...)`
call from Linux is not implemented, as one can just attach to the
`pid` of the child.
will perform a specific operation on the process. The benefit of using
schemes as opposed to a Linux-style function is that we get the
ability to disallow this feature using namespacing for free. We also
allow multiplexing using `event:` and therefore can be used in a
nonblocking fashion just like any other file descriptor.
## Process trace
Opening `proc:<pid>/trace` will give you a file which you can write
proc-related functions, and closing the file descriptor will detach
from it automatically. If any breakpoint is set when the file is
closed, they are deleted and the process is resumed. Only *one* tracer
can control a process, as I am too close-minded to come up with a
design that would make sense for running multiple tracers on a single
tracee.
That said, if the tracer has the flag `O_EXCL` will instead send
`SIGKILL` to the tracee when the tracer closes its file. This is to
prevent any ptrace-contained processes from breaking out. (`O_EXCL`
can be thought of as meaning the tracer is the only one who controls
the process, and the process can't live on its own)
Another flag used in `open` is `O_TRUNC` which will stop the process
*immediately*. This can be compared to using `PTRACE_ATTACH` on Linux
as opposed to `PTRACE_SEIZE`. (`O_TRUNC` can be thought of as
*truncating*/clearing the file's execution. It's a stretch, but I have
no better idea)
The most important operation of `ptrace` is of course to put a
breakpoints! This can be done by `write`ing a specific byte to the
file, in a 1-length array. This byte can be one of the following
combinations:
- `PTRACE_SYSCALL` breaks both before and after the next syscall.
- `PTRACE_SYSCALL | PTRACE_SYSEMU` breaks *instead* of the next
syscall.
- `PTRACE_SINGLESTEP` allows one single instruction to be ran and then
stops again.
- `PTRACE_SINGLESTEP | PTRACE_SYSEMU` allows one single instruction to
be ran and then stops again, but doesn't execute any syscalls.
- `PTRACE_CONT` runs the program to completion.
- `PTRACE_SIGNAL` runs the program until the next signal is called and
handled by user code (so `SIG_DFL` can not be overwritten this
way). Before this breakpoint, an event specifying the signal number
is called (see paragraph on events below).
- `PTRACE_WAIT` ignores the `O_NONBLOCK` flag and waits until a
breakpoint is reached, but does nothing else.
The `PTRACE_SYSEMU` flag can be used by, for example, emulators that
do not want to run any system calls on the host system.
Unless `O_NONBLOCK` is set, each `write` invocation blocks until the
breakpoint is hit. With `O_NONBLOCK`, one can still use `PTRACE_WAIT`
to wait for any breakpoint to hit. Because `ptrace` does **not** rely
on signals, when a process is ptrace-stopped you can send `SIGCONT`
without actually restarting the process. The process is restarted only
using a ptrace operation such as `PTRACE_CONT` or when the tracer file
handle is closed.
breakpoints! Redox tries to unify the Linux event system and the
breakpoint system by making the input a bitflags with *one or more*
breakpoints. It will return when the first breakpoint/event specified
within the bitmask is reached, which in case it'll add that event for
reading using the `read` system call (see events below). If an event
is already set, the `write` returns immediately. (slight exception is
`O_NONBLOCK`, see that below as well)
Each breakpoint is set and optionally awaited using the `write` system
call. Each such call will also resume the tracee in case it's stopped
after another breakpoint. So if you write a value with no stop bits
set, the program will run to completion. The exception is manually
specifying `PTRACE_WAIT` (even in blocking mode, see below), which
will - unless any new stop is set - only wait for an existing
breakpoint to be reached.
## Breakpoints
- If `PTRACE_STOP_PRE_SYSCALL` is set, the tracee will break on the
next start of a syscall. This diverges from Linux' way of using
`PTRACE_SYSCALL` for *both* pre- and post- syscall. However it's for
a good reason: Signals can occur in the middle of a syscall, and
unlike Linux which just delays the signal, we should go the simplest
route to minimize kernel code size and let the user choose the
behavior they want and not choose for them.
- If `PTRACE_STOP_POST_SYSCALL` is set, the tracee will break at the
end of a syscall, when the return value has just been set in the
appropriate register.
- If `PTRACE_STOP_SINGLESTEP` is set, the tracee is stopped after the
execution of just one assembly instruction. If used together with
any system-call trace, the system-call method will take precedence
and allow you to fine-grane how that should work. (Not a special
case, the syscall trace returns before the instruction returns and
thus is what is used by the multiplexing trace call!)
- If `PTRACE_STOP_SIGNAL` is set, the tracee is stopped before next
signal is handled. The signal number is pushed a the first parameter
to the break event, see section on events below.
### Non-breakpoint events
These events will not stop the tracee, but rather keep running in the
background until whatever breakpoint was set alongside this, was reached.
- If `PTRACE_EVENT_CLONE` was set, the tracer will wake up when the
traee creates a new child process. An event will be delivered to the
tracer with the PID as the first parameter. The child process will
be in a stopped state, but unless attached to with a separate
tracer, it will be restarted upon the next ptrace invocation.
### Flags
- If `PTRACE_FLAG_SYSEMU` is set and the tracee was just stopped
pre-syscall, don't continue running the syscall but rather return
directly without setting a return value (the tracer might have set
that)! This way you will not have to decide whether you want to
change or replace a system call before inspecting it.
- If `PTRACE_FLAG_WAIT` is set, the `write` call will not return
before the breakpoint is reached, but rather await that. This is the
default behavior whenever `O_NONBLOCK` is not set, but this flags
lets nonblocking tracers override that behavior. As explained
briefly above, this flag will not restart a stopped tracee unless a
new stop bit was set - which is behavior *not* replicated by default
without `O_NONBLOCK`.
---
Because `ptrace` does **not** rely on signals, when a process is
ptrace-stopped (such as attaching to the tracee with `O_TRUNC`
explained above) you can send `SIGCONT` without actually restarting
the process. The process is restarted only using a ptrace operation or
when the tracer file handle is closed. This signal is instead just
scheduled to get handled whenever the tracee starts, which allows the
tracee to raise `SIGSTOP` and let the tracer to restart it only after
a ptrace operation was completed.
When the tracee exits, any blocking operation depending on it stops
and instead returns `ESRCH`. It does not, however, reap the zombie
process. Therefore `waitpid` is to be attempted after a `ESRCH` error,
which will also allow you to obtain the exit status.
### Trace threads/subprocesses
By default, a tracer only traces the exact process it specifies. The
creation of a child process, using `clone`, sends a special event that
allows you to attach a new tracer to that process, before it starts
running. Unless restarted by a tracer attached to the child itself,
the subprocess remains stopped until the next operation on the parent
process, which will continue (and ignore) the subprocess only if no
tracer was attached to it.
You can use `O_NONBLOCK` to multiplex multiple tracers for events
using the `event:` scheme:
- `EVENT_READ` is sent out when an event is available.
- `EVENT_WRITE` is sent out when a breakpoint is reached and a
nonblocking `write` operation therefore completed.
process. Therefore, if the tracee is your own child process you should
invoke `waitpid` immediately after a `ESRCH` error, which will also
allow you to obtain the exit status.
### Events
Some special events are sent out, such as the event described
above. The way you receive events is by `read`ing a `PtraceEvent`
structure. Reads are not blocking, and will return `0` when no event
was able to be read.
If an event occurs during a ptrace operation, such as
`PTRACE_SYSCALL`, this operation returns early specifying it has
written `0` bytes. The standard way to handle this situation when in
blocking mode is to detect the `0`, read and handle all events, and
call `PTRACE_WAIT` until the byte is written, looping back to the
point of reading all events when it returns `0`.
Events give the tracer information about breakpoints or actions the
tracee has taken. There are two types of events: Breakpoint events,
and non-breakpoint events. Only breakpoint events stop the tracee when
reached, other events only wake up the tracer, while the tracee keeps
going. The way you receive events is by `read`ing a `PtraceEvent`
structure from the file. Reads are not blocking, and will return `0`
when no event was able to be read.
Events are read sequencially, i.e. follow first-in-last-out. The
standard behavior for handling non-breakpoint events is to read them
all and then retry waiting for the breakpoint to be reached using
`PTRACE_FLAG_WAIT`. Any unread events from the last operation will
cause a new one to return immediately, in order to prevent a possible
race condition where you think you've read all events but another one
occurs right when want to retry the wait for a breakpoint to be reached.
The structure has a value `kind` specifying what bit caused the tracer
to wake up, as well as a set of values like `a` (first parameter), `b`
(second parameter), `c` (third parameter), etc. For example, if the
input was `PTRACE_STOP_SIGNAL | PTRACE_EVENT_CLONE`, the bitmask may
be either `PTRACE_STOP_SIGNAL` or `PTRACE_EVENT_CLONE` depending on
which event was hit first. The `a` value of `PTRACE_STOP_SIGNAL` is
the signal number which caused the breakpoint to be hit, while the `a`
value of `PTRACE_EVENT_CLONE` is the PID of the tracee's new child
process.
### Nonblocking mode
In nonblocking mode, a ptrace call without the `PTRACE_WAIT` bit set
will return `1` immediately. Any breakpoint specified is set, and will
like usual overwriting any existing breakpoints. Note that the file
will send events to the `event:` scheme, meaning you can multiplex
multiple tracers.
`EVENT_READ` is triggered whenever the first event arrives. Since an
event only gets pushed to the stack if it's within the specified write
bitmask, all events in the stack are of interest and this notification
means you should immediately read them all.
`EVENT_WRITE` is reserved, for now.
## Modify registers
......@@ -136,10 +213,10 @@ unification of the following calls in Linux:
## Security
By default a process should only be allowed to trace a child processes
owned by the current user, direct or indirect. The main motivation for
allowing indirect subprocesses is so one can trace threads of a direct
subprocess.
By default a process should only be allowed to control a process owned
by the current user, as well as being an anchestor of the process,
direct or indirect. The main motivation for allowing indirect
subprocesses is so one can trace threads of a direct subprocess.
This restriction is lifted by processes owned by `root`, which can
trace any process. In the future, a capability-like system could be
......@@ -168,7 +245,11 @@ if pid == 0 {
// ptrace attach: Stop the process using internal ptrace mechanism
// (not SIGSTOP!)
let mut trace = File::open(&format!("proc:{}/trace", pid))?;
let mut trace = OpenOptions::new()
.read(true)
.write(true)
.truncate(true)
.open(&format!("proc:{}/trace", pid))?;
// obtain a handle to the process registers
let mut regs = File::open(&format!("proc:{}/regs/int", pid))?;
let mut status = 0;
......@@ -177,13 +258,22 @@ if pid == 0 {
// it is still stopped by ptrace
syscall::kill(pid, SIGCONT)?;
let mut written = trace.write(&[syscall::PTRACE_SYSCALL])?;
while written == 0 {
// Ignore events
let mut _event: PtraceEvent = PtraceEvent::default();
trace.read(&mut _event)?;
written = trace.write(&[syscall::PTRACE_WAIT])?;
trace.write(&(syscall::PTRACE_STOP_PRE_SYSCALL | syscall::PTRACE_EVENT_CLONE))?;
// Mostly ignore event... usually you can get some interesting
// data from it
let mut event: PtraceEvent = PtraceEvent::default();
trace.read(&mut event)?;
while event.kind & syscall::PTRACE_EVENT_MASK != 0 {
// In reality, you'll actually want to handle this event, or
// it makes no sense to listen for it at all. This is just an
// example to show you how you can handle non-breakpoint
// events though.
trace.write(&(syscall::PTRACE_FLAG_WAIT))?;
trace.read(&mut event)?;
}
// This assertion is safe because if the process exits, the write
// call returns ESRCH
assert_eq!(event.kind, syscall::PTRACE_STOP_PRE_SYSCALL)?;
let mut registers = syscall::IntRegisters::default();
regs.read(&mut registers)?;
......@@ -196,10 +286,10 @@ if pid == 0 {
regs.write(&registers)?;
// trace.write(&[syscall::PTRACE_SYSCALL])?; // wait for the completion of the system call
// trace.write(&[syscall::PTRACE_STOP_POST_SYSCALL])?; // wait for the completion of the system call
trace.write(&[syscall::PTRACE_CONT])?; // run the program to the end, which is like right now
syscall::waitpid(pid, &mut status, 0)?; // reap zombie processes
trace.write(&[])?; // don't set any stops, rather run the program to the end, which is like right now
syscall::waitpid(pid, &mut status, 0)?; // reap zombie process
// trace file dropped here: process tracing detached and process
// implicitly resumed if it hadn't already been, y'know, killed
......@@ -231,6 +321,10 @@ writing memory. This was what the original RFC first suggested, before
implements a `ptrace(...)` function as a userspace library over their
ProcFS.
There are lots of possible alternatives, one of which was implemented
and tried out. However, out of the ones I've considerd, this one
should be the most scalable over time.
# Unresolved questions
[unresolved]: #unresolved-questions
......@@ -244,3 +338,4 @@ ProcFS.
- Should one be able to override behavior of `SIG_DFL`-handled
signals?
- How should `int3` be user-handled? (perhaps by catching `SIGTRAP`?)
- How should a user read memory maps?