Since syscalls are near the very bottom of any software stack, their misbehavior can be particularly hard to test for. Stuff like running out of disk space, network connections timing out or bumping into system limits all ultimately manifest as a syscall failing somewhere. If you want your code to be resilient to these kinds of failures, it sure would be nice if you could simulate these situations easily.

Now, you might already know strace lets you trace system calls, but did you know it can also change their behavior? You can modify their input and output, inject errors and add time delays (though be aware of the limitations).

To demonstrate how it could be useful, I’ve added the ability to easily use this functionality from Python and Ruby to Cirron (my grab-bag of a project). Now you can now do things like:

Test how code handles insufficient space (I’ll use Python for demonstration purposes here, check out the readme for examples of how to do this in Ruby).

from cirron import Injector

injector = Injector()
# Make the "openat" syscall return the ENOSPC error.
injector.inject("openat", "error", "ENOSPC")

# All "openat" calls will return ENOSPC within this context.
with injector:
    # Fails with "No space left on device".
    f = open("test.txt", "w")

# From here on "openat" behaves normally again.

Inject occasional errors and delays to network operations.

(...)
# Make every other "connect" syscall return the ETIMEDOUT error.
# when="2+2" means to perform the injection for the second syscall
# invocation and then again every two invocations.
injector.inject("connect", "error", "ETIMEDOUT", when="2+2")

# Also add 1s of latency to "send".
injector.inject("send", "delay_exit", "1s")

with injector:
    (...)

Simulate signals being sent before particular syscalls.

(...)
# Simulate user pressing Ctrl+C before the first "read".
injector.inject("read", "signal", "SIGINT", "when=1") 

with injector:
    (...)

And more! In addition to the error, delay_exit and signal actions demonstrated above, it also supports retval for changing a return value without making it an error, delay_enter for delaying entry into a syscall rather than an exit and poke_enter and poke_exit for modifying the process memory on syscall entry or exit. See the “Tampering” section of strace’s man page for details on all these, including a description of the format of the when argument.

If you want to try this but are unsure what syscalls your code uses you can get a list of them easily with Cirron too:

from cirron import Tracer

# Tracer records all syscalls made within the context.
with Tracer() as t:
    print("Hello!")

print(t)
# [Syscall(name='write', args='1, "Hello!\\n", 7', retval='7', duration='0.000197', timestamp='1725900869.238673', pid='438862')]


How?

Cirron implements the Injector in the simplest way possible: it executes strace with the appropriate inject options and points it to the current process. After leaving the injection context the strace process is killed. If you were to make heavy use of this, it’s probably worth implementing the functionality directly, rather than executing strace every time (let me know if you’d find it useful if Cirron did this more efficiently).

So how does strace do this? Ptrace! It attaches to (or seizes; dramatic!) a process with ptrace(PTRACE_ATTACH, ...). This causes the traced process to stop on (among other things) entry and exit from syscalls. Strace can then inspect and modify the program before letting it continue.

To inject a delay, as with delay_enter or delay_exit, strace simply waits before continuing the process.

To modify the syscall’s inputs or output it uses either PTRACE_POKEDATA (or process_vm_writev) to mess with the traced process’s memory (poke_enter, poke_exit) or PTRACE_POKEUSER to modify the USER area, containing the process’s registers state, which lets you, for example, change the return value (error, retval).

Limitations

There’s the obvious performance impact, particularly if simply using strace instead of using ptrace directly.

Also consider that making a syscall fail this way does not remove the side effects it might have: making a “write” call return an error will still (possibly) perform the “write”, it will just appear to have failed to the application. Along the same lines the delay injections delay before syscall entry or after exit, which may have very different impact compared to delaying something in the middle of the call.