CESA-2009-001 - rev 1
[See all my vulnerabilities at
[Blog if you want to subscribe to new findings is at
Linux syscall interception technologies partial bypass
Programs affected: Probably many; for example systrace with ptrace()
Fixed: Systrace 1.6f for 32-bit build (but you are cautioned that ptrace()
is not a security technology).
Severity: Syscall policy violation.
Various security technologies make use of syscall filtering. Such technologies
are very powerful because they restrict a compromise not just in terms of
access to files, networks and processes -- but also access to the rich kernel
API (a great source of ring 0 bugs).
Syscall filtering technologies typically make an allow / deny decision based
on the syscall (identified by a number) and sometimes also the exact arguments
to the syscall. A vulnerability exists due to the identification of syscall by
number. On 64-bit aware Linux kernels (x86_64), the syscall number can map to
either the 32-bit or 64-bit syscall table. Since the syscall tables are
different for 32-bit vs. 64-bit, and the user space process gets to control
which table it hits, the syscall number check can often be fooled.
For example, a syscall filter technology might be monitoring a 64-bit process,
and configured to allow some subset of the very common
syscall. That's syscall number 2 in 64-bit land. However, the monitored process
can switch to 32-bit mode and issue syscall 2. That appears to be
open() to the monitor but will execute as
in the kernel - possibly leading to an unmonitored process.
Here is a sample piece of code which does a 32-bit syscall:
main(int argc, const char* argv)
/* Syscall 1 is exit on i386 but write on x86_64. */
asm volatile("movl $1, %eax\n"
When built 64-bit, and run under strace on my 64-bit machine, the difference
in opinion on the syscall is apparent:
write(1, "\370V\355\365\377\177", 140737319359320 <unfinished ... exit status 208>
The monitor sees
write() but the kernel sees (and executes!)
Detecting this situation has some subtleties. Which syscall table the kernel
hits depends on both the instruction used to trap into the kernel, and also
the "long mode" bit in the current descriptor referred to by the code segment
sysenter always cause a
32-bit syscall, but
syscall looks at the descriptor. Therefore,
to securely monitor a 32-bit process, it is sufficient just to validate that
the CS register references a privileged 32-bit descriptor. Unfortunately, to
securely monitor a 64-bit process, not only must the CS register be checked,
but the instruction initiating the syscall must be checked. This involves
reading user-space which is of course is subject to well-documented lethal race
conditions when other processes which share writeable address space.
CESA-2009-001 - rev 1