Written on: 2013-02-01

Stack traces on ARM

Not too long ago, I was trying to profile a Linux application running on an embedded ARM platform. I found it remarkably difficult to capture representative stack traces. Here's why...

To record a stack trace, you need to walk the stack. That is, given a stack frame you need to find your way to the next frame up the stack, and so on until you get to the very first frame at the top (modern ARM stacks grow downwards).

So how might you do that? One common technique is to use a frame pointer. For example, on an x86 machine the ebp register is used to point to the beginning of the current stack frame. If the calling convention requires that ebp be pushed onto the stack in the function's prologue, you can use the current ebp to look up the value of the previous ebp. In effect, you get a linked list of stack frames.

Once upon a time, the ARM Procedure Call Standard (APCS) specified a similar system on ARM machines. The suggestively named fp register was used to store a frame pointer, which was saved to the stack as part of the function calling convention and so could be used to walk the stack just like ebp on x86.

But this changed with the introduction of the "new" ARM ABI back in 2003. This ABI included a new calling convention, the ARM Architecture Procedure Call Standard (AAPCS). (AAPCS, not to be confused with APCS...)

To make matters slightly more confusing, some people refer to the "new" ARM ABI as the ARM EABI. Apparently the "E" stands for "Embedded" and indicates that the ABI is suitable for use on embedded systems... but wasn't the old ARM ABI used for embedded systems too?

Anyway, the ARM (E)ABI is actually more of a family of ABIs than a single specification, as it allows different implementations to make different choices about various operating system-specific things. Hence, on Linux, we get the GNUEABI, which is the GNU variant of the "new" ARM ABI that uses the AAPCS calling convention.

So where does this leave us with respect to frame pointers? Out of luck, mostly. The AAPCS does not reserve any of the registers as a frame pointer. The twelfth register is now called v8 and is just "variable-register 8" with no special purpose.

However, you can still ask your compiler to use frame pointers. If you use GCC to target GNUEABI then frame pointers will be disabled by default. But if you explicitly enable frame pointers then it will use r11 (otherwise known as fp or v8) as a frame pointer, just like in the good old APCS days.

Unfortunately, because AAPCS does not include any conventions for frame pointers, it won't always be saved to the same place on the stack. Registers are saved to the stack in register order, starting with the highest:

r15 is the program counter (pc), which doesn't get saved.

r14 is the link register (lr) which contains the return address. This is pushed to the stack when a function call is made. So it's usually the first item on the stack. Except when it isn't: a leaf function, i.e. a function that does not call any others, doesn't need to save its return address so it doesn't push lr.

r13 is the stack pointer (sp), which doesn't get saved.

r12 is the intra-procedure-call scratch register (ip), which doesn't get saved.

r11 is our frame pointer!

So the value we want will be either the first or the second item on the stack, depending on whether the current stack frame is for a leaf function or not.

How can we tell which one it is? AAPCS specifies that the stack must grow downwards, therefore the previous stack frame must be at a higher address than the current stack frame. If we're running Linux, we know that the program will be stored at a lower address than the stack. So we look at the first item on the stack and compare it to our current frame pointer: if it's bigger, it must be the previous frame pointer; if it's smaller, it must be a return address so the previous frame pointer must be the next item on the stack.

This combination of frame pointers with a bit of stack-scanning is actually how ARM's own profiling tool works: the ARM Development Suite Streamline Performance Analyzer does just this.

However, it has a few problems. First of all, it relies on a quirk of GCC. What if you use a different compiler? Does Clang/LLVM also put the frame pointer in r11? Even if you always use GCC, how do you know GCC will always do this? It's not part of any standard so GCC could change this behaviour at any time.

More importantly, it only works in ARM-mode. Modern ARM CPUs support two different instruction sets: ARM and Thumb2. The original Thumb instruction set used 16-bit instructions to achieve better code density than the 32-bit ARM instructions. Thumb2 is a variable-width instruction set that combines the best of both worlds.

When I tested my application, the ARM version was about 12% slower than the Thumb2 version. So I'd really rather use Thumb2 than ARM. Unfortunately, GCC in Thumb2-mode stores the frame pointer in r7. Since r8, r9, etc. may or may not be saved to the stack, there's no reliable way to find the previous frame pointer.

So ARM's profiling tool cannot profile ARM code that uses ARM's Thumb2 instruction set. What about other profiling tools?

OProfile tries to walk the stack using frame pointers but it assumes the old APCS calling convention. So that won't work on any ARM code written in the last decade or so.

Gperftools also tries to use frame pointers. It does slightly better than OProfile in that it assumes GNUEABI... but it misses the fact that r11 is the first item on the stack in leaf functions. So the stack walking breaks whenever a sample lands on a leaf function.

It's simple to patch Gperftools to use the same logic as ARM's Streamline Performance Analyzer. Patch stacktrace_arm-inl.h like so:

- void **new_sp = (void**) old_sp[-1];
+ void **new_sp;
+
+ if (((void**) old_sp[0]) > old_sp) {
+     new_sp = (void**) old_sp[0];
+ } else {
+     new_sp = (void**) old_sp[-1];
+ }

But it still won't work in Thumb2-mode.

Profiling tools aren't the only tools that need to walk stacks. Debuggers do it too. How do they manage when there aren't any frame pointers? Why did ARM remove frame pointers from their calling convention? Did they perhaps do it for the same reason that frame pointers were removed from x86_64 calling conventions? Itanium doesn't require frame pointers either. It's almost like frame pointers aren't really the best way to walk stacks these days...

The DWARF standard for debug information has included Call Frame Information (CFI) since the 1990s (I'm not sure when DWARF-2 was published but it has been the standard on Linux for a couple of decades now). This is how GDB and other debuggers know how to interpret the stack.

So maybe we could use that to generate stack traces... except it's a bit slow. If we're going to be sampling our code at a high frequency, maybe we can't afford to spend a long time calculating each stack trace.

Who else needs to unwind a stack quickly? How about any application written in a language that supports exceptions? The "new" ARM ABI includes an Exception Handling ABI, which turns out to contain exactly what we need. This specifies a system of unwinding tables, indexed by program counter and included in the executable in a special ELF section, that allows C++ programs (and others) to quickly unwind the stack. (Itanium and x86_64 have similar unwinding tables.)

ARM calls these sections EXIDX (for Exception Index, I think) and they are automatically included in any C++ program: for a C program you might have to specify -funwind-tables to get them.

So how do we use them? Luckily, Android's debuggerd already does, so we can take a look at system/core/debuggerd/arm/unwind.c and do what they do. Alternatively, rather than writing our own profiler, we can get Gperftools to use this technique by compiling it with libunwind.

At least, libunwind claims it can handle EXIDX. I confess I never got it to work. By the time I reached this stage I had already obtained the information I needed through other means (mostly by judicious use of timestamps and printf...)