HUGLO: Hyper-Ultra-Giga Low-Overhead Tracing Profiler for Ruby

Profiler Overview

I’ve built what I think is a pretty neat Ruby tracing profiler. It captures four event streams: Ruby function calls, system calls, thread-state changes, and garbage-collection activity, while adding less than 30 ns of overhead per Ruby function call, low enough for continuous use in large-scale production systems.

As far as I know, no other Ruby tracer offers this mix of signals at this cost. If you’re aware of one, please let me know and I’ll add a note here (and remove a couple of adjectives from the title).

I haven’t open-sourced the code yet because it is very much in a proof-of-concept state and I’m busy with other projects. If you think it would be valuable to you, let me know. If I see enough interest I’ll move it up my priority list.

Overhead Measurement

Above is a runtime histogram of 200 runs of a sample workload on an i5-13500 system running Ubuntu 24 compared to baseline Ruby and ruby-prof 1.7.1, the profiler that comes closest in terms of overhead. Estimated per-function-call overhead is 23 ns for this profiler and 538 ns for ruby-prof.

Why Trace in Production

In a word: outliers. Sampling profilers show where a program spends its time on average. However, issues like high tail latency are, by definition, poorly represented in an average view. That’s where a tracing profiler comes in handy.

Let me illustrate this with a sophisticated SaaS simulator that handles a bunch of API calls and records the duration of each one:

durations = []

100_000.times do
  start_time = Time.now

  api_handler

  durations << Time.now - start_time
end

Let’s assume api_handler is a black-box function with a P99.9 latency SLO of < 10ms. Running the snippet yields:

P50 latency:     1.1 ms
P99.9 latency: 101.3 ms

Our median (P50) looks fine, but the P99.9 violates the SLO.

If you care about latency, you’re probably already running a sampling profiler, so let’s see what it shows:

Sampling Profiler Example

We see that two methods account for most of the runtime: api_handler->Object#a and api_handler->Object#b. Of those, a consumes the majority of the time slice. You might therefore conclude that optimizing a is the right move, but in this case, that would be a waste of effort!

To demonstrate this, let’s record a trace from one of the slow api_handler executions using our new profiler. First, we instrument the code:

+ require 'tracing'

 durations = []
 SLA = 0.01

 100_000.times do
   start_time = Time.now
+   Trace.reset

   api_handler

   duration = Time.now - start_time
+  if duration > SLA
+     Trace.save("api_call_trace.json")
+   end

   durations << duration
end

Now let’s look at one of the slow traces:

Tracing Profiler Example

Here we see the exact opposite: b now consumes the majority of the time slice. So who’s right?

Let’s peek inside the api_handler black-box:

# api_handler always executes a, which sleeps for 1 ms
# 0.1% of the time it also executes b, which sleeps for 100 ms
def api_handler
  a
  if rand >= 0.999
    b
  end
end

def a
  sleep(0.001)
end

def b
  sleep(0.1)
end

The sampling profiler correctly shows that over the entire run, we spend more time in a, but when chasing tail-latency issues, that is not what matters. Here, a tracing profiler is the right tool for the job and b is the right function to optimize.

One small additional advantage of tracing is that it gives us a true time axis: events actually occur in the order shown, unlike in a sampled flame graph. This can be quite helpful when debugging.

Thread-State Changes

Another event stream I’ll mention here is thread-state changes and why tracking them helps. You might assume that whenever you see a stack trace in a trace view, like the one outlined by the red box below, the code is actively executing.

Tracing Profiler Thread States Example

But that isn’t necessarily the case. Notice the light-green Runnable slice and the dark-green Running slice in the example above. That track shows that the Ruby thread was not running at first, later became runnable, and finally began executing. In this case the reason is fairly clear when you inspect the stack trace: the thread is in sleep, waiting on a futex, and in the yellow-and-purple CPU 12 track you can see the Swapper process running on the core instead.

But less obvious instances exist as well: if the CPU you’re running on is oversubscribed, the kernel scheduler may boop your thread off its core as it tries to give other threads time to run. If a code you own starts running slowly because someone else is stealing your cycles, you want to know!