Infrastructure High severity

Sidekiq Memory Bloat

Long-running Sidekiq worker processes whose RSS grows monotonically over hours or days, eventually triggering the kernel OOM killer. The cause is rarely a true leak in Ruby, it is heap fragmentation in the default glibc malloc allocator.

Before / After

Problematic Pattern

# Default Dockerfile for legacy Rails Sidekiq worker
FROM ruby:3.2
RUN bundle install
CMD ["bundle", "exec", "sidekiq"]

# RSS after 6 hours: 3.2 GB
# Peak working set: ~800 MB
# Heap fragmentation: ~75% of RSS
# Kubernetes: OOMKilled every 8 hours.

Target Architecture

# Dockerfile with jemalloc preloaded
FROM ruby:3.2
RUN apt-get update && \
  apt-get install -y libjemalloc2

ENV LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.2
ENV MALLOC_CONF=background_thread:true,metadata_thp:auto

RUN bundle install
CMD ["bundle", "exec", "sidekiq"]

# RSS after 6 hours: ~950 MB (stable)
# No OOMKilled in production.

Why this hurts

Sidekiq workers are long-lived Ruby processes that allocate enormous numbers of short-lived objects. Ruby’s memory manager (mmap-backed pages divided into slots) requests memory from libc’s malloc, which in turn requests it from the OS. glibc malloc uses per-thread arenas that tend to hold freed memory rather than return it to the OS, producing fragmentation: unused slots scattered across many arenas, none large enough to satisfy a new large allocation but collectively consuming gigabytes.

The process RSS stays high even after Ruby’s garbage collector has marked most objects as unreachable, because free() in glibc does not always return pages to the kernel. top and ps show RSS climbing while GC.stat shows a stable heap_live_slots count. Nothing in APM flags a leak because Ruby’s object graph is not growing; the growth is below the GC layer, in the C-level allocator. Ruby’s own heap compaction (GC.compact) does not help because it only moves slots within existing pages, not between arenas.

Kubernetes or systemd OOMKills the pod when the container limit is hit, losing any in-flight job state that Sidekiq had not yet persisted back to Redis. Sidekiq::Shutdown handlers do not run, so current jobs crash mid-execution and go to the retry queue. Scheduling becomes noisy: restart storms ripple through the cluster, queue depth spikes every time a pod dies, and alerting thresholds must be tuned around the forced recycling pattern.

Preloading libjemalloc via LD_PRELOAD swaps the allocator without code changes. jemalloc uses fewer arenas, more aggressive coalescing, and a background thread that returns unused pages to the OS. Typical production observation: RSS stabilizes at 1.5-2x the working set instead of 4-6x, and OOM kills stop entirely.

Get Expert Help

Inheriting a legacy Rails codebase with this problem? Request a Technical Debt Audit.