Infrastructure Medium severity

Memory-Heavy Rake Enumeration

Data migration or maintenance Rake tasks that call Model.all.each or Model.where(...).map, loading millions of rows into memory as Ruby objects before iterating, crashing with OOM or blowing through the server’s swap.

Before / After

Problematic Pattern

namespace :backfill do
task user_slugs: :environment do
  # Loads 8M rows into memory.
  # Allocates 8M User objects.
  # OOM before even starting the update loop.
  User.all.each do |user|
    user.update!(slug: user.email.parameterize)
  end
end
end

Target Architecture

namespace :backfill do
task user_slugs: :environment do
  # find_each loads 1,000 rows per batch (default),
  # releases each batch to GC before loading next.
  # Use .select to load only the columns you need
  # when rows contain wide text/jsonb columns.
  User
    .where(slug: nil)
    .select(:id, :email)
    .find_each(batch_size: 1_000) do |user|
      user.update_columns(
        slug: user.email.parameterize
      )
    end
end

# For pure SQL backfills, skip Ruby entirely:
task user_slugs_sql: :environment do
  User.where(slug: nil).in_batches(of: 1_000) do |batch|
    batch.update_all(
      "slug = regexp_replace(
         lower(email),
         '[^a-z0-9]+', '-', 'g'
       )"
    )
  end
end
end

Why this hurts

Each ActiveRecord instance carries the attribute hash, type-cast values, association caches, and dirty-tracking metadata. For a User model with a dozen columns including one text biography and one jsonb preferences column, one instance averages 5-10 KB in memory. Loading 8 million rows into .all allocates 40-80 GB of Ruby objects, far exceeding any reasonable server memory. The Ruby process thrashes into swap and either hangs indefinitely or gets OOM-killed before the first update! runs.

Even when memory is sufficient, garbage collection dominates runtime. Ruby’s generational GC promotes surviving objects to older generations, and batch loads push the major GC into running on every collection cycle, stalling the process for seconds at a time. The task that should have taken an hour takes a day, and progress indicators stop updating during GC pauses. On replicated databases, each update! call ships a WAL entry, flooding replication bandwidth and widening replica lag into multi-minute territory during the backfill window.

in_batches(of: N) is safer but requires careful sizing. Default 10,000 is too aggressive for tables with wide rows: ten thousand rows at 10 KB each is 100 MB per batch, which the ActiveRecord result materializer holds in memory plus the Ruby object overhead during iteration. PostgreSQL statement_timeout may kick in before the batch completes on slow disks. A batch size of 1,000 strikes a reliable balance for most production workloads, and using .select(:id, :needed_column) reduces per-row memory proportionally by avoiding wide-column transport.

update_columns skips callbacks and validations, which is the intent for mechanical backfills. update_all with a SQL expression lets PostgreSQL compute the new value server-side, bypassing Ruby object allocation entirely and running orders of magnitude faster than row-by-row updates.

Get Expert Help

Inheriting a legacy Rails codebase with this problem? Request a Technical Debt Audit.