Let’s DO this: detecting Workers Builds errors across 1 million Durable Objects

Cloudflare Workers Builds is our CI/CD product that makes it easy to build and deploy Workers applications every time code is pushed to GitHub or GitLab. What makes Workers Builds special is that projects can be built and deployed with minimal configuration. Just hook up your project and let us take care of the rest!

But what happens when things go wrong, such as failing to install tools or dependencies? What usually happens is that we don’t fix the problem until a customer contacts us about it, at which point many other customers have likely faced the same issue. This can be a frustrating experience for both us and our customers because of the lag time between issues occurring and us fixing them.

We want Workers Builds to be reliable, fast, and easy to use so that developers can focus on building, not dealing with our bugs. That’s why we recently started building an error detection system that can detect, categorize, and surface all build issues occurring on Workers Builds, enabling us to proactively fix issues and add missing features.

It’s also no secret that we’re big fans of being “Customer Zero” at Cloudflare, and Workers Builds is itself a product that’s built end-to-end on our Developer Platform using Workers, Durable Objects, Hyperdrive, Containers, Queues, Workers KV, R2, and Workers Observability.

In this post, we will dive into how we used the Cloudflare Developer Platform to check for issues across more than 1 million Durable Objects.

Background: Workers Builds architecture

Back in October 2024, we wrote about how we built Workers Builds entirely on the Workers platform. To recap, Builds is built using Workers, Durable Objects, Workers KV, R2, Queues, Hyperdrive, and a Postgres database. Some of these things were not present when launched back in October (for example, Queues and KV). But the core of the architecture is the same.

A client Worker receives GitHub/GitLab webhooks and stores build metadata in Postgres (via Hyperdrive). A build management Worker uses two Durable Object classes: a Scheduler class to find builds in Postgres that need scheduling, and a class called BuildBuddy to manage the lifecycle of a build. When a build needs to be started, Scheduler creates a new BuildBuddy instance which is responsible for creating a container for the build (using Cloudflare Containers), monitoring the container with health checks, and receiving build logs so that they can be viewed in the Cloudflare Dashboard.

In addition to this core scheduling logic, we have several Workers Queues for background work such as sending PR comments to GitHub/GitLab.

The problem: builds are failing

While this architecture has worked well for us so far, we found ourselves with a problem: compared to Cloudflare Pages, a concerning percentage of builds were failing. We needed to dig deeper and figure out what was wrong, and understand how we could improve Workers Builds so that developers can focus more on shipping instead of build failures.

Types of build failures

Not all build failures are the same. We have several categories of failures that we monitor:

Initialization failures: when the container fails to start.
Clone failures: failing to clone the repository from GitHub/GitLab.
Build timeouts: builds that ran past the limit and were terminated by BuildBuddy.
Builds failing health checks: the container stopped responding to health checks, e.g. the container crashed for an unknown reason.
Failure to install tools or dependencies.
Failed user build/deploy commands.

The first few failure types were straightforward, and we’ve been able to track down and fix issues in our build system and control plane to improve what we call “build completion rate”. We define build completion as the following:

We successfully started the build.
We attempted to install tools/dependencies (considering failures as “user error”).
We attempted to run the user-defined build/deploy commands (again, considering failures as “user error”).
We successfully marked the build as stopped in our database.

For example, we had a bug where builds for a deleted Worker would attempt to run and continuously fail, which affected our build completion rate metric.

User error

We’ve made a lot of progress improving the reliability of build and container orchestration, but we had a significant percentage of build failures in the “user error” metric. We started asking ourselves “is this actually user error? Or is there a problem with the product itself?”

This presented a challenge because questions like “did the build command fail due to a bug in the build system, or user error?” are a lot harder to answer than pass/fail issues like failing to create a container for the build. To answer these questions, we had to build something new, something smarter.

Build logs

The most obvious way to determine why a build failed is to look at its logs. When spot-checking build failures, we can typically identify what went wrong. For example, some builds fail to install dependencies because of an out of date lockfile (e.g. package-lock.json out of date with package.json). But looking through build failures one by one doesn’t scale. We didn’t want engineers looking through customer build logs without at least suspecting that there was an issue with our build system that we could fix.

Automating error detection

At this point, next steps were clear: we needed an automated way to identify why a build failed based on build logs, and provide a way for engineers to see what the top issues were while ensuring privacy (e.g. removing account-specific identifiers and file paths from the aggregate data).

Detecting errors in build logs using Workers Queues

The first thing we needed was a way to categorize build errors after a build fails. To do this, we created a queue named BuildErrorsQueue to process builds and look for errors. After a build fails, BuildBuddy will send the build ID to BuildErrorsQueue which fetches the logs, checks for issues, and saves results to Postgres.

We started out with a few static patterns to match things like Wrangler errors in log lines:

export const DetectedErrorCodes = {
  wrangler_error: {
    detect: async (lines: LogLines) => {
      const errors: DetectedError[] = []
      for (const line of lines) {
        if (line[2].trim().startsWith('✘ [ERROR]')) {
          errors.push({
            error_code: 'wrangler_error',
            error_group: getWranglerLogGroupFromLogLine(line, wranglerRegexMatchers),
            detected_on: new Date(),
            lines_matched: [line],
          })
        }
      }
      return errors
    },
  },
  installing_tools_or_dependencies_failed: { ... },
}

It wouldn’t be useful if all Wrangler errors were grouped under a single generic “wrangler_error” code, so we further grouped them by normalizing the log lines into groups:

function getWranglerLogGroupFromLogLine(
  logLine: LogLine,
  regexMatchers: RegexMatcher[]
): string {
  const original = logLine[2].trim().replaceAll(/[\t\n\r]+/g, ' ')
  let message = original
  let group = original
  for (const { mustMatch, patterns, stopOnMatch, name, useNameAsGroup } of regexMatchers) {
    if (mustMatch !== undefined) {
      const matched = matchLineToRegexes(message, mustMatch)
      if (!matched) continue
    }
    if (patterns) {
      for (const [pattern, mask] of patterns) {
        message = message.replaceAll(pattern, mask)
      }
    }
    if (useNameAsGroup === true) {
      group = name
    } else {
      group = message
    }
    if (Boolean(stopOnMatch) && message !== original) break
  }
  return group
}

const wranglerRegexMatchers: RegexMatcher[] = [
  {
    name: 'could_not_resolve',
    // ✘ [ERROR] Could not resolve "./balance"
    // ✘ [ERROR] Could not resolve "node:string_decoder" (originally "string_decoder/")
    mustMatch: [/^✘ \[ERROR\] Could not resolve "[@\w :/\\.-]*"/i],
    stopOnMatch: true,
    patterns: [
      [/(?<=^✘ \[ERROR\] Could not resolve ")[@\w :/\\.-]*(?=")/gi, '<MODULE>'],
      [/(?<=\(originally ")[@\w :/\\.-]*(?=")/gi, '<MODULE>'],
    ],
  },
  {
    name: 'no_matching_export_for_import',
    // ✘ [ERROR] No matching export in "src/db/schemas/index.ts" for import "someCoolTable"
    mustMatch: [/^✘ \[ERROR\] No matching export in "/i],
    stopOnMatch: true,
    patterns: [
      [/(?<=^✘ \[ERROR\] No matching export in ")[@~\w:/\\.-]*(?=")/gi, '<MODULE>'],
      [/(?<=" for import ")[\w-]*(?=")/gi, '<IMPORT>'],
    ],
  },
  // ...many more added over time
]

Once we had our error detection matchers and normalizing logic in place, implementing the BuildErrorsQueue consumer was easy:

export async function handleQueue(
  batch: MessageBatch,
  env: Bindings,
  ctx: ExecutionContext
): Promise<void> {
  ...
  await pMap(batch.messages, async (msg) => {
    try {
      const { build_id } = BuildErrorsQueueMessageBody.parse(msg.body)
      await store.buildErrors.deleteErrorsByBuildId({ build_id })
      const bb = getBuildBuddy(env, build_id)
      const errors: DetectedError[] = []
      let cursor: LogsCursor | undefined
      let hasMore = false

      do {
        using maybeNewLogs = await bb.getLogs(cursor, false)
        const newLogs = LogsWithCursor.parse(maybeNewLogs)
        cursor = newLogs.cursor
        const newErrors = await detectErrorsInLogLines(newLogs.lines)
        errors.push(...newErrors)
        hasMore = Boolean(cursor) && newLogs.lines.length > 0
      } while (hasMore)

      if (errors.length > 0) {
        await store.buildErrors.insertErrors(
          errors.map((e) => ({
            build_id,
            error_code: e.error_code,
            error_group: e.error_group,
          }))
        )
      }
      msg.ack()
    } catch (e) {
      msg.retry()
      sentry.captureException(e)
    }
  })
}

Here, we’re fetching logs from each build’s BuildBuddy Durable Object, detecting why it failed using the matchers we wrote, and saving errors to the Postgres DB. We also delete any existing errors for when we improve our error detection patterns to prevent subsequent runs from adding duplicate data to our database.

What about historical builds?

The BuildErrorsQueue was great for new builds, but this meant we still didn’t know why all the previous build failures happened other than “user error”. We considered only tracking errors in new builds, but this was unacceptable because it would significantly slow down our ability to improve our error detection system because each iteration would require us to wait days to identify issues we need to prioritize.

Problem: logs are stored across one million+ Durable Objects

Remember how every build has an associated BuildBuddy DO to store logs? This is a great design for ensuring our logging pipeline scales with our customers, but it presented a challenge when trying to aggregate issues based on logs because something would need to go through all historical builds (>1 million at the time) to fetch logs and detect why they failed.

If we were using Go and Kubernetes, we might solve this using a long-running container that goes through all builds and runs our error detection. But how do we solve this in Workers?

How do we backfill errors for historical builds?

At this point, we already had the Queue to process new builds. If we could somehow send all of the old build IDs to the queue, it could scan them all quickly using Queues concurrent consumers to quickly work through all builds. We thought about hacking together a local script to fetch all of the log IDs and sending them to an API to put them on a queue. But we wanted something more secure and easier to use so that running a new backfill was as simple as an API call.

That’s when an idea hit us: what if we used a Durable Object with alarms to fetch a range of builds and send them to BuildErrorsQueue? At first, it seemed far-fetched, given that Durable Object alarms have a limited amount of work they can do per invocation. But wait, if AI Agents built on Durable Objects can manage background tasks, why can’t we fetch millions of build IDs and forward them to queues?

Building a Build Errors Agent with Durable Objects

The idea was simple: create a Durable Object class named BuildErrorsAgent and run a single instance that loops through the specified range of builds in the database and sends them to BuildErrorsQueue.

The first thing we did was set up an RPC method to start a backfill and save the parameters in Durable Object KV storage so that it can be read each time the alarm executes:

async start({
  min_build_id,
  max_build_id,
}: {
  min_build_id: BuildRecord['build_id']
  max_build_id: BuildRecord['build_id']
}): Promise<void> {
  logger.setTags({ handler: 'start', environment: this.env.ENVIRONMENT })
  try {
    if (min_build_id < 0) throw new Error('min_build_id cannot be negative')
    if (max_build_id < min_build_id) {
      throw new Error('max_build_id cannot be less than min_build_id')
    }
    const [started_on, stopped_on] = await Promise.all([
      this.kv.get('started_on'),
      this.kv.get('stopped_on'),
    ])
    await match({ started_on, stopped_on })
      .with({ started_on: P.not(null), stopped_on: P.nullish }, () => {
        throw new Error('BuildErrorsAgent is already running')
      })
      .otherwise(async () => {
        // delete all existing data and start queueing failed builds
        await this.state.storage.deleteAlarm()
        await this.state.storage.deleteAll()
        this.kv.put('started_on', new Date())
        this.kv.put('config', { min_build_id, max_build_id })
        void this.state.storage.setAlarm(this.getNextAlarmDate())
      })
  } catch (e) {
    this.sentry.captureException(e)
    throw e
  }
}

The most important part of the implementation is the alarm that runs every second until the job is complete. Each alarm invocation has the following steps:

Set a new alarm (always first to ensure an error doesn’t cause it to stop).
Retrieve state from KV.
Validate that the agent is supposed to be running:
1. Ensure the agent is supposed to be running.
2. Ensure we haven’t reached the max build ID set in the config.
Finally, queue up another batch of builds by querying Postgres and sending to the BuildErrorsQueue.

async alarm(): Promise<void> {
  logger.setTags({ handler: 'alarm', environment: this.env.ENVIRONMENT })
  try {
    void this.state.storage.setAlarm(Date.now() + 1000)
    const kvState = await this.getKVState()
    this.sentry.setContext('BuildErrorsAgent', kvState)
    const ctxLogger = logger.withFields({ state: JSON.stringify(kvState) })

    await match(kvState)
      .with({ started_on: P.nullish }, async () => {
        ctxLogger.info('BuildErrorsAgent is not started, cancelling alarm')
        await this.state.storage.deleteAlarm()
      })
      .with({ stopped_on: P.not(null) }, async () => {
        ctxLogger.info('BuildErrorsAgent is stopped, cancelling alarm')
        await this.state.storage.deleteAlarm()
      })
      .with(
        // we should never have started_on set without config set, but just in case
        { started_on: P.not(null), config: P.nullish },
        async () => {
          const msg =
            'BuildErrorsAgent started but config is empty, stopping and cancelling alarm'
          ctxLogger.error(msg)
          this.sentry.captureException(new Error(msg))
          this.kv.put('stopped_on', new Date())
          await this.state.storage.deleteAlarm()
        }
      )
      .when(
        // make sure there are still builds to enqueue
        (s) =>
          s.latest_build_id !== null &&
          s.config !== null &&
          s.latest_build_id >= s.config.max_build_id,
        async () => {
          ctxLogger.info('BuildErrorsAgent job complete, cancelling alarm')
          this.kv.put('stopped_on', new Date())
          await this.state.storage.deleteAlarm()
        }
      )
      .with(
        {
          started_on: P.not(null),
          stopped_on: P.nullish,
          config: P.not(null),
          latest_build_id: P.any,
        },
        async ({ config, latest_build_id }) => {
          // 1. select batch of ~1000 builds
          // 2. send them to Queues 100 at a time, updating
          //    latest_build_id after each batch is sent
          const failedBuilds = await this.store.builds.selectFailedBuilds({
            min_build_id: latest_build_id !== null ? latest_build_id + 1 : config.min_build_id,
            max_build_id: config.max_build_id,
            limit: 1000,
          })
          if (failedBuilds.length === 0) {
            ctxLogger.info(`BuildErrorsAgent: ran out of builds, stopping and cancelling alarm`)
            this.kv.put('stopped_on', new Date())
            await this.state.storage.deleteAlarm()
          }

          for (
            let i = 0;
            i < BUILDS_PER_ALARM_RUN && i < failedBuilds.length;
            i += QUEUES_BATCH_SIZE
          ) {
            const batch = failedBuilds
              .slice(i, QUEUES_BATCH_SIZE)
              .map((build) => ({ body: build }))

            if (batch.length === 0) {
              ctxLogger.info(`BuildErrorsAgent: ran out of builds in current batch`)
              break
            }
            ctxLogger.info(
              `BuildErrorsAgent: sending ${batch.length} builds to build errors queue`
            )
            await this.env.BUILD_ERRORS_QUEUE.sendBatch(batch)
            this.kv.put(
              'latest_build_id',
              Math.max(...batch.map((m) => m.body.build_id).concat(latest_build_id ?? 0))
            )

            this.kv.put(
              'total_builds_processed',
              ((await this.kv.get('total_builds_processed')) ?? 0) + batch.length
            )
          }
        }
      )
      .otherwise(() => {
        const msg = 'BuildErrorsAgent has nothing to do - this should never happen'
        this.sentry.captureException(msg)
        ctxLogger.info(msg)
      })
  } catch (e) {
    this.sentry.captureException(e)
    throw e
  }
}

Using pattern matching with ts-pattern made it much easier to understand what states we were expecting and what will happen compared to procedural code. We considered using a more powerful library like XState, but decided on ts-pattern due to its simplicity.

Running the backfill

Once everything rolled out, we were able to trigger an errors backfill for over a million failed builds in a couple of hours with a single API call, categorizing 80% of failed builds on the first run. With a fast backfill process, we were able to iterate on our regex matchers to further refine our error detection and improve error grouping. Here’s what the error list looks like in our staging environment:

Fixes and improvements

Having a better understanding of what’s going wrong has already enabled us to make several improvements:

Wrangler now shows a clearer error message when no config file is found.
Fixed multiple edge-cases where the wrong package manager was used in TypeScript/JavaScript projects.
Added support for bun.lock (previously only checked for bun.lockb).
Fixed several edge cases where build caching did not work in monorepos.
Projects that use a runtime.txt file to specify a Python version no longer fail.
….and more!

We’re still working on fixing other bugs we’ve found, but we’re making steady progress. Reliability is a feature we’re striving for in Workers Builds, and this project has helped us make meaningful progress towards that goal. Instead of waiting for people to contact support for issues, we’re able to proactively identify and fix issues (and catch regressions more easily).

One of the great things about building on the Developer Platform is how easy it is to ship things. The core of this error detection pipeline (the Queue and Durable Object) only took two days to build, which meant we could spend more time working on improving Workers Builds instead of spending weeks on the error detection pipeline itself.

What’s next?

In addition to continuing to improve build reliability and speed, we’ve also started thinking about other ways to help developers build their applications on Workers. For example, we built a Builds MCP server that allows users to debug builds directly in Cursor/Claude/etc. We’re also thinking about ways we can expose these detected issues in the Cloudflare Dashboard so that users can identify issues more easily without scrolling through hundreds of logs.

Ready to get started?

Building applications on Workers has never been easier! Try deploying a Durable Object-backed chat application with Workers Builds:

The Cloudflare Blog