Infrastructure Debugging

Debug flow run failures from a single pane of glass, with less context-switching between Kubernetes dashboards, AWS consoles, and log aggregators.

When a flow run fails, the first question is always the same: what went wrong?

Today, answering that question means jumping between Kubernetes dashboards, AWS consoles, CloudWatch logs, and container runtimes while piecing together clues from systems outside your workflow context. By the time you find the root cause, you’ve lost minutes (or hours) of productivity.

Prefect’s infrastructure debugging capabilities change that by surfacing lifecycle states, failure diagnostics, resource metrics, and container logs directly in the Prefect UI and CLI. Prefect becomes your single pane of glass for understanding why a run failed, whether the issue is in your code, your infrastructure configuration, or the underlying platform.

See each infrastructure stage

Before a flow run starts executing your code, it passes through several infrastructure stages: submission, scheduling, container startup, and process initialization. Previously, all of this was invisible; a run sat in Pending until it either started or failed.

Now, Prefect tracks every stage of the infrastructure lifecycle with dedicated states:

Submitting: The worker is actively creating infrastructure (a Kubernetes Job, an ECS task, a local process)
InfrastructurePending: Infrastructure exists while the flow run process is still waiting to start (for example, a pod is pulling images or waiting for node capacity)

These states appear in the UI timeline and are available through the API, so you always know whether a delay is caused by infrastructure provisioning or something else entirely.

Turn error codes into answers

When infrastructure fails, Prefect tells you that it failed, then shows you why and what to do about it.

Automatic failure diagnosis for Kubernetes and Amazon ECS inspects the actual infrastructure state and translates it into actionable guidance:

OOMKilled

Container exceeded its memory limit. Increase the memory request/limit in your work pool’s job template, or optimize your flow to reduce memory usage.

ImagePullBackOff

Failed to pull the container image. Verify the image name and tag exist, and check that image pull secrets are configured.

CrashLoopBackOff

Container is repeatedly crashing on startup. Check application logs for import errors, missing dependencies, or configuration issues.

Unschedulable

No nodes match the pod’s resource requests or node selectors. Verify cluster capacity and check node affinity/toleration settings.

A centralized exit code registry also translates cryptic process exit codes into plain English. For example, code 137 indicates an OOM kill, while code 127 usually means the command is missing. Each explanation includes resolution steps so you can fix the problem from the same workflow.

Monitor resource usage in real time

Wondering if your flow is running out of memory or maxing out CPU? Prefect collects CPU and memory metrics from flow run processes and displays them as real-time charts in the UI.

Flow run detail page: Time-series charts for CPU utilization and memory usage appear in the Infrastructure panel, with peak values shown in the title bar for at-a-glance monitoring
Deployment page: Summary cards show the highest CPU and memory usage across all recent flow runs, with a direct link to the run that produced the peak

Use these metrics to right-size your infrastructure, catch memory leaks early, and understand whether a failure was caused by resource exhaustion while staying in Prefect.

Read container logs without leaving Prefect

Some of the hardest failures to debug are the ones where a run crashes before it ever connects to Prefect. An OOM kill during import, a bad entrypoint, a missing dependency. In these cases, the flow run process never initializes its logging handler, so no logs reach Prefect.

When a Kubernetes pod or ECS task crashes before the flow run establishes connectivity, the observer automatically fetches the container’s stdout and stderr and forwards them as flow run logs. The Python traceback that explains what went wrong appears right alongside the rest of your run’s logs in the Prefect UI.

Understand concurrency at a glance

When flow runs queue up waiting for concurrency slots, it’s important to understand who is holding those slots and for how long. Prefect now surfaces concurrency utilization across the UI and CLI:

Work pool and work queue pages show active slot counts so you can see utilization at a glance
CLI commands (prefect work-pool slots, prefect work-queue slots) show which flow runs occupy each slot and how long they’ve been running
Concurrency status endpoints provide programmatic access to slot occupancy data for custom dashboards and alerting

Trace every step of the journey

Enhanced lifecycle logging gives you a detailed narrative of what happens at every stage of a flow run’s execution:

Workers report each step of infrastructure creation, from job submission to container startup, with timing information and error context
Runners log process management details including subprocess creation, signal handling, and graceful shutdown sequences
Pull steps log code retrieval progress and surface resolution hints when storage access fails (wrong credentials, missing buckets, network issues)

When something goes wrong, each component suggests concrete next steps. Instead of a generic “infrastructure exited with code 1,” you get a clear explanation and a path to resolution.

Get started

Infrastructure debugging capabilities are available today in Prefect Cloud and the latest open source release. Explore the related documentation to learn more:

States

Learn about Prefect’s state model, including the new infrastructure lifecycle states.

Work pools

Configure and manage the infrastructure that runs your flows.

Workers

Understand how workers submit and monitor flow run infrastructure.

Kubernetes integration

Deploy and observe flow runs on Kubernetes clusters.