Documentation Index
Fetch the complete documentation index at: https://docs.prefect.io/llms.txt
Use this file to discover all available pages before exploring further.
Infrastructure Debugging
Debug flow run failures from a single pane of glass, with less context-switching between Kubernetes dashboards, AWS consoles, and log aggregators.
When a flow run fails, the first question is always the same: what went wrong?
Today, answering that question means jumping between Kubernetes dashboards, AWS consoles, CloudWatch logs, and container runtimes while piecing together clues from systems outside your workflow context. By the time you find the root cause, you’ve lost minutes (or hours) of productivity.
Prefect’s infrastructure debugging capabilities change that by surfacing lifecycle states, failure diagnostics, resource metrics, and container logs directly in the Prefect UI and CLI. Prefect becomes your single pane of glass for understanding why a run failed, whether the issue is in your code, your infrastructure configuration, or the underlying platform.
See each infrastructure stage
Before a flow run starts executing your code, it passes through several infrastructure
stages: submission, scheduling, container startup, and process initialization. Previously,
all of this was invisible; a run sat in Pending until it either started or failed.
Now, Prefect tracks every stage of the infrastructure lifecycle with dedicated states:
- Submitting: The worker is actively creating infrastructure (a Kubernetes Job, an ECS task, a local process)
- InfrastructurePending: Infrastructure exists while the flow run process is still waiting to start (for example, a pod is pulling images or waiting for node capacity)
These states appear in the UI timeline and are available through the API, so you always know whether a delay is caused by infrastructure provisioning or something else entirely.
Turn error codes into answers
When infrastructure fails, Prefect tells you that it failed, then shows you why and what to do about it.
Automatic failure diagnosis for Kubernetes and Amazon ECS inspects the actual infrastructure state and translates it into actionable guidance:
OOMKilled
ImagePullBackOff
CrashLoopBackOff
Unschedulable
A centralized exit code registry also translates cryptic process exit codes into
plain English. For example, code 137 indicates an OOM kill, while code
127 usually means the command is missing. Each explanation includes resolution
steps so you can fix the problem from the same workflow.
Monitor resource usage in real time
Wondering if your flow is running out of memory or maxing out CPU? Prefect collects CPU and memory metrics from flow run processes and displays them as real-time charts in the UI.
- Flow run detail page: Time-series charts for CPU utilization and memory usage appear in the Infrastructure panel, with peak values shown in the title bar for at-a-glance monitoring
- Deployment page: Summary cards show the highest CPU and memory usage across all recent flow runs, with a direct link to the run that produced the peak
Use these metrics to right-size your infrastructure, catch memory leaks early, and understand whether a failure was caused by resource exhaustion while staying in Prefect.
Read container logs without leaving Prefect
Some of the hardest failures to debug are the ones where a run crashes before it ever connects to Prefect. An OOM kill during import, a bad entrypoint, a missing dependency. In these cases, the flow run process never initializes its logging handler, so no logs reach Prefect.
When a Kubernetes pod or ECS task crashes before the flow run establishes connectivity, the observer automatically fetches the container’s stdout and stderr and forwards them as flow run logs. The Python traceback that explains what went wrong appears right alongside the rest of your run’s logs in the Prefect UI.
Understand concurrency at a glance
When flow runs queue up waiting for concurrency slots, it’s important to understand who is holding those slots and for how long. Prefect now surfaces concurrency utilization across the UI and CLI:
- Work pool and work queue pages show active slot counts so you can see utilization at a glance
- CLI commands (
prefect work-pool slots,prefect work-queue slots) show which flow runs occupy each slot and how long they’ve been running - Concurrency status endpoints provide programmatic access to slot occupancy data for custom dashboards and alerting
Trace every step of the journey
Enhanced lifecycle logging gives you a detailed narrative of what happens at every stage of a flow run’s execution:
- Workers report each step of infrastructure creation, from job submission to container startup, with timing information and error context
- Runners log process management details including subprocess creation, signal handling, and graceful shutdown sequences
- Pull steps log code retrieval progress and surface resolution hints when storage access fails (wrong credentials, missing buckets, network issues)
When something goes wrong, each component suggests concrete next steps. Instead of a generic “infrastructure exited with code 1,” you get a clear explanation and a path to resolution.
Get started
Infrastructure debugging capabilities are available today in Prefect Cloud and the latest open source release. Explore the related documentation to learn more: