Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.prefect.io/llms.txt

Use this file to discover all available pages before exploring further.

Sudden infrastructure failures (like machine crashes or container evictions) can cause flow runs to become unresponsive and appear stuck in a Running state. To mitigate this, flow runs triggered by deployments can emit heartbeats to drive Automations that detect and respond to these “zombie” flow runs, ensuring they are marked as Crashed if they stop reporting heartbeats.

Enable flow run heartbeat events

You will need to ensure you’re running Prefect version 3.1.8 or greater and set PREFECT_FLOWS_HEARTBEAT_FREQUENCY to an integer greater than 30 to emit flow run heartbeat events.

Create the automation

To create an automation that marks zombie flow runs as crashed, run this script:
from datetime import timedelta

from prefect.automations import Automation
from prefect.client.schemas.objects import StateType
from prefect.events.actions import ChangeFlowRunState
from prefect.events.schemas.automations import EventTrigger, Posture
from prefect.events.schemas.events import ResourceSpecification


my_automation = Automation(
    name="Crash zombie flows",
    trigger=EventTrigger(
        after={"prefect.flow-run.heartbeat"},
        expect={
            "prefect.flow-run.*",
        },
        match=ResourceSpecification({"prefect.resource.id": ["prefect.flow-run.*"]}),
        for_each={"prefect.resource.id"},
        posture=Posture.Proactive,
        threshold=1,
        within=timedelta(seconds=90),
    ),
    actions=[
        ChangeFlowRunState(
            state=StateType.CRASHED,
            message="Flow run marked as crashed due to missing heartbeats.",
        )
    ],
)

if __name__ == "__main__":
    my_automation.create()
The trigger definition says that after each heartbeat event for a flow run we expect to see any flow run event (heartbeat or state change) for that same flow run within 90 seconds. Using the prefect.flow-run.* wildcard in expect ensures the automation works correctly even when flows return custom-named states (for example, Completed(name="SuccessfullyProcessed")), since flow run event names are based on the state’s name rather than its type.

Custom state names

Flow run event names are based on the state’s name, not its type. If your flows return states with custom names (for example, return Completed(name="SuccessfullyProcessed")), the emitted event will be prefect.flow-run.SuccessfullyProcessed rather than prefect.flow-run.Completed. The wildcard prefect.flow-run.* in the example above handles this automatically. If you need finer-grained control over which events disarm the trigger, you can explicitly list your custom state names in the expect set instead:
expect={
    "prefect.flow-run.heartbeat",
    "prefect.flow-run.Completed",
    "prefect.flow-run.Failed",
    "prefect.flow-run.Cancelled",
    "prefect.flow-run.Crashed",
    "prefect.flow-run.SuccessfullyProcessed",  # your custom state name
},
When using explicit state names, you must include every custom state name your flows may return. A missing name means the automation won’t recognize that terminal state, causing a false-positive zombie detection for that flow run.

Adjusting behavior with settings

If PREFECT_FLOWS_HEARTBEAT_FREQUENCY is set to 30, the automation will trigger only after 3 heartbeats have been missed. You can adjust within in the trigger definition and PREFECT_FLOWS_HEARTBEAT_FREQUENCY to change how quickly the automation will fire after the server stops receiving flow run heartbeats. You can also add additional actions to your automation to send a notification when zombie runs are detected.