From Core to Cloud: Deployment Considerations
There are many considerations that need to be made when moving from a local workflow to a distributed, dockerized workflow. This document attempts to highlight many of these considerations and potential "gotchas" that users might encounter as they promote their workflows to Prefect Cloud.
Prefect Cloud requires the use of Docker containers. Docker provides an excellent industry-standard abstraction for shipping code along with all of its dependencies for runtime consistency in diverse environments. Ultimately, to deploy a Flow to Cloud it needs to be "packaged up" inside a Docker image that is then pushed to a registry of your choosing.
How are Prefect Flows stored inside Docker containers?
Whenever you call
flow.deploy or build a Docker storage object yourself, Prefect will perform the following actions:
cloudpickle.dumps(flow)on your Flow object to convert it to serialized bytes
- stores these bytes inside the Docker image in the
- runs various health checks on your Flow inside the image to try and catch any issues
cloudpickle is an excellent alternative to the standard libary's Pickle protocol for converting Python objects to a serialized byte representation. Note that cloudpickle typically stores imported objects as importable references. So, for example, if you used a function
foo that you imported as
from my_file import foo, cloudpickle (and consequently Prefect) will assume this same import can take place inside the Docker container. For this reason, it is considered best practice in Prefect to ensure all utility scripts and custom Python code be accessible on your Docker image's system
Flows have a save / load interface
Oftentimes users want to separate their flow's build logic from its deploy logic. Because of the nature of
cloudpickle and relative imports, instead of importing your Flow object from another file it is recommended that you save your Flow to disk using
flow.save, and then load it using
Flow.load prior to deployment.
How are Prefect Flows run inside Docker containers?
Whenever a Prefect Cloud flow run is created and submitted for execution, Prefect performs the following actions inside your Flow's Docker image:
cloudpickle.load(...)on the file described above containing the byte-representation of your Flow
flow.environment.setupfor your flow's specified execution environment
Ultimately, regardless of the execution environment you use, a single
CloudFlowRunner is created to run your Flow and configure it to communicate back to Prefect Cloud.
Typically Prefect Flows have many dependencies; sometimes these dependencies are popular public Python packages, othertimes they are intricate non-Python bindings. Either way, Docker provides a convenient abstraction for handling all forms of Flow dependencies:
- PyPI dependencies: for
pipinstallable dependencies from PyPI, you can use the
python_dependencieskeyword argument on Docker storage objects and Prefect will automatically install these dependencies via
- non-PyPI dependencies: for all other forms of dependencies, the best way to include them in your Docker container is through your choice of
base_image. You can specify any base image that both you and your Prefect Agent have access to (note that Cloud never requires access to your registries) that contains all your necessary dependencies. Note that you might have to build one yourself if your dependencies are proprietary. When not provided, Prefect automatically detects your local version of Prefect and Python and attempts to select an appropriate base image for your Flow.
Prefect's Docker storage abstraction also exposes the ability to set environment variables on your image. Oftentimes environment variables are used to store sensitive information (e.g.,
GOOGLE_APPLICATION_CREDENTIALS). As a matter of best practice, you should only hardcode environment variables in Docker images if you are comfortable with all users who have pull access to your image seeing these values.
Data can be exchanged between Prefect Tasks as a first class operation. This is achieved by creating tasks which accept inputs and return values (using Python's standard
return statement). Note that this section is only focused on this type of data exchange. Prefect does not track data which is handled within your tasks (e.g., if your task extracts data from some third-party location or writes to some persisted storage but never returns this data).
During normal execution, the data exchanged between tasks is usually passed in memory. However, there are many situations in which this data needs to be persisted somewhere. Data is only persisted in Prefect Cloud using a Result Handler. Note that unless you turn checkpointing on for your local Core flows, Result Handlers are never exercised in Core.
You want to choose a result handler that matches both your Task's data type as well as your preferred location for tracking the data. For example, the
JSONResultHandler is only capable of handling JSON-compatible data, whereas the
GCSResultHandler can handle any
cloudpickle-able Python object. You can also always write a completely custom handler for your Flows and Tasks to use.
If you experience a Task failure with the message:
AssertionError: Result has no ResultHandler
it means that something triggered Cloud to persist data, but neither your Task nor your Flow had a result handler to use.
If your Flow relies on the use of Prefect Secrets, you will need to communicate those Secrets to Prefect Cloud via one of Prefect's APIs. We are currently working on a more pluggable version of Secrets that will allow you to more easily swap out Prefect's Secret storage with your favorite secret provider.
Prefect Cloud workflows are executed inside Docker containers running in the execution environment of your choosing. The only requirement for Prefect Cloud and its Agents are the ability to pull Docker images and "submit" them for execution in some fashion. This means that you may need to reconsider references to local filepaths, ensure certain environment variables are set, and make sure you understand any networking configurations that you rely on.
Prefect makes it easy to control:
- all aspects of the Docker image your Flow is stored within
- what types of Prefect Agents can submit your Flows for execution
- what execution environment your Flow runs
Note that different workflows will have different resource requirements during execution. For example, if you run a CPU intensive Flow using a Kubernetes Agent you should make sure your Kubernetes cluster has a sufficiently large node pool to run on.