https://prefect.io logo
Title
c

Camilo Fernandez

07/01/2022, 8:56 AM
Good Morning everyone, 👋 I have you are having a great day. My problem: I have Prefect 1.2.2 set up in an EKS Cluster . The Kubernetes agent schedules flow runs but they don't get executed only queued with a submitted state.
1
a

Anna Geller

07/01/2022, 11:21 AM
could you please move your entire message into the thread and in the main message only explain the problem you're trying to solve?
👀 1
c

Camilo Fernandez

07/01/2022, 11:52 AM
The Kubernetes agent was created together with the deployment of Prefect server by the Helm chart and receives the job template container an image I built:
#values.yaml:

agent:
  enabled: "true"
  prefectLabels: ["index"]
  jobTemplateFilePath: "<s3://prefect-flows/workload.yaml> --log-level DEBUG"
  
  image: 
    name: "prefecthq/prefect"
    tag: "1.2.2-python3.9"
The image is created like this:
#Dockerfile

FROM prefecthq/prefect:1.2.2-python3.9
RUN pip install "prefect[kubernetes]" kubernetes

WORKDIR /

COPY ./init.sh /

# This script does: prefect backend server; this is being executed in each deploy flow run job.
ENTRYPOINT [ "bash", "./init.sh" ]
The flow is stored in a S3 bucket and uses an image I built which is stored in the ECR:
#workload.yaml

apiVersion: batch/v1
kind: Job
metadata:
  name: my-k8s-flow
  namespace: prefect
spec:
  template:
    spec:
      serviceAccountName: "prefect-server-serviceaccount"
      
      containers:
      - name: my-k8s-flow
        image: public.ecr.aws/abcd1234/myimage:1.0
        imagePullPolicy: Always

      restartPolicy: Never
  backoffLimit: 4
The agent queues the flow run job properly but the flow runs never change state. The Kubernetes job appears to be completed, but the flow run job doesn't seem to be executed at all. It gets the images from ECR, builts it up to the entrypoint script, it runs it then it completes the Kubernetes job and deletes the pod. The dashboard shows the flow run in a submitted state for a while, the logs show Lazarus rescheduling 3 times then it changes its status to failed. The agent only logs a completed deployment:
DEBUG - agent | Querying for ready flow runs...
DEBUG:agent:Found 1 ready flow run(s): {'0b317604-ebf8-490f-b9fb-9cd8ba7e9fa7'}                                                           
DEBUG:agent:Retrieving metadata for 1 flow run(s)...                                                                                      
DEBUG - agent | Found 1 ready flow run(s): {'0b317604-ebf8-490f-b9fb-9cd8ba7e9fa7'}                                                       
DEBUG - agent | Retrieving metadata for 1 flow run(s)...
DEBUG:agent:Submitting flow run 0b317604-ebf8-490f-b9fb-9cd8ba7e9fa7 for deployment...
DEBUG - agent | Submitting flow run 0b317604-ebf8-490f-b9fb-9cd8ba7e9fa7 for deployment...
DEBUG:agent:Sleeping flow run poller for 0.25 seconds...                                                                                  
INFO:agent:Deploying flow run 0b317604-ebf8-490f-b9fb-9cd8ba7e9fa7 to execution environment...
DEBUG - agent | Sleeping flow run poller for 0.25 seconds...
INFO - agent | Deploying flow run 0b317604-ebf8-490f-b9fb-9cd8ba7e9fa7 to execution environment...
DEBUG:agent:Updating flow run 0b317604-ebf8-490f-b9fb-9cd8ba7e9fa7 state from Scheduled -> Submitted...
DEBUG - agent | Updating flow run 0b317604-ebf8-490f-b9fb-9cd8ba7e9fa7 state from Scheduled -> Submitted...
DEBUG:agent:Loading job template from '<s3://prefect-flows/workload.yaml>'                                                          
DEBUG - agent | Loading job template from '<s3://prefect-flows/workload.yaml>'
DEBUG - agent | Querying for ready flow runs...                                                                                           
DEBUG:agent:Querying for ready flow runs...
The flow should run in a schedule each 5 minutes and print something to stdout: Something interesting is that logging inside the flow with either Prefect or Python packages logger does not work at all.
#my_flow.py
...
schedule = IntervalSchedule(
    start_date=datetime.utcnow() + timedelta(seconds=1),
    interval=timedelta(minutes=5),
)


def set_run_config() -> RunConfig:
    return KubernetesRun(
            job_template_path="<s3://prefect-flows/workload.yaml>",
            image="public.ecr.aws/abcd1234/myimage:1.0",
            labels=(["index"])
        )


def set_storage() -> S3:
    return S3(
            stored_as_script=True,
            key="my_flow.py",
            bucket="prefect-flows"
        )   


@task(state_handlers=[alert_failed], log_stdout=True, slug="test-task")
def test_task():
    print('hellooooo ', file=sys.stdout)
     


with Flow("kubernetes", schedule=schedule, storage=set_storage(),\
    run_config=set_run_config()) as flow:
    test_task()
 
# Uncommenting this changes nothing
#flow.run(run_on_schedule=True)
is there anything I'm missing or a way to further debug this problem? Thanks in advance
a

Anna Geller

07/01/2022, 12:07 PM
where are you in your Prefect adoption? Are you building a PoC atm or have you been using Server for a bit? asking since it would be easier to start with Prefect 2.0 now
logging inside the flow with either Prefect or Python packages logger does not work at all
this is expected - you can only log in tasks in 1.0 - this is enhanced in 2.0 so that you can log from tasks
Can you remove the ENTRYPOINT in your Dockerfile? this way you are overwriting Prefect's entrypoint which is likely a culprit here
🙌 1
c

Camilo Fernandez

07/01/2022, 12:15 PM
I started setting up a dev environment for prefect a bit before Orion came out. My team decided to keep working on this until we have a working environment and we are comfortable using it. But we actually don't know how much easier it would be since Prefect is very new to us. I'm expected to finish this in a couple days. I have the feeling I have understood the moving parts of Prefect 1 but I keep encountering small obscure problems like this that cause long delays. I'm very tempted to upgrade but I'm afraid it may have too many changes that would make the progress of the last weeks go away? It there a migration guide you would recommend?
a

Anna Geller

07/01/2022, 12:23 PM
The thing is: sooner or later you will have to upgrade since 2.0 is "the future" 🙂 better now than after investing months
🙌 1
c

Camilo Fernandez

07/01/2022, 12:24 PM
Removing the ENTRYPOINT did it! Awesome @Anna Geller thanks a lot! I will check Orion out. 🚀 to the future!
a

Anna Geller

07/01/2022, 12:25 PM
with GA of 2.0 being just around the corner, definitely worth checking out Orion glad that worked and thanks for the update! 🙌