https://prefect.io logo
m

Morgan Omind

02/01/2021, 4:58 PM
Hi everyone, nice to join your community 🙂 We face an issue which seems to impact different google cloud platform projects, while it should not. In the thread-attached GCP log, we can see that it refers to both quetzal-omind as well as cartographie-cps projects. We do absolutly not understand why?! Our flow runs are pretty unpredictable since some days, most of the times they simply do not start, often they fail in a http requests loop, and sometimes it runs successfully! We suspect a wrong setup of our prefect agent, but we are pretty new to prefect dev, so we are very not sure of where to look for our issues. Maybe someone here will have some clues to help us 🤞 best!
👋 1
👀 1
{
  
"insertId": "fl6tfmd7rpcqwqaig",
  
"jsonPayload": {
    
"_SYSTEMD_UNIT": "kubelet.service",
    
"_SYSTEMD_SLICE": "system.slice",
    
"_CAP_EFFECTIVE": "3fffffffff",
    
"_HOSTNAME": "gke-iguazu-default-pool-f66089ee-s1lr",
    
"SYSLOG_FACILITY": "3",
    
"_PID": "1087",
    
"_SYSTEMD_INVOCATION_ID": "3eebf72f9a3b413681b38725870a2629",
    
"_BOOT_ID": "fa0576989dd2496b8c759f2a2e34514a",
    
"_GID": "0",
    
"_STREAM_ID": "d30fb7ed4a014871a2f8970efe80c5ad",
    
"SYSLOG_IDENTIFIER": "kubelet",
    
"PRIORITY": "6",
    
"_TRANSPORT": "stdout",
    
"_COMM": "kubelet",
    
"MESSAGE": "E0201 16:20:44.424380    1087 pod_workers.go:191] Error syncing pod 171b3eee-639d-4da7-9deb-a155f1c87c4a (\"prefect-job-a2802066-vjmf9_default(171b3eee-639d-4da7-9deb-a155f1c87c4a)\"), skipping: failed to \"StartContainer\" for \"flow\" with ImagePullBackOff: \"Back-off pulling image \\\"<http://eu.gcr.io/quetzal-omind/iguazu/dev/flows/bilan_vr:2020-09-17t18-14-03-371762-00-00|eu.gcr.io/quetzal-omind/iguazu/dev/flows/bilan_vr:2020-09-17t18-14-03-371762-00-00>\\\"\"",
    
"_MACHINE_ID": "6f4f58b7702e59ad315033846174af48",
    
"_SYSTEMD_CGROUP": "/system.slice/kubelet.service",
    
"_EXE": "/home/kubernetes/bin/kubelet",
    
"_UID": "0",
    
"_CMDLINE": "/home/kubernetes/bin/kubelet --v=2 --cloud-provider=gce --experimental-check-node-capabilities-before-mount=true --experimental-mounter-path=/home/kubernetes/containerized_mounter/mounter --cert-dir=/var/lib/kubelet/pki/ --cni-bin-dir=/home/kubernetes/bin --kubeconfig=/var/lib/kubelet/kubeconfig --image-pull-progress-deadline=5m --experimental-kernel-memcg-notification=true --max-pods=110 --non-masquerade-cidr=0.0.0.0/0 --network-plugin=kubenet --node-labels=<http://cloud.google.com/gke-nodepool=default-pool,cloud.google.com/gke-os-distribution=cos|cloud.google.com/gke-nodepool=default-pool,cloud.google.com/gke-os-distribution=cos> --volume-plugin-dir=/home/kubernetes/flexvolume --bootstrap-kubeconfig=/var/lib/kubelet/bootstrap-kubeconfig --node-status-max-images=25 --registry-qps=10 --registry-burst=20 --config /home/kubernetes/kubelet-config.yaml --pod-sysctls=net.core.somaxconn=1024,net.ipv4.conf.all.accept_redirects=0,net.ipv4.conf.all.forwarding=1,net.ipv4.conf.all.route_localnet=1,net.ipv4.conf.default.forwarding=1,net.ipv4.ip_forward=1,net.ipv4.tcp_fin_timeout=60,net.ipv4.tcp_keepalive_intvl=75,net.ipv4.tcp_keepalive_probes=9,net.ipv4.tcp_keepalive_time=7200,net.ipv4.tcp_rmem=4096 87380 6291456,net.ipv4.tcp_syn_retries=6,net.ipv4.tcp_tw_reuse=0,net.ipv4.tcp_wmem=4096 16384 4194304,net.ipv4.udp_rmem_min=4096,net.ipv4.udp_wmem_min=4096,net.ipv6.conf.default.accept_ra=0,<http://net.netfilter.nf|net.netfilter.nf>_conntrack_generic_timeout=600,<http://net.netfilter.nf|net.netfilter.nf>_conntrack_tcp_timeout_close_wait=3600,<http://net.netfilter.nf|net.netfilter.nf>_conntrack_tcp_timeout_established=86400"
  
},
  
"resource": {
    
"type": "k8s_node",
    
"labels": {
      
"cluster_name": "iguazu",
      
"project_id": "cartographie-cps",
      
"node_name": "gke-iguazu-default-pool-f66089ee-s1lr",
      
"location": "europe-west4-a"
    
}
  
},
  
"timestamp": "2021-02-01T16:20:44.424424Z",
  
"labels": {
    
"<http://gke.googleapis.com/log_type|gke.googleapis.com/log_type>": "system"
  
},
  
"logName": "projects/cartographie-cps/logs/kubelet",
  
"receiveTimestamp": "2021-02-01T16:20:49.212486852Z"
}
more details perhaps! We deploy our prefect flows in docker images on gcp kubernetes clusters.
k

Kyle Moon-Wright

02/01/2021, 8:54 PM
Hey @Morgan Omind, This a bit difficult to parse, but it looks like your Prefect project is
"cartographie-cps"
, and the only reference to
quetzal-omind
I see is your storage method - containerized code retrieved from your GCR registry. Your Agent will pull the image from this registry URL at runtime for execution inside it’s environment. Keep in mind that your code storage method is independent from your Prefect Project directory in your tenant. I’m not sure how best to help here, but happy to answer any questions you may have to clarify further.
m

Morgan Omind

02/01/2021, 9:09 PM
Thx @Kyle Moon-Wright for your clarification and your time investigating our issue. It confirms what we suspected! So, I guess, my misunderstanding now is how to setup the prefect agent in order to link to the desired gcp registry according to a specific prefect project?
k

Kyle Moon-Wright

02/01/2021, 9:20 PM
Well first off, you should definitely be using a Kubernetes Agent on your cluster and ensure that you’ve added the necessary dependencies to your image for the Agent - in this case, we’ll likely need to have the:
Copy code
pip install prefect[kubernetes]
and maybe
Copy code
pip install prefect[google]
as optional dependencies as described here. Otherwise, your Agent doesn’t need access to a specified Prefect Project (which has very little bearing on execution) but it does need access to the GCP registry to pull the image, as well as other RBAC /network permissions on the GCP side to submit Jobs as containers from the image you pulled.
I’m definitely no K8s expert, but it is the most widely used Agent for production purposes, so there are definitely lots of docs and threads here on its usage.
m

Morgan Omind

02/02/2021, 9:51 AM
Ok, thanks for the hints. Sadly, I already checked rbac permissions, and follow instructions to setup them accordingly. But I may have done it wrongly, Im gonna double-check this 😉 And of course we are going to continue to look for docs and threads by the community about this topic. However, what seems very weird to me, is the randomness aspect of our flow runs. As I previously explained, almost none of them succeed (but sometimes it actually succeed, but we dont know why and what is the difference with other runs), but most of the time, either runs stay in pending state, and are relaunch by the lazarus process, until its maximum relaunch trials and then it fails, either it launches, but after the flow starts, it enters an infinite http loop requests, and eventually fails. I join the log of this loop :
15:47:20
DEBUG
urllib3.connectionpool
Starting new HTTPS connection (1): <http://api.prefect.io:443|api.prefect.io:443>
15:47:20
DEBUG
urllib3.connectionpool
<https://api.prefect.io:443> "POST /graphql HTTP/1.1" 200 45
this random behavior, and the fact that we are not able to identify any pattern about it makes debugging very frustrating and unpredictable.. I guess this randomness is not a normal/common behavior, but I wonder what we made wrong to get this weird behavior?
moreover, there are many logs in gcp that seem very weird, thus we may have configuration issues.. But one of the most weird, to my opinion, is the following:
{
"textPayload": "[2021-02-02 10:44:58,462] DEBUG - agent | No flow runs found\n",
"insertId": "28isj4g32dpqfj",
"resource": {
"type": "k8s_container",
"labels": {
"namespace_name": "default",
"container_name": "k8s-agent",
"pod_name": "prefect-1612261772-k8s-agent-76968b94dc-vrb48",
"project_id": "cartographie-cps",
"cluster_name": "iguazu",
"location": "europe-west1-c"
}
},
"timestamp": "2021-02-02T10:44:58.463191937Z",
"severity": "INFO",
"labels": {
"<http://compute.googleapis.com/resource_name|compute.googleapis.com/resource_name>": "gke-iguazu-default-pool-4df57660-prlp",
"k8s-pod/app": "prefect",
"k8s-pod/app_kubernetes_io/version": "0.10.7",
"k8s-pod/app_kubernetes_io/instance": "prefect-1612261772",
"k8s-pod/pod-template-hash": "76968b94dc",
"k8s-pod/app_kubernetes_io/component": "k8s-agent",
"k8s-pod/app_kubernetes_io/managed-by": "Helm",
"k8s-pod/helm_sh/chart": "prefect-0.1.0",
"k8s-pod/app_kubernetes_io/name": "prefect"
},
"logName": "projects/cartographie-cps/logs/stdout",
"receiveTimestamp": "2021-02-02T10:45:02.024152388Z"
}
do you have any idea on what could happen here to get such a 'no flow runs found' log?
here is the query log (sorry to post so many verbose logs !!)
{
"textPayload": "[2021-02-02 10:48:19,205] DEBUG - agent | Querying for flow runs\n",
"insertId": "jzrqvyg3j63ez5",
"resource": {
"type": "k8s_container",
"labels": {
"project_id": "cartographie-cps",
"container_name": "k8s-agent",
"namespace_name": "default",
"location": "europe-west1-c",
"cluster_name": "iguazu",
"pod_name": "prefect-1612261772-k8s-agent-76968b94dc-vrb48"
}
},
"timestamp": "2021-02-02T10:48:19.208557067Z",
"severity": "INFO",
"labels": {
"k8s-pod/helm_sh/chart": "prefect-0.1.0",
"k8s-pod/pod-template-hash": "76968b94dc",
"k8s-pod/app_kubernetes_io/component": "k8s-agent",
"k8s-pod/app": "prefect",
"<http://compute.googleapis.com/resource_name|compute.googleapis.com/resource_name>": "gke-iguazu-default-pool-4df57660-prlp",
"k8s-pod/app_kubernetes_io/name": "prefect",
"k8s-pod/app_kubernetes_io/instance": "prefect-1612261772",
"k8s-pod/app_kubernetes_io/managed-by": "Helm",
"k8s-pod/app_kubernetes_io/version": "0.10.7"
},
"logName": "projects/cartographie-cps/logs/stdout",
"receiveTimestamp": "2021-02-02T10:48:22.022280350Z"
}
f

Florent VanDeMoortele

02/02/2021, 5:35 PM
Hi @Kyle Moon-Wright Thank you for your help. I work with @Morgan Omind and I complete with our last tests. We had 2 k8 agents (one in version 0.10.7 and the other in version 0.12.5). The 0.12.5 runs continuously but the 0.10.7 pauses. In the GCP logs, we notes that our flows called the one in version 0.10.7. So we killed 0.10.7 and updated the version of k8 in Helm's values.yml. After redeployment, we note that the flow finally calls the correct agent (v. 0.12.5). First good news ! But now, when I start the flow, it remains pending for several minutes. I finally cancel it and relaunch it immediately. This time the flow runs perfectly. When I restart the flow again, it remains pending. I cancel it to restart it and so on. This is weird (and seems random) but maybe I misunderstand something with instantiation and startup time... Do you have any idea?
k

Kyle Moon-Wright

02/02/2021, 6:50 PM
Hey @Florent VanDeMoortele, This is definitely a tough one, especially because I’m not too familiar with the Helm chart implementation of Prefect Server and troubleshooting there. However given you last note, I would definitely recommend making sure all versions of both Prefect and Python are the same across the board, as this can cause a multitude of issues - particularly the Prefect version which is currently on 0.14.5. Every version increase presents new functionality/configurability and this is quite important for the Agent, as Agents will not be able to run flows using a higher version of Prefect. This Agent misbehavior could be a result of these mismatches, but it’s hard to tell given it’s a Server implementation. Flows staying in pending definitely sounds networking related and I recall something similar in lesser versions of Prefect, so an update could be fruitful. If this persists however, this may be better triaged via a Github issue/discussion - however do note that I believe the Helm Chart implementation was user-contributed so our ability to help with very specific issues may be limited. Hope that makes sense!
f

Florent VanDeMoortele

02/03/2021, 9:56 AM
Thank you very much for the help and for your time @Kyle Moon-Wright. Indeed, it is a good track. We will try to redeploy all the flows and the project with the latest versions. And we'll see. How the agent works remains a mystery to us!