Has anyone faced this before the schedule for this flow is d Prefect Community #ask-community

Has anyone faced this before? the schedule for thi...

Gabriel Milan

03/10/2022, 8:13 PM

Has anyone faced this before? the schedule for this flow is defined as

Copy code

bot_schedule = Schedule(
    clocks=[
        IntervalClock(
            interval=timedelta(hours=1),
            start_date=datetime(2021, 1, 1),
            labels=[
                ...,
            ],
            parameter_defaults={
                ...,
            },
        ),
    ],
)

At first, we thought it was an UI issue. But then, we've confirmed that the flow wasn't being scheduled at all for this time window. Any ideas?

Kevin Kho

03/10/2022, 8:18 PM

Hey @Gabriel Milan, no immediate answer. I think this is similar to this. No immediate answer but we’re looking into it. Could you give me some flow ids?

Gabriel Milan

03/10/2022, 8:21 PM

Yes! It seems like this exact same issue, but we're using Prefect Server, though

Gabriel Milan

03/10/2022, 8:22 PM

And it did work after re-registering the flow, I've forgotten to mention that in the initial message

Kevin Kho

03/10/2022, 8:23 PM

What is your server version?

Gabriel Milan

03/10/2022, 8:24 PM

0.15.9

Gabriel Milan

03/10/2022, 8:26 PM

but we're willing to upgrade, if that's necessary

Kevin Kho

03/10/2022, 8:27 PM

I dont know for now

Gabriel Milan

03/10/2022, 8:31 PM

Is there any information I can gather that could help you debug this?

Kevin Kho

03/10/2022, 8:32 PM

I actually dont know enough on this one so maybe one of the engineers will reach out

Gabriel Milan

03/11/2022, 5:48 PM

Do we have any updates on this issue?

Kevin Kho

03/11/2022, 5:53 PM

I dont have any beyond the Github issue. Did it happen after re-registration?

Kevin Kho

03/11/2022, 5:53 PM

Oh the issue was opened in our private repo

Gabriel Milan

03/14/2022, 7:02 PM

idk if this is related, but it seems like no new runs are being scheduled at all for some of our flows, and re-registering them doesn't fix it. any thoughts on this? (and any updates on the original issue?)

Kevin Kho

03/14/2022, 8:07 PM

Is it across all flows?

Gabriel Milan

03/14/2022, 8:13 PM

not all flows, just some of them

Kevin Kho

03/14/2022, 8:16 PM

How many late runs do you have across you whole server?

Zanie

03/14/2022, 8:17 PM

How is your server deployed?

Gabriel Milan

03/14/2022, 8:27 PM

we're using the latest helm chart with these values:

Copy code

serverVersionTag: "core-0.15.9"

prefectVersionTag: "0.15.9"

uiVersionTag: "latest"

imagePullSecrets:
  []

annotations: {}

postgresql:
  postgresqlDatabase: prefect
  postgresqlUsername: prefect
  existingSecret: postgresql-password
  servicePort: 5432
  externalHostname: "cloud-sql-proxy"
  useSubChart: false

  persistence:
    enabled: false
    size: 8Gi

  initdbUser: postgres

  initdbScripts:
    create_pgcrypto.sql: |
      -- create pgcrypto extension, required for Hasura UUID
      CREATE EXTENSION IF NOT EXISTS pgcrypto;
      CREATE EXTENSION IF NOT EXISTS "pg_trgm";
      SET TIME ZONE 'UTC';

prefectConfig:
  services:
    towel:
      max_scheduled_runs_per_flow: "50"

hasura:
  image:
    name: hasura/graphql-engine
    tag: v1.3.3
    pullPolicy: IfNotPresent
    pullSecrets: []

  service:
    type: ClusterIP
    port: 3000

  labels: {}
  annotations: {}
  replicas: 2
  strategy: {}
  podSecurityContext: {}
  securityContext: {}
  env: []
  resources:
    limits:
      cpu: "500m"
      memory: "1Gi"
    requests:
      cpu: "100m"
      memory: "256Mi"
  nodeSelector: {}
  tolerations: []
  affinity: {}

graphql:
  image:
    name: prefecthq/server
    tag: null
    pullPolicy: Always
    pullSecrets: []

  service:
    type: ClusterIP
    port: 4201

  labels: {}
  annotations: {}
  replicas: 2
  strategy: {}
  podSecurityContext: {}
  securityContext: {}
  env: {}
  resources:
    limits:
      cpu: "500m"
      memory: "1Gi"
    requests:
      cpu: "100m"
      memory: "256Mi"
  nodeSelector: {}
  tolerations: []
  affinity: {}

  init:
    env: {}
    resources: {}

apollo:
  image:
    name: prefecthq/apollo
    tag: null
    pullPolicy: Always
    pullSecrets: []

  options:
    telemetryEnabled: true

  service:
    type: ClusterIP
    port: 4200

  ingress:
    enabled: false
    annotations: {}
    labels: {}
    hosts: []
    path: /

  labels: {}
  annotations: {}
  replicas: 2
  strategy: {}
  podSecurityContext: {}
  securityContext: {}
  env: {}
  resources:
    limits:
      cpu: "500m"
      memory: "1Gi"
    requests:
      cpu: "100m"
      memory: "256Mi"
  nodeSelector: {}
  tolerations: []
  affinity: {}

ui:
  image:
    name: prefecthq/ui
    tag: null
    pullPolicy: Always
    pullSecrets: []

  apolloApiUrl: <http://prefect-apollo.prefect.svc.cluster.local:4200/graphql>

  service:
    type: ClusterIP
    port: 8080

  ingress:
    enabled: false
    annotations: {}
    labels: {}
    hosts: []
    path: /

  labels: {}
  annotations: {}
  replicas: 1
  strategy: {}
  podSecurityContext: {}
  securityContext: {}
  env: {}
  resources: {}
  nodeSelector: {}
  tolerations: []
  affinity: {}

towel:
  image:
    name: prefecthq/server
    tag: null
    pullPolicy: Always
    pullSecrets: []

  labels: {}
  annotations: {}
  replicas: 1
  strategy: {}
  podSecurityContext: {}
  securityContext: {}
  env: {}
  resources: {}
  nodeSelector: {}
  tolerations: []
  affinity: {}

agent:
  enabled: true
  prefectLabels:
    - emd
  jobTemplateFilePath: "<https://storage.googleapis.com/datario-public/job_template_mount.yaml>"
  image:
    name: prefecthq/prefect
    tag: null
    pullPolicy: Always
    pullSecrets: []

  labels: {}
  annotations: {}
  replicas: 1
  strategy: {}
  podSecurityContext: {}
  securityContext: {}
  env: {}
  nodeSelector: {}
  tolerations: []
  affinity: {}

  resources:
    limits:
      cpu: 100m
      memory: 128Mi

  job:
    resources:
      limits:
        memory: ""
        cpu: ""
      requests:
        memory: ""
        cpu: ""
    imagePullPolicy: ""
    imagePullSecrets: []

serviceAccount:
  create: true
  name: null

jobs:
  createTenant:
    enabled: false
    tenant:
      name: default
      slug: default
    image:
      name: prefecthq/prefect
      tag: null
      pullPolicy: Always
      pullSecrets: []
    labels: {}
    annotations: {}
    podSecurityContext: {}
    securityContext: {}
    nodeSelector: {}
    tolerations: []
    affinity: {}
    backoffLimit: 10

Zanie

03/14/2022, 8:28 PM

Are there any scheduler service logs in the towel pod?

Zanie

03/14/2022, 8:29 PM

If not, can you adjust the log level in that container to DEBUG and see if there’s anything relevant?

Gabriel Milan

03/14/2022, 8:35 PM

i can see several error messages there

prefect-towel-79795cf75d-9kz9d.log

Zanie

03/14/2022, 8:38 PM

Here’s an unpacked scheduler error

Copy code

2022-03-14T18:48:52.673123542Z {"severity": "ERROR", "name": "prefect-server.Scheduler", "message": "Unexpected error: APIError('Unable to complete operation. An internal API error occurred.')", "exc_info": "Traceback (most recent call last):
  File "/prefect-server/src/prefect_server/utilities/exceptions.py", line 87, in reraise_as_api_error
    yield
  File "/prefect-server/src/prefect_server/utilities/graphql.py", line 64, in execute
    timeout=30,
  File "/usr/local/lib/python3.7/site-packages/httpx/_client.py", line 1385, in post
    timeout=timeout,
  File "/usr/local/lib/python3.7/site-packages/httpx/_client.py", line 1148, in request
    request, auth=auth, allow_redirects=allow_redirects, timeout=timeout,
  File "/usr/local/lib/python3.7/site-packages/httpx/_client.py", line 1169, in send
    request, auth=auth, timeout=timeout, allow_redirects=allow_redirects,
  File "/usr/local/lib/python3.7/site-packages/httpx/_client.py", line 1196, in send_handling_redirects
    request, auth=auth, timeout=timeout, history=history
  File "/usr/local/lib/python3.7/site-packages/httpx/_client.py", line 1232, in send_handling_auth
    response = await self.send_single_request(request, timeout)
  File "/usr/local/lib/python3.7/site-packages/httpx/_client.py", line 1269, in send_single_request
    timeout=timeout.as_dict(),
  File "/usr/local/lib/python3.7/site-packages/httpcore/_async/connection_pool.py", line 153, in request
    method, url, headers=headers, stream=stream, timeout=timeout
  File "/usr/local/lib/python3.7/site-packages/httpcore/_async/connection.py", line 65, in request
    self.socket = await self._open_socket(timeout)
  File "/usr/local/lib/python3.7/site-packages/httpcore/_async/connection.py", line 86, in _open_socket
    hostname, port, ssl_context, timeout
  File "/usr/local/lib/python3.7/site-packages/httpcore/_backends/auto.py", line 38, in open_tcp_stream
    return await self.backend.open_tcp_stream(hostname, port, ssl_context, timeout)
  File "/usr/local/lib/python3.7/site-packages/httpcore/_backends/asyncio.py", line 234, in open_tcp_stream
    stream_reader=stream_reader, stream_writer=stream_writer
  File "/usr/local/lib/python3.7/contextlib.py", line 130, in __exit__
    self.gen.throw(type, value, traceback)
  File "/usr/local/lib/python3.7/site-packages/httpcore/_exceptions.py", line 12, in map_exceptions
    raise to_exc(exc) from None
httpcore._exceptions.ConnectError: Multiple exceptions: [Errno 111] Connect call failed ('127.0.0.1', 3000), [Errno 99] Cannot assign requested address

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/prefect-server/src/prefect_server/services/loop_service.py", line 60, in run
    await self.run_once()
  File "/prefect-server/src/prefect_server/services/towel/scheduler.py", line 47, in run_once
    offset=500 * iterations,
  File "/prefect-server/src/prefect_server/database/orm.py", line 501, in get
    as_box=not apply_schema,
  File "/prefect-server/src/prefect_server/database/hasura.py", line 85, in execute
    as_box=as_box,
  File "/prefect-server/src/prefect_server/utilities/graphql.py", line 64, in execute
    timeout=30,
  File "/usr/local/lib/python3.7/contextlib.py", line 188, in __aexit__
    await self.gen.athrow(typ, value, traceback)
  File "/prefect-server/src/prefect_server/utilities/exceptions.py", line 93, in reraise_as_api_error
    raise APIError() from exc
prefect_server.utilities.exceptions.APIError: Unable to complete operation. An internal API error occurred."}
2022-03-14T18:48:52.673586081Z {"severity": "DEBUG", "name": "prefect-server.Scheduler", "message": "Heartbeat from Scheduler: next run at 2022-03-14T18:51:22+00:00"}

Zanie

03/14/2022, 8:39 PM

It looks like the services are failing to connect to the hasura pod.

Gabriel Milan

03/14/2022, 8:41 PM

Hasura seems to be fine, though.. Some other flows are running fine

Zanie

03/14/2022, 8:41 PM

Yeah it’s trying to access

127.0.0.1

which doesn’t really make sense

Zanie

03/14/2022, 8:41 PM

What is the value of the

PREFECT_SERVER__HASURA__HOST

environment variable on the towel pod?

Gabriel Milan

03/14/2022, 8:43 PM

there are none, which is very weird

Zanie

03/14/2022, 8:43 PM

Yeah… https://github.com/PrefectHQ/server/blob/master/helm/prefect-server/templates/towel/deployment.yaml#L52-L54 should be setting that

Zanie

03/14/2022, 8:44 PM

Without that variable, the services won’t be able to connect to the API and consequently you won’t get any scheduled runs.

Gabriel Milan

03/14/2022, 8:45 PM

I've re-ran

helm upgrade --install prefect -n prefect prefecthq/prefect-server -f values.yaml

and now there's

PREFECT_SERVER__HASURA__HOST: prefect-hasura.prefect

on my environments, which seems correct

Gabriel Milan

03/14/2022, 8:45 PM

I'll check if the flow schedules are back

Zanie

03/14/2022, 8:47 PM

Great!

Zanie

03/14/2022, 8:48 PM

You were likely seeing some scheduling because we will schedule some runs at registration time independently of the scheduling service.

Gabriel Milan

03/14/2022, 8:58 PM

we were actually passing `None`s as some default parameters, and it seems not to work, unfortunately

Gabriel Milan

03/14/2022, 9:05 PM

are there any updates on the original issue?

Zanie

03/14/2022, 9:15 PM

Now that you’ve got the scheduler service running again, are there any error logs?

Gabriel Milan

03/14/2022, 9:25 PM

no errors other than

Field may not be null

Zanie

03/14/2022, 9:28 PM

Can you share the full error?

Gabriel Milan

03/14/2022, 9:43 PM

this error is related to our schedule definitions. I was actually worried about the "gray bars" issue, that started this thread

Gabriel Milan

03/15/2022, 2:04 PM

@Kevin Kho would you know anything about it? I remember you've mentioned an issue that was opened in your private repo

Kevin Kho

03/15/2022, 2:08 PM

No I dont have any update on it. It hasn’t been investigated yet

Gabriel Milan

03/15/2022, 7:09 PM

is there somewhere I can track this?

Kevin Kho

03/15/2022, 7:18 PM

Let me try creating a server issue to link to our private Cloud issue

Kevin Kho

03/15/2022, 7:28 PM

I can’t link to the private issue so the best I can do is just make a public issue that references the private one so the private one has a note to this public. The private issue has a lot more detail related to the specific Cloud user so I can’t copy-paste the content either

❤️ 1

Gabriel Milan

03/15/2022, 7:31 PM

alright, thank you very much for this!

5 Views

Open in Slack

Previous Next