Has anyone faced this before? the schedule for thi...
# prefect-community
g
Has anyone faced this before? the schedule for this flow is defined as
Copy code
bot_schedule = Schedule(
    clocks=[
        IntervalClock(
            interval=timedelta(hours=1),
            start_date=datetime(2021, 1, 1),
            labels=[
                ...,
            ],
            parameter_defaults={
                ...,
            },
        ),
    ],
)
At first, we thought it was an UI issue. But then, we've confirmed that the flow wasn't being scheduled at all for this time window. Any ideas?
k
Hey @Gabriel Milan, no immediate answer. I think this is similar to this. No immediate answer but we’re looking into it. Could you give me some flow ids?
g
Yes! It seems like this exact same issue, but we're using Prefect Server, though
And it did work after re-registering the flow, I've forgotten to mention that in the initial message
k
What is your server version?
g
0.15.9
but we're willing to upgrade, if that's necessary
k
I dont know for now
g
Is there any information I can gather that could help you debug this?
k
I actually dont know enough on this one so maybe one of the engineers will reach out
g
Do we have any updates on this issue?
k
I dont have any beyond the Github issue. Did it happen after re-registration?
Oh the issue was opened in our private repo
g
idk if this is related, but it seems like no new runs are being scheduled at all for some of our flows, and re-registering them doesn't fix it. any thoughts on this? (and any updates on the original issue?)
k
Is it across all flows?
g
not all flows, just some of them
k
How many late runs do you have across you whole server?
z
How is your server deployed?
g
we're using the latest helm chart with these values:
Copy code
serverVersionTag: "core-0.15.9"

prefectVersionTag: "0.15.9"

uiVersionTag: "latest"

imagePullSecrets:
  []

annotations: {}

postgresql:
  postgresqlDatabase: prefect
  postgresqlUsername: prefect
  existingSecret: postgresql-password
  servicePort: 5432
  externalHostname: "cloud-sql-proxy"
  useSubChart: false

  persistence:
    enabled: false
    size: 8Gi

  initdbUser: postgres

  initdbScripts:
    create_pgcrypto.sql: |
      -- create pgcrypto extension, required for Hasura UUID
      CREATE EXTENSION IF NOT EXISTS pgcrypto;
      CREATE EXTENSION IF NOT EXISTS "pg_trgm";
      SET TIME ZONE 'UTC';

prefectConfig:
  services:
    towel:
      max_scheduled_runs_per_flow: "50"

hasura:
  image:
    name: hasura/graphql-engine
    tag: v1.3.3
    pullPolicy: IfNotPresent
    pullSecrets: []

  service:
    type: ClusterIP
    port: 3000

  labels: {}
  annotations: {}
  replicas: 2
  strategy: {}
  podSecurityContext: {}
  securityContext: {}
  env: []
  resources:
    limits:
      cpu: "500m"
      memory: "1Gi"
    requests:
      cpu: "100m"
      memory: "256Mi"
  nodeSelector: {}
  tolerations: []
  affinity: {}

graphql:
  image:
    name: prefecthq/server
    tag: null
    pullPolicy: Always
    pullSecrets: []

  service:
    type: ClusterIP
    port: 4201

  labels: {}
  annotations: {}
  replicas: 2
  strategy: {}
  podSecurityContext: {}
  securityContext: {}
  env: {}
  resources:
    limits:
      cpu: "500m"
      memory: "1Gi"
    requests:
      cpu: "100m"
      memory: "256Mi"
  nodeSelector: {}
  tolerations: []
  affinity: {}

  init:
    env: {}
    resources: {}

apollo:
  image:
    name: prefecthq/apollo
    tag: null
    pullPolicy: Always
    pullSecrets: []

  options:
    telemetryEnabled: true

  service:
    type: ClusterIP
    port: 4200

  ingress:
    enabled: false
    annotations: {}
    labels: {}
    hosts: []
    path: /

  labels: {}
  annotations: {}
  replicas: 2
  strategy: {}
  podSecurityContext: {}
  securityContext: {}
  env: {}
  resources:
    limits:
      cpu: "500m"
      memory: "1Gi"
    requests:
      cpu: "100m"
      memory: "256Mi"
  nodeSelector: {}
  tolerations: []
  affinity: {}

ui:
  image:
    name: prefecthq/ui
    tag: null
    pullPolicy: Always
    pullSecrets: []

  apolloApiUrl: <http://prefect-apollo.prefect.svc.cluster.local:4200/graphql>

  service:
    type: ClusterIP
    port: 8080

  ingress:
    enabled: false
    annotations: {}
    labels: {}
    hosts: []
    path: /

  labels: {}
  annotations: {}
  replicas: 1
  strategy: {}
  podSecurityContext: {}
  securityContext: {}
  env: {}
  resources: {}
  nodeSelector: {}
  tolerations: []
  affinity: {}

towel:
  image:
    name: prefecthq/server
    tag: null
    pullPolicy: Always
    pullSecrets: []

  labels: {}
  annotations: {}
  replicas: 1
  strategy: {}
  podSecurityContext: {}
  securityContext: {}
  env: {}
  resources: {}
  nodeSelector: {}
  tolerations: []
  affinity: {}

agent:
  enabled: true
  prefectLabels:
    - emd
  jobTemplateFilePath: "<https://storage.googleapis.com/datario-public/job_template_mount.yaml>"
  image:
    name: prefecthq/prefect
    tag: null
    pullPolicy: Always
    pullSecrets: []

  labels: {}
  annotations: {}
  replicas: 1
  strategy: {}
  podSecurityContext: {}
  securityContext: {}
  env: {}
  nodeSelector: {}
  tolerations: []
  affinity: {}

  resources:
    limits:
      cpu: 100m
      memory: 128Mi

  job:
    resources:
      limits:
        memory: ""
        cpu: ""
      requests:
        memory: ""
        cpu: ""
    imagePullPolicy: ""
    imagePullSecrets: []

serviceAccount:
  create: true
  name: null

jobs:
  createTenant:
    enabled: false
    tenant:
      name: default
      slug: default
    image:
      name: prefecthq/prefect
      tag: null
      pullPolicy: Always
      pullSecrets: []
    labels: {}
    annotations: {}
    podSecurityContext: {}
    securityContext: {}
    nodeSelector: {}
    tolerations: []
    affinity: {}
    backoffLimit: 10
z
Are there any scheduler service logs in the towel pod?
If not, can you adjust the log level in that container to DEBUG and see if there’s anything relevant?
g
i can see several error messages there
z
Here’s an unpacked scheduler error
Copy code
2022-03-14T18:48:52.673123542Z {"severity": "ERROR", "name": "prefect-server.Scheduler", "message": "Unexpected error: APIError('Unable to complete operation. An internal API error occurred.')", "exc_info": "Traceback (most recent call last):
  File "/prefect-server/src/prefect_server/utilities/exceptions.py", line 87, in reraise_as_api_error
    yield
  File "/prefect-server/src/prefect_server/utilities/graphql.py", line 64, in execute
    timeout=30,
  File "/usr/local/lib/python3.7/site-packages/httpx/_client.py", line 1385, in post
    timeout=timeout,
  File "/usr/local/lib/python3.7/site-packages/httpx/_client.py", line 1148, in request
    request, auth=auth, allow_redirects=allow_redirects, timeout=timeout,
  File "/usr/local/lib/python3.7/site-packages/httpx/_client.py", line 1169, in send
    request, auth=auth, timeout=timeout, allow_redirects=allow_redirects,
  File "/usr/local/lib/python3.7/site-packages/httpx/_client.py", line 1196, in send_handling_redirects
    request, auth=auth, timeout=timeout, history=history
  File "/usr/local/lib/python3.7/site-packages/httpx/_client.py", line 1232, in send_handling_auth
    response = await self.send_single_request(request, timeout)
  File "/usr/local/lib/python3.7/site-packages/httpx/_client.py", line 1269, in send_single_request
    timeout=timeout.as_dict(),
  File "/usr/local/lib/python3.7/site-packages/httpcore/_async/connection_pool.py", line 153, in request
    method, url, headers=headers, stream=stream, timeout=timeout
  File "/usr/local/lib/python3.7/site-packages/httpcore/_async/connection.py", line 65, in request
    self.socket = await self._open_socket(timeout)
  File "/usr/local/lib/python3.7/site-packages/httpcore/_async/connection.py", line 86, in _open_socket
    hostname, port, ssl_context, timeout
  File "/usr/local/lib/python3.7/site-packages/httpcore/_backends/auto.py", line 38, in open_tcp_stream
    return await self.backend.open_tcp_stream(hostname, port, ssl_context, timeout)
  File "/usr/local/lib/python3.7/site-packages/httpcore/_backends/asyncio.py", line 234, in open_tcp_stream
    stream_reader=stream_reader, stream_writer=stream_writer
  File "/usr/local/lib/python3.7/contextlib.py", line 130, in __exit__
    self.gen.throw(type, value, traceback)
  File "/usr/local/lib/python3.7/site-packages/httpcore/_exceptions.py", line 12, in map_exceptions
    raise to_exc(exc) from None
httpcore._exceptions.ConnectError: Multiple exceptions: [Errno 111] Connect call failed ('127.0.0.1', 3000), [Errno 99] Cannot assign requested address

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/prefect-server/src/prefect_server/services/loop_service.py", line 60, in run
    await self.run_once()
  File "/prefect-server/src/prefect_server/services/towel/scheduler.py", line 47, in run_once
    offset=500 * iterations,
  File "/prefect-server/src/prefect_server/database/orm.py", line 501, in get
    as_box=not apply_schema,
  File "/prefect-server/src/prefect_server/database/hasura.py", line 85, in execute
    as_box=as_box,
  File "/prefect-server/src/prefect_server/utilities/graphql.py", line 64, in execute
    timeout=30,
  File "/usr/local/lib/python3.7/contextlib.py", line 188, in __aexit__
    await self.gen.athrow(typ, value, traceback)
  File "/prefect-server/src/prefect_server/utilities/exceptions.py", line 93, in reraise_as_api_error
    raise APIError() from exc
prefect_server.utilities.exceptions.APIError: Unable to complete operation. An internal API error occurred."}
2022-03-14T18:48:52.673586081Z {"severity": "DEBUG", "name": "prefect-server.Scheduler", "message": "Heartbeat from Scheduler: next run at 2022-03-14T18:51:22+00:00"}
It looks like the services are failing to connect to the hasura pod.
g
Hasura seems to be fine, though.. Some other flows are running fine
z
Yeah it’s trying to access
127.0.0.1
which doesn’t really make sense
What is the value of the
PREFECT_SERVER__HASURA__HOST
environment variable on the towel pod?
g
there are none, which is very weird
Without that variable, the services won’t be able to connect to the API and consequently you won’t get any scheduled runs.
g
I've re-ran
helm upgrade --install prefect -n prefect prefecthq/prefect-server -f values.yaml
and now there's
PREFECT_SERVER__HASURA__HOST: prefect-hasura.prefect
on my environments, which seems correct
I'll check if the flow schedules are back
z
Great!
You were likely seeing some scheduling because we will schedule some runs at registration time independently of the scheduling service.
g
we were actually passing `None`s as some default parameters, and it seems not to work, unfortunately
are there any updates on the original issue?
z
Now that you’ve got the scheduler service running again, are there any error logs?
g
no errors other than
Field may not be null
z
Can you share the full error?
g
this error is related to our schedule definitions. I was actually worried about the "gray bars" issue, that started this thread
@Kevin Kho would you know anything about it? I remember you've mentioned an issue that was opened in your private repo
k
No I dont have any update on it. It hasn’t been investigated yet
g
is there somewhere I can track this?
k
Let me try creating a server issue to link to our private Cloud issue
I can’t link to the private issue so the best I can do is just make a public issue that references the private one so the private one has a note to this public. The private issue has a lot more detail related to the specific Cloud user so I can’t copy-paste the content either
❤️ 1
g
alright, thank you very much for this!