Hi everyone I ve run into what looks like a bug in Prefect C Prefect Community #ask-community

Hi everyone, I've run into what looks like a bug i...

Jeremy Phelps

07/16/2021, 6:45 PM

Hi everyone, I've run into what looks like a bug in Prefect Cloud. No tasks are running, yet this State Message is being reported:

Queued due to concurrency limits. The local process will attempt to run the task for the next 10 minutes, after which time it will be made available to other agents.

That string does not appear in the open-source part of Prefect, so it must be part of Prefect Cloud. The concurrency limit on that task is 10, and things were working until I changed some of Dask's configuration parameters to try to resolve an issue with it. The most likely cause of the above message is that some error happened and it didn't get handled correctly. https://cloud.prefect.io/stockwell/flow-run/4460703d-3c91-4573-b85c-a4b001048999

Kevin Kho

07/16/2021, 7:02 PM

Hey @Jeremy Phelps, could you try querying for the relevant tags and seeing if there are task runs in a running state? Something like:

Copy code

query{
  task (where: {tags: {_eq: []}}) {
    flow {
        id
        name
      }
    id
    name
    tags
    task_runs (where: {state: {_eq:"Running"}}) {
      id
      name
      state
    }
    }
  }

Jeremy Phelps

07/16/2021, 7:04 PM

I filled in that query as:

Copy code

query{
  task (where: {tags: {_eq: ["staging"]}}) {
    flow {
        id
        name
      }
    id
    name
    tags
    task_runs (where: {state: {_eq:"Running"}}) {
      id
      name
      state
    }
    }
  }

...and it returned no tasks.

Jeremy Phelps

07/16/2021, 7:05 PM

I also tried the tag for the production cluster and found nothing there (as it should be).

Jeremy Phelps

07/16/2021, 7:06 PM

The agent's logs don't have any useful information either:

Copy code

[2021-07-16 13:12:55-0500] INFO - prefect.CloudFlowRunner | Beginning Flow run for 'demand-forecasting-delivery-scheduler'
[2021-07-16 13:12:55-0500] INFO - prefect.DaskExecutor | Connecting to an existing Dask cluster at <tcp://dask-scheduler:8786>

Jeremy Phelps

07/16/2021, 7:15 PM

Logs from the Dask scheduler show that a client connected right when I started the flow run. But no tasks were forwarded to the workers.

nicholas

07/16/2021, 7:18 PM

Hi @Jeremy Phelps - can you run this query instead?

Copy code

query{
  task (where: {tags: {_eq: ["staging"]}}) {
    id
    name
    tags
    task_runs (where: {state: {_in:["Running", "Submitted", "Queued", "Cancelling", "Retrying", "Resume", "Paused"]}}) {
      id
      name
      state
    }
    }
  }

Jeremy Phelps

07/16/2021, 7:18 PM

That also returns nothing.

nicholas

07/16/2021, 7:19 PM

Interesting... let me dig around and see what I can find

Jeremy Phelps

07/16/2021, 7:21 PM

Taking off all the parameters after the

task

token returns something, but Slack won't let me send it.

Jeremy Phelps

07/16/2021, 7:22 PM

Pastebin it is, I guess: https://pastebin.com/SMpbVxz7

nicholas

07/16/2021, 7:25 PM

And

staging

is the tag you're having issues with, yeah? (Prefect employees can't see your UI links, just fyi)

Jeremy Phelps

07/16/2021, 7:26 PM

Yes.

Jeremy Phelps

07/16/2021, 7:27 PM

Are Prefect employees also blind to the contents of the database that these GQL queries operate on?

nicholas

07/16/2021, 9:57 PM

@Jeremy Phelps could you clarify what you mean?

Jeremy Phelps

07/16/2021, 9:59 PM

When I run the GQL query you suggest, that performs a lookup in a database that Prefect owns. Can Prefect employees see what's in that database?

nicholas

07/16/2021, 10:05 PM

Prefect does have access, you're correct. Here's what I found: The only concurrency limit that's set is

on a tag called

mysql-write

, and you already have 10 tasks in a

running

state with that tag; the tasks in the flow run you provided also have the

mysql-write

tag and are queued correctly as a result. There are no tasks with a

staging

tag and no concurrency limits with that tag either. One thing I did notice is that the tasks in

running

states don't all come from the same flow run, which could be causing the confusion

Jeremy Phelps

07/16/2021, 10:06 PM

These tasks are not actually running. How do I find and get rid of them?

nicholas

07/16/2021, 10:08 PM

Let me see if I can grab some flow run ids and names for you and you can manually mark them as finished or cancelled

Jeremy Phelps

07/16/2021, 10:08 PM

That doesn't solve the problem going forward. Things will reach this state again.

Jeremy Phelps

07/16/2021, 10:11 PM

I found a bunch of Kubernetes pods that appear to be stale (the Dask schedulers they are talking to have been taken down). I deleted them, so maybe that will help.

Jeremy Phelps

07/16/2021, 10:12 PM

I confused tags with labels. Do tasks with the same "tag" but different "labels" share the same concurrency pool?

nicholas

07/16/2021, 10:15 PM

Two things you can do for the future: you can set up flow SLA automations for that flow that will fail the flow if it exceeds some time threshold, you can manually mark those flow runs that are holding onto concurrency slots but whose jobs are stale as failed/completed (along with their associated tasks). Basically you'll need to kill the jobs in some way to make sure they're not holding onto those slots, whether that's through Prefect or your cluster

nicholas

07/16/2021, 10:16 PM

Tasks only have tags and so share the concurrency pool, flows have labels and share a different concurrency pool, though this is something we'd like to clarify in the future

Jeremy Phelps

07/16/2021, 10:16 PM

It seems that the only way to have a truly separate staging environment is to have a separate Prefect account for it.

nicholas

07/16/2021, 10:22 PM

I think what you're describing is entirely doable with a single tenant but separate tags and labels on your flows to account for execution submission to different environments but I can put you in touch with one of our account managers to discuss multi-tenancy which will give you database-level sharding of environments.

Jeremy Phelps

07/16/2021, 10:36 PM

Does multi-tenancy cost additional money?

nicholas

07/16/2021, 10:57 PM

It does, it's an enterprise-grade feature

Jeremy Phelps

07/16/2021, 11:00 PM

Management will never agree to it.

Jeremy Phelps

07/16/2021, 11:01 PM

Is there any documentation for the

set_task_run_states

mutation?

nicholas

07/16/2021, 11:05 PM

The GraphQL api has docs attached to the schema (you can view these in the interactive api) which denotes all input and output types. You can run that mutation through the interactive API like this:

Copy code

mutation {
  set_task_run_states(input: {states: [{task_run_id: "<<task run states>>", state: "{\"type\": \"Failed\", \"message\": \"<<your message>>\"}"}]}) {
    states {
      id
      status
    }
  }
}

nicholas

07/16/2021, 11:05 PM

Note the escaping of the

state

field, which is a JSON payload

Jeremy Phelps

07/16/2021, 11:06 PM

The problem I'm running into is that I don't know which fields are expected in the

state

Jeremy Phelps

07/16/2021, 11:06 PM

Oh, I see.

Jeremy Phelps

07/16/2021, 11:07 PM

Ty, that worked.

👍 1

36 Views

Open in Slack

Previous Next