Jeremy Phelps

    Jeremy Phelps

    1 year ago
    Hi everyone, I've run into what looks like a bug in Prefect Cloud. No tasks are running, yet this State Message is being reported:
    Queued due to concurrency limits. The local process will attempt to run the task for the next 10 minutes, after which time it will be made available to other agents.
    That string does not appear in the open-source part of Prefect, so it must be part of Prefect Cloud. The concurrency limit on that task is 10, and things were working until I changed some of Dask's configuration parameters to try to resolve an issue with it. The most likely cause of the above message is that some error happened and it didn't get handled correctly. https://cloud.prefect.io/stockwell/flow-run/4460703d-3c91-4573-b85c-a4b001048999
    Kevin Kho

    Kevin Kho

    1 year ago
    Hey @Jeremy Phelps, could you try querying for the relevant tags and seeing if there are task runs in a running state? Something like:
    query{
      task (where: {tags: {_eq: []}}) {
        flow {
            id
            name
          }
        id
        name
        tags
        task_runs (where: {state: {_eq:"Running"}}) {
          id
          name
          state
        }
        }
      }
    Jeremy Phelps

    Jeremy Phelps

    1 year ago
    I filled in that query as:
    query{
      task (where: {tags: {_eq: ["staging"]}}) {
        flow {
            id
            name
          }
        id
        name
        tags
        task_runs (where: {state: {_eq:"Running"}}) {
          id
          name
          state
        }
        }
      }
    ...and it returned no tasks.
    I also tried the tag for the production cluster and found nothing there (as it should be).
    The agent's logs don't have any useful information either:
    [2021-07-16 13:12:55-0500] INFO - prefect.CloudFlowRunner | Beginning Flow run for 'demand-forecasting-delivery-scheduler'
    [2021-07-16 13:12:55-0500] INFO - prefect.DaskExecutor | Connecting to an existing Dask cluster at <tcp://dask-scheduler:8786>
    Logs from the Dask scheduler show that a client connected right when I started the flow run. But no tasks were forwarded to the workers.
    nicholas

    nicholas

    1 year ago
    Hi @Jeremy Phelps - can you run this query instead?
    query{
      task (where: {tags: {_eq: ["staging"]}}) {
        id
        name
        tags
        task_runs (where: {state: {_in:["Running", "Submitted", "Queued", "Cancelling", "Retrying", "Resume", "Paused"]}}) {
          id
          name
          state
        }
        }
      }
    Jeremy Phelps

    Jeremy Phelps

    1 year ago
    That also returns nothing.
    nicholas

    nicholas

    1 year ago
    Interesting... let me dig around and see what I can find
    Jeremy Phelps

    Jeremy Phelps

    1 year ago
    Taking off all the parameters after the
    task
    token returns something, but Slack won't let me send it.
    Pastebin it is, I guess: https://pastebin.com/SMpbVxz7
    nicholas

    nicholas

    1 year ago
    And
    staging
    is the tag you're having issues with, yeah? (Prefect employees can't see your UI links, just fyi)
    Jeremy Phelps

    Jeremy Phelps

    1 year ago
    Yes.
    Are Prefect employees also blind to the contents of the database that these GQL queries operate on?
    nicholas

    nicholas

    1 year ago
    @Jeremy Phelps could you clarify what you mean?
    Jeremy Phelps

    Jeremy Phelps

    1 year ago
    When I run the GQL query you suggest, that performs a lookup in a database that Prefect owns. Can Prefect employees see what's in that database?
    nicholas

    nicholas

    1 year ago
    Prefect does have access, you're correct. Here's what I found: The only concurrency limit that's set is 
    10
     on a tag called 
    mysql-write
    , and you already have 10 tasks in a 
    running
     state with that tag; the tasks in the flow run you provided also have the 
    mysql-write
     tag and are queued correctly as a result. There are no tasks with a
    staging
    tag and no concurrency limits with that tag either. One thing I did notice is that the tasks in
    running
    states don't all come from the same flow run, which could be causing the confusion
    Jeremy Phelps

    Jeremy Phelps

    1 year ago
    These tasks are not actually running. How do I find and get rid of them?
    nicholas

    nicholas

    1 year ago
    Let me see if I can grab some flow run ids and names for you and you can manually mark them as finished or cancelled
    Jeremy Phelps

    Jeremy Phelps

    1 year ago
    That doesn't solve the problem going forward. Things will reach this state again.
    I found a bunch of Kubernetes pods that appear to be stale (the Dask schedulers they are talking to have been taken down). I deleted them, so maybe that will help.
    I confused tags with labels. Do tasks with the same "tag" but different "labels" share the same concurrency pool?
    nicholas

    nicholas

    1 year ago
    Two things you can do for the future: you can set up flow SLA automations for that flow that will fail the flow if it exceeds some time threshold, you can manually mark those flow runs that are holding onto concurrency slots but whose jobs are stale as failed/completed (along with their associated tasks). Basically you'll need to kill the jobs in some way to make sure they're not holding onto those slots, whether that's through Prefect or your cluster
    Tasks only have tags and so share the concurrency pool, flows have labels and share a different concurrency pool, though this is something we'd like to clarify in the future
    Jeremy Phelps

    Jeremy Phelps

    1 year ago
    It seems that the only way to have a truly separate staging environment is to have a separate Prefect account for it.
    nicholas

    nicholas

    1 year ago
    I think what you're describing is entirely doable with a single tenant but separate tags and labels on your flows to account for execution submission to different environments but I can put you in touch with one of our account managers to discuss multi-tenancy which will give you database-level sharding of environments.
    Jeremy Phelps

    Jeremy Phelps

    1 year ago
    Does multi-tenancy cost additional money?
    nicholas

    nicholas

    1 year ago
    It does, it's an enterprise-grade feature
    Jeremy Phelps

    Jeremy Phelps

    1 year ago
    Management will never agree to it.
    Is there any documentation for the
    set_task_run_states
    mutation?
    nicholas

    nicholas

    1 year ago
    The GraphQL api has docs attached to the schema (you can view these in the interactive api) which denotes all input and output types. You can run that mutation through the interactive API like this:
    mutation {
      set_task_run_states(input: {states: [{task_run_id: "<<task run states>>", state: "{\"type\": \"Failed\", \"message\": \"<<your message>>\"}"}]}) {
        states {
          id
          status
        }
      }
    }
    Note the escaping of the
    state
    field, which is a JSON payload
    Jeremy Phelps

    Jeremy Phelps

    1 year ago
    The problem I'm running into is that I don't know which fields are expected in the
    state
    .
    Oh, I see.
    Ty, that worked.