Hi there, I'm having difficulty with managing task...
# prefect-community
p
Hi there, I'm having difficulty with managing tasks and killing the tasks that are already started Here's some background • I have one
flow
with two `task`s ◦ The first
task
downloads a CSV ◦ The second
task
downloads a PDF for every row in the CSV. ▪︎ It takes around 10s to download a PDF, and there are around 500 PDFs to be downloaded ▪︎ Each task is named after the PDF it is downloading (I'm using
with_options
to assign task names on the fly) ▪︎ I've added a concurrency-limit of 8 on this task. Now, when my flow completes, I still see some 6-7 tasks in
Running
state on the UI. However, the corresponding PDFs are downloaded and saved on my disk. I have three questions • Why is this happening? The fact that PDFs are downloaded means that the tasks are completed. Is prefect somehow failing to detect that the job ended? • I'm using
prefect concurrency-limit inspect 'pdf-downloader'
to look for the running tasks. I am able to extract task-ids, but can't find a documented way of killing them. Is there a command which takes task ID and kills it? If not, what is the preferred way of killing • Is there a way to specify maximum time limit for a task? Thanks in advance 🙂
1
👀 1
I should also mention that I am using
selenium
with
geckodriver
to fetch the PDFs. Hence, whenever the second task is executed, it essentially starts a separate process under the hood
Was going through the logs and found an error level message [Posting only the relevant sections]
Copy code
sqlite3.OperationalError: database is locked
...
Copy code
sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) database is locked
[SQL: INSERT INTO task_run_state (id, created, updated, type, timestamp, name, message, state_details, data, task_run_id) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)]
[parameters: ('b5f91989-fd17-439d-ba9c-c4d2fea3f98d', '2022-08-24 02:54:02.330280', '2022-08-24 02:54:02.330286', 'RUNNING', '2022-08-24 02:54:01.949339', 'Running', None, '{"flow_run_id": "55b0e054-2384-4609-b4c2-9376226ade52", "task_run_id": "c84d2773-0165-42a2-b459-0b6c2cf1d9c0", "child_flow_run_id": null, "scheduled_time": null, "cache_key": "/asx/statistics/displayAnnouncement.do?display=pdf&idsId=02557229", "cache_expiration": null}', None, 'c84d2773-0165-42a2-b459-0b6c2cf1d9c0')]
(Background on this error at: <https://sqlalche.me/e/14/e3q8>)
....
Copy code
sqlalchemy.exc.InvalidRequestError: Can't operate on closed transaction inside context manager.  Please complete the context manager before emitting further commands.
e
Hi @Prakash Rai 👋 A couple of questions for you: 1. Would you be comfortable sharing your code or a paired-down version of it? 2. Are you running the tasks asynchronously with
task_fn.submit(path)
or something like that? 3. Are you caching the result of the PDF downloader function? I’ve found that when caching it’s more performant to return a path to a file rather than the content itself.
Why is this happening? The fact that PDFs are downloaded means that the tasks are completed. Is prefect somehow failing to detect that the job ended?
I wouldn’t think so. More likely there’s some background operation holding things up.
Is there a command which takes task ID and kills it? If not, what is the preferred way of killing
To my knowledge, killing the flow should also kill any tasks running as part of the flow. Are you running locally or through an agent?
Is there a way to specify maximum time limit for a task?
Not on the task level, but you can set
timeout_seconds
on the flow.
p
Hi @Emil Christensen, Thanks for your reply.
Would you be comfortable sharing your code or a paired-down version of it?
I'll won't be able to share the exact code, but I'll try to share a version with dummy data. Not sure whether I'll be able to reproduce the problems that way.
Are you running the tasks asynchronously with
task_fn.submit(path)
or something like that?
Yes
Are you caching the result of the PDF downloader function? I’ve found that when caching it’s more performant to return a path to a file rather than the content itself.
Yes. and I am returning the path of the downloaded files.
To my knowledge, killing the flow should also kill any tasks running as part of the flow. Are you running locally or through an agent?
I also expected that. Surprisingly, the tasks are still running after I kill the flow. Running
prefect concurrency-limit inspect 'pdf-downloader'
command lists the active tasks run IDs, even if there are no flows running (Am I interpreting it in a wrong way?). Attaching an image for your reference. I'm running these tasks locally. Planning to shift to agents later. Also, thanks for sharing the
timeout_seconds
link. It might be able to solve this issue
👀 1
e
I also expected that. Surprisingly, the tasks are still running after I kill the flow. Running
prefect concurrency-limit inspect 'pdf-downloader'
command lists the active tasks run IDs, even if there are no flows running (Am I interpreting it in a wrong way?)
It could be that the task states in the DB aren’t updated since the flow is killed. If you’re only running locally and you successfully kill the flow, then the tasks shouldn’t be able to keep processing.
Attaching an image for your reference.
I don’t think it made it 😞
p
@Emil Christensen Ah sorry, here's the command I ran and output I got. As of now, no active flow-runs are there, and yet you can see the tasks
The command I'm using to kill the flow is
prefect flow-runs delete <id>
e
@Prakash Rai Gotcha.
prefect flow-runs delete
just deletes the metadata about the flow run. If there are no running flows then there won’t be any actively running tasks. I’m fairly confident that the task states just haven’t been updated.
p
Okay, is there a straightforward way to kill a given
flow-run
?