Hello! I’ve got a load of `invalid duration format...
# prefect-community
t
Hello! I’ve got a load of
invalid duration format
errors showing up in my Orion API, just checking if this is a bug or misconfig on my part?
Copy code
20:39:34.009 | ERROR   | uvicorn.error - Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/uvicorn/protocols/http/h11_impl.py", line 366, in run_asgi
    result = await app(self.scope, self.receive, self.send)
  File "/usr/local/lib/python3.9/site-packages/uvicorn/middleware/proxy_headers.py", line 75, in __call__
    return await <http://self.app|self.app>(scope, receive, send)
  File "/usr/local/lib/python3.9/site-packages/uvicorn/middleware/message_logger.py", line 82, in __call__
    raise exc from None
  File "/usr/local/lib/python3.9/site-packages/uvicorn/middleware/message_logger.py", line 78, in __call__
    await <http://self.app|self.app>(scope, inner_receive, inner_send)
  File "/usr/local/lib/python3.9/site-packages/fastapi/applications.py", line 261, in __call__
    await super().__call__(scope, receive, send)
  File "/usr/local/lib/python3.9/site-packages/starlette/applications.py", line 112, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.9/site-packages/starlette/middleware/errors.py", line 181, in __call__
    raise exc
  File "/usr/local/lib/python3.9/site-packages/starlette/middleware/errors.py", line 159, in __call__
    await <http://self.app|self.app>(scope, receive, _send)
  File "/usr/local/lib/python3.9/site-packages/starlette/middleware/cors.py", line 92, in __call__
    await self.simple_response(scope, receive, send, request_headers=headers)
  File "/usr/local/lib/python3.9/site-packages/starlette/middleware/cors.py", line 147, in simple_response
    await <http://self.app|self.app>(scope, receive, send)
  File "/usr/local/lib/python3.9/site-packages/starlette/exceptions.py", line 82, in __call__
    raise exc
  File "/usr/local/lib/python3.9/site-packages/starlette/exceptions.py", line 71, in __call__
    await <http://self.app|self.app>(scope, receive, sender)
  File "/usr/local/lib/python3.9/site-packages/fastapi/middleware/asyncexitstack.py", line 21, in __call__
    raise e
  File "/usr/local/lib/python3.9/site-packages/fastapi/middleware/asyncexitstack.py", line 18, in __call__
    await <http://self.app|self.app>(scope, receive, send)
  File "/usr/local/lib/python3.9/site-packages/starlette/routing.py", line 656, in __call__
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.9/site-packages/starlette/routing.py", line 408, in handle
    await <http://self.app|self.app>(scope, receive, send)
  File "/usr/local/lib/python3.9/site-packages/fastapi/applications.py", line 261, in __call__
    await super().__call__(scope, receive, send)
  File "/usr/local/lib/python3.9/site-packages/starlette/applications.py", line 112, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.9/site-packages/starlette/middleware/errors.py", line 181, in __call__
    raise exc
  File "/usr/local/lib/python3.9/site-packages/starlette/middleware/errors.py", line 159, in __call__
    await <http://self.app|self.app>(scope, receive, _send)
  File "/usr/local/lib/python3.9/site-packages/starlette/exceptions.py", line 82, in __call__
    raise exc
  File "/usr/local/lib/python3.9/site-packages/starlette/exceptions.py", line 71, in __call__
    await <http://self.app|self.app>(scope, receive, sender)
  File "/usr/local/lib/python3.9/site-packages/fastapi/middleware/asyncexitstack.py", line 21, in __call__
    raise e
  File "/usr/local/lib/python3.9/site-packages/fastapi/middleware/asyncexitstack.py", line 18, in __call__
    await <http://self.app|self.app>(scope, receive, send)
  File "/usr/local/lib/python3.9/site-packages/starlette/routing.py", line 656, in __call__
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.9/site-packages/starlette/routing.py", line 259, in handle
    await <http://self.app|self.app>(scope, receive, send)
  File "/usr/local/lib/python3.9/site-packages/starlette/routing.py", line 61, in app
    response = await func(request)
  File "/usr/local/lib/python3.9/site-packages/prefect/orion/utilities/server.py", line 87, in handle_response_scoped_depends
    response = await default_handler(request)
  File "/usr/local/lib/python3.9/site-packages/fastapi/routing.py", line 227, in app
    raw_response = await run_endpoint_function(
  File "/usr/local/lib/python3.9/site-packages/fastapi/routing.py", line 160, in run_endpoint_function
    return await dependant.call(**values)
  File "/usr/local/lib/python3.9/site-packages/prefect/orion/api/flow_runs.py", line 118, in flow_run_history
    return await run_history(
  File "/usr/local/lib/python3.9/site-packages/prefect/orion/database/dependencies.py", line 112, in async_wrapper
    return await fn(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/prefect/orion/api/run_history.py", line 158, in run_history
    return pydantic.parse_obj_as(List[schemas.responses.HistoryResponse], records)
  File "pydantic/tools.py", line 38, in pydantic.tools.parse_obj_as
  File "pydantic/main.py", line 331, in pydantic.main.BaseModel.__init__
pydantic.error_wrappers.ValidationError: 1 validation error for ParsingModel[List[prefect.orion.schemas.responses.HistoryResponse]]
__root__ -> 28 -> states -> 0 -> sum_estimated_lateness
  invalid duration format (type=value_error.duration)
☝️ the stack trace from one of them. There’s lots, seem to be a batch of 2-3 every few minutes.
It might be happening every time I view the flow runs list in the UI, but I wasn’t able to confirm that behaviour 100%
a
can you share
prefect version
output?
t
If that was the case, I have two flow runs that are pending and therefore have no duration, but all of this is wild ass guessing rather than proper debugging.
Sure, one sec
Copy code
Version:             2.0b3
API version:         0.3.0
Python version:      3.9.12
Git commit:          58a401bc
Built:               Wed, Apr 13, 2022 11:21 AM
OS/Arch:             linux/x86_64
Profile:             default
Server type:         ephemeral
Server:
  Database:          postgresql
a
to check if this is a misconfiguration or a bug, when exactly did you see this error showing up in your API logs - when you triggered a specific run, you said it might be when viewing flow runs from the UI, could you confirm that? could help troubleshoot if this happens when running a flow, could you share flow code or deployment spec?
t
It’s definitely not when running a flow, I can confirm that part
👍 1
a
so far I don't have enough info to reproduce or say anything helpful to fix the issue
t
One moment, I’ll see if I can reliably get it to happen.
a
that would be helpful, thanks a lot! If I can't reproduce or help further, I'll open an issue and ask the team
t
Right, I just refreshed the UI three times with the flow runs tab selected and had it appear each time.
I’m not sure how to delete flow runs, if I can do that I can nuke them all and see if it still happens
a
so it looks like if I start ephemeral Orion with Postgres and view the UI anywhere, I should be able to reproduce?
sorry, but I think the only reliable way to nuke old flow runs would be to reset the DB entirely
t
Okay. I can do that. Let me gather more info from the current state first
🙌 1
Dammit, hadn’t realised nuking would wipe the different storage options
As in the fixtures, rather than the user configured ones
Well, refreshing now it’s totally nuked no longer causes the error, so that’s good info.
a
nice work!
t
I want to recreate my flows and have some that aren’t pending to see if that’s the cause but I don’t know how to set storage up again without that seed data. I’d have thought DB recreation would have included those entries but apparently not.
a
sorry to hear that you lost some of your configurations. we tried to make the storage setup easy with the CLI
t
Oh no, it’s not my storage config I’m worried about, that’s dead easy
a
some engineers are working on adding storage to the DeploymentSpec so if that's of any consolidation, you won't need to rely on a global storage config in the near future
🙌 1
t
it’s the storage configs that come baked into the DB that allow you to set storage up
Without those the create storage command won’t work, it just tells me to select an option from an empty list
a
Can you make sure that all your browser windows with Orion are closed before doing that?
I had a similar issue - closing all browser tabs, then resetting DB and starting orion/creating storage should work. We have an open issue for that
1
t
some engineers are working on adding storage to the DeploymentSpec so if that’s of any consolidation, you won’t need to rely on a global storage config in the near future
This will be awesome. Storage has been by far the hardest part to fully automate deployment for.
👍 1
upvote 1
a
100% agreed
t
Hmmm, something isn’t right here. I closed all UI windows and reset the DB again but via the CLI
it confirmed reset against local SQLite, but I’m using a kubernetes cluster and PostgreSQL
a
I actually meant that via CLI reset command
t
Checked config and the API URL is still set to the proper location, so it should be looking at/interacting with the deployed version.
a
in that case you would need to recreate a Kubernetes deployment?
t
I’ll try doing that, it seemed to still be running happily post reset, but can’t have been quite as happy as it seemed
👍 1
Well I didn’t cover myself in glory there. Was bouncing the wrong pod over and over wondering why nothing was happening.
😂 1
a
still, looks like everything is working now?
I'll be away for a bit now, LMK if you still have any issues, I can check later
t
Okay, everything is restored and I have deployments back in. Refreshing isn’t causing the error. Let’s test my theory
👍 1
No worries. I’ll leave a note so people can debug. It’s not urgent on my side really, I just wanted cleaner logs to debug other stuff.
👍 1
• Tried refreshing UI with one failed flow run in place. Exception didn’t occur • Tried refreshing UI with failed and pending flow run in place. Exception didn’t occur • Error has eventually returned after a period of letting the system run (schedule a few flow runs, normal UI usage etc) • I can now trigger the error reliably again by refreshing the UI Sorry, this isn’t a great starting point for debugging but I can’t seem to pin down a reliable set of steps to get into this state. Only config I added post reset was storage. I also have
PREFECT_API_URL
and
PREFECT_AGENT_QUERY_INTERVAL
env vars set.
a
I could try to replicate with Kubernetes setup but just to let you know, if this doesn't work for you, you always can try those two alternatives: • running Orion without Kubernetes - e.g. on a local instance you could set it up on EC2 • using Cloud 2.0 How did you set up your Orion instance on Kubernetes? Did you follow this tutorial here?
t
I did follow that tutorial. Everything seems to work, it’s just there’s stacks of errors in the logs that make it hard to find errors I care about.
Mostly timeouts and then the ones I describe above. I figure the timeouts are fairly self explanatory, but the invalid duration seemed like it might be a bug
I wouldn’t want to use a different deployment mechanism like EC2 as that’s going to bring down a lot of additional maintenance burden. I’d be okay with using cloud (I’m considering it now), but I’d still need to know I could run it successfully on my own stack first as a safety net since it’s such a core part of our system.
That safety net isn’t just hypothetical either. There’s a decent chance our product will need to be self-hostable for an enterprise version in future. We’d be able to use the cloud offering for the main SaaS solution, but couldn’t use it for those self-hosted instances.
a
I can 100% understand that, and I'm sure this is a solvable problem. Hard to say what exactly went wrong there, but I can reassure you that we build Prefect 2.0 in a way that you can use the OSS product for your production deployments.
t
Oh yeah for sure, I don’t mind there being some teething problems. You’re pretty up front about this being a beta and the product has lots of features that offset that
👍 1
for a chunk of the problems I’ve faced I’m hoping to wrap them up in a kubernetes operator and make it open source so others can pick it up, I just want to make sure I’ve got the right solutions internally before I do that.
I’d also need to wait on the new storage interface as I wouldn’t want to hack around the current one and then immediately replace it. Everything else is pretty workable as-is.
a
Nice! If you want a sneak peek of the storage interface, it's based on the
fsspec
interface which provides a lot of flexibility in that regard - if you want to check this: https://github.com/PrefectHQ/prefect/blob/orion/src/prefect/blocks/storage.py
👍 1
t
@Anna Geller I might have just found the issue, mea culpa. I’m on an M1 Mac locally, so it’s emulating. I only just noticed prefect doesn’t have an arm64 image. My experience of emulation with docker is that it’s pretty ropey, and the errors you can get are… weird. Are there any plans for an arm build of the image?
I can build a local version in the meantime, but would be awesome if the official one had both.
a
Nice to hear you found out the root cause of the issue, good work! Let me open an issue for that. @Marvin open "Consider adding arm64 base images for Prefect 2.0"
🙌 1
t
Never mind. I just rebuilt locally using arm64, build was successful and the app does seem more performant, but the errors described above still exist. So it’s worth doing, but not the cause of this problem.
👍 1