https://prefect.io logo
How to connect EFS to AWS Lambda used together with Prefect?
v

Viet Nguyen

10/27/2022, 2:50 PM
I have prefect agent running on Lambda that connected to a VPC, at the moment I allow all traffic from everywhere for the inbound Security group setting, but this is not ideal, besides port 4200, might I know what other ports I need to allow to get Prefect running on Lambda with a VPC without issue? Thank you.
@Anna Geller long time no chat
🙌 1
hope you're doing well 🙂
a

Anna Geller

10/27/2022, 3:44 PM
I hope you're well too! 👋
this repo template may be a good way to get started, there's a link to a blog post in readme https://github.com/anna-geller/prefect-aws-lambda
v

Viet Nguyen

10/27/2022, 3:53 PM
yeah I got it worked perfectly 10mins ago when the Lambda func is not connected to a VPC
but since I need to use AWS EFS for the Lambda func, I need to hook it to a VPC
and that requires security group settings
[ERROR] RuntimeError: Cannot create flow run. Failed to reach API at <https://api.prefect.cloud/api/accounts>
and it used to be successful running like this:
logs from Cloudwatch
code and env variables for the Lambda func stay the same
just adding a VPC to Lambda
so I wonder which ports Prefect running in the background consumes, I don't want to open all ports.
and the Lambda func has too hook to an EFS because multiple Lambda calls that writing to same Zarr stor require a Process synchronizer for write consistency.
thanks @Anna Geller
a

Anna Geller

10/27/2022, 7:21 PM
sorry I can't help with EFS and I'm not sure if I would recommend that with Lambda, it adds so much complexity, why not use S3?
v

Viet Nguyen

10/27/2022, 10:55 PM
S3 still a destination for Zarr , EFS for sync files
multiple Lambda invocations trying to write data to the Same Zarr store on
S3
, needs to use
zarr.ProcessSycnchronizer()
so that different invocations wont lapse on each other, causing data corruptions to the Zarr store on
S3
anyway I figured it out.
all I needed to know was which ports Prefect API and agent using, not asking how to use Prefect with Lambda or use Prefect with Lambda EFS combination, so I think it's a miscommunication 🙂
a

Anna Geller

10/27/2022, 11:37 PM
oh wow, can you share how you did it? I'm curious
it's just more that you know more about it than me -- I wouldn't honestly know how to use self-hosted Prefect with Lambda, I only used it with Cloud anyway, great to hear you figured it out and if you want to share how you did it, that could be useful to others 🙌
v

Viet Nguyen

10/27/2022, 11:41 PM
I dont host Prefect too, I use Prefect cloud
hang on I have some diagram
so this is a high level logics: https://zarr.readthedocs.io/en/stable/api/sync.html
Path to a directory on a file system that is shared by all processes. N.B., this should be a different path to where you store the array.
and this is the a-bit-more details logics
EFS solves this
Path to a directory on a file system that is shared by all processes
on a distributed system like Lambda
to maintain the accuracy of the S3 source data store when multiple invocations trying to update the same Zarr store at the same time.
so it's not self-hosted Prefect, flows run showing up in your cloud: look at the address bar
reupload first pic, high level diagram
@Anna Geller
because there are API connections between the Lambda func with Prefect cloud, with a VPC hooked in a a Security group attached (required if attach EFS to Lambda func), it needs to be granted access so that the 2 things can talk to each other, e.g the Lambda can access Prefect cloud API, does that make sense to you?
a

Anna Geller

10/28/2022, 2:19 AM
so Lambda is needed so that some flow runs are triggered event driven when new data arrives but this Lambda needs access to EFS which host numpy arrayed serialized with Zarr - did I get it? So it looks like your Lambda is doing all the work and Prefect is used to observe Lambda only
Thanks so much for sharing! Appreciate it
v

Viet Nguyen

10/28/2022, 2:26 AM
no worries, besides observing, I have some state management logics within the flows and tasks, also use Prefect notification on tagged runs, to me Lambda is just like an agent, we process only 1 file per invocation so Lambda has enough computing resource and timeout of 15-min is not a problem.
EFS which host numpy arrayed serialized with Zarr
no for this, EFS only host sync files that multi Lambda invocations access to, this makes sure they write to S3 appropriately, arrays stored on S3. This pipeline to manage data update only (new files added occasionally, or ingested file receiving an update), the actual Zarr store holding aggregated data from hundred thousands of NetCDF files run separately by scheduler.
The pre-generated big Zarr store creation pipelines run on ECS Fargate cluster and using Dask 🙂 those Lambda pipelines running without Dask, just 1 file, not a big deal.
sync files
Process Synchronizer specifically
realised that I didn't answer you how I fixed the problem when adding a VPC to Lambda:
When you connect a function to a VPC in your account, it does not have access to the internet unless your VPC provides access. To give your function access to the internet, route outbound traffic to a NAT gateway in a public subnet.
I need to update the route table to allow Lambda accessing internet! Ports weren't a problem @Anna Geller
:gratitude-thank-you: 1