Hi all. I'm trying to deploy my etl using github f...
# ask-community
d
Hi all. I'm trying to deploy my etl using github filesystem. I've pushed my python script containing the flow on github, created a github block on Prefect, and I've this deployment code:
Copy code
from prefect.deployments import Deployment
from prefect.filesystems import GitHub
from etl_web_to_gcs import etl

github_block = GitHub.load("github-block")

# Deployment Object
github_dep = Deployment.build_from_flow(
    flow=etl,
    name="github-flow",
    infrastructure=github_block
)

if __name__ == "__main__":
    github_dep.apply()
When I run the code above, I get the following error:
Copy code
...
cls_init(__pydantic_self__, **data)
 File "pydantic/main.py", line 341, in pydantic.main.BaseModel.__init__
pydantic.error_wrappers.ValidationError: 1 validation error for Deployment
infrastructure
 Infrastructure block must have 'run-infrastructure' capabilities. (type=value_error)
Any idea how to fix this?
r
I was replying when you removed the post:)
the github_block is not infra but storage
d
Sorry for that
r
you will also need some sort of infrastructure so prefect knows how you want the flow to run
a
Hi @redsquare, sorry for the ping! I have a question that is somewhat related to this issue. We have our infrastructure setup on cloud using ECS and I usually develop and test stuff using local
Process
. Yesterday we did the Prefect associate certificate training and a part of that was hosting code on github (basically telling the flow where the code that needs to be executed lives). Our team just recently migrated from Prefect 1 to 2 and we're still trying to figure out how to work with Prefect 2 efficently; a few flow that we have migrated from 1 to 2 are hosted on an s3 bucket which is a headache for us, all of our code already lives on github. Could you advice me on how I can configure Prefect 2, in both the cloud (using ECS) and for local testing/dev (using
Process
) to fetch the code from github? The github repos are ofcourse private. Is it as simple as creating a github block and passing credentials? Side question, during training the usage of prefect.yaml came up. How does that come into play in this context? Is it necessary? Do you write your own yaml file before deployment of a flow if required to connect to github?
C.C. @Emil Christensen - we had a brief conversation about this yesterday.
e
Storage blocks (e.g. the
GitHub
block) and infrastructure blocks are intended for use with agents and are generally incompatible with workers and the
prefect.yaml
way of deploying flows. When using workers and
prefect.yaml
, the
prefect.deployments.steps.git_clone
pull step serves the same purpose as the
GitHub
storage block. Regardless of whether you have an agent or a worker, what happens is that at run-time the flow will pull code from git. This happens regardless of the infrastructure. The caveat here is that I’m assuming you are running a deployment that gets executed by an agent or worker. If you’re just running your flow directly (e.g.
python my_flow.py
), then no code gets downloaded.
Hopefully that clarifies to some extent. Basically you have two options: 1. If you’re happy using the
prefect.yaml
file and the
prefect deploy
CLI command, then add a
prefect.deployments.steps.git_clone
step to your pull section (example). Now every time a worker executes a flow run, it’ll pull the code from GitHub 2. Alternatively, if you want to define your deployments in Python, create a
GitHub
storage block and pass it to your deployment. Now, when an agent executes a flow run, it’ll pull code from GitHub.
a
So imagine you have a script compatible with Prefect 2 on your github repo. If you wanted to deploy that piece of code and have it be runnable on the prefect cloud how would you approach this? How would you configure the authentication to github? I feel like I am mixing 2 different concepts/method up and confusing myself
e
^ does the message above help?
a
Haha yes sorry I was typing as you were already answering my question 😄
😅 1
Thank you!
e
In either case, you would want to create a personal access token and assign that to your
GitHub
block or your
prefect.deployments.steps.git_clone
step.
(in both cases the argument name is
access_token
)
a
I do like the usage of Prefect.yaml; a couple of clarifying questions about it. Can you refresh my memory on how this file gets created (which later is modified by the developer)? Or does the user create it based on their need? Also do you have a single Prefect.yaml for the whole repo or is it flow/script specific?
@Alyssa Harris
@Jordan Lessard
e
how this file gets created
You can initialize it with
prefect init
, optionally from a recipe. Also, if you do
prefect deploy
and deploy a flow, you can write out a new file at the end of that.
Also do you have a single Prefect.yaml for the whole repo or is it flow/script specific?
Generally one per repo. Within it you can have multiple different deployments (usually one per flow).
a
Thank you so much Emil!
Hi @Emil Christensen; I hope you had a great weekend. Following your guidance I managed to deploy and test a very simple "hello world" flow using github as it's source-code successfully. Ran into a couple of challenges when I wanted to deploy our real production flows which managed to solve all but one of these issues. Let me recap the steps that I have done. The code is pushed to a repo, it's a private repo and I'm working off of a feature branch. Using
prefect deploy
I provided the repo url and the branch name, then I generated an access token and provided that as well. I selected
process
as our esc is not fully fleshed out yet. Now, the flow uses 2 external packages, pandas and s3fs. The are imported in this order:
Copy code
import pandas as pd
import s3fs
After deploying, starting the workpool using this command :
prefect worker start --pool 'test-pool'
I ran the flow using the UI. Now I faced an issue with the packages.
prefect.exceptions.ScriptError: Script at 'flows/read_csv.py' encountered an exception: ModuleNotFoundError("No module named 's3fs'")
Now I am confused because of 2 questions: 1. Why the flow executed locally and not on cloud 2. How to make sure that the Prefect process has access to the packages that the script requires? 3. And oddly enough, why did it fail on `s3fs`and not on
pandas
even tho the latter was imported first? 4. How can I remedy this? Thank you so much for helping me out with this
C.C. @Nicole Garza
m
Hey Emil - I have been having a similar issue where my flow code does not seem to be cloned into the docker container. We are deploying our flows on ECS using a python
Deployment.build_from_flow
deployment so we are not using prefect.yaml file to deploy. I have created a github block wit the credentials. Is there a port mapping or security groups which may be required when the task is spinning up in ECS? Is there a way to validate, print, or pause the container so I can view if the files are cloned in the task container? Any help would be greatly appreciated as I have spent a lot of time on this seemingly trivial implementation 🙂
I have also tried all sorts of path and entrypoints to see if I could find the file, with no luck.
e
@Ali Mir assuming you’ve installed the packages, the most likely explanation is that two different virtualenvs are in use. What’s the output of
which prefect
? Can you run the flow successfully with something like
python flow.py
? If so, what’s the output of
which python
?
@Mitch when you’re using
Deployment.build_from_flow
, are you passing the GitHub block you created? You should see a log in the agent that says something to the effect of “pulling files from repo …“. You could peek at the files by adding something like the following to your flow:
Copy code
import os
print(f"Current dir is {os.path.abspath('.')}")
print(f"Files: {os.listdir('.')}")