Woraphon Tontaweewong
06/29/2025, 12:46 PMMarvin
06/29/2025, 12:47 PMMarvin
06/29/2025, 12:48 PMini
[Unit]
Description=Prefect Worker Service
After=network.target
[Service]
Type=simple
User=your_user
Environment="HTTP_PROXY=<http://your.proxy:port>"
Environment="HTTPS_PROXY=<http://your.proxy:port>"
Environment="NO_PROXY=localhost,127.0.0.1"
# Add any other environment variables you need
ExecStart=/path/to/prefect worker start -p "your-pool-name"
Restart=always
[Install]
WantedBy=multi-user.target
2. Prefect UI Configuration
Yes, you can set environment variables through the UI:
1. Go to your work pools in the Prefect UI
2. Click on your process work pool
3. Click the "Edit" button
4. Look for the "Base Job Template" section
5. Under the configuration, you can add environment variables that will be applied to all workers in this pool
The environment variables set in the UI will be merged with those set in your systemd service file, with the systemd variables taking precedence if there are conflicts.
A few important notes:
- Environment variables set through the UI will apply to all workers using that work pool
- Sensitive information (like proxy credentials) might be better set through the systemd service file for security reasons
- Make sure to restart your systemd service after making changes to the service file:
bash
sudo systemctl daemon-reload
sudo systemctl restart prefect-worker.service
For proxy settings specifically, remember to include these common variables if needed:
- HTTP_PROXY
- HTTPS_PROXY
- NO_PROXY
You can verify your worker is running with the correct environment variables by checking:
bash
sudo systemctl status prefect-worker.service
For more details on managing work pools and their configuration, you can refer to the Prefect documentation on managing work pools.Woraphon Tontaweewong
06/29/2025, 12:52 PMMarvin
06/29/2025, 12:52 PMMarvin
06/29/2025, 12:53 PMbash
HTTP_PROXY=<http://your.proxy:port>
HTTPS_PROXY=<http://your.proxy:port>
GRPC_PROXY=<http://your.proxy:port> # Important for gRPC connections
NO_PROXY=localhost,127.0.0.1
2. For Spark Connect specifically, you need to set additional environment variables:
python
import os
os.environ['SPARK_REMOTE'] = f"<sc://your-databricks-instance.cloud.databricks.com:443/default>"
os.environ['DATABRICKS_TOKEN'] = 'your-token'
os.environ['DATABRICKS_CLUSTER_ID'] = 'your-cluster-id'
3. When creating your Spark session, make sure to include the proper configurations:
python
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.remote(f"<sc://your-databricks-instance.cloud.databricks.com:443/default>") \
.config("spark.databricks.service.token", os.environ.get('DATABRICKS_TOKEN')) \
.config("spark.driver.extraJavaOptions", "-Dhttps.proxyHost=your-proxy-host -Dhttps.proxyPort=your-proxy-port") \
.config("spark.executor.extraJavaOptions", "-Dhttps.proxyHost=your-proxy-host -Dhttps.proxyPort=your-proxy-port") \
.getOrCreate()
4. Verify your connection settings:
- Ensure your Databricks cluster is actually running and accepting connections
- Check that your proxy allows gRPC traffic (port 443)
- Verify that your Databricks token has the necessary permissions
- Make sure your cluster's network settings allow incoming connections
5. If you're running this in a Prefect flow, you can set these environment variables in your work pool configuration:
python
@flow
def your_flow():
# Set environment variables at the start of your flow if needed
os.environ.update({
'SPARK_REMOTE': "<sc://your-databricks-instance.cloud.databricks.com:443/default>",
'GRPC_PROXY': "<http://your.proxy:port>",
# other necessary variables
})
# Your spark session creation and operations
spark = SparkSession.builder.remote(...).getOrCreate()
6. Common troubleshooting steps:
- Try increasing the connection timeout:
python
spark = SparkSession.builder \
.config("spark.connect.timeout", "120s") \
.remote(...) \
.getOrCreate()
- Verify proxy connectivity:
python
import requests
response = requests.get('<https://your-databricks-instance.cloud.databricks.com>')
print(response.status_code)
If these steps don't resolve the issue, you might need to:
1. Check your Databricks cluster logs for any connection rejection messages
2. Verify that your proxy administrator hasn't blocked gRPC traffic
3. Ensure all necessary Databricks endpoints are whitelisted in your proxy configuration
For more detailed information about configuring Prefect with Databricks, you can refer to the Prefect-Databricks integration documentation.Woraphon Tontaweewong
06/29/2025, 1:07 PMMarvin
06/29/2025, 1:07 PMMarvin
06/29/2025, 1:07 PMini
[Unit]
Description=Prefect Worker Service
After=network.target
[Service]
Type=simple
User=daservice
# Add your proxy settings
Environment="HTTP_PROXY=<http://your.proxy:port>"
Environment="HTTPS_PROXY=<http://your.proxy:port>"
Environment="GRPC_PROXY=<http://your.proxy:port>"
Environment="NO_PROXY=localhost,127.0.0.1"
Environment="PREFECT_API_URL=<http://127.0.0.1:4200/api>"
# If you have any Databricks-specific environment variables, add them here
Environment="DATABRICKS_HOST=your-databricks-host"
Environment="DATABRICKS_TOKEN=your-token"
Environment="DATABRICKS_CLUSTER_ID=your-cluster-id"
WorkingDirectory=/home/daservice/projects/prefect-da # Changed to your project directory
ExecStart=/home/daservice/projects/prefect-da/.venv/bin/prefect worker start -p 'process-work-pool'
Restart=always
RestartSec=3
[Install]
WantedBy=multi-user.target
Key changes and notes:
1. Added proxy environment variables that were previously available in your shell environment
2. Changed the WorkingDirectory to your actual project directory (this is important for relative paths)
3. Added Databricks-specific environment variables if you're using them
After making these changes:
1. Reload the systemd daemon and restart the service:
bash
sudo systemctl daemon-reload
sudo systemctl restart prefect-worker.service
2. Check the status and logs:
bash
sudo systemctl status prefect-worker.service
sudo journalctl -u prefect-worker.service -f
Additional troubleshooting tips:
1. Make sure the daservice
user has all necessary permissions
2. You can verify the environment variables are being set correctly by modifying your flow to print them:
python
@flow
def debug_flow():
import os
print("Proxy settings:", {
'HTTP_PROXY': os.getenv('HTTP_PROXY'),
'HTTPS_PROXY': os.getenv('HTTPS_PROXY'),
'GRPC_PROXY': os.getenv('GRPC_PROXY')
})
3. If you're using any virtual environment dependencies, make sure they're all accessible to the systemd service
Remember to replace the proxy URLs, Databricks host, and other placeholder values with your actual configuration values.