Alan
08/19/2024, 4:42 AMdef train_model():
# Define configurations.
train_loop_config = {"num_epochs": 20, "lr": 0.01, "batch_size": 32}
scaling_config = ScalingConfig(num_workers=num_workers, use_gpu=use_gpu)
run_config = RunConfig(checkpoint_config=CheckpointConfig(num_to_keep=1))
# Define datasets.
train_dataset = ray.data.from_items(
[{"input": [x], "label": [2 * x + 1]} for x in range(2000)]
)
datasets = {"train": train_dataset}
# Initialize the Trainer.
trainer = TorchTrainer(
train_loop_per_worker=train_loop_per_worker,
train_loop_config=train_loop_config,
scaling_config=scaling_config,
run_config=run_config,
datasets=datasets
)
# Train the model.
result = trainer.fit()
# Inspect the results.
final_loss = result.metrics["loss"]
@flow(task_runner=RayTaskRunner(address="<ray://raycluster-kuberay-head-svc.kuberay.svc.cluster.local:10001>", init_kwargs={"runtime_env": {"pip": ["prefect-ray", "torch", "torchvision", "boto3","botocore"]}},))
def training_pipeline():
# equivalent to setting @ray.remote(num_cpus=4, num_gpus=2)
with remote_options(num_cpus=4, num_gpus=1):
train_model.submit()
The job runs and I can see it in the dashboard, however, issue is that the job is not actually running. It will literally say running forever until I kill it from the prefect dashboard.
From the output part of the worker, I literally have two lines. I am running the latest version of prefect, prefect-ray, and ray!! My question is, should i not be using the latest versions of ray/prefect-ray? My guess is that python3.11 is just not tested!!