I'm aiming to copy database backups or snapshots from AWS RDS (MariaDb) into S3 so their data can be ingested with Prefect flows, and I'm kind of overwhelmed at the number of options:
1. Export RDS snapshots to S3 using AWS Lambda (scheduled with Cloudwatch Events)
2. AWS Glue to S3
3. AWS Data Migration Service to S3
4. AWS Backup to S3
5. AWS Data Pipeline to S3
My leading candidate is Lambda, but I thought I'd post this and see if anyone had suggestions.
01/20/2022, 11:33 AM
Interesting question! First of all, database snapshots are usually used for disaster recovery or for data migration. If you want to just incrementally replicate your data from this database into some centralized repository like DWH/data lake for analysis, perhaps you could leverage Airbyte or Fivetran? We have integrations for that in our task library.
Having said that, you can export DB snapshot from RDS to S3 using nothing but RDS 😄 And the nice thing about it is that you can choose to export this data into a parquet format, which (with some extra work using Glue crawler) allows you to directly query this data using Athena. This documentation or
shows how you can set up RDS snapshots to be sent to S3 directly.
AWS Data Pipeline is a purely GUI-based service which is super limiting imo.
When it comes to AWS Lambda execution vs. processing this data using Prefect, Prefect makes it much easier to deploy your Python code and provides significantly better observability into the execution state.
01/21/2022, 5:11 PM
Just wanted to say thanks, Anna. For whatever reason it didn't occur to me to use Prefect for this task (!!). I created a flow for exporting the RDS snapshot to S3 using a Prefect task, and it's great because I can now integrate that into other flows. Again, thanks!