Hi all, I'm an epidemiologist by training and my ...
# ask-community
k
Hi all, I'm an epidemiologist by training and my former colleagues at the University of Washington (healthdata.org) [and collaborators around the world] are currently doing a lot of work on Covid-19 data collation and analysis, which is being documented here: Github repo: https://github.com/beoutbreakprepared/nCoV2019 Initial visualization: healthmap.org/covid-19/ Nature article: https://www.nature.com/articles/s41597-020-0448-0.pdf Lancet article: https://www.thelancet.com/journals/laninf/article/PIIS1473-3099(20)30119-5/fulltext They're having a hard time keeping up with all of the new sources, which they're largely manually checking and copying data from into a spreadsheet a few times a day. They're looking for help automating that process so that they can focus more on building models and delivering results to stakeholders in governments, health systems, and non-profits around the world. They are working right now on compiling a list of all of the websites they regularly gather data from and then I'd like to help them create Prefect Flows to run BeautifulSoup tasks a few times a day to check the sites for updates, parse the results, and add them to their data sources. @Chris White and @David Abraham have generously offered up a free Prefect Cloud account for us to use, so should be easy for us to get started. I expect there to be about 100 different sites to write parsers for, so I'd love to get some help crowdsourcing that effort. If you'd like to be involved, please respond here or email me at kyleforeman@gmail.com. I'll aim to have a kickoff meeting this Saturday so that we can figure out how best to tackle the problem.
🚀 2
👨‍⚕️ 6
👍 6
upvote 9
😍 5
s
Count me in
❤️ 1
d
Never written a parser but I’m in
❤️ 1
k
great! for anyone who is new to parsing, this is a good tutorial to get you started: https://realpython.com/beautiful-soup-web-scraper-python/
💯 1
👍 1
j
I spent almost all of December 2017 writing custom parsers to scrape data for a news event aggregation pipeline for the project I was on at the time. The biggest headache I ran into was when sites would change their layout, thus breaking the parser without us knowing that it happened. So I am definitely in because Prefect is built for this kind of stuff!
❤️ 2
b
can you send me link where prefect has parser for this type of stuff?
a
👋 kyle, i can help with scraping too! about.trout@gmail.com
❤️ 1
d
@bardovv Prefect doesn’t have a parser built in. But, because it’s all python, you can import any python parser (like, in this case, Beautiful Soup https://www.crummy.com/software/BeautifulSoup/bs4/doc/) and create flows to parse, transform, store, and analyze information. I hope this helps! Feel free to reach out if you have more questions
👍 1
k
thanks to everyone who has expressed interest here or via email! since we have interested parties who aren't in this Slack, I'll start an email chain tomorrow to coordinate a first video call. If you haven't already, please send me your email (either here, via DM, or at kyleforeman@gmail.com) - looking forward to it!
💯 1
e
@Kyle Foreman (Convoy) How does your effort differ from https://covidtracking.com/? Looks like they are just doing US. But wondering if worth combining efforts
a
Hello @Kyle Foreman (Convoy)! Happy to help where I can. :)
❤️ 1
k
@Elliot that project focuses primarily on US testing and includes aggregates. This project gets microdata ("line lists", eg one row per case so that you can do much more detailed epi analysis) for cases and deaths globally. Both important, but different use cases!
👍 1
just sent out an email - let me know if I missed anyone!
a
I would be happy to help in anyway too!