tash lai
04/07/2021, 4:33 AMTyler Wanner
04/07/2021, 4:49 AMtash lai
04/07/2021, 4:52 AMwith Flow('scrapeshows') as flow:
url = get_shows_urls_from_front_page()
show_details = download_show_page.map(url)
show_name = extract_name.map(tv_show_details)
episode_urls = extract_episode_urls.map(tv_show_details)
episode_details = download_episode_page.map(url=flatten(episode_urls))
episode_name = extract_expisode_name.map(episode_details)
episode_desc = extract_episode_desc(episode_details)
Now we have all the data we need but how do i save it to my database? This approach won't work because after flattening, zip-like behaviour won't help much because there's less url
and show_name
than eposode_name
and episode_desc
save_to_database.map(url, show_name, episode_name, episode_desc)
Another approach is to annotate episode_urls
with s
show_name` and url
before flattening:
@task
def annotate_episode_url(url, show_name, episode_url):
return (url, show_name, episode_url)
with Flow('scrapeshows') as flow:
url = get_shows_urls_from_front_page()
show_details = download_show_page.map(url)
show_name = extract_name.map(tv_show_details)
episode_urls = extract_episode_urls.map(tv_show_details)
annotated_episode_urls = annotate_episode_url.map(url, show_name, episode_urls)
flattened = flatten(annotated_episode_urls)
episode_show_url = flattened[0]
episode_show_name = flattened[1]
episode_url = flattened[2]
episode_details = download_episode_page.map(url=episode_url)
episode_name = extract_expisode_name.map(episode_details)
episode_desc = extract_episode_desc(episode_details)
save_to_database.map(episode_show_url, episode_show_name, episode_name, episode_desc)
But this is overly verbose and ineffective. Is there some other way to do this?emre
04/07/2021, 9:49 AMsave_to_database
a mapped task is inefficient. You are forcing yourself to make a roundtrip to the db for each episode you have. Instead, consider collecting the output of your mapped tasks and make your save_to_database
bulk insert.
Anyways, as you said, without annotating the show url and name to the episode urls is the simplest way to maintain the relation between show and episode. If you can modify your save_to_database
task, I would just make it accept the annotated tuple instead of episode_show_url and episode_show_name, and parse the tuple inside the task. That way, you can avoid the verbose flatten[x]
steps. Something like:
@task
def annotate_episode_url(url, show_name, episode_url_list):
return [(url, show_name, episode_url) for episode_url in episode_url_list]
with Flow('scrapeshows') as flow:
url = get_shows_urls_from_front_page()
show_details = download_show_page.map(url)
show_name = extract_name.map(tv_show_details)
episode_urls = extract_episode_urls.map(tv_show_details)
annotated = annotate_episode_url.map(url, show_name, episode_urls)
episode_details = download_episode_page.map(url=flatten(episode_urls)) # flatten(episode_urls) and annotated have the same order, this works
episode_name = extract_expisode_name.map(episode_details)
episode_desc = extract_episode_desc.map(episode_details)
save_to_database.map(annotated, episode_name, episode_desc)
If you don't want to modify save_to_database
, consider placing a collection task just before database task, for unzipping the annotated. Use nout
to decrease clutter.
@task
def annotate_episode_url(url, show_name, episode_url_list):
return [(url, show_name, episode_url) for episode_url in episode_url_list]
@task(nout=2)
def collect(annotated_tuple):
return (
[u for u, s, ep in annotated_tuple],
[s for u, s, ep in annotated_tuple],
)
with Flow('scrapeshows') as flow:
url = get_shows_urls_from_front_page()
show_details = download_show_page.map(url)
show_name = extract_name.map(tv_show_details)
episode_urls = extract_episode_urls.map(tv_show_details)
annotated = annotate_episode_url.map(url, show_name, episode_urls)
episode_details = download_episode_page.map(url=flatten(episode_urls)) # flatten(episode_urls) and annotated have the same order, this works
episode_name = extract_expisode_name.map(episode_details)
episode_desc = extract_episode_desc.map(episode_details)
episode_show_url, episode_show_name = collect(annotated)
save_to_database.map(episode_show_url, episode_show_name, episode_name, episode_desc)
Apparently nout
does not work if the task is mapped, so collection is necessary in that regard.tash lai
04/07/2021, 10:36 AMKevin Kho
url
and show_name
versus episode_name
. I don’t have any suggestions beyond what you have, but I’ll come back here if I come across something