tash lai
04/07/2021, 4:33 AMTyler Wanner
04/07/2021, 4:49 AMtash lai
04/07/2021, 4:52 AMwith Flow('scrapeshows') as flow:
url = get_shows_urls_from_front_page()
show_details = download_show_page.map(url)
show_name = extract_name.map(tv_show_details)
episode_urls = extract_episode_urls.map(tv_show_details)
episode_details = download_episode_page.map(url=flatten(episode_urls))
episode_name = extract_expisode_name.map(episode_details)
episode_desc = extract_episode_desc(episode_details)
Now we have all the data we need but how do i save it to my database? This approach won't work because after flattening, zip-like behaviour won't help much because there's less url and show_name
than eposode_name and episode_desc
save_to_database.map(url, show_name, episode_name, episode_desc)
Another approach is to annotate episode_urls with s
show_name` and url before flattening:
@task
def annotate_episode_url(url, show_name, episode_url):
return (url, show_name, episode_url)
with Flow('scrapeshows') as flow:
url = get_shows_urls_from_front_page()
show_details = download_show_page.map(url)
show_name = extract_name.map(tv_show_details)
episode_urls = extract_episode_urls.map(tv_show_details)
annotated_episode_urls = annotate_episode_url.map(url, show_name, episode_urls)
flattened = flatten(annotated_episode_urls)
episode_show_url = flattened[0]
episode_show_name = flattened[1]
episode_url = flattened[2]
episode_details = download_episode_page.map(url=episode_url)
episode_name = extract_expisode_name.map(episode_details)
episode_desc = extract_episode_desc(episode_details)
save_to_database.map(episode_show_url, episode_show_name, episode_name, episode_desc)
But this is overly verbose and ineffective. Is there some other way to do this?emre
04/07/2021, 9:49 AMsave_to_database a mapped task is inefficient. You are forcing yourself to make a roundtrip to the db for each episode you have. Instead, consider collecting the output of your mapped tasks and make your save_to_database bulk insert.
Anyways, as you said, without annotating the show url and name to the episode urls is the simplest way to maintain the relation between show and episode. If you can modify your save_to_database task, I would just make it accept the annotated tuple instead of episode_show_url and episode_show_name, and parse the tuple inside the task. That way, you can avoid the verbose flatten[x] steps. Something like:
@task
def annotate_episode_url(url, show_name, episode_url_list):
return [(url, show_name, episode_url) for episode_url in episode_url_list]
with Flow('scrapeshows') as flow:
url = get_shows_urls_from_front_page()
show_details = download_show_page.map(url)
show_name = extract_name.map(tv_show_details)
episode_urls = extract_episode_urls.map(tv_show_details)
annotated = annotate_episode_url.map(url, show_name, episode_urls)
episode_details = download_episode_page.map(url=flatten(episode_urls)) # flatten(episode_urls) and annotated have the same order, this works
episode_name = extract_expisode_name.map(episode_details)
episode_desc = extract_episode_desc.map(episode_details)
save_to_database.map(annotated, episode_name, episode_desc)
If you don't want to modify save_to_database , consider placing a collection task just before database task, for unzipping the annotated. Use nout to decrease clutter.
@task
def annotate_episode_url(url, show_name, episode_url_list):
return [(url, show_name, episode_url) for episode_url in episode_url_list]
@task(nout=2)
def collect(annotated_tuple):
return (
[u for u, s, ep in annotated_tuple],
[s for u, s, ep in annotated_tuple],
)
with Flow('scrapeshows') as flow:
url = get_shows_urls_from_front_page()
show_details = download_show_page.map(url)
show_name = extract_name.map(tv_show_details)
episode_urls = extract_episode_urls.map(tv_show_details)
annotated = annotate_episode_url.map(url, show_name, episode_urls)
episode_details = download_episode_page.map(url=flatten(episode_urls)) # flatten(episode_urls) and annotated have the same order, this works
episode_name = extract_expisode_name.map(episode_details)
episode_desc = extract_episode_desc.map(episode_details)
episode_show_url, episode_show_name = collect(annotated)
save_to_database.map(episode_show_url, episode_show_name, episode_name, episode_desc)
Apparently nout does not work if the task is mapped, so collection is necessary in that regard.tash lai
04/07/2021, 10:36 AMKevin Kho
url and show_name versus episode_name. I don’t have any suggestions beyond what you have, but I’ll come back here if I come across something