-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scrape ITU OB issues in MS Word format from itu.int #16
Comments
@strogonoff we should probably scrape all the languages available...? |
@andrew2net do you have time for this? |
@ronaldtse I’d rather have English versions sooner than all languages later (and I don’t want us to accidentally DoS ITU’s site) so I vote to handle it incrementally |
it’s obvious that we’ll need other languages eventually, so if we have them I won’t mind. They’ll be useless for now though, merging translations will be a challenge for later… |
Maybe Relaton-ITU should also provide the links for Word/PDF docs in English/other languages? Thoughts @andrew2net ? Then this can be a simple wrapper script. |
Updated issue description to add a note about annex handling, and updated path specification for downloaded contents. |
Note that OB IDs are sequential integers, but unfortunately each issue’s URL also contains the year of that issue’s publication date (for example, https://www.itu.int/en/publications/ITU-T/pages/publications.aspx?parent=T-SP-OB.1165-2019). I don’t think we can reliably infer the year without actually going through OB archive pages year by year. From what I saw in Relaton-ITU implementation, it’s outside of what it was intended to do (naturally), but perhaps there is a use for it somewhere… |
That's fine. Moreover, some OB issues have translations, and some not. So it is necessary to visit every page anyway.
It isn't -- Relaton-ITU is supposed to also provide accessible links for the document. DOC/PDF links are all in scope. |
@ronaldtse I have a bunch of uncompleted tasks. But of course, you can change prioritizing. |
Didn’t know, that’s unfortunate.
If your point is that Relaton-ITU users may want to reference OB issues, then I can see how this may be in scope. If we could give Relaton-ITU integer ID of OB issue and get document links in return, this would be easy. We can simply iterate over integers from 1 and gather links until it returns an error. Even if Relaton requires a year in addition to issue ID (OB issue URLs contain the year), that can be worked around. That said, if Relaton-ITU doesn’t have ITU OB support yet, I believe it may be much faster to write a quick bespoke scraper just for this purpose. |
This should be on hold until we sort out #20. |
Rough logic for scraping OB issues from itu.int: approach oneThis does not handle issues older than 567, since they are not available through the same archive index on itu.int. Assumptions
Preliminary tasks
LogicThis pattern implies non-parallel sequential execution. It can be parallelized (but be careful to throttle downloads to avoid taking ITU site down by accident).
PseudocodeClumsy by trying URL variations with conditionals and keeps state in global variables, but the idea should be clear. cur_ob = 567
cur_yr = 1994
cur_fmt = 'MSW'
cur_ext = 'doc'
use_oas = false
def download_issue():
succeeded = download_all_languages()
# If MSW, try ZIP
if not succeeded and cur_fmt == 'MSW':
cur_fmt = 'ZIP'
cur_ext = 'zip'
succeeded = download_all_languages()
# If MSW, try different MS Word format
if not succeeded and cur_fmt == 'MSW':
if cur_ext == 'docx':
cur_ext = 'doc'
else:
cur_ext = 'docx'
succeeded = download_all_languages()
# If ZIP, try MSW
if not succeeded and cur_fmt == 'ZIP':
cur_fmt = 'MSW'
cur_ext = 'doc'
succeeded = download_all_languages()
# Try toggling OAS
if not succeeded:
use_oas = not use_oas
succeeded = download_all_languages()
# Perhaps we ran out of issues for the year, increment year
if not succeeded:
ob_year += 1
succeeded = download_all_languages()
# Ran out of ideas
if not succeeded:
return report_error("Failed to download issue")
# One of the tries succeeded, increment issue ID and continue from the top
cur_ob = cur_ob + 1
return download_issue()
def download_all_languages():
for language in ['E', 'F', …]:
url = format_url(cur_ob, cur_yr, cur_fmt, use_oas, language, cur_ext)
try:
downloaded_file = try_download_file(url)
except NotFound:
# English version not found means the URL is broken.
if language == 'E':
return False
# Otherwise we may be fine, not all the languages are always present so we’ll try the next one.
else:
continue
else:
if is_archive(downloaded_file):
expand_archive(downloaded_file)
move_files_in_place(downloaded_file)
return True
def try_download_file(url):
try:
return download_file(url)
except ServerDownOrThrottling:
sleep(10)
return try_download_fille(url)
def format_url(ob, yr, fmt, use_oas, language, ext):
pass # Outputs an URL according to format string
def download_file(url):
pass # Does the download |
Rough scraping logic: approach two
|
The goal is to write an utility that scrapes all ITU OB issues in .docx format (just English versions for now).
The issues are here: https://www.itu.int/en/publications/ITU-T/pages/publications.aspx?parent=T-SP-OB&version_date=2019
The site was recently down so we should probably throttle requests & be gentle on it.
The utility should store .docx files as
<issue ID>/en.docx
, where issue ID is a simple integer like 1023.If MS Word download link on ITU site leads to an archive, it must mean the issue has annexes, and download will contain the issue itself and annexes as separate .docx files. In such cases, the utility should expand the archive and place annexes in the same directory as the issue as
<issue ID>/annex<N>-en.docx
.Utility output should not be versioned. If we need to share the downloaded .docx file archive, we can upload it somewhere else.
The text was updated successfully, but these errors were encountered: