Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scrape ITU OB issues in MS Word format from itu.int #16

Open
strogonoff opened this issue Jul 4, 2019 · 13 comments
Open

Scrape ITU OB issues in MS Word format from itu.int #16

strogonoff opened this issue Jul 4, 2019 · 13 comments

Comments

@strogonoff
Copy link
Contributor

strogonoff commented Jul 4, 2019

The goal is to write an utility that scrapes all ITU OB issues in .docx format (just English versions for now).

The issues are here: https://www.itu.int/en/publications/ITU-T/pages/publications.aspx?parent=T-SP-OB&version_date=2019

The site was recently down so we should probably throttle requests & be gentle on it.

The utility should store .docx files as <issue ID>/en.docx, where issue ID is a simple integer like 1023.

If MS Word download link on ITU site leads to an archive, it must mean the issue has annexes, and download will contain the issue itself and annexes as separate .docx files. In such cases, the utility should expand the archive and place annexes in the same directory as the issue as <issue ID>/annex<N>-en.docx.

Utility output should not be versioned. If we need to share the downloaded .docx file archive, we can upload it somewhere else.

@strogonoff strogonoff added this to the Back-fill OB issue data milestone Jul 4, 2019
@ronaldtse
Copy link
Contributor

@strogonoff we should probably scrape all the languages available...?

@ronaldtse
Copy link
Contributor

@andrew2net do you have time for this?

@strogonoff
Copy link
Contributor Author

@ronaldtse I’d rather have English versions sooner than all languages later (and I don’t want us to accidentally DoS ITU’s site) so I vote to handle it incrementally

@strogonoff
Copy link
Contributor Author

strogonoff commented Jul 4, 2019

it’s obvious that we’ll need other languages eventually, so if we have them I won’t mind. They’ll be useless for now though, merging translations will be a challenge for later…

@ronaldtse
Copy link
Contributor

Maybe Relaton-ITU should also provide the links for Word/PDF docs in English/other languages? Thoughts @andrew2net ?

Then this can be a simple wrapper script.

@strogonoff
Copy link
Contributor Author

strogonoff commented Jul 4, 2019

Updated issue description to add a note about annex handling, and updated path specification for downloaded contents.

@strogonoff
Copy link
Contributor Author

strogonoff commented Jul 4, 2019

Note that OB IDs are sequential integers, but unfortunately each issue’s URL also contains the year of that issue’s publication date (for example, https://www.itu.int/en/publications/ITU-T/pages/publications.aspx?parent=T-SP-OB.1165-2019). I don’t think we can reliably infer the year without actually going through OB archive pages year by year.

From what I saw in Relaton-ITU implementation, it’s outside of what it was intended to do (naturally), but perhaps there is a use for it somewhere…

@ronaldtse
Copy link
Contributor

Note that OB IDs are sequential integers, but unfortunately each issue’s URL also contains the year of that issue’s publication date (for example, https://www.itu.int/en/publications/ITU-T/pages/publications.aspx?parent=T-SP-OB.1165-2019). I don’t think we can reliably infer the year without actually going through OB archive pages year by year.

That's fine. Moreover, some OB issues have translations, and some not. So it is necessary to visit every page anyway.

From what I saw in Relaton-ITU implementation, it’s outside of what it was intended to do (naturally), but perhaps there is a use for it somewhere…

It isn't -- Relaton-ITU is supposed to also provide accessible links for the document. DOC/PDF links are all in scope.

@andrew2net
Copy link
Contributor

@andrew2net do you have time for this?

@ronaldtse I have a bunch of uncompleted tasks. But of course, you can change prioritizing.

@strogonoff
Copy link
Contributor Author

@ronaldtse

some OB issues have translations, and some not

Didn’t know, that’s unfortunate.

It isn't -- Relaton-ITU is supposed to also provide accessible links for the document. DOC/PDF links are all in scope.

If your point is that Relaton-ITU users may want to reference OB issues, then I can see how this may be in scope.

If we could give Relaton-ITU integer ID of OB issue and get document links in return, this would be easy. We can simply iterate over integers from 1 and gather links until it returns an error. Even if Relaton requires a year in addition to issue ID (OB issue URLs contain the year), that can be worked around.

That said, if Relaton-ITU doesn’t have ITU OB support yet, I believe it may be much faster to write a quick bespoke scraper just for this purpose.

@strogonoff
Copy link
Contributor Author

This should be on hold until we sort out #20.

@strogonoff
Copy link
Contributor Author

strogonoff commented Jul 15, 2019

Rough logic for scraping OB issues from itu.int: approach one

This does not handle issues older than 567, since they are not available through the same archive index on itu.int.

Assumptions

  • All .docx download URLs follow the following format:

    https://www.itu.int/dms_pub/itu-t/opb/sp/T-SP-OB.<integer OB ID>-<year>-[OAS-]<file format>-<one-letter language code>.<file extension>
    
    • The OAS- substring is present for some issues, seems common for newer issues, not for older.
    • File formats we are interested in are MSW (stands for Microsoft Word) and ZIP. The latter is used when there are annexes present.
    • File extension can be .doc, .docx (both with MSW) or .zip (with ZIP).

Preliminary tasks

Logic

This pattern implies non-parallel sequential execution. It can be parallelized (but be careful to throttle downloads to avoid taking ITU site down by accident).

  • Start with given OB ID and year (567 and 1994 if we choose to auto-process all old issues available on itu.int).
  • Substitute variables in the URL format string & try downloading a docx for each language.
  • If URL reports not found, try changing format from docx to zip; then from zip to doc. Try each with a combination of OAS- substring in the URL.
  • If the result resembles a throttling response or “site is down” type of response, halt the loop for some period of time.
  • If still not found, increment year by one & retry the probing again.
  • If downloaded file is a .zip, extract it and determine which of the files is the issue itself and which are the annexes based on filename pattern (annexes usually have the word “annex” in file name, case-insensitive).
  • Increment OB ID by one and start again.

Pseudocode

Clumsy by trying URL variations with conditionals and keeps state in global variables, but the idea should be clear.

cur_ob = 567
cur_yr = 1994
cur_fmt = 'MSW'
cur_ext = 'doc'
use_oas = false

def download_issue():
  succeeded = download_all_languages()

  # If MSW, try ZIP
  if not succeeded and cur_fmt == 'MSW':
    cur_fmt = 'ZIP'
    cur_ext = 'zip'
    succeeded = download_all_languages()

  # If MSW, try different MS Word format
  if not succeeded and cur_fmt == 'MSW':
    if cur_ext == 'docx':
      cur_ext = 'doc'
    else:
      cur_ext = 'docx'
    succeeded = download_all_languages()

  # If ZIP, try MSW
  if not succeeded and cur_fmt == 'ZIP':
    cur_fmt = 'MSW'
    cur_ext = 'doc'
    succeeded = download_all_languages()

  # Try toggling OAS
  if not succeeded:
    use_oas = not use_oas
    succeeded = download_all_languages()

  # Perhaps we ran out of issues for the year, increment year
  if not succeeded:
    ob_year += 1
    succeeded = download_all_languages()

  # Ran out of ideas
  if not succeeded:
    return report_error("Failed to download issue")

  # One of the tries succeeded, increment issue ID and continue from the top
  cur_ob  = cur_ob + 1
  return download_issue()


def download_all_languages():
  for language in ['E', 'F', …]:
    url = format_url(cur_ob, cur_yr, cur_fmt, use_oas, language, cur_ext)

    try:
      downloaded_file = try_download_file(url)

    except NotFound:
      # English version not found means the URL is broken.
      if language == 'E':
        return False
      # Otherwise we may be fine, not all the languages are always present so we’ll try the next one.
      else:
        continue

    else:
      if is_archive(downloaded_file):
        expand_archive(downloaded_file)
      move_files_in_place(downloaded_file)

  return True


def try_download_file(url):
  try:
    return download_file(url)
  except ServerDownOrThrottling:
    sleep(10)
    return try_download_fille(url)


def format_url(ob, yr, fmt, use_oas, language, ext):
  pass # Outputs an URL according to format string


def download_file(url):
  pass # Does the download

@strogonoff strogonoff changed the title Scrape English versions of ITU OB issues as .docx files Scrape ITU OB issues in MS Word format from itu.int Jul 15, 2019
@strogonoff
Copy link
Contributor Author

strogonoff commented Jul 15, 2019

Rough scraping logic: approach two

  • Parse ITU OB archive webpages on ITU.int year by year and issue by issue.
  • Build a data structure of download URLs. For each issue ID & language combination there will be one download URL (some languages may be missing for some issues).
  • With that data structure as input, download all issues, expanding ZIP archives and placing/renaming files as needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants