Scrape ITU OB issues in MS Word format from itu.int #16

strogonoff · 2019-07-04T05:55:52Z

The goal is to write an utility that scrapes all ITU OB issues in .docx format (just English versions for now).

The issues are here: https://www.itu.int/en/publications/ITU-T/pages/publications.aspx?parent=T-SP-OB&version_date=2019

The site was recently down so we should probably throttle requests & be gentle on it.

The utility should store .docx files as <issue ID>/en.docx, where issue ID is a simple integer like 1023.

If MS Word download link on ITU site leads to an archive, it must mean the issue has annexes, and download will contain the issue itself and annexes as separate .docx files. In such cases, the utility should expand the archive and place annexes in the same directory as the issue as <issue ID>/annex<N>-en.docx.

Utility output should not be versioned. If we need to share the downloaded .docx file archive, we can upload it somewhere else.

The text was updated successfully, but these errors were encountered:

ronaldtse · 2019-07-04T06:02:13Z

@strogonoff we should probably scrape all the languages available...?

ronaldtse · 2019-07-04T06:07:00Z

@andrew2net do you have time for this?

strogonoff · 2019-07-04T06:08:26Z

@ronaldtse I’d rather have English versions sooner than all languages later (and I don’t want us to accidentally DoS ITU’s site) so I vote to handle it incrementally

strogonoff · 2019-07-04T06:09:31Z

it’s obvious that we’ll need other languages eventually, so if we have them I won’t mind. They’ll be useless for now though, merging translations will be a challenge for later…

ronaldtse · 2019-07-04T06:11:25Z

Maybe Relaton-ITU should also provide the links for Word/PDF docs in English/other languages? Thoughts @andrew2net ?

Then this can be a simple wrapper script.

strogonoff · 2019-07-04T06:15:51Z

Updated issue description to add a note about annex handling, and updated path specification for downloaded contents.

strogonoff · 2019-07-04T06:21:07Z

Note that OB IDs are sequential integers, but unfortunately each issue’s URL also contains the year of that issue’s publication date (for example, https://www.itu.int/en/publications/ITU-T/pages/publications.aspx?parent=T-SP-OB.1165-2019). I don’t think we can reliably infer the year without actually going through OB archive pages year by year.

From what I saw in Relaton-ITU implementation, it’s outside of what it was intended to do (naturally), but perhaps there is a use for it somewhere…

ronaldtse · 2019-07-04T06:55:12Z

Note that OB IDs are sequential integers, but unfortunately each issue’s URL also contains the year of that issue’s publication date (for example, https://www.itu.int/en/publications/ITU-T/pages/publications.aspx?parent=T-SP-OB.1165-2019). I don’t think we can reliably infer the year without actually going through OB archive pages year by year.

That's fine. Moreover, some OB issues have translations, and some not. So it is necessary to visit every page anyway.

From what I saw in Relaton-ITU implementation, it’s outside of what it was intended to do (naturally), but perhaps there is a use for it somewhere…

It isn't -- Relaton-ITU is supposed to also provide accessible links for the document. DOC/PDF links are all in scope.

andrew2net · 2019-07-04T09:05:05Z

@andrew2net do you have time for this?

@ronaldtse I have a bunch of uncompleted tasks. But of course, you can change prioritizing.

strogonoff · 2019-07-05T08:36:26Z

@ronaldtse

some OB issues have translations, and some not

Didn’t know, that’s unfortunate.

It isn't -- Relaton-ITU is supposed to also provide accessible links for the document. DOC/PDF links are all in scope.

If your point is that Relaton-ITU users may want to reference OB issues, then I can see how this may be in scope.

If we could give Relaton-ITU integer ID of OB issue and get document links in return, this would be easy. We can simply iterate over integers from 1 and gather links until it returns an error. Even if Relaton requires a year in addition to issue ID (OB issue URLs contain the year), that can be worked around.

That said, if Relaton-ITU doesn’t have ITU OB support yet, I believe it may be much faster to write a quick bespoke scraper just for this purpose.

strogonoff · 2019-07-15T12:16:45Z

This should be on hold until we sort out #20.

strogonoff · 2019-07-15T12:25:30Z

Rough logic for scraping OB issues from itu.int: approach one

This does not handle issues older than 567, since they are not available through the same archive index on itu.int.

Assumptions

All .docx download URLs follow the following format:
```
https://www.itu.int/dms_pub/itu-t/opb/sp/T-SP-OB.<integer OB ID>-<year>-[OAS-]<file format>-<one-letter language code>.<file extension>
```
- The OAS- substring is present for some issues, seems common for newer issues, not for older.
- File formats we are interested in are MSW (stands for Microsoft Word) and ZIP. The latter is used when there are annexes present.
- File extension can be .doc, .docx (both with MSW) or .zip (with ZIP).

Preliminary tasks

Build a map of languages & one-letter codes. E.g., English is E.
Determine the starting ITU OB & year and formats of interest (depends on Determine the scope of ITU OB issue archive processing #20—whether we have to process older issues in .doc or not)

Logic

This pattern implies non-parallel sequential execution. It can be parallelized (but be careful to throttle downloads to avoid taking ITU site down by accident).

Start with given OB ID and year (567 and 1994 if we choose to auto-process all old issues available on itu.int).
Substitute variables in the URL format string & try downloading a docx for each language.
If URL reports not found, try changing format from docx to zip; then from zip to doc. Try each with a combination of OAS- substring in the URL.
If the result resembles a throttling response or “site is down” type of response, halt the loop for some period of time.
If still not found, increment year by one & retry the probing again.
If downloaded file is a .zip, extract it and determine which of the files is the issue itself and which are the annexes based on filename pattern (annexes usually have the word “annex” in file name, case-insensitive).
Increment OB ID by one and start again.

Pseudocode

Clumsy by trying URL variations with conditionals and keeps state in global variables, but the idea should be clear.

cur_ob = 567
cur_yr = 1994
cur_fmt = 'MSW'
cur_ext = 'doc'
use_oas = false

def download_issue():
  succeeded = download_all_languages()

  # If MSW, try ZIP
  if not succeeded and cur_fmt == 'MSW':
    cur_fmt = 'ZIP'
    cur_ext = 'zip'
    succeeded = download_all_languages()

  # If MSW, try different MS Word format
  if not succeeded and cur_fmt == 'MSW':
    if cur_ext == 'docx':
      cur_ext = 'doc'
    else:
      cur_ext = 'docx'
    succeeded = download_all_languages()

  # If ZIP, try MSW
  if not succeeded and cur_fmt == 'ZIP':
    cur_fmt = 'MSW'
    cur_ext = 'doc'
    succeeded = download_all_languages()

  # Try toggling OAS
  if not succeeded:
    use_oas = not use_oas
    succeeded = download_all_languages()

  # Perhaps we ran out of issues for the year, increment year
  if not succeeded:
    ob_year += 1
    succeeded = download_all_languages()

  # Ran out of ideas
  if not succeeded:
    return report_error("Failed to download issue")

  # One of the tries succeeded, increment issue ID and continue from the top
  cur_ob  = cur_ob + 1
  return download_issue()


def download_all_languages():
  for language in ['E', 'F', …]:
    url = format_url(cur_ob, cur_yr, cur_fmt, use_oas, language, cur_ext)

    try:
      downloaded_file = try_download_file(url)

    except NotFound:
      # English version not found means the URL is broken.
      if language == 'E':
        return False
      # Otherwise we may be fine, not all the languages are always present so we’ll try the next one.
      else:
        continue

    else:
      if is_archive(downloaded_file):
        expand_archive(downloaded_file)
      move_files_in_place(downloaded_file)

  return True


def try_download_file(url):
  try:
    return download_file(url)
  except ServerDownOrThrottling:
    sleep(10)
    return try_download_fille(url)


def format_url(ob, yr, fmt, use_oas, language, ext):
  pass # Outputs an URL according to format string


def download_file(url):
  pass # Does the download

strogonoff · 2019-07-15T13:21:00Z

Rough scraping logic: approach two

Parse ITU OB archive webpages on ITU.int year by year and issue by issue.
Build a data structure of download URLs. For each issue ID & language combination there will be one download URL (some languages may be missing for some issues).
With that data structure as input, download all issues, expanding ZIP archives and placing/renaming files as needed.

strogonoff added this to the Back-fill OB issue data milestone Jul 4, 2019

strogonoff changed the title ~~Scrape English versions of ITU OB issues as .docx files~~ Scrape ITU OB issues in MS Word format from itu.int Jul 15, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scrape ITU OB issues in MS Word format from itu.int #16

Scrape ITU OB issues in MS Word format from itu.int #16

strogonoff commented Jul 4, 2019 •

edited

Loading

ronaldtse commented Jul 4, 2019

ronaldtse commented Jul 4, 2019

strogonoff commented Jul 4, 2019

strogonoff commented Jul 4, 2019 •

edited

Loading

ronaldtse commented Jul 4, 2019

strogonoff commented Jul 4, 2019 •

edited

Loading

strogonoff commented Jul 4, 2019 •

edited

Loading

ronaldtse commented Jul 4, 2019

andrew2net commented Jul 4, 2019

strogonoff commented Jul 5, 2019

strogonoff commented Jul 15, 2019

strogonoff commented Jul 15, 2019 •

edited

Loading

strogonoff commented Jul 15, 2019 •

edited

Loading

Scrape ITU OB issues in MS Word format from itu.int #16

Scrape ITU OB issues in MS Word format from itu.int #16

Comments

strogonoff commented Jul 4, 2019 • edited Loading

ronaldtse commented Jul 4, 2019

ronaldtse commented Jul 4, 2019

strogonoff commented Jul 4, 2019

strogonoff commented Jul 4, 2019 • edited Loading

ronaldtse commented Jul 4, 2019

strogonoff commented Jul 4, 2019 • edited Loading

strogonoff commented Jul 4, 2019 • edited Loading

ronaldtse commented Jul 4, 2019

andrew2net commented Jul 4, 2019

strogonoff commented Jul 5, 2019

strogonoff commented Jul 15, 2019

strogonoff commented Jul 15, 2019 • edited Loading

Rough logic for scraping OB issues from itu.int: approach one

Assumptions

Preliminary tasks

Logic

Pseudocode

strogonoff commented Jul 15, 2019 • edited Loading

Rough scraping logic: approach two

strogonoff commented Jul 4, 2019 •

edited

Loading

strogonoff commented Jul 4, 2019 •

edited

Loading

strogonoff commented Jul 4, 2019 •

edited

Loading

strogonoff commented Jul 4, 2019 •

edited

Loading

strogonoff commented Jul 15, 2019 •

edited

Loading

strogonoff commented Jul 15, 2019 •

edited

Loading