Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated utilized API and added new features #3

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
36 changes: 17 additions & 19 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# U.S.News-Scrapper

U.S.News Scrapper is a Python library that collects data from the website of [usnews.com](https://www.usnews.com/best-graduate-schools) and output those data in a file for offline usage. Till now, it is only capable of collecting graduate schools data and output it in `.xls` format. After generating the `.xls` file, it will be opened by default excel file opener.
U.S.News Scrapper is a Python library that collects data from the website of [usnews.com](https://www.usnews.com/best-colleges) and output those data in a file for offline usage. It collects college data and then outputs it in either `.xlsx`, `.csv` or `.html` format.

## Setup
Make sure that [Python 3](https://www.python.org/downloads) is already installed in your system.
Expand All @@ -23,59 +23,57 @@ Then it can be used via command line. See [Command line example](command-line-ex

### Command line usage
```
python -m usnews_scrapper [-h] -u URL [-o OUTPUTFILENAME] [-p PAUSETIME] [--from STARTPAGE] [--to ENDPAGE]
python -m usnews_scrapper [-h] outputfilename [-s STARTPAGE] [-e ENDPAGE] [-f {xlsx,csv,html}] [-p PAUSETIME]
```
Collects data from usnews and generates excel file.
Collects data from usnews and generates either an excel, csv or html file.

Necessary Arguments:
```
-u URL, --url URL The usnews address to collect data from.
Put the URL within qoutes i.e. " or ' .
OUTPUTFILENAME The output file name without extension.
```
Optional Arguments:
```
-h, --help Show this help message and exit
-o OUTPUTFILENAME The output file name without extension.
-p PAUSETIME, --pause PAUSETIME The pause time between loading pages from usnews.
--from STARTPAGE The page number from which the scrapper starts working.
--to ENDPAGE The page number to which the scrapper works.
-h, --help Show this help message and exit.
-s STARTPAGE, --start STARTPAGE The page number from which the scrapper starts working.
-e ENDPAGE, --end ENDPAGE The page number to which the scrapper works.
-f FORMAT, --format FORMAT The format of the output file.
-p PAUSETIME, --pause PAUSETIME The pause time between loading pages from usnews.
```

### Module usage
`usnews_scrapper.unsc()` takes input the `url` as string. The other arguments are optional. This function will return absolute path to the output file.
`usnews_scrapper.unsc()` takes the `filename` as a string. The other arguments are optional. This function will return absolute path to the output file.

```python
from usnews_scrapper import unsc
unsc(url:str, output_file_name:str, pause_time:int, from_page:int, to_page:int) -> str
unsc(outputfilename:str, pausetime:int, format:str, startpage:int, endpage:int) -> str
```
See [Module example](#module-example) for examples.

## Examples

### Command line example
Copy the address of the page from usnews website and in the Command Prompt and enter this command -
Enter this command -

```bash
$ python -m usnews_scrapper --url="https://www.usnews.com/best-graduate-schools/top-science-schools/computer-science-rankings" -o file_name -p 2 --from=2 --to=5
$ python -m usnews_scrapper file_name --start 1 --end 2 --format xlsx --pause 2
```

If you want to run from the source, then enter this command instead.

```bash
$ cd USNews-Scrapper/usnews_scrapper/
$ python usnews_scrapper.py --url="https://www.usnews.com/best-graduate-schools/top-science-schools/computer-science-rankings" -o file_name -p 2 --from=2 --to=5
$ python usnews_scrapper.py file_name --start 1 --end 2 --format xlsx --pause 2
```
In both cases, The output file will be saved in current directory under the name of `file_name_*.xls`.
In both cases, the output file will be saved in `usnews_scrapper` directory under the name of `file_name_*.xlsx`.

### Module example

```python
>>> from usnews_scrapper import unsc
>>> url = "https://www.usnews.com/best-graduate-schools/top-science-schools/computer-science-rankings"
>>> output_file = unsc(url=url, output_file_name="output", pause_time=2, from_page=2, to_page=5)
>>> output_file = unsc(outputfilename="file_name", startpage=1, endpage=2, format="xlsx", pausetime=2)
```
The output_file will contain the absolute path to the output file.

## Author
## Authors

* **Joy Ghosh** - [www.ijoyghosh.com](https://www.ijoyghosh.com)
25 changes: 12 additions & 13 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
U.S.News-Scrapper
=================

U.S.News Scrapper is a Python library that collect data from the website of usnews_ and output those data in a file for offline usage. Till now, it is only capable of collecting graduate schools data and output it in .xls format. After generating the .xls file, it will be opened by default excel file opener.
U.S.News Scrapper is a Python library that collect data from the usnews_ website and output those data in a file for offline usage. It then collects college data and outputs it in either .xlsx, .csv or .html format.
*Visit github_ page for detailed informations.*

Setup
Expand All @@ -14,35 +14,34 @@ Setup

Usage
=====
usage: python usnews_scrapper.py [-h] -u URL [-o OUTPUTFILENAME] [-p PAUSETIME] [--from STARTPAGE] [--to ENDPAGE]
usage: python usnews_scrapper.py [-h] outputfilename [-s STARTPAGE] [-e ENDPAGE] [-f {xlsx,csv,html}] [-p PAUSETIME]

Collects data from usnews and generates excel file
Collects data from usnews and generates either an excel, a csv or a html file

optional arguments:
-h, --help Show this help message and exit
-u URL, --url URL The usnews address to collect data from. Put the URL within qoutes i.e. " or ' .
-o OUTPUTFILENAME The output file name without extension.
-p PAUSETIME, --pause PAUSETIME The pause time between loading pages from usnews.
--from STARTPAGE The page number from which the scrapper starts working.
--to ENDPAGE The page number to which the scrapper works.
-h, --help Show this help message and exit.
-s STARTPAGE, --start STARTPAGE The page number from which the scrapper starts working.
-e ENDPAGE, --end ENDPAGE The page number to which the scrapper works.
-f FORMAT, --format FORMAT The format of the output file.
-p PAUSETIME, --pause PAUSETIME The pause time between loading pages from usnews.


Examples
========

Copy the address of the page from usnews website and in the Command Prompt and enter this command -
To produce an excel file that ranges between the pages 1 and 2 with a pausetime of 2 seconds, enter this command -

| $ cd USNews-Scrapper
| $ python usnews_scrapper.py --url="https://www.usnews.com/best-graduate-schools/top-science-schools/computer-science-rankings" -o file_name -p 2 --from=2 --to=5
| $ python usnews_scrapper.py file_name_ --start 1 --end 2 --format xlsx --pause 2

The output file will be saved in current directory under the name of file_name_*.xls
The output file will be saved in `usnews_scrapper` directory under the name of file_name_*.xlsx

Authors
=======

* *Joy Ghosh* - www.ijoyghosh.com_

.. _usnews: https://www.usnews.com/best-graduate-schools
.. _usnews: https://www.usnews.com/best-colleges
.. _pip: https://pip.pypa.io/en/stable/
.. _www.ijoyghosh.com : https://www.ijoyghosh.com
.. _github : https://github.com/OvroAbir/USNews-Scrapper
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
2 changes: 1 addition & 1 deletion usnews_scrapper/__init__.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
from .usnews_scrapper import usnews_scrapper as unsc
from .usnews_scrapper import usnews_scrapper as unsc
107 changes: 107 additions & 0 deletions usnews_scrapper/college.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
import locale

locale.setlocale(locale.LC_ALL, '')

class College:
def __init__(self, name, state, rank, tuition, acceptance_rate, sat_range, act_range,
engineering_rep_score, business_rep_score, cs_rep_score, nursing_rep_score):
self.__name = name
self.__state = state
self.__rank = rank
self.__tuition = tuition
self.__acceptance_rate = acceptance_rate
self.__sat_range = sat_range
self.__act_range = act_range
self.__engineering_rep_score = engineering_rep_score
self.__business_rep_score = business_rep_score
self.__cs_rep_score = cs_rep_score
self.__nursing_rep_score = nursing_rep_score

@classmethod
def getFromJSON(cls, json_data):
name = state = rank = tuition = acceptance_rate = sat_range = act_range = None
engineering_rep_score = business_rep_score = cs_rep_score = nursing_rep_score = None

try:
name = json_data["institution"]["displayName"]
except KeyError:
pass

try:
state = json_data["institution"]["state"]
except KeyError:
pass

try:
rank = int(json_data["parent"]["sortRank"])
except KeyError:
pass

try:
tuition = locale.atof(json_data["searchData"]["tuition"]["displayValue"].replace("$", ""))
except (KeyError, AttributeError):
tuition = locale.atof(json_data["searchData"]["tuition"]["displayValue"][0]["value"].replace("$", ""))
except ValueError:
pass

try:
acceptance_rate = float(json_data["searchData"]["acceptanceRate"]["displayValue"].strip("%"))/100
except KeyError:
pass

try:
sat_range = json_data["searchData"]["testAvgs"]["displayValue"][0]["value"]
except KeyError:
pass

try:
act_range = json_data["searchData"]["testAvgs"]["displayValue"][1]["value"]
except KeyError:
pass

try:
engineering_rep_score = float(json_data["searchData"]["engineeringRepScore"]["rawValue"])
except (KeyError, ValueError, TypeError):
pass

try:
business_rep_score = float(json_data["searchData"]["businessRepScore"]["rawValue"])
except (KeyError, ValueError, TypeError):
pass

try:
cs_rep_score = float(json_data["searchData"]["computerScienceRepScore"]["rawValue"])
except (KeyError, ValueError, TypeError):
pass

try:
nursing_rep_score = float(json_data["searchData"]["nursingRepScore"]["rawValue"])
except (KeyError, ValueError, TypeError):
pass

return cls(name, state, rank,
tuition, acceptance_rate, sat_range, act_range,
engineering_rep_score, business_rep_score, cs_rep_score, nursing_rep_score)

def __iter__(self):
yield self.__rank
yield self.__name
yield self.__state
yield self.__tuition
yield self.__acceptance_rate
yield self.__sat_range
yield self.__act_range
yield self.__engineering_rep_score
yield self.__business_rep_score
yield self.__cs_rep_score
yield self.__nursing_rep_score

def __str__(self):
return "name : {} \nstate : {} \nrank : {} \ntuition : {} \nacceptance rate : {} \nsat range : {} \n"\
"act range : {} \nengineering score : {} \ncomputer science score : {} \nbusiness score : {} \n"\
"computer science score : {} \nnursing score : {}".format(self.__name, self.__state, self.__rank,
self.__tuition, self.__acceptance_rate,
self.__sat_range, self.__act_range,
self.__engineering_rep_score, self.__business_rep_score,
self.__cs_rep_score, self.__nursing_rep_score)

79 changes: 79 additions & 0 deletions usnews_scrapper/table_data/main.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
// Initializing Bootstrap attributes
document.querySelectorAll("th").forEach(function(th){
th.setAttribute("scope", "col")
})

table = document.querySelector("table")
table.classList.add("table")

thead = document.querySelector("thead")
thead.classList.add("thead-dark")

const tBody = document.getElementsByTagName('tbody')[0]
tBody.querySelectorAll("th").forEach(function(th){
th.setAttribute("scope", "row")
})


/**
* Sorts a HTML table.
*
* @param {HTMLTableElement} table The table to sort
* @param {number} column The index of the column to sort
* @param {boolean} asc Determines if the sorting will be in ascending order
*/

function sortTableByColumn(table, column, asc=true){
const dirModifier = asc ? 1 : -1
const tBody = document.getElementsByTagName('tbody')[0]
const rows = Array.from(tBody.querySelectorAll("tr"))

const regInteger = /^[0-9]*$/
const regFloat = /^[+-]?\d+(\.\d+)?$/
const columnValues = Array.from(tBody.rows).map(row => row.cells[column].textContent)
const columnDataTypeIsNumSet = new Set(columnValues.map(value => (regInteger.test(value)) || regFloat.test(value)))
const columnDataTypeIsNum = Array.from(columnDataTypeIsNumSet)[0]

const sortedRows = rows.sort(function(a, b){
const aColText = a.querySelector(`td:nth-child(${column + 1})`).textContent.trim()
const bColText = b.querySelector(`td:nth-child(${column + 1})`).textContent.trim()

// Ignores all null values
if (aColText == "N/A") return 1
if (bColText === "N/A") return -1

if (columnDataTypeIsNum){
return (Number(aColText) - Number(bColText)) >= 0 ? (1 * dirModifier) : (-1 * dirModifier)
}
return aColText >= bColText ? (1 * dirModifier) : (-1 * dirModifier)
})

changeTableOrder(tBody, sortedRows)
trackSortedColumn(table, column, asc)
}

function changeTableOrder(tableBody, newRows){
while (tableBody.firstChild){
tableBody.removeChild(tableBody.firstChild)
}
tableBody.append(...newRows)
}

function trackSortedColumn(table, column, asc){
table.querySelectorAll("th").forEach(function(th){
th.classList.remove("th-sort-asc", "th-sort-desc")
})
table.querySelector(`th:nth-child(${column + 1})`).classList.toggle("th-sort-asc", asc)
table.querySelector(`th:nth-child(${column + 1})`).classList.toggle("th-sort-desc", !asc)
}


document.querySelectorAll("th").forEach(function(headerCell){
headerCell.addEventListener("click", function(){
const tableElement = headerCell.parentElement.parentElement.parentElement
const headerIndex = Array.prototype.indexOf.call(headerCell.parentElement.children, headerCell)
currentIsAscending = headerCell.classList.contains("th-sort-asc")

sortTableByColumn(tableElement, headerIndex, !currentIsAscending)
})
})
19 changes: 19 additions & 0 deletions usnews_scrapper/table_data/styles.css
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
*,
::after,
::before {
margin: 0;
padding: 0;
box-sizing: border-box;
}

th:hover {
cursor: pointer;
}

.th-sort-asc::after {
content: "\25b4";
}

.th-sort-desc::after {
content: "\25be";
}
Loading