OvroAbir · romeil · May 6, 2024
diff --git a/README.md b/README.md
@@ -1,6 +1,6 @@
 # U.S.News-Scrapper
 
-U.S.News Scrapper is a Python library that collects data from the website of [usnews.com](https://www.usnews.com/best-graduate-schools) and output those data in a file for offline usage. Till now, it is only capable of collecting graduate schools data and output it in `.xls` format. After generating the `.xls` file, it will be opened by default excel file opener.
+U.S.News Scrapper is a Python library that collects data from the website of [usnews.com](https://www.usnews.com/best-colleges) and output those data in a file for offline usage. It collects college data and then outputs it in either `.xlsx`, `.csv` or `.html` format.
 
 ## Setup
 Make sure that [Python 3](https://www.python.org/downloads) is already installed in your system.
@@ -23,59 +23,57 @@ Then it can be used via command line. See [Command line example](command-line-ex
 
 ### Command line usage
 ```
-python -m usnews_scrapper [-h] -u URL [-o OUTPUTFILENAME] [-p PAUSETIME] [--from STARTPAGE] [--to ENDPAGE]
+python -m usnews_scrapper [-h] outputfilename [-s STARTPAGE] [-e ENDPAGE]  [-f {xlsx,csv,html}] [-p PAUSETIME]
 ```
-Collects data from usnews and generates excel file.
+Collects data from usnews and generates either an excel, csv or html file.
 
 Necessary Arguments:
 ```
--u URL, --url URL     		        The usnews address to collect data from. 
-                                        Put the URL within qoutes i.e. " or ' .
+OUTPUTFILENAME     		                The output file name without extension.
 ```
 Optional Arguments:
 ```
--h, --help            		        Show this help message and exit
--o OUTPUTFILENAME     		        The output file name without extension.
--p PAUSETIME, --pause PAUSETIME         The pause time between loading pages from usnews.
---from STARTPAGE      		        The page number from which the scrapper starts working.
---to ENDPAGE          		        The page number to which the scrapper works.
+-h, --help            		            Show this help message and exit.
+-s STARTPAGE, --start STARTPAGE         The page number from which the scrapper starts working.  		        
+-e ENDPAGE, --end ENDPAGE               The page number to which the scrapper works.
+-f FORMAT, --format FORMAT              The format of the output file.
+-p PAUSETIME, --pause PAUSETIME         The pause time between loading pages from usnews.	        
 ```
 
 ### Module usage
-`usnews_scrapper.unsc()` takes input the `url` as string. The other arguments are optional. This function will return absolute path to the output file.
+`usnews_scrapper.unsc()` takes the `filename` as a string. The other arguments are optional. This function will return absolute path to the output file.
 
 ```python
 from usnews_scrapper import unsc
-unsc(url:str, output_file_name:str, pause_time:int, from_page:int, to_page:int) -> str
+unsc(outputfilename:str, pausetime:int, format:str, startpage:int, endpage:int) -> str
 ```
 See [Module example](#module-example) for examples.
 
 ## Examples
 
 ### Command line example
-Copy the address of the page from usnews website and in the Command Prompt and enter this command -
+Enter this command -
 
 ```bash
-$ python -m usnews_scrapper --url="https://www.usnews.com/best-graduate-schools/top-science-schools/computer-science-rankings" -o file_name -p 2 --from=2 --to=5 
+$ python -m usnews_scrapper file_name --start 1 --end 2 --format xlsx --pause 2
 ```
 
 If you want to run from the source, then enter this command instead.
 
 ```bash
 $ cd USNews-Scrapper/usnews_scrapper/
-$ python usnews_scrapper.py --url="https://www.usnews.com/best-graduate-schools/top-science-schools/computer-science-rankings" -o file_name -p 2 --from=2 --to=5 
+$ python usnews_scrapper.py file_name --start 1 --end 2 --format xlsx --pause 2
 ```
-In both cases, The output file will be saved in current directory under the name of `file_name_*.xls`. 
+In both cases, the output file will be saved in `usnews_scrapper` directory under the name of `file_name_*.xlsx`. 
 
 ### Module example
 
 ```python
 >>> from usnews_scrapper import unsc
->>> url = "https://www.usnews.com/best-graduate-schools/top-science-schools/computer-science-rankings"
->>> output_file = unsc(url=url, output_file_name="output", pause_time=2, from_page=2, to_page=5)
+>>> output_file = unsc(outputfilename="file_name", startpage=1, endpage=2, format="xlsx", pausetime=2)
 ```
 The output_file will contain the absolute path to the output file.
 
-## Author
+## Authors
 
 * **Joy Ghosh** - [www.ijoyghosh.com](https://www.ijoyghosh.com)
diff --git a/README.rst b/README.rst
@@ -2,7 +2,7 @@
 U.S.News-Scrapper
 =================
 
-U.S.News Scrapper is a Python library that collect data from the website of usnews_ and output those data in a file for offline usage. Till now, it is only capable of collecting graduate schools data and output it in .xls format. After generating the .xls file, it will be opened by default excel file opener.
+U.S.News Scrapper is a Python library that collect data from the  usnews_ website and output those data in a file for offline usage. It then collects college data and outputs it in either .xlsx, .csv or .html format.
 *Visit github_ page for detailed informations.*
 
 Setup
@@ -14,35 +14,34 @@ Setup
 
 Usage
 =====
-usage: python usnews_scrapper.py [-h] -u URL [-o OUTPUTFILENAME] [-p PAUSETIME] [--from STARTPAGE] [--to ENDPAGE]
+usage: python usnews_scrapper.py [-h] outputfilename [-s STARTPAGE] [-e ENDPAGE]  [-f {xlsx,csv,html}] [-p PAUSETIME] 
 
-Collects data from usnews and generates excel file
+Collects data from usnews and generates either an excel, a csv or a html file
 
 optional arguments:
--h, --help            		        Show this help message and exit
--u URL, --url URL     		        The usnews address to collect data from. Put the URL within qoutes i.e. " or ' .
--o OUTPUTFILENAME     		        The output file name without extension.
--p PAUSETIME, --pause PAUSETIME             The pause time between loading pages from usnews.
---from STARTPAGE      		        The page number from which the scrapper starts working.
---to ENDPAGE          		        The page number to which the scrapper works.
+-h, --help            		        Show this help message and exit.
+-s STARTPAGE, --start STARTPAGE     The page number from which the scrapper starts working.
+-e ENDPAGE, --end ENDPAGE     		The page number to which the scrapper works. 
+-f FORMAT, --format FORMAT          The format of the output file.  
+-p PAUSETIME, --pause PAUSETIME     The pause time between loading pages from usnews.	        
 
 
 Examples
 ========
 
-Copy the address of the page from usnews website and in the Command Prompt and enter this command -
+To produce an excel file that ranges between the pages 1 and 2 with a pausetime of 2 seconds, enter this command -
 
     | $ cd USNews-Scrapper
-    | $ python usnews_scrapper.py --url="https://www.usnews.com/best-graduate-schools/top-science-schools/computer-science-rankings" -o file_name -p 2 --from=2 --to=5 
+    | $ python usnews_scrapper.py file_name_ --start 1 --end 2 --format xlsx --pause 2
 
-The output file will be saved in current directory under the name of file_name_*.xls 
+The output file will be saved in `usnews_scrapper` directory under the name of file_name_*.xlsx
 
 Authors
 =======
 
 * *Joy Ghosh* - www.ijoyghosh.com_
 
-.. _usnews: https://www.usnews.com/best-graduate-schools
+.. _usnews: https://www.usnews.com/best-colleges
 .. _pip: https://pip.pypa.io/en/stable/
 .. _www.ijoyghosh.com : https://www.ijoyghosh.com
 .. _github : https://github.com/OvroAbir/USNews-Scrapper
diff --git a/...s/Computer Science/01.usnews_top-science-schools_computer-science_all_speciality_2018.xls b/...s/Computer Science/01.usnews_top-science-schools_computer-science_all_speciality_2018.xls
diff --git a/... Rankings/Computer Science/02.usnews_top-science-schools_artificial-intelligence_2018.xls b/... Rankings/Computer Science/02.usnews_top-science-schools_artificial-intelligence_2018.xls
diff --git a/USNews Rankings/Computer Science/03.usnews_top-science-schools_computer-programming_2018.xls b/USNews Rankings/Computer Science/03.usnews_top-science-schools_computer-programming_2018.xls
diff --git a/USNews Rankings/Computer Science/04.usnews_top-science-schools_computer-systems_2018.xls b/USNews Rankings/Computer Science/04.usnews_top-science-schools_computer-systems_2018.xls
diff --git a/USNews Rankings/Computer Science/05.usnews_top-science-schools_computer-theory_2018.xls b/USNews Rankings/Computer Science/05.usnews_top-science-schools_computer-theory_2018.xls
diff --git a/usnews_scrapper/__init__.py b/usnews_scrapper/__init__.py
@@ -1 +1 @@
-from .usnews_scrapper import usnews_scrapper as unsc
+from .usnews_scrapper import usnews_scrapper as unsc
diff --git a/usnews_scrapper/college.py b/usnews_scrapper/college.py
@@ -0,0 +1,107 @@
+import locale
+
+locale.setlocale(locale.LC_ALL, '')
+
+class College:
+    def __init__(self, name, state, rank, tuition, acceptance_rate, sat_range, act_range, 
+                 engineering_rep_score, business_rep_score, cs_rep_score, nursing_rep_score):
+        self.__name = name
+        self.__state = state
+        self.__rank = rank
+        self.__tuition = tuition
+        self.__acceptance_rate = acceptance_rate
+        self.__sat_range = sat_range
+        self.__act_range = act_range
+        self.__engineering_rep_score = engineering_rep_score
+        self.__business_rep_score = business_rep_score
+        self.__cs_rep_score = cs_rep_score
+        self.__nursing_rep_score = nursing_rep_score
+
+    @classmethod
+    def getFromJSON(cls, json_data):
+        name = state = rank = tuition = acceptance_rate = sat_range = act_range = None
+        engineering_rep_score = business_rep_score = cs_rep_score = nursing_rep_score = None
+
+        try:        
+            name = json_data["institution"]["displayName"]
+        except KeyError:
+            pass
+
+        try:
+            state = json_data["institution"]["state"]
+        except KeyError:
+            pass
+
+        try:
+            rank = int(json_data["parent"]["sortRank"])
+        except KeyError:
+            pass
+
+        try: 
+            tuition = locale.atof(json_data["searchData"]["tuition"]["displayValue"].replace("$", ""))
+        except (KeyError, AttributeError):
+            tuition = locale.atof(json_data["searchData"]["tuition"]["displayValue"][0]["value"].replace("$", ""))
+        except ValueError:
+            pass
+
+        try:
+            acceptance_rate = float(json_data["searchData"]["acceptanceRate"]["displayValue"].strip("%"))/100
+        except KeyError:
+            pass
+
+        try:
+           sat_range = json_data["searchData"]["testAvgs"]["displayValue"][0]["value"]
+        except KeyError:
+            pass
+
+        try:
+            act_range = json_data["searchData"]["testAvgs"]["displayValue"][1]["value"]
+        except KeyError:
+            pass
+
+        try:
+            engineering_rep_score = float(json_data["searchData"]["engineeringRepScore"]["rawValue"])
+        except (KeyError, ValueError, TypeError):
+            pass
+
+        try:
+            business_rep_score = float(json_data["searchData"]["businessRepScore"]["rawValue"])
+        except (KeyError, ValueError, TypeError):
+            pass
+
+        try:
+            cs_rep_score = float(json_data["searchData"]["computerScienceRepScore"]["rawValue"])
+        except (KeyError, ValueError, TypeError):
+            pass
+
+        try:
+            nursing_rep_score = float(json_data["searchData"]["nursingRepScore"]["rawValue"])
+        except (KeyError, ValueError, TypeError):
+            pass
+
+        return cls(name, state, rank, 
+                   tuition, acceptance_rate, sat_range, act_range, 
+                   engineering_rep_score, business_rep_score, cs_rep_score, nursing_rep_score)
+
+    def __iter__(self):
+        yield self.__rank
+        yield self.__name
+        yield self.__state
+        yield self.__tuition
+        yield self.__acceptance_rate
+        yield self.__sat_range
+        yield self.__act_range
+        yield self.__engineering_rep_score
+        yield self.__business_rep_score
+        yield self.__cs_rep_score
+        yield self.__nursing_rep_score
+
+    def __str__(self):
+        return "name : {} \nstate : {} \nrank : {} \ntuition : {}  \nacceptance rate : {} \nsat range : {} \n"\
+               "act range : {} \nengineering score : {} \ncomputer science score : {} \nbusiness score : {} \n"\
+               "computer science score : {} \nnursing score : {}".format(self.__name, self.__state, self.__rank, 
+                                                                         self.__tuition, self.__acceptance_rate, 
+                                                                         self.__sat_range, self.__act_range, 
+                                                                         self.__engineering_rep_score, self.__business_rep_score, 
+                                                                         self.__cs_rep_score, self.__nursing_rep_score)
+
diff --git a/usnews_scrapper/table_data/main.js b/usnews_scrapper/table_data/main.js
@@ -0,0 +1,79 @@
+// Initializing Bootstrap attributes
+document.querySelectorAll("th").forEach(function(th){
+    th.setAttribute("scope", "col")
+})
+
+table = document.querySelector("table")
+table.classList.add("table")
+
+thead = document.querySelector("thead")
+thead.classList.add("thead-dark")
+
+const tBody = document.getElementsByTagName('tbody')[0]
+tBody.querySelectorAll("th").forEach(function(th){
+    th.setAttribute("scope", "row")
+})
+
+
+/**
+ * Sorts a HTML table.
+ * 
+ * @param {HTMLTableElement} table The table to sort
+ * @param {number} column The index of the column to sort
+ * @param {boolean} asc Determines if the sorting will be in ascending order 
+ */
+
+function sortTableByColumn(table, column, asc=true){
+    const dirModifier = asc ? 1 : -1
+    const tBody = document.getElementsByTagName('tbody')[0]
+    const rows = Array.from(tBody.querySelectorAll("tr")) 
+
+    const regInteger = /^[0-9]*$/
+    const regFloat = /^[+-]?\d+(\.\d+)?$/
+    const columnValues = Array.from(tBody.rows).map(row => row.cells[column].textContent)
+    const columnDataTypeIsNumSet = new Set(columnValues.map(value => (regInteger.test(value)) || regFloat.test(value)))
+    const columnDataTypeIsNum = Array.from(columnDataTypeIsNumSet)[0]
+
+    const sortedRows = rows.sort(function(a, b){
+        const aColText = a.querySelector(`td:nth-child(${column + 1})`).textContent.trim()
+        const bColText = b.querySelector(`td:nth-child(${column + 1})`).textContent.trim()
+
+        // Ignores all null values
+        if (aColText == "N/A") return 1
+        if (bColText === "N/A") return -1
+
+        if (columnDataTypeIsNum){
+            return (Number(aColText) - Number(bColText)) >= 0 ? (1 * dirModifier) : (-1 * dirModifier)
+        }   
+        return aColText >= bColText ? (1 * dirModifier) : (-1 * dirModifier)
+    })
+
+    changeTableOrder(tBody, sortedRows)
+    trackSortedColumn(table, column, asc)
+}
+
+function changeTableOrder(tableBody, newRows){
+    while (tableBody.firstChild){
+        tableBody.removeChild(tableBody.firstChild)
+    }
+    tableBody.append(...newRows) 
+}
+
+function trackSortedColumn(table, column, asc){
+    table.querySelectorAll("th").forEach(function(th){
+        th.classList.remove("th-sort-asc", "th-sort-desc")
+    })
+    table.querySelector(`th:nth-child(${column + 1})`).classList.toggle("th-sort-asc", asc)
+    table.querySelector(`th:nth-child(${column + 1})`).classList.toggle("th-sort-desc", !asc)
+}
+
+
+document.querySelectorAll("th").forEach(function(headerCell){
+    headerCell.addEventListener("click", function(){
+        const tableElement = headerCell.parentElement.parentElement.parentElement
+        const headerIndex = Array.prototype.indexOf.call(headerCell.parentElement.children, headerCell)
+        currentIsAscending = headerCell.classList.contains("th-sort-asc")
+
+        sortTableByColumn(tableElement, headerIndex, !currentIsAscending)
+    })
+})
diff --git a/usnews_scrapper/table_data/styles.css b/usnews_scrapper/table_data/styles.css
@@ -0,0 +1,19 @@
+*,
+::after,
+::before {
+    margin: 0;
+    padding: 0;
+    box-sizing: border-box;
+}
+
+th:hover {
+    cursor: pointer;
+}
+
+.th-sort-asc::after {
+	content: "\25b4";
+}
+
+.th-sort-desc::after {
+	content: "\25be";
+}
Original file line number	Diff line number	Diff line change
		@@ -1 +1 @@
		from .usnews_scrapper import usnews_scrapper as unsc
		from .usnews_scrapper import usnews_scrapper as unsc