GitHub

processing folder

The files in the main folder are used for processing the big data into small data. The main file here is monthly_host_country_referrals.py, which pre-processes referrer urls to just the host, and then aggregates the visits for referrers per country of origin.

hdsf_to_local.py uses pandas to transfer files from the HDFS to the normal filesystem on the cluster, without losing the file structure and naming. It does need to be run with Python3 and pandas installed to work.

post_processing folder

This folder contains the scripts to process the small data further by cleaning strings up further and grouping hosts. The main logic for this can be found in post_processing.py, with the order scripts applying its functions to different data.

For example, all_hosts_ever.py aggregates all referrers in the dataset, with all_hosts_post_processing.py applying the grouping to this data. With mapping.py the difference between these datasets can be turned into a mapping of original to grouped hostnames for easy comparison.

To apply the processing to the main data monthly_ch_post_processing.py is used, which keeps the country data alongside the referrers.

exploration folder

This folder contains scripts for examining the processed small data.

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
Data		Data
Final_images		Final_images
exploration		exploration
processing		processing
.gitignore		.gitignore
Top_Countries_Per_Domains_Plot.ipynb		Top_Countries_Per_Domains_Plot.ipynb
country_activity_visualization.ipynb		country_activity_visualization.ipynb
monthly_referralsPlot&pie_chart_outliers.ipynb		monthly_referralsPlot&pie_chart_outliers.ipynb
outliers.csv		outliers.csv
readme.md		readme.md
requirements.txt		requirements.txt
time_series_monthly_referrals.ipynb		time_series_monthly_referrals.ipynb
top-10-over-time.ipynb		top-10-over-time.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

processing folder

post_processing folder

exploration folder

About

Releases

Packages

Contributors 4

Languages

JordyForNow/BigDataProject

Folders and files

Latest commit

History

Repository files navigation

processing folder

post_processing folder

exploration folder

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages