Skip to content

JordyForNow/BigDataProject

Repository files navigation

processing folder

The files in the main folder are used for processing the big data into small data. The main file here is monthly_host_country_referrals.py, which pre-processes referrer urls to just the host, and then aggregates the visits for referrers per country of origin.

hdsf_to_local.py uses pandas to transfer files from the HDFS to the normal filesystem on the cluster, without losing the file structure and naming. It does need to be run with Python3 and pandas installed to work.

post_processing folder

This folder contains the scripts to process the small data further by cleaning strings up further and grouping hosts. The main logic for this can be found in post_processing.py, with the order scripts applying its functions to different data.

For example, all_hosts_ever.py aggregates all referrers in the dataset, with all_hosts_post_processing.py applying the grouping to this data. With mapping.py the difference between these datasets can be turned into a mapping of original to grouped hostnames for easy comparison.

To apply the processing to the main data monthly_ch_post_processing.py is used, which keeps the country data alongside the referrers.

exploration folder

This folder contains scripts for examining the processed small data.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •