PROJECT STRUCTURE:
NRT_pipeline
:*.py
: .py scripts for real time consuming data fromHDFS
usingKafka2HDFS
mechanismrunAllScripts.sh
: main shell runner
scenarios/
: collection of .py scripts for Offline consumingHDFS
data following with data preprocessing via pyspraklib
:new_data_slice.py
: make data slice from externalHDFS
storescenario_base.py
:sources_update.py
: data updater: load necessary table fromOracle
scenario_stats.py
: calculate overall statistics after finishing all clickstream scenarios in/scenarios/
ga_all_snenarios_insert.py
: script for inserting processed data into Hivemail_sender.py
: class for auto-emailing and message broadcastingtools.py
: config with global variables/json definitionexport_to_iskra.py
: jDBC data loader from Hive to Oracle
Hive_External_Tbl
: pyspark script for working with externalHDFS
partitions withHIVEQL
script for partition auto sync