Skip to content

debu999/spark-training

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

spark-training

rpsconsulting spark training code

  • sample command: spark-submit --master "local[*]" sparkcore\spark_map.py

  • important diff: sortBy and sortByKey

  • spark shuffle partition/repartition trigger new stages.

  • executors are created when a job is submitted.

df = spark.read.option("multiLine", True).json("C:/spark-training/samples/sample.json") df.show()

+----+-------+-----+ | age| name|pcode| +----+-------+-----+ |null| Alice|94304| | 30|Brayden|94304| | 19| Carla|10036| | 46| Diana| null| |null|Étienne|94104| +----+-------+-----+

  • --master to pass master in command instead of code. Code takes precedence.

df_sel = df.select(["name", "pcode"]) df_sel.show() +-------+-----+ | name|pcode| +-------+-----+ | Alice|94304| |Brayden|94304| | Carla|10036| | Diana| null| |Étienne|94104| +-------+-----+

  • Data frame option - select/take/first/count/show/printschema/collect/write/
userDF2.write.option("header",true).csv("c:/test-out-csv-2")

userDF2.write.format("csv").option("header",true).save("c:/test-out-csv-3")

About

rpsconsulting spark training code

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages