Skip to content

spark-bypass-common/common-tungsten

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

common-tungsten

common-tungsten是一个针对Spark SQL的语法糖组件.其有效地屏蔽了开发基于Spark SQL的程序中 需要管理的复杂概念,如org.apache.spark.sql.Encoder,scala.reflect.api.TypeTags.TypeTag等.
尤其是对于java来说,common-tungsten对琐碎事物的托管,使得开发者可以更加专注于核心功能的开发.


common-tungsten的具体作用

common-tungsten的核心是对Dataset的代理:DataSet/JDataSet. 作为一种代理模式,DataSet/JDataSetDataset的功能增强在于:

  1. 平滑透明地生成合适的钨丝编码org.apache.spark.sql.Encoder;
  2. 透明地生成scala.reflect.api.TypeTags.TypeTag(仅针对java);
  3. 封装Dataset所没有提供的强类型XXJoin方法与cogroup方法,方便业务线使用;

common-tungsten的使用case

scala case:

def convert(str: String): T = ...
val dataSet = DataSets.readTextFromHdfs[T]("/your-file-path", convert _)
dataSet.filter(record => record != null)
  .leftOuterJoin(anotherDataSet, record => record.toString)

DataSets.writeToHdfs(dataSet, "/your-file-path")

java case:

private T parseStr(String str) {...}
JDataSet<T> dataSet = JDataSets.readTextFromHdfs("/your-file-path", this::parseStr)
    .filter((record) -> record != null)
    .leftOuterJoin(anotherJDataSet, T::toString);

JDataSets.writeToHdfs(dataSet,"/your-file-path");

About

钨丝计划的忠诚布道者:通用的Tungsten Encoder

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages