You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
sql workload fails when writing the output file -- Exception in thread "main" org.apache.spark.sql.AnalysisException:Found duplicate column(s) when inserting results-data-gen.csv:
#196
Open
eddytruyen opened this issue
Dec 11, 2020
· 2 comments
Spark-Bench version (version number, tag, or git commit hash)
2.3.0
Hadoop is not installed.
Details of your cluster setup (Spark version, Standalone/Yarn/Local/Etc)
2.4.6
Scala version on your cluster
Scala version 2.11.12
Your exact configuration file (with system details anonymized for security)
I am running the example cvs-parquet without a spark-master and write to local files on nfs
spark-bench = {
spark-submit-config = [{
//spark-args = {
// executor-memory = "2147483648"
// }
//conf = {
// // Any configuration you need for your setup goes here, like:
// //spark.dynamicAllocation.enabled = "true"
// }
suites-parallel = false
workload-suites = [
{
descr = "Generate a dataset, then take that same dataset and write it out to Parquet format"
benchmark-output = "results-data-gen.csv"
// We need to generate the dataset first through the data generator, then we take that dataset and convert it to Parquet.
parallel = false
workloads = [
{
name = "data-generation-kmeans"
rows = 10000
cols = 14
output = "file:///opt/bitnami/spark/spark_data/spark-bench-test/kmeans-data.csv"
},
{
name = "sql"
query = "select * from input"
input = "file:///opt/bitnami/spark/spark_data/spark-bench-test/kmeans-data.csv"
output = "file:///opt/bitnami/spark/spark_data/spark-bench-test/kmeans-data.parquet"
}
]
},
{
descr = "Run two different SQL queries over the dataset in two different formats"
benchmark-output = "file:///opt/bitnami/spark/spark_data/spark-bench-test/results-sql.csv"
parallel = false
repeat = 1
workloads = [
{
name = "sql"
input = ["file:///opt/bitnami/spark/spark_data/spark-bench-test/kmeans-data.csv", "file:///opt/bitnami/spark/spark_data/spark-bench-test/kmeans-data.parquet"]
query = ["select * from input", "select c0, c22 from input where c0 < -0.9"]
cache = false
}
]
}
]
}]
}
Relevant stacktrace
Exception in thread "main" org.apache.spark.sql.AnalysisException: Found duplicate column(s) when inserting into file:/opt/bitnami/spark/spark_data/spark-bench/results-data-gen.csv: `total_runtime`;
at org.apache.spark.sql.util.SchemaUtils$.checkColumnNameDuplication(SchemaUtils.scala:85)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:65)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:83)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:81)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:677)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:677)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:80)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:127)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:75)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:677)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:286)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:272)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:230)
at org.apache.spark.sql.DataFrameWriter.csv(DataFrameWriter.scala:665)
at com.ibm.sparktc.sparkbench.utils.SparkFuncs$.writeToDisk(SparkFuncs.scala:104)
at com.ibm.sparktc.sparkbench.workload.SuiteKickoff$.run(SuiteKickoff.scala:88)
at com.ibm.sparktc.sparkbench.workload.MultipleSuiteKickoff$$anonfun$com$ibm$sparktc$sparkbench$workload$MultipleSuiteKickoff$$runSuitesSerially$1.apply(MultipleSuiteKickoff.scala:38)
at com.ibm.sparktc.sparkbench.workload.MultipleSuiteKickoff$$anonfun$com$ibm$sparktc$sparkbench$workload$MultipleSuiteKickoff$$runSuitesSerially$1.apply(MultipleSuiteKickoff.scala:38)
at scala.collection.immutable.List.foreach(List.scala:392)
at com.ibm.sparktc.sparkbench.workload.MultipleSuiteKickoff$.com$ibm$sparktc$sparkbench$workload$MultipleSuiteKickoff$$runSuitesSerially(MultipleSuiteKickoff.scala:38)
at com.ibm.sparktc.sparkbench.workload.MultipleSuiteKickoff$$anonfun$run$1.apply(MultipleSuiteKickoff.scala:28)
at com.ibm.sparktc.sparkbench.workload.MultipleSuiteKickoff$$anonfun$run$1.apply(MultipleSuiteKickoff.scala:25)
at scala.collection.immutable.List.foreach(List.scala:392)
at com.ibm.sparktc.sparkbench.workload.MultipleSuiteKickoff$.run(MultipleSuiteKickoff.scala:25)
at com.ibm.sparktc.sparkbench.cli.CLIKickoff$.main(CLIKickoff.scala:30)
at com.ibm.sparktc.sparkbench.cli.CLIKickoff.main(CLIKickoff.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Description of your problem and any other relevant info
It seems the output file results-data-gen.csv cannot be written by the sql workload.
The same error also appears in a spark cluster without NFS instead of HDFS.
Note that this error does not appear when running the sql workload separately on already generated csv and parquet data
The text was updated successfully, but these errors were encountered:
Spark-Bench version (version number, tag, or git commit hash)
2.3.0
Hadoop is not installed.
Details of your cluster setup (Spark version, Standalone/Yarn/Local/Etc)
2.4.6
Scala version on your cluster
Scala version 2.11.12
Your exact configuration file (with system details anonymized for security)
I am running the example cvs-parquet without a spark-master and write to local files on nfs
Relevant stacktrace
Description of your problem and any other relevant info
It seems the output file
results-data-gen.csv
cannot be written by the sql workload.The same error also appears in a spark cluster without NFS instead of HDFS.
Note that this error does not appear when running the sql workload separately on already generated csv and parquet data
The text was updated successfully, but these errors were encountered: