Have you been able to launch jobs with Java? #39

yeikel · 2018-04-01T01:07:19Z

Hi ,

I am running Spark with the following configuration:

version: '2'
services:
  master:
    image: gettyimages/spark
    command: bin/spark-class org.apache.spark.deploy.master.Master -h master
    hostname: master
    environment:
      MASTER: spark://master:7077
      SPARK_CONF_DIR: /conf
      SPARK_PUBLIC_DNS: localhost
    expose:
      - 7001
      - 7002
      - 7003
      - 7004
      - 7005
      - 7006
      - 7077
      - 6066
    ports:
      - 4040:4040
      - 6066:6066
      - 7077:7077
      - 8080:8080
  worker:
    image: gettyimages/spark
    command: bin/spark-class org.apache.spark.deploy.worker.Worker spark://master:7077
    hostname: worker
    environment:
      SPARK_CONF_DIR: /conf
      SPARK_WORKER_CORES: 2
      SPARK_WORKER_MEMORY: 1g
      SPARK_WORKER_PORT: 8881
      SPARK_WORKER_WEBUI_PORT: 8081
      SPARK_PUBLIC_DNS: localhost
    links:
      - master
    expose:
      - 7012
      - 7013
      - 7014
      - 7015
      - 7016
      - 8881
    ports:
      - 8081:8081

And I have the following simple Java program:

SparkConf conf = new SparkConf().setMaster("spark://localhost:7077").setAppName("Word Count Sample App");
conf.set("spark.dynamicAllocation.enabled","false");
String file = "test.txt";
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> textFile = sc.textFile("src/main/resources/" + file);
JavaPairRDD<String, Integer> counts = textFile.flatMap(s -> Arrays.asList(s.split("[ ,]")).iterator()).mapToPair(word -> new Tuple2<>(word, 1)).reduceByKey((a, b) -> a + b);counts.foreach(p -> System.out.println(p));
System.out.println("Total words: " + counts.count());
counts.saveAsTextFile(file + "out.txt");

The problem that I am having is that it is generating the following command :

Spark Executor Command: "/usr/jdk1.8.0_131/bin/java" "-cp" "/conf:/usr/spark-2.3.0/jars/*:/usr/hadoop-2.8.3/etc/hadoop/:/usr/hadoop-2.8.3/etc/hadoop/*:/usr/hadoop-2.8.3/share/hadoop/common/lib/*:/usr/hadoop-2.8.3/share/hadoop/common/*:/usr/hadoop-2.8.3/share/hadoop/hdfs/*:/usr/hadoop-2.8.3/share/hadoop/hdfs/lib/*:/usr/hadoop-2.8.3/share/hadoop/yarn/lib/*:/usr/hadoop-2.8.3/share/hadoop/yarn/*:/usr/hadoop-2.8.3/share/hadoop/mapreduce/lib/*:/usr/hadoop-2.8.3/share/hadoop/mapreduce/*:/usr/hadoop-2.8.3/share/hadoop/tools/lib/*" "-Xmx1024M" "-Dspark.driver.port=59906" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler@yeikel-pc:59906" "--executor-id" "6" "--hostname" "172.19.0.3" "--cores" "2" "--app-id" "app-20180401005243-0000" "--worker-url" "spark://[email protected]:8881"

Which results in

Caused by: java.io.IOException: Failed to connect to yeikel-pc:59906
	at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:245)
	at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:187)
	at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:198)
	at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:194)
	at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:190)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.net.UnknownHostException: yeikel-pc

``

The text was updated successfully, but these errors were encountered:

JonathanLoscalzo · 2019-02-28T03:04:38Z

I have the same problem
If you try to watch stdout instead stderr:

2019-02-28 03:02:00 INFO  CoarseGrainedExecutorBackend:2566 - Started daemon with process name: 436@worker
2019-02-28 03:02:00 INFO  SignalUtils:54 - Registered signal handler for TERM
2019-02-28 03:02:00 INFO  SignalUtils:54 - Registered signal handler for HUP
2019-02-28 03:02:00 INFO  SignalUtils:54 - Registered signal handler for INT
2019-02-28 03:02:00 WARN  NativeCodeLoader:60 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2019-02-28 03:02:00 INFO  SecurityManager:54 - Changing view acls to: root,jloscalzo
2019-02-28 03:02:00 INFO  SecurityManager:54 - Changing modify acls to: root,jloscalzo
2019-02-28 03:02:00 INFO  SecurityManager:54 - Changing view acls groups to: 
2019-02-28 03:02:00 INFO  SecurityManager:54 - Changing modify acls groups to: 
2019-02-28 03:02:00 INFO  SecurityManager:54 - SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root, jloscalzo); groups with view permissions: Set(); users  with modify permissions: Set(root, jloscalzo); groups with modify permissions: Set()

Supose, it's the same problem.
I have been trying to execute a simple code, from jupyter notebook:

from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
# conf = SparkConf().setMaster("http://localhost:7077").setAppName("prueba")
# sc = SparkContext(conf=conf)
spark = SparkSession.builder.master("spark://localhost:7077").config('spark.submit.deployMode', 'client').appName("example").getOrCreate()
sc = spark.sparkContext

# this doesn't execute:
sc.parallelize([1,2,3,4]).sumApprox(1)

jaskiratr · 2019-07-21T21:24:22Z

@JonathanLoscalzo I'm running into same issue. Were you able to solve the issue?

JonathanLoscalzo · 2019-07-21T23:30:18Z

@jaskiratr not for now.
Maybe the problem were that we need install spark locally as a master, but I didn't test it.

Instead, I have installed on a Google Colab notebook an instance of spark with this code:

!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q http://www-eu.apache.org/dist/spark/spark-2.4.0/spark-2.4.0-bin-hadoop2.7.tgz
!tar xf spark-2.4.0-bin-hadoop2.7.tgz
!pip install -q findspark
!pip install pyspark

import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.0-bin-hadoop2.7"

I didn't catch up if we must install spark locally to connect to a remote instance.

It should be easier, but not.

OneCricketeer · 2020-01-20T22:15:30Z

Not clear how you got this.

"--driver-url" "spark://CoarseGrainedScheduler@yeikel-pc:59906"

The driver url should be spark://master:7077 if you mount your JAR into the worker container and run spark-submit from there rather than your host machine.

JonathanLoscalzo · 2020-01-20T23:17:55Z

Sorry @Cricket007 , when you said you have not got this, you have been referring to the initial error or how I use pyspark in colab?

Where do you write "--driver-url", when you run the containers? (it's is a docker-compose file this).

Could you explain in more detail? I have found this link

OneCricketeer · 2020-01-20T23:31:47Z

@JonathanLoscalzo I was referring to OP.

Your colab setup is likely very different than running Docker Compose, and I would suggest DataProc in the GCP environment rather than doing anything manual in CoLab

OneCricketeer · 2020-01-20T23:33:32Z

I didn't catch up if we must install spark locally to connect to a remote instance

You need Spark client libraries, yes. Or you can docker exec or ssh to somewhere else that does.

Or you can install Apache Livy as a REST interface to submit Spark jobs

OneCricketeer · 2020-01-20T23:53:30Z

I have found this link

Find my answer there? See if that network diagram answers any of your networking issues. (Make sure you can telnet / netcat between all relevant ports)

JonathanLoscalzo · 2020-01-21T01:59:09Z

I didn't catch up if we must install spark locally to connect to a remote instance

You need Spark client libraries, yes. Or you can docker exec or ssh to somewhere else that does.

Thanks @Cricket007 I realized that I need spark install locally or something related to that to run the scripts ( In this case, the machine which it was running jupyter). Now, you are confirmed my issue 👍 . I suppose it is the issue of @yeikel (?)

Or you can install Apache Livy as a REST interface to submit Spark jobs

I did't use Apache Livy, you recommend it?

@JonathanLoscalzo I was referring to OP.

Your colab setup is likely very different than running Docker Compose, and I would suggest DataProc in the GCP environment rather than doing anything manual in CoLab

I don't know what is "DataProc" in GCP, Is it like Databricks for Azure? (I will check it tomorrow)
For "testing purpose", Colab is good enough I suppose (testing scripts, or teach pyspark syntax). I don't know if it was related to the issue but, Could you recommend me some aproaches to using spark in a "development stage"?

Thanks for your answer!

OneCricketeer · 2020-01-21T03:32:36Z

DataProc is the managed Hadoop/Spark service by Google. Amazon and Azure have similar offerings, if that's what you want.

Databricks is purely Spark. If you want more than that, Qubole is another option.

If all you really want is to learn Spark locally, either extract it locally or use a VM simply because networking is easier and the way you would install an actual cluster would not be in containers. (and there's plenty of ways to automate the installation such as Apache Ambari, or Ansible).

Otherwise, the Cloudera/Hortonworks Sandboxes work fine.

OneCricketeer · 2020-01-21T03:34:27Z

did't use Apache Livy, you recommend it?

I've used it indirectly via HUE interface, but it was fairly straightforward to setup.

And I personally use Zeppelin over Jupyter because Spark (Scala) is more tightly integrated, though it can handle Python fine

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Have you been able to launch jobs with Java? #39

Have you been able to launch jobs with Java? #39

yeikel commented Apr 1, 2018

JonathanLoscalzo commented Feb 28, 2019

jaskiratr commented Jul 21, 2019

JonathanLoscalzo commented Jul 21, 2019

OneCricketeer commented Jan 20, 2020 •

edited

Loading

JonathanLoscalzo commented Jan 20, 2020

OneCricketeer commented Jan 20, 2020 •

edited

Loading

OneCricketeer commented Jan 20, 2020

OneCricketeer commented Jan 20, 2020

JonathanLoscalzo commented Jan 21, 2020 •

edited

Loading

OneCricketeer commented Jan 21, 2020

OneCricketeer commented Jan 21, 2020 •

edited

Loading

Have you been able to launch jobs with Java? #39

Have you been able to launch jobs with Java? #39

Comments

yeikel commented Apr 1, 2018

JonathanLoscalzo commented Feb 28, 2019

jaskiratr commented Jul 21, 2019

JonathanLoscalzo commented Jul 21, 2019

OneCricketeer commented Jan 20, 2020 • edited Loading

JonathanLoscalzo commented Jan 20, 2020

OneCricketeer commented Jan 20, 2020 • edited Loading

OneCricketeer commented Jan 20, 2020

OneCricketeer commented Jan 20, 2020

JonathanLoscalzo commented Jan 21, 2020 • edited Loading

OneCricketeer commented Jan 21, 2020

OneCricketeer commented Jan 21, 2020 • edited Loading

OneCricketeer commented Jan 20, 2020 •

edited

Loading

OneCricketeer commented Jan 20, 2020 •

edited

Loading

JonathanLoscalzo commented Jan 21, 2020 •

edited

Loading

OneCricketeer commented Jan 21, 2020 •

edited

Loading