Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Have you been able to launch jobs with Java? #39

Open
yeikel opened this issue Apr 1, 2018 · 11 comments
Open

Have you been able to launch jobs with Java? #39

yeikel opened this issue Apr 1, 2018 · 11 comments

Comments

@yeikel
Copy link

yeikel commented Apr 1, 2018

Hi ,

I am running Spark with the following configuration:

version: '2'
services:
  master:
    image: gettyimages/spark
    command: bin/spark-class org.apache.spark.deploy.master.Master -h master
    hostname: master
    environment:
      MASTER: spark://master:7077
      SPARK_CONF_DIR: /conf
      SPARK_PUBLIC_DNS: localhost
    expose:
      - 7001
      - 7002
      - 7003
      - 7004
      - 7005
      - 7006
      - 7077
      - 6066
    ports:
      - 4040:4040
      - 6066:6066
      - 7077:7077
      - 8080:8080
  worker:
    image: gettyimages/spark
    command: bin/spark-class org.apache.spark.deploy.worker.Worker spark://master:7077
    hostname: worker
    environment:
      SPARK_CONF_DIR: /conf
      SPARK_WORKER_CORES: 2
      SPARK_WORKER_MEMORY: 1g
      SPARK_WORKER_PORT: 8881
      SPARK_WORKER_WEBUI_PORT: 8081
      SPARK_PUBLIC_DNS: localhost
    links:
      - master
    expose:
      - 7012
      - 7013
      - 7014
      - 7015
      - 7016
      - 8881
    ports:
      - 8081:8081

And I have the following simple Java program:

SparkConf conf = new SparkConf().setMaster("spark://localhost:7077").setAppName("Word Count Sample App");
conf.set("spark.dynamicAllocation.enabled","false");
String file = "test.txt";
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> textFile = sc.textFile("src/main/resources/" + file);
JavaPairRDD<String, Integer> counts = textFile.flatMap(s -> Arrays.asList(s.split("[ ,]")).iterator()).mapToPair(word -> new Tuple2<>(word, 1)).reduceByKey((a, b) -> a + b);counts.foreach(p -> System.out.println(p));
System.out.println("Total words: " + counts.count());
counts.saveAsTextFile(file + "out.txt");

The problem that I am having is that it is generating the following command :

Spark Executor Command: "/usr/jdk1.8.0_131/bin/java" "-cp" "/conf:/usr/spark-2.3.0/jars/*:/usr/hadoop-2.8.3/etc/hadoop/:/usr/hadoop-2.8.3/etc/hadoop/*:/usr/hadoop-2.8.3/share/hadoop/common/lib/*:/usr/hadoop-2.8.3/share/hadoop/common/*:/usr/hadoop-2.8.3/share/hadoop/hdfs/*:/usr/hadoop-2.8.3/share/hadoop/hdfs/lib/*:/usr/hadoop-2.8.3/share/hadoop/yarn/lib/*:/usr/hadoop-2.8.3/share/hadoop/yarn/*:/usr/hadoop-2.8.3/share/hadoop/mapreduce/lib/*:/usr/hadoop-2.8.3/share/hadoop/mapreduce/*:/usr/hadoop-2.8.3/share/hadoop/tools/lib/*" "-Xmx1024M" "-Dspark.driver.port=59906" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler@yeikel-pc:59906" "--executor-id" "6" "--hostname" "172.19.0.3" "--cores" "2" "--app-id" "app-20180401005243-0000" "--worker-url" "spark://[email protected]:8881"

Which results in

Caused by: java.io.IOException: Failed to connect to yeikel-pc:59906
	at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:245)
	at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:187)
	at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:198)
	at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:194)
	at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:190)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.net.UnknownHostException: yeikel-pc

``



@JonathanLoscalzo
Copy link

I have the same problem
If you try to watch stdout instead stderr:

2019-02-28 03:02:00 INFO  CoarseGrainedExecutorBackend:2566 - Started daemon with process name: 436@worker
2019-02-28 03:02:00 INFO  SignalUtils:54 - Registered signal handler for TERM
2019-02-28 03:02:00 INFO  SignalUtils:54 - Registered signal handler for HUP
2019-02-28 03:02:00 INFO  SignalUtils:54 - Registered signal handler for INT
2019-02-28 03:02:00 WARN  NativeCodeLoader:60 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2019-02-28 03:02:00 INFO  SecurityManager:54 - Changing view acls to: root,jloscalzo
2019-02-28 03:02:00 INFO  SecurityManager:54 - Changing modify acls to: root,jloscalzo
2019-02-28 03:02:00 INFO  SecurityManager:54 - Changing view acls groups to: 
2019-02-28 03:02:00 INFO  SecurityManager:54 - Changing modify acls groups to: 
2019-02-28 03:02:00 INFO  SecurityManager:54 - SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root, jloscalzo); groups with view permissions: Set(); users  with modify permissions: Set(root, jloscalzo); groups with modify permissions: Set()

Supose, it's the same problem.
I have been trying to execute a simple code, from jupyter notebook:

from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
# conf = SparkConf().setMaster("http://localhost:7077").setAppName("prueba")
# sc = SparkContext(conf=conf)
spark = SparkSession.builder.master("spark://localhost:7077").config('spark.submit.deployMode', 'client').appName("example").getOrCreate()
sc = spark.sparkContext

# this doesn't execute:
sc.parallelize([1,2,3,4]).sumApprox(1)

@jaskiratr
Copy link

@JonathanLoscalzo I'm running into same issue. Were you able to solve the issue?

@JonathanLoscalzo
Copy link

@jaskiratr not for now.
Maybe the problem were that we need install spark locally as a master, but I didn't test it.

Instead, I have installed on a Google Colab notebook an instance of spark with this code:

!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q http://www-eu.apache.org/dist/spark/spark-2.4.0/spark-2.4.0-bin-hadoop2.7.tgz
!tar xf spark-2.4.0-bin-hadoop2.7.tgz
!pip install -q findspark
!pip install pyspark

import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.0-bin-hadoop2.7"

I didn't catch up if we must install spark locally to connect to a remote instance.

It should be easier, but not.

@OneCricketeer
Copy link

OneCricketeer commented Jan 20, 2020

Not clear how you got this.

"--driver-url" "spark://CoarseGrainedScheduler@yeikel-pc:59906"

The driver url should be spark://master:7077 if you mount your JAR into the worker container and run spark-submit from there rather than your host machine.

@JonathanLoscalzo
Copy link

Sorry @Cricket007 , when you said you have not got this, you have been referring to the initial error or how I use pyspark in colab?

Where do you write "--driver-url", when you run the containers? (it's is a docker-compose file this).

Could you explain in more detail? I have found this link

@OneCricketeer
Copy link

OneCricketeer commented Jan 20, 2020

@JonathanLoscalzo I was referring to OP.

Your colab setup is likely very different than running Docker Compose, and I would suggest DataProc in the GCP environment rather than doing anything manual in CoLab

@OneCricketeer
Copy link

I didn't catch up if we must install spark locally to connect to a remote instance

You need Spark client libraries, yes. Or you can docker exec or ssh to somewhere else that does.

Or you can install Apache Livy as a REST interface to submit Spark jobs

@OneCricketeer
Copy link

I have found this link

Find my answer there? See if that network diagram answers any of your networking issues. (Make sure you can telnet / netcat between all relevant ports)

@JonathanLoscalzo
Copy link

JonathanLoscalzo commented Jan 21, 2020

I didn't catch up if we must install spark locally to connect to a remote instance

You need Spark client libraries, yes. Or you can docker exec or ssh to somewhere else that does.

Thanks @Cricket007 I realized that I need spark install locally or something related to that to run the scripts ( In this case, the machine which it was running jupyter). Now, you are confirmed my issue 👍 . I suppose it is the issue of @yeikel (?)

Or you can install Apache Livy as a REST interface to submit Spark jobs

I did't use Apache Livy, you recommend it?

@JonathanLoscalzo I was referring to OP.

Your colab setup is likely very different than running Docker Compose, and I would suggest DataProc in the GCP environment rather than doing anything manual in CoLab

I don't know what is "DataProc" in GCP, Is it like Databricks for Azure? (I will check it tomorrow)
For "testing purpose", Colab is good enough I suppose (testing scripts, or teach pyspark syntax). I don't know if it was related to the issue but, Could you recommend me some aproaches to using spark in a "development stage"?

Thanks for your answer!

@OneCricketeer
Copy link

DataProc is the managed Hadoop/Spark service by Google. Amazon and Azure have similar offerings, if that's what you want.

Databricks is purely Spark. If you want more than that, Qubole is another option.

If all you really want is to learn Spark locally, either extract it locally or use a VM simply because networking is easier and the way you would install an actual cluster would not be in containers. (and there's plenty of ways to automate the installation such as Apache Ambari, or Ansible).

Otherwise, the Cloudera/Hortonworks Sandboxes work fine.

@OneCricketeer
Copy link

OneCricketeer commented Jan 21, 2020

did't use Apache Livy, you recommend it?

I've used it indirectly via HUE interface, but it was fairly straightforward to setup.

And I personally use Zeppelin over Jupyter because Spark (Scala) is more tightly integrated, though it can handle Python fine

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants