Created attachment 1669741 [details] full oc logs --all-containers=true my-spark-app-1584038448633-driver Description of problem: The my-spark-app pod runs to completion, but does not produce an event The log shows that the .jar file doesn't exist Version-Release number of selected component (if applicable): sparkoperator.v1.0.7 Apache Spark Operator 1.0.7 Succeeded oc version Client Version: openshift-clients-4.4.0-202003060720 Server Version: 4.4.0-0.nightly-2020-03-12-082023 Kubernetes Version: v1.17.1 How reproducible: Always Steps to Reproduce: 1. Subscribe to the spark operator 2. Create spark cluster 3. Create spark history server 4. Create spark application Actual results: oc get pod NAME READY STATUS RESTARTS AGE my-spark-app-1584038448633-driver 0/1 Completed 0 15m my-spark-app-submitter-sqzb9 1/1 Running 0 15m my-spark-cluster-m-cm6jd 1/1 Running 0 27m my-spark-cluster-w-gwhbc 1/1 Running 0 27m my-spark-cluster-w-sr4zn 1/1 Running 0 27m spark-operator-5dc5fd9944-vp8tc 1/1 Running 0 31m oc logs --all-containers=true my-spark-app-1584038448633-driver | grep Pi + exec /opt/spark/bin/spark-submit --conf spark.driver.bindAddress=10.128.2.21 --deploy-mode client --properties-file /opt/spark/conf/spark.properties --class org.apache.spark.examples.SparkPi spark-internal 20/03/12 18:41:00 WARN SparkSubmit$$anon$2: Failed to load org.apache.spark.examples.SparkPi. java.lang.ClassNotFoundException: org.apache.spark.examples.SparkPi Expected results: Pi is roughly 3.142155710778554 Additional info:
Hello Tom, If we take off the "| grep pi" on your log command and observe the messages there we can see that the app is trying to load spark-examples_2.11-2.3.0.jar. That comes from: https://github.com/radanalyticsio/spark-operator/blob/41d7d77022b2956896fea0546122d6c1a68138a3/manifest/olm/crd/sparkclusteroperator.1.0.1.clusterserviceversion.yaml#L31 20/03/16 15:31:04 WARN DependencyUtils: Local jar /opt/spark/examples/jars/spark-examples_2.11-2.3.0.jar does not exist, skipping. 20/03/16 15:31:04 WARN SparkSubmit$$anon$2: Failed to load org.apache.spark.examples.SparkPi. java.lang.ClassNotFoundException: org.apache.spark.examples.SparkPi at java.net.URLClassLoader.findClass(URLClassLoader.java:382) at java.lang.ClassLoader.loadClass(ClassLoader.java:419) at java.lang.ClassLoader.loadClass(ClassLoader.java:352) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:348) at org.apache.spark.util.Utils$.classForName(Utils.scala:238) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:806) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 20/03/16 15:31:04 INFO ShutdownHookManager: Shutdown hook called 20/03/16 15:31:04 INFO ShutdownHookManager: Deleting directory /tmp/spark-1a9cc374-1c13-4740-9c17-3409ab18109c But if we do "oc debug pod/my-spark-app-pod-number-driver" we can read what's in that path: sh-4.2$ ls /opt/spark/examples/jars/ scopt_2.11-3.7.0.jar spark-examples_2.11-2.4.5.jar Those are coming from here: https://github.com/radanalyticsio/spark-operator/blob/41d7d77022b2956896fea0546122d6c1a68138a3/examples/app.yaml#L7 It seems to be looking for an outdated jar file. I'm doing some more tests just to verify and will post the results here.
Default configurations on Openshift Install will also fail depending on what is loaded on the cluster because the executors requires 1 dedicated vCpu core each: NAME READY STATUS RESTARTS AGE my-spark-app-1584383030507-exec-3 0/1 Pending 0 19s my-spark-app-1584383030507-exec-4 0/1 Pending 0 19s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling <unknown> default-scheduler 0/6 nodes are available: 3 Insufficient cpu, 3 node(s) had taints that the pod didn't tolerate. But after upgrading worker nodes to m5.2xlarge machines with 8 CPUs and fixing the jar file name, we get the correct output on the last line: oc logs --all-containers=true my-spark-app-1584387507724-driver -n spark | grep Pi + exec /opt/spark/bin/spark-submit --conf spark.driver.bindAddress=10.128.2.15 --deploy-mode client --properties-file /opt/spark/conf/spark.properties --class org.apache.spark.examples.SparkPi spark-internal 20/03/16 19:38:41 INFO SparkContext: Submitted application: Spark Pi 20/03/16 19:39:22 INFO SparkContext: Starting job: reduce at SparkPi.scala:38 20/03/16 19:39:22 INFO DAGScheduler: Got job 0 (reduce at SparkPi.scala:38) with 2 output partitions 20/03/16 19:39:22 INFO DAGScheduler: Final stage: ResultStage 0 (reduce at SparkPi.scala:38) 20/03/16 19:39:22 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34), which has no missing parents 20/03/16 19:39:23 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34) (first 15 tasks are for partitions Vector(0, 1)) 20/03/16 19:39:24 INFO DAGScheduler: ResultStage 0 (reduce at SparkPi.scala:38) finished in 1.359 s 20/03/16 19:39:24 INFO DAGScheduler: Job 0 finished: reduce at SparkPi.scala:38, took 1.597624 s Pi is roughly 3.141395706978535 I'm contacting the developers to put a PR fixing it. I'll post here when done.
Hi Tom, There is a PR to fix that. Thanks to Trevor Mckay. Please follow up here: https://github.com/operator-framework/community-operators/pull/1361 Let me know if you need anything else on this case. Thanks!
Switched my yaml to use spark-examples_2.11-2.4.5.jar Pi is roughly 3.142955714778574 oc version Client Version: openshift-clients-4.4.0-202003060720 Server Version: 4.4.0-0.nightly-2020-03-19-075457 Kubernetes Version: v1.17.1
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:3075