Bug 1813037 - my-spark-app example does not run/is missing
Summary: my-spark-app example does not run/is missing
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: ISV Operators
Version: 4.4
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: 4.4.z
Assignee: Alexandre Menezes
QA Contact: Tom Buskey
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-03-12 19:08 UTC by Tom Buskey
Modified: 2020-07-28 12:37 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Just a typo on the examples json input leading to a non existent file.
Clone Of:
Environment:
Last Closed: 2020-07-28 12:37:03 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
full oc logs --all-containers=true my-spark-app-1584038448633-driver (2.44 KB, text/plain)
2020-03-12 19:08 UTC, Tom Buskey
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2020:3075 0 None None None 2020-07-28 12:37:28 UTC

Description Tom Buskey 2020-03-12 19:08:37 UTC
Created attachment 1669741 [details]
full oc logs --all-containers=true my-spark-app-1584038448633-driver

Description of problem:
The my-spark-app pod runs to completion, but does not produce an event
The log shows that the .jar file doesn't exist


Version-Release number of selected component (if applicable):
sparkoperator.v1.0.7   Apache Spark Operator   1.0.7                Succeeded
oc version
Client Version: openshift-clients-4.4.0-202003060720
Server Version: 4.4.0-0.nightly-2020-03-12-082023
Kubernetes Version: v1.17.1


How reproducible:
Always


Steps to Reproduce:
1. Subscribe to the spark operator
2. Create spark cluster
3. Create spark history server
4. Create spark application

Actual results:
oc get pod
NAME                                READY   STATUS      RESTARTS   AGE
my-spark-app-1584038448633-driver   0/1     Completed   0          15m
my-spark-app-submitter-sqzb9        1/1     Running     0          15m
my-spark-cluster-m-cm6jd            1/1     Running     0          27m
my-spark-cluster-w-gwhbc            1/1     Running     0          27m
my-spark-cluster-w-sr4zn            1/1     Running     0          27m
spark-operator-5dc5fd9944-vp8tc     1/1     Running     0          31m

oc logs --all-containers=true my-spark-app-1584038448633-driver  | grep Pi
+ exec /opt/spark/bin/spark-submit --conf spark.driver.bindAddress=10.128.2.21 --deploy-mode client --properties-file /opt/spark/conf/spark.properties --class org.apache.spark.examples.SparkPi spark-internal
20/03/12 18:41:00 WARN SparkSubmit$$anon$2: Failed to load org.apache.spark.examples.SparkPi.
java.lang.ClassNotFoundException: org.apache.spark.examples.SparkPi



Expected results:
Pi is roughly 3.142155710778554

Additional info:

Comment 2 Alexandre Menezes 2020-03-16 16:16:27 UTC
Hello Tom,

If we take off the "| grep pi" on your log command and observe the messages there we can see that the app is trying to load spark-examples_2.11-2.3.0.jar.


That comes from: https://github.com/radanalyticsio/spark-operator/blob/41d7d77022b2956896fea0546122d6c1a68138a3/manifest/olm/crd/sparkclusteroperator.1.0.1.clusterserviceversion.yaml#L31


20/03/16 15:31:04 WARN DependencyUtils: Local jar /opt/spark/examples/jars/spark-examples_2.11-2.3.0.jar does not exist, skipping.
20/03/16 15:31:04 WARN SparkSubmit$$anon$2: Failed to load org.apache.spark.examples.SparkPi.
java.lang.ClassNotFoundException: org.apache.spark.examples.SparkPi
        at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
        at java.lang.Class.forName0(Native Method)
        at java.lang.Class.forName(Class.java:348)
        at org.apache.spark.util.Utils$.classForName(Utils.scala:238)
        at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:806)
        at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
        at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
        at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
        at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
20/03/16 15:31:04 INFO ShutdownHookManager: Shutdown hook called
20/03/16 15:31:04 INFO ShutdownHookManager: Deleting directory /tmp/spark-1a9cc374-1c13-4740-9c17-3409ab18109c


But if we do "oc debug pod/my-spark-app-pod-number-driver" we can read what's in that path:

sh-4.2$ ls /opt/spark/examples/jars/
scopt_2.11-3.7.0.jar  spark-examples_2.11-2.4.5.jar

Those are coming from here: https://github.com/radanalyticsio/spark-operator/blob/41d7d77022b2956896fea0546122d6c1a68138a3/examples/app.yaml#L7

It seems to be looking for an outdated jar file. I'm doing some more tests just to verify and will post the results here.

Comment 3 Alexandre Menezes 2020-03-16 20:04:21 UTC
Default configurations on Openshift Install will also fail depending on what is loaded on the cluster because the executors requires 1 dedicated vCpu core each:

NAME                                READY   STATUS        RESTARTS   AGE
my-spark-app-1584383030507-exec-3   0/1     Pending       0          19s
my-spark-app-1584383030507-exec-4   0/1     Pending       0          19s

Events:
  Type     Reason            Age        From               Message
  ----     ------            ----       ----               -------
  Warning  FailedScheduling  <unknown>  default-scheduler  0/6 nodes are available: 3 Insufficient cpu, 3 node(s) had taints that the pod didn't tolerate.

But after upgrading worker nodes to m5.2xlarge machines with 8 CPUs and fixing the jar file name, we get the correct output on the last line:

oc logs --all-containers=true  my-spark-app-1584387507724-driver -n spark | grep Pi
+ exec /opt/spark/bin/spark-submit --conf spark.driver.bindAddress=10.128.2.15 --deploy-mode client --properties-file /opt/spark/conf/spark.properties --class org.apache.spark.examples.SparkPi spark-internal
20/03/16 19:38:41 INFO SparkContext: Submitted application: Spark Pi
20/03/16 19:39:22 INFO SparkContext: Starting job: reduce at SparkPi.scala:38
20/03/16 19:39:22 INFO DAGScheduler: Got job 0 (reduce at SparkPi.scala:38) with 2 output partitions
20/03/16 19:39:22 INFO DAGScheduler: Final stage: ResultStage 0 (reduce at SparkPi.scala:38)
20/03/16 19:39:22 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34), which has no missing parents
20/03/16 19:39:23 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34) (first 15 tasks are for partitions Vector(0, 1))
20/03/16 19:39:24 INFO DAGScheduler: ResultStage 0 (reduce at SparkPi.scala:38) finished in 1.359 s
20/03/16 19:39:24 INFO DAGScheduler: Job 0 finished: reduce at SparkPi.scala:38, took 1.597624 s
Pi is roughly 3.141395706978535

I'm contacting the developers to put a PR fixing it. I'll post here when done.

Comment 4 Alexandre Menezes 2020-03-18 19:28:21 UTC
Hi Tom,

There is a PR to fix that. Thanks to Trevor Mckay. Please follow up here: https://github.com/operator-framework/community-operators/pull/1361

Let me know if you need anything else on this case.

Thanks!

Comment 5 Tom Buskey 2020-03-19 14:25:49 UTC
Switched my yaml to use spark-examples_2.11-2.4.5.jar

Pi is roughly 3.142955714778574

 oc version
Client Version: openshift-clients-4.4.0-202003060720
Server Version: 4.4.0-0.nightly-2020-03-19-075457
Kubernetes Version: v1.17.1

Comment 8 errata-xmlrpc 2020-07-28 12:37:03 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:3075


Note You need to log in before you can comment on or make changes to this bug.