Bug 2011654

Summary: OC deploys causing error: "Invalid or corrupt jarfile"
Product: OpenShift Container Platform Reporter: Vinya Nema <vnema>
Component: BuildAssignee: Adam Kaplan <adam.kaplan>
Status: CLOSED DUPLICATE QA Contact: Jitendar Singh <jitsingh>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.6CC: aos-bugs, dwalsh, gmontero, jokerman, pbhattac, spandura, tsweeney
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-10-27 13:00:19 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Vinya Nema 2021-10-07 01:39:14 UTC
Description of problem:
- "oc new-build" or "oc start-build" that could be causing this error 
-  in doing so we have started to encounter random (around one out of every four) apps failing to deploy due to invalid or corrupt jarfiles.

The error message:
~~~
[e411mxml@ocp-web-install:~ #] oc logs pod/dev1-fedonedocgen-currentview-messaging-34-xzx9c
Starting the Java application using /opt/jboss/container/java/run/run-java.sh ...
INFO exec  java -XX:ParallelGCThreads=2 -Xmx9600m -Dcom.instana.agent.jvm.name=dev1-fedonedocgen-currentview-messaging -javaagent:/usr/share/java/jolokia-jvm-agent/jolokia-jvm.jar=config=/opt/jboss/container/jolokia/etc/jolokia.properties -Xms1250m -XX:+UseParallelOldGC -XX:MinHeapFreeRatio=10 -XX:MaxHeapFreeRatio=20 -XX:GCTimeRatio=4 -XX:AdaptiveSizePolicyWeight=90 -XX:MaxMetaspaceSize=300m -XX:ParallelGCThreads=1 -Djava.util.concurrent.ForkJoinPool.common.parallelism=1 -XX:CICompilerCount=2 -XX:+ExitOnOutOfMemoryError -cp "." -jar /deployments/dev1-fedonedocgen-currentview-messaging.jar
Error: Invalid or corrupt jarfile /deployments/dev1-fedonedocgen-currentview-messaging.jar
~~~

Version-Release number of selected component (if applicable):
4.6.18

Attachments: http://collab-shell.usersys.redhat.com/03020274/


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results: Deployment of an app gets completed in one go


Additional info:-
 cu is able to take the same jar files and use them at other places without any problems. 
Also, after multiple deploys the error doesn't occur and app deployment gets completed successfully.

Comment 1 Gabe Montero 2021-10-14 19:34:36 UTC
OK it took a decent amount of digging in the customer case, but I was able to cobble together a few things:

1) this is a binary build situation:

spec:
  failedBuildsHistoryLimit: 5
  nodeSelector: null
  output:
    to:
      kind: ImageStreamTag
      name: prod-mk-proposalgenerator-messaging:latest
  postCommit: {}
  resources: {}
  runPolicy: Serial
  source:
    binary: {}
    type: Binary
  strategy:
    sourceStrategy:
      from:
        kind: ImageStreamTag
        name: openjdk18-openshift:fedprod
    type: Source
  successfulBuildsHistoryLimit: 5


so hence an upload of contents from the local file system, presumably even this jar file in question, seems probable,
especially given the name 'dev1-fedonedocgen-currentview-messaging.jar' 

I *doubt* that comes from the openjdk18-openshift the prod-mk-proposalgenerator-messaging imagestream, but seems like it could come from prod-mk-proposalgenerator-messaging imagestream

2) So, a quick refresher on how binary builds works

a) the 'oc start-build' "transfers" the data to the api server, which then streams it to the pod
b) exactly how this is done depends very much on the arguments supplied to the 'oc start-build'
c) for example, based on whether --from-file, --from-dir, --from-archive, or --from-repo is supplied, the 'oc start-build' will vary the upload mechanism
d) golang http and whatever git binary is installed on the local host are potential options for example

3) I see mention of running oc start-build with trace in the customer case, but I see no evidence of that trace being provided, is that correct?  If we don't have that trace, we really need it.  And again, aside from the trace itself, we need to know the precise list of arguments supplied on the 'oc start-build' invocation.

4) I also need clarification on "cu is able to take the same jar files and use them at other places without any problems".  Does that mean they can do 'oc start-build' on some hosts and everything works fine, but on some host the image produced by 'oc start-build' results in the corrupt jar message?
Or do they just mean they copy that jar file and use it successfully in a fashion other than running the prod-mk-proposalgenerator-messaging:latest imagestreamtag in a Pod?  

If images produced by 'oc start-build' work from some local systems, but not others, that is a clue that one of the dependencies could be off on that system.

Or, if the version of java used when the jar file works is different than the version of java when the jar file does not, that could be a clue.

5) I did look at the must-gather initially provided, bu saw no mention of the prod-mk-proposalgenerator-messaging build config in the controller manager of api server logs.  Now, that said, both are involved in the binary build process.  In fact, 'oc start-build' transfer the local data to the api server, who in turn transfers it to the build pod, which then produces the output image.  So plenty of opportunities for unexpected items to occur.  In particular, we have seen an issue with the 'tar' command that openshift picks up from RHEL impacting builds, as tar is used by the apiserver to stream data over the socket to the build pod.  So, go ahead and get must gather along with the 'oc start-build' information I asked for in 3.

6) lastly, I see them mentioned old versions of the deployment.  I take it that means deployments that use older versions of the imagestreamtag prod-mk-proposalgenerator-messaging:latest based on the dc yaml they provided.

if so, get them to use 'oc debug' to create test pods using the version of the imagestreamtag that works and ones with the imagestreamtag that fails.  In each of those debug pods, do cksum's and ls -la of the jar file in question, so we can compare the results with each other, as well as with the version of the jar file that is currently being picked up by 'oc start-build' for the failing imagestream tags.

OK, lots of data to capture.  Good luck.

Comment 2 Adam Kaplan 2021-10-18 16:25:48 UTC
Is this a potential duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1952929 ? We believed the root cause there was a RHEL 7 (or equivalent) client system using binary builds to stream content to OCP 4.6+, which is based on RHEL 8. However, we didn't explore the issue further since the affected image had reached the end of its support lifecycle.

Comment 3 Gabe Montero 2021-10-18 16:56:59 UTC
Conceivably.  But assuming we are not dealing with EOL versions this time, any suggestions on what from https://bugzilla.redhat.com/show_bug.cgi?id=1952929 we can use to move things along here.

Comment 10 Adam Kaplan 2021-10-27 13:00:19 UTC
Closing this as a duplicate of 1952929.

In this issue an `oc` client on a RHEL 7 based system will occasionally corrupt a byte of information when streaming contents to an OpenShift cluster v4.6 or higher.
The root cause is that the `oc` client uses the host system's `tar` utility to stream contents to OpenShift, which on 4.6 and higher uses a newer version of tar to unpack the streamed contents.
Given that RHEL 7 is currently in the "Maintenance Support 2" phase of its lifecycle and this bug does not have the "Urgent" priority, this issue will likely not be addressed in a future release RHEL 7 and may not be fixed in RHEL 8.

Using `oc start-build --from-dir` instead of `--from-file` appears to work around this issue.

*** This bug has been marked as a duplicate of bug 1952929 ***