Bug 1574675
Summary: | oc start-build --follow does not terminate when run from Jenkins | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Justin Holmes <jholmes> | ||||||||
Component: | Build | Assignee: | Ben Parees <bparees> | ||||||||
Status: | CLOSED CANTFIX | QA Contact: | Wenjing Zheng <wzheng> | ||||||||
Severity: | medium | Docs Contact: | |||||||||
Priority: | medium | ||||||||||
Version: | 3.9.0 | CC: | aos-bugs, bparees, dspring, gmontero, jholmes, jokerman, kmcanoy, mmccomas, obedin, wzheng | ||||||||
Target Milestone: | --- | ||||||||||
Target Release: | 3.9.z | ||||||||||
Hardware: | Unspecified | ||||||||||
OS: | Unspecified | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2019-03-12 16:45:43 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Attachments: |
|
Description
Justin Holmes
2018-05-03 20:08:53 UTC
Just to clarify, we have tested this with oc on a developer's laptop and cannot recreate there. please provide the oc binary version being used in your slave image. (or is it the 3.9.14 value above?) Hey Ben. We've reproduced the issue on 3.9.14 and 3.9.25 Jenkins masters against 3.9.14 OCP api masters. My understanding is that the jenkins mastwr versions follow the oc client versions. Yes that's how it's supposed to work, but it would be good to confirm it is correct in your env. I can verify .14 is using .14. Will verify .25 is using .25 today verified that .25 is using .25 Do you see the same behavior when running oc start-build from a normal client (not a jenkins slave)? What about from the jenkins master? I can't recreate this on latest master (3.10). Can you try running it with "oc start-build buildname --loglevel=8 --follow"? (note that is --loglevel not --build-loglevel) and provide the logs once the started build has completed and the oc start-build command appears to be stuck? Ben we see this on masters and slaves. We have not been able to reproduce from outside the cluster (e.g. dev laptop). We've seen this with a few clusters now, and probably could give you access to one if you'd like (its a red hat controlled env with customers in it). I'm asking team members to see if they can run your command. Created attachment 1432803 [details]
Jenkins Output loglevel 8
I've attached the output from a Jenkins job that did not complete.
you didn't mention that this appears to be a binary build. Do you see the same issue when doing non-binary builds? (not passing a --from-xxx arg) Created attachment 1432810 [details]
Jenkins Output loglevel 8 non-binary
Here is the output from a failing build that is not a binary build
would it be possible for you to gather level 5 logs from your api server while recreating this? The binary note is in the issue description. We dont perform s2i source under normal circumstances, and pipeline builds are triggered via Jenkins/OCP UI's or webhooks so I honestly can't tell you about what happens on non binary builds. Would you like us to create a s2i. As to the api server logs, well prob need to pick that one up tomorrow, the team is in US eastern time and is dealing with a few other infra issues too. Ben, which logs do you want specifically? I can recreate the problem but want to make sure I'm giving you the right info you need. i'm looking for logs from the apiserver, after configuring the apiserver for loglevel=5, covering a time period during which you ran the oc start-build --follow --loglevel=8 command. please also include output from the start-build command (i know you've supplied one set already but i'd like to correlate the output w/ the apiserver logs). I'm having trouble replicating this issue. I'll update with logs when I see it again (will continue to do more builds to speed up this process). I can't reproduce this bug for binary build with cmd 'oc start-build <bc> --from-dir=<dir> --follow' in slave node Also the s2i build works well with cmd 'oc start-build <bc> --follow' in slave node Tried oc(3.9.25) &jenkins-2-rhel7:v3.9.25 &cluster27 and oc27 & jenkins-2-rhel7:v3.9.27 &cluster27 Created attachment 1433571 [details]
Binary build in slave node logs
Created attachment 1433572 [details]
s2i build in slave node logs
any luck reproducing this, or can we close it? We've seen this problem again, I'll get the logs when I'm able to reboot the master service to update the log level. Created attachment 1450717 [details]
OpenShift Master API log
We were able to reproduce the issue tonight with the log levels set. Attached here is the master api log. I'll attach the additional logs in the next comment.
Created attachment 1450718 [details]
oc start-build logs
This is the oc start-build logs with log level 8 set. The end of this log is where it hangs, there are no logs afterwards.
Just an additional note, it seems that running the oc start-build with -w works as a temporary fix. I saw this again today on our s9 cluster building the java-app as part of ci. The work around we have been using so far is to replace the --follow with a -w (as logan has just said) and then use the plugin to verify the build using something along these lines. openshiftVerifyBuild (bldCfg: "${env.APP_NAME}", checkForTriggeredDeployments: 'false', namespace: "${env.CI_CD_PROJECT}", verbose: 'false', waitTime: '') can you supply the build pod yaml for the associated build? i think there may be an issue in the init container state reporting. Created attachment 1450999 [details]
Build Pod yaml
I've attached the yaml of the build pod.
Just to make sure this point has been made, this issue is intermittent and does not always happen. We've encountered times where the issue would be encountered for 90% of the builds, other times we would never encounter it. Recently we've been seeing it a lot. that pod appears to be for build "java-app-2", do you have the build pod for the "java-app-6" build that you provided the other logs for? and yes, definitely understand it's intermittent. Unfortunately no, the build pod for java-app-6 is gone, but it had the same hanging issue. If needed I can reproduce it with the build pod as well. i'm afraid it is needed, sorry. The build pod always starts and ompletes successfully with no different messaging if jenkins hangs or not. > The build pod always starts and ompletes successfully with no different messaging if jenkins hangs or not.
my concern is that the init container state in the pod may not be correct, which will trip up the logic that is attempting to stream the logs from the various containers that make up the build pod. I need to see the build pod yaml to confirm/deny that possibility. We've had bugs in this space before (init container state not being set correctly in a pod).
Created attachment 1453023 [details]
oc start-build logs (rerun)
Reproduced the issue, I'm attaching the `oc start-build` level-8 logs, the master-api logs with log level 5, and the build-pod yml.
Created attachment 1453024 [details]
OpenShift Master API log (rerun)
Created attachment 1453025 [details]
build pod yaml (rerun)
Ben - another path to also consider is the Jenkins durable task plugin. That is what provides the guts of the "sh" when then call "sh 'oc start-build .... --follow'. Especially since they see this only in jenkins master/agents and note the command line, and it is intermittent, I very well could see that the various process/signal handling that occurs when `oc start-build --follow` exits fools thee durable task plugin side such that it does not think the command has exited. I've been sifting through some Jenkins jira / commits for this pluging (https://github.com/jenkinsci/durable-task-plugin) but have not seen anything of note yet specific to this, but generally speaking there have been issues with parent/child pid processing with this stuff. For example, we hit an odd timing issue with the client plugin when starting processes, where the "sh" step would abort prematurely. If nothing stands out from the logs they attached in #Comment 36, #Comment 37, #Comment 38, we may want to look at providing a debug version of the durable task plugin that tries to capture relevant information on the jenkins pipeline side of things. Logan, in your comment 26, did you pass --follow *and* --wait, or just --wait? Ben, that was only using -w and not -f. ok, and have you ever recreated this behavior running oc start-build outside of jenkins? (e.g. just on a client machine)? For the most part, my testing was done in Jenkins slaves. I seem to recall others, like Kevin, ran some of it locally and encountered the same problem, but I can't be certain (and he's away on PTO for a awhile). Created attachment 1470558 [details]
debug version of the jenkins durable task plugin
OK I was able to vet out today the debug version of the durable task plugin, and in particular the "sh" pipeline step.
I'm attaching the hpi file here. Manually install it on your running Jenkins instance.
The initial output message in a shell script will now read:
"Running shell script - DEBUG PLUGIN"
so in other words, I added the "DEBUG PLUGIN" piece.
Other changes:
- I've added some prints in the plugins error paths that will show up in the job run output, in case there is something amiss there
- after 10 minutes, if the command is still running, it will get a golang thread dump for every instance of "oc" currently running
- you can change the number of minutes it waits by setting the env var OC_THREAD_DUMP_INTERVAL_MIN to another integer value
- Setting a global env var should make it available to the master; update the env vars on any pod templates you have to see the change when running on agents
If you all think you'll be able to give this a go, just update here.
Thanks
We were never able to recreate this and I've heard of no other reports of it happening, if we get a new report on a current version we can revisit but for now there's no on-going investigation of this bug happening so i'm going to close it to make that clear. |