Bug 1574675

Summary: oc start-build --follow does not terminate when run from Jenkins
Product: OpenShift Container Platform Reporter: Justin Holmes <jholmes>
Component: BuildAssignee: Ben Parees <bparees>
Status: CLOSED CANTFIX QA Contact: Wenjing Zheng <wzheng>
Severity: medium Docs Contact:
Priority: medium    
Version: 3.9.0CC: aos-bugs, bparees, dspring, gmontero, jholmes, jokerman, kmcanoy, mmccomas, obedin, wzheng
Target Milestone: ---   
Target Release: 3.9.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-03-12 16:45:43 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Binary build in slave node logs
none
s2i build in slave node logs
none
debug version of the jenkins durable task plugin none

Description Justin Holmes 2018-05-03 20:08:53 UTC
Description of problem: 

Binary build with --follow does not terminate. The build will complete in OpenShift, but the CLI command will hang in Jenkins


Version-Release number of selected component (if applicable):
3.9.25 and 3.9.14 CLI

OpenShift Master:
v3.9.14
Kubernetes Master:
v1.9.1+a0ce1bc657
OpenShift Web Console:
v3.9.14

How reproducible:
Very. I have a cluster that I can log support personnel to test in if need be.

Steps to Reproduce:
1. Step of Jenkins Master/Slave build
2. Build your app on the slave
3. Kick the s2i binary build off from the slave

Actual results:
oc start-build --follow hangs

Expected results:
oc start-build --follow terminate when the s2i build terminates

Additional info:
oc start-build -w works as expected.

Comment 1 Justin Holmes 2018-05-03 20:25:05 UTC
Just to clarify, we have tested this with oc on a developer's laptop and cannot recreate there.

Comment 2 Ben Parees 2018-05-03 23:09:59 UTC
please provide the oc binary version being used in your slave image.

Comment 3 Ben Parees 2018-05-03 23:10:32 UTC
(or is it the 3.9.14 value above?)

Comment 4 Justin Holmes 2018-05-04 04:54:54 UTC
Hey Ben. We've reproduced the issue on 3.9.14 and 3.9.25 Jenkins masters against 3.9.14 OCP api masters. My understanding is that the jenkins mastwr versions follow the oc client versions.

Comment 5 Ben Parees 2018-05-04 06:56:25 UTC
Yes that's how it's supposed to work, but it would be good to confirm it is correct in your env.

Comment 6 Justin Holmes 2018-05-04 12:21:46 UTC
I can verify .14 is using .14. Will verify .25 is using .25 today

Comment 7 Justin Holmes 2018-05-05 02:20:08 UTC
verified that .25 is using .25

Comment 8 Ben Parees 2018-05-07 17:45:54 UTC
Do you see the same behavior when running oc start-build from a normal client (not a jenkins slave)?

What about from the jenkins master?

Comment 9 Ben Parees 2018-05-07 18:20:30 UTC
I can't recreate this on latest master (3.10).

Can you try running it with "oc start-build buildname --loglevel=8 --follow"?  (note that is --loglevel not --build-loglevel) and provide the logs once the started build has completed and the oc start-build command appears to be stuck?

Comment 10 Justin Holmes 2018-05-07 18:36:06 UTC
Ben we see this on masters and slaves. We have not been able to reproduce from outside the cluster (e.g. dev laptop). We've seen this with a few clusters now, and probably could give you access to one if you'd like (its a red hat controlled env with customers in it). 

I'm asking team members to see if they can run your command.

Comment 11 Kevin McAnoy 2018-05-07 19:52:39 UTC
Created attachment 1432803 [details]
Jenkins Output loglevel 8

I've attached the output from a Jenkins job that did not complete.

Comment 12 Ben Parees 2018-05-07 19:58:34 UTC
you didn't mention that this appears to be a binary build.

Do you see the same issue when doing non-binary builds?  (not passing a --from-xxx arg)

Comment 13 Kevin McAnoy 2018-05-07 20:19:28 UTC
Created attachment 1432810 [details]
Jenkins Output loglevel 8 non-binary

Here is the output from a failing build that is not a binary build

Comment 14 Ben Parees 2018-05-07 20:29:32 UTC
would it be possible for you to gather level 5 logs from your api server while recreating this?

Comment 15 Justin Holmes 2018-05-07 21:34:36 UTC
The binary note is in the issue description.

We dont perform s2i source under normal circumstances, and pipeline builds are triggered via Jenkins/OCP UI's or webhooks so I honestly can't tell you about what happens on non binary builds. Would you like us to create a s2i.

As to the api server logs, well prob need to pick that one up tomorrow, the team is in US eastern time and is dealing with a few other infra issues too.

Comment 16 Logan Donley 2018-05-08 17:25:09 UTC
Ben, which logs do you want specifically? I can recreate the problem but want to make sure I'm giving you the right info you need.

Comment 17 Ben Parees 2018-05-08 17:31:39 UTC
i'm looking for logs from the apiserver, after configuring the apiserver for loglevel=5, covering a time period during which you ran the oc start-build --follow --loglevel=8 command.

please also include output from the start-build command (i know you've supplied one set already but i'd like to correlate the output w/ the apiserver logs).

Comment 18 Logan Donley 2018-05-08 21:00:00 UTC
I'm having trouble replicating this issue. I'll update with logs when I see it again (will continue to do more builds to speed up this process).

Comment 19 XiuJuan Wang 2018-05-09 07:43:05 UTC
I can't reproduce this bug for binary build with cmd 'oc start-build <bc> --from-dir=<dir> --follow' in slave node

Also the s2i build works well with cmd 'oc start-build <bc> --follow' in slave node

Tried oc(3.9.25) &jenkins-2-rhel7:v3.9.25 &cluster27 and oc27 & jenkins-2-rhel7:v3.9.27 &cluster27

Comment 20 XiuJuan Wang 2018-05-09 07:46:36 UTC
Created attachment 1433571 [details]
Binary build in slave node logs

Comment 21 XiuJuan Wang 2018-05-09 07:47:13 UTC
Created attachment 1433572 [details]
s2i build in slave node logs

Comment 22 Ben Parees 2018-06-08 19:55:11 UTC
any luck reproducing this, or can we close it?

Comment 23 Logan Donley 2018-06-11 16:32:19 UTC
We've seen this problem again, I'll get the logs when I'm able to reboot the master service to update the log level.

Comment 24 Logan Donley 2018-06-12 23:33:56 UTC
Created attachment 1450717 [details]
OpenShift Master API log

We were able to reproduce the issue tonight with the log levels set. Attached here is the master api log. I'll attach the additional logs in the next comment.

Comment 25 Logan Donley 2018-06-12 23:35:41 UTC
Created attachment 1450718 [details]
oc start-build logs

This is the oc start-build logs with log level 8 set. The end of this log is where it hangs, there are no logs afterwards.

Comment 26 Logan Donley 2018-06-13 16:07:34 UTC
Just an additional note, it seems that running the oc start-build with -w works as a temporary fix.

Comment 27 Dónal Spring 2018-06-13 16:38:32 UTC
I saw this again today on our s9 cluster building the java-app as part of ci.
The work around we have been using so far is to replace the --follow with a -w (as logan has just said) and then use the plugin to verify the build using something along these lines. 
         openshiftVerifyBuild (bldCfg: "${env.APP_NAME}",
                     checkForTriggeredDeployments: 'false',
                     namespace: "${env.CI_CD_PROJECT}",
                     verbose: 'false',
                     waitTime: '')

Comment 28 Ben Parees 2018-06-13 16:56:06 UTC
can you supply the build pod yaml for the associated build?  i think there may be an issue in the init container state reporting.

Comment 29 Logan Donley 2018-06-13 17:13:06 UTC
Created attachment 1450999 [details]
Build Pod yaml

I've attached the yaml of the build pod.

Comment 30 Logan Donley 2018-06-13 17:19:30 UTC
Just to make sure this point has been made, this issue is intermittent and does not always happen. We've encountered times where the issue would be encountered for 90% of the builds, other times we would never encounter it. Recently we've been seeing it a lot.

Comment 31 Ben Parees 2018-06-13 17:29:16 UTC
that pod appears to be for build "java-app-2", do you have the build pod for the "java-app-6" build that you provided the other logs for?

and yes, definitely understand it's intermittent.

Comment 32 Logan Donley 2018-06-19 16:01:14 UTC
Unfortunately no, the build pod for java-app-6 is gone, but it had the same hanging issue.

If needed I can reproduce it with the build pod as well.

Comment 33 Ben Parees 2018-06-19 16:07:03 UTC
i'm afraid it is needed, sorry.

Comment 34 Kevin McAnoy 2018-06-19 16:12:07 UTC
The build pod always starts and ompletes successfully with no different messaging if jenkins hangs or not.

Comment 35 Ben Parees 2018-06-19 17:31:40 UTC
> The build pod always starts and ompletes successfully with no different messaging if jenkins hangs or not.

my concern is that the init container state in the pod may not be correct, which will trip up the logic that is attempting to stream the logs from the various containers that make up the build pod.  I need to see the build pod yaml to confirm/deny that possibility.  We've had bugs in this space before (init container state not being set correctly in a pod).

Comment 36 Logan Donley 2018-06-19 20:47:44 UTC
Created attachment 1453023 [details]
oc start-build logs (rerun)

Reproduced the issue, I'm attaching the `oc start-build` level-8 logs, the master-api logs with log level 5, and the build-pod yml.

Comment 37 Logan Donley 2018-06-19 20:48:23 UTC
Created attachment 1453024 [details]
OpenShift Master API log (rerun)

Comment 38 Logan Donley 2018-06-19 20:48:51 UTC
Created attachment 1453025 [details]
build pod yaml (rerun)

Comment 39 Gabe Montero 2018-07-24 19:35:45 UTC
Ben - another path to also consider is the Jenkins durable task plugin.  That is what provides the guts of the "sh" when then call "sh 'oc start-build .... --follow'.

Especially since they see this only in jenkins master/agents and note the command line, and it is intermittent, I very well could see that the various process/signal handling that occurs when `oc start-build --follow` exits fools thee durable task plugin side such that it does not think the command has exited.

I've been sifting through some Jenkins jira / commits for this pluging (https://github.com/jenkinsci/durable-task-plugin) but have not seen anything of note yet specific to this, but generally speaking there have been issues with parent/child pid processing with this stuff.  For example, we hit an odd timing issue with the client plugin when starting processes, where the "sh" step would abort prematurely.

If nothing stands out from the logs they attached in #Comment 36, #Comment 37, #Comment 38, we may want to look at providing a debug version of the durable task plugin that tries to capture relevant information on the jenkins pipeline side of things.

Comment 40 Ben Parees 2018-07-24 20:45:12 UTC
Logan, in your comment 26, did you pass --follow *and* --wait, or just --wait?

Comment 41 Logan Donley 2018-07-24 21:06:09 UTC
Ben, that was only using -w and not -f.

Comment 42 Ben Parees 2018-07-24 21:07:09 UTC
ok, and have you ever recreated this behavior running oc start-build outside of jenkins? (e.g. just on a client machine)?

Comment 43 Logan Donley 2018-07-24 21:24:19 UTC
For the most part, my testing was done in Jenkins slaves. I seem to recall others, like Kevin, ran some of it locally and encountered the same problem, but I can't be certain (and he's away on PTO for a awhile).

Comment 44 Gabe Montero 2018-07-25 21:15:14 UTC
Created attachment 1470558 [details]
debug version of the jenkins durable task plugin

OK I was able to vet out today the debug version of the durable task plugin, and in particular the "sh" pipeline step.

I'm attaching the hpi file here.  Manually install it on your running Jenkins instance.

The initial output message in a shell script will now read:

"Running shell script - DEBUG PLUGIN"

so in other words, I added the "DEBUG PLUGIN" piece.

Other changes:
- I've added some prints in the plugins error paths that will show up in the job run output, in case there is something amiss there
- after 10 minutes, if the command is still running, it will get a golang thread  dump for every instance of "oc" currently running
- you can change the number of minutes it waits by setting the env var OC_THREAD_DUMP_INTERVAL_MIN to another integer value
- Setting a global env var should make it available to the master; update the env vars on any pod templates you have to see the change when running on agents

If you all think you'll be able to give this a go, just update here.

Thanks

Comment 45 Ben Parees 2019-03-12 16:45:43 UTC
We were never able to recreate this and I've heard of no other reports of it happening, if we get a new report on a current version we can revisit but for now there's no on-going investigation of this bug happening so i'm going to close it to make that clear.