1551500 – Can't login jenkins webconsole due to "Liveness probe failed"

Bug 1551500 - Can't login jenkins webconsole due to "Liveness probe failed"

Summary: Can't login jenkins webconsole due to "Liveness probe failed"

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	ImageStreams
Sub Component:
Version:	3.9.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	3.9.0
Assignee:	Gabe Montero
QA Contact:	Dongbo Yan
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1554711
TreeView+	depends on / blocked

Reported:	2018-03-05 10:06 UTC by XiuJuan Wang
Modified:	2021-06-10 15:03 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: a combination of upgrading to jenkins 2.89.x, with its increased interaction with the jenkins update center on initial login, along with the default memory limit of the example openshift jenkins templates, can lead to significant enough GC on the first login attempt, and the openshift route may timeout the request Consequence: a gateway/504 error will show up on login attempts while the update center activity and subsequent java GC is occurring; you currently have to wait until the GC dies down and then hit refresh to finish the login; sometimes even the liveness/readiness probe might fail such that jenkins is redeployed; or you have to instantiate jenkins in a pod with a larger MEMORY_LIMIT (say 2Gi instead of the default 512Mi). Fix: updated the route timeout and liveness/readiness probes to account for this initial GC cycle with the default MEMORY_LIMIT from the template Result: no 504's on initial login, no redeploys of pod as a result of intense GC
Clone Of:
Clones:	1554711 (view as bug list)
Environment:
Last Closed:	2018-08-28 18:35:22 UTC
Target Upstream Version:
Embargoed:
Flags:	xiuwang: needinfo-

Attachments	(Terms of Use)
Events after pod being ready when image is re-imported (9.96 KB, text/plain) 2018-03-08 06:32 UTC, Wenjing Zheng	no flags	Details
Jenkins pods log after re-deployment (20.97 KB, text/plain) 2018-03-08 06:32 UTC, Wenjing Zheng	no flags	Details
View All

Description XiuJuan Wang 2018-03-05 10:06:06 UTC

Description of problem:
Can't login jenkins webconsole due to below error

  Warning  Unhealthy              10m               kubelet, ip-172-31-76-218.us-east-2.compute.internal  Readiness probe failed: Get http://10.128.0.80:8080/login: dial tcp 10.128.0.80:8080: getsockopt: connection refused
  Warning  Unhealthy              7m (x7 over 9m)   kubelet, ip-172-31-76-218.us-east-2.compute.internal  Readiness probe failed: HTTP probe failed with statuscode: 503
  Warning  Unhealthy              2m (x11 over 9m)  kubelet, ip-172-31-76-218.us-east-2.compute.internal  Readiness probe failed: Get http://10.128.0.80:8080/login: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
  Warning  Unhealthy              2m                kubelet, ip-172-31-76-218.us-east-2.compute.internal  Liveness probe failed: Get http://10.128.0.80:8080/login: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

jenkins pod keeps restarting every few mins

Version-Release number of selected component (if applicable):
registry.reg-aws.openshift.com:443/openshift3/jenkins-2-rhel7:v3.9.1

How reproducible:
always

Steps to Reproduce:
1.Create a jenkins app
oc new-app jenkins-persistent
2.Wait jenkins pod ready, to access jenkins webconsole via route
3.

Actual results:
Got 504 error, and show delow error when describe jenkins pod

Expected results:
Could access jenkins webconsole

Additional info:

Comment 1 Ben Parees 2018-03-05 14:40:16 UTC

Gabe, maybe use this bug to make the router timeout changes?

I'm a little surprised at the liveness+readiness probe failures though, that may warrant a separate investigation, unless you think those would also fail due to the login GC churn (in which case we may have to increase their tolerance)

Comment 2 Gabe Montero 2018-03-05 14:47:47 UTC

GC impacting the probes during the login attempt is quite conceivable

Initial reaction, we should adjust those probe settings in addition to the router timeout setting we discussed this weekend.

We can discuss further either here or in the PR I was in the process of submitting.

Comment 3 Gabe Montero 2018-03-05 16:16:22 UTC

PR https://github.com/openshift/origin/pull/18832 in in flight

Believe we can cherrypick from there to the 3.9 branch, but if not, a second PR will be coming.

Comment 4 Gabe Montero 2018-03-05 17:49:19 UTC

3.9 PR https://github.com/openshift/origin/pull/18834

Comment 5 Gabe Montero 2018-03-05 18:35:26 UTC

The 3.9 PR is now https://github.com/openshift/origin/pull/18839

Comment 6 Gabe Montero 2018-03-06 01:03:21 UTC

PR https://github.com/openshift/jenkins/pull/440 is also going to help here, as it moves the update center processing to startup vs. on the initial login

We are still also moving forward with the template changes noted above.

I'll move this to on qa when we think enough of these PRs have merged

Comment 7 Gabe Montero 2018-03-06 18:48:36 UTC

Both PRs for the 3.9 origin template changes and the openshift jenkin rhel image to do the update center processing during startup have merged.


https://github.com/openshift/openshift-ansible/pull/7409 is submitted to update the 3.9 install, but is stuck with test flakes / merge pain.

I will move this to ON_QA as soon as the ansible change merges.

Comment 8 Gabe Montero 2018-03-06 19:38:11 UTC

OK Scott Dodson was able to manually merge the ansible PR.

Moving to ON_QA

Note, for the jenkins image, QA needs to make sure they have a openshift jenkins rhel image with commit 4bfb7f025f9821edabee2dea70f331f266d38f27 as well.

This most likely will entail importing that image from brew-pulp into the jenkins image stream and updating the "2" tag.

Comment 9 Gabe Montero 2018-03-06 20:02:57 UTC

Hey Dongbo Yan - ignore my comment about pulling in the updated image in Comment #8.  

Simply used the updated template from the ansible update to verify.

We are going to back out the image change to do the processing during startup, at least for the short term.

Comment 10 XiuJuan Wang 2018-03-07 09:48:46 UTC

Hi, Gabe
When I used the ansible updated template in free-stg, still met 504 error when access jenkinswebconsole at first time

Comment 11 XiuJuan Wang 2018-03-07 10:04:51 UTC

(In reply to XiuJuan Wang from comment #10)
> Hi, Gabe
> When I used the ansible updated template in free-stg, still met 504 error
> when access jenkinswebconsole at first time

More info:
Used the ansible updated template in 3.9 OCP cluster,don't meet 504 error when access webconsole at first time.

Comment 12 Gabe Montero 2018-03-07 14:38:51 UTC

OK great per #Comment 11 XiuJuan 

so we can mark this verified then, right? ... free-stg just has not been updated yet presumably?

Or can you not mark this verified until free-stg is updated ?

If you could provide the yaml/json for the template in free-stg, I can confirm if it does in fact have the update.

Comment 15 Wenjing Zheng 2018-03-08 06:32:13 UTC

Created attachment 1405706 [details]
Events after pod being ready when image is re-imported

Comment 16 Wenjing Zheng 2018-03-08 06:32:52 UTC

Created attachment 1405707 [details]
Jenkins pods log after re-deployment

Comment 17 Wenjing Zheng 2018-03-08 06:33:37 UTC

I am sure the pods is running when I try to access console after re-deployment.

Comment 18 Gabe Montero 2018-03-08 14:43:32 UTC

Hey Wenjian - the verify procedure and image image from the other bug you noted in Comment #14 doe not apply to this bug.

Hey Xiujuan - yeah, it you used the template from my commit to provision jenkins and still got the 504, it is possible that we need to relax the timeouts even more in the online environments.

I think we need to do the following:

1) Xiujuan - in Coment #13 you mentioned getting a 504 when "describing the pod", but Comment #0 of this bug, plus all the associated github issues and my personal experience, all are about logging into the jenkins console using the route

You meant in comment #13 logging into the console, right?

2) Xiujuan - can you please provide the yaml for the route after using the modified template, as well as the jenkins pod logs when you get the 504 on the first login

3) I'll see about logging into free-stg myself (I believe I have an account, and experiment with the timeouts

Marking need info for 1) and 2) while I do 3)

Comment 19 Gabe Montero 2018-03-08 15:11:20 UTC

For 3), I don't have access to create projects on free-stg (Justin tells me that is more restricted), but do have it to free-int

I'll be testing there with the assumption if is comparable to free-stg.

Comment 20 Gabe Montero 2018-03-08 18:08:58 UTC

Based on testing in free-int and us-starter-east-1 i've bumped the route timeout to 4m

origin pr https://github.com/openshift/origin/pull/18900
ansible pr https://github.com/openshift/openshift-ansible/pull/7460

will move to on qa when they merge, though Sam/Adam will need to kick off a new 3.9 builds when they do

Comment 21 XiuJuan Wang 2018-03-09 06:00:24 UTC

Gabe,
Yes, I logged into jenkins webconsole via route indeed.

I don't meet 504 on the first login when using the updated template of comment #20. If the 504 error pop next time in free-stg, I will try increase "haproxy.router.openshift.io/timeout".

I will verify this bug, when template get updated on OSO starter env.

Comment 22 XiuJuan Wang 2018-03-13 03:35:23 UTC

Since this bug is reported in OCP, and has been fixed in OCP 3.9 (v3.9.7), so QE will verify this bug.
Then clone a new bug against OSO-starter env to track the same issue

Note You need to log in before you can comment on or make changes to this bug.