Description of problem: Can't login jenkins webconsole due to below error Warning Unhealthy 10m kubelet, ip-172-31-76-218.us-east-2.compute.internal Readiness probe failed: Get http://10.128.0.80:8080/login: dial tcp 10.128.0.80:8080: getsockopt: connection refused Warning Unhealthy 7m (x7 over 9m) kubelet, ip-172-31-76-218.us-east-2.compute.internal Readiness probe failed: HTTP probe failed with statuscode: 503 Warning Unhealthy 2m (x11 over 9m) kubelet, ip-172-31-76-218.us-east-2.compute.internal Readiness probe failed: Get http://10.128.0.80:8080/login: net/http: request canceled (Client.Timeout exceeded while awaiting headers) Warning Unhealthy 2m kubelet, ip-172-31-76-218.us-east-2.compute.internal Liveness probe failed: Get http://10.128.0.80:8080/login: net/http: request canceled (Client.Timeout exceeded while awaiting headers) jenkins pod keeps restarting every few mins Version-Release number of selected component (if applicable): registry.reg-aws.openshift.com:443/openshift3/jenkins-2-rhel7:v3.9.1 How reproducible: always Steps to Reproduce: 1.Create a jenkins app oc new-app jenkins-persistent 2.Wait jenkins pod ready, to access jenkins webconsole via route 3. Actual results: Got 504 error, and show delow error when describe jenkins pod Expected results: Could access jenkins webconsole Additional info:
Gabe, maybe use this bug to make the router timeout changes? I'm a little surprised at the liveness+readiness probe failures though, that may warrant a separate investigation, unless you think those would also fail due to the login GC churn (in which case we may have to increase their tolerance)
GC impacting the probes during the login attempt is quite conceivable Initial reaction, we should adjust those probe settings in addition to the router timeout setting we discussed this weekend. We can discuss further either here or in the PR I was in the process of submitting.
PR https://github.com/openshift/origin/pull/18832 in in flight Believe we can cherrypick from there to the 3.9 branch, but if not, a second PR will be coming.
3.9 PR https://github.com/openshift/origin/pull/18834
The 3.9 PR is now https://github.com/openshift/origin/pull/18839
PR https://github.com/openshift/jenkins/pull/440 is also going to help here, as it moves the update center processing to startup vs. on the initial login We are still also moving forward with the template changes noted above. I'll move this to on qa when we think enough of these PRs have merged
Both PRs for the 3.9 origin template changes and the openshift jenkin rhel image to do the update center processing during startup have merged. https://github.com/openshift/openshift-ansible/pull/7409 is submitted to update the 3.9 install, but is stuck with test flakes / merge pain. I will move this to ON_QA as soon as the ansible change merges.
OK Scott Dodson was able to manually merge the ansible PR. Moving to ON_QA Note, for the jenkins image, QA needs to make sure they have a openshift jenkins rhel image with commit 4bfb7f025f9821edabee2dea70f331f266d38f27 as well. This most likely will entail importing that image from brew-pulp into the jenkins image stream and updating the "2" tag.
Hey Dongbo Yan - ignore my comment about pulling in the updated image in Comment #8. Simply used the updated template from the ansible update to verify. We are going to back out the image change to do the processing during startup, at least for the short term.
Hi, Gabe When I used the ansible updated template in free-stg, still met 504 error when access jenkinswebconsole at first time
(In reply to XiuJuan Wang from comment #10) > Hi, Gabe > When I used the ansible updated template in free-stg, still met 504 error > when access jenkinswebconsole at first time More info: Used the ansible updated template in 3.9 OCP cluster,don't meet 504 error when access webconsole at first time.
OK great per #Comment 11 XiuJuan so we can mark this verified then, right? ... free-stg just has not been updated yet presumably? Or can you not mark this verified until free-stg is updated ? If you could provide the yaml/json for the template in free-stg, I can confirm if it does in fact have the update.
Created attachment 1405706 [details] Events after pod being ready when image is re-imported
Created attachment 1405707 [details] Jenkins pods log after re-deployment
I am sure the pods is running when I try to access console after re-deployment.
Hey Wenjian - the verify procedure and image image from the other bug you noted in Comment #14 doe not apply to this bug. Hey Xiujuan - yeah, it you used the template from my commit to provision jenkins and still got the 504, it is possible that we need to relax the timeouts even more in the online environments. I think we need to do the following: 1) Xiujuan - in Coment #13 you mentioned getting a 504 when "describing the pod", but Comment #0 of this bug, plus all the associated github issues and my personal experience, all are about logging into the jenkins console using the route You meant in comment #13 logging into the console, right? 2) Xiujuan - can you please provide the yaml for the route after using the modified template, as well as the jenkins pod logs when you get the 504 on the first login 3) I'll see about logging into free-stg myself (I believe I have an account, and experiment with the timeouts Marking need info for 1) and 2) while I do 3)
For 3), I don't have access to create projects on free-stg (Justin tells me that is more restricted), but do have it to free-int I'll be testing there with the assumption if is comparable to free-stg.
Based on testing in free-int and us-starter-east-1 i've bumped the route timeout to 4m origin pr https://github.com/openshift/origin/pull/18900 ansible pr https://github.com/openshift/openshift-ansible/pull/7460 will move to on qa when they merge, though Sam/Adam will need to kick off a new 3.9 builds when they do
Gabe, Yes, I logged into jenkins webconsole via route indeed. I don't meet 504 on the first login when using the updated template of comment #20. If the 504 error pop next time in free-stg, I will try increase "haproxy.router.openshift.io/timeout". I will verify this bug, when template get updated on OSO starter env.
Since this bug is reported in OCP, and has been fixed in OCP 3.9 (v3.9.7), so QE will verify this bug. Then clone a new bug against OSO-starter env to track the same issue