Created attachment 1283832 [details] customer reproducer with only 1 instance Description of problem: When defining the liveness and readiness probes on the deployment config and scale up, the application is restarted multiple times due to the failure of the readiness check. Version-Release number of selected component (if applicable): OCP 3.5.5.8 How reproducible: Customer can reproduce it frequently Steps to Reproduce: 1. Create a project and a new application using jboss-webserver-3/webserver30-tomcat8-openshift without any source 2. Configure liveness and readiness probes with the following parameters livenessProbe: httpGet: path: /docs port: 8080 scheme: HTTP initialDelaySeconds: 300 timeoutSeconds: 30 periodSeconds: 10 successThreshold: 1 failureThreshold: 5 3. Scale up to 10 instances 4. Curl'ing pods directly to the same path as the probe succeeds 5. Pods will eventually become ready after several restarts due to the probes reporting a failure Actual results: Even with one replica it sometimes takes 4 container restart to get it right (see attachment) Mon May 29 08:59:28 CEST 2017 NAME READY STATUS RESTARTS AGE IP NODE tomcat-16-vpc7n 0/1 Running 1 10m 10.1.17.129 osclu1-azuur-05.uz.kuleuven.ac.be Mon May 29 09:09:29 CEST 2017 NAME READY STATUS RESTARTS AGE IP NODE tomcat-16-vpc7n 0/1 Running 3 20m 10.1.17.129 osclu1-azuur-05.uz.kuleuven.ac.be Mon May 29 09:19:30 CEST 2017 NAME READY STATUS RESTARTS AGE IP NODE tomcat-16-vpc7n 1/1 Running 4 30m 10.1.17.129 osclu1-azuur-05.uz.kuleuven.ac.be Expected results: I would expect the pods to become ready without restarts if the server is started and serving requests. Additional info:
Are all 10 scaled instances running on the same node? We recently fixed a bug in kubernetes 1.7 and backported to OCP 3.6 where failureThresholds were not being reset across pod restarts. I am wondering if that is the source of the restarts > 1 in the referenced scenario. See: https://github.com/openshift/origin/pull/14332
(In reply to Derek Carr from comment #1) > Are all 10 scaled instances running on the same node? Not necessarily, it has been seen even with one instance. > We recently fixed a bug in kubernetes 1.7 and backported to OCP 3.6 where > failureThresholds were not being reset across pod restarts. I am wondering > if that is the source of the restarts > 1 in the referenced scenario. > > See: https://github.com/openshift/origin/pull/14332 I don't think it is related to this issue. In the "Actual results" section you can see how the restarts grew each 10 minutes, meaning the pod was restarted. I will try to get the events, the journal from the node the pods were being deployed and any other information you find relevant.
Created attachment 1284145 [details] journal from azuur05
Created attachment 1284146 [details] journal from pew05
Created attachment 1284147 [details] namespace events
Created attachment 1284148 [details] pods dump $ for pod in `oc get po | grep -o "tomcat-[0-9a-z\-]*"`; do echo --POD $pod; echo ---- POD DESCRIPTION; oc get po $pod -o yaml; echo ---- POD LOGS; oc logs $pod; echo ---- PODS ACCESS.log; oc exec $pod -- cat /opt/webserver/logs/localhost_access_log.2017-06-01.txt; done > tomcat_pods.dump
Pay attention to tomcat-16-mv9sj (the last one) **** POD: tomcat-16-7hd54 (Started OK, is ready and processed curl requests) ... ready: true restartCount: 0 state: running: startedAt: 2017-06-01T08:58:49Z hostIP: 10.254.250.55 phase: Running podIP: 10.1.17.66 startTime: 2017-06-01T08:58:44Z 12m 12m 1 tomcat-16-7hd54 Pod Normal Scheduled {default-scheduler } Successfully assigned tomcat-16-7hd54 to osclu1-azuur-05.uz.kuleuven.ac.be 12m 12m 1 tomcat-16-7hd54 Pod spec.containers{tomcat} Normal Pulling {kubelet osclu1-azuur-05.uz.kuleuven.ac.be} pulling image "172.30.241.23:5000/test-tomcat/tomcat@sha256:2754206e231365d5f0ae59107ccfcd677cf8769b524618051323b448b05c732e" 12m 12m 1 tomcat-16-7hd54 Pod spec.containers{tomcat} Normal Pulled {kubelet osclu1-azuur-05.uz.kuleuven.ac.be} Successfully pulled image "172.30.241.23:5000/test-tomcat/tomcat@sha256:2754206e231365d5f0ae59107ccfcd677cf8769b524618051323b448b05c732e" 12m 12m 1 tomcat-16-7hd54 Pod spec.containers{tomcat} Normal Created {kubelet osclu1-azuur-05.uz.kuleuven.ac.be} Created container with docker id b92065aa1408; Security:[seccomp=unconfined] 12m 12m 1 tomcat-16-7hd54 Pod spec.containers{tomcat} Normal Started {kubelet osclu1-azuur-05.uz.kuleuven.ac.be} Started container with docker id b92065aa1408 Processed the first request at: 10.1.4.1 - - [01/Jun/2017:09:01:28 +0000] "GET / HTTP/1.1" 404 - 10.1.4.1 - - [01/Jun/2017:09:03:19 +0000] "GET / HTTP/1.1" 404 - 10.1.17.1 - - [01/Jun/2017:09:04:09 +0000] "GET /docs HTTP/1.1" 302 - **** POD: tomcat-16-fszzw (Started OK, is ready and processed curl requests) ... ready: true restartCount: 0 state: running: startedAt: 2017-06-01T08:58:47Z hostIP: 10.254.185.49 phase: Running podIP: 10.1.11.156 startTime: 2017-06-01T08:58:44Z 12m 12m 1 tomcat-16-fszzw Pod Normal Scheduled {default-scheduler } Successfully assigned tomcat-16-fszzw to osclu1-pew-05.uz.kuleuven.ac.be 12m 12m 1 tomcat-16-fszzw Pod spec.containers{tomcat} Normal Pulling {kubelet osclu1-pew-05.uz.kuleuven.ac.be} pulling image "172.30.241.23:5000/test-tomcat/tomcat@sha256:2754206e231365d5f0ae59107ccfcd677cf8769b524618051323b448b05c732e" 12m 12m 1 tomcat-16-fszzw Pod spec.containers{tomcat} Normal Pulled {kubelet osclu1-pew-05.uz.kuleuven.ac.be} Successfully pulled image "172.30.241.23:5000/test-tomcat/tomcat@sha256:2754206e231365d5f0ae59107ccfcd677cf8769b524618051323b448b05c732e" 12m 12m 1 tomcat-16-fszzw Pod spec.containers{tomcat} Normal Created {kubelet osclu1-pew-05.uz.kuleuven.ac.be} Created container with docker id 6dec2b636981; Security:[seccomp=unconfined] 12m 12m 1 tomcat-16-fszzw Pod spec.containers{tomcat} Normal Started {kubelet osclu1-pew-05.uz.kuleuven.ac.be} Started container with docker id 6dec2b636981 Processed the first request at: 10.1.4.1 - - [01/Jun/2017:09:01:15 +0000] "GET / HTTP/1.1" 404 - 10.1.4.1 - - [01/Jun/2017:09:03:13 +0000] "GET / HTTP/1.1" 404 - 10.1.11.1 - - [01/Jun/2017:09:03:54 +0000] "GET /docs HTTP/1.1" 302 - **** POD: tomcat-16-f9h76 (Started OK, is ready and processed curl requests) ... ready: true restartCount: 0 state: running: startedAt: 2017-06-01T08:58:48Z hostIP: 10.254.185.49 phase: Running podIP: 10.1.11.157 startTime: 2017-06-01T08:58:44Z 12m 12m 1 tomcat-16-f9h76 Pod Normal Scheduled {default-scheduler } Successfully assigned tomcat-16-f9h76 to osclu1-pew-05.uz.kuleuven.ac.be 12m 12m 1 tomcat-16-f9h76 Pod spec.containers{tomcat} Normal Pulling {kubelet osclu1-pew-05.uz.kuleuven.ac.be} pulling image "172.30.241.23:5000/test-tomcat/tomcat@sha256:2754206e231365d5f0ae59107ccfcd677cf8769b524618051323b448b05c732e" 12m 12m 1 tomcat-16-f9h76 Pod spec.containers{tomcat} Normal Pulled {kubelet osclu1-pew-05.uz.kuleuven.ac.be} Successfully pulled image "172.30.241.23:5000/test-tomcat/tomcat@sha256:2754206e231365d5f0ae59107ccfcd677cf8769b524618051323b448b05c732e" 12m 12m 1 tomcat-16-f9h76 Pod spec.containers{tomcat} Normal Created {kubelet osclu1-pew-05.uz.kuleuven.ac.be} Created container with docker id b52c9dda5bb7; Security:[seccomp=unconfined] 12m 12m 1 tomcat-16-f9h76 Pod spec.containers{tomcat} Normal Started {kubelet osclu1-pew-05.uz.kuleuven.ac.be} Started container with docker id b52c9dda5bb7 Processed the first requests at: 10.1.4.1 - - [01/Jun/2017:09:01:21 +0000] "GET / HTTP/1.1" 404 - 10.1.4.1 - - [01/Jun/2017:09:03:16 +0000] "GET / HTTP/1.1" 404 - 10.1.11.1 - - [01/Jun/2017:09:03:54 +0000] "GET /docs HTTP/1.1" 302 - **** POD: tomcat-16-mv9sj (Required 1 restart, is NOT ready and processed curl requests) ... ready: false restartCount: 0 state: running: startedAt: 2017-06-01T08:58:49Z hostIP: 10.254.250.55 phase: Running podIP: 10.1.17.68 startTime: 2017-06-01T08:58:44Z 12m 12m 1 tomcat-16-mv9sj Pod Normal Scheduled {default-scheduler } Successfully assigned tomcat-16-mv9sj to osclu1-azuur-05.uz.kuleuven.ac.be 4m 12m 2 tomcat-16-mv9sj Pod spec.containers{tomcat} Normal Pulling {kubelet osclu1-azuur-05.uz.kuleuven.ac.be} pulling image "172.30.241.23:5000/test-tomcat/tomcat@sha256:2754206e231365d5f0ae59107ccfcd677cf8769b524618051323b448b05c732e" 4m 12m 2 tomcat-16-mv9sj Pod spec.containers{tomcat} Normal Pulled {kubelet osclu1-azuur-05.uz.kuleuven.ac.be} Successfully pulled image "172.30.241.23:5000/test-tomcat/tomcat@sha256:2754206e231365d5f0ae59107ccfcd677cf8769b524618051323b448b05c732e" 12m 12m 1 tomcat-16-mv9sj Pod spec.containers{tomcat} Normal Created {kubelet osclu1-azuur-05.uz.kuleuven.ac.be} Created container with docker id 1698b56bdbe0; Security:[seccomp=unconfined] 12m 12m 1 tomcat-16-mv9sj Pod spec.containers{tomcat} Normal Started {kubelet osclu1-azuur-05.uz.kuleuven.ac.be} Started container with docker id 1698b56bdbe0 4m 6m 5 tomcat-16-mv9sj Pod spec.containers{tomcat} Warning Unhealthy {kubelet osclu1-azuur-05.uz.kuleuven.ac.be} Liveness probe failed: Get http://10.1.17.68:8080/docs: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) 4m 6m 5 tomcat-16-mv9sj Pod spec.containers{tomcat} Warning Unhealthy {kubelet osclu1-azuur-05.uz.kuleuven.ac.be} Readiness probe failed: Get http://10.1.17.68:8080/docs: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) 4m 4m 1 tomcat-16-mv9sj Pod spec.containers{tomcat} Normal Killing {kubelet osclu1-azuur-05.uz.kuleuven.ac.be} Killing container with docker id 1698b56bdbe0: pod "tomcat-16-mv9sj_test-tomcat(84649f7a-46a8-11e7-ac29-005056ab1e93)" container "tomcat" is unhealthy, it will be killed and re-created. 4m 4m 1 tomcat-16-mv9sj Pod spec.containers{tomcat} Normal Created {kubelet osclu1-azuur-05.uz.kuleuven.ac.be} Created container with docker id 84508fbfedb0; Security:[seccomp=unconfined] 4m 4m 1 tomcat-16-mv9sj Pod spec.containers{tomcat} Normal Started {kubelet osclu1-azuur-05.uz.kuleuven.ac.be} Started container with docker id 84508fbfedb0 Processed the first request at (I couldn't find any request to /docs here): 10.1.4.1 - - [01/Jun/2017:09:01:33 +0000] "GET / HTTP/1.1" 404 - 10.1.4.1 - - [01/Jun/2017:09:03:24 +0000] "GET / HTTP/1.1" 404 - 10.1.4.1 - - [01/Jun/2017:09:05:19 +0000] "GET / HTTP/1.1" 404 -
Analysis given the previous comment: 1. pod tomcat-16-mv9sj is created at 08:58:44 -- startTime: 2017-06-01T08:58:44Z 2. tomcat server is started at 08:59:14 -- 2017-06-01 08:59:14,685 [main] INFO org.apache.catalina.startup.Catalina- Server startup in 2943 ms 3. first request is received at 09:01:33 -- 10.1.4.1 - - [01/Jun/2017:09:01:33 +0000] "GET / HTTP/1.1" 404 - 4. 6 minutes later (aprox. 09:04) the probe fails by timeout and the pod is killed - Liveness probe failed: Get http://10.1.17.68:8080/docs: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) 5. 8 minutes later (aprox. 09:06) a new container is created. I might or might not have failed after 6 minutes more. Logs don't show it.
Looking at the probe: initialDelaySeconds: 300 timeoutSeconds: 30 periodSeconds: 10 successThreshold: 1 failureThreshold: 5 The time between container restarts, if the probe URI is timing out, is 300 + 30 * 5 + 10 * 4 = 490s = 8m10s. That is initialDelaySeconds + timeoutSeconds * failureThreshold + periodSeconds * (failureThreshold - 1). > 5. 8 minutes later (aprox. 09:06) a new container is created. That aligns with what you are reporting. Due to the bug Derek mentioned, I would expect to see a restart every 330 seconds (5m30s) after a single probe failure on the restarted pod. If the application, is ready by the first probe, then the container appears as Ready. Looking at the report: Mon May 29 08:59:28 CEST 2017 NAME READY STATUS RESTARTS AGE IP NODE tomcat-16-vpc7n 0/1 Running 1 10m 10.1.17.129 osclu1-azuur-05.uz.kuleuven.ac.be Mon May 29 09:09:29 CEST 2017 NAME READY STATUS RESTARTS AGE IP NODE tomcat-16-vpc7n 0/1 Running 3 20m 10.1.17.129 osclu1-azuur-05.uz.kuleuven.ac.be Mon May 29 09:19:30 CEST 2017 NAME READY STATUS RESTARTS AGE IP NODE tomcat-16-vpc7n 1/1 Running 4 30m 10.1.17.129 osclu1-azuur-05.uz.kuleuven.ac.be Timeline: @0m the container is started @8m10s the container restarts due to 5 probe failures @10m get shows 1 restart and 0/1 Ready @13m40s the container restarts due to 1 probe failure @19m10s the container restarts due to 1 probe failure @20m get shows 3 restarts and 0/1 Ready @24m40s the container restarts due to 1 probe failure @29m40s the container probe succeeds @30m get show 4 restarts and 1/1 Ready tl;dr The results you are reporting are the expected results when the aforementioned bug in comment 1 is taken into account.
*** This bug has been marked as a duplicate of bug 1455056 ***
The question here is why even if the server is capable of processing requests, the probe fails and the pod is restarted. Regardless the count.
Sorry Ruben, misunderstood what the issue was. If the openshift node process can not contact the application on its pod IP address, then yes, the probe will always fail. This looks like a networking issue. However, I do see you mentioned: > 4. Curl'ing pods directly to the same path as the probe succeeds Is this confirming that the manual curl from the node to the pod is working at the same time the probe is failing? Could there be an intermittent network issue? Is this a containerized install?
(In reply to Seth Jennings from comment #14) > Sorry Ruben, misunderstood what the issue was. > > If the openshift node process can not contact the application on its pod IP > address, then yes, the probe will always fail. This looks like a networking > issue. > > However, I do see you mentioned: > > > 4. Curl'ing pods directly to the same path as the probe succeeds > > Is this confirming that the manual curl from the node to the pod is working > at the same time the probe is failing? Could there be an intermittent > network issue? See the mentioned networking issue the installation is also experiencing: https://bugzilla.redhat.com/show_bug.cgi?id=1454948 > Is this a containerized install? No it is not but it is partially deployed on Azure and OpenStack
One reproducer is a containerized install, the other is not.
Reported case also on OCP 3.4.1-24 Updated version label accordingly.
At the moment of a liveness failure can you collect `ip neigh` on the host and `ip a` inside the container that is failing to respond?
I'd also love to see a tcpdump inside and outside the container in question during a (couple of?) liveness failures. Something like: tcpdump -i any -nn port 8080 If you don't have tcpdump inside the container you can actually grab the same thing from outside if you know the pid of a process in the right container. You should be able to get that from docker ps and docker inspect. Then on the host you can run: nsenter -t $PID -n -- tcpdump -i any -nn port 8080
Also, at the moment of failure, from the host the container is running on, can you do: tcpdump -n -i tun0 -w capture.cap host 192.168.1.1 (but replace that IP with the IP of the pod that is failing to respond) Then please get me the capture.cap file
Created attachment 1288095 [details] nsenter.tcpdump.port.8080.cap
Created attachment 1288096 [details] tcpdump.port.8080.cap
Created attachment 1288097 [details] ipneigh.out
Created attachment 1288098 [details] capture.cap
[root@osclu1-cross-01 ~]# oc get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE tomcat-20-f9d7v 0/1 Running 0 30m 10.1.11.34 osclu1-pew-05.uz.kuleuven.ac.be tomcat-20-fmnqt 1/1 Running 0 30m 10.1.17.88 osclu1-azuur-05.uz.kuleuven.ac.be tomcat-3-build 0/1 Completed 0 7d 10.1.11.128 osclu1-pew-05.uz.kuleuven.ac.be ip neigh >/tmp/ipneigh.outhi nsenter -t 15455 -n -- ip a >/tmp/ipa.out tcpdump -i any -nn port 8080 -w /tmp/tcpdump.port.8080 nsenter -t 15455 -n -- tcpdump -i any -nn port 8080 -w /tmp/nsenter.tcpdump.port.8080 tcpdump -n -i tun0 -w capture.cap host 10.1.11.34 I let the dumps run for about 30 seconds. Checks are done every 10 seconds
Can we please get the /tmp/ipa.out attachment? That has the critical piece of information in it.
Created attachment 1288158 [details] ipa.out
10.1.11.34 dev tun0 lladdr de:0e:f4:db:09:7c REACHABLE 3: eth0@if2140: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP link/ether 62:fd:49:57:02:9b brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet 10.1.11.34/24 scope global eth0
Maybe I 'm kicking in an open door but the mac address are not the same. Is that why the checks ar no working ?
markv: Precisely. https://github.com/openshift/origin/pull/14685 will address the symptom. I'm still not sure why the arp cache gets into this state though.
Ben, can I dup this to 1451854 then?
*** Bug 1461505 has been marked as a duplicate of this bug. ***
*** This bug has been marked as a duplicate of bug 1451854 ***