Bug 2092647
| Summary: | Router pod consumes entire hour of terminationGracePeriodSeconds | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | W. Trevor King <wking> |
| Component: | Networking | Assignee: | aos-network-edge-staff <aos-network-edge-staff> |
| Networking sub component: | router | QA Contact: | Hongan Li <hongli> |
| Status: | CLOSED DUPLICATE | Docs Contact: | |
| Severity: | medium | ||
| Priority: | high | CC: | aos-bugs, hongli, mmasters |
| Version: | 4.11 | ||
| Target Milestone: | --- | ||
| Target Release: | 4.11.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2022-06-16 23:12:07 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
W. Trevor King
2022-06-01 22:49:53 UTC
Is this a frequent failure? These two log messages indicate a clean shutdown: I0601 11:13:19.088066 1 router.go:618] template "msg"="router reloaded" "output"=" - Shutting down\n" I0601 11:13:19.088093 1 template.go:710] router "msg"="Shutdown complete, exiting" The first message indicates that OpenShift router observed that all haproxy processes terminated. The second message indicates that the openshift-router process was exiting (after that log message, the openshift-router process sleeps 1 second and then returns to the operating system). So we are fairly confident that the processes in the router pod did terminate. This is likely a problem with cri-o or the kubelet. Therefore I am sending this BZ to the node team. I'm not sure how to get a handle on frequency. Jerry found [1] with similar symptoms: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-upgrade-from-stable-4.10-e2e-aws-upgrade/1531666974952656896/artifacts/e2e-aws-upgrade/gather-extra/artifacts/pods/openshift-machine-config-operator_machine-config-controller-866fd867bc-rxr9p_machine-config-controller.log | grep -o 'node.*Drain failed.*' | sort | uniq -c 1 node ip-10-0-147-171.us-east-2.compute.internal: Drain failed, but overall timeout has not been reached. Waiting 1 minute then retrying. Error message from drain: error when waiting for pod "apiserver-7dbfd8d945-7kn67" terminating: global timeout reached: 1m30s 18 node ip-10-0-194-58.us-east-2.compute.internal: Drain failed, but overall timeout has not been reached. Waiting 1 minute then retrying. Error message from drain: error when waiting for pod "router-default-855b574d4d-wpbbj" terminating: global timeout reached: 1m30s And again the hour delay: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-upgrade-from-stable-4.10-e2e-aws-upgrade/1531666974952656896/artifacts/e2e-aws-upgrade/gather-extra/artifacts/nodes/ip-10-0-194-58.us-east-2.compute.internal/journal | zgrep router-default-855b574d4d-wpbbj | grep -B1 'Stopped container' May 31 17:14:19.597384 ip-10-0-194-58 hyperkube[1404]: I0531 17:14:19.597377 1404 kuberuntime_container.go:729] "Killing container with a grace period" pod="openshift-ingress/router-default-855b574d4d-wpbbj" podUID=7f13f3fd-272b-49f0-8c1c-2d522c471b5a containerName="router" containerID="cri-o://9f05ca8e6a44b097e8a171c4fe5798ea22db4f91ffc01a79044054087b281a12" gracePeriod=3600 May 31 18:14:19.903483 ip-10-0-194-58 crio[1374]: time="2022-05-31 18:14:19.903342974Z" level=info msg="Stopped container 9f05ca8e6a44b097e8a171c4fe5798ea22db4f91ffc01a79044054087b281a12: openshift-ingress/router-default-855b574d4d-wpbbj/router" id=693d22a7-578a-464a-9df3-1408f2d96bb7 name=/runtime.v1.RuntimeService/StopContainer And in Loki: {invoker="openshift-internal-ci/periodic-ci-openshift-release-master-nightly-4.11-upgrade-from-stable-4.10-e2e-aws-upgrade/1531666974952656896"} | unpack | pod_name="router-default-855b574d4d-wpbbj" includes: I0531 17:33:35.327412 1 template.go:710] router "msg"="Shutdown complete, exiting" So in both cases it is taking the router surprisingly long (~15m) to get from the kubelet's claimed "Killing container" to the router's "Shutdown requested", and then another ~45m to get from the router's "Shutdown complete" to CRI-O noticing the container as stopped. I'm unclear about both of those delays. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-upgrade-from-stable-4.10-e2e-aws-upgrade/1531666974952656896 It looks like CRI-O is behaving as expected here. We can see in the loki logs: ``` 2022-06-14 17:52:52 I0614 17:52:51.988338 1 template.go:437] router "msg"="starting router" "version"="majorFromGit: \nminorFromGit: \ncommitFromGit: 49dce8fe\nversionFromGit: v0.0.0-unknown\ngitTreeState: dirty\nbuildDate: 2022-06-11T07:34:52Z\n" ... 2022-06-14 17:52:52 I0614 17:52:52.047282 1 template.go:437] router "msg"="starting router" "version"="majorFromGit: \nminorFromGit: \ncommitFromGit: 11109e4028b69749d6f842a4da682916e0d91d2f\nversionFromGit: 4.0.0-370-g11109e40\ngitTreeState: clean\nbuildDate: 2022-06-01T00:00:21Z\n" ``` I actually think this is a bug in loki. According to a conversation with Miciah on slack, those two versions could be expected to be 4.10 and 4.11 versions of the pod. Looking at the cri-o logs, we see: ``` Jun 14 17:52:39.082363 ci-op-853gri6b-fde6e-x6vmd-worker-centralus3-9nvjt hyperkube[1511]: I0614 17:52:39.082315 1511 kubelet.go:2125] "SyncLoop ADD" source="api" pods=[openshift-ingress/router-default-7678459c85-9twn7] Jun 14 17:52:39.437726 ci-op-853gri6b-fde6e-x6vmd-worker-centralus3-9nvjt crio[1477]: time="2022-06-14 17:52:39.436727436Z" level=info msg="Running pod sandbox: openshift-ingress/router-default-7678459c85-9twn7/POD" id=a9411cdd-33c8-4e34-bf72-554475493896 name=/runtime.v1.RuntimeService/RunPodSandbox Jun 14 17:52:39.480946 ci-op-853gri6b-fde6e-x6vmd-worker-centralus3-9nvjt hyperkube[1511]: I0614 17:52:39.480906 1511 kubelet.go:2141] "SyncLoop DELETE" source="api" pods=[openshift-ingress/router-default-7678459c85-9twn7] Jun 14 17:52:39.551603 ci-op-853gri6b-fde6e-x6vmd-worker-centralus3-9nvjt hyperkube[1511]: I0614 17:52:39.546844 1511 kubelet.go:2125] "SyncLoop ADD" source="api" pods=[openshift-ingress/router-default-75c97bc477-bc4jn] Jun 14 17:52:39.976520 ci-op-853gri6b-fde6e-x6vmd-worker-centralus3-9nvjt crio[1477]: time="2022-06-14 17:52:39.976427379Z" level=info msg="Running pod sandbox: openshift-ingress/router-default-75c97bc477-bc4jn/POD" id=62362160-9a99-4a11-9786-d897904ce128 name=/runtime.v1.RuntimeService/RunPodSandbox Jun 14 17:52:41.106115 ci-op-853gri6b-fde6e-x6vmd-worker-centralus3-9nvjt hyperkube[1511]: I0614 17:52:41.106080 1511 kubelet.go:2141] "SyncLoop DELETE" source="api" pods=[openshift-ingress/router-default-7678459c85-9twn7] Jun 14 17:52:41.121657 ci-op-853gri6b-fde6e-x6vmd-worker-centralus3-9nvjt crio[1477]: time="2022-06-14 17:52:41.121602041Z" level=info msg="Ran pod sandbox 499a80bc4a1339bcd298a1f3373ddd60ce21e05f3a4a79a1367acf865b41731f with infra container: openshift-ingress/router-default-7678459c85-9twn7/POD" id=a9411cdd-33c8-4e34-bf72-554475493896 name=/runtime.v1.RuntimeService/RunPodSandbox Jun 14 17:52:44.043785 ci-op-853gri6b-fde6e-x6vmd-worker-centralus3-9nvjt crio[1477]: time="2022-06-14 17:52:44.042515875Z" level=info msg="Ran pod sandbox 59c1397a57b90750b8e689967240fe37c49e4bdd45ceb6e3eda60400a636539c with infra container: openshift-ingress/router-default-75c97bc477-bc4jn/POD" id=62362160-9a99-4a11-9786-d897904ce128 name=/runtime.v1.RuntimeService/RunPodSandbox Jun 14 17:52:51.666362 ci-op-853gri6b-fde6e-x6vmd-worker-centralus3-9nvjt crio[1477]: time="2022-06-14 17:52:51.666317438Z" level=info msg="Creating container: openshift-ingress/router-default-7678459c85-9twn7/router" id=6ad0c5f9-5f49-4fef-84ef-4e50c20329c3 name=/runtime.v1.RuntimeService/CreateContainer Jun 14 17:52:51.689527 ci-op-853gri6b-fde6e-x6vmd-worker-centralus3-9nvjt crio[1477]: time="2022-06-14 17:52:51.689479865Z" level=info msg="Creating container: openshift-ingress/router-default-75c97bc477-bc4jn/router" id=632e8841-f9bf-4e08-ac0c-8a4773cdf947 name=/runtime.v1.RuntimeService/CreateContainer Jun 14 17:52:51.936824 ci-op-853gri6b-fde6e-x6vmd-worker-centralus3-9nvjt crio[1477]: time="2022-06-14 17:52:51.934960884Z" level=info msg="Started container" PID=50483 containerID=ba35c5add80459b330dda07a02f2f36d059d2aa29e1fa5d1954341577724a697 description=openshift-ingress/router-default-75c97bc477-bc4jn/router id=ebae24b5-1530-4686-94b7-4d84989ad56f name=/runtime.v1.RuntimeService/StartContainer sandboxID=59c1397a57b90750b8e689967240fe37c49e4bdd45ceb6e3eda60400a636539c Jun 14 17:52:51.936824 ci-op-853gri6b-fde6e-x6vmd-worker-centralus3-9nvjt crio[1477]: time="2022-06-14 17:52:51.934960884Z" level=info msg="Started container" PID=50483 containerID=ba35c5add80459b330dda07a02f2f36d059d2aa29e1fa5d1954341577724a697 description=openshift-ingress/router-default-75c97bc477-bc4jn/router id=ebae24b5-1530-4686-94b7-4d84989ad56f name=/runtime.v1.RuntimeService/StartContainer sandboxID=59c1397a57b90750b8e689967240fe37c49e4bdd45ceb6e3eda60400a636539c Jun 14 17:52:52.011222 ci-op-853gri6b-fde6e-x6vmd-worker-centralus3-9nvjt crio[1477]: time="2022-06-14 17:52:52.011168423Z" level=info msg="Started container" PID=50475 containerID=110d55901a25ccbfb4001a91fe6db0c2e29557f1288bdbbe79ebebc7edf762a6 description=openshift-ingress/router-default-7678459c85-9twn7/router id=69904695-ffc5-4775-9616-b901cb9e438c name=/runtime.v1.RuntimeService/StartContainer sandboxID=499a80bc4a1339bcd298a1f3373ddd60ce21e05f3a4a79a1367acf865b41731f Jun 14 17:52:51.982618 ci-op-853gri6b-fde6e-x6vmd-worker-centralus3-9nvjt crio[1477]: time="2022-06-14 17:52:51.982552043Z" level=info msg="Created container 110d55901a25ccbfb4001a91fe6db0c2e29557f1288bdbbe79ebebc7edf762a6: openshift-ingress/router-default-7678459c85-9twn7/router" id=6ad0c5f9-5f49-4fef-84ef-4e50c20329c3 name=/runtime.v1.RuntimeService/CreateContainer Jun 14 17:52:52.331977 ci-op-853gri6b-fde6e-x6vmd-worker-centralus3-9nvjt hyperkube[1511]: I0614 17:52:52.331917 1511 kuberuntime_container.go:729] "Killing container with a grace period" pod="openshift-ingress/router-default-7678459c85-9twn7" podUID=15ac6c91-6429-493d-9706-1d0e7ce9e61a containerName="router" containerID="cri-o://110d55901a25ccbfb4001a91fe6db0c2e29557f1288bdbbe79ebebc7edf762a6" gracePeriod=3600 ``` which confirms that there are two versions of the pod running, supporting the theory that loki is confusing them. Much later, we see ``` Jun 14 18:29:18.059094 ci-op-853gri6b-fde6e-x6vmd-worker-centralus3-9nvjt hyperkube[1511]: I0614 18:29:18.059060 1511 kuberuntime_container.go:729] "Killing container with a grace period" pod="openshift-ingress/router-default-75c97bc477-bc4jn" podUID=448256d1-ca47-418b-9418-07757523c031 containerName="router" containerID="cri-o://ba35c5add80459b330dda07a02f2f36d059d2aa29e1fa5d1954341577724a697" gracePeriod=3600 Jun 14 18:30:05.306223 ci-op-853gri6b-fde6e-x6vmd-worker-centralus3-9nvjt crio[1477]: time="2022-06-14 18:30:05.306103421Z" level=info msg="Stopped container ba35c5add80459b330dda07a02f2f36d059d2aa29e1fa5d1954341577724a697: openshift-ingress/router-default-75c97bc477-bc4jn/router" id=e3194e4a-90b9-47fb-b114-1d50ba843ef6 name=/runtime.v1.RuntimeService/StopContainer ``` which corresponds to the loki message ``` I0614 18:30:04.088614 1 template.go:710] router "msg"="Shutdown complete, exiting" ``` in time. Since we've shown Loki is conflating the two pods, and shown that one of the router pods is shutting down as expected after reporting the pod has shutdown, I don't think CRI-O is missing the original one's exit. Much later, after the hour grace period, we see ``` Jun 14 18:52:52.650055 ci-op-853gri6b-fde6e-x6vmd-worker-centralus3-9nvjt crio[1477]: time="2022-06-14 18:52:52.647888879Z" level=info msg="Stopped container 110d55901a25ccbfb4001a91fe6db0c2e29557f1288bdbbe79ebebc7edf762a6: openshift-ingress/router-default-7678459c85-9twn7/router" id=dfcab30e-8cf6-462b-95dc-ef1186c36574 name=/runtime.v1.RuntimeService/StopContainer ``` indicating both pods were eventually cleaned up. As such, I believe this is a router bug, where having two copies of the router causes one to catch and ignore the signal from CRI-O saying to shutdown. I am rerouting back to router, and unsetting the blocker flag. Since this happens pretty infrequently, it's not clear to me it should be a blocker from router perspective. From CRI-O, if we were missing a container exit for so long, it would feel more blockery to me. Ok, lots of digging through logs and theorizing with Peter and Miciah, and I think we've unraveled this. Unpacking [1] * 4.10 cluster comes up, and happens to assign both router-default pods to the same node, because bug 2089336 is still POST for 4.10.z (verification discussion continues in bug 2062459). * 4.10 cluster is running ReplicaSet router-default-7678459c85 * 17:52:09, cluster update triggers incoming ingress-operator-64785c445d * 17:52:36, MalscheduledPod: pod/router-default-7678459c85-7gcvk pod/router-default-7678459c85-rgfpc should be one per node, but all were placed on node/ci-op-853gri6b-fde6e-x6vmd-worker-centralus2-jcc7b; evicting pod/router-default-7678459c85-7gcvk * Ingress operator bumps the router-default Deployment to request the updated image, etc. * Incoming ReplicasSet router-default-75c97bc477 with the update * 17:52:39.082363, kubelet tells CRI-O to create router-default-7678459c85-9twn7 * 17:52:39.437726, CRI-O creates a9411cdd-... for ...-9twn7 * 17:52:39.480946, kubelet tells CRI-O to delete ...-9twn7, only ~0.4s after asking for its creation, no idea why * 17:52:39.551603, kubelet tells CRI-O to create router-default-75c97bc477-bc4jn * 17:52:39.976520, CRI-O creates 62362160-... for ...-bc4jn * 17:52:41.106115, kubelet tells CRI-O to delete ...-9twn7 again * 17:52:41.121657, CRI-O is done creating a9411cdd-... for ...-9twn7 (which is no longer needed) * 17:52:44.043785, CRI-O is done creating 62362160-... for ...-bc4jn * 17:52:51.666362, CRI-O creates 6ad0c5f9-... for ...-9twn7 (which is no longer needed) * 17:52:51.689527, CRI-O creates 632e8841-... for ...-bc4jn * 17:52:51.936824, CRI-O logs PID 50483 for containerID=ba35... for ...-bc4jn * 17:52:52.011222, CRI-O logs PID 50475 for containerID=110d... for ...-9twn7 (which is no longer needed) * 17:52:52.331977, CRI-O TERMs containerID=110d... for ...-9twn7 with a 1h grace period * containerID=110d... ignores the TERM, because 4.10 routers are apparently also impacted by bug 2076297, and the fix there has not yet been backported to 4.10.z. * … * 18:29:18.059094, CRI-O TERMs containerID=ba35... for ...-bc4jn with a 1h grace period * 18:30:04.088614, router logs "Shutdown complete, exiting" * 18:30:05.306223, CRI-O logs "Stopped container" (reaping the process?) for containerID=ba35... for ...-bc4jn * … * 18:52:52.650055, CRI-O logs "Stopped container" for containerID=110d... for …-9twn7 So I'm going to close this as a dup of bug 2076297, although if bug 2089336 had landed, we'd get bit by this less frequently in CI. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-azure-upgrade/1536750651021725696 *** This bug has been marked as a duplicate of bug 2076297 *** |