Bug 2092647

Summary:	Router pod consumes entire hour of terminationGracePeriodSeconds
Product:	OpenShift Container Platform	Reporter:	W. Trevor King <wking>
Component:	Networking	Assignee:	aos-network-edge-staff <aos-network-edge-staff>
Networking sub component:	router	QA Contact:	Hongan Li <hongli>
Status:	CLOSED DUPLICATE	Docs Contact:
Severity:	medium
Priority:	high	CC:	aos-bugs, hongli, mmasters
Version:	4.11
Target Milestone:	---
Target Release:	4.11.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-06-16 23:12:07 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description W. Trevor King 2022-06-01 22:49:53 UTC

In 4.10.16 to 4.11.0-0.nightly-2022-06-01-093002 CI [1], one node took a really long time to drain, with machine-config complaints like:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-upgrade-from-stable-4.10-e2e-aws-upgrade/1531933401613864960/artifacts/e2e-aws-upgrade/gather-extra/artifacts/pods/openshift-machine-config-operator_machine-config-controller-7fffff86d7-kv5zp_machine-config-controller.log | grep 'Drain failed'
I0601 11:25:36.300924       1 drain_controller.go:141] node ip-10-0-165-38.ec2.internal: Drain failed, but overall timeout has not been reached. Waiting 1 minute then retrying. Error message from drain: error when waiting for pod "router-default-5fd588c887-62fxb" terminating: global timeout reached: 1m30s
...
I0601 11:49:24.961755       1 drain_controller.go:141] node ip-10-0-165-38.ec2.internal: Drain failed, but overall timeout has not been reached. Waiting 1 minute then retrying. Error message from drain: error when waiting for pod "router-default-5fd588c887-62fxb" terminating: global timeout reached: 1m30s

Looking at the node logs:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-upgrade-from-stable-4.10-e2e-aws-upgrade/1531933401613864960/artifacts/e2e-aws-upgrade/gather-extra/artifacts/nodes/ip-10-0-165-38.ec2.internal/journal | zgrep router-default-5fd588c887-62fxb | grep -B1 'Jun 01 11:50' | head -n2
Jun 01 10:50:16.956056 ip-10-0-165-38 hyperkube[1391]: I0601 10:50:16.955647    1391 kuberuntime_container.go:729] "Killing container with a grace period" pod="openshift-ingress/router-default-5fd588c887-62fxb" podUID=20e32699-dc9f-4ded-b42c-b4499f2c28c5 containerName="router" containerID="cri-o://8ddb122be6e55f6e4c06e11764995ffe3b0c8c05dad2deb90438dfd76b897a94" gracePeriod=3600
Jun 01 11:50:17.193777 ip-10-0-165-38 crio[1359]: time="2022-06-01 11:50:17.193723289Z" level=info msg="Stopped container 8ddb122be6e55f6e4c06e11764995ffe3b0c8c05dad2deb90438dfd76b897a94: openshift-ingress/router-default-5fd588c887-62fxb/router" id=b0b328a2-bb99-4c73-814c-522f4d94007a name=/runtime.v1.RuntimeService/StopContainer

And indeed, the grace period is an hour:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-upgrade-from-stable-4.10-e2e-aws-upgrade/1531933401613864960/artifacts/e2e-aws-upgrade/gather-extra/artifacts/deployments.json | jq -r '.items[] | select(.metadata.namespace == "openshift-ingress") | .spec.template.spec.terminationGracePeriodSeconds'
3600

So seems like things are working as expected.  But running up against terminationGracePeriodSeconds might be surprising, and an hour is pretty long, so  opening this bug in case folks want to dig around some more.  [2,3] is where the hour comes from, and from [3]:

  In general few transactions will take 1 hour to complete.

Loki logs with:

  {invoker="openshift-internal-ci/periodic-ci-openshift-release-master-nightly-4.11-upgrade-from-stable-4.10-e2e-aws-upgrade/1531933401613864960"} | unpack | pod_name="router-default-5fd588c887-62fxb"

give:

I0601 10:50:18.321038       1 template.go:437] router "msg"="starting router"  "version"="majorFromGit: \nminorFromGit: \ncommitFromGit: 7f63ecde644748ba9083aac72a5e3e56cd4f06e2\nversionFromGit: 4.0.0-384-g7f63ecde\ngitTreeState: clean\nbuildDate: 2022-05-30T21:33:39Z\n"
I0601 10:50:18.322283       1 metrics.go:156] metrics "msg"="router health and metrics port listening on HTTP and HTTPS"  "address"="0.0.0.0:1936"
I0601 10:50:18.328428       1 router.go:191] template "msg"="creating a new template router"  "writeDir"="/var/lib/haproxy"
I0601 10:50:18.328492       1 router.go:273] template "msg"="router will coalesce reloads within an interval of each other"  "interval"="5s"
I0601 10:50:18.328763       1 router.go:343] template "msg"="watching for changes"  "path"="/etc/pki/tls/private"
I0601 10:50:18.328833       1 router.go:269] router "msg"="router is including routes in all namespaces"  
I0601 10:50:16.000426       1 template.go:437] router "msg"="starting router"  "version"="majorFromGit: \nminorFromGit: \ncommitFromGit: 11109e4028b69749d6f842a4da682916e0d91d2f\nversionFromGit: 4.0.0-370-g11109e40\ngitTreeState: clean\nbuildDate: 2022-05-12T09:54:12Z\n"
I0601 10:50:16.001940       1 metrics.go:156] metrics "msg"="router health and metrics port listening on HTTP and HTTPS"  "address"="0.0.0.0:1936"
I0601 10:50:16.006208       1 router.go:191] template "msg"="creating a new template router"  "writeDir"="/var/lib/haproxy"
I0601 10:50:16.006259       1 router.go:273] template "msg"="router will coalesce reloads within an interval of each other"  "interval"="5s"
I0601 10:50:16.006586       1 router.go:343] template "msg"="watching for changes"  "path"="/etc/pki/tls/private"
I0601 10:50:16.006653       1 router.go:262] router "msg"="router is including routes in all namespaces"  
E0601 10:50:19.440921       1 haproxy.go:418] can't scrape HAProxy: dial unix /var/lib/haproxy/run/haproxy.sock: connect: no such file or directory
I0601 10:50:19.471825       1 router.go:618] template "msg"="router reloaded"  "output"=" - Checking http://localhost:80 using PROXY protocol ...\n - Health check ok : 0 retry attempt(s).\n"
I0601 10:50:24.470878       1 router.go:618] template "msg"="router reloaded"  "output"=" - Checking http://localhost:80 using PROXY protocol ...\n - Health check ok : 0 retry attempt(s).\n"
I0601 10:50:33.431206       1 router.go:618] template "msg"="router reloaded"  "output"=" - Checking http://localhost:80 using PROXY protocol ...\n - Health check ok : 0 retry attempt(s).\n"
I0601 10:50:38.420463       1 router.go:618] template "msg"="router reloaded"  "output"=" - Checking http://localhost:80 using PROXY protocol ...\n - Health check ok : 0 retry attempt(s).\n"
I0601 10:50:43.404689       1 router.go:618] template "msg"="router reloaded"  "output"=" - Checking http://localhost:80 using PROXY protocol ...\n - Health check ok : 0 retry attempt(s).\n"
I0601 10:50:48.423919       1 router.go:618] template "msg"="router reloaded"  "output"=" - Checking http://localhost:80 using PROXY protocol ...\n - Health check ok : 0 retry attempt(s).\n"
I0601 10:50:53.430139       1 router.go:618] template "msg"="router reloaded"  "output"=" - Checking http://localhost:80 using PROXY protocol ...\n - Health check ok : 0 retry attempt(s).\n"
I0601 10:50:58.409198       1 router.go:618] template "msg"="router reloaded"  "output"=" - Checking http://localhost:80 using PROXY protocol ...\n - Health check ok : 0 retry attempt(s).\n"
I0601 10:51:03.407028       1 router.go:618] template "msg"="router reloaded"  "output"=" - Checking http://localhost:80 using PROXY protocol ...\n - Health check ok : 0 retry attempt(s).\n"
I0601 10:51:14.204525       1 router.go:618] template "msg"="router reloaded"  "output"=" - Checking http://localhost:80 using PROXY protocol ...\n - Health check ok : 0 retry attempt(s).\n"
I0601 10:51:19.217367       1 router.go:618] template "msg"="router reloaded"  "output"=" - Checking http://localhost:80 using PROXY protocol ...\n - Health check ok : 0 retry attempt(s).\n"
I0601 10:51:24.206444       1 router.go:618] template "msg"="router reloaded"  "output"=" - Checking http://localhost:80 using PROXY protocol ...\n - Health check ok : 0 retry attempt(s).\n"
I0601 10:51:50.040507       1 router.go:618] template "msg"="router reloaded"  "output"=" - Checking http://localhost:80 using PROXY protocol ...\n - Health check ok : 0 retry attempt(s).\n"
I0601 10:51:55.038371       1 router.go:618] template "msg"="router reloaded"  "output"=" - Checking http://localhost:80 using PROXY protocol ...\n - Health check ok : 0 retry attempt(s).\n"
I0601 10:52:17.825268       1 router.go:618] template "msg"="router reloaded"  "output"=" - Checking http://localhost:80 using PROXY protocol ...\n - Health check ok : 0 retry attempt(s).\n"
I0601 10:52:22.807455       1 router.go:618] template "msg"="router reloaded"  "output"=" - Checking http://localhost:80 using PROXY protocol ...\n - Health check ok : 0 retry attempt(s).\n"
I0601 10:52:42.369398       1 router.go:618] template "msg"="router reloaded"  "output"=" - Checking http://localhost:80 using PROXY protocol ...\n - Health check ok : 0 retry attempt(s).\n"
I0601 10:52:47.374499       1 router.go:618] template "msg"="router reloaded"  "output"=" - Checking http://localhost:80 using PROXY protocol ...\n - Health check ok : 0 retry attempt(s).\n"
W0601 10:52:52.576477       1 reflector.go:442] github.com/openshift/router/pkg/router/controller/factory/factory.go:124: watch of *v1.Route ended with: an error on the server ("unable to decode an event from the watch stream: stream error: stream ID 7; INTERNAL_ERROR") has prevented the request from succeeding
I0601 10:52:59.526967       1 router.go:618] template "msg"="router reloaded"  "output"=" - Checking http://localhost:80 using PROXY protocol ...\n - Health check ok : 0 retry attempt(s).\n"
I0601 10:53:04.514006       1 router.go:618] template "msg"="router reloaded"  "output"=" - Checking http://localhost:80 using PROXY protocol ...\n - Health check ok : 0 retry attempt(s).\n"
I0601 10:53:09.524362       1 router.go:618] template "msg"="router reloaded"  "output"=" - Checking http://localhost:80 using PROXY protocol ...\n - Health check ok : 0 retry attempt(s).\n"
I0601 10:53:15.826692       1 router.go:618] template "msg"="router reloaded"  "output"=" - Checking http://localhost:80 using PROXY protocol ...\n - Health check ok : 0 retry attempt(s).\n"
I0601 10:53:20.828285       1 router.go:618] template "msg"="router reloaded"  "output"=" - Checking http://localhost:80 using PROXY protocol ...\n - Health check ok : 0 retry attempt(s).\n"
I0601 10:53:34.294829       1 router.go:618] template "msg"="router reloaded"  "output"=" - Checking http://localhost:80 using PROXY protocol ...\n - Health check ok : 0 retry attempt(s).\n"
I0601 10:54:04.621910       1 router.go:618] template "msg"="router reloaded"  "output"=" - Checking http://localhost:80 using PROXY protocol ...\n - Health check ok : 0 retry attempt(s).\n"
I0601 11:09:16.448896       1 router.go:618] template "msg"="router reloaded"  "output"=" - Checking http://localhost:80 using PROXY protocol ...\n - Health check ok : 0 retry attempt(s).\n"
I0601 11:09:21.459262       1 router.go:618] template "msg"="router reloaded"  "output"=" - Checking http://localhost:80 using PROXY protocol ...\n - Health check ok : 0 retry attempt(s).\n"
I0601 11:09:26.672005       1 router.go:618] template "msg"="router reloaded"  "output"=" - Checking http://localhost:80 using PROXY protocol ...\n - Health check ok : 0 retry attempt(s).\n"
I0601 11:09:49.933439       1 router.go:618] template "msg"="router reloaded"  "output"=" - Checking http://localhost:80 using PROXY protocol ...\n - Health check ok : 0 retry attempt(s).\n"
I0601 11:09:54.929663       1 router.go:618] template "msg"="router reloaded"  "output"=" - Checking http://localhost:80 using PROXY protocol ...\n - Health check ok : 0 retry attempt(s).\n"
W0601 11:10:35.275744       1 reflector.go:442] github.com/openshift/router/pkg/router/controller/factory/factory.go:124: watch of *v1.Route ended with: an error on the server ("unable to decode an event from the watch stream: stream error: stream ID 181; INTERNAL_ERROR") has prevented the request from succeeding
I0601 11:11:39.677889       1 router.go:618] template "msg"="router reloaded"  "output"=" - Checking http://localhost:80 using PROXY protocol ...\n - Health check ok : 0 retry attempt(s).\n"
I0601 11:12:00.054963       1 router.go:618] template "msg"="router reloaded"  "output"=" - Checking http://localhost:80 using PROXY protocol ...\n - Health check ok : 0 retry attempt(s).\n"
I0601 11:12:31.064939       1 template.go:704] router "msg"="Shutdown requested, waiting 45s for new connections to cease"  
I0601 11:12:31.096958       1 router.go:618] template "msg"="router reloaded"  "output"=" - Checking http://localhost:80 using PROXY protocol ...\n - Health check ok : 0 retry attempt(s).\n"
I0601 11:13:16.066742       1 template.go:706] router "msg"="Instructing the template router to terminate"  
I0601 11:13:19.088066       1 router.go:618] template "msg"="router reloaded"  "output"=" - Shutting down\n"
I0601 11:13:19.088093       1 template.go:710] router "msg"="Shutdown complete, exiting"

That "Shutdown complete" logging is from [4], and I can't explain the delay between that log entry at 11:13:19 and CRI-O noticing the stop at 11:50:17.  Opening with the Routing component, but feel free to punt to Node for CRI-O if it seems like they would have more insight.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-upgrade-from-stable-4.10-e2e-aws-upgrade/1531933401613864960
[2]: https://github.com/openshift/cluster-ingress-operator/blame/09afd2ad8a7104f3f9461fb9df89df071d5ea7b8/pkg/operator/controller/ingress/deployment.go#L246-L248
[3]: https://github.com/openshift/cluster-ingress-operator/commit/25d08cd97c97e74ea0baba1b7a47f4eab324b41e
[4]: https://github.com/openshift/router/blob/40cadc3719957704b511cbc1bea5697622e6cbd3/pkg/cmd/infra/router/template.go#L710

Comment 1 Miciah Dashiel Butler Masters 2022-06-02 16:32:00 UTC

Is this a frequent failure?  

These two log messages indicate a clean shutdown:

I0601 11:13:19.088066       1 router.go:618] template "msg"="router reloaded"  "output"=" - Shutting down\n"
I0601 11:13:19.088093       1 template.go:710] router "msg"="Shutdown complete, exiting"

The first message indicates that OpenShift router observed that all haproxy processes terminated.  The second message indicates that the openshift-router process was exiting (after that log message, the openshift-router process sleeps 1 second and then returns to the operating system).  So we are fairly confident that the processes in the router pod did terminate.  This is likely a problem with cri-o or the kubelet.  Therefore I am sending this BZ to the node team.

Comment 2 W. Trevor King 2022-06-02 17:26:41 UTC

I'm not sure how to get a handle on frequency.  Jerry found [1] with similar symptoms:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-upgrade-from-stable-4.10-e2e-aws-upgrade/1531666974952656896/artifacts/e2e-aws-upgrade/gather-extra/artifacts/pods/openshift-machine-config-operator_machine-config-controller-866fd867bc-rxr9p_machine-config-controller.log | grep -o 'node.*Drain failed.*' | sort | uniq -c
        1 node ip-10-0-147-171.us-east-2.compute.internal: Drain failed, but overall timeout has not been reached. Waiting 1 minute then retrying. Error message from drain: error when waiting for pod "apiserver-7dbfd8d945-7kn67" terminating: global timeout reached: 1m30s
       18 node ip-10-0-194-58.us-east-2.compute.internal: Drain failed, but overall timeout has not been reached. Waiting 1 minute then retrying. Error message from drain: error when waiting for pod "router-default-855b574d4d-wpbbj" terminating: global timeout reached: 1m30s

And again the hour delay:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-upgrade-from-stable-4.10-e2e-aws-upgrade/1531666974952656896/artifacts/e2e-aws-upgrade/gather-extra/artifacts/nodes/ip-10-0-194-58.us-east-2.compute.internal/journal | zgrep router-default-855b574d4d-wpbbj | grep -B1 'Stopped container'
  May 31 17:14:19.597384 ip-10-0-194-58 hyperkube[1404]: I0531 17:14:19.597377    1404 kuberuntime_container.go:729] "Killing container with a grace period" pod="openshift-ingress/router-default-855b574d4d-wpbbj" podUID=7f13f3fd-272b-49f0-8c1c-2d522c471b5a containerName="router" containerID="cri-o://9f05ca8e6a44b097e8a171c4fe5798ea22db4f91ffc01a79044054087b281a12" gracePeriod=3600
  May 31 18:14:19.903483 ip-10-0-194-58 crio[1374]: time="2022-05-31 18:14:19.903342974Z" level=info msg="Stopped container 9f05ca8e6a44b097e8a171c4fe5798ea22db4f91ffc01a79044054087b281a12: openshift-ingress/router-default-855b574d4d-wpbbj/router" id=693d22a7-578a-464a-9df3-1408f2d96bb7 name=/runtime.v1.RuntimeService/StopContainer

And in Loki:

  {invoker="openshift-internal-ci/periodic-ci-openshift-release-master-nightly-4.11-upgrade-from-stable-4.10-e2e-aws-upgrade/1531666974952656896"} | unpack | pod_name="router-default-855b574d4d-wpbbj"

includes:

  I0531 17:33:35.327412       1 template.go:710] router "msg"="Shutdown complete, exiting"

So in both cases it is taking the router surprisingly long (~15m) to get from the kubelet's claimed "Killing container" to the router's "Shutdown requested", and then another ~45m to get from the router's "Shutdown complete" to CRI-O noticing the container as stopped.  I'm unclear about both of those delays.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-upgrade-from-stable-4.10-e2e-aws-upgrade/1531666974952656896

Comment 8 Peter Hunt 2022-06-15 19:41:10 UTC

It looks like CRI-O is behaving as expected here. We can see in the loki logs:
```
2022-06-14 17:52:52	I0614 17:52:51.988338       1 template.go:437] router "msg"="starting router"  "version"="majorFromGit: \nminorFromGit: \ncommitFromGit: 49dce8fe\nversionFromGit: v0.0.0-unknown\ngitTreeState: dirty\nbuildDate: 2022-06-11T07:34:52Z\n"
...
2022-06-14 17:52:52	I0614 17:52:52.047282       1 template.go:437] router "msg"="starting router"  "version"="majorFromGit: \nminorFromGit: \ncommitFromGit: 11109e4028b69749d6f842a4da682916e0d91d2f\nversionFromGit: 4.0.0-370-g11109e40\ngitTreeState: clean\nbuildDate: 2022-06-01T00:00:21Z\n"
```

I actually think this is a bug in loki. According to a conversation with Miciah on slack, those two versions could be expected to be 4.10 and 4.11 versions of the pod.
Looking at the cri-o logs, we see:
```
Jun 14 17:52:39.082363 ci-op-853gri6b-fde6e-x6vmd-worker-centralus3-9nvjt hyperkube[1511]: I0614 17:52:39.082315    1511 kubelet.go:2125] "SyncLoop ADD" source="api" pods=[openshift-ingress/router-default-7678459c85-9twn7]
Jun 14 17:52:39.437726 ci-op-853gri6b-fde6e-x6vmd-worker-centralus3-9nvjt crio[1477]: time="2022-06-14 17:52:39.436727436Z" level=info msg="Running pod sandbox: openshift-ingress/router-default-7678459c85-9twn7/POD" id=a9411cdd-33c8-4e34-bf72-554475493896 name=/runtime.v1.RuntimeService/RunPodSandbox
Jun 14 17:52:39.480946 ci-op-853gri6b-fde6e-x6vmd-worker-centralus3-9nvjt hyperkube[1511]: I0614 17:52:39.480906    1511 kubelet.go:2141] "SyncLoop DELETE" source="api" pods=[openshift-ingress/router-default-7678459c85-9twn7]
Jun 14 17:52:39.551603 ci-op-853gri6b-fde6e-x6vmd-worker-centralus3-9nvjt hyperkube[1511]: I0614 17:52:39.546844    1511 kubelet.go:2125] "SyncLoop ADD" source="api" pods=[openshift-ingress/router-default-75c97bc477-bc4jn]
Jun 14 17:52:39.976520 ci-op-853gri6b-fde6e-x6vmd-worker-centralus3-9nvjt crio[1477]: time="2022-06-14 17:52:39.976427379Z" level=info msg="Running pod sandbox: openshift-ingress/router-default-75c97bc477-bc4jn/POD" id=62362160-9a99-4a11-9786-d897904ce128 name=/runtime.v1.RuntimeService/RunPodSandbox
Jun 14 17:52:41.106115 ci-op-853gri6b-fde6e-x6vmd-worker-centralus3-9nvjt hyperkube[1511]: I0614 17:52:41.106080    1511 kubelet.go:2141] "SyncLoop DELETE" source="api" pods=[openshift-ingress/router-default-7678459c85-9twn7]
Jun 14 17:52:41.121657 ci-op-853gri6b-fde6e-x6vmd-worker-centralus3-9nvjt crio[1477]: time="2022-06-14 17:52:41.121602041Z" level=info msg="Ran pod sandbox 499a80bc4a1339bcd298a1f3373ddd60ce21e05f3a4a79a1367acf865b41731f with infra container: openshift-ingress/router-default-7678459c85-9twn7/POD" id=a9411cdd-33c8-4e34-bf72-554475493896 name=/runtime.v1.RuntimeService/RunPodSandbox
Jun 14 17:52:44.043785 ci-op-853gri6b-fde6e-x6vmd-worker-centralus3-9nvjt crio[1477]: time="2022-06-14 17:52:44.042515875Z" level=info msg="Ran pod sandbox 59c1397a57b90750b8e689967240fe37c49e4bdd45ceb6e3eda60400a636539c with infra container: openshift-ingress/router-default-75c97bc477-bc4jn/POD" id=62362160-9a99-4a11-9786-d897904ce128 name=/runtime.v1.RuntimeService/RunPodSandbox
Jun 14 17:52:51.666362 ci-op-853gri6b-fde6e-x6vmd-worker-centralus3-9nvjt crio[1477]: time="2022-06-14 17:52:51.666317438Z" level=info msg="Creating container: openshift-ingress/router-default-7678459c85-9twn7/router" id=6ad0c5f9-5f49-4fef-84ef-4e50c20329c3 name=/runtime.v1.RuntimeService/CreateContainer
Jun 14 17:52:51.689527 ci-op-853gri6b-fde6e-x6vmd-worker-centralus3-9nvjt crio[1477]: time="2022-06-14 17:52:51.689479865Z" level=info msg="Creating container: openshift-ingress/router-default-75c97bc477-bc4jn/router" id=632e8841-f9bf-4e08-ac0c-8a4773cdf947 name=/runtime.v1.RuntimeService/CreateContainer
Jun 14 17:52:51.936824 ci-op-853gri6b-fde6e-x6vmd-worker-centralus3-9nvjt crio[1477]: time="2022-06-14 17:52:51.934960884Z" level=info msg="Started container" PID=50483 containerID=ba35c5add80459b330dda07a02f2f36d059d2aa29e1fa5d1954341577724a697 description=openshift-ingress/router-default-75c97bc477-bc4jn/router id=ebae24b5-1530-4686-94b7-4d84989ad56f name=/runtime.v1.RuntimeService/StartContainer sandboxID=59c1397a57b90750b8e689967240fe37c49e4bdd45ceb6e3eda60400a636539c
Jun 14 17:52:51.936824 ci-op-853gri6b-fde6e-x6vmd-worker-centralus3-9nvjt crio[1477]: time="2022-06-14 17:52:51.934960884Z" level=info msg="Started container" PID=50483 containerID=ba35c5add80459b330dda07a02f2f36d059d2aa29e1fa5d1954341577724a697 description=openshift-ingress/router-default-75c97bc477-bc4jn/router id=ebae24b5-1530-4686-94b7-4d84989ad56f name=/runtime.v1.RuntimeService/StartContainer sandboxID=59c1397a57b90750b8e689967240fe37c49e4bdd45ceb6e3eda60400a636539c
Jun 14 17:52:52.011222 ci-op-853gri6b-fde6e-x6vmd-worker-centralus3-9nvjt crio[1477]: time="2022-06-14 17:52:52.011168423Z" level=info msg="Started container" PID=50475 containerID=110d55901a25ccbfb4001a91fe6db0c2e29557f1288bdbbe79ebebc7edf762a6 description=openshift-ingress/router-default-7678459c85-9twn7/router id=69904695-ffc5-4775-9616-b901cb9e438c name=/runtime.v1.RuntimeService/StartContainer sandboxID=499a80bc4a1339bcd298a1f3373ddd60ce21e05f3a4a79a1367acf865b41731f
Jun 14 17:52:51.982618 ci-op-853gri6b-fde6e-x6vmd-worker-centralus3-9nvjt crio[1477]: time="2022-06-14 17:52:51.982552043Z" level=info msg="Created container 110d55901a25ccbfb4001a91fe6db0c2e29557f1288bdbbe79ebebc7edf762a6: openshift-ingress/router-default-7678459c85-9twn7/router" id=6ad0c5f9-5f49-4fef-84ef-4e50c20329c3 name=/runtime.v1.RuntimeService/CreateContainer
Jun 14 17:52:52.331977 ci-op-853gri6b-fde6e-x6vmd-worker-centralus3-9nvjt hyperkube[1511]: I0614 17:52:52.331917    1511 kuberuntime_container.go:729] "Killing container with a grace period" pod="openshift-ingress/router-default-7678459c85-9twn7" podUID=15ac6c91-6429-493d-9706-1d0e7ce9e61a containerName="router" containerID="cri-o://110d55901a25ccbfb4001a91fe6db0c2e29557f1288bdbbe79ebebc7edf762a6" gracePeriod=3600
```

which confirms that there are two versions of the pod running, supporting the theory that loki is confusing them.

Much later, we see
```
Jun 14 18:29:18.059094 ci-op-853gri6b-fde6e-x6vmd-worker-centralus3-9nvjt hyperkube[1511]: I0614 18:29:18.059060    1511 kuberuntime_container.go:729] "Killing container with a grace period" pod="openshift-ingress/router-default-75c97bc477-bc4jn" podUID=448256d1-ca47-418b-9418-07757523c031 containerName="router" containerID="cri-o://ba35c5add80459b330dda07a02f2f36d059d2aa29e1fa5d1954341577724a697" gracePeriod=3600
Jun 14 18:30:05.306223 ci-op-853gri6b-fde6e-x6vmd-worker-centralus3-9nvjt crio[1477]: time="2022-06-14 18:30:05.306103421Z" level=info msg="Stopped container ba35c5add80459b330dda07a02f2f36d059d2aa29e1fa5d1954341577724a697: openshift-ingress/router-default-75c97bc477-bc4jn/router" id=e3194e4a-90b9-47fb-b114-1d50ba843ef6 name=/runtime.v1.RuntimeService/StopContainer
```

which corresponds to the loki message 
```
I0614 18:30:04.088614       1 template.go:710] router "msg"="Shutdown complete, exiting"
```

in time. Since we've shown Loki is conflating the two pods, and shown that one of the router pods is shutting down as expected after reporting the pod has shutdown, I don't think CRI-O is missing the original one's exit.

Much later, after the hour grace period, we see
```
Jun 14 18:52:52.650055 ci-op-853gri6b-fde6e-x6vmd-worker-centralus3-9nvjt crio[1477]: time="2022-06-14 18:52:52.647888879Z" level=info msg="Stopped container 110d55901a25ccbfb4001a91fe6db0c2e29557f1288bdbbe79ebebc7edf762a6: openshift-ingress/router-default-7678459c85-9twn7/router" id=dfcab30e-8cf6-462b-95dc-ef1186c36574 name=/runtime.v1.RuntimeService/StopContainer
```
indicating both pods were eventually cleaned up. As such, I believe this is a router bug, where having two copies of the router causes one to catch and ignore the signal from CRI-O saying to shutdown.


I am rerouting back to router, and unsetting the blocker flag. Since this happens pretty infrequently, it's not clear to me it should be a blocker from router perspective. From CRI-O, if we were missing a container exit for so long, it would feel more blockery to me.

Comment 9 W. Trevor King 2022-06-16 23:12:07 UTC

Ok, lots of digging through logs and theorizing with Peter and Miciah, and I think we've unraveled this.  Unpacking [1]

* 4.10 cluster comes up, and happens to assign both router-default pods to the same node, because bug 2089336 is still POST for 4.10.z (verification discussion continues in bug 2062459).
* 4.10 cluster is running ReplicaSet router-default-7678459c85
* 17:52:09, cluster update triggers incoming ingress-operator-64785c445d
* 17:52:36, MalscheduledPod: pod/router-default-7678459c85-7gcvk pod/router-default-7678459c85-rgfpc should be one per node, but all were placed on node/ci-op-853gri6b-fde6e-x6vmd-worker-centralus2-jcc7b; evicting pod/router-default-7678459c85-7gcvk
* Ingress operator bumps the router-default Deployment to request the updated image, etc.
* Incoming ReplicasSet router-default-75c97bc477 with the update
* 17:52:39.082363, kubelet tells CRI-O to create router-default-7678459c85-9twn7
* 17:52:39.437726, CRI-O creates a9411cdd-... for ...-9twn7
* 17:52:39.480946, kubelet tells CRI-O to delete ...-9twn7, only ~0.4s after asking for its creation, no idea why
* 17:52:39.551603, kubelet tells CRI-O to create router-default-75c97bc477-bc4jn
* 17:52:39.976520, CRI-O creates 62362160-... for ...-bc4jn
* 17:52:41.106115, kubelet tells CRI-O to delete ...-9twn7 again
* 17:52:41.121657, CRI-O is done creating a9411cdd-... for ...-9twn7 (which is no longer needed)
* 17:52:44.043785, CRI-O is done creating 62362160-... for ...-bc4jn
* 17:52:51.666362, CRI-O creates 6ad0c5f9-... for ...-9twn7 (which is no longer needed)
* 17:52:51.689527, CRI-O creates 632e8841-... for ...-bc4jn
* 17:52:51.936824, CRI-O logs PID 50483 for containerID=ba35... for ...-bc4jn
* 17:52:52.011222, CRI-O logs PID 50475 for containerID=110d... for ...-9twn7 (which is no longer needed)
* 17:52:52.331977, CRI-O TERMs containerID=110d... for ...-9twn7 with a 1h grace period
* containerID=110d... ignores the TERM, because 4.10 routers are apparently also impacted by bug 2076297, and the fix there has not yet been backported to 4.10.z.
* …
* 18:29:18.059094, CRI-O TERMs containerID=ba35... for ...-bc4jn with a 1h grace period
* 18:30:04.088614, router logs "Shutdown complete, exiting"
* 18:30:05.306223, CRI-O logs "Stopped container" (reaping the process?) for containerID=ba35... for ...-bc4jn
* …
* 18:52:52.650055, CRI-O logs "Stopped container" for containerID=110d... for …-9twn7

So I'm going to close this as a dup of bug 2076297, although if bug 2089336 had landed, we'd get bit by this less frequently in CI.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-azure-upgrade/1536750651021725696

*** This bug has been marked as a duplicate of bug 2076297 ***