Bug 2076297
Summary: | Router process ignores shutdown signal while starting up | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Grant Spence <gspence> |
Component: | Networking | Assignee: | Grant Spence <gspence> |
Networking sub component: | router | QA Contact: | Shudi Li <shudili> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | high | ||
Priority: | high | CC: | aos-bugs, hongli, wking |
Version: | 4.10 | ||
Target Milestone: | --- | ||
Target Release: | 4.11.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: |
Cause: The openshift-router process was ignoring the SIGTERM shutdown signal for a brief moment as it was starting.
Consequence: If a SIGTERM was sent while the router process was starting up, it would ignore the signal. This means the container would ignore a kubernetes shutdown request resulting in the container taking 1 hour to shutdown (terminationGracePeriodSeconds).
Fix: Propagate the SIGTERM handler in GO code to the cache initialization function.
Result: The router now responds to SIGTERM signals during it initialization.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2022-08-10 11:07:40 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 2098230 |
Description
Grant Spence
2022-04-18 15:30:26 UTC
The issue pods were stuck in terminating for an hour was also seen in Bug 2076193 - oc patch command for the liveness probe and readiness probe parameters of an OpenShift router deployment doesn't take effect. Tested it with 4.11.0-0.nightly-2022-04-21-025500, the issue couldn't be reproduced. 1. oc patch an OpenShift router deployment with liveness probe and readiness probe % oc -n openshift-ingress patch deploy/router-default --type=strategic --patch='{"spec":{"template":{"spec":{"containers":[{"name":"router","livenessProbe":{"timeoutSeconds":5},"readinessProbe":{"timeoutSeconds":5}}]}}}}' Warning: would violate PodSecurity "restricted:latest": allowPrivilegeEscalation != false (container "router" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container "router" must set securityContext.capabilities.drop=["ALL"]), runAsNonRoot != true (pod or container "router" must set securityContext.runAsNonRoot=true), seccompProfile (pod or container "router" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost") deployment.apps/router-default patched % 2. the router pod is terminated with about 45 seconds % oc -n openshift-ingress get pods NAME READY STATUS RESTARTS AGE router-default-75577c49f9-44vjm 1/1 Running 0 140m router-default-75577c49f9-6rtts 1/1 Terminating 0 140m router-default-75577c49f9-dkkxd 0/1 ContainerCreating 0 7s % % oc -n openshift-ingress get pods NAME READY STATUS RESTARTS AGE router-default-75577c49f9-44vjm 1/1 Running 0 140m router-default-75577c49f9-6rtts 1/1 Terminating 0 140m router-default-75577c49f9-dkkxd 1/1 Running 0 20s % % oc -n openshift-ingress get pods NAME READY STATUS RESTARTS AGE router-default-75577c49f9-44vjm 1/1 Running 0 140m router-default-75577c49f9-6rtts 1/1 Terminating 0 140m router-default-75577c49f9-dkkxd 1/1 Running 0 32s % % oc -n openshift-ingress get pods NAME READY STATUS RESTARTS AGE router-default-75577c49f9-44vjm 1/1 Running 0 141m router-default-75577c49f9-6rtts 1/1 Terminating 0 141m router-default-75577c49f9-dkkxd 1/1 Running 0 38s % % oc -n openshift-ingress get pods NAME READY STATUS RESTARTS AGE router-default-75577c49f9-44vjm 1/1 Running 0 143m router-default-75577c49f9-dkkxd 1/1 Running 0 2m55s % 3. % oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.0-0.nightly-2022-04-21-025500 True False 132m Cluster version is 4.11.0-0.nightly-2022-04-21-025500 % Apparently applies to at least 4.10 as well; comments to come in bug 2092647, which I'm about to close as a dup of this one. *** Bug 2092647 has been marked as a duplicate of this bug. *** Ok, [1] walks through how a 4.10 router with this issue could expose us to problems on 4.10 to 4.11 updates. Fixing bug 2089336 in 4.10.z would also have avoided the race we're seeing in CI, because getting bitten requires: 1. Install a 4.10.z that does not include the fix for bug 2089336. 2. Get bit by bug 2089336's bootstrap-router-lease race to get mis-scheduled router pods. 3. Never do anything that would trigger a post-install reschedule. Like updating within 4.10, or touching a drain-inducing MachineConfig. 4. Updating to 4.11 and picking up the MalscheduledPod-instrumented ingress operator. And even then, the impact is just "that node takes an hour to update". So while it's probably worth taking this fix back to 4.10.z, we can probably let it cook longer before backporting. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=2092647#c9 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069 |