Bug 2076297 - Router process ignores shutdown signal while starting up
Summary: Router process ignores shutdown signal while starting up
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.10
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.11.0
Assignee: Grant Spence
QA Contact: Shudi Li
URL:
Whiteboard:
: 2092647 (view as bug list)
Depends On:
Blocks: 2098230
TreeView+ depends on / blocked
 
Reported: 2022-04-18 15:30 UTC by Grant Spence
Modified: 2022-08-10 11:08 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: The openshift-router process was ignoring the SIGTERM shutdown signal for a brief moment as it was starting. Consequence: If a SIGTERM was sent while the router process was starting up, it would ignore the signal. This means the container would ignore a kubernetes shutdown request resulting in the container taking 1 hour to shutdown (terminationGracePeriodSeconds). Fix: Propagate the SIGTERM handler in GO code to the cache initialization function. Result: The router now responds to SIGTERM signals during it initialization.
Clone Of:
Environment:
Last Closed: 2022-08-10 11:07:40 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift router pull 383 0 None Merged Bug 2076297: Fix gap in router's handling of graceful shutdowns. 2022-06-16 22:53:34 UTC
Red Hat Product Errata RHSA-2022:5069 0 None None None 2022-08-10 11:08:02 UTC

Description Grant Spence 2022-04-18 15:30:26 UTC
Description of problem:
For brief window while the openshift-router binary is starting up, it ignores shutdown signals (SIGTERMs) and will never shutdown.

This becomes a larger issue when K8S sends a graceful shutdown while the router is starting up and subsequently waits the terminationGracePeriodSeconds as specified in the router deployment, which is 1 hour. 

This becomes even more of an issue with https://github.com/openshift/cluster-ingress-operator/pull/724 which makes the ingress controller wait for all pods before deleting itself. So if these pods are stuck in Terminating for an hour, then the ingress controller will be stuck in Terminating for an hour.

OpenShift release version:


Cluster Platform:


How reproducible:
You can start/stop the router pod quickly to get it to be stuck in a hour-long Terminating state.

Steps to Reproduce (in detail):
1. Create a YAML file with the following content:

apiVersion: v1
items:
- apiVersion: operator.openshift.io/v1
  kind: IngressController
  metadata:
    name: loadbalancer
    namespace: openshift-ingress-operator
  spec:
    replicas: 1
    routeSelector:
      matchLabels:
        type: loadbalancer
    endpointPublishingStrategy:
      type: LoadBalancerService
    nodePlacement:
      nodeSelector:
        matchLabels:
          node-role.kubernetes.io/worker: ""
  status: {}
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

2. Run the following command:

oc apply  -f <YAML_FILE>.yaml && while ! oc get pod -n openshift-ingress | grep -q router-loadbalancer; do echo "Waiting"; done; oc delete pod -n openshift-ingress $(oc get pod -n openshift-ingress --no-headers | grep router-loadbalancer | awk '{print $1}');

It is considered a failure if it hangs for more than 45 seconds. You can ctrl-c after it deletes the pod and run "oc get pods -n openshift-ingress" to see that it is stuck in a terminating state with a AGE longer than 45 seconds.

The pod will take 1 hour to terminate, but you can always clean up by force deleting it.


Actual results:
Pod takes 1 hour to be deleted.


Expected results:
Pod should be deleted in about 45 seconds.

Impact of the problem:
Router pods hang in terminating for 1 hour and that will affect user experience.


Additional info:



** Please do not disregard the report template; filling the template out as much as possible will allow us to help you. Please consider attaching a must-gather archive (via `oc adm must-gather`). Please review must-gather contents for sensitive information before attaching any must-gathers to a bugzilla report.  You may also mark the bug private if you wish.

Comment 4 Shudi Li 2022-04-21 10:31:32 UTC
The issue pods were stuck in terminating for an hour was also seen in Bug 2076193 - oc patch command for the liveness probe and readiness probe parameters of an OpenShift router deployment doesn't take effect. 
Tested it with 4.11.0-0.nightly-2022-04-21-025500, the issue couldn't be reproduced.

1.  oc patch an OpenShift router deployment with liveness probe and readiness probe
% oc -n openshift-ingress patch deploy/router-default --type=strategic --patch='{"spec":{"template":{"spec":{"containers":[{"name":"router","livenessProbe":{"timeoutSeconds":5},"readinessProbe":{"timeoutSeconds":5}}]}}}}'
Warning: would violate PodSecurity "restricted:latest": allowPrivilegeEscalation != false (container "router" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container "router" must set securityContext.capabilities.drop=["ALL"]), runAsNonRoot != true (pod or container "router" must set securityContext.runAsNonRoot=true), seccompProfile (pod or container "router" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")
deployment.apps/router-default patched
% 

2. the router pod is terminated with about 45 seconds
% oc -n openshift-ingress get pods                                                                                                                                                                                           
NAME                              READY   STATUS              RESTARTS   AGE
router-default-75577c49f9-44vjm   1/1     Running             0          140m
router-default-75577c49f9-6rtts   1/1     Terminating         0          140m
router-default-75577c49f9-dkkxd   0/1     ContainerCreating   0          7s
% 

% oc -n openshift-ingress get pods
NAME                              READY   STATUS        RESTARTS   AGE
router-default-75577c49f9-44vjm   1/1     Running       0          140m
router-default-75577c49f9-6rtts   1/1     Terminating   0          140m
router-default-75577c49f9-dkkxd   1/1     Running       0          20s
% 

% oc -n openshift-ingress get pods
NAME                              READY   STATUS        RESTARTS   AGE
router-default-75577c49f9-44vjm   1/1     Running       0          140m
router-default-75577c49f9-6rtts   1/1     Terminating   0          140m
router-default-75577c49f9-dkkxd   1/1     Running       0          32s
%

% oc -n openshift-ingress get pods
NAME                              READY   STATUS        RESTARTS   AGE
router-default-75577c49f9-44vjm   1/1     Running       0          141m
router-default-75577c49f9-6rtts   1/1     Terminating   0          141m
router-default-75577c49f9-dkkxd   1/1     Running       0          38s
% 

% oc -n openshift-ingress get pods
NAME                              READY   STATUS    RESTARTS   AGE
router-default-75577c49f9-44vjm   1/1     Running   0          143m
router-default-75577c49f9-dkkxd   1/1     Running   0          2m55s
%

3.
% oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-04-21-025500   True        False         132m    Cluster version is 4.11.0-0.nightly-2022-04-21-025500
%

Comment 6 W. Trevor King 2022-06-16 23:05:22 UTC
Apparently applies to at least 4.10 as well; comments to come in bug 2092647, which I'm about to close as a dup of this one.

Comment 7 W. Trevor King 2022-06-16 23:12:07 UTC
*** Bug 2092647 has been marked as a duplicate of this bug. ***

Comment 8 W. Trevor King 2022-06-16 23:17:00 UTC
Ok, [1] walks through how a 4.10 router with this issue could expose us to problems on 4.10 to 4.11 updates.  Fixing bug 2089336 in 4.10.z would also have avoided the race we're seeing in CI, because getting bitten requires:

1. Install a 4.10.z that does not include the fix for bug 2089336.
2. Get bit by bug 2089336's bootstrap-router-lease race to get mis-scheduled router pods.
3. Never do anything that would trigger a post-install reschedule.  Like updating within 4.10, or touching a drain-inducing MachineConfig.
4. Updating to 4.11 and picking up the MalscheduledPod-instrumented ingress operator.

And even then, the impact is just "that node takes an hour to update".  So while it's probably worth taking this fix back to 4.10.z, we can probably let it cook longer before backporting.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=2092647#c9

Comment 9 errata-xmlrpc 2022-08-10 11:07:40 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069


Note You need to log in before you can comment on or make changes to this bug.