Bug 1956372 - openshift-gcp-routes causes disruption during upgrade by stopping before all pods terminate
Summary: openshift-gcp-routes causes disruption during upgrade by stopping before all ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Machine Config Operator
Version: 4.8
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: 4.8.0
Assignee: Antonio Ojea
QA Contact: Michael Nguyen
URL:
Whiteboard:
Depends On:
Blocks: 1966595
TreeView+ depends on / blocked
 
Reported: 2021-05-03 14:41 UTC by Clayton Coleman
Modified: 2021-07-29 07:27 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: The Openshift script handling the Google Cloud Loadbalancer logic was exiting before the network was down. Consequence: The Openshift components that depend on loadbalancers were disrupted, so they can not exit gracefully Fix: Wait until the network is down before exiting Result: Graceful shutdown works correctly
Clone Of:
Environment:
Last Closed: 2021-07-27 23:05:53 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift machine-config-operator pull 2561 0 None open Bug 1956372: gcp-routes should wait until network is stopped 2021-05-03 14:43:35 UTC
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 23:06:24 UTC

Description Clayton Coleman 2021-05-03 14:41:27 UTC
openshift-gcp-routes is required to send traffic from GCP load balancers (because it actually connects the VIP to the host networking). Right now it stops *before* networking stops, which means while kube-apiserver is draining we kill the vip, which causes disruption.

We should terminate openshift-gcp-routes service when the network is shutting down, not before.

Comment 2 Michael Nguyen 2021-05-07 18:21:39 UTC
Verified on  4.8.0-0.nightly-2021-05-07-075528.  openshift-gcp-routes.service is stopped after network online target.

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2021-05-07-075528   True        False         32m     Cluster version is 4.8.0-0.nightly-2021-05-07-075528
$ oc get nodes
NAME                                       STATUS   ROLES    AGE   VERSION
ci-ln-9wr6012-f76d1-z7bjv-master-0         Ready    master   53m   v1.21.0-rc.0+291e731
ci-ln-9wr6012-f76d1-z7bjv-master-1         Ready    master   53m   v1.21.0-rc.0+291e731
ci-ln-9wr6012-f76d1-z7bjv-master-2         Ready    master   53m   v1.21.0-rc.0+291e731
ci-ln-9wr6012-f76d1-z7bjv-worker-b-gwn8c   Ready    worker   44m   v1.21.0-rc.0+291e731
ci-ln-9wr6012-f76d1-z7bjv-worker-c-c2ndb   Ready    worker   44m   v1.21.0-rc.0+291e731
ci-ln-9wr6012-f76d1-z7bjv-worker-d-2sc2x   Ready    worker   44m   v1.21.0-rc.0+291e731
$ oc debug node/ci-ln-9wr6012-f76d1-z7bjv-master-0
Starting pod/ci-ln-9wr6012-f76d1-z7bjv-master-0-debug ...
To use host binaries, run `chroot /host`
If you don't see a command prompt, try pressing enter.
sh-4.2# chroot /host
sh-4.4# systemctl cat openshift-gcp-routes
# /etc/systemd/system/openshift-gcp-routes.service
[Unit]
Description=Update GCP routes for forwarded IPs.
ConditionKernelCommandLine=|ignition.platform.id=gce
ConditionKernelCommandLine=|ignition.platform.id=gcp
Before=network-online.target

[Service]
Type=simple
ExecStart=/bin/bash /opt/libexec/openshift-gcp-routes.sh start
ExecStopPost=/bin/bash /opt/libexec/openshift-gcp-routes.sh cleanup
User=root
RestartSec=30
Restart=always

[Install]
WantedBy=multi-user.target
# Ensure that network-online.target will not complete until the node has working external LBs.
RequiredBy=network-online.target
sh-4.4# journalctl
...snip...
May 07 17:15:11 ci-ln-9wr6012-f76d1-z7bjv-master-0.c.openshift-gce-devel-ci.inte systemd[1]: Stopped target Network is Online.
May 07 17:15:11 ci-ln-9wr6012-f76d1-z7bjv-master-0.c.openshift-gce-devel-ci.inte systemd[1]: node-valid-hostname.service: Succeeded.
May 07 17:15:11 ci-ln-9wr6012-f76d1-z7bjv-master-0.c.openshift-gce-devel-ci.inte systemd[1]: Stopped Ensure the node hostname is valid for the cluster.
May 07 17:15:11 ci-ln-9wr6012-f76d1-z7bjv-master-0.c.openshift-gce-devel-ci.inte systemd[1]: node-valid-hostname.service: Consumed 0 CPU time
May 07 17:15:11 ci-ln-9wr6012-f76d1-z7bjv-master-0.c.openshift-gce-devel-ci.inte systemd[1]: Stopping Update GCP routes for forwarded IPs....
May 07 17:15:11 ci-ln-9wr6012-f76d1-z7bjv-master-0.c.openshift-gce-devel-ci.inte systemd[1]: NetworkManager-wait-online.service: Succeeded.
May 07 17:15:11 ci-ln-9wr6012-f76d1-z7bjv-master-0.c.openshift-gce-devel-ci.inte systemd[1]: Stopped Network Manager Wait Online.
May 07 17:15:11 ci-ln-9wr6012-f76d1-z7bjv-master-0.c.openshift-gce-devel-ci.inte systemd[1]: NetworkManager-wait-online.service: Consumed 0 CPU time
May 07 17:15:11 ci-ln-9wr6012-f76d1-z7bjv-master-0.c.openshift-gce-devel-ci.inte systemd[1]: Stopped target sshd-keygen.target.
May 07 17:15:11 ci-ln-9wr6012-f76d1-z7bjv-master-0.c.openshift-gce-devel-ci.inte systemd[1]: systemd-user-sessions.service: Succeeded.
May 07 17:15:11 ci-ln-9wr6012-f76d1-z7bjv-master-0.c.openshift-gce-devel-ci.inte systemd[1]: Stopped Permit User Sessions.
May 07 17:15:11 ci-ln-9wr6012-f76d1-z7bjv-master-0.c.openshift-gce-devel-ci.inte systemd[1]: systemd-user-sessions.service: Consumed 13ms CPU time
May 07 17:15:11 ci-ln-9wr6012-f76d1-z7bjv-master-0.c.openshift-gce-devel-ci.inte systemd[1]: Stopped target Remote File Systems.
May 07 17:15:11 ci-ln-9wr6012-f76d1-z7bjv-master-0.c.openshift-gce-devel-ci.inte systemd[1]: Stopped target Network.
May 07 17:15:11 ci-ln-9wr6012-f76d1-z7bjv-master-0.c.openshift-gce-devel-ci.inte systemd[1]: Stopping Network Manager...

Comment 5 errata-xmlrpc 2021-07-27 23:05:53 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438


Note You need to log in before you can comment on or make changes to this bug.