Bug 2003947 - [bz-openshift-apiserver] clusteroperator/openshift-apiserver should not change condition/Degraded: master nodes drained too quickly
Summary: [bz-openshift-apiserver] clusteroperator/openshift-apiserver should not chang...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: openshift-apiserver
Version: 4.7
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.7.z
Assignee: Jan Chaloupka
QA Contact: Ke Wang
URL:
Whiteboard:
Depends On: 2003946
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-09-14 08:12 UTC by Jan Chaloupka
Modified: 2022-03-02 12:51 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 2003946
Environment:
Last Closed: 2022-03-02 12:51:57 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Jan Chaloupka 2021-09-14 08:12:46 UTC
+++ This bug was initially created as a clone of Bug #2003946 +++

+++ This bug was initially created as a clone of Bug #1998031 +++

From https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-ovirt-upgrade/1430418602502131712:
```
[bz-openshift-apiserver] clusteroperator/openshift-apiserver should not change condition/Degraded 

1 unexpected clusteroperator state transitions during e2e test run 

Aug 25 08:06:51.762 - 808s  E clusteroperator/openshift-apiserver condition/Degraded status/True reason/APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-apiserver ()
```

From loki (https://grafana-loki.ci.openshift.org/explore?orgId=1&left=%5B%221629842400000%22,%221629928799000%22,%22Grafana%20Cloud%22,%7B%22expr%22:%22%7Binvoker%3D%5C%22openshift-internal-ci%2Fperiodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-ovirt-upgrade%2F1430418602502131712%5C%22%7D%20%7C%20unpack%20%7C%20namespace%3D%5C%22openshift-authentication-operator%5C%22%20%7C%20container_name%3D%5C%22authentication-operator%5C%22%22%7D%5D):
```
2021-08-25 10:06:51
I0825 08:06:51.740397       1 status_controller.go:211] clusteroperator/authentication diff {"status":{"conditions":[{"lastTransitionTime":"2021-08-25T08:06:51Z","message":"APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-oauth-apiserver ()\nOAuthServerDeploymentDegraded: 1 of 3 requested instances are unavailable for oauth-openshift.openshift-authentication ()","reason":"APIServerDeployment_UnavailablePod::OAuthServerDeployment_UnavailablePod","status":"True","type":"Degraded"},{"lastTransitionTime":"2021-08-25T07:45:00Z","message":"AuthenticatorCertKeyProgressing: All is well","reason":"AsExpected","status":"False","type":"Progressing"},{"lastTransitionTime":"2021-08-25T07:57:01Z","message":"All is well","reason":"AsExpected","status":"True","type":"Available"},{"lastTransitionTime":"2021-08-25T06:49:31Z","message":"All is well","reason":"AsExpected","status":"True","type":"Upgradeable"}]}}
2021-08-25 10:06:24
I0825 08:06:24.738959       1 request.go:668] Waited for 1.015853827s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-oauth-apiserver/configmaps/audit
```

Checking metrics (`kube_deployment_status_replicas_available{namespace="openshift-authentication"}`) the deployment had 1 replicas missing at 08:06:51.740397

There were 3 RSs (from oldest to youngest)
- oauth-openshift-7fb8f95bd8
- oauth-openshift-6f975f59c
- oauth-openshift-65d76d9bcd

- 08:04:43 -> 08:20:55 less than 3 replicas available

Checking events.json:
```
            "message": "Scaled up replica set oauth-openshift-65d76d9bcd to 1",
            "message": "Scaled up replica set oauth-openshift-65d76d9bcd to 2",
            "message": "Scaled up replica set oauth-openshift-65d76d9bcd to 3",
```
No events about scaling oauth-openshift-65d76d9bcd. Checking kubelet:
```
ovirt14-vhnb7-master-1:18830:Aug 25 07:43:47.183806 ovirt14-vhnb7-master-1 hyperkube[1846]: I0825 07:43:47.183768    1846 kubelet.go:1944] "SyncLoop ADD" source="api" pods=[openshift-authentication/oauth-openshift-65d76d9bcd-mlj5s]
ovirt14-vhnb7-master-1:22979:Aug 25 08:09:43.697971 ovirt14-vhnb7-master-1 hyperkube[1846]: I0825 08:09:43.697943    1846 kubelet.go:1960] "SyncLoop DELETE" source="api" pods=[openshift-authentication/oauth-openshift-65d76d9bcd-mlj5s]
ovirt14-vhnb7-master-1:24774:Aug 25 08:10:11.266126 ovirt14-vhnb7-master-1 hyperkube[1846]: I0825 08:10:11.266097    1846 kubelet.go:1960] "SyncLoop DELETE" source="api" pods=[openshift-authentication/oauth-openshift-65d76d9bcd-mlj5s]
ovirt14-vhnb7-master-1:32270:Aug 25 08:14:43.473159 ovirt14-vhnb7-master-1 hyperkube[1971]: I0825 08:14:43.471314    1971 kubelet.go:2065] "SyncLoop ADD" source="api" pods=[openshift-authentication/oauth-openshift-65d76d9bcd-c8mv4]
ovirt14-vhnb7-master-2:16925:Aug 25 07:42:44.615329 ovirt14-vhnb7-master-2 hyperkube[1850]: I0825 07:42:44.615293    1850 kubelet.go:1944] "SyncLoop ADD" source="api" pods=[openshift-authentication/oauth-openshift-65d76d9bcd-k4g7p]
ovirt14-vhnb7-master-2:23385:Aug 25 08:14:50.475225 ovirt14-vhnb7-master-2 hyperkube[1850]: I0825 08:14:50.475018    1850 kubelet.go:1960] "SyncLoop DELETE" source="api" pods=[openshift-authentication/oauth-openshift-65d76d9bcd-k4g7p]
ovirt14-vhnb7-master-2:26008:Aug 25 08:15:24.629063 ovirt14-vhnb7-master-2 hyperkube[1850]: I0825 08:15:24.629032    1850 kubelet.go:1960] "SyncLoop DELETE" source="api" pods=[openshift-authentication/oauth-openshift-65d76d9bcd-k4g7p]
ovirt14-vhnb7-master-2:33998:Aug 25 08:19:59.335853 ovirt14-vhnb7-master-2 hyperkube[1965]: I0825 08:19:59.335828    1965 kubelet.go:2065] "SyncLoop ADD" source="api" pods=[openshift-authentication/oauth-openshift-65d76d9bcd-ptwm7]
ovirt14-vhnb7-master-0:21275:Aug 25 07:44:30.963045 ovirt14-vhnb7-master-0 hyperkube[1851]: I0825 07:44:30.962513    1851 kubelet.go:1944] "SyncLoop ADD" source="api" pods=[openshift-authentication/oauth-openshift-65d76d9bcd-snm9m]
ovirt14-vhnb7-master-0:24990:Aug 25 08:04:44.819607 ovirt14-vhnb7-master-0 hyperkube[1851]: I0825 08:04:44.819553    1851 kubelet.go:1960] "SyncLoop DELETE" source="api" pods=[openshift-authentication/oauth-openshift-65d76d9bcd-snm9m]
ovirt14-vhnb7-master-0:26129:Aug 25 08:05:15.971355 ovirt14-vhnb7-master-0 hyperkube[1851]: I0825 08:05:15.971321    1851 kubelet.go:1960] "SyncLoop DELETE" source="api" pods=[openshift-authentication/oauth-openshift-65d76d9bcd-snm9m]
ovirt14-vhnb7-master-0:34407:Aug 25 08:09:40.277931 ovirt14-vhnb7-master-0 hyperkube[1991]: I0825 08:09:40.277906    1991 kubelet.go:2065] "SyncLoop ADD" source="api" pods=[openshift-authentication/oauth-openshift-65d76d9bcd-mnvdm]
```

- oauth-openshift-65d76d9bcd-snm9m: <07:44:30.962513; 08:04:44.819553> -> oauth-openshift-65d76d9bcd-mnvdm: <08:09:40.277906; ???>
- oauth-openshift-65d76d9bcd-mlj5s: <07:43:47.183768; 08:09:43.697943> -> oauth-openshift-65d76d9bcd-c8mv4: <08:14:43.471314; ???>
- oauth-openshift-65d76d9bcd-k4g7p: <07:42:44.615293; 08:14:50.475018> -> oauth-openshift-65d76d9bcd-ptwm7: <08:19:59.335828; ???>

Checking `kube_replicaset_spec_replicas{namespace="openshift-authentication",replicaset=~"oauth-openshift-65d76d9bcd"}`, the value was set to 3 the whole time. Meaning something just deleted the pods one by one.

Although, the same can be observed with openshift-apiserver instances (see `kube_deployment_status_replicas_available{namespace="openshift-apiserver"}`).

```
Aug 25 07:33:31.075468 ovirt14-vhnb7-master-0 hyperkube[1851]: I0825 07:33:31.075433    1851 kubelet.go:1944] "SyncLoop ADD" source="api" pods=[openshift-apiserver/apiserver-7478d5665f-5mlx4]
Aug 25 08:04:44.754920 ovirt14-vhnb7-master-0 hyperkube[1851]: I0825 08:04:44.754791    1851 kubelet.go:1960] "SyncLoop DELETE" source="api" pods=[openshift-apiserver/apiserver-7478d5665f-5mlx4]
Aug 25 08:05:57.242004 ovirt14-vhnb7-master-0 hyperkube[1851]: I0825 08:05:57.241938    1851 kubelet.go:1960] "SyncLoop DELETE" source="api" pods=[openshift-apiserver/apiserver-7478d5665f-5mlx4]
Aug 25 08:09:40.255344 ovirt14-vhnb7-master-0 hyperkube[1991]: I0825 08:09:40.255312    1991 kubelet.go:2065] "SyncLoop ADD" source="api" pods=[openshift-apiserver/apiserver-7478d5665f-g6nt9]

Aug 25 07:31:59.161083 ovirt14-vhnb7-master-1 hyperkube[1846]: I0825 07:31:59.161061    1846 kubelet.go:1944] "SyncLoop ADD" source="api" pods=[openshift-apiserver/apiserver-7478d5665f-wktv4]
Aug 25 08:09:44.274744 ovirt14-vhnb7-master-1 hyperkube[1846]: I0825 08:09:44.274715    1846 kubelet.go:1960] "SyncLoop DELETE" source="api" pods=[openshift-apiserver/apiserver-7478d5665f-wktv4]
Aug 25 08:10:59.127089 ovirt14-vhnb7-master-1 hyperkube[1846]: I0825 08:10:59.127057    1846 kubelet.go:1960] "SyncLoop DELETE" source="api" pods=[openshift-apiserver/apiserver-7478d5665f-wktv4]
Aug 25 08:14:43.407859 ovirt14-vhnb7-master-1 hyperkube[1971]: I0825 08:14:43.407823    1971 kubelet.go:2065] "SyncLoop ADD" source="api" pods=[openshift-apiserver/apiserver-7478d5665f-kdvqw]
Aug 25 08:14:50.437017 ovirt14-vhnb7-master-1 hyperkube[1971]: I0825 08:14:50.436981    1971 kubelet.go:2065] "SyncLoop ADD" source="api" pods=[openshift-apiserver-operator/openshift-apiserver-operator-7954c899b6-hq7xx]

Aug 25 07:30:31.271077 ovirt14-vhnb7-master-2 hyperkube[1850]: I0825 07:30:31.271018    1850 kubelet.go:1944] "SyncLoop ADD" source="api" pods=[openshift-apiserver/apiserver-7478d5665f-krq8w]
Aug 25 08:14:50.386938 ovirt14-vhnb7-master-2 hyperkube[1850]: I0825 08:14:50.386718    1850 kubelet.go:1960] "SyncLoop DELETE" source="api" pods=[openshift-apiserver-operator/openshift-apiserver-operator-7954c899b6-p2wbp]
Aug 25 08:14:50.493640 ovirt14-vhnb7-master-2 hyperkube[1850]: I0825 08:14:50.492425    1850 kubelet.go:1960] "SyncLoop DELETE" source="api" pods=[openshift-apiserver/apiserver-7478d5665f-krq8w]
Aug 25 08:15:04.621295 ovirt14-vhnb7-master-2 hyperkube[1850]: I0825 08:15:04.621266    1850 kubelet.go:1960] "SyncLoop DELETE" source="api" pods=[openshift-apiserver-operator/openshift-apiserver-operator-7954c899b6-p2wbp]
Aug 25 08:16:14.597465 ovirt14-vhnb7-master-2 hyperkube[1850]: I0825 08:16:14.597413    1850 kubelet.go:1960] "SyncLoop DELETE" source="api" pods=[openshift-apiserver/apiserver-7478d5665f-krq8w]
Aug 25 08:19:59.338981 ovirt14-vhnb7-master-2 hyperkube[1965]: I0825 08:19:59.338960    1965 kubelet.go:2065] "SyncLoop ADD" source="api" pods=[openshift-apiserver/apiserver-7478d5665f-gslsh]
```

- apiserver-7478d5665f-5mlx4: <07:33:31.075433; 08:04:44.754791> -> apiserver-7478d5665f-g6nt9: <08:09:40.255312; ???>
- apiserver-7478d5665f-wktv4: <07:31:59.161061; 08:09:44.274715> -> apiserver-7478d5665f-kdvqw: <08:14:43.407823; ???>
- apiserver-7478d5665f-krq8w: <07:30:31.271018; 08:14:50.386718> -> apiserver-7478d5665f-gslsh: <08:19:59.338960; ???>

Putting both alongside (master0 -> master1 -> master2)
- apiserver-7478d5665f-5mlx4: <07:33:31.075433; 08:04:44.754791> -> apiserver-7478d5665f-g6nt9: <08:09:40.255312; ???>
- oauth-openshift-65d76d9bcd-snm9m: <07:44:30.962513; 08:04:44.819553> -> oauth-openshift-65d76d9bcd-mnvdm: <08:09:40.277906; ???>
- apiserver-7478d5665f-wktv4: <07:31:59.161061; 08:09:44.274715> -> apiserver-7478d5665f-kdvqw: <08:14:43.407823; ???>
- oauth-openshift-65d76d9bcd-mlj5s: <07:43:47.183768; 08:09:43.697943> -> oauth-openshift-65d76d9bcd-c8mv4: <08:14:43.471314; ???>
- apiserver-7478d5665f-krq8w: <07:30:31.271018; 08:14:50.386718> -> apiserver-7478d5665f-gslsh: <08:19:59.338960; ???>
- oauth-openshift-65d76d9bcd-k4g7p: <07:42:44.615293; 08:14:50.475018> -> oauth-openshift-65d76d9bcd-ptwm7: <08:19:59.335828; ???>

Checking kubelets for cordoning/drain/update:
```
Aug 25 08:04:42.977292 ovirt14-vhnb7-master-0 root[392852]: machine-config-daemon[382060]: Starting update from rendered-master-9faf006a78c47c3fb7ca6a9223f67108 to rendered-master-7242100dcf595a1cee714eaa5fee9736: &{osUpdate:true kargs:false fips:false passwd:false files:true units:true kernelType:false extensions:false}
Aug 25 08:04:43.016573 ovirt14-vhnb7-master-0 root[392853]: machine-config-daemon[382060]: Node has been successfully cordoned
Aug 25 08:04:43.020530 ovirt14-vhnb7-master-0 root[392854]: machine-config-daemon[382060]: Update prepared; beginning drain
Aug 25 08:05:57.764428 ovirt14-vhnb7-master-0 root[400279]: machine-config-daemon[382060]: drain complete
Aug 25 08:06:40.588616 ovirt14-vhnb7-master-0 root[403514]: machine-config-daemon[382060]: initiating reboot: Node will reboot into config rendered-master-7242100dcf595a1cee714eaa5fee9736
Aug 25 08:06:40.600122 ovirt14-vhnb7-master-0 systemd[1]: Started machine-config-daemon: Node will reboot into config rendered-master-7242100dcf595a1cee714eaa5fee9736.
Aug 25 08:06:40.706811 ovirt14-vhnb7-master-0 systemd[1]: Stopping machine-config-daemon: Node will reboot into config rendered-master-7242100dcf595a1cee714eaa5fee9736...
Aug 25 08:06:40.827596 ovirt14-vhnb7-master-0 systemd[1]: Stopped machine-config-daemon: Node will reboot into config rendered-master-7242100dcf595a1cee714eaa5fee9736.
Aug 25 08:09:40.255344 ovirt14-vhnb7-master-0 hyperkube[1991]: I0825 08:09:40.255312    1991 kubelet.go:2065] "SyncLoop ADD" source="api" pods=[openshift-apiserver/apiserver-7478d5665f-g6nt9]
Aug 25 08:09:40.277931 ovirt14-vhnb7-master-0 hyperkube[1991]: I0825 08:09:40.277906    1991 kubelet.go:2065] "SyncLoop ADD" source="api" pods=[openshift-authentication/oauth-openshift-65d76d9bcd-mnvdm]

Aug 25 08:09:41.927616 ovirt14-vhnb7-master-1 root[392706]: machine-config-daemon[357804]: Starting update from rendered-master-9faf006a78c47c3fb7ca6a9223f67108 to rendered-master-7242100dcf595a1cee714eaa5fee9736: &{osUpdate:true kargs:false fips:false passwd:false files:true units:true kernelType:false extensions:false}
Aug 25 08:09:41.968324 ovirt14-vhnb7-master-1 root[392714]: machine-config-daemon[357804]: Node has been successfully cordoned
Aug 25 08:09:41.971817 ovirt14-vhnb7-master-1 root[392715]: machine-config-daemon[357804]: Update prepared; beginning drain
Aug 25 08:10:59.273200 ovirt14-vhnb7-master-1 root[402449]: machine-config-daemon[357804]: drain complete
Aug 25 08:11:45.707314 ovirt14-vhnb7-master-1 root[405834]: machine-config-daemon[357804]: initiating reboot: Node will reboot into config rendered-master-7242100dcf595a1cee714eaa5fee9736
Aug 25 08:11:45.719771 ovirt14-vhnb7-master-1 systemd[1]: Started machine-config-daemon: Node will reboot into config rendered-master-7242100dcf595a1cee714eaa5fee9736.
Aug 25 08:11:45.767006 ovirt14-vhnb7-master-1 systemd[1]: Stopped machine-config-daemon: Node will reboot into config rendered-master-7242100dcf595a1cee714eaa5fee9736.
Aug 25 08:14:43.473159 ovirt14-vhnb7-master-1 hyperkube[1971]: I0825 08:14:43.471314    1971 kubelet.go:2065] "SyncLoop ADD" source="api" pods=[openshift-authentication/oauth-openshift-65d76d9bcd-c8mv4]
Aug 25 08:14:50.437017 ovirt14-vhnb7-master-1 hyperkube[1971]: I0825 08:14:50.436981    1971 kubelet.go:2065] "SyncLoop ADD" source="api" pods=[openshift-apiserver-operator/openshift-apiserver-operator-7954c899b6-hq7xx]

Aug 25 08:14:48.609668 ovirt14-vhnb7-master-2 root[427626]: machine-config-daemon[354533]: Starting update from rendered-master-9faf006a78c47c3fb7ca6a9223f67108 to rendered-master-7242100dcf595a1cee714eaa5fee9736: &{osUpdate:true kargs:false fips:false passwd:false files:true units:true kernelType:false extensions:false}
Aug 25 08:14:48.646779 ovirt14-vhnb7-master-2 root[427627]: machine-config-daemon[354533]: Node has been successfully cordoned
Aug 25 08:14:48.652760 ovirt14-vhnb7-master-2 root[427628]: machine-config-daemon[354533]: Update prepared; beginning drain
Aug 25 08:16:15.397745 ovirt14-vhnb7-master-2 root[439943]: machine-config-daemon[354533]: drain complete
Aug 25 08:17:13.039866 ovirt14-vhnb7-master-2 root[443741]: machine-config-daemon[354533]: initiating reboot: Node will reboot into config rendered-master-7242100dcf595a1cee714eaa5fee9736
Aug 25 08:17:13.052337 ovirt14-vhnb7-master-2 systemd[1]: Started machine-config-daemon: Node will reboot into config rendered-master-7242100dcf595a1cee714eaa5fee9736.
Aug 25 08:17:13.159769 ovirt14-vhnb7-master-2 systemd[1]: Stopping machine-config-daemon: Node will reboot into config rendered-master-7242100dcf595a1cee714eaa5fee9736...
Aug 25 08:17:13.283820 ovirt14-vhnb7-master-2 systemd[1]: Stopped machine-config-daemon: Node will reboot into config rendered-master-7242100dcf595a1cee714eaa5fee9736.
Aug 25 08:19:59.335853 ovirt14-vhnb7-master-2 hyperkube[1965]: I0825 08:19:59.335828    1965 kubelet.go:2065] "SyncLoop ADD" source="api" pods=[openshift-authentication/oauth-openshift-65d76d9bcd-ptwm7]
Aug 25 08:19:59.338981 ovirt14-vhnb7-master-2 hyperkube[1965]: I0825 08:19:59.338960    1965 kubelet.go:2065] "SyncLoop ADD" source="api" pods=[openshift-apiserver/apiserver-7478d5665f-gslsh]
```

From openshift-authentication/oauth-openshift deployment:
```
{
    "apiVersion": "apps/v1",
    "kind": "Deployment",
    "metadata": {
        ...
        "name": "oauth-openshift",
        "namespace": "openshift-authentication",
        "resourceVersion": "28554",
        "uid": "ca7e25cc-c3d8-400f-89ef-49776b1e6094"
    },
    "spec": {
        ...
        "replicas": 3,
        "strategy": {
            "rollingUpdate": {
                "maxSurge": 0,
                "maxUnavailable": 1
            },
            "type": "RollingUpdate"
        },
        "template": {
            ...
            "spec": {
                "affinity": {
                    "podAntiAffinity": {
                        "requiredDuringSchedulingIgnoredDuringExecution": [
                            {
                                "labelSelector": {
                                    "matchLabels": {
                                        "app": "oauth-openshift",
                                        "oauth-openshift-anti-affinity": "true"
                                    }
                                },
                                "topologyKey": "kubernetes.io/hostname"
                            }
                        ]
                    }
                },
```

From openshift-apiserver/apiserver deployment:
```
{
    "apiVersion": "apps/v1",
    "kind": "Deployment",
    "metadata": {
        ...
        "name": "apiserver",
        "namespace": "openshift-apiserver",
    },
    "spec": {
        ...
        "replicas": 3,
        "strategy": {
            "rollingUpdate": {
                "maxSurge": 0,
                "maxUnavailable": 1
            },
            "type": "RollingUpdate"
        },
        "template": {
            ...
            "spec": {
                "affinity": {
                    "podAntiAffinity": {
                        "requiredDuringSchedulingIgnoredDuringExecution": [
                            {
                                "labelSelector": {
                                    "matchLabels": {
                                        "apiserver": "true",
                                        "app": "openshift-apiserver-a",
                                        "openshift-apiserver-anti-affinity": "true"
                                    }
                                },
                                "topologyKey": "kubernetes.io/hostname"
                            }
                        ]
                    }
                },
```

Conclusion:
1. Master nodes are drained one by one (approximately after 5 minute of each other).
2. Both apiserver/oauth-openshift pods do not have enough time to get back running/available in time.
3. There's no PDB that can avoid draining two apiserver/oauth-openshift instances at the same time so the number of available instances dropped from 3 to 1 at some point.

Current behavior:
- Both apiserver/oauth-openshift operators changes condition/Degraded to true when there's at least one replicas missing.

Expected behavior:
- The operators allow 1 replica to be unavailable and change condition/Degraded to true only when there are at least 2 replicas unavailable (since both deployments set maxUnavailable to 1)
- PDB is defined to avoid 2 unavailable replicas at the same time

Comment 3 Jan Chaloupka 2021-09-22 13:35:38 UTC
https://github.com/openshift/library-go/pull/1056/commits/8efd1883d406cc389eb25e2a257c4451dfbd668c needs to be ported to library-go in 4.7 as well. Let's revisit the backport 4.7 once there's a customer need.

Comment 4 Ke Wang 2022-02-24 09:20:00 UTC
The bug was reproduced in our 4.6-> 4.7 upgrade, need to back-port to library-go in 4.7 as well. Otherwise, customer will hit the bug.

Comment 5 Ke Wang 2022-02-24 09:29:00 UTC
During upgrade from 4.6-> 4.7, kube-apiserver goes into DEGRADED, 

 $ oc get clusteroperators    
   NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE                                                                                            
 ...
  dns                                        4.6.55    True        False         False      4h3m    
 ...   
  kube-apiserver                             4.7.43    True        True          True       4h1m                       
...  
  machine-config                             4.6.55    True        False         False      4h3m    
  
Checked must-gather log, kube-apiserver goes Degrared at last because of unready node newugd-5971-pl9kh-control-plane-1.

2022-02-22T09:28:57.052569491Z I0222 09:28:57.039818       1 status_controller.go:213] clusteroperator/kube-apiserver diff {"status":{"conditions":[{"lastTransitionTime":"2022-02-22T09:     20:37Z","message":"NodeInstallerDegraded: 1 nodes are failing on revision 7:\nNodeInstallerDegraded: no detailed termination message, see `oc get -n \"openshift-kube-apiserver\" pods/\"     installer-7-newugd-5971-pl9kh-control-plane-2\" -oyaml`","reason":"NodeInstaller_InstallerPodFailed","status":"True","type":"Degraded"},{"lastTransitionTime":"2022-02-22T09:18:17Z","message":"NodeInstallerProgressing: 3 nodes are at revision 6; 0 nodes have achieved new revision 7","reason":"NodeInstaller","status":"True","type":"Progressing"},{"lastTransitionTime":"2022-02-22T07:44:15Z","message":"StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 6; 0 nodes have achieved new revision 7","reason":"AsExpected","status":"True","type":"A     vailable"},{"lastTransitionTime":"2022-02-22T07:41:58Z","message":"KubeletMinorVersionUpgradeable: Kubelet minor versions on 6 nodes are behind the expected API server version; nevertheless, they will continue to be supported in the next OpenShift minor version upgrade.","reason":"AsExpected","status":"True","type":"Upgradeable"}]}}   
                                 
2022-02-22T09:28:57.072422040Z I0222 09:28:57.072352       1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-apiserver-operator", Name:"kube-apiserve     r-operator", UID:"c99ea6b0-5aae-4cdb-9f89-a63525f9aa26", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/kube-apiserver changed: Degraded message changed from "NodeControllerDegraded: The master nodes not ready: node \"newugd-5971-pl9kh-control-plane-1\" not ready since 2022-02-22 09:28:46 +0000 UTC because KubeletNotReady ([container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful])\nNodeInstallerDegraded: 1 nodes are failing on revision 7:\nNodeInstallerDegraded: no detailed termination message, see `oc get -n \"openshift-kube-apiserver\" pods/\"installer-7-newugd-5971-pl9kh-control-plane-2\" -o yaml`" to "NodeInstallerDegraded: 1 nodes are failing on revision 7:\nNodeInstallerDegraded: no detailed termination message, see `oc get -n \"openshift-kube-apiserver\" pods/\"installer-7-newugd-5971-pl9kh-control-plane-2\" -oyaml`"


Track the node status during upgrade with config machine, in log namespaces/openshift-machine-config-operator/pods/machine-config-daemon-99jhw/machine-config-daemon/machine-config-daemon/logs/current.log,
machine config for node newugd-5971-pl9kh-control-plane-1, 

2022-02-22T09:18:08.851003218Z I0222 09:18:08.850955   99197 daemon.go:383] Node newugd-5971-pl9kh-control-plane-1 is part of the control plane
...
2022-02-22T09:25:30.912795104Z I0222 09:25:30.912751   99197 update.go:1714] Update prepared; beginning drain
...
2022-02-22T09:26:46.204696209Z I0222 09:26:46.204678   99197 update.go:1714] drain complete
...
2022-02-22T09:26:46.326050179Z I0222 09:26:46.325997   99197 update.go:1714] initiating reboot: Node will reboot into config rendered-master-c43d10a537590420b7654f576eb5668a
...
2022-02-22T09:28:57.312231275Z I0222 09:28:57.312169    2771 update.go:1714] completed update for config rendered-master-c43d10a537590420b7654f576eb5668a

Finished machine config on node newugd-5971-pl9kh-control-plane-1 at 09:28:57.312169.


Checked the machine-config log, namespaces/openshift-machine-config-operator/pods/machine-config-controller-7799944b89-8pq4j/machine-config-controller/machine-config-controller/logs/current.log
...
2022-02-22T09:28:01.367703684Z I0222 09:28:01.367628       1 node_controller.go:419] Pool master: node newugd-5971-pl9kh-control-plane-1: Reporting unready: node newugd-5971-pl9kh-control-plane-1 is reporting OutOfDisk=Unknown
2022-02-22T09:28:46.839548728Z I0222 09:28:46.839434       1 node_controller.go:419] Pool master: node newugd-5971-pl9kh-control-plane-1: Reporting unready: node newugd-5971-pl9kh-control-plane-1 is reporting NotReady=False
2022-02-22T09:28:57.004252577Z I0222 09:28:57.004209       1 node_controller.go:419] Pool master: node newugd-5971-pl9kh-control-plane-1: Reporting unready: node newugd-5971-pl9kh-control-plane-1 is reporting Unschedulable
2022-02-22T09:28:57.255035779Z I0222 09:28:57.251994       1 node_controller.go:419] Pool master: node newugd-5971-pl9kh-control-plane-1: Completed update to rendered-master-c43d10a537590420b7654f576eb5668a
2022-02-22T09:28:57.347001835Z I0222 09:28:57.346582       1 node_controller.go:419] Pool master: node newugd-5971-pl9kh-control-plane-1: Reporting ready 

Node newugd-5971-pl9kh-control-plane-1 is ready at 09:28:57.346582, though the status for clusteroperator/kube-apiserver changed: Degraded message changed from "NodeControllerDegraded at 09:28:57.072352: The master nodes not ready: node \"newugd-5971-pl9kh-control-plane-1\" not ready since 2022-02-22 09:28:46.

We can see  kube-apiserver pods do not have enough time to get back running/available in time when master nodes are drained. the kube-apiserver operator allow 1 replica to be unavailable and change condition/Degraded to true.

Comment 6 Jan Chaloupka 2022-03-02 12:51:57 UTC
Ke Wang,

this issue is specifically about openshift-apiserver operator. In order to prevent kube-apiserver from going Degraded, there's more to be backported. Please see https://issues.redhat.com/browse/WRKLDS-293 for more detailed information. The improvements are targeted for 4.10. 

Given the issue is reported against 4.7 it is unlikely we would backport all the functionality there without providing a customer escalation.


Note You need to log in before you can comment on or make changes to this bug.