Bug 1998031 - [bz-openshift-apiserver] clusteroperator/openshift-apiserver should not change condition/Degraded: master nodes drained too quickly
Summary: [bz-openshift-apiserver] clusteroperator/openshift-apiserver should not chang...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: openshift-apiserver
Version: 4.9
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.9.0
Assignee: Jan Chaloupka
QA Contact: Rahul Gangwar
URL:
Whiteboard:
Depends On:
Blocks: 2003946
TreeView+ depends on / blocked
 
Reported: 2021-08-26 10:10 UTC by Jan Chaloupka
Modified: 2022-02-24 09:18 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 2000004 2003946 (view as bug list)
Environment:
jobs=periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-ovirt-upgrade=all
Last Closed: 2021-10-18 17:49:07 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
kube_deployment_status_replicas_available{namespace="openshift-apiserver"} (85.58 KB, image/png)
2021-08-26 10:12 UTC, Jan Chaloupka
no flags Details
kube_deployment_status_replicas_available{namespace="openshift-authentication"} (86.92 KB, image/png)
2021-08-26 10:12 UTC, Jan Chaloupka
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-authentication-operator pull 476 0 None None None 2021-09-03 11:04:19 UTC
Github openshift cluster-openshift-apiserver-operator pull 472 0 None None None 2021-09-03 11:04:21 UTC
Red Hat Product Errata RHSA-2021:3759 0 None None None 2021-10-18 17:49:25 UTC

Description Jan Chaloupka 2021-08-26 10:10:36 UTC
From https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-ovirt-upgrade/1430418602502131712:
```
[bz-openshift-apiserver] clusteroperator/openshift-apiserver should not change condition/Degraded 

1 unexpected clusteroperator state transitions during e2e test run 

Aug 25 08:06:51.762 - 808s  E clusteroperator/openshift-apiserver condition/Degraded status/True reason/APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-apiserver ()
```

From loki (https://grafana-loki.ci.openshift.org/explore?orgId=1&left=%5B%221629842400000%22,%221629928799000%22,%22Grafana%20Cloud%22,%7B%22expr%22:%22%7Binvoker%3D%5C%22openshift-internal-ci%2Fperiodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-ovirt-upgrade%2F1430418602502131712%5C%22%7D%20%7C%20unpack%20%7C%20namespace%3D%5C%22openshift-authentication-operator%5C%22%20%7C%20container_name%3D%5C%22authentication-operator%5C%22%22%7D%5D):
```
2021-08-25 10:06:51
I0825 08:06:51.740397       1 status_controller.go:211] clusteroperator/authentication diff {"status":{"conditions":[{"lastTransitionTime":"2021-08-25T08:06:51Z","message":"APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-oauth-apiserver ()\nOAuthServerDeploymentDegraded: 1 of 3 requested instances are unavailable for oauth-openshift.openshift-authentication ()","reason":"APIServerDeployment_UnavailablePod::OAuthServerDeployment_UnavailablePod","status":"True","type":"Degraded"},{"lastTransitionTime":"2021-08-25T07:45:00Z","message":"AuthenticatorCertKeyProgressing: All is well","reason":"AsExpected","status":"False","type":"Progressing"},{"lastTransitionTime":"2021-08-25T07:57:01Z","message":"All is well","reason":"AsExpected","status":"True","type":"Available"},{"lastTransitionTime":"2021-08-25T06:49:31Z","message":"All is well","reason":"AsExpected","status":"True","type":"Upgradeable"}]}}
2021-08-25 10:06:24
I0825 08:06:24.738959       1 request.go:668] Waited for 1.015853827s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-oauth-apiserver/configmaps/audit
```

Checking metrics (`kube_deployment_status_replicas_available{namespace="openshift-authentication"}`) the deployment had 1 replicas missing at 08:06:51.740397

There were 3 RSs (from oldest to youngest)
- oauth-openshift-7fb8f95bd8
- oauth-openshift-6f975f59c
- oauth-openshift-65d76d9bcd

- 08:04:43 -> 08:20:55 less than 3 replicas available

Checking events.json:
```
            "message": "Scaled up replica set oauth-openshift-65d76d9bcd to 1",
            "message": "Scaled up replica set oauth-openshift-65d76d9bcd to 2",
            "message": "Scaled up replica set oauth-openshift-65d76d9bcd to 3",
```
No events about scaling oauth-openshift-65d76d9bcd. Checking kubelet:
```
ovirt14-vhnb7-master-1:18830:Aug 25 07:43:47.183806 ovirt14-vhnb7-master-1 hyperkube[1846]: I0825 07:43:47.183768    1846 kubelet.go:1944] "SyncLoop ADD" source="api" pods=[openshift-authentication/oauth-openshift-65d76d9bcd-mlj5s]
ovirt14-vhnb7-master-1:22979:Aug 25 08:09:43.697971 ovirt14-vhnb7-master-1 hyperkube[1846]: I0825 08:09:43.697943    1846 kubelet.go:1960] "SyncLoop DELETE" source="api" pods=[openshift-authentication/oauth-openshift-65d76d9bcd-mlj5s]
ovirt14-vhnb7-master-1:24774:Aug 25 08:10:11.266126 ovirt14-vhnb7-master-1 hyperkube[1846]: I0825 08:10:11.266097    1846 kubelet.go:1960] "SyncLoop DELETE" source="api" pods=[openshift-authentication/oauth-openshift-65d76d9bcd-mlj5s]
ovirt14-vhnb7-master-1:32270:Aug 25 08:14:43.473159 ovirt14-vhnb7-master-1 hyperkube[1971]: I0825 08:14:43.471314    1971 kubelet.go:2065] "SyncLoop ADD" source="api" pods=[openshift-authentication/oauth-openshift-65d76d9bcd-c8mv4]
ovirt14-vhnb7-master-2:16925:Aug 25 07:42:44.615329 ovirt14-vhnb7-master-2 hyperkube[1850]: I0825 07:42:44.615293    1850 kubelet.go:1944] "SyncLoop ADD" source="api" pods=[openshift-authentication/oauth-openshift-65d76d9bcd-k4g7p]
ovirt14-vhnb7-master-2:23385:Aug 25 08:14:50.475225 ovirt14-vhnb7-master-2 hyperkube[1850]: I0825 08:14:50.475018    1850 kubelet.go:1960] "SyncLoop DELETE" source="api" pods=[openshift-authentication/oauth-openshift-65d76d9bcd-k4g7p]
ovirt14-vhnb7-master-2:26008:Aug 25 08:15:24.629063 ovirt14-vhnb7-master-2 hyperkube[1850]: I0825 08:15:24.629032    1850 kubelet.go:1960] "SyncLoop DELETE" source="api" pods=[openshift-authentication/oauth-openshift-65d76d9bcd-k4g7p]
ovirt14-vhnb7-master-2:33998:Aug 25 08:19:59.335853 ovirt14-vhnb7-master-2 hyperkube[1965]: I0825 08:19:59.335828    1965 kubelet.go:2065] "SyncLoop ADD" source="api" pods=[openshift-authentication/oauth-openshift-65d76d9bcd-ptwm7]
ovirt14-vhnb7-master-0:21275:Aug 25 07:44:30.963045 ovirt14-vhnb7-master-0 hyperkube[1851]: I0825 07:44:30.962513    1851 kubelet.go:1944] "SyncLoop ADD" source="api" pods=[openshift-authentication/oauth-openshift-65d76d9bcd-snm9m]
ovirt14-vhnb7-master-0:24990:Aug 25 08:04:44.819607 ovirt14-vhnb7-master-0 hyperkube[1851]: I0825 08:04:44.819553    1851 kubelet.go:1960] "SyncLoop DELETE" source="api" pods=[openshift-authentication/oauth-openshift-65d76d9bcd-snm9m]
ovirt14-vhnb7-master-0:26129:Aug 25 08:05:15.971355 ovirt14-vhnb7-master-0 hyperkube[1851]: I0825 08:05:15.971321    1851 kubelet.go:1960] "SyncLoop DELETE" source="api" pods=[openshift-authentication/oauth-openshift-65d76d9bcd-snm9m]
ovirt14-vhnb7-master-0:34407:Aug 25 08:09:40.277931 ovirt14-vhnb7-master-0 hyperkube[1991]: I0825 08:09:40.277906    1991 kubelet.go:2065] "SyncLoop ADD" source="api" pods=[openshift-authentication/oauth-openshift-65d76d9bcd-mnvdm]
```

- oauth-openshift-65d76d9bcd-snm9m: <07:44:30.962513; 08:04:44.819553> -> oauth-openshift-65d76d9bcd-mnvdm: <08:09:40.277906; ???>
- oauth-openshift-65d76d9bcd-mlj5s: <07:43:47.183768; 08:09:43.697943> -> oauth-openshift-65d76d9bcd-c8mv4: <08:14:43.471314; ???>
- oauth-openshift-65d76d9bcd-k4g7p: <07:42:44.615293; 08:14:50.475018> -> oauth-openshift-65d76d9bcd-ptwm7: <08:19:59.335828; ???>

Checking `kube_replicaset_spec_replicas{namespace="openshift-authentication",replicaset=~"oauth-openshift-65d76d9bcd"}`, the value was set to 3 the whole time. Meaning something just deleted the pods one by one.

Although, the same can be observed with openshift-apiserver instances (see `kube_deployment_status_replicas_available{namespace="openshift-apiserver"}`).

```
Aug 25 07:33:31.075468 ovirt14-vhnb7-master-0 hyperkube[1851]: I0825 07:33:31.075433    1851 kubelet.go:1944] "SyncLoop ADD" source="api" pods=[openshift-apiserver/apiserver-7478d5665f-5mlx4]
Aug 25 08:04:44.754920 ovirt14-vhnb7-master-0 hyperkube[1851]: I0825 08:04:44.754791    1851 kubelet.go:1960] "SyncLoop DELETE" source="api" pods=[openshift-apiserver/apiserver-7478d5665f-5mlx4]
Aug 25 08:05:57.242004 ovirt14-vhnb7-master-0 hyperkube[1851]: I0825 08:05:57.241938    1851 kubelet.go:1960] "SyncLoop DELETE" source="api" pods=[openshift-apiserver/apiserver-7478d5665f-5mlx4]
Aug 25 08:09:40.255344 ovirt14-vhnb7-master-0 hyperkube[1991]: I0825 08:09:40.255312    1991 kubelet.go:2065] "SyncLoop ADD" source="api" pods=[openshift-apiserver/apiserver-7478d5665f-g6nt9]

Aug 25 07:31:59.161083 ovirt14-vhnb7-master-1 hyperkube[1846]: I0825 07:31:59.161061    1846 kubelet.go:1944] "SyncLoop ADD" source="api" pods=[openshift-apiserver/apiserver-7478d5665f-wktv4]
Aug 25 08:09:44.274744 ovirt14-vhnb7-master-1 hyperkube[1846]: I0825 08:09:44.274715    1846 kubelet.go:1960] "SyncLoop DELETE" source="api" pods=[openshift-apiserver/apiserver-7478d5665f-wktv4]
Aug 25 08:10:59.127089 ovirt14-vhnb7-master-1 hyperkube[1846]: I0825 08:10:59.127057    1846 kubelet.go:1960] "SyncLoop DELETE" source="api" pods=[openshift-apiserver/apiserver-7478d5665f-wktv4]
Aug 25 08:14:43.407859 ovirt14-vhnb7-master-1 hyperkube[1971]: I0825 08:14:43.407823    1971 kubelet.go:2065] "SyncLoop ADD" source="api" pods=[openshift-apiserver/apiserver-7478d5665f-kdvqw]
Aug 25 08:14:50.437017 ovirt14-vhnb7-master-1 hyperkube[1971]: I0825 08:14:50.436981    1971 kubelet.go:2065] "SyncLoop ADD" source="api" pods=[openshift-apiserver-operator/openshift-apiserver-operator-7954c899b6-hq7xx]

Aug 25 07:30:31.271077 ovirt14-vhnb7-master-2 hyperkube[1850]: I0825 07:30:31.271018    1850 kubelet.go:1944] "SyncLoop ADD" source="api" pods=[openshift-apiserver/apiserver-7478d5665f-krq8w]
Aug 25 08:14:50.386938 ovirt14-vhnb7-master-2 hyperkube[1850]: I0825 08:14:50.386718    1850 kubelet.go:1960] "SyncLoop DELETE" source="api" pods=[openshift-apiserver-operator/openshift-apiserver-operator-7954c899b6-p2wbp]
Aug 25 08:14:50.493640 ovirt14-vhnb7-master-2 hyperkube[1850]: I0825 08:14:50.492425    1850 kubelet.go:1960] "SyncLoop DELETE" source="api" pods=[openshift-apiserver/apiserver-7478d5665f-krq8w]
Aug 25 08:15:04.621295 ovirt14-vhnb7-master-2 hyperkube[1850]: I0825 08:15:04.621266    1850 kubelet.go:1960] "SyncLoop DELETE" source="api" pods=[openshift-apiserver-operator/openshift-apiserver-operator-7954c899b6-p2wbp]
Aug 25 08:16:14.597465 ovirt14-vhnb7-master-2 hyperkube[1850]: I0825 08:16:14.597413    1850 kubelet.go:1960] "SyncLoop DELETE" source="api" pods=[openshift-apiserver/apiserver-7478d5665f-krq8w]
Aug 25 08:19:59.338981 ovirt14-vhnb7-master-2 hyperkube[1965]: I0825 08:19:59.338960    1965 kubelet.go:2065] "SyncLoop ADD" source="api" pods=[openshift-apiserver/apiserver-7478d5665f-gslsh]
```

- apiserver-7478d5665f-5mlx4: <07:33:31.075433; 08:04:44.754791> -> apiserver-7478d5665f-g6nt9: <08:09:40.255312; ???>
- apiserver-7478d5665f-wktv4: <07:31:59.161061; 08:09:44.274715> -> apiserver-7478d5665f-kdvqw: <08:14:43.407823; ???>
- apiserver-7478d5665f-krq8w: <07:30:31.271018; 08:14:50.386718> -> apiserver-7478d5665f-gslsh: <08:19:59.338960; ???>

Putting both alongside (master0 -> master1 -> master2)
- apiserver-7478d5665f-5mlx4: <07:33:31.075433; 08:04:44.754791> -> apiserver-7478d5665f-g6nt9: <08:09:40.255312; ???>
- oauth-openshift-65d76d9bcd-snm9m: <07:44:30.962513; 08:04:44.819553> -> oauth-openshift-65d76d9bcd-mnvdm: <08:09:40.277906; ???>
- apiserver-7478d5665f-wktv4: <07:31:59.161061; 08:09:44.274715> -> apiserver-7478d5665f-kdvqw: <08:14:43.407823; ???>
- oauth-openshift-65d76d9bcd-mlj5s: <07:43:47.183768; 08:09:43.697943> -> oauth-openshift-65d76d9bcd-c8mv4: <08:14:43.471314; ???>
- apiserver-7478d5665f-krq8w: <07:30:31.271018; 08:14:50.386718> -> apiserver-7478d5665f-gslsh: <08:19:59.338960; ???>
- oauth-openshift-65d76d9bcd-k4g7p: <07:42:44.615293; 08:14:50.475018> -> oauth-openshift-65d76d9bcd-ptwm7: <08:19:59.335828; ???>

Checking kubelets for cordoning/drain/update:
```
Aug 25 08:04:42.977292 ovirt14-vhnb7-master-0 root[392852]: machine-config-daemon[382060]: Starting update from rendered-master-9faf006a78c47c3fb7ca6a9223f67108 to rendered-master-7242100dcf595a1cee714eaa5fee9736: &{osUpdate:true kargs:false fips:false passwd:false files:true units:true kernelType:false extensions:false}
Aug 25 08:04:43.016573 ovirt14-vhnb7-master-0 root[392853]: machine-config-daemon[382060]: Node has been successfully cordoned
Aug 25 08:04:43.020530 ovirt14-vhnb7-master-0 root[392854]: machine-config-daemon[382060]: Update prepared; beginning drain
Aug 25 08:05:57.764428 ovirt14-vhnb7-master-0 root[400279]: machine-config-daemon[382060]: drain complete
Aug 25 08:06:40.588616 ovirt14-vhnb7-master-0 root[403514]: machine-config-daemon[382060]: initiating reboot: Node will reboot into config rendered-master-7242100dcf595a1cee714eaa5fee9736
Aug 25 08:06:40.600122 ovirt14-vhnb7-master-0 systemd[1]: Started machine-config-daemon: Node will reboot into config rendered-master-7242100dcf595a1cee714eaa5fee9736.
Aug 25 08:06:40.706811 ovirt14-vhnb7-master-0 systemd[1]: Stopping machine-config-daemon: Node will reboot into config rendered-master-7242100dcf595a1cee714eaa5fee9736...
Aug 25 08:06:40.827596 ovirt14-vhnb7-master-0 systemd[1]: Stopped machine-config-daemon: Node will reboot into config rendered-master-7242100dcf595a1cee714eaa5fee9736.
Aug 25 08:09:40.255344 ovirt14-vhnb7-master-0 hyperkube[1991]: I0825 08:09:40.255312    1991 kubelet.go:2065] "SyncLoop ADD" source="api" pods=[openshift-apiserver/apiserver-7478d5665f-g6nt9]
Aug 25 08:09:40.277931 ovirt14-vhnb7-master-0 hyperkube[1991]: I0825 08:09:40.277906    1991 kubelet.go:2065] "SyncLoop ADD" source="api" pods=[openshift-authentication/oauth-openshift-65d76d9bcd-mnvdm]

Aug 25 08:09:41.927616 ovirt14-vhnb7-master-1 root[392706]: machine-config-daemon[357804]: Starting update from rendered-master-9faf006a78c47c3fb7ca6a9223f67108 to rendered-master-7242100dcf595a1cee714eaa5fee9736: &{osUpdate:true kargs:false fips:false passwd:false files:true units:true kernelType:false extensions:false}
Aug 25 08:09:41.968324 ovirt14-vhnb7-master-1 root[392714]: machine-config-daemon[357804]: Node has been successfully cordoned
Aug 25 08:09:41.971817 ovirt14-vhnb7-master-1 root[392715]: machine-config-daemon[357804]: Update prepared; beginning drain
Aug 25 08:10:59.273200 ovirt14-vhnb7-master-1 root[402449]: machine-config-daemon[357804]: drain complete
Aug 25 08:11:45.707314 ovirt14-vhnb7-master-1 root[405834]: machine-config-daemon[357804]: initiating reboot: Node will reboot into config rendered-master-7242100dcf595a1cee714eaa5fee9736
Aug 25 08:11:45.719771 ovirt14-vhnb7-master-1 systemd[1]: Started machine-config-daemon: Node will reboot into config rendered-master-7242100dcf595a1cee714eaa5fee9736.
Aug 25 08:11:45.767006 ovirt14-vhnb7-master-1 systemd[1]: Stopped machine-config-daemon: Node will reboot into config rendered-master-7242100dcf595a1cee714eaa5fee9736.
Aug 25 08:14:43.473159 ovirt14-vhnb7-master-1 hyperkube[1971]: I0825 08:14:43.471314    1971 kubelet.go:2065] "SyncLoop ADD" source="api" pods=[openshift-authentication/oauth-openshift-65d76d9bcd-c8mv4]
Aug 25 08:14:50.437017 ovirt14-vhnb7-master-1 hyperkube[1971]: I0825 08:14:50.436981    1971 kubelet.go:2065] "SyncLoop ADD" source="api" pods=[openshift-apiserver-operator/openshift-apiserver-operator-7954c899b6-hq7xx]

Aug 25 08:14:48.609668 ovirt14-vhnb7-master-2 root[427626]: machine-config-daemon[354533]: Starting update from rendered-master-9faf006a78c47c3fb7ca6a9223f67108 to rendered-master-7242100dcf595a1cee714eaa5fee9736: &{osUpdate:true kargs:false fips:false passwd:false files:true units:true kernelType:false extensions:false}
Aug 25 08:14:48.646779 ovirt14-vhnb7-master-2 root[427627]: machine-config-daemon[354533]: Node has been successfully cordoned
Aug 25 08:14:48.652760 ovirt14-vhnb7-master-2 root[427628]: machine-config-daemon[354533]: Update prepared; beginning drain
Aug 25 08:16:15.397745 ovirt14-vhnb7-master-2 root[439943]: machine-config-daemon[354533]: drain complete
Aug 25 08:17:13.039866 ovirt14-vhnb7-master-2 root[443741]: machine-config-daemon[354533]: initiating reboot: Node will reboot into config rendered-master-7242100dcf595a1cee714eaa5fee9736
Aug 25 08:17:13.052337 ovirt14-vhnb7-master-2 systemd[1]: Started machine-config-daemon: Node will reboot into config rendered-master-7242100dcf595a1cee714eaa5fee9736.
Aug 25 08:17:13.159769 ovirt14-vhnb7-master-2 systemd[1]: Stopping machine-config-daemon: Node will reboot into config rendered-master-7242100dcf595a1cee714eaa5fee9736...
Aug 25 08:17:13.283820 ovirt14-vhnb7-master-2 systemd[1]: Stopped machine-config-daemon: Node will reboot into config rendered-master-7242100dcf595a1cee714eaa5fee9736.
Aug 25 08:19:59.335853 ovirt14-vhnb7-master-2 hyperkube[1965]: I0825 08:19:59.335828    1965 kubelet.go:2065] "SyncLoop ADD" source="api" pods=[openshift-authentication/oauth-openshift-65d76d9bcd-ptwm7]
Aug 25 08:19:59.338981 ovirt14-vhnb7-master-2 hyperkube[1965]: I0825 08:19:59.338960    1965 kubelet.go:2065] "SyncLoop ADD" source="api" pods=[openshift-apiserver/apiserver-7478d5665f-gslsh]
```

From openshift-authentication/oauth-openshift deployment:
```
{
    "apiVersion": "apps/v1",
    "kind": "Deployment",
    "metadata": {
        ...
        "name": "oauth-openshift",
        "namespace": "openshift-authentication",
        "resourceVersion": "28554",
        "uid": "ca7e25cc-c3d8-400f-89ef-49776b1e6094"
    },
    "spec": {
        ...
        "replicas": 3,
        "strategy": {
            "rollingUpdate": {
                "maxSurge": 0,
                "maxUnavailable": 1
            },
            "type": "RollingUpdate"
        },
        "template": {
            ...
            "spec": {
                "affinity": {
                    "podAntiAffinity": {
                        "requiredDuringSchedulingIgnoredDuringExecution": [
                            {
                                "labelSelector": {
                                    "matchLabels": {
                                        "app": "oauth-openshift",
                                        "oauth-openshift-anti-affinity": "true"
                                    }
                                },
                                "topologyKey": "kubernetes.io/hostname"
                            }
                        ]
                    }
                },
```

From openshift-apiserver/apiserver deployment:
```
{
    "apiVersion": "apps/v1",
    "kind": "Deployment",
    "metadata": {
        ...
        "name": "apiserver",
        "namespace": "openshift-apiserver",
    },
    "spec": {
        ...
        "replicas": 3,
        "strategy": {
            "rollingUpdate": {
                "maxSurge": 0,
                "maxUnavailable": 1
            },
            "type": "RollingUpdate"
        },
        "template": {
            ...
            "spec": {
                "affinity": {
                    "podAntiAffinity": {
                        "requiredDuringSchedulingIgnoredDuringExecution": [
                            {
                                "labelSelector": {
                                    "matchLabels": {
                                        "apiserver": "true",
                                        "app": "openshift-apiserver-a",
                                        "openshift-apiserver-anti-affinity": "true"
                                    }
                                },
                                "topologyKey": "kubernetes.io/hostname"
                            }
                        ]
                    }
                },
```

Conclusion:
1. Master nodes are drained one by one (approximately after 5 minute of each other).
2. Both apiserver/oauth-openshift pods do not have enough time to get back running/available in time.
3. There's no PDB that can avoid draining two apiserver/oauth-openshift instances at the same time so the number of available instances dropped from 3 to 1 at some point.

Current behavior:
- Both apiserver/oauth-openshift operators changes condition/Degraded to true when there's at least one replicas missing.

Expected behavior:
- The operators allow 1 replica to be unavailable and change condition/Degraded to true only when there are at least 2 replicas unavailable (since both deployments set maxUnavailable to 1)
- PDB is defined to avoid 2 unavailable replicas at the same time

Comment 1 Jan Chaloupka 2021-08-26 10:12:14 UTC
Created attachment 1817835 [details]
kube_deployment_status_replicas_available{namespace="openshift-apiserver"}

Comment 2 Jan Chaloupka 2021-08-26 10:12:43 UTC
Created attachment 1817836 [details]
kube_deployment_status_replicas_available{namespace="openshift-authentication"}

Comment 3 Stefan Schimanski 2021-08-26 12:10:29 UTC
Moving to Ravi as this involves PDB to be solved.

Comment 7 Jan Chaloupka 2021-09-01 13:05:49 UTC
For QA: only the PDBs are included in the PRs/this BZ, the rest of the BZ will get sorted out through https://bugzilla.redhat.com/show_bug.cgi?id=2000004

Comment 9 Rahul Gangwar 2021-09-08 12:37:14 UTC
Created attachment 1821518 [details]
verification result screenshot for apiserver

Comment 10 Rahul Gangwar 2021-09-08 12:41:28 UTC
Created attachment 1821520 [details]
verification result screenshot for Opensshift-Authentication

Verification result screenshot for Opensshift-Authentication

Comment 12 Xingxing Xia 2021-09-10 07:27:55 UTC
Hi, Jan Chaloupka, thanks for the keen insight to capture the cause, great enhancement; do you think the fix is worthy to back-port to 4.8 (or even earlier releases)? Seems worthy for me :)
I also added one comment in bug 2000004#c2.

Comment 13 Jan Chaloupka 2021-09-14 08:24:10 UTC
Yes, have just backported the changes to 4.8 and 4.7

Comment 15 errata-xmlrpc 2021-10-18 17:49:07 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759


Note You need to log in before you can comment on or make changes to this bug.