From https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-ovirt-upgrade/1430418602502131712: ``` [bz-openshift-apiserver] clusteroperator/openshift-apiserver should not change condition/Degraded 1 unexpected clusteroperator state transitions during e2e test run Aug 25 08:06:51.762 - 808s E clusteroperator/openshift-apiserver condition/Degraded status/True reason/APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-apiserver () ``` From loki (https://grafana-loki.ci.openshift.org/explore?orgId=1&left=%5B%221629842400000%22,%221629928799000%22,%22Grafana%20Cloud%22,%7B%22expr%22:%22%7Binvoker%3D%5C%22openshift-internal-ci%2Fperiodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-ovirt-upgrade%2F1430418602502131712%5C%22%7D%20%7C%20unpack%20%7C%20namespace%3D%5C%22openshift-authentication-operator%5C%22%20%7C%20container_name%3D%5C%22authentication-operator%5C%22%22%7D%5D): ``` 2021-08-25 10:06:51 I0825 08:06:51.740397 1 status_controller.go:211] clusteroperator/authentication diff {"status":{"conditions":[{"lastTransitionTime":"2021-08-25T08:06:51Z","message":"APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-oauth-apiserver ()\nOAuthServerDeploymentDegraded: 1 of 3 requested instances are unavailable for oauth-openshift.openshift-authentication ()","reason":"APIServerDeployment_UnavailablePod::OAuthServerDeployment_UnavailablePod","status":"True","type":"Degraded"},{"lastTransitionTime":"2021-08-25T07:45:00Z","message":"AuthenticatorCertKeyProgressing: All is well","reason":"AsExpected","status":"False","type":"Progressing"},{"lastTransitionTime":"2021-08-25T07:57:01Z","message":"All is well","reason":"AsExpected","status":"True","type":"Available"},{"lastTransitionTime":"2021-08-25T06:49:31Z","message":"All is well","reason":"AsExpected","status":"True","type":"Upgradeable"}]}} 2021-08-25 10:06:24 I0825 08:06:24.738959 1 request.go:668] Waited for 1.015853827s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-oauth-apiserver/configmaps/audit ``` Checking metrics (`kube_deployment_status_replicas_available{namespace="openshift-authentication"}`) the deployment had 1 replicas missing at 08:06:51.740397 There were 3 RSs (from oldest to youngest) - oauth-openshift-7fb8f95bd8 - oauth-openshift-6f975f59c - oauth-openshift-65d76d9bcd - 08:04:43 -> 08:20:55 less than 3 replicas available Checking events.json: ``` "message": "Scaled up replica set oauth-openshift-65d76d9bcd to 1", "message": "Scaled up replica set oauth-openshift-65d76d9bcd to 2", "message": "Scaled up replica set oauth-openshift-65d76d9bcd to 3", ``` No events about scaling oauth-openshift-65d76d9bcd. Checking kubelet: ``` ovirt14-vhnb7-master-1:18830:Aug 25 07:43:47.183806 ovirt14-vhnb7-master-1 hyperkube[1846]: I0825 07:43:47.183768 1846 kubelet.go:1944] "SyncLoop ADD" source="api" pods=[openshift-authentication/oauth-openshift-65d76d9bcd-mlj5s] ovirt14-vhnb7-master-1:22979:Aug 25 08:09:43.697971 ovirt14-vhnb7-master-1 hyperkube[1846]: I0825 08:09:43.697943 1846 kubelet.go:1960] "SyncLoop DELETE" source="api" pods=[openshift-authentication/oauth-openshift-65d76d9bcd-mlj5s] ovirt14-vhnb7-master-1:24774:Aug 25 08:10:11.266126 ovirt14-vhnb7-master-1 hyperkube[1846]: I0825 08:10:11.266097 1846 kubelet.go:1960] "SyncLoop DELETE" source="api" pods=[openshift-authentication/oauth-openshift-65d76d9bcd-mlj5s] ovirt14-vhnb7-master-1:32270:Aug 25 08:14:43.473159 ovirt14-vhnb7-master-1 hyperkube[1971]: I0825 08:14:43.471314 1971 kubelet.go:2065] "SyncLoop ADD" source="api" pods=[openshift-authentication/oauth-openshift-65d76d9bcd-c8mv4] ovirt14-vhnb7-master-2:16925:Aug 25 07:42:44.615329 ovirt14-vhnb7-master-2 hyperkube[1850]: I0825 07:42:44.615293 1850 kubelet.go:1944] "SyncLoop ADD" source="api" pods=[openshift-authentication/oauth-openshift-65d76d9bcd-k4g7p] ovirt14-vhnb7-master-2:23385:Aug 25 08:14:50.475225 ovirt14-vhnb7-master-2 hyperkube[1850]: I0825 08:14:50.475018 1850 kubelet.go:1960] "SyncLoop DELETE" source="api" pods=[openshift-authentication/oauth-openshift-65d76d9bcd-k4g7p] ovirt14-vhnb7-master-2:26008:Aug 25 08:15:24.629063 ovirt14-vhnb7-master-2 hyperkube[1850]: I0825 08:15:24.629032 1850 kubelet.go:1960] "SyncLoop DELETE" source="api" pods=[openshift-authentication/oauth-openshift-65d76d9bcd-k4g7p] ovirt14-vhnb7-master-2:33998:Aug 25 08:19:59.335853 ovirt14-vhnb7-master-2 hyperkube[1965]: I0825 08:19:59.335828 1965 kubelet.go:2065] "SyncLoop ADD" source="api" pods=[openshift-authentication/oauth-openshift-65d76d9bcd-ptwm7] ovirt14-vhnb7-master-0:21275:Aug 25 07:44:30.963045 ovirt14-vhnb7-master-0 hyperkube[1851]: I0825 07:44:30.962513 1851 kubelet.go:1944] "SyncLoop ADD" source="api" pods=[openshift-authentication/oauth-openshift-65d76d9bcd-snm9m] ovirt14-vhnb7-master-0:24990:Aug 25 08:04:44.819607 ovirt14-vhnb7-master-0 hyperkube[1851]: I0825 08:04:44.819553 1851 kubelet.go:1960] "SyncLoop DELETE" source="api" pods=[openshift-authentication/oauth-openshift-65d76d9bcd-snm9m] ovirt14-vhnb7-master-0:26129:Aug 25 08:05:15.971355 ovirt14-vhnb7-master-0 hyperkube[1851]: I0825 08:05:15.971321 1851 kubelet.go:1960] "SyncLoop DELETE" source="api" pods=[openshift-authentication/oauth-openshift-65d76d9bcd-snm9m] ovirt14-vhnb7-master-0:34407:Aug 25 08:09:40.277931 ovirt14-vhnb7-master-0 hyperkube[1991]: I0825 08:09:40.277906 1991 kubelet.go:2065] "SyncLoop ADD" source="api" pods=[openshift-authentication/oauth-openshift-65d76d9bcd-mnvdm] ``` - oauth-openshift-65d76d9bcd-snm9m: <07:44:30.962513; 08:04:44.819553> -> oauth-openshift-65d76d9bcd-mnvdm: <08:09:40.277906; ???> - oauth-openshift-65d76d9bcd-mlj5s: <07:43:47.183768; 08:09:43.697943> -> oauth-openshift-65d76d9bcd-c8mv4: <08:14:43.471314; ???> - oauth-openshift-65d76d9bcd-k4g7p: <07:42:44.615293; 08:14:50.475018> -> oauth-openshift-65d76d9bcd-ptwm7: <08:19:59.335828; ???> Checking `kube_replicaset_spec_replicas{namespace="openshift-authentication",replicaset=~"oauth-openshift-65d76d9bcd"}`, the value was set to 3 the whole time. Meaning something just deleted the pods one by one. Although, the same can be observed with openshift-apiserver instances (see `kube_deployment_status_replicas_available{namespace="openshift-apiserver"}`). ``` Aug 25 07:33:31.075468 ovirt14-vhnb7-master-0 hyperkube[1851]: I0825 07:33:31.075433 1851 kubelet.go:1944] "SyncLoop ADD" source="api" pods=[openshift-apiserver/apiserver-7478d5665f-5mlx4] Aug 25 08:04:44.754920 ovirt14-vhnb7-master-0 hyperkube[1851]: I0825 08:04:44.754791 1851 kubelet.go:1960] "SyncLoop DELETE" source="api" pods=[openshift-apiserver/apiserver-7478d5665f-5mlx4] Aug 25 08:05:57.242004 ovirt14-vhnb7-master-0 hyperkube[1851]: I0825 08:05:57.241938 1851 kubelet.go:1960] "SyncLoop DELETE" source="api" pods=[openshift-apiserver/apiserver-7478d5665f-5mlx4] Aug 25 08:09:40.255344 ovirt14-vhnb7-master-0 hyperkube[1991]: I0825 08:09:40.255312 1991 kubelet.go:2065] "SyncLoop ADD" source="api" pods=[openshift-apiserver/apiserver-7478d5665f-g6nt9] Aug 25 07:31:59.161083 ovirt14-vhnb7-master-1 hyperkube[1846]: I0825 07:31:59.161061 1846 kubelet.go:1944] "SyncLoop ADD" source="api" pods=[openshift-apiserver/apiserver-7478d5665f-wktv4] Aug 25 08:09:44.274744 ovirt14-vhnb7-master-1 hyperkube[1846]: I0825 08:09:44.274715 1846 kubelet.go:1960] "SyncLoop DELETE" source="api" pods=[openshift-apiserver/apiserver-7478d5665f-wktv4] Aug 25 08:10:59.127089 ovirt14-vhnb7-master-1 hyperkube[1846]: I0825 08:10:59.127057 1846 kubelet.go:1960] "SyncLoop DELETE" source="api" pods=[openshift-apiserver/apiserver-7478d5665f-wktv4] Aug 25 08:14:43.407859 ovirt14-vhnb7-master-1 hyperkube[1971]: I0825 08:14:43.407823 1971 kubelet.go:2065] "SyncLoop ADD" source="api" pods=[openshift-apiserver/apiserver-7478d5665f-kdvqw] Aug 25 08:14:50.437017 ovirt14-vhnb7-master-1 hyperkube[1971]: I0825 08:14:50.436981 1971 kubelet.go:2065] "SyncLoop ADD" source="api" pods=[openshift-apiserver-operator/openshift-apiserver-operator-7954c899b6-hq7xx] Aug 25 07:30:31.271077 ovirt14-vhnb7-master-2 hyperkube[1850]: I0825 07:30:31.271018 1850 kubelet.go:1944] "SyncLoop ADD" source="api" pods=[openshift-apiserver/apiserver-7478d5665f-krq8w] Aug 25 08:14:50.386938 ovirt14-vhnb7-master-2 hyperkube[1850]: I0825 08:14:50.386718 1850 kubelet.go:1960] "SyncLoop DELETE" source="api" pods=[openshift-apiserver-operator/openshift-apiserver-operator-7954c899b6-p2wbp] Aug 25 08:14:50.493640 ovirt14-vhnb7-master-2 hyperkube[1850]: I0825 08:14:50.492425 1850 kubelet.go:1960] "SyncLoop DELETE" source="api" pods=[openshift-apiserver/apiserver-7478d5665f-krq8w] Aug 25 08:15:04.621295 ovirt14-vhnb7-master-2 hyperkube[1850]: I0825 08:15:04.621266 1850 kubelet.go:1960] "SyncLoop DELETE" source="api" pods=[openshift-apiserver-operator/openshift-apiserver-operator-7954c899b6-p2wbp] Aug 25 08:16:14.597465 ovirt14-vhnb7-master-2 hyperkube[1850]: I0825 08:16:14.597413 1850 kubelet.go:1960] "SyncLoop DELETE" source="api" pods=[openshift-apiserver/apiserver-7478d5665f-krq8w] Aug 25 08:19:59.338981 ovirt14-vhnb7-master-2 hyperkube[1965]: I0825 08:19:59.338960 1965 kubelet.go:2065] "SyncLoop ADD" source="api" pods=[openshift-apiserver/apiserver-7478d5665f-gslsh] ``` - apiserver-7478d5665f-5mlx4: <07:33:31.075433; 08:04:44.754791> -> apiserver-7478d5665f-g6nt9: <08:09:40.255312; ???> - apiserver-7478d5665f-wktv4: <07:31:59.161061; 08:09:44.274715> -> apiserver-7478d5665f-kdvqw: <08:14:43.407823; ???> - apiserver-7478d5665f-krq8w: <07:30:31.271018; 08:14:50.386718> -> apiserver-7478d5665f-gslsh: <08:19:59.338960; ???> Putting both alongside (master0 -> master1 -> master2) - apiserver-7478d5665f-5mlx4: <07:33:31.075433; 08:04:44.754791> -> apiserver-7478d5665f-g6nt9: <08:09:40.255312; ???> - oauth-openshift-65d76d9bcd-snm9m: <07:44:30.962513; 08:04:44.819553> -> oauth-openshift-65d76d9bcd-mnvdm: <08:09:40.277906; ???> - apiserver-7478d5665f-wktv4: <07:31:59.161061; 08:09:44.274715> -> apiserver-7478d5665f-kdvqw: <08:14:43.407823; ???> - oauth-openshift-65d76d9bcd-mlj5s: <07:43:47.183768; 08:09:43.697943> -> oauth-openshift-65d76d9bcd-c8mv4: <08:14:43.471314; ???> - apiserver-7478d5665f-krq8w: <07:30:31.271018; 08:14:50.386718> -> apiserver-7478d5665f-gslsh: <08:19:59.338960; ???> - oauth-openshift-65d76d9bcd-k4g7p: <07:42:44.615293; 08:14:50.475018> -> oauth-openshift-65d76d9bcd-ptwm7: <08:19:59.335828; ???> Checking kubelets for cordoning/drain/update: ``` Aug 25 08:04:42.977292 ovirt14-vhnb7-master-0 root[392852]: machine-config-daemon[382060]: Starting update from rendered-master-9faf006a78c47c3fb7ca6a9223f67108 to rendered-master-7242100dcf595a1cee714eaa5fee9736: &{osUpdate:true kargs:false fips:false passwd:false files:true units:true kernelType:false extensions:false} Aug 25 08:04:43.016573 ovirt14-vhnb7-master-0 root[392853]: machine-config-daemon[382060]: Node has been successfully cordoned Aug 25 08:04:43.020530 ovirt14-vhnb7-master-0 root[392854]: machine-config-daemon[382060]: Update prepared; beginning drain Aug 25 08:05:57.764428 ovirt14-vhnb7-master-0 root[400279]: machine-config-daemon[382060]: drain complete Aug 25 08:06:40.588616 ovirt14-vhnb7-master-0 root[403514]: machine-config-daemon[382060]: initiating reboot: Node will reboot into config rendered-master-7242100dcf595a1cee714eaa5fee9736 Aug 25 08:06:40.600122 ovirt14-vhnb7-master-0 systemd[1]: Started machine-config-daemon: Node will reboot into config rendered-master-7242100dcf595a1cee714eaa5fee9736. Aug 25 08:06:40.706811 ovirt14-vhnb7-master-0 systemd[1]: Stopping machine-config-daemon: Node will reboot into config rendered-master-7242100dcf595a1cee714eaa5fee9736... Aug 25 08:06:40.827596 ovirt14-vhnb7-master-0 systemd[1]: Stopped machine-config-daemon: Node will reboot into config rendered-master-7242100dcf595a1cee714eaa5fee9736. Aug 25 08:09:40.255344 ovirt14-vhnb7-master-0 hyperkube[1991]: I0825 08:09:40.255312 1991 kubelet.go:2065] "SyncLoop ADD" source="api" pods=[openshift-apiserver/apiserver-7478d5665f-g6nt9] Aug 25 08:09:40.277931 ovirt14-vhnb7-master-0 hyperkube[1991]: I0825 08:09:40.277906 1991 kubelet.go:2065] "SyncLoop ADD" source="api" pods=[openshift-authentication/oauth-openshift-65d76d9bcd-mnvdm] Aug 25 08:09:41.927616 ovirt14-vhnb7-master-1 root[392706]: machine-config-daemon[357804]: Starting update from rendered-master-9faf006a78c47c3fb7ca6a9223f67108 to rendered-master-7242100dcf595a1cee714eaa5fee9736: &{osUpdate:true kargs:false fips:false passwd:false files:true units:true kernelType:false extensions:false} Aug 25 08:09:41.968324 ovirt14-vhnb7-master-1 root[392714]: machine-config-daemon[357804]: Node has been successfully cordoned Aug 25 08:09:41.971817 ovirt14-vhnb7-master-1 root[392715]: machine-config-daemon[357804]: Update prepared; beginning drain Aug 25 08:10:59.273200 ovirt14-vhnb7-master-1 root[402449]: machine-config-daemon[357804]: drain complete Aug 25 08:11:45.707314 ovirt14-vhnb7-master-1 root[405834]: machine-config-daemon[357804]: initiating reboot: Node will reboot into config rendered-master-7242100dcf595a1cee714eaa5fee9736 Aug 25 08:11:45.719771 ovirt14-vhnb7-master-1 systemd[1]: Started machine-config-daemon: Node will reboot into config rendered-master-7242100dcf595a1cee714eaa5fee9736. Aug 25 08:11:45.767006 ovirt14-vhnb7-master-1 systemd[1]: Stopped machine-config-daemon: Node will reboot into config rendered-master-7242100dcf595a1cee714eaa5fee9736. Aug 25 08:14:43.473159 ovirt14-vhnb7-master-1 hyperkube[1971]: I0825 08:14:43.471314 1971 kubelet.go:2065] "SyncLoop ADD" source="api" pods=[openshift-authentication/oauth-openshift-65d76d9bcd-c8mv4] Aug 25 08:14:50.437017 ovirt14-vhnb7-master-1 hyperkube[1971]: I0825 08:14:50.436981 1971 kubelet.go:2065] "SyncLoop ADD" source="api" pods=[openshift-apiserver-operator/openshift-apiserver-operator-7954c899b6-hq7xx] Aug 25 08:14:48.609668 ovirt14-vhnb7-master-2 root[427626]: machine-config-daemon[354533]: Starting update from rendered-master-9faf006a78c47c3fb7ca6a9223f67108 to rendered-master-7242100dcf595a1cee714eaa5fee9736: &{osUpdate:true kargs:false fips:false passwd:false files:true units:true kernelType:false extensions:false} Aug 25 08:14:48.646779 ovirt14-vhnb7-master-2 root[427627]: machine-config-daemon[354533]: Node has been successfully cordoned Aug 25 08:14:48.652760 ovirt14-vhnb7-master-2 root[427628]: machine-config-daemon[354533]: Update prepared; beginning drain Aug 25 08:16:15.397745 ovirt14-vhnb7-master-2 root[439943]: machine-config-daemon[354533]: drain complete Aug 25 08:17:13.039866 ovirt14-vhnb7-master-2 root[443741]: machine-config-daemon[354533]: initiating reboot: Node will reboot into config rendered-master-7242100dcf595a1cee714eaa5fee9736 Aug 25 08:17:13.052337 ovirt14-vhnb7-master-2 systemd[1]: Started machine-config-daemon: Node will reboot into config rendered-master-7242100dcf595a1cee714eaa5fee9736. Aug 25 08:17:13.159769 ovirt14-vhnb7-master-2 systemd[1]: Stopping machine-config-daemon: Node will reboot into config rendered-master-7242100dcf595a1cee714eaa5fee9736... Aug 25 08:17:13.283820 ovirt14-vhnb7-master-2 systemd[1]: Stopped machine-config-daemon: Node will reboot into config rendered-master-7242100dcf595a1cee714eaa5fee9736. Aug 25 08:19:59.335853 ovirt14-vhnb7-master-2 hyperkube[1965]: I0825 08:19:59.335828 1965 kubelet.go:2065] "SyncLoop ADD" source="api" pods=[openshift-authentication/oauth-openshift-65d76d9bcd-ptwm7] Aug 25 08:19:59.338981 ovirt14-vhnb7-master-2 hyperkube[1965]: I0825 08:19:59.338960 1965 kubelet.go:2065] "SyncLoop ADD" source="api" pods=[openshift-apiserver/apiserver-7478d5665f-gslsh] ``` From openshift-authentication/oauth-openshift deployment: ``` { "apiVersion": "apps/v1", "kind": "Deployment", "metadata": { ... "name": "oauth-openshift", "namespace": "openshift-authentication", "resourceVersion": "28554", "uid": "ca7e25cc-c3d8-400f-89ef-49776b1e6094" }, "spec": { ... "replicas": 3, "strategy": { "rollingUpdate": { "maxSurge": 0, "maxUnavailable": 1 }, "type": "RollingUpdate" }, "template": { ... "spec": { "affinity": { "podAntiAffinity": { "requiredDuringSchedulingIgnoredDuringExecution": [ { "labelSelector": { "matchLabels": { "app": "oauth-openshift", "oauth-openshift-anti-affinity": "true" } }, "topologyKey": "kubernetes.io/hostname" } ] } }, ``` From openshift-apiserver/apiserver deployment: ``` { "apiVersion": "apps/v1", "kind": "Deployment", "metadata": { ... "name": "apiserver", "namespace": "openshift-apiserver", }, "spec": { ... "replicas": 3, "strategy": { "rollingUpdate": { "maxSurge": 0, "maxUnavailable": 1 }, "type": "RollingUpdate" }, "template": { ... "spec": { "affinity": { "podAntiAffinity": { "requiredDuringSchedulingIgnoredDuringExecution": [ { "labelSelector": { "matchLabels": { "apiserver": "true", "app": "openshift-apiserver-a", "openshift-apiserver-anti-affinity": "true" } }, "topologyKey": "kubernetes.io/hostname" } ] } }, ``` Conclusion: 1. Master nodes are drained one by one (approximately after 5 minute of each other). 2. Both apiserver/oauth-openshift pods do not have enough time to get back running/available in time. 3. There's no PDB that can avoid draining two apiserver/oauth-openshift instances at the same time so the number of available instances dropped from 3 to 1 at some point. Current behavior: - Both apiserver/oauth-openshift operators changes condition/Degraded to true when there's at least one replicas missing. Expected behavior: - The operators allow 1 replica to be unavailable and change condition/Degraded to true only when there are at least 2 replicas unavailable (since both deployments set maxUnavailable to 1) - PDB is defined to avoid 2 unavailable replicas at the same time
Created attachment 1817835 [details] kube_deployment_status_replicas_available{namespace="openshift-apiserver"}
Created attachment 1817836 [details] kube_deployment_status_replicas_available{namespace="openshift-authentication"}
Moving to Ravi as this involves PDB to be solved.
For QA: only the PDBs are included in the PRs/this BZ, the rest of the BZ will get sorted out through https://bugzilla.redhat.com/show_bug.cgi?id=2000004
Created attachment 1821518 [details] verification result screenshot for apiserver
Created attachment 1821520 [details] verification result screenshot for Opensshift-Authentication Verification result screenshot for Opensshift-Authentication
Hi, Jan Chaloupka, thanks for the keen insight to capture the cause, great enhancement; do you think the fix is worthy to back-port to 4.8 (or even earlier releases)? Seems worthy for me :) I also added one comment in bug 2000004#c2.
Yes, have just backported the changes to 4.8 and 4.7
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759