Created attachment 1819532 [details] oc describe co output that failed to upgrade Description of problem: Cluster operator machine-config has not yet successfully rolled out during upgrade from OCP 4.7.13-x86_64 to 4.7.0-0.nightly-2021-08-27-190811 with cluster operators network and monitoring degraded. Profile: IPI on vSphere 7.0 with RHCOS & RHEL7.9 & FIPS on & OVN & Etcd Encryption on Also one of the RHEL worker nodes was in NotReady,SchedulingDisabled state. oc get co output from networking shows: [2021-08-30T08:54:21.794Z] DaemonSet "openshift-ovn-kubernetes/ovnkube-node" rollout is not making progress - last change 2021-08-30T07:14:53Z [2021-08-30T08:54:21.794Z] DaemonSet "openshift-network-diagnostics/network-check-target" rollout is not making progress - last change 2021-08-30T07:15:32Z [2021-08-30T08:54:21.794Z] Reason: RolloutHung [2021-08-30T08:54:21.794Z] Status: True [2021-08-30T08:54:21.794Z] Type: Degraded [2021-08-30T08:54:21.794Z] Last Transition Time: 2021-08-30T03:44:55Z [2021-08-30T08:54:21.794Z] Status: False [2021-08-30T08:54:21.794Z] Type: ManagementStateDegraded [2021-08-30T08:54:21.794Z] Last Transition Time: 2021-08-30T03:44:55Z [2021-08-30T08:54:21.794Z] Status: True [2021-08-30T08:54:21.794Z] Type: Upgradeable [2021-08-30T08:54:21.794Z] Last Transition Time: 2021-08-30T07:00:42Z [2021-08-30T08:54:21.794Z] Message: DaemonSet "openshift-multus/multus" is not available (awaiting 1 nodes) [2021-08-30T08:54:21.794Z] DaemonSet "openshift-multus/network-metrics-daemon" is not available (awaiting 1 nodes) [2021-08-30T08:54:21.794Z] DaemonSet "openshift-ovn-kubernetes/ovnkube-node" is not available (awaiting 1 nodes) [2021-08-30T08:54:21.794Z] DaemonSet "openshift-network-diagnostics/network-check-target" is not available (awaiting 1 nodes)' operand pod logs from must-gather show: # grep -ir E0830 openshift-ovn-kubernetes/ openshift-ovn-kubernetes//core/pods.yaml: message: "s received\nW0830 07:06:50.041459 1 warnings.go:70] apiextensions.k8s.io/v1beta1 CustomResourceDefinition is deprecated in v1.16+, unavailable in v1.22+; use apiextensions.k8s.io/v1 CustomResourceDefinition\nI0830 07:06:51.037099 1 leaderelection.go:346] lock is held by kewang30113609-r5nwz-master-2 and has not yet expired\nI0830 07:06:51.037123 1 leaderelection.go:248] failed to acquire lease openshift-ovn-kubernetes/ovn-kubernetes-master\nI0830 07:07:27.431575 1 leaderelection.go:253] successfully acquired lease openshift-ovn-kubernetes/ovn-kubernetes-master\nI0830 07:07:27.431725 1 master.go:83] Won leader election; in active mode\nI0830 07:07:27.432607 1 master.go:254] Starting cluster master\n2021/08/30 07:09:00 rpc2: client protocol error: read tcp 172.31.249.133:53970->172.31.249.102:9642: read: connection timed out\n2021/08/30 07:09:00 ssl:172.31.249.102:9642,ssl:172.31.249.133:9642,ssl:172.31.249.233:9642 disconnected. Reconnecting ... \n2021/08/30 07:09:00 ssl:172.31.249.102:9642,ssl:172.31.249.133:9642,ssl:172.31.249.233:9642 reconnected after 0 retries.\n2021/08/30 07:09:15 rpc2: client protocol error: read tcp 172.31.249.133:46338->172.31.249.102:9641: read: connection reset by peer\n2021/08/30 07:09:15 ssl:172.31.249.102:9641,ssl:172.31.249.133:9641,ssl:172.31.249.233:9641 disconnected. Reconnecting ... \nE0830 07:09:15.819221 1 master.go:238] Failed to enable logical datapath groups: read tcp 172.31.249.133:46338->172.31.249.102:9641: read: connection reset by peer\npanic: read tcp 172.31.249.133:46338->172.31.249.102:9641: read: connection reset by peer\n\ngoroutine 530 [running]:\ngithub.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn.(*Controller).Start.func1(0x1e42660, 0xc000d36780)\n\t/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/ovn/master.go:94 +0x265\ncreated by k8s.io/client-go/tools/leaderelection.(*LeaderElector).Run\n\t/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:207 +0x113\n" # grep -ir E0830 openshift-monitoring openshift-monitoring/pods/alertmanager-main-0/alertmanager-proxy/alertmanager-proxy/logs/current.log:2021-08-30T07:07:00.169523165Z E0830 07:07:00.169427 1 webhook.go:111] Failed to make webhook authenticator request: Post "https://172.30.0.1:443/apis/authentication.k8s.io/v1/tokenreviews": context deadline exceeded openshift-monitoring/pods/alertmanager-main-0/alertmanager-proxy/alertmanager-proxy/logs/current.log:2021-08-30T07:07:30.177166106Z E0830 07:07:30.177099 1 webhook.go:111] Failed to make webhook authenticator request: Post "https://172.30.0.1:443/apis/authentication.k8s.io/v1/tokenreviews": context deadline exceeded openshift-monitoring/pods/alertmanager-main-0/alertmanager-proxy/alertmanager-proxy/logs/current.log:2021-08-30T07:12:40.614749633Z E0830 07:12:40.614655 1 webhook.go:111] Failed to make webhook authenticator request: Post "https://172.30.0.1:443/apis/authentication.k8s.io/v1/tokenreviews": context deadline exceeded openshift-monitoring/pods/openshift-state-metrics-6bd979c55c-5d4hr/openshift-state-metrics/openshift-state-metrics/logs/current.log:2021-08-30T03:01:36.819693276-04:00 E0830 07:01:36.819613 1 reflector.go:127] github.com/openshift/openshift-state-metrics/pkg/collectors/builder.go:228: Failed to watch *v1.Group: failed to list *v1.Group: an error on the server ("Internal Server Error: \"/apis/user.openshift.io/v1/groups?resourceVersion=105615\": Post \"https://172.30.0.1:443/apis/authorization.k8s.io/v1/subjectaccessreviews?timeout=10s\": http2: client connection lost") has prevented the request from succeeding (get groups.user.openshift.io) openshift-monitoring/pods/openshift-state-metrics-6bd979c55c-5d4hr/openshift-state-metrics/openshift-state-metrics/logs/current.log:2021-08-30T03:01:39.031183755-04:00 E0830 07:01:39.031118 1 reflector.go:127] github.com/openshift/openshift-state-metrics/pkg/collectors/builder.go:228: Failed to watch *v1.Group: failed to list *v1.Group: the server is currently unable to handle the request (get groups.user.openshift.io) #oc get co:NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE [2021-08-30T08:53:09.190Z] authentication 4.7.0-0.nightly-2021-08-27-190811 True False False 98m [2021-08-30T08:53:09.190Z] baremetal 4.7.0-0.nightly-2021-08-27-190811 True False False 5h5m [2021-08-30T08:53:09.190Z] cloud-credential 4.7.0-0.nightly-2021-08-27-190811 True False False 5h8m [2021-08-30T08:53:09.190Z] cluster-autoscaler 4.7.0-0.nightly-2021-08-27-190811 True False False 5h3m [2021-08-30T08:53:09.190Z] config-operator 4.7.0-0.nightly-2021-08-27-190811 True False False 5h5m [2021-08-30T08:53:09.190Z] console 4.7.0-0.nightly-2021-08-27-190811 True False False 107m [2021-08-30T08:53:09.190Z] csi-snapshot-controller 4.7.0-0.nightly-2021-08-27-190811 True False False 108m [2021-08-30T08:53:09.190Z] dns 4.7.0-0.nightly-2021-08-27-190811 True False False 5h4m [2021-08-30T08:53:09.190Z] etcd 4.7.0-0.nightly-2021-08-27-190811 True False False 5h3m [2021-08-30T08:53:09.190Z] image-registry 4.7.0-0.nightly-2021-08-27-190811 True False False 4h6m [2021-08-30T08:53:09.190Z] ingress 4.7.0-0.nightly-2021-08-27-190811 True False False 4h54m [2021-08-30T08:53:09.190Z] insights 4.7.0-0.nightly-2021-08-27-190811 True False False 4h58m [2021-08-30T08:53:09.190Z] kube-apiserver 4.7.0-0.nightly-2021-08-27-190811 True False False 5h1m [2021-08-30T08:53:09.190Z] kube-controller-manager 4.7.0-0.nightly-2021-08-27-190811 True False False 5h1m [2021-08-30T08:53:09.190Z] kube-scheduler 4.7.0-0.nightly-2021-08-27-190811 True False False 5h2m [2021-08-30T08:53:09.190Z] kube-storage-version-migrator 4.7.0-0.nightly-2021-08-27-190811 True False False 106m [2021-08-30T08:53:09.190Z] machine-api 4.7.0-0.nightly-2021-08-27-190811 True False False 5h1m [2021-08-30T08:53:09.190Z] machine-approver 4.7.0-0.nightly-2021-08-27-190811 True False False 5h4m [2021-08-30T08:53:09.190Z] machine-config 4.7.13 False True True 140m [2021-08-30T08:53:09.190Z] marketplace 4.7.0-0.nightly-2021-08-27-190811 True False False 101m [2021-08-30T08:53:09.190Z] monitoring 4.7.0-0.nightly-2021-08-27-190811 False True True 106m [2021-08-30T08:53:09.190Z] network 4.7.0-0.nightly-2021-08-27-190811 True True True 5h4m [2021-08-30T08:53:09.190Z] node-tuning 4.7.0-0.nightly-2021-08-27-190811 True False False 153m [2021-08-30T08:53:09.190Z] openshift-apiserver 4.7.0-0.nightly-2021-08-27-190811 True False False 100m [2021-08-30T08:53:09.190Z] openshift-controller-manager 4.7.0-0.nightly-2021-08-27-190811 True False False 151m [2021-08-30T08:53:09.190Z] openshift-samples 4.7.0-0.nightly-2021-08-27-190811 True False False 153m [2021-08-30T08:53:09.190Z] operator-lifecycle-manager 4.7.0-0.nightly-2021-08-27-190811 True False False 5h4m [2021-08-30T08:53:09.190Z] operator-lifecycle-manager-catalog 4.7.0-0.nightly-2021-08-27-190811 True False False 5h4m [2021-08-30T08:53:09.190Z] operator-lifecycle-manager-packageserver 4.7.0-0.nightly-2021-08-27-190811 True False False 105m [2021-08-30T08:53:09.190Z] service-ca 4.7.0-0.nightly-2021-08-27-190811 True False False 5h5m [2021-08-30T08:53:09.190Z] storage 4.7.0-0.nightly-2021-08-27-190811 True False False 100m #oc get node: NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME [2021-08-30T08:53:09.190Z] kewang30113609-r5nwz-master-0 Ready master 5h8m v1.20.0+4593a24 172.31.249.102 172.31.249.102 Red Hat Enterprise Linux CoreOS 47.84.202108250831-0 (Ootpa) 4.18.0-305.12.1.el8_4.x86_64 cri-o://1.20.4-12.rhaos4.7.git9275d5c.el8 [2021-08-30T08:53:09.190Z] kewang30113609-r5nwz-master-1 Ready master 5h8m v1.20.0+4593a24 172.31.249.133 172.31.249.133 Red Hat Enterprise Linux CoreOS 47.84.202108250831-0 (Ootpa) 4.18.0-305.12.1.el8_4.x86_64 cri-o://1.20.4-12.rhaos4.7.git9275d5c.el8 [2021-08-30T08:53:09.190Z] kewang30113609-r5nwz-master-2 Ready master 5h7m v1.20.0+4593a24 172.31.249.233 172.31.249.233 Red Hat Enterprise Linux CoreOS 47.84.202108250831-0 (Ootpa) 4.18.0-305.12.1.el8_4.x86_64 cri-o://1.20.4-12.rhaos4.7.git9275d5c.el8 [2021-08-30T08:53:09.190Z] kewang30113609-r5nwz-rhel-0 NotReady,SchedulingDisabled worker 3h53m v1.20.0+9689d22 172.31.249.247 172.31.249.247 Red Hat Enterprise Linux Server 7.9 (Maipo) 3.10.0-1160.36.2.el7.x86_64 cri-o://1.20.4-12.rhaos4.7.git9275d5c.el7 [2021-08-30T08:53:09.190Z] kewang30113609-r5nwz-rhel-1 Ready worker 3h53m v1.20.0+9689d22 172.31.249.85 172.31.249.85 Red Hat Enterprise Linux Server 7.9 (Maipo) 3.10.0-1160.36.2.el7.x86_64 cri-o://1.20.4-12.rhaos4.7.git9275d5c.el7 [2021-08-30T08:53:09.190Z] kewang30113609-r5nwz-worker-5564x Ready worker 4h55m v1.20.0+df9c838 172.31.249.20 172.31.249.20 Red Hat Enterprise Linux CoreOS 47.83.202105220305-0 (Ootpa) 4.18.0-240.22.1.el8_3.x86_64 cri-o://1.20.2-12.rhaos4.7.git9f7be76.el8 [2021-08-30T08:53:09.190Z] kewang30113609-r5nwz-worker-s5bqb Ready worker 4h55m v1.20.0+df9c838 172.31.249.89 172.31.249.89 Red Hat Enterprise Linux CoreOS 47.83.202105220305-0 (Ootpa) 4.18.0-240.22.1.el8_3.x86_64 cri-o://1.20.2-12.rhaos4.7.git9f7be76.el8 Version-Release number of selected component (if applicable): ocp version: - starting from: 4.7.0-0.nightly-2021-08-27-190811 - upgrading to: 4.7.0-0.nightly-2021-08-27-190811 kubelet version on RHEL nodes: v1.20.0+9689d22, cri-o://1.20.4-12.rhaos4.7.git9275d5c.el7 kubelet version on RHCOS nodes: v1.20.0+df9c838, cri-o://1.20.2-12.rhaos4.7.git9f7be76.el8 How reproducible: Once so far Steps to Reproduce: 1. Install OCP 4.7.13 IPI on vSphere 7.0 with RHCOS & FIPS on & OVN & Etcd Encryption on 2. Scale cluster to add 2 RHEL 7.9 worker nodes 3. Upgrade cluster to 4.7.0-0.nightly-2021-08-27-190811: ./oc adm upgrade --to-image=registry.ci.openshift.org/ocp/release:4.7.0-0.nightly-2021-08-27-190811 --force=true --allow-explicit-upgrade=true Actual results: MCO, network and monitoring cluster operators have degraded and one RHEL worker node is NodeNot Ready Expected results: Successful upgrade to version 4.7.0-0.nightly-2021-08-27-190811 with all cluster operators available and not degraded on the new version Additional info: Link to cluster operator oc describe outputs in attachment. Link to must-gather logs in next comment
Assigning to Andrew since I think he's already looked at a couple of "RHEL 7.9 worker node" bugs. If I am wrong, feel free to un-assing, Andrew. /Alex