1999867 – OCP 4.7.13: networking CO degraded during upgrade OCP with RHEL, OVN, FIPS on & OVN & Etcd Encryption on IPI vSphere cluster

Bug 1999867 - OCP 4.7.13: networking CO degraded during upgrade OCP with RHEL, OVN, FIPS on & OVN & Etcd Encryption on IPI vSphere cluster

Summary: OCP 4.7.13: networking CO degraded during upgrade OCP with RHEL, OVN, FIPS o...

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.7
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Andrew Stoycos
QA Contact:	Anurag saxena
Docs Contact:
URL:
Whiteboard:
Depends On:	1974962
Blocks:
TreeView+	depends on / blocked

Reported:	2021-08-31 20:39 UTC by Walid A.
Modified:	2021-10-05 14:55 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-10-05 14:55:41 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
oc describe co output that failed to upgrade (137.65 KB, text/plain) 2021-08-31 20:39 UTC, Walid A.	no flags	Details
View All

Description Walid A. 2021-08-31 20:39:09 UTC

Created attachment 1819532 [details]
oc describe co output that failed to upgrade

Description of problem:

Cluster operator machine-config has not yet successfully rolled out during upgrade from OCP 4.7.13-x86_64 to 4.7.0-0.nightly-2021-08-27-190811 with cluster operators network and monitoring degraded.

Profile: IPI on vSphere 7.0 with RHCOS & RHEL7.9 & FIPS on & OVN & Etcd Encryption on
Also one of the RHEL worker nodes was in NotReady,SchedulingDisabled state.

oc get co output from networking shows:

[2021-08-30T08:54:21.794Z] DaemonSet "openshift-ovn-kubernetes/ovnkube-node" rollout is not making progress - last change 2021-08-30T07:14:53Z
[2021-08-30T08:54:21.794Z] DaemonSet "openshift-network-diagnostics/network-check-target" rollout is not making progress - last change 2021-08-30T07:15:32Z
[2021-08-30T08:54:21.794Z]     Reason:                RolloutHung
[2021-08-30T08:54:21.794Z]     Status:                True
[2021-08-30T08:54:21.794Z]     Type:                  Degraded
[2021-08-30T08:54:21.794Z]     Last Transition Time:  2021-08-30T03:44:55Z
[2021-08-30T08:54:21.794Z]     Status:                False
[2021-08-30T08:54:21.794Z]     Type:                  ManagementStateDegraded
[2021-08-30T08:54:21.794Z]     Last Transition Time:  2021-08-30T03:44:55Z
[2021-08-30T08:54:21.794Z]     Status:                True
[2021-08-30T08:54:21.794Z]     Type:                  Upgradeable
[2021-08-30T08:54:21.794Z]     Last Transition Time:  2021-08-30T07:00:42Z
[2021-08-30T08:54:21.794Z]     Message:               DaemonSet "openshift-multus/multus" is not available (awaiting 1 nodes)
[2021-08-30T08:54:21.794Z] DaemonSet "openshift-multus/network-metrics-daemon" is not available (awaiting 1 nodes)
[2021-08-30T08:54:21.794Z] DaemonSet "openshift-ovn-kubernetes/ovnkube-node" is not available (awaiting 1 nodes)
[2021-08-30T08:54:21.794Z] DaemonSet "openshift-network-diagnostics/network-check-target" is not available (awaiting 1 nodes)'

operand pod logs from must-gather show:

# grep -ir E0830 openshift-ovn-kubernetes/
openshift-ovn-kubernetes//core/pods.yaml:          message: "s received\nW0830 07:06:50.041459       1 warnings.go:70] apiextensions.k8s.io/v1beta1 CustomResourceDefinition is deprecated in v1.16+, unavailable in v1.22+; use apiextensions.k8s.io/v1 CustomResourceDefinition\nI0830 07:06:51.037099       1 leaderelection.go:346] lock is held by kewang30113609-r5nwz-master-2 and has not yet expired\nI0830 07:06:51.037123       1 leaderelection.go:248] failed to acquire lease openshift-ovn-kubernetes/ovn-kubernetes-master\nI0830 07:07:27.431575       1 leaderelection.go:253] successfully acquired lease openshift-ovn-kubernetes/ovn-kubernetes-master\nI0830 07:07:27.431725       1 master.go:83] Won leader election; in active mode\nI0830 07:07:27.432607       1 master.go:254] Starting cluster master\n2021/08/30 07:09:00 rpc2: client protocol error: read tcp 172.31.249.133:53970->172.31.249.102:9642: read: connection timed out\n2021/08/30 07:09:00 ssl:172.31.249.102:9642,ssl:172.31.249.133:9642,ssl:172.31.249.233:9642 disconnected. Reconnecting ... \n2021/08/30 07:09:00 ssl:172.31.249.102:9642,ssl:172.31.249.133:9642,ssl:172.31.249.233:9642 reconnected after 0 retries.\n2021/08/30 07:09:15 rpc2: client protocol error: read tcp 172.31.249.133:46338->172.31.249.102:9641: read: connection reset by peer\n2021/08/30 07:09:15 ssl:172.31.249.102:9641,ssl:172.31.249.133:9641,ssl:172.31.249.233:9641 disconnected. Reconnecting ... \nE0830 07:09:15.819221       1 master.go:238] Failed to enable logical datapath groups: read tcp 172.31.249.133:46338->172.31.249.102:9641: read: connection reset by peer\npanic: read tcp 172.31.249.133:46338->172.31.249.102:9641: read: connection reset by peer\n\ngoroutine 530 [running]:\ngithub.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn.(*Controller).Start.func1(0x1e42660, 0xc000d36780)\n\t/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/ovn/master.go:94 +0x265\ncreated by k8s.io/client-go/tools/leaderelection.(*LeaderElector).Run\n\t/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:207 +0x113\n"

# grep -ir E0830 openshift-monitoring
openshift-monitoring/pods/alertmanager-main-0/alertmanager-proxy/alertmanager-proxy/logs/current.log:2021-08-30T07:07:00.169523165Z E0830 07:07:00.169427       1 webhook.go:111] Failed to make webhook authenticator request: Post "https://172.30.0.1:443/apis/authentication.k8s.io/v1/tokenreviews": context deadline exceeded
openshift-monitoring/pods/alertmanager-main-0/alertmanager-proxy/alertmanager-proxy/logs/current.log:2021-08-30T07:07:30.177166106Z E0830 07:07:30.177099       1 webhook.go:111] Failed to make webhook authenticator request: Post "https://172.30.0.1:443/apis/authentication.k8s.io/v1/tokenreviews": context deadline exceeded
openshift-monitoring/pods/alertmanager-main-0/alertmanager-proxy/alertmanager-proxy/logs/current.log:2021-08-30T07:12:40.614749633Z E0830 07:12:40.614655       1 webhook.go:111] Failed to make webhook authenticator request: Post "https://172.30.0.1:443/apis/authentication.k8s.io/v1/tokenreviews": context deadline exceeded
openshift-monitoring/pods/openshift-state-metrics-6bd979c55c-5d4hr/openshift-state-metrics/openshift-state-metrics/logs/current.log:2021-08-30T03:01:36.819693276-04:00 E0830 07:01:36.819613       1 reflector.go:127] github.com/openshift/openshift-state-metrics/pkg/collectors/builder.go:228: Failed to watch *v1.Group: failed to list *v1.Group: an error on the server ("Internal Server Error: \"/apis/user.openshift.io/v1/groups?resourceVersion=105615\": Post \"https://172.30.0.1:443/apis/authorization.k8s.io/v1/subjectaccessreviews?timeout=10s\": http2: client connection lost") has prevented the request from succeeding (get groups.user.openshift.io)
openshift-monitoring/pods/openshift-state-metrics-6bd979c55c-5d4hr/openshift-state-metrics/openshift-state-metrics/logs/current.log:2021-08-30T03:01:39.031183755-04:00 E0830 07:01:39.031118       1 reflector.go:127] github.com/openshift/openshift-state-metrics/pkg/collectors/builder.go:228: Failed to watch *v1.Group: failed to list *v1.Group: the server is currently unable to handle the request (get groups.user.openshift.io)



#oc get co:NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
[2021-08-30T08:53:09.190Z] authentication                             4.7.0-0.nightly-2021-08-27-190811   True        False         False      98m
[2021-08-30T08:53:09.190Z] baremetal                                  4.7.0-0.nightly-2021-08-27-190811   True        False         False      5h5m
[2021-08-30T08:53:09.190Z] cloud-credential                           4.7.0-0.nightly-2021-08-27-190811   True        False         False      5h8m
[2021-08-30T08:53:09.190Z] cluster-autoscaler                         4.7.0-0.nightly-2021-08-27-190811   True        False         False      5h3m
[2021-08-30T08:53:09.190Z] config-operator                            4.7.0-0.nightly-2021-08-27-190811   True        False         False      5h5m
[2021-08-30T08:53:09.190Z] console                                    4.7.0-0.nightly-2021-08-27-190811   True        False         False      107m
[2021-08-30T08:53:09.190Z] csi-snapshot-controller                    4.7.0-0.nightly-2021-08-27-190811   True        False         False      108m
[2021-08-30T08:53:09.190Z] dns                                        4.7.0-0.nightly-2021-08-27-190811   True        False         False      5h4m
[2021-08-30T08:53:09.190Z] etcd                                       4.7.0-0.nightly-2021-08-27-190811   True        False         False      5h3m
[2021-08-30T08:53:09.190Z] image-registry                             4.7.0-0.nightly-2021-08-27-190811   True        False         False      4h6m
[2021-08-30T08:53:09.190Z] ingress                                    4.7.0-0.nightly-2021-08-27-190811   True        False         False      4h54m
[2021-08-30T08:53:09.190Z] insights                                   4.7.0-0.nightly-2021-08-27-190811   True        False         False      4h58m
[2021-08-30T08:53:09.190Z] kube-apiserver                             4.7.0-0.nightly-2021-08-27-190811   True        False         False      5h1m
[2021-08-30T08:53:09.190Z] kube-controller-manager                    4.7.0-0.nightly-2021-08-27-190811   True        False         False      5h1m
[2021-08-30T08:53:09.190Z] kube-scheduler                             4.7.0-0.nightly-2021-08-27-190811   True        False         False      5h2m
[2021-08-30T08:53:09.190Z] kube-storage-version-migrator              4.7.0-0.nightly-2021-08-27-190811   True        False         False      106m
[2021-08-30T08:53:09.190Z] machine-api                                4.7.0-0.nightly-2021-08-27-190811   True        False         False      5h1m
[2021-08-30T08:53:09.190Z] machine-approver                           4.7.0-0.nightly-2021-08-27-190811   True        False         False      5h4m
[2021-08-30T08:53:09.190Z] machine-config                             4.7.13                              False       True          True       140m
[2021-08-30T08:53:09.190Z] marketplace                                4.7.0-0.nightly-2021-08-27-190811   True        False         False      101m
[2021-08-30T08:53:09.190Z] monitoring                                 4.7.0-0.nightly-2021-08-27-190811   False       True          True       106m
[2021-08-30T08:53:09.190Z] network                                    4.7.0-0.nightly-2021-08-27-190811   True        True          True       5h4m
[2021-08-30T08:53:09.190Z] node-tuning                                4.7.0-0.nightly-2021-08-27-190811   True        False         False      153m
[2021-08-30T08:53:09.190Z] openshift-apiserver                        4.7.0-0.nightly-2021-08-27-190811   True        False         False      100m
[2021-08-30T08:53:09.190Z] openshift-controller-manager               4.7.0-0.nightly-2021-08-27-190811   True        False         False      151m
[2021-08-30T08:53:09.190Z] openshift-samples                          4.7.0-0.nightly-2021-08-27-190811   True        False         False      153m
[2021-08-30T08:53:09.190Z] operator-lifecycle-manager                 4.7.0-0.nightly-2021-08-27-190811   True        False         False      5h4m
[2021-08-30T08:53:09.190Z] operator-lifecycle-manager-catalog         4.7.0-0.nightly-2021-08-27-190811   True        False         False      5h4m
[2021-08-30T08:53:09.190Z] operator-lifecycle-manager-packageserver   4.7.0-0.nightly-2021-08-27-190811   True        False         False      105m
[2021-08-30T08:53:09.190Z] service-ca                                 4.7.0-0.nightly-2021-08-27-190811   True        False         False      5h5m
[2021-08-30T08:53:09.190Z] storage                                    4.7.0-0.nightly-2021-08-27-190811   True        False         False      100m

#oc get node: NAME                                STATUS                        ROLES    AGE     VERSION           INTERNAL-IP      EXTERNAL-IP      OS-IMAGE                                                       KERNEL-VERSION                 CONTAINER-RUNTIME
[2021-08-30T08:53:09.190Z] kewang30113609-r5nwz-master-0       Ready                         master   5h8m    v1.20.0+4593a24   172.31.249.102   172.31.249.102   Red Hat Enterprise Linux CoreOS 47.84.202108250831-0 (Ootpa)   4.18.0-305.12.1.el8_4.x86_64   cri-o://1.20.4-12.rhaos4.7.git9275d5c.el8
[2021-08-30T08:53:09.190Z] kewang30113609-r5nwz-master-1       Ready                         master   5h8m    v1.20.0+4593a24   172.31.249.133   172.31.249.133   Red Hat Enterprise Linux CoreOS 47.84.202108250831-0 (Ootpa)   4.18.0-305.12.1.el8_4.x86_64   cri-o://1.20.4-12.rhaos4.7.git9275d5c.el8
[2021-08-30T08:53:09.190Z] kewang30113609-r5nwz-master-2       Ready                         master   5h7m    v1.20.0+4593a24   172.31.249.233   172.31.249.233   Red Hat Enterprise Linux CoreOS 47.84.202108250831-0 (Ootpa)   4.18.0-305.12.1.el8_4.x86_64   cri-o://1.20.4-12.rhaos4.7.git9275d5c.el8
[2021-08-30T08:53:09.190Z] kewang30113609-r5nwz-rhel-0         NotReady,SchedulingDisabled   worker   3h53m   v1.20.0+9689d22   172.31.249.247   172.31.249.247   Red Hat Enterprise Linux Server 7.9 (Maipo)                    3.10.0-1160.36.2.el7.x86_64    cri-o://1.20.4-12.rhaos4.7.git9275d5c.el7
[2021-08-30T08:53:09.190Z] kewang30113609-r5nwz-rhel-1         Ready                         worker   3h53m   v1.20.0+9689d22   172.31.249.85    172.31.249.85    Red Hat Enterprise Linux Server 7.9 (Maipo)                    3.10.0-1160.36.2.el7.x86_64    cri-o://1.20.4-12.rhaos4.7.git9275d5c.el7
[2021-08-30T08:53:09.190Z] kewang30113609-r5nwz-worker-5564x   Ready                         worker   4h55m   v1.20.0+df9c838   172.31.249.20    172.31.249.20    Red Hat Enterprise Linux CoreOS 47.83.202105220305-0 (Ootpa)   4.18.0-240.22.1.el8_3.x86_64   cri-o://1.20.2-12.rhaos4.7.git9f7be76.el8
[2021-08-30T08:53:09.190Z] kewang30113609-r5nwz-worker-s5bqb   Ready                         worker   4h55m   v1.20.0+df9c838   172.31.249.89    172.31.249.89    Red Hat Enterprise Linux CoreOS 47.83.202105220305-0 (Ootpa)   4.18.0-240.22.1.el8_3.x86_64   cri-o://1.20.2-12.rhaos4.7.git9f7be76.el8


Version-Release number of selected component (if applicable):
ocp version: 
  - starting from: 4.7.0-0.nightly-2021-08-27-190811
  - upgrading to:  4.7.0-0.nightly-2021-08-27-190811
kubelet version on RHEL nodes:  v1.20.0+9689d22, cri-o://1.20.4-12.rhaos4.7.git9275d5c.el7
kubelet version on RHCOS nodes:  v1.20.0+df9c838, cri-o://1.20.2-12.rhaos4.7.git9f7be76.el8


How reproducible:
Once so far

Steps to Reproduce:
1. Install OCP 4.7.13 IPI on vSphere 7.0 with RHCOS & FIPS on & OVN & Etcd Encryption on
2. Scale cluster to add 2 RHEL 7.9 worker nodes
3. Upgrade cluster to 4.7.0-0.nightly-2021-08-27-190811:  ./oc adm upgrade --to-image=registry.ci.openshift.org/ocp/release:4.7.0-0.nightly-2021-08-27-190811 --force=true --allow-explicit-upgrade=true


Actual results:
MCO, network and monitoring cluster operators have degraded and one RHEL worker node is NodeNot Ready

Expected results:
Successful upgrade to version 4.7.0-0.nightly-2021-08-27-190811 with all cluster operators available and not degraded on the new version
 

Additional info:
Link to cluster operator oc describe outputs in attachment.
Link to must-gather logs in next comment

Comment 2 Alexander Constantinescu 2021-09-06 15:13:24 UTC

Assigning to Andrew since I think he's already looked at a couple of "RHEL 7.9 worker node" bugs. If I am wrong, feel free to un-assing, Andrew.

/Alex

Note You need to log in before you can comment on or make changes to this bug.