Description of problem: Upgrade one OCP 4.7.11-x86_64 cluster with RHEL 7.9 workers to 4.8.0-0.nightly-2021-05-15-141455, it was stuck at Ingress operator Degraded. Version-Release number of selected component (if applicable): $ oc version -o yaml clientVersion: buildDate: "2021-05-14T22:17:07Z" compiler: gc gitCommit: 629bdbe335bbf2f68e5a5f6e3fc25de8c249fd3c gitTreeState: clean gitVersion: 4.8.0-202105142152.p0-629bdbe goVersion: go1.16.1 major: "" minor: "" platform: linux/amd64 openshiftVersion: 4.8.0-0.nightly-2021-05-15-141455 releaseClientVersion: 4.8.0-0.nightly-2021-05-17-231618 serverVersion: buildDate: "2021-05-15T02:13:14Z" compiler: gc gitCommit: 5f48aec83243eea46a34f011f978e8949714a9a8 gitTreeState: clean gitVersion: v1.21.0-rc.0+5f48aec goVersion: go1.16.1 major: "1" minor: 21+ platform: linux/amd64 How reproducible: Always Steps to Reproduce: 1.UPI install(aws) one OCP 4.7 cluster with OVNKubernetes + fips_enable 2.Removed the default RHEL coreOS workers and scaled up with two RHEL workers 3.Upgrade to 4.8 nightly Actual results: Upgrade was stuck at Ingress operator Degraded. $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.7.11 True True 12h Unable to apply 4.8.0-0.nightly-2021-05-15-141455: wait has exceeded 40 minutes for these operators: ingress $ oc get no -owide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME ip-10-0-49-27.us-east-2.compute.internal Ready master 14h v1.21.0-rc.0+5f48aec 10.0.49.27 <none> Red Hat Enterprise Linux CoreOS 48.84.202105150503-0 (Ootpa) 4.18.0-293.el8.x86_64 cri-o://1.21.0-93.rhaos4.8.git8f8bcd9.el8 ip-10-0-51-150.us-east-2.compute.internal Ready master 14h v1.21.0-rc.0+5f48aec 10.0.51.150 <none> Red Hat Enterprise Linux CoreOS 48.84.202105150503-0 (Ootpa) 4.18.0-293.el8.x86_64 cri-o://1.21.0-93.rhaos4.8.git8f8bcd9.el8 ip-10-0-54-16.us-east-2.compute.internal NotReady,SchedulingDisabled worker 13h v1.20.0+df9c838 10.0.54.16 <none> Red Hat Enterprise Linux Server 7.9 (Maipo) 3.10.0-1160.25.1.el7.x86_64 cri-o://1.20.2-11.rhaos4.7.git704b03d.el7 ip-10-0-57-248.us-east-2.compute.internal Ready worker 13h v1.20.0+df9c838 10.0.57.248 <none> Red Hat Enterprise Linux Server 7.9 (Maipo) 3.10.0-1160.25.1.el7.x86_64 cri-o://1.20.2-11.rhaos4.7.git704b03d.el7 ip-10-0-64-188.us-east-2.compute.internal Ready master 14h v1.21.0-rc.0+5f48aec 10.0.64.188 <none> Red Hat Enterprise Linux CoreOS 48.84.202105150503-0 (Ootpa) 4.18.0-293.el8.x86_64 cri-o://1.21.0-93.rhaos4.8.git8f8bcd9.el8 $ oc get co --no-headers | grep -v '.True.*False.*False' dns 4.8.0-0.nightly-2021-05-15-141455 True True False 12h image-registry 4.8.0-0.nightly-2021-05-15-141455 True True False 14h ingress 4.8.0-0.nightly-2021-05-15-141455 True False True 12h machine-config 4.7.11 False True True 11h monitoring 4.8.0-0.nightly-2021-05-15-141455 False True True 11h network 4.8.0-0.nightly-2021-05-15-141455 True True True 14h storage 4.8.0-0.nightly-2021-05-15-141455 True True False 11h $ oc describe node/ip-10-0-54-16.us-east-2.compute.internal Name: ip-10-0-54-16.us-east-2.compute.internal Roles: worker Labels: beta.kubernetes.io/arch=amd64 beta.kubernetes.io/instance-type=m4.xlarge beta.kubernetes.io/os=linux failure-domain.beta.kubernetes.io/region=us-east-2 failure-domain.beta.kubernetes.io/zone=us-east-2a kubernetes.io/arch=amd64 kubernetes.io/hostname=ip-10-0-54-16.us-east-2.compute.internal kubernetes.io/os=linux node-role.kubernetes.io/worker= node.kubernetes.io/instance-type=m4.xlarge node.openshift.io/os_id=rhel topology.ebs.csi.aws.com/zone=us-east-2a topology.hostpath.csi/node=ip-10-0-54-16.us-east-2.compute.internal topology.kubernetes.io/region=us-east-2 topology.kubernetes.io/zone=us-east-2a Annotations: csi.volume.kubernetes.io/nodeid: {"ebs.csi.aws.com":"i-0feccb71b4ea1d6df","hostpath.csi.k8s.io":"ip-10-0-54-16.us-east-2.compute.internal"} k8s.ovn.org/host-addresses: ["10.0.54.16"] k8s.ovn.org/l3-gateway-config: {"default":{"mode":"shared","interface-id":"br-ex_ip-10-0-54-16.us-east-2.compute.internal","mac-address":"02:57:99:4a:de:3e","ip-addresse... k8s.ovn.org/node-chassis-id: 1a869f3d-5225-47ae-8733-12c8b07f26fe k8s.ovn.org/node-local-nat-ip: {"default":["169.254.12.21"]} k8s.ovn.org/node-mgmt-port-mac-address: b2:d9:c1:2f:56:5f k8s.ovn.org/node-primary-ifaddr: {"ipv4":"10.0.54.16/20"} k8s.ovn.org/node-subnets: {"default":"10.131.2.0/23"} machineconfiguration.openshift.io/controlPlaneTopology: HighlyAvailable machineconfiguration.openshift.io/currentConfig: rendered-worker-510ff379c2cfff6cf15b0027c5307a1b machineconfiguration.openshift.io/desiredConfig: rendered-worker-431be08e2490567fbb1fa6af64b1d1b1 machineconfiguration.openshift.io/ssh: accessed machineconfiguration.openshift.io/state: Working volumes.kubernetes.io/controller-managed-attach-detach: true CreationTimestamp: Mon, 17 May 2021 20:41:15 +0800 Taints: node.kubernetes.io/unreachable:NoExecute node.kubernetes.io/unreachable:NoSchedule node.kubernetes.io/unschedulable:NoSchedule Unschedulable: true Lease: HolderIdentity: ip-10-0-54-16.us-east-2.compute.internal AcquireTime: <unset> RenewTime: Mon, 17 May 2021 22:28:29 +0800 Conditions: Type Status LastHeartbeatTime LastTransitionTime Reason Message ---- ------ ----------------- ------------------ ------ ------- MemoryPressure Unknown Mon, 17 May 2021 22:27:25 +0800 Mon, 17 May 2021 22:29:11 +0800 NodeStatusUnknown Kubelet stopped posting node status. DiskPressure Unknown Mon, 17 May 2021 22:27:25 +0800 Mon, 17 May 2021 22:29:11 +0800 NodeStatusUnknown Kubelet stopped posting node status. PIDPressure Unknown Mon, 17 May 2021 22:27:25 +0800 Mon, 17 May 2021 22:29:11 +0800 NodeStatusUnknown Kubelet stopped posting node status. Ready Unknown Mon, 17 May 2021 22:27:25 +0800 Mon, 17 May 2021 22:29:11 +0800 NodeStatusUnknown Kubelet stopped posting node status. Addresses: InternalIP: 10.0.54.16 Hostname: ip-10-0-54-16.us-east-2.compute.internal InternalDNS: ip-10-0-54-16.us-east-2.compute.internal Capacity: attachable-volumes-aws-ebs: 39 cpu: 4 ephemeral-storage: 31444972Ki hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 16264956Ki pods: 250 Allocatable: attachable-volumes-aws-ebs: 39 cpu: 3500m ephemeral-storage: 27905944324 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 15113980Ki pods: 250 System Info: Machine ID: 8e81451db8f24a6c9315a20728f21b53 System UUID: EC2B2713-93F5-AB07-98DE-57562B745B40 Boot ID: 17bd4ba2-061c-4639-85aa-d1a4161fe199 Kernel Version: 3.10.0-1160.25.1.el7.x86_64 OS Image: Red Hat Enterprise Linux Server 7.9 (Maipo) Operating System: linux Architecture: amd64 Container Runtime Version: cri-o://1.20.2-11.rhaos4.7.git704b03d.el7 Kubelet Version: v1.20.0+df9c838 Kube-Proxy Version: v1.20.0+df9c838 ProviderID: aws:///us-east-2a/i-0feccb71b4ea1d6df Non-terminated Pods: (15 in total) Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE --------- ---- ------------ ---------- --------------- ------------- --- node-upgrade hello-daemonset-vfsrt 0 (0%) 0 (0%) 0 (0%) 0 (0%) 13h openshift-cluster-csi-drivers aws-ebs-csi-driver-node-bvh62 30m (0%) 0 (0%) 150Mi (1%) 0 (0%) 12h openshift-cluster-node-tuning-operator tuned-p99rt 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 12h openshift-dns dns-default-4f4fp 60m (1%) 0 (0%) 110Mi (0%) 0 (0%) 12h openshift-dns node-resolver-f84ct 5m (0%) 0 (0%) 21Mi (0%) 0 (0%) 12h openshift-image-registry node-ca-xddxs 10m (0%) 0 (0%) 10Mi (0%) 0 (0%) 12h openshift-ingress-canary ingress-canary-hgmt2 10m (0%) 0 (0%) 20Mi (0%) 0 (0%) 12h openshift-logging fluentd-925p8 100m (2%) 0 (0%) 736Mi (4%) 736Mi (4%) 13h openshift-machine-config-operator machine-config-daemon-7ds4t 40m (1%) 0 (0%) 100Mi (0%) 0 (0%) 12h openshift-monitoring node-exporter-8zhlp 9m (0%) 0 (0%) 210Mi (1%) 0 (0%) 12h openshift-multus multus-9lffm 10m (0%) 0 (0%) 150Mi (1%) 0 (0%) 12h openshift-multus network-metrics-daemon-qctth 20m (0%) 0 (0%) 120Mi (0%) 0 (0%) 12h openshift-network-diagnostics network-check-target-j9922 10m (0%) 0 (0%) 15Mi (0%) 0 (0%) 12h openshift-ovn-kubernetes ovnkube-node-rrtlw 40m (1%) 0 (0%) 640Mi (4%) 0 (0%) 12h ui-upgrade hello-daemonset-md8kd 0 (0%) 0 (0%) 0 (0%) 0 (0%) 13h Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits -------- -------- ------ cpu 354m (10%) 0 (0%) memory 2332Mi (15%) 736Mi (4%) ephemeral-storage 0 (0%) 0 (0%) hugepages-1Gi 0 (0%) 0 (0%) hugepages-2Mi 0 (0%) 0 (0%) attachable-volumes-aws-ebs 0 0 Events: <none> $ oc describe co/ingress Name: ingress Namespace: Labels: <none> Annotations: include.release.openshift.io/ibm-cloud-managed: true include.release.openshift.io/self-managed-high-availability: true include.release.openshift.io/single-node-developer: true API Version: config.openshift.io/v1 Kind: ClusterOperator Metadata: Creation Timestamp: 2021-05-17T11:48:25Z Generation: 1 Managed Fields: API Version: config.openshift.io/v1 Fields Type: FieldsV1 fieldsV1: f:metadata: f:annotations: .: f:include.release.openshift.io/ibm-cloud-managed: f:include.release.openshift.io/self-managed-high-availability: f:include.release.openshift.io/single-node-developer: f:spec: f:status: .: f:extension: Manager: cluster-version-operator Operation: Update Time: 2021-05-17T11:48:25Z API Version: config.openshift.io/v1 Fields Type: FieldsV1 fieldsV1: f:status: f:conditions: f:relatedObjects: f:versions: Manager: ingress-operator Operation: Update Time: 2021-05-17T12:05:30Z Resource Version: 177666 UID: 4e29dfd6-a65f-4cac-af5d-63a93a44a108 Spec: Status: Conditions: Last Transition Time: 2021-05-17T14:05:07Z Message: The "default" ingress controller reports Available=True. Reason: IngressAvailable Status: True Type: Available Last Transition Time: 2021-05-17T14:05:08Z Message: desired and current number of IngressControllers are equal Reason: AsExpected Status: False Type: Progressing Last Transition Time: 2021-05-17T15:27:31Z Message: The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: PodsScheduled=False (PodsNotScheduled: Some pods are not scheduled: Pod "router-default-85cb58644b-fbzwl" cannot be scheduled: 0/5 nodes are available: 1 Insufficient memory, 1 node(s) were unschedulable, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate. Make sure you have sufficient worker nodes.), DeploymentReplicasAllAvailable=False (DeploymentReplicasNotAvailable: 1/2 of replicas are available) Reason: IngressDegraded Status: True Type: Degraded Extension: <nil> Related Objects: Group: Name: openshift-ingress-operator Resource: namespaces Group: operator.openshift.io Name: Namespace: openshift-ingress-operator Resource: IngressController Group: ingress.operator.openshift.io Name: Namespace: openshift-ingress-operator Resource: DNSRecord Group: Name: openshift-ingress Resource: namespaces Group: Name: openshift-ingress-canary Resource: namespaces Versions: Name: operator Version: 4.8.0-0.nightly-2021-05-15-141455 Name: ingress-controller Version: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:eb85465e5f39282ba92746666e4e2cb85c71a80fae3a92021d8aff8cd54166d7 Name: canary-server Version: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0e35f3d008ea901d17535883decf0b19a360c2db7a896fa75fe4bb7182254039 Events: <none> ---- Logged in to aws console, checked the NotReady node ip-10-0-54-16.us-east-2.compute.internal, shows 'Instance reachability check failed at May 17, 2021 at 10:31:00 PM UTC+8 (13 hours and 13 minutes ago)', the node was still running. Logged to the bastion server and tried to test two RHEL 7.9 workers connectivity, [root@bastion ~]# nc -vz 10.0.57.248 22 Ncat: Version 7.50 ( https://nmap.org/ncat ) Ncat: Connected to 10.0.57.248:22. Ncat: 0 bytes sent, 0 bytes received in 0.01 seconds. [root@bastion ~]# nc -vz 10.0.54.16 22 Ncat: Version 7.50 ( https://nmap.org/ncat ) Ncat: Connection timed out. So make sure there is something wrong with connectivity of NotReady node ip-10-0-54-16.us-east-2.compute.internal. Expected results: Upgrade should be successful. Additional info: Must-gather output: When opening a support case, bugzilla, or issue please include the following summary data along with any other requested information. ClusterID: a642c38f-039b-44ec-bcbc-f4b3074c13d7 ClusterVersion: Updating to "4.8.0-0.nightly-2021-05-15-141455" from "4.7.11" for 15 hours: Unable to apply 4.8.0-0.nightly-2021-05-15-141455: wait has exceeded 40 minutes for these operators: ingress ClusterOperators: clusteroperator/dns is progressing: DNS "default" reports Progressing=True: "Have 4 available node-resolver pods, want 5." clusteroperator/image-registry is progressing: Progressing: The deployment has not completed clusteroperator/ingress is degraded because The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: PodsScheduled=False (PodsNotScheduled: Some pods are not scheduled: Pod "router-default-85cb58644b-fbzwl" cannot be scheduled: 0/5 nodes are available: 1 Insufficient memory, 1 node(s) were unschedulable, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate. Make sure you have sufficient worker nodes.), DeploymentReplicasAllAvailable=False (DeploymentReplicasNotAvailable: 1/2 of replicas are available) clusteroperator/machine-config is not available (Cluster not available for 4.8.0-0.nightly-2021-05-15-141455) because Unable to apply 4.8.0-0.nightly-2021-05-15-141455: timed out waiting for the condition during waitForDaemonsetRollout: Daemonset machine-config-daemon is not ready. status: (desired: 5, updated: 5, ready: 4, unavailable: 1) clusteroperator/monitoring is not available (Rollout of the monitoring stack failed and is degraded. Please investigate the degraded status error.) because Failed to rollout the stack. Error: running task Updating node-exporter failed: reconciling node-exporter DaemonSet failed: updating DaemonSet object failed: waiting for DaemonSetRollout of openshift-monitoring/node-exporter: got 1 unavailable nodes clusteroperator/network is degraded because DaemonSet "openshift-multus/multus" rollout is not making progress - last change 2021-05-17T14:45:59Z DaemonSet "openshift-ovn-kubernetes/ovnkube-node" rollout is not making progress - last change 2021-05-17T14:45:52Z clusteroperator/storage is progressing: AWSEBSCSIDriverOperatorCRProgressing: AWSEBSDriverNodeServiceControllerProgressing: Waiting for DaemonSet to deploy node pods
This bug blocked this upgrade path test, so added UpgradeBlocker.
This is hard to repro issue so I suggest to get oc adm must-gather as soon as you see "Unable to apply 4.8.0-0.nightly-2021-06-03-101158: wait has exceeded 40 minutes for these operators: ingress"
*** Bug 1971832 has been marked as a duplicate of this bug. ***
The root cause is a bug in openvswitch with handling check_pkt_larger action on older kernels. The packet will be punted to userspace and is not making it to conntrack afterwards. Filed: https://bugzilla.redhat.com/show_bug.cgi?id=1973465
Even if we get a fix for OVS with userspace, the network performance will be bad for packets directed towards OVN on RHEL 7.9 nodes that dont support the check_pkt_len action. After some discussion within the team, it makes sense to try to disable these flows if the kernel does not support this action. The trade off is ICMP frag needed will no longer work for packets sent to OVN with an MTU larger than the pod MTU on these nodes. But that trade off is better than regressing in performance for RHEL nodes.
Removing the OVS bug (https://bugzilla.redhat.com/show_bug.cgi?id=1973465) as a dependency, as we will detect proper support using ovn-kube: https://github.com/ovn-org/ovn-kubernetes/pull/2267
Adding testblocker keyword since it's blocking the regression test against ovn with rhel cluster.
Per Comment 18, I also tested fresh install OVN cluster with latest build, the ingress router pod running on RHEL worker node works well. $ oc get network/cluster -o jsonpath='{.status.networkType}' OVNKubernetes $ oc -n openshift-ingress get pod -owide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES router-default-5787f6cd56-6sxl2 1/1 Running 0 17h 10.130.2.107 ip-10-0-56-71.us-east-2.compute.internal <none> <none> $ oc get node -owide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME ip-10-0-52-205.us-east-2.compute.internal Ready master 23h v1.21.0-rc.0+766a5fe 10.0.52.205 <none> Red Hat Enterprise Linux CoreOS 48.84.202106231817-0 (Ootpa) 4.18.0-305.3.1.el8_4.x86_64 cri-o://1.21.1-10.rhaos4.8.gitd3e59a4.el8 ip-10-0-56-22.us-east-2.compute.internal Ready worker 23h v1.21.0-rc.0+766a5fe 10.0.56.22 <none> Red Hat Enterprise Linux CoreOS 48.84.202106231817-0 (Ootpa) 4.18.0-305.3.1.el8_4.x86_64 cri-o://1.21.1-10.rhaos4.8.gitd3e59a4.el8 ip-10-0-56-71.us-east-2.compute.internal Ready worker 22h v1.21.0-rc.0+766a5fe 10.0.56.71 <none> Red Hat Enterprise Linux Server 7.9 (Maipo) 3.10.0-1160.31.1.el7.x86_64 cri-o://1.21.1-11.rhaos4.8.git30ca719.el7 ip-10-0-58-118.us-east-2.compute.internal Ready master 23h v1.21.0-rc.0+766a5fe 10.0.58.118 <none> Red Hat Enterprise Linux CoreOS 48.84.202106231817-0 (Ootpa) 4.18.0-305.3.1.el8_4.x86_64 cri-o://1.21.1-10.rhaos4.8.gitd3e59a4.el8 ip-10-0-58-129.us-east-2.compute.internal Ready worker 23h v1.21.0-rc.0+766a5fe 10.0.58.129 <none> Red Hat Enterprise Linux CoreOS 48.84.202106231817-0 (Ootpa) 4.18.0-305.3.1.el8_4.x86_64 cri-o://1.21.1-10.rhaos4.8.gitd3e59a4.el8 ip-10-0-66-137.us-east-2.compute.internal Ready worker 23h v1.21.0-rc.0+766a5fe 10.0.66.137 <none> Red Hat Enterprise Linux CoreOS 48.84.202106231817-0 (Ootpa) 4.18.0-305.3.1.el8_4.x86_64 cri-o://1.21.1-10.rhaos4.8.gitd3e59a4.el8 ip-10-0-68-81.us-east-2.compute.internal Ready master 23h v1.21.0-rc.0+766a5fe 10.0.68.81 <none> Red Hat Enterprise Linux CoreOS 48.84.202106231817-0 (Ootpa) 4.18.0-305.3.1.el8_4.x86_64 cri-o://1.21.1-10.rhaos4.8.gitd3e59a4.el8 $ oc get co/ingress NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE ingress 4.8.0-0.nightly-2021-06-28-165738 True False False 17h $ curl https://canary-openshift-ingress-canary.apps.hongli-ovn.qe.devcluster.openshift.com -k Healthcheck requested
ok, So now this issue only affect customer 4.7-> 4.7 user and 4.7 to 4.8 user. We found 4.6->4.7 is fine I added release note in https://github.com/openshift/openshift-docs/issues/29652#issuecomment-871157879 So according to comment 38 and 42. I think we can move this but to 'verified'
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days