Description of problem: Unable to upgrade from 4.7.17 to 4.8.0-0.nightly-2021-06-19-005119 do to error pool worker is not ready Version-Release number of selected component (if applicable): 4.8.0-0.nightly-2021-06-19-005119 How reproducible: First time seen Steps to Reproduce: 1. 4.7.17 cluster IPI on GCP & OVN IPSEC 2. Upgrade to 4.8.0-0.nightly-2021-06-19-005119 cmd: ./oc adm upgrade --to-image=registry.ci.openshift.org/ocp/release:4.8.0-0.nightly-2021-06-19-005119 --force=true --allow-explicit-upgrade=true Actual results: Cluster is not able to be upgraded from 4.7.17 to 4.8.0-0.nightly-2021-06-19-005119 Unable to apply 4.8.0-0.nightly-2021-06-19-005119: wait has exceeded 40 minutes for these operators: ingress Expected results: Cluster is able to be upgraded to 4.8.0-0.nightly-2021-06-19-005119 with no issues/holdups Additional info: 06-21 07:35:16.890 NAME VERSION AVAILABLE PROGRESSING SINCE STATUS 06-21 07:35:16.890 version 4.7.17 True True 3h Unable to apply 4.8.0-0.nightly-2021-06-19-005119: wait has exceeded 40 minutes for these operators: ingress #oc get node: NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME 06-21 07:36:13.583 wsun211502-ccvmn-master-0.c.openshift-qe.internal Ready master 4h23m v1.21.0-rc.0+120883f 10.0.0.2 <none> Red Hat Enterprise Linux CoreOS 48.84.202106161818-0 (Ootpa) 4.18.0-305.3.1.el8_4.x86_64 cri-o://1.21.1-9.rhaos4.8.gitdfcd2b6.el8 06-21 07:36:13.583 wsun211502-ccvmn-master-1.c.openshift-qe.internal Ready master 4h23m v1.21.0-rc.0+120883f 10.0.0.3 <none> Red Hat Enterprise Linux CoreOS 48.84.202106161818-0 (Ootpa) 4.18.0-305.3.1.el8_4.x86_64 cri-o://1.21.1-9.rhaos4.8.gitdfcd2b6.el8 06-21 07:36:13.583 wsun211502-ccvmn-master-2.c.openshift-qe.internal Ready master 4h23m v1.21.0-rc.0+120883f 10.0.0.5 <none> Red Hat Enterprise Linux CoreOS 48.84.202106161818-0 (Ootpa) 4.18.0-305.3.1.el8_4.x86_64 cri-o://1.21.1-9.rhaos4.8.gitdfcd2b6.el8 06-21 07:36:13.583 wsun211502-ccvmn-worker-a-kd4h9.c.openshift-qe.internal Ready worker 4h12m v1.21.0-rc.0+120883f 10.0.32.4 <none> Red Hat Enterprise Linux CoreOS 48.84.202106161818-0 (Ootpa) 4.18.0-305.3.1.el8_4.x86_64 cri-o://1.21.1-9.rhaos4.8.gitdfcd2b6.el8 06-21 07:36:13.583 wsun211502-ccvmn-worker-b-88szm.c.openshift-qe.internal Ready worker 4h12m v1.21.0-rc.0+120883f 10.0.32.2 <none> Red Hat Enterprise Linux CoreOS 48.84.202106161818-0 (Ootpa) 4.18.0-305.3.1.el8_4.x86_64 cri-o://1.21.1-9.rhaos4.8.gitdfcd2b6.el8 06-21 07:36:13.583 wsun211502-ccvmn-worker-c-k4pd8.c.openshift-qe.internal Ready,SchedulingDisabled worker 4h12m v1.20.0+2817867 10.0.32.3 <none> Red Hat Enterprise Linux CoreOS 47.83.202106120438-0 (Ootpa) 4.18.0-240.22.1.el8_3.x86_64 cri-o://1.20.3-3.rhaos4.7.git845b2fa.el8 #oc get co: NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE 06-21 07:36:13.584 authentication 4.8.0-0.nightly-2021-06-19-005119 True False False 30m 06-21 07:36:13.584 baremetal 4.8.0-0.nightly-2021-06-19-005119 True False False 4h20m 06-21 07:36:13.584 cloud-credential 4.8.0-0.nightly-2021-06-19-005119 True False False 4h26m 06-21 07:36:13.584 cluster-autoscaler 4.8.0-0.nightly-2021-06-19-005119 True False False 4h19m 06-21 07:36:13.584 config-operator 4.8.0-0.nightly-2021-06-19-005119 True False False 4h20m 06-21 07:36:13.584 console 4.8.0-0.nightly-2021-06-19-005119 True False False 74m 06-21 07:36:13.584 csi-snapshot-controller 4.8.0-0.nightly-2021-06-19-005119 True False False 3h32m 06-21 07:36:13.584 dns 4.8.0-0.nightly-2021-06-19-005119 True False False 129m 06-21 07:36:13.584 etcd 4.8.0-0.nightly-2021-06-19-005119 True False False 4h17m 06-21 07:36:13.584 image-registry 4.8.0-0.nightly-2021-06-19-005119 True True False 4h11m 06-21 07:36:13.584 ingress 4.8.0-0.nightly-2021-06-19-005119 True False True 150m 06-21 07:36:13.584 insights 4.8.0-0.nightly-2021-06-19-005119 True False False 4h13m 06-21 07:36:13.584 kube-apiserver 4.8.0-0.nightly-2021-06-19-005119 True False False 4h17m 06-21 07:36:13.584 kube-controller-manager 4.8.0-0.nightly-2021-06-19-005119 True False False 4h18m 06-21 07:36:13.584 kube-scheduler 4.8.0-0.nightly-2021-06-19-005119 True False False 4h17m 06-21 07:36:13.584 kube-storage-version-migrator 4.8.0-0.nightly-2021-06-19-005119 True False False 73m 06-21 07:36:13.584 machine-api 4.8.0-0.nightly-2021-06-19-005119 True False False 4h9m 06-21 07:36:13.584 machine-approver 4.8.0-0.nightly-2021-06-19-005119 True False False 4h19m 06-21 07:36:13.584 machine-config 4.8.0-0.nightly-2021-06-19-005119 False False True 8m3s 06-21 07:36:13.584 marketplace 4.8.0-0.nightly-2021-06-19-005119 True False False 3h36m 06-21 07:36:13.584 monitoring 4.8.0-0.nightly-2021-06-19-005119 False True True 79m 06-21 07:36:13.584 network 4.8.0-0.nightly-2021-06-19-005119 True False False 4h17m 06-21 07:36:13.584 node-tuning 4.8.0-0.nightly-2021-06-19-005119 True False False 149m 06-21 07:36:13.584 openshift-apiserver 4.8.0-0.nightly-2021-06-19-005119 True False False 68m 06-21 07:36:13.584 openshift-controller-manager 4.8.0-0.nightly-2021-06-19-005119 True False False 4h11m 06-21 07:36:13.584 openshift-samples 4.8.0-0.nightly-2021-06-19-005119 True False False 149m 06-21 07:36:13.584 operator-lifecycle-manager 4.8.0-0.nightly-2021-06-19-005119 True False False 4h19m 06-21 07:36:13.584 operator-lifecycle-manager-catalog 4.8.0-0.nightly-2021-06-19-005119 True False False 4h19m 06-21 07:36:13.584 operator-lifecycle-manager-packageserver 4.8.0-0.nightly-2021-06-19-005119 True False False 3h39m 06-21 07:36:13.584 service-ca 4.8.0-0.nightly-2021-06-19-005119 True False False 4h20m 06-21 07:36:13.584 storage 4.8.0-0.nightly-2021-06-19-005119 True False False 71m 06-21 07:37:24.065 Name: image-registry 06-21 07:37:24.065 Namespace: 06-21 07:37:24.065 Labels: <none> 06-21 07:37:24.065 Annotations: include.release.openshift.io/ibm-cloud-managed: true 06-21 07:37:24.065 include.release.openshift.io/self-managed-high-availability: true 06-21 07:37:24.065 include.release.openshift.io/single-node-developer: true 06-21 07:37:24.065 API Version: config.openshift.io/v1 06-21 07:37:24.065 Kind: ClusterOperator 06-21 07:37:24.065 Spec: 06-21 07:37:24.065 Status: 06-21 07:37:24.065 Conditions: 06-21 07:37:24.065 Last Transition Time: 2021-06-21T07:24:33Z 06-21 07:37:24.065 Message: Available: The registry has minimum availability 06-21 07:37:24.065 NodeCADaemonAvailable: The daemon set node-ca has available replicas 06-21 07:37:24.065 ImagePrunerAvailable: Pruner CronJob has been created 06-21 07:37:24.065 Reason: MinimumAvailability 06-21 07:37:24.065 Status: True 06-21 07:37:24.065 Type: Available 06-21 07:37:24.065 Last Transition Time: 2021-06-21T10:12:36Z 06-21 07:37:24.065 Message: Progressing: The deployment has not completed 06-21 07:37:24.065 Reason: DeploymentNotCompleted 06-21 07:37:24.065 Status: True 06-21 07:37:24.065 Type: Progressing 06-21 07:37:24.065 Last Transition Time: 2021-06-21T10:10:06Z 06-21 07:37:24.065 Reason: AsExpected 06-21 07:37:24.065 Status: False 06-21 07:37:24.065 Type: Degraded 06-21 07:37:24.065 Extension: <nil> 06-21 07:37:24.065 ~~~~~~~~~~~~~~~~~~~~~~~ 06-21 07:37:24.065 06-21 07:37:25.043 Name: ingress 06-21 07:37:25.043 Namespace: 06-21 07:37:25.043 Labels: <none> 06-21 07:37:25.043 Annotations: include.release.openshift.io/ibm-cloud-managed: true 06-21 07:37:25.043 include.release.openshift.io/self-managed-high-availability: true 06-21 07:37:25.043 include.release.openshift.io/single-node-developer: true 06-21 07:37:25.043 API Version: config.openshift.io/v1 06-21 07:37:25.043 Kind: ClusterOperator 6-21 07:37:25.043 Spec: 06-21 07:37:25.043 Status: 06-21 07:37:25.043 Conditions: 06-21 07:37:25.043 Last Transition Time: 2021-06-21T09:06:04Z 06-21 07:37:25.043 Message: The "default" ingress controller reports Available=True. 06-21 07:37:25.043 Reason: IngressAvailable 06-21 07:37:25.043 Status: True 06-21 07:37:25.043 Type: Available 06-21 07:37:25.043 Last Transition Time: 2021-06-21T10:21:36Z 06-21 07:37:25.043 Message: desired and current number of IngressControllers are equal 06-21 07:37:25.043 Reason: AsExpected 06-21 07:37:25.043 Status: False 06-21 07:37:25.043 Type: Progressing 06-21 07:37:25.043 Last Transition Time: 2021-06-21T11:12:36Z 06-21 07:37:25.043 Message: The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: DeploymentReplicasAllAvailable=False (DeploymentReplicasNotAvailable: 1/2 of replicas are available) 06-21 07:37:25.043 Reason: IngressDegraded 06-21 07:37:25.043 Status: True 06-21 07:37:25.043 Type: Degraded 06-21 07:37:25.043 ~~~~~~~~~~~~~~~~~~~~~~~ 06-21 07:37:25.043 06-21 07:37:26.057 Name: machine-config 06-21 07:37:26.057 Namespace: 06-21 07:37:26.057 Labels: <none> 06-21 07:37:26.057 Annotations: exclude.release.openshift.io/internal-openshift-hosted: true 06-21 07:37:26.057 include.release.openshift.io/self-managed-high-availability: true 06-21 07:37:26.057 include.release.openshift.io/single-node-developer: true 06-21 07:37:26.057 API Version: config.openshift.io/v1 06-21 07:37:26.057 Kind: ClusterOperator 06-21 07:37:26.057 Spec: 06-21 07:37:26.057 Status: 06-21 07:37:26.057 Conditions: 06-21 07:37:26.057 Last Transition Time: 2021-06-21T10:28:27Z 06-21 07:37:26.057 Message: Cluster version is 4.8.0-0.nightly-2021-06-19-005119 06-21 07:37:26.057 Status: False 06-21 07:37:26.057 Type: Progressing 06-21 07:37:26.057 Last Transition Time: 2021-06-21T11:28:10Z 06-21 07:37:26.057 Message: Failed to resync 4.8.0-0.nightly-2021-06-19-005119 because: timed out waiting for the condition during syncRequiredMachineConfigPools: error pool worker is not ready, retrying. Status: (pool degraded: true total: 3, ready 2, updated: 2, unavailable: 1) 06-21 07:37:26.057 Reason: RequiredPoolsFailed 06-21 07:37:26.057 Status: True 06-21 07:37:26.057 Type: Degraded 06-21 07:37:26.057 Last Transition Time: 2021-06-21T11:28:10Z 06-21 07:37:26.057 Message: Cluster not available for 4.8.0-0.nightly-2021-06-19-005119 06-21 07:37:26.057 Status: False 06-21 07:37:26.057 Type: Available 06-21 07:37:26.057 Last Transition Time: 2021-06-21T10:08:04Z 06-21 07:37:26.057 Message: One or more machine config pools are degraded, please see `oc get mcp` for further details and resolve before upgrading 06-21 07:37:26.057 Reason: DegradedPool 06-21 07:37:26.057 Status: False 06-21 07:37:26.057 Type: Upgradeable 06-21 07:37:26.057 Extension: 06-21 07:37:26.057 Last Sync Error: error pool worker is not ready, retrying. Status: (pool degraded: true total: 3, ready 2, updated: 2, unavailable: 1) 06-21 07:37:26.057 Master: all 3 nodes are at latest configuration rendered-master-734b3a767d88af12786770075ab197a2 06-21 07:37:26.057 Worker: pool is degraded because nodes fail with "1 nodes are reporting degraded status on sync": "Node wsun211502-ccvmn-worker-c-k4pd8.c.openshift-qe.internal is reporting: \"failed to drain node : wsun211502-ccvmn-worker-c-k4pd8.c.openshift-qe.internal after 1 hour\"" .... 06-21 07:37:26.058 ~~~~~~~~~~~~~~~~~~~~~~~ 06-21 07:37:26.058 06-21 07:37:27.007 Name: monitoring 06-21 07:37:27.007 Namespace: 06-21 07:37:27.007 Labels: <none> 06-21 07:37:27.007 Annotations: include.release.openshift.io/ibm-cloud-managed: true 06-21 07:37:27.007 include.release.openshift.io/self-managed-high-availability: true 06-21 07:37:27.007 include.release.openshift.io/single-node-developer: true 06-21 07:37:27.007 API Version: config.openshift.io/v1 06-21 07:37:27.007 Kind: ClusterOperator 06-21 07:37:27.007 Spec: 06-21 07:37:27.007 Status: 06-21 07:37:27.007 Conditions: 06-21 07:37:27.007 Last Transition Time: 2021-06-21T10:34:30Z 06-21 07:37:27.007 Message: Failed to rollout the stack. Error: running task Updating Grafana failed: reconciling Grafana Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/grafana: expected 2 replicas, got 1 updated replicas 06-21 07:37:27.007 Reason: UpdatingGrafanaFailed 06-21 07:37:27.007 Status: True 06-21 07:37:27.007 Type: Degraded 06-21 07:37:27.007 Last Transition Time: 2021-06-21T11:34:54Z 06-21 07:37:27.007 Message: Rollout of the monitoring stack is in progress. Please wait until it finishes. 06-21 07:37:27.007 Reason: RollOutInProgress 06-21 07:37:27.007 Status: True 06-21 07:37:27.007 Type: Upgradeable 06-21 07:37:27.007 Last Transition Time: 2021-06-21T10:16:59Z 06-21 07:37:27.007 Message: Rollout of the monitoring stack failed and is degraded. Please investigate the degraded status error. 06-21 07:37:27.007 Reason: UpdatingGrafanaFailed 06-21 07:37:27.007 Status: False 06-21 07:37:27.007 Type: Available 06-21 07:37:27.007 Last Transition Time: 2021-06-21T11:34:54Z 06-21 07:37:27.007 Message: Rolling out the stack. 06-21 07:37:27.007 Reason: RollOutInProgress 06-21 07:37:27.007 Status: True 06-21 07:37:27.007 Type: Progressing 06-21 07:37:27.007 Extension: <nil> ....
> got retryable error; requeueing {"after": "1m0s", "error": "IngressController is degraded: DeploymentReplicasAllAvailable=False (DeploymentReplicasNotAvailable: 1/2 of replicas are available)"} This is the operator complaining that only 1 of 2 router pod replicas is available. > haproxy.go:418] can't scrape HAProxy: dial unix /var/lib/haproxy/run/haproxy.sock: connect: no such file or directory This is normal during the router pod startup. If it only appears once, it is innocuous error. It looks like kubelet health checks are failing for the image registry and router pods, so the registry and the router have each only 1 available pod. Furthermore, the registry and router have each a PDB, which prevents eviction of each one's single available pod. It looks like the PDBs are doing their job, and we need to understand why the pods are unhealthy. Given that health probes for both the registry and the router are getting "i/o timeout" and "no route to host" errors, it looks like there is some underlying networking issue.
Hi @rioliu, @prubenda, @jiazha, Because this was not reproduced here (https://bugzilla.redhat.com/show_bug.cgi?id=1974403#c4) I am going to set this as blocker-. Please try to reproduce with the latest nightly and if it can be reproduced I will re-evaluate bumping to blocker+. Thanks, Andrew
So after much investigation we think this exposed a race in OVN-K 1. CNI ADD for a Previous sandbox is seen in OVN-Kube Node 2. That ADD event is old and grabs the old pod ip (10.129.2.49) from the API server 3. In 4.8 OCP we use the `checkExternalIDs` Option(https://github.com/ovn-org/ovn-kubernetes/blob/master/go-controller/pkg/cni/ovs.go#L236) to wait for ovn-controller to let us know the port it up ** HOWEVER this does not check for the MAC + IP that the ADD request from the pod when it asked the API (They should match) 4. While this is happening, a pod event comes in that causes the master to re-IPAM the pod (resetting the new address (10.129.2.67)) on the pod, and updates OVN 5. Now OVN-CONTROLLER is setting up 10.129.2.67 even though the CNI plugin thinks it's IP is 2.49 (which it returns to multus) 6. so K8's puts 2.49 as the Pod.Status.PodIP because that's what the cni returned even though OVN thinks it should be 2.67 because that's what the ovnkube-master process said to use. Thankfully a forceful `oc delete pod openshift-ingress_router-default-9b58d8984-st7c8` seemed to fix the issue and allowed the upgrade to successfully complete oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.0-0.nightly-2021-07-04-112043 True False 32h Cluster version is 4.8.0-0.nightly-2021-07-04-112043 Do to this workaround, and the fact that this race does not occur 100% of the time I am going to set to a release of 4.9.0 and backport to 4.8.z accordingly Thanks, Andrew
Taking a stab at a more specific bug summary based on comment 15.
This upstream patch (https://github.com/ovn-org/ovn-kubernetes/pull/2275) Should fix the issue by ensuring the POD UID has not changed during the Pod ADD workflow.
Posting a possible "known issue" entry for the 4.8 release notes: https://github.com/openshift/openshift-docs/pull/34401
Tried many time upgrade from 4.8.2-x86_64--> 4.9.0-0.nightly-2021-07-27-043424. This issue cannot be reproduced. Move this bug to verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759