Bug 1974403
Summary: | OVN-Kube Node race occasionally leads to invalid pod IP | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Paige Rubendall <prubenda> |
Component: | Networking | Assignee: | Andrew Stoycos <astoycos> |
Networking sub component: | ovn-kubernetes | QA Contact: | Anurag saxena <anusaxen> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | high | ||
Priority: | urgent | CC: | aos-bugs, astoycos, jiazha, kgarriso, mmasters, mmckiern, prubenda, rioliu, vlaad, wking, zzhao |
Version: | 4.9 | ||
Target Milestone: | --- | ||
Target Release: | 4.9.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2021-10-18 17:35:54 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 2019809 |
Description
Paige Rubendall
2021-06-21 15:12:24 UTC
> got retryable error; requeueing {"after": "1m0s", "error": "IngressController is degraded: DeploymentReplicasAllAvailable=False (DeploymentReplicasNotAvailable: 1/2 of replicas are available)"} This is the operator complaining that only 1 of 2 router pod replicas is available. > haproxy.go:418] can't scrape HAProxy: dial unix /var/lib/haproxy/run/haproxy.sock: connect: no such file or directory This is normal during the router pod startup. If it only appears once, it is innocuous error. It looks like kubelet health checks are failing for the image registry and router pods, so the registry and the router have each only 1 available pod. Furthermore, the registry and router have each a PDB, which prevents eviction of each one's single available pod. It looks like the PDBs are doing their job, and we need to understand why the pods are unhealthy. Given that health probes for both the registry and the router are getting "i/o timeout" and "no route to host" errors, it looks like there is some underlying networking issue. Hi @rioliu, @prubenda, @jiazha, Because this was not reproduced here (https://bugzilla.redhat.com/show_bug.cgi?id=1974403#c4) I am going to set this as blocker-. Please try to reproduce with the latest nightly and if it can be reproduced I will re-evaluate bumping to blocker+. Thanks, Andrew So after much investigation we think this exposed a race in OVN-K 1. CNI ADD for a Previous sandbox is seen in OVN-Kube Node 2. That ADD event is old and grabs the old pod ip (10.129.2.49) from the API server 3. In 4.8 OCP we use the `checkExternalIDs` Option(https://github.com/ovn-org/ovn-kubernetes/blob/master/go-controller/pkg/cni/ovs.go#L236) to wait for ovn-controller to let us know the port it up ** HOWEVER this does not check for the MAC + IP that the ADD request from the pod when it asked the API (They should match) 4. While this is happening, a pod event comes in that causes the master to re-IPAM the pod (resetting the new address (10.129.2.67)) on the pod, and updates OVN 5. Now OVN-CONTROLLER is setting up 10.129.2.67 even though the CNI plugin thinks it's IP is 2.49 (which it returns to multus) 6. so K8's puts 2.49 as the Pod.Status.PodIP because that's what the cni returned even though OVN thinks it should be 2.67 because that's what the ovnkube-master process said to use. Thankfully a forceful `oc delete pod openshift-ingress_router-default-9b58d8984-st7c8` seemed to fix the issue and allowed the upgrade to successfully complete oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.0-0.nightly-2021-07-04-112043 True False 32h Cluster version is 4.8.0-0.nightly-2021-07-04-112043 Do to this workaround, and the fact that this race does not occur 100% of the time I am going to set to a release of 4.9.0 and backport to 4.8.z accordingly Thanks, Andrew Taking a stab at a more specific bug summary based on comment 15. This upstream patch (https://github.com/ovn-org/ovn-kubernetes/pull/2275) Should fix the issue by ensuring the POD UID has not changed during the Pod ADD workflow. Posting a possible "known issue" entry for the 4.8 release notes: https://github.com/openshift/openshift-docs/pull/34401 Tried many time upgrade from 4.8.2-x86_64--> 4.9.0-0.nightly-2021-07-27-043424. This issue cannot be reproduced. Move this bug to verified. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759 |