Bug 1974403

Summary:	OVN-Kube Node race occasionally leads to invalid pod IP
Product:	OpenShift Container Platform	Reporter:	Paige Rubendall <prubenda>
Component:	Networking	Assignee:	Andrew Stoycos <astoycos>
Networking sub component:	ovn-kubernetes	QA Contact:	Anurag saxena <anusaxen>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	high
Priority:	urgent	CC:	aos-bugs, astoycos, jiazha, kgarriso, mmasters, mmckiern, prubenda, rioliu, vlaad, wking, zzhao
Version:	4.9
Target Milestone:	---
Target Release:	4.9.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-10-18 17:35:54 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	2019809

Description Paige Rubendall 2021-06-21 15:12:24 UTC

Description of problem:
Unable to upgrade from 4.7.17 to 4.8.0-0.nightly-2021-06-19-005119 do to error pool worker is not ready

Version-Release number of selected component (if applicable): 4.8.0-0.nightly-2021-06-19-005119


How reproducible: First time seen


Steps to Reproduce:
1. 4.7.17 cluster IPI on GCP & OVN IPSEC
2. Upgrade to 4.8.0-0.nightly-2021-06-19-005119

cmd: ./oc adm upgrade --to-image=registry.ci.openshift.org/ocp/release:4.8.0-0.nightly-2021-06-19-005119 --force=true --allow-explicit-upgrade=true

Actual results:
Cluster is not able to be upgraded from 4.7.17 to 4.8.0-0.nightly-2021-06-19-005119
Unable to apply 4.8.0-0.nightly-2021-06-19-005119: wait has exceeded 40 minutes for these operators: ingress

Expected results:
Cluster is able to be upgraded to 4.8.0-0.nightly-2021-06-19-005119 with no issues/holdups


Additional info:

06-21 07:35:16.890  NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
06-21 07:35:16.890  version   4.7.17    True        True          3h      Unable to apply 4.8.0-0.nightly-2021-06-19-005119: wait has exceeded 40 minutes for these operators: ingress


 #oc get node: NAME                                                      STATUS                     ROLES    AGE     VERSION                INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                                                       KERNEL-VERSION                 CONTAINER-RUNTIME
06-21 07:36:13.583  wsun211502-ccvmn-master-0.c.openshift-qe.internal         Ready                      master   4h23m   v1.21.0-rc.0+120883f   10.0.0.2      <none>        Red Hat Enterprise Linux CoreOS 48.84.202106161818-0 (Ootpa)   4.18.0-305.3.1.el8_4.x86_64    cri-o://1.21.1-9.rhaos4.8.gitdfcd2b6.el8
06-21 07:36:13.583  wsun211502-ccvmn-master-1.c.openshift-qe.internal         Ready                      master   4h23m   v1.21.0-rc.0+120883f   10.0.0.3      <none>        Red Hat Enterprise Linux CoreOS 48.84.202106161818-0 (Ootpa)   4.18.0-305.3.1.el8_4.x86_64    cri-o://1.21.1-9.rhaos4.8.gitdfcd2b6.el8
06-21 07:36:13.583  wsun211502-ccvmn-master-2.c.openshift-qe.internal         Ready                      master   4h23m   v1.21.0-rc.0+120883f   10.0.0.5      <none>        Red Hat Enterprise Linux CoreOS 48.84.202106161818-0 (Ootpa)   4.18.0-305.3.1.el8_4.x86_64    cri-o://1.21.1-9.rhaos4.8.gitdfcd2b6.el8
06-21 07:36:13.583  wsun211502-ccvmn-worker-a-kd4h9.c.openshift-qe.internal   Ready                      worker   4h12m   v1.21.0-rc.0+120883f   10.0.32.4     <none>        Red Hat Enterprise Linux CoreOS 48.84.202106161818-0 (Ootpa)   4.18.0-305.3.1.el8_4.x86_64    cri-o://1.21.1-9.rhaos4.8.gitdfcd2b6.el8
06-21 07:36:13.583  wsun211502-ccvmn-worker-b-88szm.c.openshift-qe.internal   Ready                      worker   4h12m   v1.21.0-rc.0+120883f   10.0.32.2     <none>        Red Hat Enterprise Linux CoreOS 48.84.202106161818-0 (Ootpa)   4.18.0-305.3.1.el8_4.x86_64    cri-o://1.21.1-9.rhaos4.8.gitdfcd2b6.el8
06-21 07:36:13.583  wsun211502-ccvmn-worker-c-k4pd8.c.openshift-qe.internal   Ready,SchedulingDisabled   worker   4h12m   v1.20.0+2817867        10.0.32.3     <none>        Red Hat Enterprise Linux CoreOS 47.83.202106120438-0 (Ootpa)   4.18.0-240.22.1.el8_3.x86_64   cri-o://1.20.3-3.rhaos4.7.git845b2fa.el8



#oc get co:
NAME                                                           VERSION                            AVAILABLE   PROGRESSING   DEGRADED   SINCE
06-21 07:36:13.584  authentication                             4.8.0-0.nightly-2021-06-19-005119   True        False         False      30m
06-21 07:36:13.584  baremetal                                  4.8.0-0.nightly-2021-06-19-005119   True        False         False      4h20m
06-21 07:36:13.584  cloud-credential                           4.8.0-0.nightly-2021-06-19-005119   True        False         False      4h26m
06-21 07:36:13.584  cluster-autoscaler                         4.8.0-0.nightly-2021-06-19-005119   True        False         False      4h19m
06-21 07:36:13.584  config-operator                            4.8.0-0.nightly-2021-06-19-005119   True        False         False      4h20m
06-21 07:36:13.584  console                                    4.8.0-0.nightly-2021-06-19-005119   True        False         False      74m
06-21 07:36:13.584  csi-snapshot-controller                    4.8.0-0.nightly-2021-06-19-005119   True        False         False      3h32m
06-21 07:36:13.584  dns                                        4.8.0-0.nightly-2021-06-19-005119   True        False         False      129m
06-21 07:36:13.584  etcd                                       4.8.0-0.nightly-2021-06-19-005119   True        False         False      4h17m
06-21 07:36:13.584  image-registry                             4.8.0-0.nightly-2021-06-19-005119   True        True          False      4h11m
06-21 07:36:13.584  ingress                                    4.8.0-0.nightly-2021-06-19-005119   True        False         True       150m
06-21 07:36:13.584  insights                                   4.8.0-0.nightly-2021-06-19-005119   True        False         False      4h13m
06-21 07:36:13.584  kube-apiserver                             4.8.0-0.nightly-2021-06-19-005119   True        False         False      4h17m
06-21 07:36:13.584  kube-controller-manager                    4.8.0-0.nightly-2021-06-19-005119   True        False         False      4h18m
06-21 07:36:13.584  kube-scheduler                             4.8.0-0.nightly-2021-06-19-005119   True        False         False      4h17m
06-21 07:36:13.584  kube-storage-version-migrator              4.8.0-0.nightly-2021-06-19-005119   True        False         False      73m
06-21 07:36:13.584  machine-api                                4.8.0-0.nightly-2021-06-19-005119   True        False         False      4h9m
06-21 07:36:13.584  machine-approver                           4.8.0-0.nightly-2021-06-19-005119   True        False         False      4h19m
06-21 07:36:13.584  machine-config                             4.8.0-0.nightly-2021-06-19-005119   False       False         True       8m3s
06-21 07:36:13.584  marketplace                                4.8.0-0.nightly-2021-06-19-005119   True        False         False      3h36m
06-21 07:36:13.584  monitoring                                 4.8.0-0.nightly-2021-06-19-005119   False       True          True       79m
06-21 07:36:13.584  network                                    4.8.0-0.nightly-2021-06-19-005119   True        False         False      4h17m
06-21 07:36:13.584  node-tuning                                4.8.0-0.nightly-2021-06-19-005119   True        False         False      149m
06-21 07:36:13.584  openshift-apiserver                        4.8.0-0.nightly-2021-06-19-005119   True        False         False      68m
06-21 07:36:13.584  openshift-controller-manager               4.8.0-0.nightly-2021-06-19-005119   True        False         False      4h11m
06-21 07:36:13.584  openshift-samples                          4.8.0-0.nightly-2021-06-19-005119   True        False         False      149m
06-21 07:36:13.584  operator-lifecycle-manager                 4.8.0-0.nightly-2021-06-19-005119   True        False         False      4h19m
06-21 07:36:13.584  operator-lifecycle-manager-catalog         4.8.0-0.nightly-2021-06-19-005119   True        False         False      4h19m
06-21 07:36:13.584  operator-lifecycle-manager-packageserver   4.8.0-0.nightly-2021-06-19-005119   True        False         False      3h39m
06-21 07:36:13.584  service-ca                                 4.8.0-0.nightly-2021-06-19-005119   True        False         False      4h20m
06-21 07:36:13.584  storage                                    4.8.0-0.nightly-2021-06-19-005119   True        False         False      71m





06-21 07:37:24.065  Name:         image-registry
06-21 07:37:24.065  Namespace:    
06-21 07:37:24.065  Labels:       <none>
06-21 07:37:24.065  Annotations:  include.release.openshift.io/ibm-cloud-managed: true
06-21 07:37:24.065                include.release.openshift.io/self-managed-high-availability: true
06-21 07:37:24.065                include.release.openshift.io/single-node-developer: true
06-21 07:37:24.065  API Version:  config.openshift.io/v1
06-21 07:37:24.065  Kind:         ClusterOperator
06-21 07:37:24.065  Spec:
06-21 07:37:24.065  Status:
06-21 07:37:24.065    Conditions:
06-21 07:37:24.065      Last Transition Time:  2021-06-21T07:24:33Z
06-21 07:37:24.065      Message:               Available: The registry has minimum availability
06-21 07:37:24.065  NodeCADaemonAvailable: The daemon set node-ca has available replicas
06-21 07:37:24.065  ImagePrunerAvailable: Pruner CronJob has been created
06-21 07:37:24.065      Reason:                MinimumAvailability
06-21 07:37:24.065      Status:                True
06-21 07:37:24.065      Type:                  Available
06-21 07:37:24.065      Last Transition Time:  2021-06-21T10:12:36Z
06-21 07:37:24.065      Message:               Progressing: The deployment has not completed
06-21 07:37:24.065      Reason:                DeploymentNotCompleted
06-21 07:37:24.065      Status:                True
06-21 07:37:24.065      Type:                  Progressing
06-21 07:37:24.065      Last Transition Time:  2021-06-21T10:10:06Z
06-21 07:37:24.065      Reason:                AsExpected
06-21 07:37:24.065      Status:                False
06-21 07:37:24.065      Type:                  Degraded
06-21 07:37:24.065    Extension:               <nil>

06-21 07:37:24.065  ~~~~~~~~~~~~~~~~~~~~~~~
06-21 07:37:24.065  
06-21 07:37:25.043  Name:         ingress
06-21 07:37:25.043  Namespace:    
06-21 07:37:25.043  Labels:       <none>
06-21 07:37:25.043  Annotations:  include.release.openshift.io/ibm-cloud-managed: true
06-21 07:37:25.043                include.release.openshift.io/self-managed-high-availability: true
06-21 07:37:25.043                include.release.openshift.io/single-node-developer: true
06-21 07:37:25.043  API Version:  config.openshift.io/v1
06-21 07:37:25.043  Kind:         ClusterOperator
6-21 07:37:25.043  Spec:
06-21 07:37:25.043  Status:
06-21 07:37:25.043    Conditions:
06-21 07:37:25.043      Last Transition Time:  2021-06-21T09:06:04Z
06-21 07:37:25.043      Message:               The "default" ingress controller reports Available=True.
06-21 07:37:25.043      Reason:                IngressAvailable
06-21 07:37:25.043      Status:                True
06-21 07:37:25.043      Type:                  Available
06-21 07:37:25.043      Last Transition Time:  2021-06-21T10:21:36Z
06-21 07:37:25.043      Message:               desired and current number of IngressControllers are equal
06-21 07:37:25.043      Reason:                AsExpected
06-21 07:37:25.043      Status:                False
06-21 07:37:25.043      Type:                  Progressing
06-21 07:37:25.043      Last Transition Time:  2021-06-21T11:12:36Z
06-21 07:37:25.043      Message:               The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: DeploymentReplicasAllAvailable=False (DeploymentReplicasNotAvailable: 1/2 of replicas are available)
06-21 07:37:25.043      Reason:                IngressDegraded
06-21 07:37:25.043      Status:                True
06-21 07:37:25.043      Type:                  Degraded
06-21 07:37:25.043  ~~~~~~~~~~~~~~~~~~~~~~~
06-21 07:37:25.043  
06-21 07:37:26.057  Name:         machine-config
06-21 07:37:26.057  Namespace:    
06-21 07:37:26.057  Labels:       <none>
06-21 07:37:26.057  Annotations:  exclude.release.openshift.io/internal-openshift-hosted: true
06-21 07:37:26.057                include.release.openshift.io/self-managed-high-availability: true
06-21 07:37:26.057                include.release.openshift.io/single-node-developer: true
06-21 07:37:26.057  API Version:  config.openshift.io/v1
06-21 07:37:26.057  Kind:         ClusterOperator
06-21 07:37:26.057  Spec:
06-21 07:37:26.057  Status:
06-21 07:37:26.057    Conditions:
06-21 07:37:26.057      Last Transition Time:  2021-06-21T10:28:27Z
06-21 07:37:26.057      Message:               Cluster version is 4.8.0-0.nightly-2021-06-19-005119
06-21 07:37:26.057      Status:                False
06-21 07:37:26.057      Type:                  Progressing
06-21 07:37:26.057      Last Transition Time:  2021-06-21T11:28:10Z
06-21 07:37:26.057      Message:               Failed to resync 4.8.0-0.nightly-2021-06-19-005119 because: timed out waiting for the condition during syncRequiredMachineConfigPools: error pool worker is not ready, retrying. Status: (pool degraded: true total: 3, ready 2, updated: 2, unavailable: 1)
06-21 07:37:26.057      Reason:                RequiredPoolsFailed
06-21 07:37:26.057      Status:                True
06-21 07:37:26.057      Type:                  Degraded
06-21 07:37:26.057      Last Transition Time:  2021-06-21T11:28:10Z
06-21 07:37:26.057      Message:               Cluster not available for 4.8.0-0.nightly-2021-06-19-005119
06-21 07:37:26.057      Status:                False
06-21 07:37:26.057      Type:                  Available
06-21 07:37:26.057      Last Transition Time:  2021-06-21T10:08:04Z
06-21 07:37:26.057      Message:               One or more machine config pools are degraded, please see `oc get mcp` for further details and resolve before upgrading
06-21 07:37:26.057      Reason:                DegradedPool
06-21 07:37:26.057      Status:                False
06-21 07:37:26.057      Type:                  Upgradeable
06-21 07:37:26.057    Extension:
06-21 07:37:26.057      Last Sync Error:  error pool worker is not ready, retrying. Status: (pool degraded: true total: 3, ready 2, updated: 2, unavailable: 1)
06-21 07:37:26.057      Master:           all 3 nodes are at latest configuration rendered-master-734b3a767d88af12786770075ab197a2
06-21 07:37:26.057      Worker:           pool is degraded because nodes fail with "1 nodes are reporting degraded status on sync": "Node wsun211502-ccvmn-worker-c-k4pd8.c.openshift-qe.internal is reporting: \"failed to drain node : wsun211502-ccvmn-worker-c-k4pd8.c.openshift-qe.internal after 1 hour\""
....
06-21 07:37:26.058  ~~~~~~~~~~~~~~~~~~~~~~~
06-21 07:37:26.058  
06-21 07:37:27.007  Name:         monitoring
06-21 07:37:27.007  Namespace:    
06-21 07:37:27.007  Labels:       <none>
06-21 07:37:27.007  Annotations:  include.release.openshift.io/ibm-cloud-managed: true
06-21 07:37:27.007                include.release.openshift.io/self-managed-high-availability: true
06-21 07:37:27.007                include.release.openshift.io/single-node-developer: true
06-21 07:37:27.007  API Version:  config.openshift.io/v1
06-21 07:37:27.007  Kind:         ClusterOperator

06-21 07:37:27.007  Spec:
06-21 07:37:27.007  Status:
06-21 07:37:27.007    Conditions:
06-21 07:37:27.007      Last Transition Time:  2021-06-21T10:34:30Z
06-21 07:37:27.007      Message:               Failed to rollout the stack. Error: running task Updating Grafana failed: reconciling Grafana Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/grafana: expected 2 replicas, got 1 updated replicas
06-21 07:37:27.007      Reason:                UpdatingGrafanaFailed
06-21 07:37:27.007      Status:                True
06-21 07:37:27.007      Type:                  Degraded
06-21 07:37:27.007      Last Transition Time:  2021-06-21T11:34:54Z
06-21 07:37:27.007      Message:               Rollout of the monitoring stack is in progress. Please wait until it finishes.
06-21 07:37:27.007      Reason:                RollOutInProgress
06-21 07:37:27.007      Status:                True
06-21 07:37:27.007      Type:                  Upgradeable
06-21 07:37:27.007      Last Transition Time:  2021-06-21T10:16:59Z
06-21 07:37:27.007      Message:               Rollout of the monitoring stack failed and is degraded. Please investigate the degraded status error.
06-21 07:37:27.007      Reason:                UpdatingGrafanaFailed
06-21 07:37:27.007      Status:                False
06-21 07:37:27.007      Type:                  Available
06-21 07:37:27.007      Last Transition Time:  2021-06-21T11:34:54Z
06-21 07:37:27.007      Message:               Rolling out the stack.
06-21 07:37:27.007      Reason:                RollOutInProgress
06-21 07:37:27.007      Status:                True
06-21 07:37:27.007      Type:                  Progressing
06-21 07:37:27.007    Extension:               <nil>
....

Comment 9 Miciah Dashiel Butler Masters 2021-07-01 21:24:29 UTC

> got retryable error; requeueing	{"after": "1m0s", "error": "IngressController is degraded: DeploymentReplicasAllAvailable=False (DeploymentReplicasNotAvailable: 1/2 of replicas are available)"}

This is the operator complaining that only 1 of 2 router pod replicas is available.  

> haproxy.go:418] can't scrape HAProxy: dial unix /var/lib/haproxy/run/haproxy.sock: connect: no such file or directory

This is normal during the router pod startup.  If it only appears once, it is innocuous error.  

It looks like kubelet health checks are failing for the image registry and router pods, so the registry and the router have each only 1 available pod.  Furthermore, the registry and router have each a PDB, which prevents eviction of each one's single available pod.  It looks like the PDBs are doing their job, and we need to understand why the pods are unhealthy.  Given that health probes for both the registry and the router are getting "i/o timeout" and "no route to host" errors, it looks like there is some underlying networking issue.

Comment 11 Andrew Stoycos 2021-07-02 13:46:39 UTC

Hi @rioliu, @prubenda, @jiazha, 

Because this was not reproduced here (https://bugzilla.redhat.com/show_bug.cgi?id=1974403#c4) I am going to set this as blocker-. Please try to reproduce with the latest nightly and if it can be reproduced I will re-evaluate bumping to blocker+. 

Thanks, 
Andrew

Comment 15 Andrew Stoycos 2021-07-06 18:47:21 UTC

So after much investigation we think this exposed a race in OVN-K 

1. CNI ADD for a Previous sandbox is seen in OVN-Kube Node 
2. That ADD event is old and grabs the old pod ip (10.129.2.49) from the API server 
3. In 4.8 OCP we use the `checkExternalIDs` Option(https://github.com/ovn-org/ovn-kubernetes/blob/master/go-controller/pkg/cni/ovs.go#L236) to wait for ovn-controller to let us know the port it up 

** HOWEVER this does not check for the MAC + IP that the ADD request from the pod when it asked the API (They should match) 

4. While this is happening, a pod event comes in that causes the master to re-IPAM the pod (resetting the new address (10.129.2.67)) on the pod, and updates OVN

5. Now OVN-CONTROLLER is setting up 10.129.2.67 even though the CNI plugin thinks it's IP is 2.49 (which it returns to multus) 
6. so K8's puts 2.49 as the Pod.Status.PodIP because that's what the cni returned even though OVN thinks it should be 2.67 because that's what the ovnkube-master process said to use. 

Thankfully a forceful `oc delete pod openshift-ingress_router-default-9b58d8984-st7c8` seemed to fix the issue and allowed the upgrade to successfully complete 

oc get clusterversion                                                                                                                                              
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2021-07-04-112043   True        False         32h     Cluster version is 4.8.0-0.nightly-2021-07-04-112043

Do to this workaround, and the fact that this race does not occur 100% of the time I am going to set to a release of 4.9.0 and backport to 4.8.z accordingly 

Thanks, 
Andrew

Comment 16 W. Trevor King 2021-07-06 19:33:12 UTC

Taking a stab at a more specific bug summary based on comment 15.

Comment 17 Andrew Stoycos 2021-07-06 21:19:15 UTC

This upstream patch (https://github.com/ovn-org/ovn-kubernetes/pull/2275) Should fix the issue by ensuring the POD UID has not changed during the Pod ADD workflow.

Comment 18 Mike McKiernan 2021-07-08 17:22:26 UTC

Posting a possible "known issue" entry for the 4.8 release notes: https://github.com/openshift/openshift-docs/pull/34401

Comment 21 zhaozhanqi 2021-07-29 01:12:18 UTC

Tried many time upgrade from 4.8.2-x86_64--> 4.9.0-0.nightly-2021-07-27-043424.  This issue cannot be reproduced. 

Move this bug to verified.

Comment 24 errata-xmlrpc 2021-10-18 17:35:54 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759