Bug 2042001

Summary: unexpectedly found multiple load balancers
Product: OpenShift Container Platform Reporter: Dan Williams <dcbw>
Component: NetworkingAssignee: Tim Rozet <trozet>
Networking sub component: ovn-kubernetes QA Contact: Ross Brattain <rbrattai>
Status: CLOSED ERRATA Docs Contact:
Severity: urgent    
Priority: urgent CC: anbhat, elevin, jcaamano, rbrattai, surya, trozet, vpickard
Version: 4.10   
Target Milestone: ---   
Target Release: 4.10.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-03-12 04:41:03 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Dan Williams 2022-01-18 16:51:15 UTC
/tmp/dbgeq1.txt:I0110 17:54:22.725955       1 services_controller.go:224] "Error syncing service, retrying" service="openshift-machine-config-operator/machine-config-daemon" err="failed to ensure service openshift-machine-config-operator/machine-config-daemon load balancers: unexpectedly found multiple load balancers: [{UUID:a8f708c5-8694-4fb8-b95e-d716a21f4367 ExternalIDs:map[k8s.ovn.org/kind:Service k8s.ovn.org/owner:openshift-machine-config-operator/machine-config-daemon] HealthCheck:[] IPPortMappings:map[] Name:Service_openshift-machine-config-operator/machine-config-daemon_TCP_node_router_ip-10-0-129-163.us-west-2.compute.internal Options:map[event:false reject:true skip_snat:false] Protocol:0xc00a28c010 SelectionFields:[] Vips:map[172.30.146.245:9001:169.254.169.2:9001,10.0.129.43:9001,10.0.132.49:9001,10.0.133.149:9001,10.0.133.56:9001,10.0.134.135:9001,10.0.135.107:9001,10.0.135.166:9001,10.0.135.171:9001,10.0.135.225:9001,10.0.136.122:9001,10.0.138.44:9001,10.0.138.74:9001,10.0.139.225:9001,10.0.139.92:9001,10.0.140.133:9001,10.0.141.124:9001,10.0.141.188:9001,10.0.142.228:9001,10.0.144.39:9001,10.0.145.216:9001,10.0.145.49:9001,10.0.145.7:9001,10.0.147.146:9001,10.0.149.113:9001,10.0.149.16:9001,10.0.150.241:9001,10.0.151.120:9001,10.0.155.49:9001,10.0.157.36:9001,10.0.158.36:9001,10.0.160.28:9001,10.0.160.88:9001,10.0.162.186:9001,10.0.164.79:9001,10.0.165.226:9001,10.0.165.58:9001,10.0.166.13:9001,10.0.168.167:9001,10.0.168.212:9001,10.0.169.85:9001,10.0.170.7:9001,10.0.171.113:9001,10.0.172.255:9001,10.0.173.52:9001,10.0.173.8:9001,10.0.174.76:9001,10.0.174.99:9001,10.0.179.207:9001,10.0.179.62:9001,10.0.180.213:9001,10.0.180.31:9001,10.0.181.9:9001,10.0.183.134:9001,10.0.183.47:9001,10.0.184.192:9001,10.0.186.90:9001,10.0.188.28:9001,10.0.188.47:9001,10.0.189.136:9001,10.0.189.161:9001,10.0.190.23:9001,10.0.192.198:9001,10.0.192.239:9001,10.0.196.106:9001,10.0.197.13:9001,10.0.199.121:9001,10.0.199.122:9001,10.0.199.215:9001,10.0.201.104:9001,10.0.202.213:9001,10.0.203.85:9001,10.0.205.79:9001,10.0.205.96:9001,10.0.206.7:9001,10.0.207.202:9001,10.0.207.225:9001,10.0.208.121:9001,10.0.208.225:9001,10.0.208.232:9001,10.0.208.27:9001,10.0.209.0:9001,10.0.214.143:9001,10.0.216.111:9001,10.0.217.25:9001,10.0.217.93:9001,10.0.220.195:9001,10.0.225.118:9001,10.0.225.27:9001,10.0.227.221:9001,10.0.230.167:9001,10.0.230.176:9001,10.0.231.208:9001,10.0.234.187:9001,10.0.237.9:9001,10.0.238.28:9001,10.0.239.233:9001,10.0.240.39:9001,10.0.240.9:9001,10.0.241.56:9001,10.0.243.186:9001,10.0.243.65:9001,10.0.244.37:9001,10.0.245.134:9001,10.0.246.122:9001,10.0.246.216:9001,10.0.246.60:9001,10.0.247.97:9001,10.0.248.42:9001,10.0.250.102:9001,10.0.252.134:9001,10.0.253.139:9001,10.0.253.141:9001,10.0.254.34:9001,10.0.255.15:9001,10.0.255.172:9001]} {UUID:1d70ad2a-93ac-439a-bf9b-e9d92a4a7694 ExternalIDs:map[k8s.ovn.org/kind:Service k8s.ovn.org/owner:openshift-machine-config-operator/machine-config-daemon] HealthCheck:[] IPPortMappings:map[] Name:Service_openshift-machine-config-operator/machine-config-daemon_TCP_node_router_ip-10-0-129-163.us-west-2.compute.internal Options:map[event:false reject:true skip_snat:false] Protocol:0xc009dd5880 SelectionFields:[] Vips:map[172.30.146.245:9001:169.254.169.2:9001,10.0.129.43:9001,10.0.132.49:9001,10.0.133.149:9001,10.0.133.56:9001,10.0.134.135:9001,10.0.135.107:9001,10.0.135.166:9001,10.0.135.171:9001,10.0.135.225:9001,10.0.136.122:9001,10.0.138.44:9001,10.0.138.74:9001,10.0.139.225:9001,10.0.139.92:9001,10.0.140.133:9001,10.0.141.124:9001,10.0.141.188:9001,10.0.142.228:9001,10.0.144.39:9001,10.0.145.216:9001,10.0.145.49:9001,10.0.145.7:9001,10.0.147.146:9001,10.0.149.113:9001,10.0.149.16:9001,10.0.150.241:9001,10.0.151.120:9001,10.0.155.49:9001,10.0.157.36:9001,10.0.158.36:9001,10.0.160.28:9001,10.0.160.88:9001,10.0.162.186:9001,10.0.164.79:9001,10.0.165.226:9001,10.0.165.58:9001,10.0.166.13:9001,10.0.168.167:9001,10.0.168.212:9001,10.0.169.85:9001,10.0.170.7:9001,10.0.171.113:9001,10.0.172.255:9001,10.0.173.52:9001,10.0.173.8:9001,10.0.174.76:9001,10.0.174.99:9001,10.0.179.207:9001,10.0.179.62:9001,10.0.180.213:9001,10.0.180.31:9001,10.0.181.9:9001,10.0.183.134:9001,10.0.183.47:9001,10.0.184.192:9001,10.0.186.90:9001,10.0.188.28:9001,10.0.188.47:9001,10.0.189.136:9001,10.0.189.161:9001,10.0.190.23:9001,10.0.192.198:9001,10.0.192.239:9001,10.0.196.106:9001,10.0.197.13:9001,10.0.199.121:9001,10.0.199.122:9001,10.0.199.215:9001,10.0.201.104:9001,10.0.202.213:9001,10.0.203.85:9001,10.0.205.79:9001,10.0.205.96:9001,10.0.206.7:9001,10.0.207.202:9001,10.0.207.225:9001,10.0.208.121:9001,10.0.208.225:9001,10.0.208.232:9001,10.0.208.27:9001,10.0.209.0:9001,10.0.214.143:9001,10.0.216.111:9001,10.0.217.25:9001,10.0.217.93:9001,10.0.220.195:9001,10.0.225.118:9001,10.0.225.27:9001,10.0.227.221:9001,10.0.230.167:9001,10.0.230.176:9001,10.0.231.208:9001,10.0.234.187:9001,10.0.237.9:9001,10.0.238.28:9001,10.0.239.233:9001,10.0.240.39:9001,10.0.240.9:9001,10.0.241.56:9001,10.0.243.186:9001,10.0.243.65:9001,10.0.244.37:9001,10.0.245.134:9001,10.0.246.122:9001,10.0.246.216:9001,10.0.246.60:9001,10.0.247.97:9001,10.0.250.102:9001,10.0.252.134:9001,10.0.253.139:9001,10.0.253.141:9001,10.0.254.34:9001,10.0.255.15:9001,10.0.255.172:9001]}]"

This happens as a result of RAFT cluster leadership changes where we apparently cache a LB UUID in the service controller that does not correspond to the one in the libovsdb cache that was ultimately created.

Comment 1 Dan Williams 2022-01-18 16:55:20 UTC
Ilya says (of the general issue):

"The likely sequence of events:

 1. client sends a transaction.
 2. old leader starts execution: initiates a raft command, sends append requests to followers.
 3. old leader transfers the leadership: completes the in-progress raft command with a 'cluster error: lost leadership'
 4. old leader rejects append replies for the raft command from followers
 5. new leader has the transaction committed, because it received the append request previously and the other follower also has it.
 6. new leader replicates the result of the old transaction to the old leader.
 7. old leader checks that state of the old transaction is 'cluster error' and cluster errors are temporary -> retries the execution.
 8. Execution retry fails with a constraint violation.
"

Comment 2 Jaime CaamaƱo Ruiz 2022-01-19 16:02:11 UTC
> where we apparently cache a LB UUID in the service controller that does not correspond to the one in the libovsdb cache that was ultimately created.

I am not sure this makes sense to me. If we had a UUID that is not the real one, then the op

op, err := nbClient.Where(lb).Update(lb, fields...)

would return an error unless there was some other LB that had that UUID. Is this what we are talking about?

But double checking the services controller, I think it is not using the cached UUIDs (bug) so it is actually always searching by name and indeed we wouldn't have the above error otherwise as that is actually an indication that it is searching an existing LB by name (the controller populates its cache at startup from the libovsdb cache).

Comment 3 Casey Callendrello 2022-01-19 16:56:51 UTC
This error message is coming from libovsdbopts. Somehow, our cache got messed up. As a first point, the code has a separate cache on top of libovsdb that I got rid of. I also added some code to clean up any dupe LBs just in case.

https://github.com/ovn-org/ovn-kubernetes/pull/2757

Comment 6 Tim Rozet 2022-01-27 16:02:46 UTC
As dcbw mentioned, the root cause of this is really an OVSDB server bug: https://bugzilla.redhat.com/show_bug.cgi?id=2046340

The bug can happen to any resource type, not just load_balancers. The bug is present in past releases of OCP. It's just in 4.10 we added some code in ovn-kubernetes that will detect and error when this problem occurs. The workaround is to add an OVSDB wait method operation before every creation attempt in OVSDB using libovsdb. This will wait method acts as a guard to avoid OVSDB from creating the same thing twice. Note, ACLs are currently unable to be guarded and could still encounter the issue, but hopefully will reconcile.

For 4.9 we used a mixture of nbctl and go-ovn (not libovsdb). It's unclear whether or not nbctl gives us some protection against this problem when creating different resource types. We may need to examine this for potential backports and investigate how we could fix this in nbctl/go-ovn. Ideally we would rather have the root cause fixed in OVSDB server itself and backport that fix.

Comment 12 Ross Brattain 2022-02-03 05:45:41 UTC
Unable to find duplicate load-balancers on 4.10.0-0.nightly-2022-02-02-000921

Verified

sh-4.4# ovn-nbctl --no-leader-only --format=csv --data=bare --no-heading --columns=name  find Load_Balancer | sort | uniq -c
      1 Service_default/kubernetes_TCP_node_router_rbrattai-o410i31-7bx9q-master-0
      1 Service_default/kubernetes_TCP_node_router_rbrattai-o410i31-7bx9q-master-1
      1 Service_default/kubernetes_TCP_node_router_rbrattai-o410i31-7bx9q-master-2
      1 Service_default/kubernetes_TCP_node_switch_rbrattai-o410i31-7bx9q-master-0_merged
      1 Service_openshift-apiserver-operator/metrics_TCP_cluster
      1 Service_openshift-apiserver/api_TCP_cluster
      1 Service_openshift-apiserver/check-endpoints_TCP_cluster
      1 Service_openshift-authentication-operator/metrics_TCP_cluster
      1 Service_openshift-authentication/oauth-openshift_TCP_cluster
      1 Service_openshift-cloud-credential-operator/cco-metrics_TCP_cluster
      1 Service_openshift-cluster-storage-operator/cluster-storage-operator-metrics_TCP_cluster
      1 Service_openshift-cluster-storage-operator/csi-snapshot-controller-operator-metrics_TCP_cluster
      1 Service_openshift-cluster-storage-operator/csi-snapshot-webhook_TCP_cluster
      1 Service_openshift-cluster-version/cluster-version-operator_TCP_node_router_rbrattai-o410i31-7bx9q-master-0
      1 Service_openshift-cluster-version/cluster-version-operator_TCP_node_switch_rbrattai-o410i31-7bx9q-master-0_merged
      1 Service_openshift-config-operator/metrics_TCP_cluster
      1 Service_openshift-console-operator/metrics_TCP_cluster
      1 Service_openshift-console/console_TCP_cluster
      1 Service_openshift-console/downloads_TCP_cluster
      1 Service_openshift-controller-manager-operator/metrics_TCP_cluster
      1 Service_openshift-controller-manager/controller-manager_TCP_cluster
      1 Service_openshift-dns-operator/metrics_TCP_cluster
      1 Service_openshift-dns/dns-default_TCP_node_router_rbrattai-o410i31-7bx9q-master-0_merged
      1 Service_openshift-dns/dns-default_TCP_node_switch_rbrattai-o410i31-7bx9q-master-0
      1 Service_openshift-dns/dns-default_TCP_node_switch_rbrattai-o410i31-7bx9q-master-1
      1 Service_openshift-dns/dns-default_TCP_node_switch_rbrattai-o410i31-7bx9q-master-2
      1 Service_openshift-dns/dns-default_TCP_node_switch_rbrattai-o410i31-7bx9q-worker-1-h7klb
      1 Service_openshift-dns/dns-default_TCP_node_switch_rbrattai-o410i31-7bx9q-worker-2-277ht
      1 Service_openshift-dns/dns-default_TCP_node_switch_rbrattai-o410i31-7bx9q-worker-3-gkrgw
      1 Service_openshift-dns/dns-default_UDP_node_router_rbrattai-o410i31-7bx9q-master-0_merged
      1 Service_openshift-dns/dns-default_UDP_node_switch_rbrattai-o410i31-7bx9q-master-0
      1 Service_openshift-dns/dns-default_UDP_node_switch_rbrattai-o410i31-7bx9q-master-1
      1 Service_openshift-dns/dns-default_UDP_node_switch_rbrattai-o410i31-7bx9q-master-2
      1 Service_openshift-dns/dns-default_UDP_node_switch_rbrattai-o410i31-7bx9q-worker-1-h7klb
      1 Service_openshift-dns/dns-default_UDP_node_switch_rbrattai-o410i31-7bx9q-worker-2-277ht
      1 Service_openshift-dns/dns-default_UDP_node_switch_rbrattai-o410i31-7bx9q-worker-3-gkrgw
      1 Service_openshift-etcd-operator/metrics_TCP_cluster
      1 Service_openshift-etcd/etcd_TCP_node_router_rbrattai-o410i31-7bx9q-master-0
      1 Service_openshift-etcd/etcd_TCP_node_router_rbrattai-o410i31-7bx9q-master-1
      1 Service_openshift-etcd/etcd_TCP_node_router_rbrattai-o410i31-7bx9q-master-2
      1 Service_openshift-etcd/etcd_TCP_node_switch_rbrattai-o410i31-7bx9q-master-0_merged
      1 Service_openshift-image-registry/image-registry_TCP_cluster
      1 Service_openshift-ingress-canary/ingress-canary_TCP_cluster
      1 Service_openshift-ingress-operator/metrics_TCP_cluster
      1 Service_openshift-ingress/router-default_TCP_cluster
      1 Service_openshift-ingress/router-default_TCP_node_router+switch_rbrattai-o410i31-7bx9q-master-0
      1 Service_openshift-ingress/router-default_TCP_node_router+switch_rbrattai-o410i31-7bx9q-master-1
      1 Service_openshift-ingress/router-default_TCP_node_router+switch_rbrattai-o410i31-7bx9q-master-2
      1 Service_openshift-ingress/router-default_TCP_node_router+switch_rbrattai-o410i31-7bx9q-worker-1-h7klb
      1 Service_openshift-ingress/router-default_TCP_node_router+switch_rbrattai-o410i31-7bx9q-worker-2-277ht
      1 Service_openshift-ingress/router-default_TCP_node_router+switch_rbrattai-o410i31-7bx9q-worker-3-gkrgw
      1 Service_openshift-ingress/router-internal-default_TCP_cluster
      1 Service_openshift-insights/metrics_TCP_cluster
      1 Service_openshift-kube-apiserver-operator/metrics_TCP_cluster
      1 Service_openshift-kube-apiserver/apiserver_TCP_node_router_rbrattai-o410i31-7bx9q-master-0
      1 Service_openshift-kube-apiserver/apiserver_TCP_node_router_rbrattai-o410i31-7bx9q-master-1
      1 Service_openshift-kube-apiserver/apiserver_TCP_node_router_rbrattai-o410i31-7bx9q-master-2
      1 Service_openshift-kube-apiserver/apiserver_TCP_node_switch_rbrattai-o410i31-7bx9q-master-0_merged
      1 Service_openshift-kube-controller-manager-operator/metrics_TCP_cluster
      1 Service_openshift-kube-controller-manager/kube-controller-manager_TCP_node_router_rbrattai-o410i31-7bx9q-master-0
      1 Service_openshift-kube-controller-manager/kube-controller-manager_TCP_node_router_rbrattai-o410i31-7bx9q-master-1
      1 Service_openshift-kube-controller-manager/kube-controller-manager_TCP_node_router_rbrattai-o410i31-7bx9q-master-2
      1 Service_openshift-kube-controller-manager/kube-controller-manager_TCP_node_switch_rbrattai-o410i31-7bx9q-master-0_merged
      1 Service_openshift-kube-scheduler-operator/metrics_TCP_cluster
      1 Service_openshift-kube-scheduler/scheduler_TCP_node_router_rbrattai-o410i31-7bx9q-master-0
      1 Service_openshift-kube-scheduler/scheduler_TCP_node_router_rbrattai-o410i31-7bx9q-master-1
      1 Service_openshift-kube-scheduler/scheduler_TCP_node_router_rbrattai-o410i31-7bx9q-master-2
      1 Service_openshift-kube-scheduler/scheduler_TCP_node_switch_rbrattai-o410i31-7bx9q-master-0_merged
      1 Service_openshift-kube-storage-version-migrator-operator/metrics_TCP_cluster
      1 Service_openshift-machine-api/cluster-autoscaler-operator_TCP_cluster
      1 Service_openshift-machine-api/cluster-baremetal-operator-service_TCP_cluster
      1 Service_openshift-machine-api/cluster-baremetal-webhook-service_TCP_cluster
      1 Service_openshift-machine-api/machine-api-controllers_TCP_cluster
      1 Service_openshift-machine-api/machine-api-operator-webhook_TCP_cluster
      1 Service_openshift-machine-api/machine-api-operator_TCP_cluster
      1 Service_openshift-machine-config-operator/machine-config-daemon_TCP_node_router_rbrattai-o410i31-7bx9q-master-0
      1 Service_openshift-machine-config-operator/machine-config-daemon_TCP_node_router_rbrattai-o410i31-7bx9q-master-1
      1 Service_openshift-machine-config-operator/machine-config-daemon_TCP_node_router_rbrattai-o410i31-7bx9q-master-2
      1 Service_openshift-machine-config-operator/machine-config-daemon_TCP_node_router_rbrattai-o410i31-7bx9q-worker-1-h7klb
      1 Service_openshift-machine-config-operator/machine-config-daemon_TCP_node_router_rbrattai-o410i31-7bx9q-worker-2-277ht
      1 Service_openshift-machine-config-operator/machine-config-daemon_TCP_node_router_rbrattai-o410i31-7bx9q-worker-3-gkrgw
      1 Service_openshift-machine-config-operator/machine-config-daemon_TCP_node_switch_rbrattai-o410i31-7bx9q-master-0_merged
      1 Service_openshift-marketplace/certified-operators_TCP_cluster
      1 Service_openshift-marketplace/community-operators_TCP_cluster
      1 Service_openshift-marketplace/marketplace-operator-metrics_TCP_cluster
      1 Service_openshift-marketplace/redhat-marketplace_TCP_cluster
      1 Service_openshift-marketplace/redhat-operators_TCP_cluster
      1 Service_openshift-monitoring/alertmanager-main_TCP_cluster
      1 Service_openshift-monitoring/grafana_TCP_cluster
      1 Service_openshift-monitoring/prometheus-adapter_TCP_cluster
      1 Service_openshift-monitoring/prometheus-k8s_TCP_cluster
      1 Service_openshift-monitoring/thanos-querier_TCP_cluster
      1 Service_openshift-multus/multus-admission-controller_TCP_cluster
      1 Service_openshift-network-diagnostics/network-check-target_TCP_cluster
      1 Service_openshift-oauth-apiserver/api_TCP_cluster
      1 Service_openshift-operator-lifecycle-manager/catalog-operator-metrics_TCP_cluster
      1 Service_openshift-operator-lifecycle-manager/olm-operator-metrics_TCP_cluster
      1 Service_openshift-operator-lifecycle-manager/packageserver-service_TCP_cluster
      1 Service_openshift-service-ca-operator/metrics_TCP_cluster

Comment 14 errata-xmlrpc 2022-03-12 04:41:03 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056