Bug 2042001 - unexpectedly found multiple load balancers
Summary: unexpectedly found multiple load balancers
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.10
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.10.0
Assignee: Tim Rozet
QA Contact: Ross Brattain
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-01-18 16:51 UTC by Dan Williams
Modified: 2022-03-12 04:41 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-03-12 04:41:03 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift ovn-kubernetes pull 934 0 None Merged Bug 2042001: Adds wait method for ovsdb operations that created named objects 2022-01-28 03:34:28 UTC
Github ovn-org libovsdb pull 287 0 None Merged Fix json encoding of wait method timeout 2022-01-27 16:04:45 UTC
Github ovn-org libovsdb pull 288 0 None Merged Implement support for wait method in server 2022-01-27 16:04:45 UTC
Github ovn-org ovn-kubernetes pull 2764 0 None Merged Adds wait method for ovsdb operations that created named objects 2022-01-27 19:51:10 UTC
Red Hat Product Errata RHSA-2022:0056 0 None None None 2022-03-12 04:41:20 UTC

Description Dan Williams 2022-01-18 16:51:15 UTC
/tmp/dbgeq1.txt:I0110 17:54:22.725955       1 services_controller.go:224] "Error syncing service, retrying" service="openshift-machine-config-operator/machine-config-daemon" err="failed to ensure service openshift-machine-config-operator/machine-config-daemon load balancers: unexpectedly found multiple load balancers: [{UUID:a8f708c5-8694-4fb8-b95e-d716a21f4367 ExternalIDs:map[k8s.ovn.org/kind:Service k8s.ovn.org/owner:openshift-machine-config-operator/machine-config-daemon] HealthCheck:[] IPPortMappings:map[] Name:Service_openshift-machine-config-operator/machine-config-daemon_TCP_node_router_ip-10-0-129-163.us-west-2.compute.internal Options:map[event:false reject:true skip_snat:false] Protocol:0xc00a28c010 SelectionFields:[] Vips:map[172.30.146.245:9001:169.254.169.2:9001,10.0.129.43:9001,10.0.132.49:9001,10.0.133.149:9001,10.0.133.56:9001,10.0.134.135:9001,10.0.135.107:9001,10.0.135.166:9001,10.0.135.171:9001,10.0.135.225:9001,10.0.136.122:9001,10.0.138.44:9001,10.0.138.74:9001,10.0.139.225:9001,10.0.139.92:9001,10.0.140.133:9001,10.0.141.124:9001,10.0.141.188:9001,10.0.142.228:9001,10.0.144.39:9001,10.0.145.216:9001,10.0.145.49:9001,10.0.145.7:9001,10.0.147.146:9001,10.0.149.113:9001,10.0.149.16:9001,10.0.150.241:9001,10.0.151.120:9001,10.0.155.49:9001,10.0.157.36:9001,10.0.158.36:9001,10.0.160.28:9001,10.0.160.88:9001,10.0.162.186:9001,10.0.164.79:9001,10.0.165.226:9001,10.0.165.58:9001,10.0.166.13:9001,10.0.168.167:9001,10.0.168.212:9001,10.0.169.85:9001,10.0.170.7:9001,10.0.171.113:9001,10.0.172.255:9001,10.0.173.52:9001,10.0.173.8:9001,10.0.174.76:9001,10.0.174.99:9001,10.0.179.207:9001,10.0.179.62:9001,10.0.180.213:9001,10.0.180.31:9001,10.0.181.9:9001,10.0.183.134:9001,10.0.183.47:9001,10.0.184.192:9001,10.0.186.90:9001,10.0.188.28:9001,10.0.188.47:9001,10.0.189.136:9001,10.0.189.161:9001,10.0.190.23:9001,10.0.192.198:9001,10.0.192.239:9001,10.0.196.106:9001,10.0.197.13:9001,10.0.199.121:9001,10.0.199.122:9001,10.0.199.215:9001,10.0.201.104:9001,10.0.202.213:9001,10.0.203.85:9001,10.0.205.79:9001,10.0.205.96:9001,10.0.206.7:9001,10.0.207.202:9001,10.0.207.225:9001,10.0.208.121:9001,10.0.208.225:9001,10.0.208.232:9001,10.0.208.27:9001,10.0.209.0:9001,10.0.214.143:9001,10.0.216.111:9001,10.0.217.25:9001,10.0.217.93:9001,10.0.220.195:9001,10.0.225.118:9001,10.0.225.27:9001,10.0.227.221:9001,10.0.230.167:9001,10.0.230.176:9001,10.0.231.208:9001,10.0.234.187:9001,10.0.237.9:9001,10.0.238.28:9001,10.0.239.233:9001,10.0.240.39:9001,10.0.240.9:9001,10.0.241.56:9001,10.0.243.186:9001,10.0.243.65:9001,10.0.244.37:9001,10.0.245.134:9001,10.0.246.122:9001,10.0.246.216:9001,10.0.246.60:9001,10.0.247.97:9001,10.0.248.42:9001,10.0.250.102:9001,10.0.252.134:9001,10.0.253.139:9001,10.0.253.141:9001,10.0.254.34:9001,10.0.255.15:9001,10.0.255.172:9001]} {UUID:1d70ad2a-93ac-439a-bf9b-e9d92a4a7694 ExternalIDs:map[k8s.ovn.org/kind:Service k8s.ovn.org/owner:openshift-machine-config-operator/machine-config-daemon] HealthCheck:[] IPPortMappings:map[] Name:Service_openshift-machine-config-operator/machine-config-daemon_TCP_node_router_ip-10-0-129-163.us-west-2.compute.internal Options:map[event:false reject:true skip_snat:false] Protocol:0xc009dd5880 SelectionFields:[] Vips:map[172.30.146.245:9001:169.254.169.2:9001,10.0.129.43:9001,10.0.132.49:9001,10.0.133.149:9001,10.0.133.56:9001,10.0.134.135:9001,10.0.135.107:9001,10.0.135.166:9001,10.0.135.171:9001,10.0.135.225:9001,10.0.136.122:9001,10.0.138.44:9001,10.0.138.74:9001,10.0.139.225:9001,10.0.139.92:9001,10.0.140.133:9001,10.0.141.124:9001,10.0.141.188:9001,10.0.142.228:9001,10.0.144.39:9001,10.0.145.216:9001,10.0.145.49:9001,10.0.145.7:9001,10.0.147.146:9001,10.0.149.113:9001,10.0.149.16:9001,10.0.150.241:9001,10.0.151.120:9001,10.0.155.49:9001,10.0.157.36:9001,10.0.158.36:9001,10.0.160.28:9001,10.0.160.88:9001,10.0.162.186:9001,10.0.164.79:9001,10.0.165.226:9001,10.0.165.58:9001,10.0.166.13:9001,10.0.168.167:9001,10.0.168.212:9001,10.0.169.85:9001,10.0.170.7:9001,10.0.171.113:9001,10.0.172.255:9001,10.0.173.52:9001,10.0.173.8:9001,10.0.174.76:9001,10.0.174.99:9001,10.0.179.207:9001,10.0.179.62:9001,10.0.180.213:9001,10.0.180.31:9001,10.0.181.9:9001,10.0.183.134:9001,10.0.183.47:9001,10.0.184.192:9001,10.0.186.90:9001,10.0.188.28:9001,10.0.188.47:9001,10.0.189.136:9001,10.0.189.161:9001,10.0.190.23:9001,10.0.192.198:9001,10.0.192.239:9001,10.0.196.106:9001,10.0.197.13:9001,10.0.199.121:9001,10.0.199.122:9001,10.0.199.215:9001,10.0.201.104:9001,10.0.202.213:9001,10.0.203.85:9001,10.0.205.79:9001,10.0.205.96:9001,10.0.206.7:9001,10.0.207.202:9001,10.0.207.225:9001,10.0.208.121:9001,10.0.208.225:9001,10.0.208.232:9001,10.0.208.27:9001,10.0.209.0:9001,10.0.214.143:9001,10.0.216.111:9001,10.0.217.25:9001,10.0.217.93:9001,10.0.220.195:9001,10.0.225.118:9001,10.0.225.27:9001,10.0.227.221:9001,10.0.230.167:9001,10.0.230.176:9001,10.0.231.208:9001,10.0.234.187:9001,10.0.237.9:9001,10.0.238.28:9001,10.0.239.233:9001,10.0.240.39:9001,10.0.240.9:9001,10.0.241.56:9001,10.0.243.186:9001,10.0.243.65:9001,10.0.244.37:9001,10.0.245.134:9001,10.0.246.122:9001,10.0.246.216:9001,10.0.246.60:9001,10.0.247.97:9001,10.0.250.102:9001,10.0.252.134:9001,10.0.253.139:9001,10.0.253.141:9001,10.0.254.34:9001,10.0.255.15:9001,10.0.255.172:9001]}]"

This happens as a result of RAFT cluster leadership changes where we apparently cache a LB UUID in the service controller that does not correspond to the one in the libovsdb cache that was ultimately created.

Comment 1 Dan Williams 2022-01-18 16:55:20 UTC
Ilya says (of the general issue):

"The likely sequence of events:

 1. client sends a transaction.
 2. old leader starts execution: initiates a raft command, sends append requests to followers.
 3. old leader transfers the leadership: completes the in-progress raft command with a 'cluster error: lost leadership'
 4. old leader rejects append replies for the raft command from followers
 5. new leader has the transaction committed, because it received the append request previously and the other follower also has it.
 6. new leader replicates the result of the old transaction to the old leader.
 7. old leader checks that state of the old transaction is 'cluster error' and cluster errors are temporary -> retries the execution.
 8. Execution retry fails with a constraint violation.
"

Comment 2 Jaime Caamaño Ruiz 2022-01-19 16:02:11 UTC
> where we apparently cache a LB UUID in the service controller that does not correspond to the one in the libovsdb cache that was ultimately created.

I am not sure this makes sense to me. If we had a UUID that is not the real one, then the op

op, err := nbClient.Where(lb).Update(lb, fields...)

would return an error unless there was some other LB that had that UUID. Is this what we are talking about?

But double checking the services controller, I think it is not using the cached UUIDs (bug) so it is actually always searching by name and indeed we wouldn't have the above error otherwise as that is actually an indication that it is searching an existing LB by name (the controller populates its cache at startup from the libovsdb cache).

Comment 3 Casey Callendrello 2022-01-19 16:56:51 UTC
This error message is coming from libovsdbopts. Somehow, our cache got messed up. As a first point, the code has a separate cache on top of libovsdb that I got rid of. I also added some code to clean up any dupe LBs just in case.

https://github.com/ovn-org/ovn-kubernetes/pull/2757

Comment 6 Tim Rozet 2022-01-27 16:02:46 UTC
As dcbw mentioned, the root cause of this is really an OVSDB server bug: https://bugzilla.redhat.com/show_bug.cgi?id=2046340

The bug can happen to any resource type, not just load_balancers. The bug is present in past releases of OCP. It's just in 4.10 we added some code in ovn-kubernetes that will detect and error when this problem occurs. The workaround is to add an OVSDB wait method operation before every creation attempt in OVSDB using libovsdb. This will wait method acts as a guard to avoid OVSDB from creating the same thing twice. Note, ACLs are currently unable to be guarded and could still encounter the issue, but hopefully will reconcile.

For 4.9 we used a mixture of nbctl and go-ovn (not libovsdb). It's unclear whether or not nbctl gives us some protection against this problem when creating different resource types. We may need to examine this for potential backports and investigate how we could fix this in nbctl/go-ovn. Ideally we would rather have the root cause fixed in OVSDB server itself and backport that fix.

Comment 12 Ross Brattain 2022-02-03 05:45:41 UTC
Unable to find duplicate load-balancers on 4.10.0-0.nightly-2022-02-02-000921

Verified

sh-4.4# ovn-nbctl --no-leader-only --format=csv --data=bare --no-heading --columns=name  find Load_Balancer | sort | uniq -c
      1 Service_default/kubernetes_TCP_node_router_rbrattai-o410i31-7bx9q-master-0
      1 Service_default/kubernetes_TCP_node_router_rbrattai-o410i31-7bx9q-master-1
      1 Service_default/kubernetes_TCP_node_router_rbrattai-o410i31-7bx9q-master-2
      1 Service_default/kubernetes_TCP_node_switch_rbrattai-o410i31-7bx9q-master-0_merged
      1 Service_openshift-apiserver-operator/metrics_TCP_cluster
      1 Service_openshift-apiserver/api_TCP_cluster
      1 Service_openshift-apiserver/check-endpoints_TCP_cluster
      1 Service_openshift-authentication-operator/metrics_TCP_cluster
      1 Service_openshift-authentication/oauth-openshift_TCP_cluster
      1 Service_openshift-cloud-credential-operator/cco-metrics_TCP_cluster
      1 Service_openshift-cluster-storage-operator/cluster-storage-operator-metrics_TCP_cluster
      1 Service_openshift-cluster-storage-operator/csi-snapshot-controller-operator-metrics_TCP_cluster
      1 Service_openshift-cluster-storage-operator/csi-snapshot-webhook_TCP_cluster
      1 Service_openshift-cluster-version/cluster-version-operator_TCP_node_router_rbrattai-o410i31-7bx9q-master-0
      1 Service_openshift-cluster-version/cluster-version-operator_TCP_node_switch_rbrattai-o410i31-7bx9q-master-0_merged
      1 Service_openshift-config-operator/metrics_TCP_cluster
      1 Service_openshift-console-operator/metrics_TCP_cluster
      1 Service_openshift-console/console_TCP_cluster
      1 Service_openshift-console/downloads_TCP_cluster
      1 Service_openshift-controller-manager-operator/metrics_TCP_cluster
      1 Service_openshift-controller-manager/controller-manager_TCP_cluster
      1 Service_openshift-dns-operator/metrics_TCP_cluster
      1 Service_openshift-dns/dns-default_TCP_node_router_rbrattai-o410i31-7bx9q-master-0_merged
      1 Service_openshift-dns/dns-default_TCP_node_switch_rbrattai-o410i31-7bx9q-master-0
      1 Service_openshift-dns/dns-default_TCP_node_switch_rbrattai-o410i31-7bx9q-master-1
      1 Service_openshift-dns/dns-default_TCP_node_switch_rbrattai-o410i31-7bx9q-master-2
      1 Service_openshift-dns/dns-default_TCP_node_switch_rbrattai-o410i31-7bx9q-worker-1-h7klb
      1 Service_openshift-dns/dns-default_TCP_node_switch_rbrattai-o410i31-7bx9q-worker-2-277ht
      1 Service_openshift-dns/dns-default_TCP_node_switch_rbrattai-o410i31-7bx9q-worker-3-gkrgw
      1 Service_openshift-dns/dns-default_UDP_node_router_rbrattai-o410i31-7bx9q-master-0_merged
      1 Service_openshift-dns/dns-default_UDP_node_switch_rbrattai-o410i31-7bx9q-master-0
      1 Service_openshift-dns/dns-default_UDP_node_switch_rbrattai-o410i31-7bx9q-master-1
      1 Service_openshift-dns/dns-default_UDP_node_switch_rbrattai-o410i31-7bx9q-master-2
      1 Service_openshift-dns/dns-default_UDP_node_switch_rbrattai-o410i31-7bx9q-worker-1-h7klb
      1 Service_openshift-dns/dns-default_UDP_node_switch_rbrattai-o410i31-7bx9q-worker-2-277ht
      1 Service_openshift-dns/dns-default_UDP_node_switch_rbrattai-o410i31-7bx9q-worker-3-gkrgw
      1 Service_openshift-etcd-operator/metrics_TCP_cluster
      1 Service_openshift-etcd/etcd_TCP_node_router_rbrattai-o410i31-7bx9q-master-0
      1 Service_openshift-etcd/etcd_TCP_node_router_rbrattai-o410i31-7bx9q-master-1
      1 Service_openshift-etcd/etcd_TCP_node_router_rbrattai-o410i31-7bx9q-master-2
      1 Service_openshift-etcd/etcd_TCP_node_switch_rbrattai-o410i31-7bx9q-master-0_merged
      1 Service_openshift-image-registry/image-registry_TCP_cluster
      1 Service_openshift-ingress-canary/ingress-canary_TCP_cluster
      1 Service_openshift-ingress-operator/metrics_TCP_cluster
      1 Service_openshift-ingress/router-default_TCP_cluster
      1 Service_openshift-ingress/router-default_TCP_node_router+switch_rbrattai-o410i31-7bx9q-master-0
      1 Service_openshift-ingress/router-default_TCP_node_router+switch_rbrattai-o410i31-7bx9q-master-1
      1 Service_openshift-ingress/router-default_TCP_node_router+switch_rbrattai-o410i31-7bx9q-master-2
      1 Service_openshift-ingress/router-default_TCP_node_router+switch_rbrattai-o410i31-7bx9q-worker-1-h7klb
      1 Service_openshift-ingress/router-default_TCP_node_router+switch_rbrattai-o410i31-7bx9q-worker-2-277ht
      1 Service_openshift-ingress/router-default_TCP_node_router+switch_rbrattai-o410i31-7bx9q-worker-3-gkrgw
      1 Service_openshift-ingress/router-internal-default_TCP_cluster
      1 Service_openshift-insights/metrics_TCP_cluster
      1 Service_openshift-kube-apiserver-operator/metrics_TCP_cluster
      1 Service_openshift-kube-apiserver/apiserver_TCP_node_router_rbrattai-o410i31-7bx9q-master-0
      1 Service_openshift-kube-apiserver/apiserver_TCP_node_router_rbrattai-o410i31-7bx9q-master-1
      1 Service_openshift-kube-apiserver/apiserver_TCP_node_router_rbrattai-o410i31-7bx9q-master-2
      1 Service_openshift-kube-apiserver/apiserver_TCP_node_switch_rbrattai-o410i31-7bx9q-master-0_merged
      1 Service_openshift-kube-controller-manager-operator/metrics_TCP_cluster
      1 Service_openshift-kube-controller-manager/kube-controller-manager_TCP_node_router_rbrattai-o410i31-7bx9q-master-0
      1 Service_openshift-kube-controller-manager/kube-controller-manager_TCP_node_router_rbrattai-o410i31-7bx9q-master-1
      1 Service_openshift-kube-controller-manager/kube-controller-manager_TCP_node_router_rbrattai-o410i31-7bx9q-master-2
      1 Service_openshift-kube-controller-manager/kube-controller-manager_TCP_node_switch_rbrattai-o410i31-7bx9q-master-0_merged
      1 Service_openshift-kube-scheduler-operator/metrics_TCP_cluster
      1 Service_openshift-kube-scheduler/scheduler_TCP_node_router_rbrattai-o410i31-7bx9q-master-0
      1 Service_openshift-kube-scheduler/scheduler_TCP_node_router_rbrattai-o410i31-7bx9q-master-1
      1 Service_openshift-kube-scheduler/scheduler_TCP_node_router_rbrattai-o410i31-7bx9q-master-2
      1 Service_openshift-kube-scheduler/scheduler_TCP_node_switch_rbrattai-o410i31-7bx9q-master-0_merged
      1 Service_openshift-kube-storage-version-migrator-operator/metrics_TCP_cluster
      1 Service_openshift-machine-api/cluster-autoscaler-operator_TCP_cluster
      1 Service_openshift-machine-api/cluster-baremetal-operator-service_TCP_cluster
      1 Service_openshift-machine-api/cluster-baremetal-webhook-service_TCP_cluster
      1 Service_openshift-machine-api/machine-api-controllers_TCP_cluster
      1 Service_openshift-machine-api/machine-api-operator-webhook_TCP_cluster
      1 Service_openshift-machine-api/machine-api-operator_TCP_cluster
      1 Service_openshift-machine-config-operator/machine-config-daemon_TCP_node_router_rbrattai-o410i31-7bx9q-master-0
      1 Service_openshift-machine-config-operator/machine-config-daemon_TCP_node_router_rbrattai-o410i31-7bx9q-master-1
      1 Service_openshift-machine-config-operator/machine-config-daemon_TCP_node_router_rbrattai-o410i31-7bx9q-master-2
      1 Service_openshift-machine-config-operator/machine-config-daemon_TCP_node_router_rbrattai-o410i31-7bx9q-worker-1-h7klb
      1 Service_openshift-machine-config-operator/machine-config-daemon_TCP_node_router_rbrattai-o410i31-7bx9q-worker-2-277ht
      1 Service_openshift-machine-config-operator/machine-config-daemon_TCP_node_router_rbrattai-o410i31-7bx9q-worker-3-gkrgw
      1 Service_openshift-machine-config-operator/machine-config-daemon_TCP_node_switch_rbrattai-o410i31-7bx9q-master-0_merged
      1 Service_openshift-marketplace/certified-operators_TCP_cluster
      1 Service_openshift-marketplace/community-operators_TCP_cluster
      1 Service_openshift-marketplace/marketplace-operator-metrics_TCP_cluster
      1 Service_openshift-marketplace/redhat-marketplace_TCP_cluster
      1 Service_openshift-marketplace/redhat-operators_TCP_cluster
      1 Service_openshift-monitoring/alertmanager-main_TCP_cluster
      1 Service_openshift-monitoring/grafana_TCP_cluster
      1 Service_openshift-monitoring/prometheus-adapter_TCP_cluster
      1 Service_openshift-monitoring/prometheus-k8s_TCP_cluster
      1 Service_openshift-monitoring/thanos-querier_TCP_cluster
      1 Service_openshift-multus/multus-admission-controller_TCP_cluster
      1 Service_openshift-network-diagnostics/network-check-target_TCP_cluster
      1 Service_openshift-oauth-apiserver/api_TCP_cluster
      1 Service_openshift-operator-lifecycle-manager/catalog-operator-metrics_TCP_cluster
      1 Service_openshift-operator-lifecycle-manager/olm-operator-metrics_TCP_cluster
      1 Service_openshift-operator-lifecycle-manager/packageserver-service_TCP_cluster
      1 Service_openshift-service-ca-operator/metrics_TCP_cluster

Comment 14 errata-xmlrpc 2022-03-12 04:41:03 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056


Note You need to log in before you can comment on or make changes to this bug.