Hide Forgot
/tmp/dbgeq1.txt:I0110 17:54:22.725955 1 services_controller.go:224] "Error syncing service, retrying" service="openshift-machine-config-operator/machine-config-daemon" err="failed to ensure service openshift-machine-config-operator/machine-config-daemon load balancers: unexpectedly found multiple load balancers: [{UUID:a8f708c5-8694-4fb8-b95e-d716a21f4367 ExternalIDs:map[k8s.ovn.org/kind:Service k8s.ovn.org/owner:openshift-machine-config-operator/machine-config-daemon] HealthCheck:[] IPPortMappings:map[] Name:Service_openshift-machine-config-operator/machine-config-daemon_TCP_node_router_ip-10-0-129-163.us-west-2.compute.internal Options:map[event:false reject:true skip_snat:false] Protocol:0xc00a28c010 SelectionFields:[] Vips:map[172.30.146.245:9001:169.254.169.2:9001,10.0.129.43:9001,10.0.132.49:9001,10.0.133.149:9001,10.0.133.56:9001,10.0.134.135:9001,10.0.135.107:9001,10.0.135.166:9001,10.0.135.171:9001,10.0.135.225:9001,10.0.136.122:9001,10.0.138.44:9001,10.0.138.74:9001,10.0.139.225:9001,10.0.139.92:9001,10.0.140.133:9001,10.0.141.124:9001,10.0.141.188:9001,10.0.142.228:9001,10.0.144.39:9001,10.0.145.216:9001,10.0.145.49:9001,10.0.145.7:9001,10.0.147.146:9001,10.0.149.113:9001,10.0.149.16:9001,10.0.150.241:9001,10.0.151.120:9001,10.0.155.49:9001,10.0.157.36:9001,10.0.158.36:9001,10.0.160.28:9001,10.0.160.88:9001,10.0.162.186:9001,10.0.164.79:9001,10.0.165.226:9001,10.0.165.58:9001,10.0.166.13:9001,10.0.168.167:9001,10.0.168.212:9001,10.0.169.85:9001,10.0.170.7:9001,10.0.171.113:9001,10.0.172.255:9001,10.0.173.52:9001,10.0.173.8:9001,10.0.174.76:9001,10.0.174.99:9001,10.0.179.207:9001,10.0.179.62:9001,10.0.180.213:9001,10.0.180.31:9001,10.0.181.9:9001,10.0.183.134:9001,10.0.183.47:9001,10.0.184.192:9001,10.0.186.90:9001,10.0.188.28:9001,10.0.188.47:9001,10.0.189.136:9001,10.0.189.161:9001,10.0.190.23:9001,10.0.192.198:9001,10.0.192.239:9001,10.0.196.106:9001,10.0.197.13:9001,10.0.199.121:9001,10.0.199.122:9001,10.0.199.215:9001,10.0.201.104:9001,10.0.202.213:9001,10.0.203.85:9001,10.0.205.79:9001,10.0.205.96:9001,10.0.206.7:9001,10.0.207.202:9001,10.0.207.225:9001,10.0.208.121:9001,10.0.208.225:9001,10.0.208.232:9001,10.0.208.27:9001,10.0.209.0:9001,10.0.214.143:9001,10.0.216.111:9001,10.0.217.25:9001,10.0.217.93:9001,10.0.220.195:9001,10.0.225.118:9001,10.0.225.27:9001,10.0.227.221:9001,10.0.230.167:9001,10.0.230.176:9001,10.0.231.208:9001,10.0.234.187:9001,10.0.237.9:9001,10.0.238.28:9001,10.0.239.233:9001,10.0.240.39:9001,10.0.240.9:9001,10.0.241.56:9001,10.0.243.186:9001,10.0.243.65:9001,10.0.244.37:9001,10.0.245.134:9001,10.0.246.122:9001,10.0.246.216:9001,10.0.246.60:9001,10.0.247.97:9001,10.0.248.42:9001,10.0.250.102:9001,10.0.252.134:9001,10.0.253.139:9001,10.0.253.141:9001,10.0.254.34:9001,10.0.255.15:9001,10.0.255.172:9001]} {UUID:1d70ad2a-93ac-439a-bf9b-e9d92a4a7694 ExternalIDs:map[k8s.ovn.org/kind:Service k8s.ovn.org/owner:openshift-machine-config-operator/machine-config-daemon] HealthCheck:[] IPPortMappings:map[] Name:Service_openshift-machine-config-operator/machine-config-daemon_TCP_node_router_ip-10-0-129-163.us-west-2.compute.internal Options:map[event:false reject:true skip_snat:false] Protocol:0xc009dd5880 SelectionFields:[] Vips:map[172.30.146.245:9001:169.254.169.2:9001,10.0.129.43:9001,10.0.132.49:9001,10.0.133.149:9001,10.0.133.56:9001,10.0.134.135:9001,10.0.135.107:9001,10.0.135.166:9001,10.0.135.171:9001,10.0.135.225:9001,10.0.136.122:9001,10.0.138.44:9001,10.0.138.74:9001,10.0.139.225:9001,10.0.139.92:9001,10.0.140.133:9001,10.0.141.124:9001,10.0.141.188:9001,10.0.142.228:9001,10.0.144.39:9001,10.0.145.216:9001,10.0.145.49:9001,10.0.145.7:9001,10.0.147.146:9001,10.0.149.113:9001,10.0.149.16:9001,10.0.150.241:9001,10.0.151.120:9001,10.0.155.49:9001,10.0.157.36:9001,10.0.158.36:9001,10.0.160.28:9001,10.0.160.88:9001,10.0.162.186:9001,10.0.164.79:9001,10.0.165.226:9001,10.0.165.58:9001,10.0.166.13:9001,10.0.168.167:9001,10.0.168.212:9001,10.0.169.85:9001,10.0.170.7:9001,10.0.171.113:9001,10.0.172.255:9001,10.0.173.52:9001,10.0.173.8:9001,10.0.174.76:9001,10.0.174.99:9001,10.0.179.207:9001,10.0.179.62:9001,10.0.180.213:9001,10.0.180.31:9001,10.0.181.9:9001,10.0.183.134:9001,10.0.183.47:9001,10.0.184.192:9001,10.0.186.90:9001,10.0.188.28:9001,10.0.188.47:9001,10.0.189.136:9001,10.0.189.161:9001,10.0.190.23:9001,10.0.192.198:9001,10.0.192.239:9001,10.0.196.106:9001,10.0.197.13:9001,10.0.199.121:9001,10.0.199.122:9001,10.0.199.215:9001,10.0.201.104:9001,10.0.202.213:9001,10.0.203.85:9001,10.0.205.79:9001,10.0.205.96:9001,10.0.206.7:9001,10.0.207.202:9001,10.0.207.225:9001,10.0.208.121:9001,10.0.208.225:9001,10.0.208.232:9001,10.0.208.27:9001,10.0.209.0:9001,10.0.214.143:9001,10.0.216.111:9001,10.0.217.25:9001,10.0.217.93:9001,10.0.220.195:9001,10.0.225.118:9001,10.0.225.27:9001,10.0.227.221:9001,10.0.230.167:9001,10.0.230.176:9001,10.0.231.208:9001,10.0.234.187:9001,10.0.237.9:9001,10.0.238.28:9001,10.0.239.233:9001,10.0.240.39:9001,10.0.240.9:9001,10.0.241.56:9001,10.0.243.186:9001,10.0.243.65:9001,10.0.244.37:9001,10.0.245.134:9001,10.0.246.122:9001,10.0.246.216:9001,10.0.246.60:9001,10.0.247.97:9001,10.0.250.102:9001,10.0.252.134:9001,10.0.253.139:9001,10.0.253.141:9001,10.0.254.34:9001,10.0.255.15:9001,10.0.255.172:9001]}]" This happens as a result of RAFT cluster leadership changes where we apparently cache a LB UUID in the service controller that does not correspond to the one in the libovsdb cache that was ultimately created.
Ilya says (of the general issue): "The likely sequence of events: 1. client sends a transaction. 2. old leader starts execution: initiates a raft command, sends append requests to followers. 3. old leader transfers the leadership: completes the in-progress raft command with a 'cluster error: lost leadership' 4. old leader rejects append replies for the raft command from followers 5. new leader has the transaction committed, because it received the append request previously and the other follower also has it. 6. new leader replicates the result of the old transaction to the old leader. 7. old leader checks that state of the old transaction is 'cluster error' and cluster errors are temporary -> retries the execution. 8. Execution retry fails with a constraint violation. "
> where we apparently cache a LB UUID in the service controller that does not correspond to the one in the libovsdb cache that was ultimately created. I am not sure this makes sense to me. If we had a UUID that is not the real one, then the op op, err := nbClient.Where(lb).Update(lb, fields...) would return an error unless there was some other LB that had that UUID. Is this what we are talking about? But double checking the services controller, I think it is not using the cached UUIDs (bug) so it is actually always searching by name and indeed we wouldn't have the above error otherwise as that is actually an indication that it is searching an existing LB by name (the controller populates its cache at startup from the libovsdb cache).
This error message is coming from libovsdbopts. Somehow, our cache got messed up. As a first point, the code has a separate cache on top of libovsdb that I got rid of. I also added some code to clean up any dupe LBs just in case. https://github.com/ovn-org/ovn-kubernetes/pull/2757
As dcbw mentioned, the root cause of this is really an OVSDB server bug: https://bugzilla.redhat.com/show_bug.cgi?id=2046340 The bug can happen to any resource type, not just load_balancers. The bug is present in past releases of OCP. It's just in 4.10 we added some code in ovn-kubernetes that will detect and error when this problem occurs. The workaround is to add an OVSDB wait method operation before every creation attempt in OVSDB using libovsdb. This will wait method acts as a guard to avoid OVSDB from creating the same thing twice. Note, ACLs are currently unable to be guarded and could still encounter the issue, but hopefully will reconcile. For 4.9 we used a mixture of nbctl and go-ovn (not libovsdb). It's unclear whether or not nbctl gives us some protection against this problem when creating different resource types. We may need to examine this for potential backports and investigate how we could fix this in nbctl/go-ovn. Ideally we would rather have the root cause fixed in OVSDB server itself and backport that fix.
Unable to find duplicate load-balancers on 4.10.0-0.nightly-2022-02-02-000921 Verified sh-4.4# ovn-nbctl --no-leader-only --format=csv --data=bare --no-heading --columns=name find Load_Balancer | sort | uniq -c 1 Service_default/kubernetes_TCP_node_router_rbrattai-o410i31-7bx9q-master-0 1 Service_default/kubernetes_TCP_node_router_rbrattai-o410i31-7bx9q-master-1 1 Service_default/kubernetes_TCP_node_router_rbrattai-o410i31-7bx9q-master-2 1 Service_default/kubernetes_TCP_node_switch_rbrattai-o410i31-7bx9q-master-0_merged 1 Service_openshift-apiserver-operator/metrics_TCP_cluster 1 Service_openshift-apiserver/api_TCP_cluster 1 Service_openshift-apiserver/check-endpoints_TCP_cluster 1 Service_openshift-authentication-operator/metrics_TCP_cluster 1 Service_openshift-authentication/oauth-openshift_TCP_cluster 1 Service_openshift-cloud-credential-operator/cco-metrics_TCP_cluster 1 Service_openshift-cluster-storage-operator/cluster-storage-operator-metrics_TCP_cluster 1 Service_openshift-cluster-storage-operator/csi-snapshot-controller-operator-metrics_TCP_cluster 1 Service_openshift-cluster-storage-operator/csi-snapshot-webhook_TCP_cluster 1 Service_openshift-cluster-version/cluster-version-operator_TCP_node_router_rbrattai-o410i31-7bx9q-master-0 1 Service_openshift-cluster-version/cluster-version-operator_TCP_node_switch_rbrattai-o410i31-7bx9q-master-0_merged 1 Service_openshift-config-operator/metrics_TCP_cluster 1 Service_openshift-console-operator/metrics_TCP_cluster 1 Service_openshift-console/console_TCP_cluster 1 Service_openshift-console/downloads_TCP_cluster 1 Service_openshift-controller-manager-operator/metrics_TCP_cluster 1 Service_openshift-controller-manager/controller-manager_TCP_cluster 1 Service_openshift-dns-operator/metrics_TCP_cluster 1 Service_openshift-dns/dns-default_TCP_node_router_rbrattai-o410i31-7bx9q-master-0_merged 1 Service_openshift-dns/dns-default_TCP_node_switch_rbrattai-o410i31-7bx9q-master-0 1 Service_openshift-dns/dns-default_TCP_node_switch_rbrattai-o410i31-7bx9q-master-1 1 Service_openshift-dns/dns-default_TCP_node_switch_rbrattai-o410i31-7bx9q-master-2 1 Service_openshift-dns/dns-default_TCP_node_switch_rbrattai-o410i31-7bx9q-worker-1-h7klb 1 Service_openshift-dns/dns-default_TCP_node_switch_rbrattai-o410i31-7bx9q-worker-2-277ht 1 Service_openshift-dns/dns-default_TCP_node_switch_rbrattai-o410i31-7bx9q-worker-3-gkrgw 1 Service_openshift-dns/dns-default_UDP_node_router_rbrattai-o410i31-7bx9q-master-0_merged 1 Service_openshift-dns/dns-default_UDP_node_switch_rbrattai-o410i31-7bx9q-master-0 1 Service_openshift-dns/dns-default_UDP_node_switch_rbrattai-o410i31-7bx9q-master-1 1 Service_openshift-dns/dns-default_UDP_node_switch_rbrattai-o410i31-7bx9q-master-2 1 Service_openshift-dns/dns-default_UDP_node_switch_rbrattai-o410i31-7bx9q-worker-1-h7klb 1 Service_openshift-dns/dns-default_UDP_node_switch_rbrattai-o410i31-7bx9q-worker-2-277ht 1 Service_openshift-dns/dns-default_UDP_node_switch_rbrattai-o410i31-7bx9q-worker-3-gkrgw 1 Service_openshift-etcd-operator/metrics_TCP_cluster 1 Service_openshift-etcd/etcd_TCP_node_router_rbrattai-o410i31-7bx9q-master-0 1 Service_openshift-etcd/etcd_TCP_node_router_rbrattai-o410i31-7bx9q-master-1 1 Service_openshift-etcd/etcd_TCP_node_router_rbrattai-o410i31-7bx9q-master-2 1 Service_openshift-etcd/etcd_TCP_node_switch_rbrattai-o410i31-7bx9q-master-0_merged 1 Service_openshift-image-registry/image-registry_TCP_cluster 1 Service_openshift-ingress-canary/ingress-canary_TCP_cluster 1 Service_openshift-ingress-operator/metrics_TCP_cluster 1 Service_openshift-ingress/router-default_TCP_cluster 1 Service_openshift-ingress/router-default_TCP_node_router+switch_rbrattai-o410i31-7bx9q-master-0 1 Service_openshift-ingress/router-default_TCP_node_router+switch_rbrattai-o410i31-7bx9q-master-1 1 Service_openshift-ingress/router-default_TCP_node_router+switch_rbrattai-o410i31-7bx9q-master-2 1 Service_openshift-ingress/router-default_TCP_node_router+switch_rbrattai-o410i31-7bx9q-worker-1-h7klb 1 Service_openshift-ingress/router-default_TCP_node_router+switch_rbrattai-o410i31-7bx9q-worker-2-277ht 1 Service_openshift-ingress/router-default_TCP_node_router+switch_rbrattai-o410i31-7bx9q-worker-3-gkrgw 1 Service_openshift-ingress/router-internal-default_TCP_cluster 1 Service_openshift-insights/metrics_TCP_cluster 1 Service_openshift-kube-apiserver-operator/metrics_TCP_cluster 1 Service_openshift-kube-apiserver/apiserver_TCP_node_router_rbrattai-o410i31-7bx9q-master-0 1 Service_openshift-kube-apiserver/apiserver_TCP_node_router_rbrattai-o410i31-7bx9q-master-1 1 Service_openshift-kube-apiserver/apiserver_TCP_node_router_rbrattai-o410i31-7bx9q-master-2 1 Service_openshift-kube-apiserver/apiserver_TCP_node_switch_rbrattai-o410i31-7bx9q-master-0_merged 1 Service_openshift-kube-controller-manager-operator/metrics_TCP_cluster 1 Service_openshift-kube-controller-manager/kube-controller-manager_TCP_node_router_rbrattai-o410i31-7bx9q-master-0 1 Service_openshift-kube-controller-manager/kube-controller-manager_TCP_node_router_rbrattai-o410i31-7bx9q-master-1 1 Service_openshift-kube-controller-manager/kube-controller-manager_TCP_node_router_rbrattai-o410i31-7bx9q-master-2 1 Service_openshift-kube-controller-manager/kube-controller-manager_TCP_node_switch_rbrattai-o410i31-7bx9q-master-0_merged 1 Service_openshift-kube-scheduler-operator/metrics_TCP_cluster 1 Service_openshift-kube-scheduler/scheduler_TCP_node_router_rbrattai-o410i31-7bx9q-master-0 1 Service_openshift-kube-scheduler/scheduler_TCP_node_router_rbrattai-o410i31-7bx9q-master-1 1 Service_openshift-kube-scheduler/scheduler_TCP_node_router_rbrattai-o410i31-7bx9q-master-2 1 Service_openshift-kube-scheduler/scheduler_TCP_node_switch_rbrattai-o410i31-7bx9q-master-0_merged 1 Service_openshift-kube-storage-version-migrator-operator/metrics_TCP_cluster 1 Service_openshift-machine-api/cluster-autoscaler-operator_TCP_cluster 1 Service_openshift-machine-api/cluster-baremetal-operator-service_TCP_cluster 1 Service_openshift-machine-api/cluster-baremetal-webhook-service_TCP_cluster 1 Service_openshift-machine-api/machine-api-controllers_TCP_cluster 1 Service_openshift-machine-api/machine-api-operator-webhook_TCP_cluster 1 Service_openshift-machine-api/machine-api-operator_TCP_cluster 1 Service_openshift-machine-config-operator/machine-config-daemon_TCP_node_router_rbrattai-o410i31-7bx9q-master-0 1 Service_openshift-machine-config-operator/machine-config-daemon_TCP_node_router_rbrattai-o410i31-7bx9q-master-1 1 Service_openshift-machine-config-operator/machine-config-daemon_TCP_node_router_rbrattai-o410i31-7bx9q-master-2 1 Service_openshift-machine-config-operator/machine-config-daemon_TCP_node_router_rbrattai-o410i31-7bx9q-worker-1-h7klb 1 Service_openshift-machine-config-operator/machine-config-daemon_TCP_node_router_rbrattai-o410i31-7bx9q-worker-2-277ht 1 Service_openshift-machine-config-operator/machine-config-daemon_TCP_node_router_rbrattai-o410i31-7bx9q-worker-3-gkrgw 1 Service_openshift-machine-config-operator/machine-config-daemon_TCP_node_switch_rbrattai-o410i31-7bx9q-master-0_merged 1 Service_openshift-marketplace/certified-operators_TCP_cluster 1 Service_openshift-marketplace/community-operators_TCP_cluster 1 Service_openshift-marketplace/marketplace-operator-metrics_TCP_cluster 1 Service_openshift-marketplace/redhat-marketplace_TCP_cluster 1 Service_openshift-marketplace/redhat-operators_TCP_cluster 1 Service_openshift-monitoring/alertmanager-main_TCP_cluster 1 Service_openshift-monitoring/grafana_TCP_cluster 1 Service_openshift-monitoring/prometheus-adapter_TCP_cluster 1 Service_openshift-monitoring/prometheus-k8s_TCP_cluster 1 Service_openshift-monitoring/thanos-querier_TCP_cluster 1 Service_openshift-multus/multus-admission-controller_TCP_cluster 1 Service_openshift-network-diagnostics/network-check-target_TCP_cluster 1 Service_openshift-oauth-apiserver/api_TCP_cluster 1 Service_openshift-operator-lifecycle-manager/catalog-operator-metrics_TCP_cluster 1 Service_openshift-operator-lifecycle-manager/olm-operator-metrics_TCP_cluster 1 Service_openshift-operator-lifecycle-manager/packageserver-service_TCP_cluster 1 Service_openshift-service-ca-operator/metrics_TCP_cluster
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056