Bug 2042001
Summary: | unexpectedly found multiple load balancers | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Dan Williams <dcbw> |
Component: | Networking | Assignee: | Tim Rozet <trozet> |
Networking sub component: | ovn-kubernetes | QA Contact: | Ross Brattain <rbrattai> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | urgent | ||
Priority: | urgent | CC: | anbhat, elevin, jcaamano, rbrattai, surya, trozet, vpickard |
Version: | 4.10 | ||
Target Milestone: | --- | ||
Target Release: | 4.10.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2022-03-12 04:41:03 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Dan Williams
2022-01-18 16:51:15 UTC
Ilya says (of the general issue): "The likely sequence of events: 1. client sends a transaction. 2. old leader starts execution: initiates a raft command, sends append requests to followers. 3. old leader transfers the leadership: completes the in-progress raft command with a 'cluster error: lost leadership' 4. old leader rejects append replies for the raft command from followers 5. new leader has the transaction committed, because it received the append request previously and the other follower also has it. 6. new leader replicates the result of the old transaction to the old leader. 7. old leader checks that state of the old transaction is 'cluster error' and cluster errors are temporary -> retries the execution. 8. Execution retry fails with a constraint violation. " > where we apparently cache a LB UUID in the service controller that does not correspond to the one in the libovsdb cache that was ultimately created.
I am not sure this makes sense to me. If we had a UUID that is not the real one, then the op
op, err := nbClient.Where(lb).Update(lb, fields...)
would return an error unless there was some other LB that had that UUID. Is this what we are talking about?
But double checking the services controller, I think it is not using the cached UUIDs (bug) so it is actually always searching by name and indeed we wouldn't have the above error otherwise as that is actually an indication that it is searching an existing LB by name (the controller populates its cache at startup from the libovsdb cache).
This error message is coming from libovsdbopts. Somehow, our cache got messed up. As a first point, the code has a separate cache on top of libovsdb that I got rid of. I also added some code to clean up any dupe LBs just in case. https://github.com/ovn-org/ovn-kubernetes/pull/2757 As dcbw mentioned, the root cause of this is really an OVSDB server bug: https://bugzilla.redhat.com/show_bug.cgi?id=2046340 The bug can happen to any resource type, not just load_balancers. The bug is present in past releases of OCP. It's just in 4.10 we added some code in ovn-kubernetes that will detect and error when this problem occurs. The workaround is to add an OVSDB wait method operation before every creation attempt in OVSDB using libovsdb. This will wait method acts as a guard to avoid OVSDB from creating the same thing twice. Note, ACLs are currently unable to be guarded and could still encounter the issue, but hopefully will reconcile. For 4.9 we used a mixture of nbctl and go-ovn (not libovsdb). It's unclear whether or not nbctl gives us some protection against this problem when creating different resource types. We may need to examine this for potential backports and investigate how we could fix this in nbctl/go-ovn. Ideally we would rather have the root cause fixed in OVSDB server itself and backport that fix. Unable to find duplicate load-balancers on 4.10.0-0.nightly-2022-02-02-000921 Verified sh-4.4# ovn-nbctl --no-leader-only --format=csv --data=bare --no-heading --columns=name find Load_Balancer | sort | uniq -c 1 Service_default/kubernetes_TCP_node_router_rbrattai-o410i31-7bx9q-master-0 1 Service_default/kubernetes_TCP_node_router_rbrattai-o410i31-7bx9q-master-1 1 Service_default/kubernetes_TCP_node_router_rbrattai-o410i31-7bx9q-master-2 1 Service_default/kubernetes_TCP_node_switch_rbrattai-o410i31-7bx9q-master-0_merged 1 Service_openshift-apiserver-operator/metrics_TCP_cluster 1 Service_openshift-apiserver/api_TCP_cluster 1 Service_openshift-apiserver/check-endpoints_TCP_cluster 1 Service_openshift-authentication-operator/metrics_TCP_cluster 1 Service_openshift-authentication/oauth-openshift_TCP_cluster 1 Service_openshift-cloud-credential-operator/cco-metrics_TCP_cluster 1 Service_openshift-cluster-storage-operator/cluster-storage-operator-metrics_TCP_cluster 1 Service_openshift-cluster-storage-operator/csi-snapshot-controller-operator-metrics_TCP_cluster 1 Service_openshift-cluster-storage-operator/csi-snapshot-webhook_TCP_cluster 1 Service_openshift-cluster-version/cluster-version-operator_TCP_node_router_rbrattai-o410i31-7bx9q-master-0 1 Service_openshift-cluster-version/cluster-version-operator_TCP_node_switch_rbrattai-o410i31-7bx9q-master-0_merged 1 Service_openshift-config-operator/metrics_TCP_cluster 1 Service_openshift-console-operator/metrics_TCP_cluster 1 Service_openshift-console/console_TCP_cluster 1 Service_openshift-console/downloads_TCP_cluster 1 Service_openshift-controller-manager-operator/metrics_TCP_cluster 1 Service_openshift-controller-manager/controller-manager_TCP_cluster 1 Service_openshift-dns-operator/metrics_TCP_cluster 1 Service_openshift-dns/dns-default_TCP_node_router_rbrattai-o410i31-7bx9q-master-0_merged 1 Service_openshift-dns/dns-default_TCP_node_switch_rbrattai-o410i31-7bx9q-master-0 1 Service_openshift-dns/dns-default_TCP_node_switch_rbrattai-o410i31-7bx9q-master-1 1 Service_openshift-dns/dns-default_TCP_node_switch_rbrattai-o410i31-7bx9q-master-2 1 Service_openshift-dns/dns-default_TCP_node_switch_rbrattai-o410i31-7bx9q-worker-1-h7klb 1 Service_openshift-dns/dns-default_TCP_node_switch_rbrattai-o410i31-7bx9q-worker-2-277ht 1 Service_openshift-dns/dns-default_TCP_node_switch_rbrattai-o410i31-7bx9q-worker-3-gkrgw 1 Service_openshift-dns/dns-default_UDP_node_router_rbrattai-o410i31-7bx9q-master-0_merged 1 Service_openshift-dns/dns-default_UDP_node_switch_rbrattai-o410i31-7bx9q-master-0 1 Service_openshift-dns/dns-default_UDP_node_switch_rbrattai-o410i31-7bx9q-master-1 1 Service_openshift-dns/dns-default_UDP_node_switch_rbrattai-o410i31-7bx9q-master-2 1 Service_openshift-dns/dns-default_UDP_node_switch_rbrattai-o410i31-7bx9q-worker-1-h7klb 1 Service_openshift-dns/dns-default_UDP_node_switch_rbrattai-o410i31-7bx9q-worker-2-277ht 1 Service_openshift-dns/dns-default_UDP_node_switch_rbrattai-o410i31-7bx9q-worker-3-gkrgw 1 Service_openshift-etcd-operator/metrics_TCP_cluster 1 Service_openshift-etcd/etcd_TCP_node_router_rbrattai-o410i31-7bx9q-master-0 1 Service_openshift-etcd/etcd_TCP_node_router_rbrattai-o410i31-7bx9q-master-1 1 Service_openshift-etcd/etcd_TCP_node_router_rbrattai-o410i31-7bx9q-master-2 1 Service_openshift-etcd/etcd_TCP_node_switch_rbrattai-o410i31-7bx9q-master-0_merged 1 Service_openshift-image-registry/image-registry_TCP_cluster 1 Service_openshift-ingress-canary/ingress-canary_TCP_cluster 1 Service_openshift-ingress-operator/metrics_TCP_cluster 1 Service_openshift-ingress/router-default_TCP_cluster 1 Service_openshift-ingress/router-default_TCP_node_router+switch_rbrattai-o410i31-7bx9q-master-0 1 Service_openshift-ingress/router-default_TCP_node_router+switch_rbrattai-o410i31-7bx9q-master-1 1 Service_openshift-ingress/router-default_TCP_node_router+switch_rbrattai-o410i31-7bx9q-master-2 1 Service_openshift-ingress/router-default_TCP_node_router+switch_rbrattai-o410i31-7bx9q-worker-1-h7klb 1 Service_openshift-ingress/router-default_TCP_node_router+switch_rbrattai-o410i31-7bx9q-worker-2-277ht 1 Service_openshift-ingress/router-default_TCP_node_router+switch_rbrattai-o410i31-7bx9q-worker-3-gkrgw 1 Service_openshift-ingress/router-internal-default_TCP_cluster 1 Service_openshift-insights/metrics_TCP_cluster 1 Service_openshift-kube-apiserver-operator/metrics_TCP_cluster 1 Service_openshift-kube-apiserver/apiserver_TCP_node_router_rbrattai-o410i31-7bx9q-master-0 1 Service_openshift-kube-apiserver/apiserver_TCP_node_router_rbrattai-o410i31-7bx9q-master-1 1 Service_openshift-kube-apiserver/apiserver_TCP_node_router_rbrattai-o410i31-7bx9q-master-2 1 Service_openshift-kube-apiserver/apiserver_TCP_node_switch_rbrattai-o410i31-7bx9q-master-0_merged 1 Service_openshift-kube-controller-manager-operator/metrics_TCP_cluster 1 Service_openshift-kube-controller-manager/kube-controller-manager_TCP_node_router_rbrattai-o410i31-7bx9q-master-0 1 Service_openshift-kube-controller-manager/kube-controller-manager_TCP_node_router_rbrattai-o410i31-7bx9q-master-1 1 Service_openshift-kube-controller-manager/kube-controller-manager_TCP_node_router_rbrattai-o410i31-7bx9q-master-2 1 Service_openshift-kube-controller-manager/kube-controller-manager_TCP_node_switch_rbrattai-o410i31-7bx9q-master-0_merged 1 Service_openshift-kube-scheduler-operator/metrics_TCP_cluster 1 Service_openshift-kube-scheduler/scheduler_TCP_node_router_rbrattai-o410i31-7bx9q-master-0 1 Service_openshift-kube-scheduler/scheduler_TCP_node_router_rbrattai-o410i31-7bx9q-master-1 1 Service_openshift-kube-scheduler/scheduler_TCP_node_router_rbrattai-o410i31-7bx9q-master-2 1 Service_openshift-kube-scheduler/scheduler_TCP_node_switch_rbrattai-o410i31-7bx9q-master-0_merged 1 Service_openshift-kube-storage-version-migrator-operator/metrics_TCP_cluster 1 Service_openshift-machine-api/cluster-autoscaler-operator_TCP_cluster 1 Service_openshift-machine-api/cluster-baremetal-operator-service_TCP_cluster 1 Service_openshift-machine-api/cluster-baremetal-webhook-service_TCP_cluster 1 Service_openshift-machine-api/machine-api-controllers_TCP_cluster 1 Service_openshift-machine-api/machine-api-operator-webhook_TCP_cluster 1 Service_openshift-machine-api/machine-api-operator_TCP_cluster 1 Service_openshift-machine-config-operator/machine-config-daemon_TCP_node_router_rbrattai-o410i31-7bx9q-master-0 1 Service_openshift-machine-config-operator/machine-config-daemon_TCP_node_router_rbrattai-o410i31-7bx9q-master-1 1 Service_openshift-machine-config-operator/machine-config-daemon_TCP_node_router_rbrattai-o410i31-7bx9q-master-2 1 Service_openshift-machine-config-operator/machine-config-daemon_TCP_node_router_rbrattai-o410i31-7bx9q-worker-1-h7klb 1 Service_openshift-machine-config-operator/machine-config-daemon_TCP_node_router_rbrattai-o410i31-7bx9q-worker-2-277ht 1 Service_openshift-machine-config-operator/machine-config-daemon_TCP_node_router_rbrattai-o410i31-7bx9q-worker-3-gkrgw 1 Service_openshift-machine-config-operator/machine-config-daemon_TCP_node_switch_rbrattai-o410i31-7bx9q-master-0_merged 1 Service_openshift-marketplace/certified-operators_TCP_cluster 1 Service_openshift-marketplace/community-operators_TCP_cluster 1 Service_openshift-marketplace/marketplace-operator-metrics_TCP_cluster 1 Service_openshift-marketplace/redhat-marketplace_TCP_cluster 1 Service_openshift-marketplace/redhat-operators_TCP_cluster 1 Service_openshift-monitoring/alertmanager-main_TCP_cluster 1 Service_openshift-monitoring/grafana_TCP_cluster 1 Service_openshift-monitoring/prometheus-adapter_TCP_cluster 1 Service_openshift-monitoring/prometheus-k8s_TCP_cluster 1 Service_openshift-monitoring/thanos-querier_TCP_cluster 1 Service_openshift-multus/multus-admission-controller_TCP_cluster 1 Service_openshift-network-diagnostics/network-check-target_TCP_cluster 1 Service_openshift-oauth-apiserver/api_TCP_cluster 1 Service_openshift-operator-lifecycle-manager/catalog-operator-metrics_TCP_cluster 1 Service_openshift-operator-lifecycle-manager/olm-operator-metrics_TCP_cluster 1 Service_openshift-operator-lifecycle-manager/packageserver-service_TCP_cluster 1 Service_openshift-service-ca-operator/metrics_TCP_cluster Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056 |