Created attachment 1854634 [details] test execution log with stack trace Description of problem: Platform - Metal How reproducible: Did you catch this issue by running a Jenkins job? If yes, please list: 1. Jenkins job: https://ci-jenkins-csb-kniqe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-edge-auto-tests/291/ 2. Profile: OCP: 4:10 - vm-disconnected-ipv4v6_ctlplane-ipv6_provisioning-ipsec-no_fips Steps to Reproduce: 1. identify master that hold API_VIP (master-0-2) 2. Pick another master and stop kube-api container on it (master-0-1) 3. Poll for API_VIP hostname for 30sec (each 10 seconds) Actual results: API_VIP switched from master-0-2.ocp-edge-cluster-0.qe.lab.redhat.com to master-0-0.ocp-edge-cluster-0.qe.lab.redhat.com Expected results: After 30sec API_VIP should stay on the same node (master-0-2) Additional info: API_VIP should move only if HAproxy service failed and cannot LB the traffic. Issue is sporadic and not reproducable 100%
The only thing I can think of that would cause this is if our haproxy health check got loadbalanced to the stopped node multiple times so it thinks that haproxy has failed. We may need to look at the timing of the failover vs. the health checks to make sure if a backend drops out we will detect that before we run enough health checks to trigger a failover.
We continue to hit this issue approximately once in a week in different CI profiles https://ci-jenkins-csb-kniqe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-edge-auto-tests/2301//testReport/junit/deployment.networking/test_lb_availability/test_lb_availability_when_non_api_vip_shut_down/ > raise AssertionError(f"API_VIP should not {to_from_msg} if haproxy pod current id " f"{current_haproxy_pod_id} is the same as original pod id " f"{original_haproxy_pod_id}") E AssertionError: API_VIP should not switch from master-0-1 to master-0-0 if haproxy pod current id ['4ef3c910024eccd5eaea9ef98fb2028635557377728fe3bd78867e0133e592d4'] is the same as original pod id ['4ef3c910024eccd5eaea9ef98fb2028635557377728fe3bd78867e0133e592d4']
We still need logs from a failing environment in order to debug this. The fact that it only happens about once a week means this must be a fairly uncommon edge case and it's unlikely we're going to be able to guess what the problem is.
Verified by number of CI runs, issue doesn't happen again
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069