Bug 2045559
| Summary: | API_VIP moved when kube-api container on another master node was stopped | ||||||
|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Victor Voronkov <vvoronko> | ||||
| Component: | Machine Config Operator | Assignee: | Christoph Stäbler <cstabler> | ||||
| Machine Config Operator sub component: | platform-baremetal | QA Contact: | Victor Voronkov <vvoronko> | ||||
| Status: | CLOSED ERRATA | Docs Contact: | |||||
| Severity: | medium | ||||||
| Priority: | medium | CC: | aos-bugs, bnemec, tsedovic | ||||
| Version: | 4.10 | ||||||
| Target Milestone: | --- | ||||||
| Target Release: | 4.11.0 | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | No Doc Update | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2022-08-10 10:43:43 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | |||||||
| Bug Blocks: | 2092948 | ||||||
| Attachments: |
|
||||||
|
Description
Victor Voronkov
2022-01-25 17:14:23 UTC
The only thing I can think of that would cause this is if our haproxy health check got loadbalanced to the stopped node multiple times so it thinks that haproxy has failed. We may need to look at the timing of the failover vs. the health checks to make sure if a backend drops out we will detect that before we run enough health checks to trigger a failover. We continue to hit this issue approximately once in a week in different CI profiles https://ci-jenkins-csb-kniqe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-edge-auto-tests/2301//testReport/junit/deployment.networking/test_lb_availability/test_lb_availability_when_non_api_vip_shut_down/ > raise AssertionError(f"API_VIP should not {to_from_msg} if haproxy pod current id " f"{current_haproxy_pod_id} is the same as original pod id " f"{original_haproxy_pod_id}") E AssertionError: API_VIP should not switch from master-0-1 to master-0-0 if haproxy pod current id ['4ef3c910024eccd5eaea9ef98fb2028635557377728fe3bd78867e0133e592d4'] is the same as original pod id ['4ef3c910024eccd5eaea9ef98fb2028635557377728fe3bd78867e0133e592d4'] We still need logs from a failing environment in order to debug this. The fact that it only happens about once a week means this must be a fairly uncommon edge case and it's unlikely we're going to be able to guess what the problem is. Verified by number of CI runs, issue doesn't happen again Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069 |