Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2045559

Summary: API_VIP moved when kube-api container on another master node was stopped
Product: OpenShift Container Platform Reporter: Victor Voronkov <vvoronko>
Component: Machine Config OperatorAssignee: Christoph Stäbler <cstabler>
Machine Config Operator sub component: platform-baremetal QA Contact: Victor Voronkov <vvoronko>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: medium CC: aos-bugs, bnemec, tsedovic
Version: 4.10   
Target Milestone: ---   
Target Release: 4.11.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-08-10 10:43:43 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2092948    
Attachments:
Description Flags
test execution log with stack trace none

Description Victor Voronkov 2022-01-25 17:14:23 UTC
Created attachment 1854634 [details]
test execution log with stack trace

Description of problem:

Platform - Metal

How reproducible:

Did you catch this issue by running a Jenkins job? If yes, please list:
1. Jenkins job: https://ci-jenkins-csb-kniqe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-edge-auto-tests/291/

2. Profile: OCP: 4:10 - vm-disconnected-ipv4v6_ctlplane-ipv6_provisioning-ipsec-no_fips

Steps to Reproduce:
1. identify master that hold API_VIP (master-0-2)
2. Pick another master and stop kube-api container on it (master-0-1)
3. Poll for API_VIP hostname for 30sec (each 10 seconds)

Actual results:
API_VIP switched from master-0-2.ocp-edge-cluster-0.qe.lab.redhat.com to master-0-0.ocp-edge-cluster-0.qe.lab.redhat.com

Expected results:
After 30sec API_VIP should stay on the same node (master-0-2)

Additional info:
API_VIP should move only if HAproxy service failed and cannot LB the traffic.
Issue is sporadic and not reproducable 100%

Comment 1 Ben Nemec 2022-01-25 21:38:57 UTC
The only thing I can think of that would cause this is if our haproxy health check got loadbalanced to the stopped node multiple times so it thinks that haproxy has failed. We may need to look at the timing of the failover vs. the health checks to make sure if a backend drops out we will detect that before we run enough health checks to trigger a failover.

Comment 4 Victor Voronkov 2022-03-27 15:23:22 UTC
We continue to hit this issue approximately once in a week in different CI profiles
https://ci-jenkins-csb-kniqe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-edge-auto-tests/2301//testReport/junit/deployment.networking/test_lb_availability/test_lb_availability_when_non_api_vip_shut_down/
>                   raise AssertionError(f"API_VIP should not {to_from_msg} if haproxy pod current id "
                                         f"{current_haproxy_pod_id} is the same as original pod id "
                                         f"{original_haproxy_pod_id}")
E                   AssertionError: API_VIP should not switch from master-0-1 to master-0-0 if haproxy pod current id ['4ef3c910024eccd5eaea9ef98fb2028635557377728fe3bd78867e0133e592d4'] is the same as original pod id ['4ef3c910024eccd5eaea9ef98fb2028635557377728fe3bd78867e0133e592d4']

Comment 5 Ben Nemec 2022-03-28 14:48:50 UTC
We still need logs from a failing environment in order to debug this.  The fact that it only happens about once a week means this must be a fairly uncommon edge case and it's unlikely we're going to be able to guess what the problem is.

Comment 9 Victor Voronkov 2022-06-15 09:53:07 UTC
Verified by number of CI runs, issue doesn't happen again

Comment 11 errata-xmlrpc 2022-08-10 10:43:43 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069