Bug 2045559 - API_VIP moved when kube-api container on another master node was stopped
Summary: API_VIP moved when kube-api container on another master node was stopped
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Machine Config Operator
Version: 4.10
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
: 4.11.0
Assignee: Christoph Stäbler
QA Contact: Victor Voronkov
Depends On:
Blocks: 2092948
TreeView+ depends on / blocked
Reported: 2022-01-25 17:14 UTC by Victor Voronkov
Modified: 2022-08-10 10:44 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Last Closed: 2022-08-10 10:43:43 UTC
Target Upstream Version:

Attachments (Terms of Use)
test execution log with stack trace (18.91 KB, text/plain)
2022-01-25 17:14 UTC, Victor Voronkov
no flags Details

System ID Private Priority Status Summary Last Updated
Github openshift machine-config-operator pull 3158 0 None open Bug 2045559: Increase keepalived API check fall value to 3 2022-05-23 19:03:46 UTC
Red Hat Product Errata RHSA-2022:5069 0 None None None 2022-08-10 10:44:05 UTC

Description Victor Voronkov 2022-01-25 17:14:23 UTC
Created attachment 1854634 [details]
test execution log with stack trace

Description of problem:

Platform - Metal

How reproducible:

Did you catch this issue by running a Jenkins job? If yes, please list:
1. Jenkins job: https://ci-jenkins-csb-kniqe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-edge-auto-tests/291/

2. Profile: OCP: 4:10 - vm-disconnected-ipv4v6_ctlplane-ipv6_provisioning-ipsec-no_fips

Steps to Reproduce:
1. identify master that hold API_VIP (master-0-2)
2. Pick another master and stop kube-api container on it (master-0-1)
3. Poll for API_VIP hostname for 30sec (each 10 seconds)

Actual results:
API_VIP switched from master-0-2.ocp-edge-cluster-0.qe.lab.redhat.com to master-0-0.ocp-edge-cluster-0.qe.lab.redhat.com

Expected results:
After 30sec API_VIP should stay on the same node (master-0-2)

Additional info:
API_VIP should move only if HAproxy service failed and cannot LB the traffic.
Issue is sporadic and not reproducable 100%

Comment 1 Ben Nemec 2022-01-25 21:38:57 UTC
The only thing I can think of that would cause this is if our haproxy health check got loadbalanced to the stopped node multiple times so it thinks that haproxy has failed. We may need to look at the timing of the failover vs. the health checks to make sure if a backend drops out we will detect that before we run enough health checks to trigger a failover.

Comment 4 Victor Voronkov 2022-03-27 15:23:22 UTC
We continue to hit this issue approximately once in a week in different CI profiles
>                   raise AssertionError(f"API_VIP should not {to_from_msg} if haproxy pod current id "
                                         f"{current_haproxy_pod_id} is the same as original pod id "
E                   AssertionError: API_VIP should not switch from master-0-1 to master-0-0 if haproxy pod current id ['4ef3c910024eccd5eaea9ef98fb2028635557377728fe3bd78867e0133e592d4'] is the same as original pod id ['4ef3c910024eccd5eaea9ef98fb2028635557377728fe3bd78867e0133e592d4']

Comment 5 Ben Nemec 2022-03-28 14:48:50 UTC
We still need logs from a failing environment in order to debug this.  The fact that it only happens about once a week means this must be a fairly uncommon edge case and it's unlikely we're going to be able to guess what the problem is.

Comment 9 Victor Voronkov 2022-06-15 09:53:07 UTC
Verified by number of CI runs, issue doesn't happen again

Comment 11 errata-xmlrpc 2022-08-10 10:43:43 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.