2045559 – API_VIP moved when kube-api container on another master node was stopped

Bug 2045559 - API_VIP moved when kube-api container on another master node was stopped

Summary: API_VIP moved when kube-api container on another master node was stopped

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Machine Config Operator
Sub Component:
Version:	4.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.11.0
Assignee:	Christoph Stäbler
QA Contact:	Victor Voronkov
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2092948
TreeView+	depends on / blocked

Reported:	2022-01-25 17:14 UTC by Victor Voronkov
Modified:	2022-08-10 10:44 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-08-10 10:43:43 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
test execution log with stack trace (18.91 KB, text/plain) 2022-01-25 17:14 UTC, Victor Voronkov	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift machine-config-operator pull 3158	0	None	open	Bug 2045559: Increase keepalived API check fall value to 3	2022-05-23 19:03:46 UTC
Red Hat Product Errata	RHSA-2022:5069	0	None	None	None	2022-08-10 10:44:05 UTC

Description Victor Voronkov 2022-01-25 17:14:23 UTC

Created attachment 1854634 [details]
test execution log with stack trace

Description of problem:

Platform - Metal

How reproducible:

Did you catch this issue by running a Jenkins job? If yes, please list:
1. Jenkins job: https://ci-jenkins-csb-kniqe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-edge-auto-tests/291/

2. Profile: OCP: 4:10 - vm-disconnected-ipv4v6_ctlplane-ipv6_provisioning-ipsec-no_fips

Steps to Reproduce:
1. identify master that hold API_VIP (master-0-2)
2. Pick another master and stop kube-api container on it (master-0-1)
3. Poll for API_VIP hostname for 30sec (each 10 seconds)

Actual results:
API_VIP switched from master-0-2.ocp-edge-cluster-0.qe.lab.redhat.com to master-0-0.ocp-edge-cluster-0.qe.lab.redhat.com

Expected results:
After 30sec API_VIP should stay on the same node (master-0-2)

Additional info:
API_VIP should move only if HAproxy service failed and cannot LB the traffic.
Issue is sporadic and not reproducable 100%

Comment 1 Ben Nemec 2022-01-25 21:38:57 UTC

The only thing I can think of that would cause this is if our haproxy health check got loadbalanced to the stopped node multiple times so it thinks that haproxy has failed. We may need to look at the timing of the failover vs. the health checks to make sure if a backend drops out we will detect that before we run enough health checks to trigger a failover.

Comment 4 Victor Voronkov 2022-03-27 15:23:22 UTC

We continue to hit this issue approximately once in a week in different CI profiles
https://ci-jenkins-csb-kniqe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-edge-auto-tests/2301//testReport/junit/deployment.networking/test_lb_availability/test_lb_availability_when_non_api_vip_shut_down/
>                   raise AssertionError(f"API_VIP should not {to_from_msg} if haproxy pod current id "
                                         f"{current_haproxy_pod_id} is the same as original pod id "
                                         f"{original_haproxy_pod_id}")
E                   AssertionError: API_VIP should not switch from master-0-1 to master-0-0 if haproxy pod current id ['4ef3c910024eccd5eaea9ef98fb2028635557377728fe3bd78867e0133e592d4'] is the same as original pod id ['4ef3c910024eccd5eaea9ef98fb2028635557377728fe3bd78867e0133e592d4']

Comment 5 Ben Nemec 2022-03-28 14:48:50 UTC

We still need logs from a failing environment in order to debug this.  The fact that it only happens about once a week means this must be a fairly uncommon edge case and it's unlikely we're going to be able to guess what the problem is.

Comment 9 Victor Voronkov 2022-06-15 09:53:07 UTC

Verified by number of CI runs, issue doesn't happen again

Comment 11 errata-xmlrpc 2022-08-10 10:43:43 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069

Note You need to log in before you can comment on or make changes to this bug.