1949664 – Spurious keepalived liveness probe failures

Bug 1949664 - Spurious keepalived liveness probe failures

Summary: Spurious keepalived liveness probe failures

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Machine Config Operator
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Ben Nemec
QA Contact:	Michael Nguyen
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1989751
TreeView+	depends on / blocked

Reported:	2021-04-14 18:15 UTC by Ben Nemec
Modified:	2021-08-04 15:38 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-07-27 23:00:48 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift machine-config-operator pull 2528	0	None	open	Bug 1949664: [on-prem] Disable liveness probe until keepalived.conf exists	2021-04-14 21:54:43 UTC
Red Hat Product Errata	RHSA-2021:2438	0	None	None	None	2021-07-27 23:01:03 UTC

Description Ben Nemec 2021-04-14 18:15:22 UTC

Description of problem: The keepalived liveness probe frequently fails during deployment with "/bin/bash: line 0: kill: `': not a pid or valid job spec". There's no indication in the logs that anything is actually wrong with keepalived, so I suspect there might be an issue with the liveness probe itself.


Version-Release number of selected component (if applicable): 4.8


How reproducible: Seems to be intermittent


Steps to Reproduce:
1. Deploy using dev-scripts
2. Check journal on one of the nodes. Sometimes the message above will be present and the keepalived container will have been restarted.

Actual results: Liveness probe errors and unexpected keepalived restarts.


Expected results: Keepalived starts and runs normally.


Additional info: My working theory right now is that sending the pgrep output to kill in [0] is occasionally getting tripped up. I want to try using pkill directly to avoid the shell output passing.

0: https://github.com/openshift/machine-config-operator/blob/3fe1270f5d11040119b1f977d6a5604b4e9a80a2/templates/common/on-prem/files/keepalived.yaml#L128

Comment 1 Ben Nemec 2021-04-14 21:53:18 UTC

I was mistaken about the problem here. It's just that it takes too long to populate keepalived.conf on the first start. I have a patch proposed to fix that.

Comment 3 Nataf Sharabi 2021-06-03 10:50:19 UTC

After the fix,
The bug didn't reproduced.

Verifying.

Comment 7 errata-xmlrpc 2021-07-27 23:00:48 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Note You need to log in before you can comment on or make changes to this bug.