Bug 1986757

Summary: Keepalived fails with Liveness probe failed: command timed out
Product: OpenShift Container Platform Reporter: Sonigra Saurab <ssonigra>
Component: NetworkingAssignee: Ben Nemec <bnemec>
Networking sub component: runtime-cfg QA Contact: Victor Voronkov <vvoronko>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: ancollin, bnemec, bperkins, jrosenta, mharri, mmarkand, swasthan, yboaron
Version: 4.7Keywords: Triaged
Target Milestone: ---   
Target Release: 4.9.0   
Hardware: x86_64   
OS: All   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-10-18 17:42:49 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1999531    

Description Sonigra Saurab 2021-07-28 09:47:49 UTC
Description of problem:

All the Keepalived pods fail with the Liveness probe error.

openshift-openstack-infra                         1h9m       Warning  Unhealthy                pod/keepalived-ocp-xlkqn-master-0                                    Liveness probe failed: command timed out
openshift-openstack-infra                         1h26m      Warning  Unhealthy                pod/keepalived-ocp-xlkqn-master-1                                    Liveness probe failed: command timed out
openshift-openstack-infra                         6m24s      Warning  Unhealthy                pod/keepalived-ocp-xlkqn-master-2                                    Liveness probe failed: command timed out
openshift-openstack-infra                         29m        Warning  Unhealthy                pod/keepalived-ocp-xlkqn-worker-0-5zz4m                              Liveness probe failed: command timed out
openshift-openstack-infra                         19m        Warning  Unhealthy                pod/keepalived-ocp-xlkqn-worker-0-9c65w                              Liveness probe failed: command timed out
openshift-openstack-infra                         4h47m      Warning  Unhealthy                pod/keepalived-ocp-xlkqn-worker-0-cq99b                              Liveness probe failed: command timed out
openshift-openstack-infra                         2m57s      Warning  Unhealthy                pod/keepalived-ocp-xlkqn-worker-0-fjlfd                              Liveness probe failed: command timed out
openshift-openstack-infra                         9m8s       Warning  Unhealthy                pod/keepalived-ocp-xlkqn-worker-0-ksmp9                              Liveness probe failed: command timed out
openshift-openstack-infra                         1h40m      Warning  Unhealthy                pod/keepalived-ocp-xlkqn-worker-0-lhnkw                              Liveness probe failed: command timed out
openshift-openstack-infra                         14m        Warning  Unhealthy                pod/keepalived-ocp-xlkqn-worker-0-ttt4m                              Liveness probe failed: command timed out
openshift-openstack-infra                         4h14m      Warning  Unhealthy                pod/keepalived-ocp-xlkqn-worker-0-zl8ff                              Liveness probe failed: command timed out
openshift-openstack-infra                         5h7m       Warning  Unhealthy                pod/mdns-publisher-ocp-xlkqn-worker-0-zl8ff                          Liveness probe failed: command timed out

How reproducible:

Install a cluster 4.7 on OpenStack using IPI 

Steps to Reproduce:

Install a cluster 4.7 on OpenStack using IPI

Actual results:

The alerts are getting triggered with liveness probe failed for keepalived pods but there is no actual error seen in the cluster

Expected results:

Alerts should not be triggered if there is no issue.

Additional info:

Comment 3 Ben Nemec 2021-08-03 22:21:35 UTC
Hmm, interesting. Possibly related to https://bugzilla.redhat.com/show_bug.cgi?id=1949664, but not quite the same thing. I've gone ahead and backported that fix to 4.7, but I think this one may require a different fix because I still see the same behavior reported here on my local 4.8 cluster.

Comment 17 errata-xmlrpc 2021-10-18 17:42:49 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759