Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2057689

Summary:	[OSP16.2] Since upgrading to 16.2.1 (from 16.1.5) Pacemaker resources monitors fail randomly on all controller nodes
Product:	Red Hat OpenStack	Reporter:	ggrimaux
Component:	openvswitch	Assignee:	Luca Miccini <lmiccini>
Status:	CLOSED NOTABUG	QA Contact:	Eran Kuris <ekuris>
Severity:	high	Docs Contact:
Priority:	high
Version:	16.2 (Train)	CC:	apevec, bwelterl, chrisw, dhill, fleitner, jveiraca, lmiccini, mlavalle, mpattric
Target Milestone:	---	Keywords:	Triaged
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2023-10-23 14:58:44 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description ggrimaux 2022-02-23 20:38:25 UTC

Description of problem:
Client recently did a minor update from 16.1.5 to 16.2.1.
Since then they noticed that pacemaker monitors fail randomly on many resources and on all 3 controller nodes.
The services themselves don't fail just the monitor (at least from pacemaker point of view)

Controller nodes are physical with 32 cores and over 380GB RAM.

Load on the servers is very low (verified with SAR data)

Example of errors:
Failed Resource Actions:
  * stonith-fence_ipmilan-34800d457d9e_start_0 on controller1 'error' (1): call=940, status='complete', exitreason='', last-rc-change='2022-02-21 11:56:49 +01:00', queued=0ms, exec=32409ms
  * galera-bundle-podman-0_monitor_60000 on controller1 'not running' (7): call=251, status='complete', exitreason='', last-rc-change='2022-02-21 09:35:03 +01:00', queued=0ms, exec=0ms
  * ovn-dbs-bundle-podman-2_monitor_60000 on controller2 'not running' (7): call=739, status='complete', exitreason='', last-rc-change='2022-02-19 01:29:00 +01:00', queued=0ms, exec=0ms
  * openstack-manila-share-podman-0_monitor_60000 on controller2 'not running' (7): call=743, status='complete', exitreason='', last-rc-change='2022-02-20 01:18:04 +01:00', queued=0ms, exec=0ms
  * galera-bundle-podman-1_monitor_60000 on controller2 'not running' (7): call=735, status='complete', exitreason='', last-rc-change='2022-02-20 03:15:33 +01:00', queued=0ms, exec=0ms
  * rabbitmq-bundle-podman-2_monitor_60000 on controller3 'not running' (7): call=281, status='complete', exitreason='', last-rc-change='2022-02-21 12:26:42 +01:00', queued=0ms, exec=0ms

We can't tell where the problem is.

Here's an example:
Feb 19 01:29:00 controller2 pacemaker-controld[4084]: notice: Result of monitor operation for ovn-dbs-bundle-podman-2 on controller2: not running
Feb 19 01:29:00 controller2 pacemaker-attrd[4081]: notice: Setting fail-count-ovn-dbs-bundle-podman-2#monitor_60000[controller2]: (unset) -> 1
Feb 19 01:29:00 controller2 pacemaker-attrd[4081]: notice: Setting last-failure-ovn-dbs-bundle-podman-2#monitor_60000[controller2]: (unset) -> 1645230540

Well one minute before we see podman executing something inside this same container:
Feb 19 01:27:59 controller2 podman[884934]: 2022-02-19 01:27:59.939433496 +0100 CET m=+0.188883340 container exec c2b798e3a3bcf7fd7f434ca4b454ab5abad20d9b98be3639e3697ae0eb0b2dad (image=cluster.common.tag/openstack-ovn-northd:pcmklatest, name=ovn-dbs-bundle-podman-2, build-date=2021-12-08T15:50:59.425286, io.k8s.description=Red Hat OpenStack Platform 16.2 ovn-northd, maintainer=OpenStack TripleO team, summary=Red Hat OpenStack Platform 16.2 ovn-northd, com.redhat.license_terms=https://www.redhat.com/agreements, description=Red Hat OpenStack Platform 16.2 ovn-northd, batch=16.2_20211202.1, distribution-scope=public, io.openshift.tags=rhosp osp openstack osp-16.2, vcs-ref=65751562764cdcf9b93bde8451dd0702dc2d1d7f, io.openshift.expose-services=, tcib_managed=true, vendor=Red Hat, Inc., com.redhat.build-host=cpt-1001.osbs.prod.upshift.rdu2.redhat.com, version=16.2.1, url=https://access.redhat.com/containers/#/registry.access.redhat.com/rhosp16/openstack-ovn-northd/images/16.2.1-9, architecture=x86_64, vcs-type=git, com.redhat.component=openstack-ovn-northd-container, release=9, io.k8s.display-name=Red Hat OpenStack Platform 16.2 ovn-northd, name=rhosp16/openstack-ovn-northd)

So how can it be dead from pacemaker point of view 1 minute later ?!
I troubleshooted this with clusterha people and it does say "not running" so from their point of view when pacemaker tried to monitor the resource it wasn't running.


Sharing here things we see a lot in /var/log/messages:
Feb 21 04:19:13 controller2 kernel: openvswitch: ovs-system: deferred action limit reached, drop recirc action
Feb 21 04:19:15 controller2 kernel: openvswitch: ovs-system: deferred action limit reached, drop recirc action
Feb 21 04:19:18 controller2 kernel: openvswitch: ovs-system: deferred action limit reached, drop recirc action

If I zgrep for this in the sosreport for all 3 controller nodes I get a number around 2 million:
zgrep "drop recirc action" 00[123]*/*/var/log/messages*|wc -l
1945780

Is it the root cause of this ? I don't know but I'd like to rule it out.

Then we can move to the next place (pacemaker ? containers ?)


If you need anything else please let me know


Version-Release number of selected component (if applicable):
OSP 16.2.1

How reproducible:
Happens right now

Steps to Reproduce:
1. Just wait
2.
3.

Actual results:
Pacemaker monitor resources are failing

Expected results:
Pacemaker monitor resources don't fail

Additional info:
We have several sosreport from the controller nodes.

Comment 3 ggrimaux 2022-02-28 17:11:20 UTC

*** Bug 2057688 has been marked as a duplicate of this bug. ***