Bug 2232244
| Summary: | Setting priority-fencing-delay results in delay in failover of resource in SAP ENSA2 configuration | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 8 | Reporter: | dennispadia <depadia> |
| Component: | pacemaker | Assignee: | Ken Gaillot <kgaillot> |
| Status: | CLOSED NOTABUG | QA Contact: | cluster-qe <cluster-qe> |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 8.8 | CC: | cfeist, cluster-maint, kris.shawcross, ksatarin, radeltch |
| Target Milestone: | rc | Flags: | pm-rhel:
mirror+
|
| Target Release: | --- | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2023-09-06 20:31:17 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
dennispadia
2023-08-15 21:59:08 UTC
You can find SOS report for both VMs in https://1drv.ms/f/s!Aqy5CvX16plyvzgMtno-OcIr0uGa?e=scD4gS (In reply to dennispadia from comment #1) > You can find SOS report for both VMs in > https://1drv.ms/f/s!Aqy5CvX16plyvzgMtno-OcIr0uGa?e=scD4gS Hi, I don't see any files there. Do I need a special login or permissions? (In reply to Ken Gaillot from comment #2) > (In reply to dennispadia from comment #1) > > You can find SOS report for both VMs in > > https://1drv.ms/f/s!Aqy5CvX16plyvzgMtno-OcIr0uGa?e=scD4gS > > Hi, > > I don't see any files there. Do I need a special login or permissions? Hi, It doesn't require login; can you check with this link https://1drv.ms/f/s!Aqy5CvX16plyvziczbnSgnxXM7K2?e=WGi6c2 Regards, Dennis Padia (In reply to dennispadia from comment #0) > - As priority=10 is set on rsc_sap_RH8_ASCS00, fence agent introduce a > delay of 15s as priority-fencing-delay is configured as 15s. > > Aug 15 20:13:42 rh8scs01l795 pacemaker-fenced[1621]: notice: Delaying > 'reboot' action targeting rh8scs00l795 using rsc_app_azr_agt for 15s > > - After a delay of 15s, it is expected that cluster to perform some > action on the resource. Instead, the cluster again checks the totem and do > not perform any action on ASCS resource (which are in stopped state at this > time) > > Aug 15 20:14:25 rh8scs01l795 corosync[1479]: [TOTEM ] Token has not been > received in 22500 ms > Aug 15 20:14:32 rh8scs01l795 corosync[1479]: [TOTEM ] A processor failed, > forming new configuration: token timed out (30000ms), waiting 36000ms for > consensus. Hi, What happened here is that the node comes back up and rejoins the cluster at 20:13:50, while fencing is waiting for the delay. The Corosync cluster is re-formed until the fencing succeeds at 20:14:18, which Corosync sees as a new node loss event above. Coincidentally, the node was fenced the second time at the same time the surviving node was updating the CIB status for its rejoin. That caused the status updates to fail since Corosync couldn't deliver the request to the other node: Aug 15 20:14:36 rh8scs01l795 pacemaker-controld[993208]: error: Node update 89 failed: Timer expired (-62) Aug 15 20:14:36 rh8scs01l795 pacemaker-controld[993208]: error: Node update 90 failed: Timer expired (-62) That caused the controller on the surviving node to exit, and pacemakerd to respawn it. That's a correct recovery, but since the other node is still down when the controller respawns, it causes an election failure after 20 seconds, which results in another controller exit and respawn. By this time, the other node is back, so everything proceeds normally. This is just unfortunate timing but ultimately results in correct (if slow) recovery. Some users get around this by putting a delay (longer than the maximum time fencing is expected to take) in each node's boot sequence before starting Pacemaker. That ensures if a node leaves and rejoins, any fencing scheduled for it can complete before it tries to rejoin the cluster. |