Bug 1953597

Summary: pacemaker_remoted shows "Error in the push function" if more than one resource is assigned to remote guest KVM VM
Product: Red Hat Enterprise Linux 7 Reporter: Juergen Schleich <jenginfo>
Component: pacemakerAssignee: Ken Gaillot <kgaillot>
Status: CLOSED INSUFFICIENT_DATA QA Contact: cluster-qe <cluster-qe>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 7.9CC: admin, cluster-maint, sbradley
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-04-19 16:20:23 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Juergen Schleich 2021-04-26 13:18:23 UTC
Description of problem:
In a pacemaker cluster with remote guest node the pacemaker_remoted shows the following error messages:
pacemaker_remoted[992]:   error: Connection terminated: Error in the push function.
pacemaker_remoted[992]:   error: Connection terminated: The specified session has been invalidated for some reason.
pacemaker_remoted[992]:   error: Could not send remote message: Software caused connection abort
in the /var/log/message file of the pacemaker remote guest node when moving the VirtualDomain resource to another 
physical node.

The issue occurs if assigning two resources to the pacemaker remote guest node and live migrate one of the 
resource to the other node.
The issue is not visible is assigning only one resource to the pacemaker remote guest node.


Version-Release number of selected component (if applicable):
physical nodes:
# cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 7.9 (Maipo)
# uname -a
Linux pnode03 3.10.0-1160.11.1.el7.x86_64 #1 SMP Tue Dec 15 11:58:45 PST 2020 x86_64 x86_64 x86_64 GNU/Linux
# pacemakerd -$
Pacemaker 1.1.23-1.0.1.el7

remote guest node:
# uname -a
Linux vmguestremote5 3.10.0-1160.el7.x86_64 #1 SMP Thu Oct 1 17:21:35 PDT 2020 x86_64 x86_64 x86_64 GNU/Linux
# pacemakerd -$
Pacemaker 1.1.23-1.0.1.el7


How reproducible:
always.

Simple create a pacemaker remote guest node which is controlled by VirtualDomain resource agent. 
Then assign 2 resources to the remote guest node resource.
Afterwards do a move (live migration) of the VirtualDomain resource which is owning the 2 resources.


Steps to Reproduce:
1. Create a KVM VM which can be used for live migration in pacemaker
2. Create the VirtualDomain resource for this KVM VM
   e.g:
   # pcs resource create vmguestremote5-rs VirtualDomain hypervisor="qemu:///system" config="/etc/pacemaker/vmguestremote5.xml" migration_transport=ssh meta allow-migrate="true" priority="100"
3. Add the VM as remote guest node to pacemaker
   [vmguestremote5]# yum install pacemaker-remote resource-agents pcs
   [vmguestremote5]# systemctl enable pcsd
   [vmguestremote5]# systemctl start pcsd 
   [vmguestremote5]# systemctl start pacemaker_remote
   [vmguestremote5]# systemctl enable pacemaker_remote
   [vmguestremote5]# passwd hacluster
   [pnode03]# pcs cluster auth vmguestremote5 -u hacluster
   [pnode03]# pcs cluster node add-guest vmguestremote5 vmguestremote5-rs
    optional, test live migration


Actual results:
4. Now starting with error reproduction:
   a) Start nfsserver and create directories in the remote guest node
      # systemctl start nfsserver
      # mkdir /export/data1
      # mkdir /export/data2
   b) Create two exportfs resources
      # pcs resource create nfsdata1 ocf:heartbeat:exportfs clientspec="*/24" options=rw,sync,no_root_squash directory=/export/data1 fsid=1
      # pcs constraint location nfsdata1 prefers vmguestremote5
      # pcs resource create nfsdata2 ocf:heartbeat:exportfs clientspec="*/24" options=rw,sync,no_root_squash directory=/export/data2 fsid=2
      # pcs constraint location nfsdata2 prefers vmguestremote5
   c) Do a live migration of the VirtualDomain resource 
      # pcs resource move vmguestremote5-rs pnode03

   Monitor the live migration with crm_mon and 
   in remote guest node the /var/log/messages file:

   In crm_mon you will see:
   vmguestremote5-rs      (ocf::heartbeat:VirtualDomain): Started pnode04
   vmguestremote5-rs      (ocf::heartbeat:VirtualDomain): FAILED pnode04   <<<<<<<<<<<
   vmguestremote5-rs      (ocf::heartbeat:VirtualDomain): Started pnode03
   instead of:
   vmguestremote5-rs      (ocf::heartbeat:VirtualDomain): Started pnode04
   vmguestremote5-rs      (ocf::heartbeat:VirtualDomain): Migrating pnode04
   vmguestremote5-rs      (ocf::heartbeat:VirtualDomain): Started pnode03

   /var/log/messages file will show:
   Apr 23 12:41:33 vmguestremote5 pacemaker_remoted[1553]:   error: Connection terminated: Error in the push function.
   Apr 23 12:41:33 vmguestremote5 pacemaker_remoted[1553]:   error: Connection terminated: The specified session has been invalidated for some reason.
   Apr 23 12:41:33 vmguestremote5 pacemaker_remoted[1553]:   error: Could not send remote message: Software caused connection abort
   Apr 23 12:41:33 vmguestremote5 pacemaker_remoted[1553]: warning: Could not notify client remote-lrmd-vmguestremote5:3121/52219315-8c3a-4488-bb3e-e7e0db17472a: Software caused connection abort


Expected results:
No error messages if doing a live migration.
And no "FAILED" message from VirtualDomain resource in crm_mon output if doing a live migration.


Additional info:
A) The same issue occur if using the ocf:heartbeat:Dummy resource agent. 
B) The same issue occur if you execute the command 
     [vmguestremote5]# pcs resource move vsmgc5k-rs pnode03
   in the remote guest node. In this case only 1 resource need to be configured 
   to reproduce the error message in /var/log/messages and the FAILED message in crm_mon.

Comment 2 Ken Gaillot 2021-04-26 21:15:24 UTC
Hi,

Would it be possible for you to open a support case first? I would like to rule out other possible causes before focusing on pacemaker. There are multiple components involved in this issue, and support has better capabilities for narrowing that down.

You can initiate a case with Red Hat's Global Support Services group through one of the methods listed at the following link:

  https://access.redhat.com/start/how-to-engage-red-hat-support

Comment 3 Juergen Schleich 2021-04-28 14:11:05 UTC
Hi, thanks for the update. 
This issue happened in an early state of a project. I can open the SR in a couple of weeks when the contracts are in place...

Comment 4 Ken Gaillot 2021-04-29 20:02:39 UTC
(In reply to Juergen Schleich from comment #3)
> Hi, thanks for the update. 
> This issue happened in an early state of a project. I can open the SR in a
> couple of weeks when the contracts are in place...

Sounds good

Comment 6 Ken Gaillot 2022-04-19 16:20:23 UTC
If this is determined to be an issue in Pacemaker, we can reopen