Bug 1505909

Summary: Problems when cleaning up unmanaged bundle resources
Product: Red Hat Enterprise Linux 7 Reporter: Damien Ciabrini <dciabrin>
Component: pacemakerAssignee: Ken Gaillot <kgaillot>
Status: CLOSED DUPLICATE QA Contact: cluster-qe <cluster-qe>
Severity: high Docs Contact:
Priority: unspecified    
Version: 7.4CC: abeekhof, cluster-maint, michele
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-11-06 17:11:03 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1494455    
Attachments:
Description Flags
galera configuration none

Description Damien Ciabrini 2017-10-24 14:32:09 UTC
Created attachment 1342787 [details]
galera configuration

Description of problem:
I see unexpected behaviour w.r.t cleaning up containerized resources
when they are unmanaged in pacemaker. Namely:

A - When a resource attribute is updated while the resource is
unmanaged, and a cleanup is requested afterwards:

    pcs resource unmanage galera       
    pcs resource update galera cluster_host_map='ra1:ra1;ra2:ra2;ra3:ra3;foo=foo'
    pcs resource cleanup galera

a stop operation is scheduled on the resource and executed, even if
the resource is still unmanaged.

B - When a resource bundle is unmanaged, and a cleanup is requested
on the bundle:

    pcs resource unmanage galera-bundle
    pcs resource cleanup galera-bundle

the monitor operation triggered on the galera resource are not able to
probe the real state of the resource; galera server is still running
but the resource shows as "Stopped" in pcs status.
Even subsequent "pcs resource cleanup galera" are not to reprobe the
state of the resource.

Version-Release number of selected component (if applicable):
1.1.16-12.el7_4.4

How reproducible:
Always

Steps to Reproduce:
For testing I used a containerized galera resource:

On a three node cluster ra1, ra2, ra3

1. pull the container image on all nodes
  docker pull docker.io/tripleoupstream/centos-binary-mariadb:latest

2. prepare the hosts
  # install the attached galera.cnf in /etc/my.cnf.d/galera.cnf and adapt the host names

3. create a bundle
  pcs resource bundle create galera-bundle container docker image=docker.io/tripleoupstream/centos-binary-mariadb:latest replicas=3 masters=3 network=host options="--user=root --log-driver=journald" run-command="/usr/sbin/pacemaker_remoted" network control-port=3123 storage-map id=map0 source-dir=/dev/log target-dir=/dev/log storage-map id=map1 source-dir=/dev/zero target-dir=/etc/libqb/force-filesystem-sockets options=ro storage-map id=map2 source-dir=/etc/my.cnf.d/galera.cnf target-dir=/etc/my.cnf.d/galera.cnf options=ro storage-map id=map3 source-dir=/var/lib/mysql target-dir=/var/lib/mysql options=rw  --disabled

4. create the galera resource inside the bundle (adapt the host names)
  pcs resource create galera galera enable_creation=true wsrep_cluster_address='gcomm://ra1,ra2,ra3' cluster_host_map='ra1:ra1;ra2:ra2;ra3:ra3' op promote timeout=60 on-fail=block meta container-attribute-target=host bundle galera-bundle

5. start the bundle to bootstrap the galera cluster
  pcs resource enable galera-bundle

Actual results:
See description, sometimes the state of the resource cannot be recovered, sometimes a resource stop operation is triggered while it shouldn't.

Expected results:
Cleanup should reprobe the state of the resource and should never stop it while unmanaged.

Additional info:

Comment 2 Andrew Beekhof 2017-10-30 01:09:15 UTC
Stupid question... why are you doing a cleanup?

Comment 3 Andrew Beekhof 2017-10-31 02:59:55 UTC
A)

Looking at the logs in the (externally) supplied tarball, I don't see any stops (or starts).  Eg.

# grep -e crmd:.*stop  controller-2/corosync.log.extract.txt

I do see lots of:

Oct 20 07:10:10 [20322] controller-2       crmd:   notice: te_rsc_command:	Initiating notify operation rabbitmq_pre_notify_start_0 on rabbitmq-bundle-0 | action 181

But the actual start never happens.

I see a demote though, I guess for galera that means stop.

The reason for the start is:

(  allocate.c:348   galera-bundle-docker-0)    info: check_action_definition:	Parameters to galera-bundle-docker-0_start_0 on controller-0 changed: was c4193f3afb494a6a0ebfd33f930e7be4 vs. now ed7e84d41c6aedf055e8c2d9ca23355d (reload:3.0.12) 0:0;62:16:0:fed57637-c75b-4cee-90c7-c85f876c06c6

Which suggests either the 'manage' ran too soon or --cleanup didn't delete everything.

This:
Oct 20 07:11:37 [20322] controller-2       crmd:     info: notify_deleted:	Notifying 958bfdf7-3a5b-402b-a00c-f1e74a86f429 on controller-0 that galera:1 was deleted
Oct 20 07:11:37 [20322] controller-2       crmd:     info: notify_deleted:	Notifying 958bfdf7-3a5b-402b-a00c-f1e74a86f429 on controller-0 that galera:1 was deleted
Oct 20 07:11:38 [20322] controller-2       crmd:     info: notify_deleted:	Notifying 958bfdf7-3a5b-402b-a00c-f1e74a86f429 on controller-0 that galera:1 was deleted

suggests the latter. Which exact pacemaker version?

Comment 4 Damien Ciabrini 2017-11-05 21:28:41 UTC
So for the bug report, I used the 1.1.16-12.el7_4.4, which is the latest available in RHEL 7.4.z and CentOS. With that version, I can always
reproduce the error I mentioned in the description, apologies if I provided
a bad crm_reports beforehand.

I also reproduced this setup with a pacemaker version which includes fixes
for https://bugzilla.redhat.com/show_bug.cgi?id=1499217.
With that version, I don't experience problem A nor problem B anymore, but
I now hit another issue (henceforth C): once I "pcs resource manage galera", a restart
operation is triggered.

Lastly, I've tested version 1.1.18-1.dabd094e0.git.el7.centos-dabd094e0 built from https://github.com/beekhof/pacemaker/commit/dabd094e02e746b760063bbbd1166485745636e7 , and I confirm that I don't see any of problem A, B or C anymore. I replaced "pcs resource cleanup" with "crm_resource --refresh" to cope with the slight semantic change in this version.

Comment 6 Ken Gaillot 2017-11-06 17:11:03 UTC
The build currently planned for 7.5 should behave the same as the 1.1.18 version you tested, so I think we're good to consider this a duplicate of Bug 1499217 as far as 7.5 goes.

That one also has a planned 7.4.z-stream, Bug 1509874. The z-stream will come with some other fixes, so that build may already address problem C. If not, we can address that on that bz.

*** This bug has been marked as a duplicate of bug 1499217 ***

Comment 7 Damien Ciabrini 2017-11-07 20:07:06 UTC
Ken, Andrew, I just tested a scratch build of https://bugzilla.redhat.com/show_bug.cgi?id=1499217, I can confirm it fixes both problem A and B, however it still exhibits problem C that I described in comment #4.

I'll cross-link info in bug 1499217 so it this unexpected cleanup behaviour can be properly tracked and fixed.

Comment 8 Ken Gaillot 2017-11-07 21:36:36 UTC
(In reply to Damien Ciabrini from comment #7)
> Ken, Andrew, I just tested a scratch build of
> https://bugzilla.redhat.com/show_bug.cgi?id=1499217, I can confirm it fixes
> both problem A and B, however it still exhibits problem C that I described
> in comment #4.
> 
> I'll cross-link info in bug 1499217 so it this unexpected cleanup behaviour
> can be properly tracked and fixed.

To clarify, the scratch build is for the z-stream Bug 1509874; the 1.1.18-5 build for Bug 1499217 should behave identically to the 1.1.18 version you tested previously.

Will comment on Bug 1509874, to move the discussion there.