Bug 1505909 - Problems when cleaning up unmanaged bundle resources
Summary: Problems when cleaning up unmanaged bundle resources
Keywords:
Status: CLOSED DUPLICATE of bug 1499217
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: pacemaker
Version: 7.4
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: rc
: ---
Assignee: Ken Gaillot
QA Contact: cluster-qe@redhat.com
URL:
Whiteboard:
Depends On:
Blocks: 1494455
TreeView+ depends on / blocked
 
Reported: 2017-10-24 14:32 UTC by Damien Ciabrini
Modified: 2017-11-07 21:36 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-11-06 17:11:03 UTC
Target Upstream Version:


Attachments (Terms of Use)
galera configuration (697 bytes, text/plain)
2017-10-24 14:32 UTC, Damien Ciabrini
no flags Details

Description Damien Ciabrini 2017-10-24 14:32:09 UTC
Created attachment 1342787 [details]
galera configuration

Description of problem:
I see unexpected behaviour w.r.t cleaning up containerized resources
when they are unmanaged in pacemaker. Namely:

A - When a resource attribute is updated while the resource is
unmanaged, and a cleanup is requested afterwards:

    pcs resource unmanage galera       
    pcs resource update galera cluster_host_map='ra1:ra1;ra2:ra2;ra3:ra3;foo=foo'
    pcs resource cleanup galera

a stop operation is scheduled on the resource and executed, even if
the resource is still unmanaged.

B - When a resource bundle is unmanaged, and a cleanup is requested
on the bundle:

    pcs resource unmanage galera-bundle
    pcs resource cleanup galera-bundle

the monitor operation triggered on the galera resource are not able to
probe the real state of the resource; galera server is still running
but the resource shows as "Stopped" in pcs status.
Even subsequent "pcs resource cleanup galera" are not to reprobe the
state of the resource.

Version-Release number of selected component (if applicable):
1.1.16-12.el7_4.4

How reproducible:
Always

Steps to Reproduce:
For testing I used a containerized galera resource:

On a three node cluster ra1, ra2, ra3

1. pull the container image on all nodes
  docker pull docker.io/tripleoupstream/centos-binary-mariadb:latest

2. prepare the hosts
  # install the attached galera.cnf in /etc/my.cnf.d/galera.cnf and adapt the host names

3. create a bundle
  pcs resource bundle create galera-bundle container docker image=docker.io/tripleoupstream/centos-binary-mariadb:latest replicas=3 masters=3 network=host options="--user=root --log-driver=journald" run-command="/usr/sbin/pacemaker_remoted" network control-port=3123 storage-map id=map0 source-dir=/dev/log target-dir=/dev/log storage-map id=map1 source-dir=/dev/zero target-dir=/etc/libqb/force-filesystem-sockets options=ro storage-map id=map2 source-dir=/etc/my.cnf.d/galera.cnf target-dir=/etc/my.cnf.d/galera.cnf options=ro storage-map id=map3 source-dir=/var/lib/mysql target-dir=/var/lib/mysql options=rw  --disabled

4. create the galera resource inside the bundle (adapt the host names)
  pcs resource create galera galera enable_creation=true wsrep_cluster_address='gcomm://ra1,ra2,ra3' cluster_host_map='ra1:ra1;ra2:ra2;ra3:ra3' op promote timeout=60 on-fail=block meta container-attribute-target=host bundle galera-bundle

5. start the bundle to bootstrap the galera cluster
  pcs resource enable galera-bundle

Actual results:
See description, sometimes the state of the resource cannot be recovered, sometimes a resource stop operation is triggered while it shouldn't.

Expected results:
Cleanup should reprobe the state of the resource and should never stop it while unmanaged.

Additional info:

Comment 2 Andrew Beekhof 2017-10-30 01:09:15 UTC
Stupid question... why are you doing a cleanup?

Comment 3 Andrew Beekhof 2017-10-31 02:59:55 UTC
A)

Looking at the logs in the (externally) supplied tarball, I don't see any stops (or starts).  Eg.

# grep -e crmd:.*stop  controller-2/corosync.log.extract.txt

I do see lots of:

Oct 20 07:10:10 [20322] controller-2       crmd:   notice: te_rsc_command:	Initiating notify operation rabbitmq_pre_notify_start_0 on rabbitmq-bundle-0 | action 181

But the actual start never happens.

I see a demote though, I guess for galera that means stop.

The reason for the start is:

(  allocate.c:348   galera-bundle-docker-0)    info: check_action_definition:	Parameters to galera-bundle-docker-0_start_0 on controller-0 changed: was c4193f3afb494a6a0ebfd33f930e7be4 vs. now ed7e84d41c6aedf055e8c2d9ca23355d (reload:3.0.12) 0:0;62:16:0:fed57637-c75b-4cee-90c7-c85f876c06c6

Which suggests either the 'manage' ran too soon or --cleanup didn't delete everything.

This:
Oct 20 07:11:37 [20322] controller-2       crmd:     info: notify_deleted:	Notifying 958bfdf7-3a5b-402b-a00c-f1e74a86f429 on controller-0 that galera:1 was deleted
Oct 20 07:11:37 [20322] controller-2       crmd:     info: notify_deleted:	Notifying 958bfdf7-3a5b-402b-a00c-f1e74a86f429 on controller-0 that galera:1 was deleted
Oct 20 07:11:38 [20322] controller-2       crmd:     info: notify_deleted:	Notifying 958bfdf7-3a5b-402b-a00c-f1e74a86f429 on controller-0 that galera:1 was deleted

suggests the latter. Which exact pacemaker version?

Comment 4 Damien Ciabrini 2017-11-05 21:28:41 UTC
So for the bug report, I used the 1.1.16-12.el7_4.4, which is the latest available in RHEL 7.4.z and CentOS. With that version, I can always
reproduce the error I mentioned in the description, apologies if I provided
a bad crm_reports beforehand.

I also reproduced this setup with a pacemaker version which includes fixes
for https://bugzilla.redhat.com/show_bug.cgi?id=1499217.
With that version, I don't experience problem A nor problem B anymore, but
I now hit another issue (henceforth C): once I "pcs resource manage galera", a restart
operation is triggered.

Lastly, I've tested version 1.1.18-1.dabd094e0.git.el7.centos-dabd094e0 built from https://github.com/beekhof/pacemaker/commit/dabd094e02e746b760063bbbd1166485745636e7 , and I confirm that I don't see any of problem A, B or C anymore. I replaced "pcs resource cleanup" with "crm_resource --refresh" to cope with the slight semantic change in this version.

Comment 6 Ken Gaillot 2017-11-06 17:11:03 UTC
The build currently planned for 7.5 should behave the same as the 1.1.18 version you tested, so I think we're good to consider this a duplicate of Bug 1499217 as far as 7.5 goes.

That one also has a planned 7.4.z-stream, Bug 1509874. The z-stream will come with some other fixes, so that build may already address problem C. If not, we can address that on that bz.

*** This bug has been marked as a duplicate of bug 1499217 ***

Comment 7 Damien Ciabrini 2017-11-07 20:07:06 UTC
Ken, Andrew, I just tested a scratch build of https://bugzilla.redhat.com/show_bug.cgi?id=1499217, I can confirm it fixes both problem A and B, however it still exhibits problem C that I described in comment #4.

I'll cross-link info in bug 1499217 so it this unexpected cleanup behaviour can be properly tracked and fixed.

Comment 8 Ken Gaillot 2017-11-07 21:36:36 UTC
(In reply to Damien Ciabrini from comment #7)
> Ken, Andrew, I just tested a scratch build of
> https://bugzilla.redhat.com/show_bug.cgi?id=1499217, I can confirm it fixes
> both problem A and B, however it still exhibits problem C that I described
> in comment #4.
> 
> I'll cross-link info in bug 1499217 so it this unexpected cleanup behaviour
> can be properly tracked and fixed.

To clarify, the scratch build is for the z-stream Bug 1509874; the 1.1.18-5 build for Bug 1499217 should behave identically to the 1.1.18 version you tested previously.

Will comment on Bug 1509874, to move the discussion there.


Note You need to log in before you can comment on or make changes to this bug.