Bug 1505909
Summary: | Problems when cleaning up unmanaged bundle resources | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Damien Ciabrini <dciabrin> | ||||
Component: | pacemaker | Assignee: | Ken Gaillot <kgaillot> | ||||
Status: | CLOSED DUPLICATE | QA Contact: | cluster-qe <cluster-qe> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | 7.4 | CC: | abeekhof, cluster-maint, michele | ||||
Target Milestone: | rc | ||||||
Target Release: | --- | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2017-11-06 17:11:03 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1494455 | ||||||
Attachments: |
|
Stupid question... why are you doing a cleanup? A) Looking at the logs in the (externally) supplied tarball, I don't see any stops (or starts). Eg. # grep -e crmd:.*stop controller-2/corosync.log.extract.txt I do see lots of: Oct 20 07:10:10 [20322] controller-2 crmd: notice: te_rsc_command: Initiating notify operation rabbitmq_pre_notify_start_0 on rabbitmq-bundle-0 | action 181 But the actual start never happens. I see a demote though, I guess for galera that means stop. The reason for the start is: ( allocate.c:348 galera-bundle-docker-0) info: check_action_definition: Parameters to galera-bundle-docker-0_start_0 on controller-0 changed: was c4193f3afb494a6a0ebfd33f930e7be4 vs. now ed7e84d41c6aedf055e8c2d9ca23355d (reload:3.0.12) 0:0;62:16:0:fed57637-c75b-4cee-90c7-c85f876c06c6 Which suggests either the 'manage' ran too soon or --cleanup didn't delete everything. This: Oct 20 07:11:37 [20322] controller-2 crmd: info: notify_deleted: Notifying 958bfdf7-3a5b-402b-a00c-f1e74a86f429 on controller-0 that galera:1 was deleted Oct 20 07:11:37 [20322] controller-2 crmd: info: notify_deleted: Notifying 958bfdf7-3a5b-402b-a00c-f1e74a86f429 on controller-0 that galera:1 was deleted Oct 20 07:11:38 [20322] controller-2 crmd: info: notify_deleted: Notifying 958bfdf7-3a5b-402b-a00c-f1e74a86f429 on controller-0 that galera:1 was deleted suggests the latter. Which exact pacemaker version? So for the bug report, I used the 1.1.16-12.el7_4.4, which is the latest available in RHEL 7.4.z and CentOS. With that version, I can always reproduce the error I mentioned in the description, apologies if I provided a bad crm_reports beforehand. I also reproduced this setup with a pacemaker version which includes fixes for https://bugzilla.redhat.com/show_bug.cgi?id=1499217. With that version, I don't experience problem A nor problem B anymore, but I now hit another issue (henceforth C): once I "pcs resource manage galera", a restart operation is triggered. Lastly, I've tested version 1.1.18-1.dabd094e0.git.el7.centos-dabd094e0 built from https://github.com/beekhof/pacemaker/commit/dabd094e02e746b760063bbbd1166485745636e7 , and I confirm that I don't see any of problem A, B or C anymore. I replaced "pcs resource cleanup" with "crm_resource --refresh" to cope with the slight semantic change in this version. The build currently planned for 7.5 should behave the same as the 1.1.18 version you tested, so I think we're good to consider this a duplicate of Bug 1499217 as far as 7.5 goes. That one also has a planned 7.4.z-stream, Bug 1509874. The z-stream will come with some other fixes, so that build may already address problem C. If not, we can address that on that bz. *** This bug has been marked as a duplicate of bug 1499217 *** Ken, Andrew, I just tested a scratch build of https://bugzilla.redhat.com/show_bug.cgi?id=1499217, I can confirm it fixes both problem A and B, however it still exhibits problem C that I described in comment #4. I'll cross-link info in bug 1499217 so it this unexpected cleanup behaviour can be properly tracked and fixed. (In reply to Damien Ciabrini from comment #7) > Ken, Andrew, I just tested a scratch build of > https://bugzilla.redhat.com/show_bug.cgi?id=1499217, I can confirm it fixes > both problem A and B, however it still exhibits problem C that I described > in comment #4. > > I'll cross-link info in bug 1499217 so it this unexpected cleanup behaviour > can be properly tracked and fixed. To clarify, the scratch build is for the z-stream Bug 1509874; the 1.1.18-5 build for Bug 1499217 should behave identically to the 1.1.18 version you tested previously. Will comment on Bug 1509874, to move the discussion there. |
Created attachment 1342787 [details] galera configuration Description of problem: I see unexpected behaviour w.r.t cleaning up containerized resources when they are unmanaged in pacemaker. Namely: A - When a resource attribute is updated while the resource is unmanaged, and a cleanup is requested afterwards: pcs resource unmanage galera pcs resource update galera cluster_host_map='ra1:ra1;ra2:ra2;ra3:ra3;foo=foo' pcs resource cleanup galera a stop operation is scheduled on the resource and executed, even if the resource is still unmanaged. B - When a resource bundle is unmanaged, and a cleanup is requested on the bundle: pcs resource unmanage galera-bundle pcs resource cleanup galera-bundle the monitor operation triggered on the galera resource are not able to probe the real state of the resource; galera server is still running but the resource shows as "Stopped" in pcs status. Even subsequent "pcs resource cleanup galera" are not to reprobe the state of the resource. Version-Release number of selected component (if applicable): 1.1.16-12.el7_4.4 How reproducible: Always Steps to Reproduce: For testing I used a containerized galera resource: On a three node cluster ra1, ra2, ra3 1. pull the container image on all nodes docker pull docker.io/tripleoupstream/centos-binary-mariadb:latest 2. prepare the hosts # install the attached galera.cnf in /etc/my.cnf.d/galera.cnf and adapt the host names 3. create a bundle pcs resource bundle create galera-bundle container docker image=docker.io/tripleoupstream/centos-binary-mariadb:latest replicas=3 masters=3 network=host options="--user=root --log-driver=journald" run-command="/usr/sbin/pacemaker_remoted" network control-port=3123 storage-map id=map0 source-dir=/dev/log target-dir=/dev/log storage-map id=map1 source-dir=/dev/zero target-dir=/etc/libqb/force-filesystem-sockets options=ro storage-map id=map2 source-dir=/etc/my.cnf.d/galera.cnf target-dir=/etc/my.cnf.d/galera.cnf options=ro storage-map id=map3 source-dir=/var/lib/mysql target-dir=/var/lib/mysql options=rw --disabled 4. create the galera resource inside the bundle (adapt the host names) pcs resource create galera galera enable_creation=true wsrep_cluster_address='gcomm://ra1,ra2,ra3' cluster_host_map='ra1:ra1;ra2:ra2;ra3:ra3' op promote timeout=60 on-fail=block meta container-attribute-target=host bundle galera-bundle 5. start the bundle to bootstrap the galera cluster pcs resource enable galera-bundle Actual results: See description, sometimes the state of the resource cannot be recovered, sometimes a resource stop operation is triggered while it shouldn't. Expected results: Cleanup should reprobe the state of the resource and should never stop it while unmanaged. Additional info: