| Summary: | pcs does not cleanup an old failed action | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Raoul Scarazzini <rscarazz> |
| Component: | pacemaker | Assignee: | Ken Gaillot <kgaillot> |
| Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | cluster-qe <cluster-qe> |
| Severity: | medium | Docs Contact: | |
| Priority: | low | ||
| Version: | 7.2 | CC: | abeekhof, cfeist, cluster-maint, michele, mnovacek, tojeline |
| Target Milestone: | rc | ||
| Target Release: | --- | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2019-03-27 16:46:49 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
"pcs resource cleanup nova-compute-clone" merely runs "crm_resource -C -r nova-compute-clone". Moving to pacemaker for further investigation. We may only be deleting ${resource}_last_failure_0
Focusing on the 04:55:02 cleanup in the logs, the resource's fail-count is correctly removed for all the compute nodes, but the resource's operation history (which "pcs status" uses to determine failed actions) is cleared on all except overcloud-novacompute-3. This is likely a bug, but exactly where eludes me. A couple of questions: The cleanup command will print messages like "Cleaning up nova-compute on overcloud-novacompute-0". Do you remember if it displayed such a message for overcloud-novacompute-3? Are you able to reproduce the issue? IIRC the cleanup message was displayed for all the overcloud-novacomputes-* resources. I think that reproducing this specific issue is very hard. It should (and not could) be possible by using the old resource agent of the NovaCompute, but I don't have an environment available at the moment to do this. I'll keep an eye to see if I got it in some of my tests that I'm still doing. This will not be addressed in the 7.3 timeframe. This will not be addressed in the 7.4 timeframe. qa-ack+: comment #9 Due to time constraints, this will not make 7.5 Realistically, without a reproducer we aren't going to be able to doing anything with this. There have been multiple fixes since this time relating to clean-up, so there's a good chance it's already been addressed. |
When doing a "pcs resource cleanup nova-compute-clone" the following failed action never gets removed. This is an older failed action and somehow pcs does not manage to remove it: Failed Actions: * nova-compute_monitor_10000 on overcloud-novacompute-3 'not running' (7): call=1577, status=complete, exitreason='none', last-rc-change='Thu Dec 24 04:53:51 2015', queued=0ms, exec=0ms Note that the nova-compute-clone service running on overcloud-novacompute-3 is correctly started, we're only talking about a previous failed action that does not get cleaned up: [heat-admin@overcloud-controller-1 logs]$ sudo pcs status|grep -A1 nova-compute-clone Clone Set: nova-compute-clone [nova-compute] Started: [ overcloud-novacompute-0 overcloud-novacompute-1 overcloud-novacompute-2 overcloud-novacompute-3 ] And on the specific node: [root@overcloud-novacompute-3 ~]# /usr/lib/ocf/resource.d/openstack/NovaCompute monitor DEBUG: default monitor : 0 pacemaker-1.1.13-10.el7.x86_64 pcs-0.9.143-15.el7.x86_64 We attach the following files: 1) CIB 2) pcsd log from controller-0 where we ran the commands 3) corosync.log from all three nodes http://file.rdu.redhat.com/~rscarazz/20151224_failed_resource_cleanup/