Hide Forgot
When doing a "pcs resource cleanup nova-compute-clone" the following failed action never gets removed. This is an older failed action and somehow pcs does not manage to remove it: Failed Actions: * nova-compute_monitor_10000 on overcloud-novacompute-3 'not running' (7): call=1577, status=complete, exitreason='none', last-rc-change='Thu Dec 24 04:53:51 2015', queued=0ms, exec=0ms Note that the nova-compute-clone service running on overcloud-novacompute-3 is correctly started, we're only talking about a previous failed action that does not get cleaned up: [heat-admin@overcloud-controller-1 logs]$ sudo pcs status|grep -A1 nova-compute-clone Clone Set: nova-compute-clone [nova-compute] Started: [ overcloud-novacompute-0 overcloud-novacompute-1 overcloud-novacompute-2 overcloud-novacompute-3 ] And on the specific node: [root@overcloud-novacompute-3 ~]# /usr/lib/ocf/resource.d/openstack/NovaCompute monitor DEBUG: default monitor : 0 pacemaker-1.1.13-10.el7.x86_64 pcs-0.9.143-15.el7.x86_64 We attach the following files: 1) CIB 2) pcsd log from controller-0 where we ran the commands 3) corosync.log from all three nodes http://file.rdu.redhat.com/~rscarazz/20151224_failed_resource_cleanup/
"pcs resource cleanup nova-compute-clone" merely runs "crm_resource -C -r nova-compute-clone". Moving to pacemaker for further investigation.
We may only be deleting ${resource}_last_failure_0
Focusing on the 04:55:02 cleanup in the logs, the resource's fail-count is correctly removed for all the compute nodes, but the resource's operation history (which "pcs status" uses to determine failed actions) is cleared on all except overcloud-novacompute-3. This is likely a bug, but exactly where eludes me. A couple of questions: The cleanup command will print messages like "Cleaning up nova-compute on overcloud-novacompute-0". Do you remember if it displayed such a message for overcloud-novacompute-3? Are you able to reproduce the issue?
IIRC the cleanup message was displayed for all the overcloud-novacomputes-* resources. I think that reproducing this specific issue is very hard. It should (and not could) be possible by using the old resource agent of the NovaCompute, but I don't have an environment available at the moment to do this. I'll keep an eye to see if I got it in some of my tests that I'm still doing.
This will not be addressed in the 7.3 timeframe.
This will not be addressed in the 7.4 timeframe.
qa-ack+: comment #9
Due to time constraints, this will not make 7.5
Realistically, without a reproducer we aren't going to be able to doing anything with this. There have been multiple fixes since this time relating to clean-up, so there's a good chance it's already been addressed.