Bug 1294055 - pcs does not cleanup an old failed action
Summary: pcs does not cleanup an old failed action
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: pacemaker
Version: 7.2
Hardware: x86_64
OS: Linux
low
medium
Target Milestone: rc
: ---
Assignee: Ken Gaillot
QA Contact: cluster-qe@redhat.com
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-12-24 10:18 UTC by Raoul Scarazzini
Modified: 2019-03-27 16:46 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-03-27 16:46:49 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description Raoul Scarazzini 2015-12-24 10:18:49 UTC
When doing a "pcs resource cleanup nova-compute-clone" the following failed action
never gets removed. This is an older failed action and somehow pcs does not manage
to remove it:

Failed Actions:
* nova-compute_monitor_10000 on overcloud-novacompute-3 'not running' (7): call=1577, status=complete, exitreason='none',
    last-rc-change='Thu Dec 24 04:53:51 2015', queued=0ms, exec=0ms

Note that the nova-compute-clone service running on overcloud-novacompute-3 is correctly
started, we're only talking about a previous failed action that does not get cleaned up:

[heat-admin@overcloud-controller-1 logs]$ sudo pcs status|grep -A1 nova-compute-clone
 Clone Set: nova-compute-clone [nova-compute]
     Started: [ overcloud-novacompute-0 overcloud-novacompute-1 overcloud-novacompute-2 overcloud-novacompute-3 ]

And on the specific node:

[root@overcloud-novacompute-3 ~]# /usr/lib/ocf/resource.d/openstack/NovaCompute monitor
DEBUG: default monitor : 0

pacemaker-1.1.13-10.el7.x86_64
pcs-0.9.143-15.el7.x86_64

We attach the following files:
1) CIB
2) pcsd log from controller-0 where we ran the commands
3) corosync.log from all three nodes

http://file.rdu.redhat.com/~rscarazz/20151224_failed_resource_cleanup/

Comment 2 Tomas Jelinek 2016-01-05 14:31:54 UTC
"pcs resource cleanup nova-compute-clone" merely runs "crm_resource -C -r nova-compute-clone". Moving to pacemaker for further investigation.

Comment 4 Andrew Beekhof 2016-01-11 00:36:10 UTC
We may only be deleting ${resource}_last_failure_0

Comment 6 Ken Gaillot 2016-01-28 00:07:56 UTC
Focusing on the 04:55:02 cleanup in the logs, the resource's fail-count is correctly removed for all the compute nodes, but the resource's operation history (which "pcs status" uses to determine failed actions) is cleared on all except overcloud-novacompute-3.

This is likely a bug, but exactly where eludes me. A couple of questions:

The cleanup command will print messages like "Cleaning up nova-compute on overcloud-novacompute-0". Do you remember if it displayed such a message for overcloud-novacompute-3?

Are you able to reproduce the issue?

Comment 7 Raoul Scarazzini 2016-01-28 07:47:33 UTC
IIRC the cleanup message was displayed for all the overcloud-novacomputes-* resources.

I think that reproducing this specific issue is very hard. It should (and not could) be possible by using the old resource agent of the NovaCompute, but I don't have an environment available at the moment to do this. I'll keep an eye to see if I got it in some of my tests that I'm still doing.

Comment 8 Ken Gaillot 2016-05-16 16:21:57 UTC
This will not be addressed in the 7.3 timeframe.

Comment 10 Ken Gaillot 2017-03-06 23:23:16 UTC
This will not be addressed in the 7.4 timeframe.

Comment 11 michal novacek 2017-08-04 11:03:09 UTC
qa-ack+: comment #9

Comment 13 Ken Gaillot 2017-10-09 17:22:13 UTC
Due to time constraints, this will not make 7.5

Comment 14 Ken Gaillot 2019-03-27 16:46:49 UTC
Realistically, without a reproducer we aren't going to be able to doing anything with this. There have been multiple fixes since this time relating to clean-up, so there's a good chance it's already been addressed.


Note You need to log in before you can comment on or make changes to this bug.