Bug 1294055 - pcs does not cleanup an old failed action
pcs does not cleanup an old failed action
Status: NEW
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: pacemaker (Show other bugs)
7.2
x86_64 Linux
unspecified Severity medium
: rc
: 7.4
Assigned To: Ken Gaillot
cluster-qe@redhat.com
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2015-12-24 05:18 EST by Raoul Scarazzini
Modified: 2017-08-04 07:11 EDT (History)
7 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Raoul Scarazzini 2015-12-24 05:18:49 EST
When doing a "pcs resource cleanup nova-compute-clone" the following failed action
never gets removed. This is an older failed action and somehow pcs does not manage
to remove it:

Failed Actions:
* nova-compute_monitor_10000 on overcloud-novacompute-3 'not running' (7): call=1577, status=complete, exitreason='none',
    last-rc-change='Thu Dec 24 04:53:51 2015', queued=0ms, exec=0ms

Note that the nova-compute-clone service running on overcloud-novacompute-3 is correctly
started, we're only talking about a previous failed action that does not get cleaned up:

[heat-admin@overcloud-controller-1 logs]$ sudo pcs status|grep -A1 nova-compute-clone
 Clone Set: nova-compute-clone [nova-compute]
     Started: [ overcloud-novacompute-0 overcloud-novacompute-1 overcloud-novacompute-2 overcloud-novacompute-3 ]

And on the specific node:

[root@overcloud-novacompute-3 ~]# /usr/lib/ocf/resource.d/openstack/NovaCompute monitor
DEBUG: default monitor : 0

pacemaker-1.1.13-10.el7.x86_64
pcs-0.9.143-15.el7.x86_64

We attach the following files:
1) CIB
2) pcsd log from controller-0 where we ran the commands
3) corosync.log from all three nodes

http://file.rdu.redhat.com/~rscarazz/20151224_failed_resource_cleanup/
Comment 2 Tomas Jelinek 2016-01-05 09:31:54 EST
"pcs resource cleanup nova-compute-clone" merely runs "crm_resource -C -r nova-compute-clone". Moving to pacemaker for further investigation.
Comment 4 Andrew Beekhof 2016-01-10 19:36:10 EST
We may only be deleting ${resource}_last_failure_0
Comment 6 Ken Gaillot 2016-01-27 19:07:56 EST
Focusing on the 04:55:02 cleanup in the logs, the resource's fail-count is correctly removed for all the compute nodes, but the resource's operation history (which "pcs status" uses to determine failed actions) is cleared on all except overcloud-novacompute-3.

This is likely a bug, but exactly where eludes me. A couple of questions:

The cleanup command will print messages like "Cleaning up nova-compute on overcloud-novacompute-0". Do you remember if it displayed such a message for overcloud-novacompute-3?

Are you able to reproduce the issue?
Comment 7 Raoul Scarazzini 2016-01-28 02:47:33 EST
IIRC the cleanup message was displayed for all the overcloud-novacomputes-* resources.

I think that reproducing this specific issue is very hard. It should (and not could) be possible by using the old resource agent of the NovaCompute, but I don't have an environment available at the moment to do this. I'll keep an eye to see if I got it in some of my tests that I'm still doing.
Comment 8 Ken Gaillot 2016-05-16 12:21:57 EDT
This will not be addressed in the 7.3 timeframe.
Comment 10 Ken Gaillot 2017-03-06 18:23:16 EST
This will not be addressed in the 7.4 timeframe.
Comment 11 michal novacek 2017-08-04 07:03:09 EDT
qa-ack+: comment #9

Note You need to log in before you can comment on or make changes to this bug.