Bug 512512 - Fenced does not update cman status upon successfull node fencing
Fenced does not update cman status upon successfull node fencing
Status: CLOSED UPSTREAM
Product: Red Hat Cluster Suite
Classification: Red Hat
Component: fence (Show other bugs)
4
All Linux
low Severity medium
: ---
: ---
Assigned To: Marek Grac
Cluster QE
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2009-07-18 06:46 EDT by Giacomo Bagnoli
Modified: 2016-04-26 11:31 EDT (History)
2 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2009-10-30 12:51:37 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Fix (663 bytes, patch)
2009-07-18 06:46 EDT, Giacomo Bagnoli
no flags Details | Diff

  None (edit)
Description Giacomo Bagnoli 2009-07-18 06:46:14 EDT
Created attachment 354249 [details]
Fix

Description of problem:
Using STABLE2 latest release simulating a node failure in a two node cluster results in fencing successful (using impilan agent), but cman status is not updated, resulting in rgmanager waiting for the fence operation to finish (thus services are not relocated).
cman_tool -f nodes says that victim node has not been fenced since it went offline, clustat shows the node offline but its services are still in the started state on that node, while in the logs:

fenced[9592]: node2 not a cluster member after 0 sec post_fail_delay
fenced[9592]: fencing node "node2"
fenced[9592]: can't get node number for node <garbage_here>
fenced[9592]: fence "node2" success

where <garbage_here> are random chars.

I tried to trace the problem in the code, and found that in 
cluster-2.03.11/fence/fenced/agent.c

313         if (ccs_lookup_nodename(cd, victim, &victim_nodename) == 0)
314                 victim = victim_nodename;

then on line 358 victim_nodename is freed 

357                 if (victim_nodename)
358                         free(victim_nodename);

and than update_cman is called with "victim" as node name, failing as
the nodeid could not be retrieved (and garbage printed to syslog)

361                 if (!error) {
362                         update_cman(victim, good_device);
363                         break;

I admit that I miss why ccs_lookup_nodename returns 0, but delaying the
free call after the update_cman call makes everything works, services
relocate to the other node and when node2 comes back and rejoins the
cluster they migrate back to the original node, as expected.

very simple patch (fixes all for me) attached.


Version-Release number of selected component (if applicable): cluster-2.03.11


How reproducible:
Always

Steps to Reproduce:
1. simulate a failure on a node
2. wait for fence to finish
3.
  
Actual results:
Cman node status not updated and services relocated

Expected results:
Cman node status updated and service being relocated

Additional info:
Comment 1 Lon Hohberger 2009-10-30 12:51:37 EDT
Your patch is correct, but this was already fixed in the STABLE2 branch in March:

http://git.fedorahosted.org/git/?p=cluster.git;a=commit;h=aee97b180e80c9f8b90b8fca63004afe3b289962

Note You need to log in before you can comment on or make changes to this bug.