Bug 512512

Summary: Fenced does not update cman status upon successfull node fencing
Product: [Retired] Red Hat Cluster Suite Reporter: Giacomo Bagnoli <g.bagnoli>
Component: fenceAssignee: Marek Grac <mgrac>
Status: CLOSED UPSTREAM QA Contact: Cluster QE <mspqa-list>
Severity: medium Docs Contact:
Priority: low    
Version: 4CC: cluster-maint, edamato
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-10-30 16:51:37 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Fix none

Description Giacomo Bagnoli 2009-07-18 10:46:14 UTC
Created attachment 354249 [details]
Fix

Description of problem:
Using STABLE2 latest release simulating a node failure in a two node cluster results in fencing successful (using impilan agent), but cman status is not updated, resulting in rgmanager waiting for the fence operation to finish (thus services are not relocated).
cman_tool -f nodes says that victim node has not been fenced since it went offline, clustat shows the node offline but its services are still in the started state on that node, while in the logs:

fenced[9592]: node2 not a cluster member after 0 sec post_fail_delay
fenced[9592]: fencing node "node2"
fenced[9592]: can't get node number for node <garbage_here>
fenced[9592]: fence "node2" success

where <garbage_here> are random chars.

I tried to trace the problem in the code, and found that in 
cluster-2.03.11/fence/fenced/agent.c

313         if (ccs_lookup_nodename(cd, victim, &victim_nodename) == 0)
314                 victim = victim_nodename;

then on line 358 victim_nodename is freed 

357                 if (victim_nodename)
358                         free(victim_nodename);

and than update_cman is called with "victim" as node name, failing as
the nodeid could not be retrieved (and garbage printed to syslog)

361                 if (!error) {
362                         update_cman(victim, good_device);
363                         break;

I admit that I miss why ccs_lookup_nodename returns 0, but delaying the
free call after the update_cman call makes everything works, services
relocate to the other node and when node2 comes back and rejoins the
cluster they migrate back to the original node, as expected.

very simple patch (fixes all for me) attached.


Version-Release number of selected component (if applicable): cluster-2.03.11


How reproducible:
Always

Steps to Reproduce:
1. simulate a failure on a node
2. wait for fence to finish
3.
  
Actual results:
Cman node status not updated and services relocated

Expected results:
Cman node status updated and service being relocated

Additional info:

Comment 1 Lon Hohberger 2009-10-30 16:51:37 UTC
Your patch is correct, but this was already fixed in the STABLE2 branch in March:

http://git.fedorahosted.org/git/?p=cluster.git;a=commit;h=aee97b180e80c9f8b90b8fca63004afe3b289962