Bug 512512 - Fenced does not update cman status upon successfull node fencing
Summary: Fenced does not update cman status upon successfull node fencing
Keywords:
Status: CLOSED UPSTREAM
Alias: None
Product: Red Hat Cluster Suite
Classification: Retired
Component: fence
Version: 4
Hardware: All
OS: Linux
low
medium
Target Milestone: ---
Assignee: Marek Grac
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2009-07-18 10:46 UTC by Giacomo Bagnoli
Modified: 2016-04-26 15:31 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2009-10-30 16:51:37 UTC
Embargoed:


Attachments (Terms of Use)
Fix (663 bytes, patch)
2009-07-18 10:46 UTC, Giacomo Bagnoli
no flags Details | Diff

Description Giacomo Bagnoli 2009-07-18 10:46:14 UTC
Created attachment 354249 [details]
Fix

Description of problem:
Using STABLE2 latest release simulating a node failure in a two node cluster results in fencing successful (using impilan agent), but cman status is not updated, resulting in rgmanager waiting for the fence operation to finish (thus services are not relocated).
cman_tool -f nodes says that victim node has not been fenced since it went offline, clustat shows the node offline but its services are still in the started state on that node, while in the logs:

fenced[9592]: node2 not a cluster member after 0 sec post_fail_delay
fenced[9592]: fencing node "node2"
fenced[9592]: can't get node number for node <garbage_here>
fenced[9592]: fence "node2" success

where <garbage_here> are random chars.

I tried to trace the problem in the code, and found that in 
cluster-2.03.11/fence/fenced/agent.c

313         if (ccs_lookup_nodename(cd, victim, &victim_nodename) == 0)
314                 victim = victim_nodename;

then on line 358 victim_nodename is freed 

357                 if (victim_nodename)
358                         free(victim_nodename);

and than update_cman is called with "victim" as node name, failing as
the nodeid could not be retrieved (and garbage printed to syslog)

361                 if (!error) {
362                         update_cman(victim, good_device);
363                         break;

I admit that I miss why ccs_lookup_nodename returns 0, but delaying the
free call after the update_cman call makes everything works, services
relocate to the other node and when node2 comes back and rejoins the
cluster they migrate back to the original node, as expected.

very simple patch (fixes all for me) attached.


Version-Release number of selected component (if applicable): cluster-2.03.11


How reproducible:
Always

Steps to Reproduce:
1. simulate a failure on a node
2. wait for fence to finish
3.
  
Actual results:
Cman node status not updated and services relocated

Expected results:
Cman node status updated and service being relocated

Additional info:

Comment 1 Lon Hohberger 2009-10-30 16:51:37 UTC
Your patch is correct, but this was already fixed in the STABLE2 branch in March:

http://git.fedorahosted.org/git/?p=cluster.git;a=commit;h=aee97b180e80c9f8b90b8fca63004afe3b289962


Note You need to log in before you can comment on or make changes to this bug.