Bug 512512

Summary:

Fenced does not update cman status upon successfull node fencing

Product:

[Retired] Red Hat Cluster Suite

Reporter:

Giacomo Bagnoli <g.bagnoli>

Component:

fence

Assignee:

Marek Grac <mgrac>

Status:

CLOSED UPSTREAM

QA Contact:

Cluster QE <mspqa-list>

Severity:

medium

Docs Contact:

Priority:

low

Version:

CC:

cluster-maint, edamato

Target Milestone:

---

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2009-10-30 16:51:37 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
Fix	none

Description Giacomo Bagnoli 2009-07-18 10:46:14 UTC

Created attachment 354249 [details]
Fix

Description of problem:
Using STABLE2 latest release simulating a node failure in a two node cluster results in fencing successful (using impilan agent), but cman status is not updated, resulting in rgmanager waiting for the fence operation to finish (thus services are not relocated).
cman_tool -f nodes says that victim node has not been fenced since it went offline, clustat shows the node offline but its services are still in the started state on that node, while in the logs:

fenced[9592]: node2 not a cluster member after 0 sec post_fail_delay
fenced[9592]: fencing node "node2"
fenced[9592]: can't get node number for node <garbage_here>
fenced[9592]: fence "node2" success

where <garbage_here> are random chars.

I tried to trace the problem in the code, and found that in 
cluster-2.03.11/fence/fenced/agent.c

313         if (ccs_lookup_nodename(cd, victim, &victim_nodename) == 0)
314                 victim = victim_nodename;

then on line 358 victim_nodename is freed 

357                 if (victim_nodename)
358                         free(victim_nodename);

and than update_cman is called with "victim" as node name, failing as
the nodeid could not be retrieved (and garbage printed to syslog)

361                 if (!error) {
362                         update_cman(victim, good_device);
363                         break;

I admit that I miss why ccs_lookup_nodename returns 0, but delaying the
free call after the update_cman call makes everything works, services
relocate to the other node and when node2 comes back and rejoins the
cluster they migrate back to the original node, as expected.

very simple patch (fixes all for me) attached.


Version-Release number of selected component (if applicable): cluster-2.03.11


How reproducible:
Always

Steps to Reproduce:
1. simulate a failure on a node
2. wait for fence to finish
3.
  
Actual results:
Cman node status not updated and services relocated

Expected results:
Cman node status updated and service being relocated

Additional info:

Comment 1 Lon Hohberger 2009-10-30 16:51:37 UTC

Your patch is correct, but this was already fixed in the STABLE2 branch in March:

http://git.fedorahosted.org/git/?p=cluster.git;a=commit;h=aee97b180e80c9f8b90b8fca63004afe3b289962