Bug 700139 - fence_xvm in virtual cluster stops working: need to restart fence_virtd
Summary: fence_xvm in virtual cluster stops working: need to restart fence_virtd
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: fence-virt
Version: 6.0
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: rc
: ---
Assignee: Ryan McCabe
QA Contact: cluster-qe@redhat.com
URL:
Whiteboard:
Depends On:
Blocks: 756082
TreeView+ depends on / blocked
 
Reported: 2011-04-27 15:07 UTC by Gianluca Cecchi
Modified: 2021-05-14 11:21 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-12-15 07:24:33 UTC
Target Upstream Version:
heinzm: needinfo-


Attachments (Terms of Use)

Description Gianluca Cecchi 2011-04-27 15:07:45 UTC
Description of problem:
fence_xvm in a virtual cluster composed by rh el 5.6 guests stops working after some time and I need to restart fence_virtd service on host side to have it work again

Version-Release number of selected component (if applicable):
fence-virtd-0.2.1-5.el6.x86_64

How reproducible:
always

Steps to Reproduce:
1. cluster of rh el 6.0 hosts (rhev1 and rhev2) with rhel-x86_64-server-ha-6-beta repository enabled and fence-virtd-checkpoint-0.2.1-7.el6.x86_64 installed
2. virtual cluster of rh el 5.6 guests (vorastud1 and vorastud2) configured with fence_xvm fencing agent (cman-2.0.115-68.el5_6.1)
3. from a guest try
fence_xvm -H vorastud1 -k /etc/cluster/rhev1.key -ddd -o null
fence_xvm -H vorastud2 -k /etc/cluster/rhev2.key -ddd -o null
  
Actual results:
the command keeps in waiting state:
Debugging threshold is now 3
-- args @ 0x7fff35f31ea0 --
  args->addr = 225.0.0.12
  args->domain = vorastud1
  args->key_file = /etc/cluster/rhev1.key
  args->op = 0
  args->hash = 2
  args->auth = 2
  args->port = 1229
  args->ifindex = 0
  args->family = 2
  args->timeout = 30
  args->retr_time = 20
  args->flags = 0
  args->debug = 3
-- end args --
Reading in key file /etc/cluster/rhev1.key into 0x7fff35f30e50 (4096 max size)
Actual key length = 4096 bytesSending to 225.0.0.12 via 127.0.0.1
Sending to 225.0.0.12 via 10.4.5.165
Sending to 225.0.0.12 via 10.4.4.52
Waiting for connection from XVM host daemon.
Sending to 225.0.0.12 via 127.0.0.1
Sending to 225.0.0.12 via 10.4.5.165
Sending to 225.0.0.12 via 10.4.4.52
...

Expected results:
something like:
Debugging threshold is now 3
-- args @ 0x7fffc6958d20 --
  args->addr = 225.0.0.12
  args->domain = vorastud1
  args->key_file = /etc/cluster/rhev1.key
  args->op = 0
  args->hash = 2
  args->auth = 2
  args->port = 1229
  args->ifindex = 0
  args->family = 2
  args->timeout = 30
  args->retr_time = 20
  args->flags = 0
  args->debug = 3
-- end args --
Reading in key file /etc/cluster/rhev1.key into 0x7fffc6957cd0 (4096 max size)
Actual key length = 4096 bytesSending to 225.0.0.12 via 127.0.0.1
Sending to 225.0.0.12 via 10.4.5.165
Sending to 225.0.0.12 via 10.4.4.52
Waiting for connection from XVM host daemon.
Issuing TCP challenge
Responding to TCP challenge
TCP Exchange + Authentication done... 
Waiting for return value from XVM host
Remote: Operation failed

Additional info:

When I have the problem, if I run
[root@rhev1 cluster]# service fence_virtd restart
Stopping fence_virtd:                                      [  OK  ]
Starting fence_virtd:                                      [  OK  ]

then the fence_xvm commands begin to work again.
from guest vorastud1 I can run successfully
# fence_xvm -H vorastud1 -k /etc/cluster/rhev1.key -ddd -o null
# fence_xvm -H vorastud2 -k /etc/cluster/rhev2.key -ddd -o null

from guest vorastud2 I can run successfully
# fence_xvm -H vorastud1 -k /etc/cluster/rhev1.key -ddd -o null
# fence_xvm -H vorastud2 -k /etc/cluster/rhev2.key -ddd -o null

But tipically after some minutes the commands stop to work again...
 
Let me know if I have to explicitly open a case for this to be followed or if I can open a bugzilla against a beta repository as it is the rhel-x86_64-server-ha-6-beta one.

Comment 2 RHEL Program Management 2011-04-28 06:00:47 UTC
Since RHEL 6.1 External Beta has begun, and this bug remains
unresolved, it has been rejected as it is not proposed as
exception or blocker.

Red Hat invites you to ask your support representative to
propose this request, if appropriate and relevant, in the
next release of Red Hat Enterprise Linux.

Comment 3 Lon Hohberger 2011-04-28 14:24:54 UTC
Interesting, so it is rejecting the request or unable to perform the operation.

I wonder why.

Comment 4 Lon Hohberger 2011-04-28 14:25:37 UTC
Oh, sorry, misread.  It's like it's no longer responding to requests, like you said.

Comment 5 Gianluca Cecchi 2011-04-28 14:33:08 UTC
Let me know if you need any configuration files at guest and/or host side.
At thsi moment it seems that the problem happens more with one particular host (rhev1).

From a firewall point of view, at the moment I inserted this line in INPUT chain, just to allow all traffic, in /etc/sysconfig/iptables:
-I INPUT -d 225.0.0.12 -j ACCEPT

donna if there is a better/more restrective one....
no line in FORWARD chain.. I suppose it is not neceesary?

general question:
if guest1 that is running on host1 runs

# fence_xvm -H guest2 -k /etc/cluster/host2.key -ddd -o null

is it supposed to generate "strace" output on both fence_virtd processes or only on host2 one?

Comment 6 Lon Hohberger 2011-05-12 14:17:15 UTC
You'll see output on the fence_virtd side as well.

Comment 7 Gianluca Cecchi 2011-05-12 14:27:29 UTC
Sorry, but I've not understood your comment. Is it for the last question of my comment#5 ?
Perhaps I didn't explain well that question:

1) Suppose I run on host1 where fence_virtd has pid PID1:
strace PID1 

2) Suppose I run on host2 where fence_virtd has pid PID2:
strace PID2

3) Then on guest1 that is on host1 I run:
# fence_xvm -H guest2 -k /etc/cluster/host2.key -ddd -o null

Will 3) generate output in both strace commands 1) and 2) ?

Comment 8 Lon Hohberger 2011-05-31 21:53:02 UTC
(In reply to comment #7)
> Sorry, but I've not understood your comment. Is it for the last question of my
> comment#5 ?
> Perhaps I didn't explain well that question:
> 
> 1) Suppose I run on host1 where fence_virtd has pid PID1:
> strace PID1 
> 
> 2) Suppose I run on host2 where fence_virtd has pid PID2:
> strace PID2
> 
> 3) Then on guest1 that is on host1 I run:
> # fence_xvm -H guest2 -k /etc/cluster/host2.key -ddd -o null
> 
> Will 3) generate output in both strace commands 1) and 2) ?

It depends on how fence_xvm's forwarding is done.  You will likely see some processing of the multicast packet sent from fence_xvm on both hosts, but then the fence_virtd process (on *one* host) will connect back to the fence_xvm instance, so you will only see that part on one host.

Comment 11 Michael Jansen 2012-06-15 05:56:29 UTC
Hello there!

I am also looking at using the cluster suite with kvm virtual machines
and I have encountered some problems with the fence_virtd daemon.  My
setup is not multicast.  I use the serial plugin to talk from VM to
underlying vmhost, and run cman (not rgmanager) and use the checkpoint
plugin to do fencing between vms running on different physical hosts.

The problem I run into is that, if I write a loop which fences one
of a clustered pair of vms (the fencing is just using a fence_node command
(-o reboot) from a vm that is not part of the cluster), then the fence_node
command can be made to work for a while, but each time a fence request is sent
around the corosync cpg the fence_virtd daemons open additional sockets
(which I think are due to connections to libvirt that are not released),
and eventually you hit the 1024 open file limit and fence_virtd stops
working.  It takes quite a number of fence events, of course, to reach
this stage.

I was wondering whether the problem here has an open file problem, too.

Comment 25 Chris Feist 2015-10-14 20:59:48 UTC
Closing since we have not been able to reproduce this issue, if this is still an issue with the current cluster packages please feel free to re-open this bug.

Comment 26 Heinz Mauelshagen 2020-03-02 14:25:04 UTC
This is still present in RHEL8 fence-virtd-0.4.0-7.el8.x86_64 !

Just recently had to restart fence_virtd to get it to work again on two RHEL8 hosts.

Comment 27 Jamie Bainbridge 2020-03-17 06:12:42 UTC
Hi Heinz,

Thanks for your report. Are you able to provide reproducer steps which consistently make this happen? Ideally we would like as detailed as possible from a bare RHEL/CentOS install.

We did a significant amount of work trying to reproduce this when it was logged but we could never get it to consistently fail. Our package maintainer could not get it to fail at all. I was doing High Availability technical support at that time. I had it failing in my test environment one day but the next day it all worked fine and the problem vanished never to be seen again.

Setting needinfo on you for this. Unfortunately I'm afraid without consistent steps to get this to fail, it will be impossible to fix.

Jamie

Comment 29 RHEL Program Management 2020-12-15 07:24:33 UTC
After evaluating this issue, there are no plans to address it further or fix it in an upcoming release.  Therefore, it is being closed.  If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened.


Note You need to log in before you can comment on or make changes to this bug.