Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 700139

Summary:	fence_xvm in virtual cluster stops working: need to restart fence_virtd
Product:	Red Hat Enterprise Linux 6	Reporter:	Gianluca Cecchi <gianluca.cecchi>
Component:	fence-virt	Assignee:	Ryan McCabe <rmccabe>
Status:	CLOSED WONTFIX	QA Contact:	cluster-qe <cluster-qe>
Severity:	high	Docs Contact:
Priority:	medium
Version:	6.0	CC:	cfeist, cluster-maint, djansa, fdinitto, heinzm, jbainbri, mgrac, michael.jansen, mwest, rmccabe, tlavigne
Target Milestone:	rc	Keywords:	Reopened
Target Release:	---	Flags:	heinzm: needinfo-
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-12-15 07:24:33 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	756082

Description Gianluca Cecchi 2011-04-27 15:07:45 UTC

Description of problem:
fence_xvm in a virtual cluster composed by rh el 5.6 guests stops working after some time and I need to restart fence_virtd service on host side to have it work again

Version-Release number of selected component (if applicable):
fence-virtd-0.2.1-5.el6.x86_64

How reproducible:
always

Steps to Reproduce:
1. cluster of rh el 6.0 hosts (rhev1 and rhev2) with rhel-x86_64-server-ha-6-beta repository enabled and fence-virtd-checkpoint-0.2.1-7.el6.x86_64 installed
2. virtual cluster of rh el 5.6 guests (vorastud1 and vorastud2) configured with fence_xvm fencing agent (cman-2.0.115-68.el5_6.1)
3. from a guest try
fence_xvm -H vorastud1 -k /etc/cluster/rhev1.key -ddd -o null
fence_xvm -H vorastud2 -k /etc/cluster/rhev2.key -ddd -o null
  
Actual results:
the command keeps in waiting state:
Debugging threshold is now 3
-- args @ 0x7fff35f31ea0 --
  args->addr = 225.0.0.12
  args->domain = vorastud1
  args->key_file = /etc/cluster/rhev1.key
  args->op = 0
  args->hash = 2
  args->auth = 2
  args->port = 1229
  args->ifindex = 0
  args->family = 2
  args->timeout = 30
  args->retr_time = 20
  args->flags = 0
  args->debug = 3
-- end args --
Reading in key file /etc/cluster/rhev1.key into 0x7fff35f30e50 (4096 max size)
Actual key length = 4096 bytesSending to 225.0.0.12 via 127.0.0.1
Sending to 225.0.0.12 via 10.4.5.165
Sending to 225.0.0.12 via 10.4.4.52
Waiting for connection from XVM host daemon.
Sending to 225.0.0.12 via 127.0.0.1
Sending to 225.0.0.12 via 10.4.5.165
Sending to 225.0.0.12 via 10.4.4.52
...

Expected results:
something like:
Debugging threshold is now 3
-- args @ 0x7fffc6958d20 --
  args->addr = 225.0.0.12
  args->domain = vorastud1
  args->key_file = /etc/cluster/rhev1.key
  args->op = 0
  args->hash = 2
  args->auth = 2
  args->port = 1229
  args->ifindex = 0
  args->family = 2
  args->timeout = 30
  args->retr_time = 20
  args->flags = 0
  args->debug = 3
-- end args --
Reading in key file /etc/cluster/rhev1.key into 0x7fffc6957cd0 (4096 max size)
Actual key length = 4096 bytesSending to 225.0.0.12 via 127.0.0.1
Sending to 225.0.0.12 via 10.4.5.165
Sending to 225.0.0.12 via 10.4.4.52
Waiting for connection from XVM host daemon.
Issuing TCP challenge
Responding to TCP challenge
TCP Exchange + Authentication done... 
Waiting for return value from XVM host
Remote: Operation failed

Additional info:

When I have the problem, if I run
[root@rhev1 cluster]# service fence_virtd restart
Stopping fence_virtd:                                      [  OK  ]
Starting fence_virtd:                                      [  OK  ]

then the fence_xvm commands begin to work again.
from guest vorastud1 I can run successfully
# fence_xvm -H vorastud1 -k /etc/cluster/rhev1.key -ddd -o null
# fence_xvm -H vorastud2 -k /etc/cluster/rhev2.key -ddd -o null

from guest vorastud2 I can run successfully
# fence_xvm -H vorastud1 -k /etc/cluster/rhev1.key -ddd -o null
# fence_xvm -H vorastud2 -k /etc/cluster/rhev2.key -ddd -o null

But tipically after some minutes the commands stop to work again...
 
Let me know if I have to explicitly open a case for this to be followed or if I can open a bugzilla against a beta repository as it is the rhel-x86_64-server-ha-6-beta one.

Comment 2 RHEL Program Management 2011-04-28 06:00:47 UTC

Since RHEL 6.1 External Beta has begun, and this bug remains
unresolved, it has been rejected as it is not proposed as
exception or blocker.

Red Hat invites you to ask your support representative to
propose this request, if appropriate and relevant, in the
next release of Red Hat Enterprise Linux.

Comment 3 Lon Hohberger 2011-04-28 14:24:54 UTC

Interesting, so it is rejecting the request or unable to perform the operation.

I wonder why.

Comment 4 Lon Hohberger 2011-04-28 14:25:37 UTC

Oh, sorry, misread.  It's like it's no longer responding to requests, like you said.

Comment 5 Gianluca Cecchi 2011-04-28 14:33:08 UTC

Let me know if you need any configuration files at guest and/or host side.
At thsi moment it seems that the problem happens more with one particular host (rhev1).

From a firewall point of view, at the moment I inserted this line in INPUT chain, just to allow all traffic, in /etc/sysconfig/iptables:
-I INPUT -d 225.0.0.12 -j ACCEPT

donna if there is a better/more restrective one....
no line in FORWARD chain.. I suppose it is not neceesary?

general question:
if guest1 that is running on host1 runs

# fence_xvm -H guest2 -k /etc/cluster/host2.key -ddd -o null

is it supposed to generate "strace" output on both fence_virtd processes or only on host2 one?

Comment 6 Lon Hohberger 2011-05-12 14:17:15 UTC

You'll see output on the fence_virtd side as well.

Comment 7 Gianluca Cecchi 2011-05-12 14:27:29 UTC

Sorry, but I've not understood your comment. Is it for the last question of my comment#5 ?
Perhaps I didn't explain well that question:

1) Suppose I run on host1 where fence_virtd has pid PID1:
strace PID1 

2) Suppose I run on host2 where fence_virtd has pid PID2:
strace PID2

3) Then on guest1 that is on host1 I run:
# fence_xvm -H guest2 -k /etc/cluster/host2.key -ddd -o null

Will 3) generate output in both strace commands 1) and 2) ?

Comment 8 Lon Hohberger 2011-05-31 21:53:02 UTC

(In reply to comment #7)
> Sorry, but I've not understood your comment. Is it for the last question of my
> comment#5 ?
> Perhaps I didn't explain well that question:
> 
> 1) Suppose I run on host1 where fence_virtd has pid PID1:
> strace PID1 
> 
> 2) Suppose I run on host2 where fence_virtd has pid PID2:
> strace PID2
> 
> 3) Then on guest1 that is on host1 I run:
> # fence_xvm -H guest2 -k /etc/cluster/host2.key -ddd -o null
> 
> Will 3) generate output in both strace commands 1) and 2) ?

It depends on how fence_xvm's forwarding is done.  You will likely see some processing of the multicast packet sent from fence_xvm on both hosts, but then the fence_virtd process (on *one* host) will connect back to the fence_xvm instance, so you will only see that part on one host.

Comment 11 Michael Jansen 2012-06-15 05:56:29 UTC

Hello there!

I am also looking at using the cluster suite with kvm virtual machines
and I have encountered some problems with the fence_virtd daemon.  My
setup is not multicast.  I use the serial plugin to talk from VM to
underlying vmhost, and run cman (not rgmanager) and use the checkpoint
plugin to do fencing between vms running on different physical hosts.

The problem I run into is that, if I write a loop which fences one
of a clustered pair of vms (the fencing is just using a fence_node command
(-o reboot) from a vm that is not part of the cluster), then the fence_node
command can be made to work for a while, but each time a fence request is sent
around the corosync cpg the fence_virtd daemons open additional sockets
(which I think are due to connections to libvirt that are not released),
and eventually you hit the 1024 open file limit and fence_virtd stops
working.  It takes quite a number of fence events, of course, to reach
this stage.

I was wondering whether the problem here has an open file problem, too.

Comment 25 Chris Feist 2015-10-14 20:59:48 UTC

Closing since we have not been able to reproduce this issue, if this is still an issue with the current cluster packages please feel free to re-open this bug.

Comment 26 Heinz Mauelshagen 2020-03-02 14:25:04 UTC

This is still present in RHEL8 fence-virtd-0.4.0-7.el8.x86_64 !

Just recently had to restart fence_virtd to get it to work again on two RHEL8 hosts.

Comment 27 Jamie Bainbridge 2020-03-17 06:12:42 UTC

Hi Heinz,

Thanks for your report. Are you able to provide reproducer steps which consistently make this happen? Ideally we would like as detailed as possible from a bare RHEL/CentOS install.

We did a significant amount of work trying to reproduce this when it was logged but we could never get it to consistently fail. Our package maintainer could not get it to fail at all. I was doing High Availability technical support at that time. I had it failing in my test environment one day but the next day it all worked fine and the problem vanished never to be seen again.

Setting needinfo on you for this. Unfortunately I'm afraid without consistent steps to get this to fail, it will be impossible to fix.

Jamie

Comment 29 RHEL Program Management 2020-12-15 07:24:33 UTC

After evaluating this issue, there are no plans to address it further or fix it in an upcoming release.  Therefore, it is being closed.  If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened.