Hide Forgot
Description of problem: fence_xvm in a virtual cluster composed by rh el 5.6 guests stops working after some time and I need to restart fence_virtd service on host side to have it work again Version-Release number of selected component (if applicable): fence-virtd-0.2.1-5.el6.x86_64 How reproducible: always Steps to Reproduce: 1. cluster of rh el 6.0 hosts (rhev1 and rhev2) with rhel-x86_64-server-ha-6-beta repository enabled and fence-virtd-checkpoint-0.2.1-7.el6.x86_64 installed 2. virtual cluster of rh el 5.6 guests (vorastud1 and vorastud2) configured with fence_xvm fencing agent (cman-2.0.115-68.el5_6.1) 3. from a guest try fence_xvm -H vorastud1 -k /etc/cluster/rhev1.key -ddd -o null fence_xvm -H vorastud2 -k /etc/cluster/rhev2.key -ddd -o null Actual results: the command keeps in waiting state: Debugging threshold is now 3 -- args @ 0x7fff35f31ea0 -- args->addr = 225.0.0.12 args->domain = vorastud1 args->key_file = /etc/cluster/rhev1.key args->op = 0 args->hash = 2 args->auth = 2 args->port = 1229 args->ifindex = 0 args->family = 2 args->timeout = 30 args->retr_time = 20 args->flags = 0 args->debug = 3 -- end args -- Reading in key file /etc/cluster/rhev1.key into 0x7fff35f30e50 (4096 max size) Actual key length = 4096 bytesSending to 225.0.0.12 via 127.0.0.1 Sending to 225.0.0.12 via 10.4.5.165 Sending to 225.0.0.12 via 10.4.4.52 Waiting for connection from XVM host daemon. Sending to 225.0.0.12 via 127.0.0.1 Sending to 225.0.0.12 via 10.4.5.165 Sending to 225.0.0.12 via 10.4.4.52 ... Expected results: something like: Debugging threshold is now 3 -- args @ 0x7fffc6958d20 -- args->addr = 225.0.0.12 args->domain = vorastud1 args->key_file = /etc/cluster/rhev1.key args->op = 0 args->hash = 2 args->auth = 2 args->port = 1229 args->ifindex = 0 args->family = 2 args->timeout = 30 args->retr_time = 20 args->flags = 0 args->debug = 3 -- end args -- Reading in key file /etc/cluster/rhev1.key into 0x7fffc6957cd0 (4096 max size) Actual key length = 4096 bytesSending to 225.0.0.12 via 127.0.0.1 Sending to 225.0.0.12 via 10.4.5.165 Sending to 225.0.0.12 via 10.4.4.52 Waiting for connection from XVM host daemon. Issuing TCP challenge Responding to TCP challenge TCP Exchange + Authentication done... Waiting for return value from XVM host Remote: Operation failed Additional info: When I have the problem, if I run [root@rhev1 cluster]# service fence_virtd restart Stopping fence_virtd: [ OK ] Starting fence_virtd: [ OK ] then the fence_xvm commands begin to work again. from guest vorastud1 I can run successfully # fence_xvm -H vorastud1 -k /etc/cluster/rhev1.key -ddd -o null # fence_xvm -H vorastud2 -k /etc/cluster/rhev2.key -ddd -o null from guest vorastud2 I can run successfully # fence_xvm -H vorastud1 -k /etc/cluster/rhev1.key -ddd -o null # fence_xvm -H vorastud2 -k /etc/cluster/rhev2.key -ddd -o null But tipically after some minutes the commands stop to work again... Let me know if I have to explicitly open a case for this to be followed or if I can open a bugzilla against a beta repository as it is the rhel-x86_64-server-ha-6-beta one.
Since RHEL 6.1 External Beta has begun, and this bug remains unresolved, it has been rejected as it is not proposed as exception or blocker. Red Hat invites you to ask your support representative to propose this request, if appropriate and relevant, in the next release of Red Hat Enterprise Linux.
Interesting, so it is rejecting the request or unable to perform the operation. I wonder why.
Oh, sorry, misread. It's like it's no longer responding to requests, like you said.
Let me know if you need any configuration files at guest and/or host side. At thsi moment it seems that the problem happens more with one particular host (rhev1). From a firewall point of view, at the moment I inserted this line in INPUT chain, just to allow all traffic, in /etc/sysconfig/iptables: -I INPUT -d 225.0.0.12 -j ACCEPT donna if there is a better/more restrective one.... no line in FORWARD chain.. I suppose it is not neceesary? general question: if guest1 that is running on host1 runs # fence_xvm -H guest2 -k /etc/cluster/host2.key -ddd -o null is it supposed to generate "strace" output on both fence_virtd processes or only on host2 one?
You'll see output on the fence_virtd side as well.
Sorry, but I've not understood your comment. Is it for the last question of my comment#5 ? Perhaps I didn't explain well that question: 1) Suppose I run on host1 where fence_virtd has pid PID1: strace PID1 2) Suppose I run on host2 where fence_virtd has pid PID2: strace PID2 3) Then on guest1 that is on host1 I run: # fence_xvm -H guest2 -k /etc/cluster/host2.key -ddd -o null Will 3) generate output in both strace commands 1) and 2) ?
(In reply to comment #7) > Sorry, but I've not understood your comment. Is it for the last question of my > comment#5 ? > Perhaps I didn't explain well that question: > > 1) Suppose I run on host1 where fence_virtd has pid PID1: > strace PID1 > > 2) Suppose I run on host2 where fence_virtd has pid PID2: > strace PID2 > > 3) Then on guest1 that is on host1 I run: > # fence_xvm -H guest2 -k /etc/cluster/host2.key -ddd -o null > > Will 3) generate output in both strace commands 1) and 2) ? It depends on how fence_xvm's forwarding is done. You will likely see some processing of the multicast packet sent from fence_xvm on both hosts, but then the fence_virtd process (on *one* host) will connect back to the fence_xvm instance, so you will only see that part on one host.
Hello there! I am also looking at using the cluster suite with kvm virtual machines and I have encountered some problems with the fence_virtd daemon. My setup is not multicast. I use the serial plugin to talk from VM to underlying vmhost, and run cman (not rgmanager) and use the checkpoint plugin to do fencing between vms running on different physical hosts. The problem I run into is that, if I write a loop which fences one of a clustered pair of vms (the fencing is just using a fence_node command (-o reboot) from a vm that is not part of the cluster), then the fence_node command can be made to work for a while, but each time a fence request is sent around the corosync cpg the fence_virtd daemons open additional sockets (which I think are due to connections to libvirt that are not released), and eventually you hit the 1024 open file limit and fence_virtd stops working. It takes quite a number of fence events, of course, to reach this stage. I was wondering whether the problem here has an open file problem, too.
Closing since we have not been able to reproduce this issue, if this is still an issue with the current cluster packages please feel free to re-open this bug.
This is still present in RHEL8 fence-virtd-0.4.0-7.el8.x86_64 ! Just recently had to restart fence_virtd to get it to work again on two RHEL8 hosts.
Hi Heinz, Thanks for your report. Are you able to provide reproducer steps which consistently make this happen? Ideally we would like as detailed as possible from a bare RHEL/CentOS install. We did a significant amount of work trying to reproduce this when it was logged but we could never get it to consistently fail. Our package maintainer could not get it to fail at all. I was doing High Availability technical support at that time. I had it failing in my test environment one day but the next day it all worked fine and the problem vanished never to be seen again. Setting needinfo on you for this. Unfortunately I'm afraid without consistent steps to get this to fail, it will be impossible to fix. Jamie
After evaluating this issue, there are no plans to address it further or fix it in an upcoming release. Therefore, it is being closed. If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened.