Hide Forgot
Created attachment 1213753 [details] qemu machine configuration xml Description of problem: I'm having pacemaker cluster with sbd (storage based death) fencing configured on qemu nodes. SBD fencing means that cluster node is rebooted as soon as it looses pacemaker quorum. All works correctly when inducing kernel panic with 'echo c > /proc/sysrq-trigger' or cutting the node from other nodes. However running 'halt -fin' will just leave the node hanging. Version-Release number of selected component (if applicable): RHEL7.3 and RHEL6.7 (with appropriate offical version) How reproducible: always Steps to Reproduce: 1. configure pacemaker cluster with sbd fencing (node reboot on quorum loss) 2. on one of the cluster nodes issue 'halt -fin' Actual results: node doing nothing Expected results: node rebooted as when 'echo c > /proc/sysrq-trigger' Additional info: I'm logging this on qemu-kvm because it seems to me to be the most probable one to blame. I can easily provide configured cluster to reproduce. Configuration of the watchdog card in qemu machines looks like this: <devices> ... <watchdog model='i6300esb' action='reset'> <alias name='watchdog0'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0'/> </watchdog> ... </devices>
> Description of problem: > I'm having pacemaker cluster with sbd (storage based death) fencing configured on qemu nodes. SBD fencing means that cluster node is rebooted as soon as it looses pacemaker quorum. Is clustering in the hosts or between guests? Assume hosts. > All works correctly when inducing kernel panic with 'echo c > /proc/sysrq-trigger' or cutting the node from other nodes. Is this on the host or in a guest? Guest? > Version-Release number of selected component (if applicable): > RHEL7.3 and RHEL6.7 (with appropriate offical version) Please distinguish between hosts and guests versions. RHEL7 host and RHEL6 guests? For the host, please provide the exact package versions for kernel, qemu and libvirt. For the guests, please provide the exact kernel version. Also, please provide the full qemu command line. The XML will give us a start. Interesting that the machine type is rhel6.3.0... Thanks.
It's cluster between guests. Host is not part of the cluster. As for the versions, it happens in all the instances I tried (different or same host/guest versions, rhel6 and rhel7). As an example I have the following configuration where it fails: host (rhel6.7): [root@big-01 ~]# rpm -q qemu libvirt kernel package qemu is not installed libvirt-0.10.2-54.el6_7.2.x86_64 kernel-2.6.32-573.8.1.el6.x86_64 guests (rhel6.9): [root@virt-031 ~]# rpm -q qemu libvirt kernel package qemu is not installed package libvirt is not installed kernel-2.6.32-671.el6.x86_64 [root@virt-031 ~]# cat /etc/redhat-release Red Hat Enterprise Linux Server release 6.9 Beta (Santiago) full line to run qemu-kvm: /usr/libexec/qemu-kvm -name virt-031.cluster-qe.lab.eng.brq.redhat.com -S -M rhel6.3.0 -enable-kvm -m 1024 -realtime mlock=off -smp 1,sockets=1,cores=1,threads=1 -uuid d4c1b147-1303-3ddf-5568-35d7fccd1d3d -nographic -nodefconfig -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/virt-031.cluster-qe.lab.eng.brq.redhat.com.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=/dev/storage-big/root-virt-031.cluster-qe.lab.eng.brq.redhat.com,if=none,id=drive-virtio-disk0,format=raw,cache=unsafe,aio=native -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=2 -netdev tap,fd=140,id=hostnet0,vhost=on,vhostfd=141 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=1a:00:00:00:00:1f,bus=pci.0,addr=0x3,bootindex=1 -netdev tap,fd=144,id=hostnet1,vhost=on,vhostfd=145 -device virtio-net-pci,netdev=hostnet1,id=net1,mac=52:54:00:01:00:1f,bus=pci.0,addr=0x5 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -device i6300esb,id=watchdog0,bus=pci.0,addr=0x7 -watchdog-action reset -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6 -msg timestamp=on
Did you succeed in reproducing the problem?
(In reply to michal novacek from comment #5) > Did you succeed in reproducing the problem? Sorry for the delay. I can reproduce using "halt -fin" BTW, what's the "-i" option ? However, I am wondering what the expected behavior is. According to the watchdog manpage, the watchdog daemon can be stopped without causing a reboot if /dev/watchdog is closed correctly. I am wondering whether halt (even with the f flag) does result in closing /dev/watchdog. CONFIG_WATCHDOG_NOWAYOUT (which overrides this behavior) is not enabled in the RHEL kernel. I will try to build a kernel with that option enabled and see if it makes a difference.
I compiled a guest kernel with CONFIG_WATCHDOG_NOWAYOUT and sure enough, halt -fin execution results in a boot. This confirms my suspicion in comment 6 that "halt -fin" closes /dev/watchdog which means the watchdog will never fire. Is the test procedure essential or more importantly, do you know what the behavior with real watchdog hardware is as far as "halt -fin" is concerned ?
The procedure with 'halt -f' (-in is not really doing anything) is from other test we use to test cluster fencing. It seems that the behavior is expected and not a bug. Thanks for the valuable feedback, saved me ton of time. Closing as NOTABUG.