Created attachment 1171527 [details] engine_log Description of problem: Migration failed with "Bad file descriptor" in vdsm. While migrate on VM in all policies. Migration failed and host become non-responsive, host recover after few minutes. Version-Release number of selected component (if applicable): RHEVM: 4.0.0.6-0.1.el7ev HOSTS: OS Version:RHEL - 7.2 - 9.el7_2.1 Kernel Version:3.10.0 - 327.22.2.el7.x86_64 KVM Version:2.3.0 - 31.el7_2.16 LIBVIRT Version:libvirt-1.2.17-13.el7_2.5 VDSM Version:vdsm-4.18.4-2.el7ev How reproducible: 100% Steps to Reproduce: 1. Create VM 2. Migrate VM in each policy Actual results: 1. Migration failed 2. Host become non-responsive Expected results: Migrate VM successfully Additional info: From VDSM log: Thread-377::ERROR::2016-06-23 16:29:34,932::migration::381::virt.vm::(run) vmId=`242d23fd-4226-4f9f-a83f-04e0d3f433ec`::Failed to migrate Traceback (most recent call last): File "/usr/share/vdsm/virt/migration.py", line 334, in run self._setupVdsConnection() File "/usr/share/vdsm/virt/migration.py", line 189, in _setupVdsConnection self._destServer.ping() File "/usr/lib/python2.7/site-packages/vdsm/jsonrpcvdscli.py", line 146, in _callMethod kwargs.pop('_transport_timeout', self._default_timeout))) File "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 350, in call call = self.call_async(*reqs) File "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 356, in call_async self.call_cb(call.callback, *reqs) File "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 372, in call_cb self._transport.send(ctx.encode()) File "/usr/lib/python2.7/site-packages/yajsonrpc/stompreactor.py", line 548, in send headers, File "/usr/lib/python2.7/site-packages/yajsonrpc/stompreactor.py", line 407, in send self._reactor.wakeup() File "/usr/lib/python2.7/site-packages/yajsonrpc/betterAsyncore.py", line 234, in wakeup self._wakeupEvent.set() File "/usr/lib/python2.7/site-packages/yajsonrpc/betterAsyncore.py", line 175, in set self._eventfd.write(1) File "/usr/lib/python2.7/site-packages/vdsm/infra/eventfd/__init__.py", line 104, in write self._verify_code(rv) File "/usr/lib/python2.7/site-packages/vdsm/infra/eventfd/__init__.py", line 111, in _verify_code raise OSError(err, msg) OSError: [Errno 9] Bad file descriptor ************************************************* [root@virt-nested-vm09 ~]# ps -ef | grep vdsmd root 8242 3518 0 16:40 pts/0 00:00:00 grep --color=auto vdsmd [root@virt-nested-vm09 ~]# lsof -nPp 8242 [root@virt-nested-vm09 ~]# lsof -nPp 3518 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME bash 3518 root cwd DIR 253,3 4096 482385 /root bash 3518 root rtd DIR 253,3 4096 2 / bash 3518 root txt REG 253,3 960376 875607 /usr/bin/bash bash 3518 root mem REG 253,3 61928 875554 /usr/lib64/libnss_files-2.17.so bash 3518 root mem REG 253,3 106065056 909344 /usr/lib/locale/locale-archive bash 3518 root mem REG 253,3 2112384 875536 /usr/lib64/libc-2.17.so bash 3518 root mem REG 253,3 19520 875542 /usr/lib64/libdl-2.17.so bash 3518 root mem REG 253,3 174528 875605 /usr/lib64/libtinfo.so.5.9 bash 3518 root mem REG 253,3 164440 879867 /usr/lib64/ld-2.17.so bash 3518 root mem REG 253,3 26254 909342 /usr/lib64/gconv/gconv-modules.cache bash 3518 root 0u CHR 136,0 0t0 3 /dev/pts/0 bash 3518 root 1u CHR 136,0 0t0 3 /dev/pts/0 bash 3518 root 2u CHR 136,0 0t0 3 /dev/pts/0 bash 3518 root 255u CHR 136,0 0t0 3 /dev/pts/0
Created attachment 1171528 [details] source_vdsm_log
did it happen for the first migration or only after some time, after several successful migrations?
What happens if you downgrade VDSM to the previous release? Where is the destination VDSM log file? Libvirt log file on destination?
Does it reproduce on another setup? Piotr, can it be related to the latest fix? I guess we will know once Israel tries it on the previous environment.
it was already verified on vdsm-4.18.3 as a regression of vdsm-4.18.4 introduced by https://gerrit.ovirt.org/#/c/59106/
Moved to 4.0.1 as patch that caused the issue was reverted from oVirt 4.0 GA and RHEV 4.0 beta1
Proposed fix verified by couple hundred migrations in few hours on Sefi's setup running 4.18.4-2 plus patch 59720 applied. The code fix shouldn't affect the original area of hosted-engine-broker-ha so it should be safe and bug 1343005 doesn't really need to be retested (though it never hurts of course:) These are the two only json-rpc clients at the moment, so hopefully we're covered
*** Bug 1349885 has been marked as a duplicate of this bug. ***
Created attachment 1172448 [details] destination_vdsm.log
Created attachment 1172450 [details] dest_libvirt.log
Created attachment 1172451 [details] src_libvirt.log
Verify with: Engine: ovirt-engine-4.0.2-0.2.rc1.el7ev.noarch Hosts: OS Version:RHEL - 7.2 - 9.el7_2.1 Kernel Version:3.10.0 - 327.22.2.el7.x86_64 KVM Version:2.3.0 - 31.el7_2.16 LIBVIRT Version:libvirt-1.2.17-13.el7_2.5 VDSM Version:vdsm-4.18.5.1-1.el7ev SPICE Version:0.12.4 - 15.el7_2.1 Steps: With each policy: Migrate VM, maintenance of host. Results: All cases PASS
Since the problem described in this bug report should be resolved in oVirt 4.0.1 released on July 19th 2016, it has been closed with a resolution of CURRENT RELEASE. For information on the release, and how to update to this release, follow the link below. If the solution does not work for you, open a new bug report. http://www.ovirt.org/release/4.0.1/