Bug 1349461 - Migrate Failed with "Bad file descriptor" error in vdsm
Summary: Migrate Failed with "Bad file descriptor" error in vdsm
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: vdsm
Classification: oVirt
Component: General
Version: 4.18.4
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ovirt-4.0.1
: 4.18.5
Assignee: Piotr Kliczewski
QA Contact: Israel Pinto
URL:
Whiteboard:
: 1349885 (view as bug list)
Depends On:
Blocks: 1343005
TreeView+ depends on / blocked
 
Reported: 2016-06-23 13:43 UTC by Israel Pinto
Modified: 2016-07-19 06:25 UTC (History)
19 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-07-19 06:25:37 UTC
oVirt Team: Infra
Embargoed:
rule-engine: ovirt-4.0.z+
pm-rhel: blocker+
ipinto: testing_plan_complete+
rule-engine: planning_ack+
michal.skrivanek: devel_ack+
gklein: testing_ack+


Attachments (Terms of Use)
engine_log (231.23 KB, application/zip)
2016-06-23 13:43 UTC, Israel Pinto
no flags Details
source_vdsm_log (279.82 KB, application/zip)
2016-06-23 13:44 UTC, Israel Pinto
no flags Details
destination_vdsm.log (18.77 MB, text/plain)
2016-06-26 07:10 UTC, Israel Pinto
no flags Details
dest_libvirt.log (11.98 MB, application/x-gzip)
2016-06-26 07:18 UTC, Israel Pinto
no flags Details
src_libvirt.log (11.98 MB, application/x-gzip)
2016-06-26 07:21 UTC, Israel Pinto
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 59720 0 master MERGED migration: usage of single reactor in vdsm 2020-03-31 15:18:51 UTC
oVirt gerrit 59921 0 ovirt-4.0 MERGED migration: usage of single reactor in vdsm 2020-03-31 15:18:51 UTC

Description Israel Pinto 2016-06-23 13:43:06 UTC
Created attachment 1171527 [details]
engine_log

Description of problem:
Migration failed with "Bad file descriptor"
in vdsm. While migrate on VM in all policies.
Migration failed and host become non-responsive, host recover after few minutes.
 
Version-Release number of selected component (if applicable):
RHEVM: 4.0.0.6-0.1.el7ev
HOSTS:
OS Version:RHEL - 7.2 - 9.el7_2.1
Kernel Version:3.10.0 - 327.22.2.el7.x86_64
KVM Version:2.3.0 - 31.el7_2.16
LIBVIRT Version:libvirt-1.2.17-13.el7_2.5
VDSM Version:vdsm-4.18.4-2.el7ev


How reproducible:
100%

Steps to Reproduce:
1. Create VM
2. Migrate VM in each policy


Actual results:
1. Migration failed
2. Host become non-responsive

Expected results:
Migrate VM successfully 


Additional info:
From VDSM log:

Thread-377::ERROR::2016-06-23 16:29:34,932::migration::381::virt.vm::(run) vmId=`242d23fd-4226-4f9f-a83f-04e0d3f433ec`::Failed to migrate
Traceback (most recent call last):
  File "/usr/share/vdsm/virt/migration.py", line 334, in run
    self._setupVdsConnection()
  File "/usr/share/vdsm/virt/migration.py", line 189, in _setupVdsConnection
    self._destServer.ping()
  File "/usr/lib/python2.7/site-packages/vdsm/jsonrpcvdscli.py", line 146, in _callMethod
    kwargs.pop('_transport_timeout', self._default_timeout)))
  File "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 350, in call
    call = self.call_async(*reqs)
  File "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 356, in call_async
    self.call_cb(call.callback, *reqs)
  File "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 372, in call_cb
    self._transport.send(ctx.encode())
  File "/usr/lib/python2.7/site-packages/yajsonrpc/stompreactor.py", line 548, in send
    headers,
  File "/usr/lib/python2.7/site-packages/yajsonrpc/stompreactor.py", line 407, in send
    self._reactor.wakeup()
  File "/usr/lib/python2.7/site-packages/yajsonrpc/betterAsyncore.py", line 234, in wakeup
    self._wakeupEvent.set()
  File "/usr/lib/python2.7/site-packages/yajsonrpc/betterAsyncore.py", line 175, in set
    self._eventfd.write(1)
  File "/usr/lib/python2.7/site-packages/vdsm/infra/eventfd/__init__.py", line 104, in write
    self._verify_code(rv)
  File "/usr/lib/python2.7/site-packages/vdsm/infra/eventfd/__init__.py", line 111, in _verify_code
    raise OSError(err, msg)
OSError: [Errno 9] Bad file descriptor

*************************************************
[root@virt-nested-vm09 ~]# ps -ef | grep vdsmd
root      8242  3518  0 16:40 pts/0    00:00:00 grep --color=auto vdsmd
[root@virt-nested-vm09 ~]# lsof -nPp 8242
[root@virt-nested-vm09 ~]# lsof -nPp 3518
COMMAND  PID USER   FD   TYPE DEVICE  SIZE/OFF   NODE NAME
bash    3518 root  cwd    DIR  253,3      4096 482385 /root
bash    3518 root  rtd    DIR  253,3      4096      2 /
bash    3518 root  txt    REG  253,3    960376 875607 /usr/bin/bash
bash    3518 root  mem    REG  253,3     61928 875554 /usr/lib64/libnss_files-2.17.so
bash    3518 root  mem    REG  253,3 106065056 909344 /usr/lib/locale/locale-archive
bash    3518 root  mem    REG  253,3   2112384 875536 /usr/lib64/libc-2.17.so
bash    3518 root  mem    REG  253,3     19520 875542 /usr/lib64/libdl-2.17.so
bash    3518 root  mem    REG  253,3    174528 875605 /usr/lib64/libtinfo.so.5.9
bash    3518 root  mem    REG  253,3    164440 879867 /usr/lib64/ld-2.17.so
bash    3518 root  mem    REG  253,3     26254 909342 /usr/lib64/gconv/gconv-modules.cache
bash    3518 root    0u   CHR  136,0       0t0      3 /dev/pts/0
bash    3518 root    1u   CHR  136,0       0t0      3 /dev/pts/0
bash    3518 root    2u   CHR  136,0       0t0      3 /dev/pts/0
bash    3518 root  255u   CHR  136,0       0t0      3 /dev/pts/0

Comment 1 Israel Pinto 2016-06-23 13:44:02 UTC
Created attachment 1171528 [details]
source_vdsm_log

Comment 2 Michal Skrivanek 2016-06-23 14:04:48 UTC
    did it happen for the first migration or only after some time, after several successful migrations?

Comment 3 Yaniv Kaul 2016-06-23 14:13:13 UTC
What happens if you downgrade VDSM to the previous release?
Where is the destination VDSM log file?
Libvirt log file on destination?

Comment 6 Oved Ourfali 2016-06-23 17:50:40 UTC
Does it reproduce on another setup? 
Piotr, can it be related to the latest fix? 
I guess we will know once Israel tries it on the previous environment.

Comment 7 Michal Skrivanek 2016-06-23 18:33:05 UTC
it was already verified on vdsm-4.18.3 as a regression of vdsm-4.18.4 introduced by https://gerrit.ovirt.org/#/c/59106/

Comment 8 Martin Perina 2016-06-24 11:33:09 UTC
Moved to 4.0.1 as patch that caused the issue was reverted from oVirt 4.0 GA and RHEV 4.0 beta1

Comment 9 Michal Skrivanek 2016-06-24 11:46:31 UTC
Proposed fix verified by couple hundred migrations in few hours on Sefi's setup running 4.18.4-2 plus patch 59720 applied.

The code fix shouldn't affect the original area of hosted-engine-broker-ha so it should be safe and bug 1343005 doesn't really need to be retested (though it never hurts of course:)
These are the two only json-rpc clients at the moment, so hopefully we're covered

Comment 10 Piotr Kliczewski 2016-06-24 13:29:30 UTC
*** Bug 1349885 has been marked as a duplicate of this bug. ***

Comment 11 Israel Pinto 2016-06-26 07:10:39 UTC
Created attachment 1172448 [details]
destination_vdsm.log

Comment 12 Israel Pinto 2016-06-26 07:18:26 UTC
Created attachment 1172450 [details]
dest_libvirt.log

Comment 13 Israel Pinto 2016-06-26 07:21:13 UTC
Created attachment 1172451 [details]
src_libvirt.log

Comment 14 Israel Pinto 2016-07-04 06:30:46 UTC
Verify with:
Engine:
ovirt-engine-4.0.2-0.2.rc1.el7ev.noarch
Hosts:
OS Version:RHEL - 7.2 - 9.el7_2.1
Kernel Version:3.10.0 - 327.22.2.el7.x86_64
KVM Version:2.3.0 - 31.el7_2.16
LIBVIRT Version:libvirt-1.2.17-13.el7_2.5
VDSM Version:vdsm-4.18.5.1-1.el7ev
SPICE Version:0.12.4 - 15.el7_2.1

Steps:
With each policy: Migrate VM, maintenance of host.
Results:
All cases PASS

Comment 15 Sandro Bonazzola 2016-07-19 06:25:37 UTC
Since the problem described in this bug report should be
resolved in oVirt 4.0.1 released on July 19th 2016, it has been closed with a
resolution of CURRENT RELEASE.

For information on the release, and how to update to this release, follow the link below.

If the solution does not work for you, open a new bug report.

http://www.ovirt.org/release/4.0.1/


Note You need to log in before you can comment on or make changes to this bug.