Bug 1151863 - Error during successful migration: [Errno 9] Bad file descriptor
Summary: Error during successful migration: [Errno 9] Bad file descriptor
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: vdsm
Classification: oVirt
Component: General
Version: ---
Hardware: Unspecified
OS: Unspecified
unspecified
low
Target Milestone: ovirt-3.6.3
: 4.17.19
Assignee: Vinzenz Feenstra [evilissimo]
QA Contact: Israel Pinto
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-10-12 18:34 UTC by Nir Soffer
Modified: 2016-02-17 07:23 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-02-17 07:23:18 UTC
oVirt Team: Virt
Embargoed:
michal.skrivanek: ovirt-3.6.z?
rule-engine: planning_ack?
michal.skrivanek: devel_ack+
rule-engine: testing_ack?


Attachments (Terms of Use)
vdsm log from rhel 7 host (61.22 KB, application/x-xz)
2014-10-12 18:34 UTC, Nir Soffer
no flags Details
vdsm log from rhel 6 host (36.18 KB, application/x-xz)
2014-10-12 18:35 UTC, Nir Soffer
no flags Details
Vdsm log using debug log level with current master (72.24 KB, application/x-xz)
2015-11-28 21:01 UTC, Nir Soffer
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 47404 0 None None None Never
oVirt gerrit 52758 0 ovirt-3.6 MERGED guest-agent: desktopLock exception on EBADF should be ignored 2016-01-26 15:39:04 UTC

Description Nir Soffer 2014-10-12 18:34:36 UTC
Created attachment 946136 [details]
vdsm log from rhel 7 host

Description of problem:

During successful migration, we see several errors in vdsm log:

Thread-56::ERROR::2014-10-12 21:21:53,498::guestagent::356::vm.Vm::(desktopLock) vmId=`e7ae437a-e487-46c2-a293-95f7f0659925`::desktopLock failed
Traceback (most recent call last):
  File "/usr/share/vdsm/virt/guestagent.py", line 354, in desktopLock
    self._forward("lock-screen")
  File "/usr/share/vdsm/virt/guestagent.py", line 224, in _forward
    self._sock.send(message)
  File "/usr/lib64/python2.6/socket.py", line 167, in _dummy
    raise error(EBADF, 'Bad file descriptor')
error: [Errno 9] Bad file descriptor

The guest is running Fedora 20 from tlv pxe, probably without
guest agent (I did not install one).

Version-Release number of selected component (if applicable):
master Oct 12

How reproducible:
Seen on both rhel 7 and 6.5, 2/2 migrations.

Steps to Reproduce:
1. Run vm on rhel 7 or 6.5 machine with one iscsi disk
2. Migrate to other machine
3. Migration succeeds

Actual results:
Unexpected errors in log

Expected results:
Clean log

Comment 1 Nir Soffer 2014-10-12 18:35:27 UTC
Created attachment 946137 [details]
vdsm log from rhel 6 host

Comment 2 Michal Skrivanek 2014-11-24 10:14:14 UTC
it is expected in this flow. Maybe make the err msg a bit more pretty / info level.

Comment 3 Michal Skrivanek 2015-06-05 13:19:50 UTC
still want to consider this "soon"

Comment 4 Red Hat Bugzilla Rules Engine 2015-10-21 18:28:00 UTC
This bug is not marked for z-stream, yet the milestone is for a z-stream version, therefore the milestone has been reset.
Please set the correct milestone or add the z-stream flag.

Comment 6 Yaniv Lavi 2015-10-29 12:08:23 UTC
In oVirt testing is done on single stream by default. Therefore I'm removing the 4.0 flag. If you think this bug must be tested in 4.0 as well, please re-add the flag. Please note we might not have testing resources to handle the 4.0 clone.

Comment 7 Nir Soffer 2015-11-28 20:58:28 UTC
The previous logs in the bug used log level info, making the flow unclear.

I reproduce the issue again with current master and some patches for
improving logging:
https://gerrit.ovirt.org/49343

In the old logs, we had 4 different threads calling desktopLock and
failing in the same time. I could not reproduce this now, maybe because
the code was changed, or maybe something with the setup was different.
Now we see only one failed desktopLock call for each migration.

To reproduce:
1. Have 2 hosts
2. Start vm on one host
3. Open console
4. Migrate to other host

The attached log contains 3 instances of this error. Here is annotated
version of one of them:

1. Migration finished, in migration downtime thread (Thread-114)

2015-11-28 17:58:25,846 DEBUG [virt.vm] (Thread-114) vmId=`ecc85cb8-a80f-4822-8c53-a07423939499`::stopping migration downtime thread [migration:454(stop)]
2015-11-28 17:58:25,846 INFO [virt.vm] (Thread-114) vmId=`ecc85cb8-a80f-4822-8c53-a07423939499`::migration took 10 seconds to complete [migration:368(_startUnderlyingMigration)]
2015-11-28 17:58:25,848 DEBUG [virt.vm] (Thread-115) vmId=`ecc85cb8-a80f-4822-8c53-a07423939499`::migration downtime thread exiting [migration:451(run)]
2015-11-28 17:58:25,849 INFO [virt.vm] (Thread-114) vmId=`ecc85cb8-a80f-4822-8c53-a07423939499`::Changed state to Down: Migration succeeded (code=4) [vm:1293(setDownStatus)]

2. Guest agent stopped

2015-11-28 17:58:25,849 INFO [virt.vm] (Thread-114) vmId=`ecc85cb8-a80f-4822-8c53-a07423939499`::Stopping connection [guestagent:336(stop)]
2015-11-28 17:58:25,849 DEBUG [vds] (Thread-114) Delete fileno 34 from listener. [vmchannels:214(unregister)]

At this point, guest agent socket is closed, and guest agent
_stopped variable is True.

3. 2 seconds later, another thread (Thread-117) call desktopLock

2015-11-28 17:58:27,448 DEBUG [virt.vm] (Thread-117) vmId=`ecc85cb8-a80f-4822-8c53-a07423939499`::desktopLock called [guestagent:376(desktopLock)]
2015-11-28 17:58:27,448 ERROR [virt.vm] (Thread-117) vmId=`ecc85cb8-a80f-4822-8c53-a07423939499`::desktopLock failed [guestagent:379(desktopLock)]
Traceback (most recent call last):
  File "/usr/share/vdsm/virt/guestagent.py", line 377, in desktopLock
    self._forward("lock-screen")
  File "/usr/share/vdsm/virt/guestagent.py", line 244, in _forward
    self._sock.sendall(message)
  File "/usr/lib64/python2.7/socket.py", line 224, in meth
    return getattr(self._sock,name)(*args)
  File "/usr/lib64/python2.7/socket.py", line 170, in _dummy
    raise error(EBADF, 'Bad file descriptor')

The call failed because the socket was closed by Thread-114.

The main issue seems to be incorrect usage of guest agent from 2 threads
One thread stop it, and the other try to send a message to a stopped
guest agent.

A secondary issue is failing in a more clear way when the guest agent 
is stopped. This can be hard to to with the current code, since it uses
the _stopped variable also for keeping the connection state
(stopped == disconnected). It seems that guest agent should keep more
states (stopped, connecting, connected).

Comment 8 Nir Soffer 2015-11-28 21:01:07 UTC
Created attachment 1100001 [details]
Vdsm log using debug log level with current master

Comment 9 Gil Klein 2016-02-17 07:23:18 UTC
This bug was fixed and is slated to be in the upcoming version. As we
are focusing our testing at this phase on severe bugs, this bug was
closed without going through its verification step. If you think this
bug should be verified by QE, please set its severity to high and move
it back to ON_QA


Note You need to log in before you can comment on or make changes to this bug.