Bug 732652

Summary: [vdsm][storage]validateStorageServerConnection fails in case mount is already exist on the host
Product: Red Hat Enterprise Linux 6 Reporter: Moran Goldboim <mgoldboi>
Component: vdsmAssignee: Federico Simoncelli <fsimonce>
Status: CLOSED ERRATA QA Contact: Moran Goldboim <mgoldboi>
Severity: medium Docs Contact:
Priority: medium    
Version: 6.3CC: abaron, bazulay, danken, dmalcolm, iheim, ilvovsky, ykaul
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: storage
Fixed In Version: vdsm-4.9-97.el6 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-12-06 07:39:20 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
vdsm log none

Description Moran Goldboim 2011-08-23 08:26:31 UTC
Created attachment 519406 [details]
vdsm log

Description of problem:
after restarting vdsm and initiating initVdsmOnUp rhevm sends validateStorageServerConnection with export domain in my case (FC DC), i have noticed that when mount exists vdsm returns:
Thread-206::INFO::2011-08-23 10:38:20,222::dispatcher::100::Storage.Dispatcher.Protect::(run) Run and protect: validateStorageServerConnection, Return response: {'status': {'message': 'O
K', 'code': 0}, 'statuslist': [{'status': 453, 'id': '3c2a6165-d61a-4a4a-b414-f973222019f3'}]}

after Thread-206::ERROR::2011-08-23 10:38:20,217::storage_connection::268::Storage.ServerConnection::(__validateFileServer) Error during storage connection validation: Operation Stuck
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/storage_connection.py", line 248, in __validateFileServer
    rc = oop.fileUtils.mount(con['rp'], mountpoint, fsType)
  File "/usr/share/vdsm/storage/processPool.py", line 33, in wrapper
    return self.runExternally(func, *args, **kwds)
  File "/usr/share/vdsm/storage/processPool.py", line 61, in runExternally
    raise Timeout("Operation Stuck")
Timeout: Operation Stuck

when mount doesn't exist vdsm returns it successfully.

Version-Release number of selected component (if applicable):
vdsm-4.9-95.el6.x86_64

How reproducible:
always

Steps to Reproduce:
1.have vdsm connected with FC/iscsi DC and nfs export (hsm)
2.restart vdsm
3.
  
Actual results:


Expected results:


Additional info:

Comment 2 Federico Simoncelli 2011-08-25 12:03:58 UTC
This is a ProcessPool issue.
Apparently ProcessPool.runExternally is sending the command but it's not picked up by the Helper.
I added some debug and this is the output:

send: (<function mount at 0x7f5667704f50>, ('rhev-i8c-03.mpc.lab.eng.bos.redhat.com:/home/export', '/tmp/tmpwCWYK2', 'nfs'), {})
send: (<function umount at 0x7f566770e050>, ('rhev-i8c-03.mpc.lab.eng.bos.redhat.com:/home/export', '/tmp/tmpwCWYK2', 'nfs'), {})
receive: (<function umount at 0x7f566770e050>, ('rhev-i8c-03.mpc.lab.eng.bos.redhat.com:/home/export', '/tmp/tmpwCWYK2', 'nfs'), {})
return: (<function umount at 0x7f566770e050>, -1, None)

As we can see the "mount" command wasn't picked up and went in time out.
Still can't reproduce on my host. It might be related to the load present on the host.

Comment 3 Federico Simoncelli 2011-08-31 15:34:56 UTC
The issue is that oop.fileUtils.mount uses execCmd which is trying to log some information. Sadly if the log lock was acquired before the oop fork then we get stuck:

 http://www.mail-archive.com/python-bugs-list@python.org/msg116242.html
 http://bugs.python.org/issue6721#msg141286

That said we can take several approach here:

 1) remove or make logging optional in execCmd
 2) move mount out of the oop (revert part of 7c20a19)
 3) reorganize the code so that only isStaleHandle is oop (race prone)
 4) fix the problem in python

Comment 9 Federico Simoncelli 2011-09-01 09:44:23 UTC
commit ae8a9e17f4ccf35ef3a04f2460b906ead97b0e96
Author: Federico Simoncelli <fsimonce>
Date:   Thu Sep 1 09:39:04 2011 +0000

    BZ#732652 Avoid logging in the processPool Helper
    
    Logging can cause a deadlock in a multi-threaded multi-process
    environment therefore closing the logging file descriptors is
    not enough we should also remove all the handlers.
    Reference: http://bugs.python.org/issue6721
    
    Change-Id: Ic70c2b390db23b48f96ba677054039d2773960ae

http://gerrit.usersys.redhat.com/887

Comment 15 Moran Goldboim 2011-09-22 09:55:52 UTC
verified on vdsm-4.9-97.el6

Comment 16 errata-xmlrpc 2011-12-06 07:39:20 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2011-1782.html