Hide Forgot
Created attachment 519406 [details] vdsm log Description of problem: after restarting vdsm and initiating initVdsmOnUp rhevm sends validateStorageServerConnection with export domain in my case (FC DC), i have noticed that when mount exists vdsm returns: Thread-206::INFO::2011-08-23 10:38:20,222::dispatcher::100::Storage.Dispatcher.Protect::(run) Run and protect: validateStorageServerConnection, Return response: {'status': {'message': 'O K', 'code': 0}, 'statuslist': [{'status': 453, 'id': '3c2a6165-d61a-4a4a-b414-f973222019f3'}]} after Thread-206::ERROR::2011-08-23 10:38:20,217::storage_connection::268::Storage.ServerConnection::(__validateFileServer) Error during storage connection validation: Operation Stuck Traceback (most recent call last): File "/usr/share/vdsm/storage/storage_connection.py", line 248, in __validateFileServer rc = oop.fileUtils.mount(con['rp'], mountpoint, fsType) File "/usr/share/vdsm/storage/processPool.py", line 33, in wrapper return self.runExternally(func, *args, **kwds) File "/usr/share/vdsm/storage/processPool.py", line 61, in runExternally raise Timeout("Operation Stuck") Timeout: Operation Stuck when mount doesn't exist vdsm returns it successfully. Version-Release number of selected component (if applicable): vdsm-4.9-95.el6.x86_64 How reproducible: always Steps to Reproduce: 1.have vdsm connected with FC/iscsi DC and nfs export (hsm) 2.restart vdsm 3. Actual results: Expected results: Additional info:
This is a ProcessPool issue. Apparently ProcessPool.runExternally is sending the command but it's not picked up by the Helper. I added some debug and this is the output: send: (<function mount at 0x7f5667704f50>, ('rhev-i8c-03.mpc.lab.eng.bos.redhat.com:/home/export', '/tmp/tmpwCWYK2', 'nfs'), {}) send: (<function umount at 0x7f566770e050>, ('rhev-i8c-03.mpc.lab.eng.bos.redhat.com:/home/export', '/tmp/tmpwCWYK2', 'nfs'), {}) receive: (<function umount at 0x7f566770e050>, ('rhev-i8c-03.mpc.lab.eng.bos.redhat.com:/home/export', '/tmp/tmpwCWYK2', 'nfs'), {}) return: (<function umount at 0x7f566770e050>, -1, None) As we can see the "mount" command wasn't picked up and went in time out. Still can't reproduce on my host. It might be related to the load present on the host.
The issue is that oop.fileUtils.mount uses execCmd which is trying to log some information. Sadly if the log lock was acquired before the oop fork then we get stuck: http://www.mail-archive.com/python-bugs-list@python.org/msg116242.html http://bugs.python.org/issue6721#msg141286 That said we can take several approach here: 1) remove or make logging optional in execCmd 2) move mount out of the oop (revert part of 7c20a19) 3) reorganize the code so that only isStaleHandle is oop (race prone) 4) fix the problem in python
commit ae8a9e17f4ccf35ef3a04f2460b906ead97b0e96 Author: Federico Simoncelli <fsimonce> Date: Thu Sep 1 09:39:04 2011 +0000 BZ#732652 Avoid logging in the processPool Helper Logging can cause a deadlock in a multi-threaded multi-process environment therefore closing the logging file descriptors is not enough we should also remove all the handlers. Reference: http://bugs.python.org/issue6721 Change-Id: Ic70c2b390db23b48f96ba677054039d2773960ae http://gerrit.usersys.redhat.com/887
verified on vdsm-4.9-97.el6
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHEA-2011-1782.html