Created attachment 773789 [details] vdsm logs Description of problem: VDSM logs show that vdsm fails when starting due to: MainThread::ERROR::2013-07-15 17:30:19,795::vdsm::104::vds::(run) Exception raised Traceback (most recent call last): File "/usr/share/vdsm/vdsm", line 102, in run serve_clients(log) File "/usr/share/vdsm/vdsm", line 59, in serve_clients cif = clientIF.getInstance(log) File "/usr/share/vdsm/clientIF.py", line 122, in getInstance cls._instance = clientIF(log) File "/usr/share/vdsm/clientIF.py", line 112, in __init__ self._prepareBindings() File "/usr/share/vdsm/clientIF.py", line 150, in _prepareBindings self._loadBindingXMLRPC() File "/usr/share/vdsm/clientIF.py", line 136, in _loadBindingXMLRPC default_bridge) File "/usr/share/vdsm/BindingXMLRPC.py", line 55, in __init__ self.server = self._createXMLRPCServer() File "/usr/share/vdsm/BindingXMLRPC.py", line 149, in _createXMLRPCServer requestHandler=LoggingHandler) File "/usr/lib64/python2.6/site-packages/vdsm/SecureXMLRPCServer.py", line 141, in __init__ self.server_bind() File "/usr/lib64/python2.6/SocketServer.py", line 413, in server_bind self.socket.bind(self.server_address) File "/usr/lib64/python2.6/site-packages/M2Crypto/SSL/Connection.py", line 89, in bind self.socket.bind(addr) File "<string>", line 1, in bind error: [Errno 98] Address already in use there is another vdsm process running with its own respawn process. This (earlier) vdsm process is not restarted when using /etc/init.d/vdsmd stop, and it is bound to port 54321 preventing the newly spawned vdsm from binding to that port. Any commands passed to vdsm locally fail with: Traceback (most recent call last): File "/usr/share/vdsm/vdsClient.py", line 2482, in <module> code, message = commands[command][0](commandArgs) File "/usr/share/vdsm/vdsClient.py", line 419, in do_getCap return self.ExecAndExit(self.s.getVdsCapabilities()) File "/usr/lib64/python2.6/xmlrpclib.py", line 1199, in __call__ return self.__send(self.__name, args) File "/usr/lib64/python2.6/xmlrpclib.py", line 1489, in __request verbose=self.__verbose File "/usr/lib64/python2.6/xmlrpclib.py", line 1235, in request self.send_content(h, request_body) File "/usr/lib64/python2.6/xmlrpclib.py", line 1349, in send_content connection.endheaders() File "/usr/lib64/python2.6/httplib.py", line 908, in endheaders self._send_output() File "/usr/lib64/python2.6/httplib.py", line 780, in _send_output self.send(msg) File "/usr/lib64/python2.6/httplib.py", line 739, in send self.connect() File "/usr/lib64/python2.6/site-packages/vdsm/SecureXMLRPCServer.py", line 172, in connect cert_reqs=self.cert_reqs) File "/usr/lib64/python2.6/ssl.py", line 338, in wrap_socket suppress_ragged_eofs=suppress_ragged_eofs) File "/usr/lib64/python2.6/ssl.py", line 120, in __init__ self.do_handshake() File "/usr/lib64/python2.6/ssl.py", line 279, in do_handshake self._sslobj.do_handshake() SSLError: [Errno 1] _ssl.c:490: error:14090086:SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify failed Version-Release number of selected component (if applicable): vdsm-4.11.0-121.git082925a.el6.x86_64 How reproducible: ? Steps to Reproduce: ? Actual results: All commands passed to vdsm fail, even adding the host to a new engine fails installation, same as restarting vdsm using /etc.initd/vdsmd stop / start Additional info: vdsm logs attached
We need a reproduce scenario for this phenomena. How was the vdsm brought down ? Is it possible that you have upgraded vdsm package without restarting the service ?
I do not know exactly how the host ended up in that state. As far as I know vdsm package was not upgraded without restarting the service - the host was used for several automated tests before entering this state
So if we don't get a reproducer it will move to CLOSE INSUFFICIENT_DATA within a week
we seem to have a reproducer on our jenkins environment, please check: http://jenkins.qa.lab.tlv.redhat.com:8080/view/Storage/view/3.3/job/3.3-storage_export_import_server-iscsi-sdk/19/consoleFull problematic host is: 10.35.160.154 ][root@cheetah02 ~]# cat lsof_vdsm.txt | grep res respawn 3952 vdsm cwd DIR 8,3 4096 2 / respawn 3952 vdsm rtd DIR 8,3 4096 2 / respawn 3952 vdsm txt REG 8,3 938768 11214899 /bin/bash respawn 3952 vdsm mem REG 8,3 138280 24789098 /lib64/libtinfo.so.5.7 respawn 3952 vdsm mem REG 8,3 156872 24789442 /lib64/ld-2.12.so respawn 3952 vdsm mem REG 8,3 22536 24789445 /lib64/libdl-2.12.so respawn 3952 vdsm mem REG 8,3 1922152 24789443 /lib64/libc-2.12.so respawn 3952 vdsm 0r CHR 1,3 0t0 3698 /dev/null respawn 3952 vdsm 1w CHR 1,3 0t0 3698 /dev/null respawn 3952 vdsm 2w CHR 1,3 0t0 3698 /dev/null vdsm 3956 vdsm mem REG 8,3 113952 24789470 /lib64/libresolv-2.12.so python 5266 vdsm mem REG 8,3 113952 24789470 /lib64/libresolv-2.12.so python 5887 vdsm mem REG 8,3 113952 24789470 /lib64/libresolv-2.12.so python 5894 vdsm mem REG 8,3 113952 24789470 /lib64/libresolv-2.12.so respawn 19932 vdsm cwd DIR 8,3 4096 2 / respawn 19932 vdsm rtd DIR 8,3 4096 2 / respawn 19932 vdsm txt REG 8,3 938768 11214899 /bin/bash respawn 19932 vdsm mem REG 8,3 138280 24789098 /lib64/libtinfo.so.5.7 respawn 19932 vdsm mem REG 8,3 156872 24789442 /lib64/ld-2.12.so respawn 19932 vdsm mem REG 8,3 22536 24789445 /lib64/libdl-2.12.so respawn 19932 vdsm mem REG 8,3 1922152 24789443 /lib64/libc-2.12.so respawn 19932 vdsm 0r CHR 1,3 0t0 3698 /dev/null respawn 19932 vdsm 1w CHR 1,3 0t0 3698 /dev/null respawn 19932 vdsm 2w CHR 1,3 0t0 3698 /dev/null
The remain instance of vdsm is there since Jul12, and was connected to 3 storage domains. [root@cheetah02 ~]# ps aux | grep vdsm vdsm 3952 0.0 0.0 9236 688 ? S< Jul12 0:00 /bin/bash -e /usr/share/vdsm/respawn --minlifetime 10 --daemon --masterpid /var/run/vdsm/respawn.pid /usr/share/vdsm/vdsm vdsm 3956 0.1 0.3 1920936 50660 ? T<l Jul12 30:22 /usr/bin/python /usr/share/vdsm/vdsm vdsm 5266 0.0 0.0 107572 10152 ? S< Jul12 0:00 /usr/bin/python /usr/share/vdsm/storage/remoteFileHandler.pyc 33 17 vdsm 5887 0.0 0.0 107576 10184 ? S< Jul12 0:00 /usr/bin/python /usr/share/vdsm/storage/remoteFileHandler.pyc 38 35 vdsm 5894 0.0 0.0 107572 10248 ? S< Jul12 0:00 /usr/bin/python /usr/share/vdsm/storage/remoteFileHandler.pyc 41 38 vdsm 8996 0.0 0.0 4068 488 ? S< 15:38 0:00 sleep 900 root 9096 0.0 0.0 103248 844 pts/1 S+ 15:44 0:00 grep vdsm root 13129 0.0 0.0 191756 8128 ? S 11:03 0:00 python /usr/bin/vdsmcodecoverage --daemon --conf /etc/vdsmcodecoverage.conf root 19868 0.0 0.0 11336 696 ? S 11:27 0:00 /bin/bash -e /usr/share/vdsm/respawn --minlifetime 10 --daemon --masterpid /var/run/vdsm/supervdsm_respawn.pid /usr/bin/python /usr/share/vdsm/supervdsmServer.pyc /var/run/vdsm/svdsm.sock root 19870 0.0 0.1 756028 25204 ? Sl 11:27 0:04 /usr/bin/python /usr/share/vdsm/supervdsmServer.pyc /var/run/vdsm/svdsm.sock vdsm 19932 0.0 0.0 9232 720 ? S< 11:27 0:00 /bin/bash -e /usr/share/vdsm/respawn --minlifetime 10 --daemon --masterpid /var/run/vdsm/respawn.pid /usr/share/vdsm/vdsm from /var/log/yum.log , in this date we installed vdsm-4.12.0-rc1.12 , and today we install vdsm-4.12.0-rc3. since Jul12 we installed vdsm each day at least 3 times. Trying to locate the reason why this specific old instance was not cleared
And after looking at the yum log we see that vdsm was upgraded on a daily basis, but the failure happened on jenkins only on the 28 th. So the question of reproduce scenario still remains.
After checking some init script races, we see that if you restart vdsmd service simultaneously more then once, we might reach to 2 or more instances of the respawn script. This leads to the scenario that mentioned in this bug. We assume that during the QA tests we reached to that race while upgrading to new version of vdsm, the upgrade performs condrestart after the uninstall part, and might be that the deploy process performs "service vdsmd start" simultaneously. Following: http://fedoraproject.org/wiki/Packaging:SysVInitScript#Why_do_SysV_initscripts_with_resident_daemons_require_lock_files.3F We add lock file that signs if daemon is running or not. This should prevent such races. This fix should be added also to supervdsmd init script.
Verified on is13 - restarting vdsmd from two shells simultaneously: It seems to solve the issue, but first failure is somehow "confusing" # service vdsmd restart Shutting down vdsm daemon: vdsm during another start, cannot stop yet [FAILED] cannot start VDS daemon when another instance runs [FAILED]
Why confusing? another instance runs.. what do you prefer? btw, the patch http://gerrit.ovirt.org/#/c/18875 modifies it to: "cannot perform VDSM daemon operation when another instance runs" (almost the same :)) It wasn't merged it, but will be soon. so if you have any good suggestion I'll be glad to adopt
(In reply to Yaniv Bronhaim from comment #9) > Why confusing? another instance runs.. what do you prefer? > > btw, the patch http://gerrit.ovirt.org/#/c/18875 modifies it to: > "cannot perform VDSM daemon operation when another instance runs" (almost > the same :)) > > It wasn't merged it, but will be soon. so if you have any good suggestion > I'll be glad to adopt Almost but much more understandable - Can't wait to see that patch merged :)
This bug is currently attached to errata RHBA-2013:15291. If this change is not to be documented in the text for this errata please either remove it from the errata, set the requires_doc_text flag to minus (-), or leave a "Doc Text" value of "--no tech note required" if you do not have permission to alter the flag. Otherwise to aid in the development of relevant and accurate release documentation, please fill out the "Doc Text" field above with these four (4) pieces of information: * Cause: What actions or circumstances cause this bug to present. * Consequence: What happens when the bug presents. * Fix: What was done to fix the bug. * Result: What now happens when the actions or circumstances above occur. (NB: this is not the same as 'the bug doesn't present anymore') Once filled out, please set the "Doc Type" field to the appropriate value for the type of change made and submit your edits to the bug. For further details on the Cause, Consequence, Fix, Result format please refer to: https://bugzilla.redhat.com/page.cgi?id=fields.html#cf_release_notes Thanks in advance.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2014-0040.html