Bug 984616 - [vdsm] vdsm fails to bind to port due to another vdsm process running
[vdsm] vdsm fails to bind to port due to another vdsm process running
Status: CLOSED ERRATA
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm (Show other bugs)
3.3.0
All Linux
unspecified Severity high
: ---
: 3.3.0
Assigned To: Yaniv Bronhaim
Barak Dagan
infra
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2013-07-15 10:40 EDT by Gadi Ickowicz
Modified: 2016-02-10 14:00 EST (History)
12 users (show)

See Also:
Fixed In Version: is9.1
Doc Type: Bug Fix
Doc Text:
If the vdsmd service was restarted simultaneously more then once, it would respawn and have several VDSM processes running. The earlier VDSM process was not restarted using the init script, and it was bound to port 54321, preventing the newly spawned VDSM from binding to that port. Now, the init script is locked when vdsmd starts up until it is stopped, so the port binding error no longer occurs.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2014-01-21 11:28:29 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: Infra
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
vdsm logs (517.24 KB, application/x-gzip)
2013-07-15 10:40 EDT, Gadi Ickowicz
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
oVirt gerrit 17662 None None None Never

  None (edit)
Description Gadi Ickowicz 2013-07-15 10:40:49 EDT
Created attachment 773789 [details]
vdsm logs

Description of problem:
VDSM logs show that vdsm fails when starting due to:
MainThread::ERROR::2013-07-15 17:30:19,795::vdsm::104::vds::(run) Exception raised
Traceback (most recent call last):
  File "/usr/share/vdsm/vdsm", line 102, in run
    serve_clients(log)
  File "/usr/share/vdsm/vdsm", line 59, in serve_clients
    cif = clientIF.getInstance(log)
  File "/usr/share/vdsm/clientIF.py", line 122, in getInstance
    cls._instance = clientIF(log)
  File "/usr/share/vdsm/clientIF.py", line 112, in __init__
    self._prepareBindings()
  File "/usr/share/vdsm/clientIF.py", line 150, in _prepareBindings
    self._loadBindingXMLRPC()
  File "/usr/share/vdsm/clientIF.py", line 136, in _loadBindingXMLRPC
    default_bridge)
  File "/usr/share/vdsm/BindingXMLRPC.py", line 55, in __init__
    self.server = self._createXMLRPCServer()
  File "/usr/share/vdsm/BindingXMLRPC.py", line 149, in _createXMLRPCServer
    requestHandler=LoggingHandler)
  File "/usr/lib64/python2.6/site-packages/vdsm/SecureXMLRPCServer.py", line 141, in __init__
    self.server_bind()
  File "/usr/lib64/python2.6/SocketServer.py", line 413, in server_bind
    self.socket.bind(self.server_address)
  File "/usr/lib64/python2.6/site-packages/M2Crypto/SSL/Connection.py", line 89, in bind
    self.socket.bind(addr)
  File "<string>", line 1, in bind
error: [Errno 98] Address already in use


there is another vdsm process running with its own respawn process. This (earlier) vdsm process is not restarted when using /etc/init.d/vdsmd stop, and it is bound to port 54321 preventing the newly spawned vdsm from binding to that port. Any commands passed to vdsm locally fail with:

Traceback (most recent call last):
  File "/usr/share/vdsm/vdsClient.py", line 2482, in <module>
    code, message = commands[command][0](commandArgs)
  File "/usr/share/vdsm/vdsClient.py", line 419, in do_getCap
    return self.ExecAndExit(self.s.getVdsCapabilities())
  File "/usr/lib64/python2.6/xmlrpclib.py", line 1199, in __call__
    return self.__send(self.__name, args)
  File "/usr/lib64/python2.6/xmlrpclib.py", line 1489, in __request
    verbose=self.__verbose
  File "/usr/lib64/python2.6/xmlrpclib.py", line 1235, in request
    self.send_content(h, request_body)
  File "/usr/lib64/python2.6/xmlrpclib.py", line 1349, in send_content
    connection.endheaders()
  File "/usr/lib64/python2.6/httplib.py", line 908, in endheaders
    self._send_output()
  File "/usr/lib64/python2.6/httplib.py", line 780, in _send_output
    self.send(msg)
  File "/usr/lib64/python2.6/httplib.py", line 739, in send
    self.connect()
  File "/usr/lib64/python2.6/site-packages/vdsm/SecureXMLRPCServer.py", line 172, in connect
    cert_reqs=self.cert_reqs)
  File "/usr/lib64/python2.6/ssl.py", line 338, in wrap_socket
    suppress_ragged_eofs=suppress_ragged_eofs)
  File "/usr/lib64/python2.6/ssl.py", line 120, in __init__
    self.do_handshake()
  File "/usr/lib64/python2.6/ssl.py", line 279, in do_handshake
    self._sslobj.do_handshake()
SSLError: [Errno 1] _ssl.c:490: error:14090086:SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify failed


Version-Release number of selected component (if applicable):
vdsm-4.11.0-121.git082925a.el6.x86_64

How reproducible:
?

Steps to Reproduce:
?

Actual results:
All commands passed to vdsm fail, even adding the host to a new engine fails installation, same as restarting vdsm using /etc.initd/vdsmd stop / start

Additional info:
vdsm logs attached
Comment 1 Barak 2013-07-21 11:26:11 EDT
We need a reproduce scenario for this phenomena.
How was the vdsm brought down ?
Is it possible that you have upgraded vdsm package without restarting the service ?
Comment 2 Gadi Ickowicz 2013-07-22 07:18:15 EDT
I do not know exactly how the host ended up in that state. As far as I know vdsm package was not upgraded without restarting the service - the host was used for several automated tests before entering this state
Comment 3 Barak 2013-07-25 08:30:25 EDT
So if we don't get a reproducer it will move to CLOSE INSUFFICIENT_DATA within a week
Comment 4 Haim 2013-07-30 08:00:10 EDT
we seem to have a reproducer on our jenkins environment, please check:

http://jenkins.qa.lab.tlv.redhat.com:8080/view/Storage/view/3.3/job/3.3-storage_export_import_server-iscsi-sdk/19/consoleFull

problematic host is: 10.35.160.154

][root@cheetah02 ~]# cat lsof_vdsm.txt | grep res
respawn    3952      vdsm  cwd       DIR                8,3     4096          2 /
respawn    3952      vdsm  rtd       DIR                8,3     4096          2 /
respawn    3952      vdsm  txt       REG                8,3   938768   11214899 /bin/bash
respawn    3952      vdsm  mem       REG                8,3   138280   24789098 /lib64/libtinfo.so.5.7
respawn    3952      vdsm  mem       REG                8,3   156872   24789442 /lib64/ld-2.12.so
respawn    3952      vdsm  mem       REG                8,3    22536   24789445 /lib64/libdl-2.12.so
respawn    3952      vdsm  mem       REG                8,3  1922152   24789443 /lib64/libc-2.12.so
respawn    3952      vdsm    0r      CHR                1,3      0t0       3698 /dev/null
respawn    3952      vdsm    1w      CHR                1,3      0t0       3698 /dev/null
respawn    3952      vdsm    2w      CHR                1,3      0t0       3698 /dev/null
vdsm       3956      vdsm  mem       REG                8,3   113952   24789470 /lib64/libresolv-2.12.so
python     5266      vdsm  mem       REG                8,3   113952   24789470 /lib64/libresolv-2.12.so
python     5887      vdsm  mem       REG                8,3   113952   24789470 /lib64/libresolv-2.12.so
python     5894      vdsm  mem       REG                8,3   113952   24789470 /lib64/libresolv-2.12.so
respawn   19932      vdsm  cwd       DIR                8,3     4096          2 /
respawn   19932      vdsm  rtd       DIR                8,3     4096          2 /
respawn   19932      vdsm  txt       REG                8,3   938768   11214899 /bin/bash
respawn   19932      vdsm  mem       REG                8,3   138280   24789098 /lib64/libtinfo.so.5.7
respawn   19932      vdsm  mem       REG                8,3   156872   24789442 /lib64/ld-2.12.so
respawn   19932      vdsm  mem       REG                8,3    22536   24789445 /lib64/libdl-2.12.so
respawn   19932      vdsm  mem       REG                8,3  1922152   24789443 /lib64/libc-2.12.so
respawn   19932      vdsm    0r      CHR                1,3      0t0       3698 /dev/null
respawn   19932      vdsm    1w      CHR                1,3      0t0       3698 /dev/null
respawn   19932      vdsm    2w      CHR                1,3      0t0       3698 /dev/null
Comment 5 Yaniv Bronhaim 2013-07-30 08:50:29 EDT
The remain instance of vdsm is there since Jul12, and was connected to 3 storage domains. 

[root@cheetah02 ~]# ps aux | grep vdsm
vdsm      3952  0.0  0.0   9236   688 ?        S<   Jul12   0:00 /bin/bash -e /usr/share/vdsm/respawn --minlifetime 10 --daemon --masterpid /var/run/vdsm/respawn.pid /usr/share/vdsm/vdsm
vdsm      3956  0.1  0.3 1920936 50660 ?       T<l  Jul12  30:22 /usr/bin/python /usr/share/vdsm/vdsm
vdsm      5266  0.0  0.0 107572 10152 ?        S<   Jul12   0:00 /usr/bin/python /usr/share/vdsm/storage/remoteFileHandler.pyc 33 17
vdsm      5887  0.0  0.0 107576 10184 ?        S<   Jul12   0:00 /usr/bin/python /usr/share/vdsm/storage/remoteFileHandler.pyc 38 35
vdsm      5894  0.0  0.0 107572 10248 ?        S<   Jul12   0:00 /usr/bin/python /usr/share/vdsm/storage/remoteFileHandler.pyc 41 38
vdsm      8996  0.0  0.0   4068   488 ?        S<   15:38   0:00 sleep 900
root      9096  0.0  0.0 103248   844 pts/1    S+   15:44   0:00 grep vdsm
root     13129  0.0  0.0 191756  8128 ?        S    11:03   0:00 python /usr/bin/vdsmcodecoverage --daemon --conf /etc/vdsmcodecoverage.conf
root     19868  0.0  0.0  11336   696 ?        S    11:27   0:00 /bin/bash -e /usr/share/vdsm/respawn --minlifetime 10 --daemon --masterpid /var/run/vdsm/supervdsm_respawn.pid /usr/bin/python /usr/share/vdsm/supervdsmServer.pyc /var/run/vdsm/svdsm.sock
root     19870  0.0  0.1 756028 25204 ?        Sl   11:27   0:04 /usr/bin/python /usr/share/vdsm/supervdsmServer.pyc /var/run/vdsm/svdsm.sock
vdsm     19932  0.0  0.0   9232   720 ?        S<   11:27   0:00 /bin/bash -e /usr/share/vdsm/respawn --minlifetime 10 --daemon --masterpid /var/run/vdsm/respawn.pid /usr/share/vdsm/vdsm

from /var/log/yum.log , in this date we installed vdsm-4.12.0-rc1.12 , and today we install vdsm-4.12.0-rc3. since Jul12 we installed vdsm each day at least 3 times.

Trying to locate the reason why this specific old instance was not cleared
Comment 6 Barak 2013-07-30 12:41:06 EDT
And after looking at the yum log we see that vdsm was upgraded on a daily basis,
but the failure happened on jenkins only on the 28 th.
So the question of reproduce scenario still remains.
Comment 7 Yaniv Bronhaim 2013-08-05 09:17:46 EDT
After checking some init script races, we see that if you restart vdsmd service simultaneously more then once, we might reach to 2 or more instances of the respawn script. This leads to the scenario that mentioned in this bug.
We assume that during the QA tests we reached to that race while upgrading to new version of vdsm, the upgrade performs condrestart after the uninstall part, and might be that the deploy process performs "service vdsmd start" simultaneously.

Following: http://fedoraproject.org/wiki/Packaging:SysVInitScript#Why_do_SysV_initscripts_with_resident_daemons_require_lock_files.3F

We add lock file that signs if daemon is running or not. This should prevent such races.
This fix should be added also to supervdsmd init script.
Comment 8 Barak Dagan 2013-09-08 09:16:04 EDT
Verified on is13 - restarting vdsmd from two shells simultaneously:
It seems to solve the issue, but first failure is somehow "confusing"

# service vdsmd restart
Shutting down vdsm daemon: 
vdsm during another start, cannot stop yet                 [FAILED]
cannot start VDS daemon when another instance runs         [FAILED]
Comment 9 Yaniv Bronhaim 2013-09-08 10:57:54 EDT
Why confusing? another instance runs.. what do you prefer?

btw, the patch http://gerrit.ovirt.org/#/c/18875 modifies it to: 
"cannot perform VDSM daemon operation when another instance runs" (almost the same :))

It wasn't merged it, but will be soon. so if you have any good suggestion I'll be glad to adopt
Comment 10 Barak Dagan 2013-09-09 03:04:47 EDT
(In reply to Yaniv Bronhaim from comment #9)
> Why confusing? another instance runs.. what do you prefer?
> 
> btw, the patch http://gerrit.ovirt.org/#/c/18875 modifies it to: 
> "cannot perform VDSM daemon operation when another instance runs" (almost
> the same :))
> 
> It wasn't merged it, but will be soon. so if you have any good suggestion
> I'll be glad to adopt

Almost but much more understandable - Can't wait to see that patch merged :)
Comment 11 Charlie 2013-11-27 19:31:21 EST
This bug is currently attached to errata RHBA-2013:15291. If this change is not to be documented in the text for this errata please either remove it from the errata, set the requires_doc_text flag to 
minus (-), or leave a "Doc Text" value of "--no tech note required" if you do not have permission to alter the flag.

Otherwise to aid in the development of relevant and accurate release documentation, please fill out the "Doc Text" field above with these four (4) pieces of information:

* Cause: What actions or circumstances cause this bug to present.
* Consequence: What happens when the bug presents.
* Fix: What was done to fix the bug.
* Result: What now happens when the actions or circumstances above occur. (NB: this is not the same as 'the bug doesn't present anymore')

Once filled out, please set the "Doc Type" field to the appropriate value for the type of change made and submit your edits to the bug.

For further details on the Cause, Consequence, Fix, Result format please refer to:

https://bugzilla.redhat.com/page.cgi?id=fields.html#cf_release_notes 

Thanks in advance.
Comment 12 errata-xmlrpc 2014-01-21 11:28:29 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-0040.html

Note You need to log in before you can comment on or make changes to this bug.