Bug 829710 - 3.1 - vdsm: host become non-responsive when blocking nfs export domain
3.1 - vdsm: host become non-responsive when blocking nfs export domain
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: vdsm (Show other bugs)
6.3
x86_64 Linux
high Severity urgent
: rc
: ---
Assigned To: Saggi Mizrahi
Dafna Ron
infra
: Regression, TestBlocker, ZStream
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2012-06-07 07:31 EDT by Dafna Ron
Modified: 2012-12-04 13:59 EST (History)
10 users (show)

See Also:
Fixed In Version: vdsm-4.9.6-36.0
Doc Type: Bug Fix
Doc Text:
Previously, when iSCSI master domains were created with NFS export domains (where the export domains were located on different storage servers) and connectivity to the export domain was blocked by means of iptables, VDSM stopped responding and the host became non-responsive while storage remained up. An update to VDSM makes sure that now the export domain is deactivated when iptables is used to block connectivity to the export domain.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2012-12-04 13:59:05 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
logs (598.94 KB, application/x-gzip)
2012-06-07 07:33 EDT, Dafna Ron
no flags Details
log (1.45 MB, application/x-xz)
2012-09-10 14:19 EDT, Dafna Ron
no flags Details

  None (edit)
Description Dafna Ron 2012-06-07 07:31:13 EDT
Description of problem:

after blocking my export domain from the host (only export)
it seems like we have vdsm Broken pipe without vdsm kill. 
getCapabilities will raise libvirt error: 

libvirtError: Cannot write data: Broken pipe

libvirt is up and so is vdsm (vdsm is in additional info). 

4 S root      2218     1  0  80   0 - 232202 poll_s 11:46 ?       00:00:04 /usr/sbin/libvirtd --listen


Version-Release number of selected component (if applicable):

vdsm-4.9.6-10.el6.x86_64

How reproducible:

100%

Steps to Reproduce:
1. create an iscsi master domain with nfs Export domain (export located on different storage server)
2. block connectivity to the Export domain from the host using iptables
3.
  
Actual results:

vdsm will stop to respond. 

host becomes non-responsive while storage is up. 

Expected results:

Export domain will become inactive.

Additional info: vdsm,engine and libvirt logs are attached 


0 T root      1268  2567  0  80   0 - 26360 signal 14:15 pts/0    00:00:00 less /var/log/vdsm/vdsm.log
1 S vdsm      7184 29199  0  75  -5 -  2475 wait   14:22 ?        00:00:00 /bin/bash /usr/libexec/vdsm/spmprotect.sh renew 14a1733b-45a0-49b9-ab46-c64230585b54 2 5 /dev/14a1733b-45a0-49b9-ab46-c64230585b54/leases 60000 10000 1338806085720108
0 S vdsm      7186 29199  0  75  -5 -  1016 hrtime 14:22 ?        00:00:00 sleep 1
0 S vdsm      7187  7184  0  75  -5 -  1016 hrtime 14:22 ?        00:00:00 sleep 59
0 S root      7191  2567  0  80   0 - 25811 pipe_w 14:22 pts/0    00:00:00 grep vdsm
0 S vdsm      9439 25223  0  75  -5 -  2337 pipe_w 13:48 ?        00:00:00 dd if=/rhev/data-center/b77871a5-0998-45f9-8a6c-9f21e23929c8/mastersd/dom_md/inbox iflag=direct,fullblock count=1 bs=1024000
0 Z vdsm      9451 25223  0  75  -5 -     0 exit   13:48 ?        00:00:00 [ifconfig] <defunct>
1 S vdsm     25220     1  0  75  -5 -  2308 wait   12:21 ?        00:00:00 /bin/bash -e /usr/share/vdsm/respawn --minlifetime 10 --daemon --masterpid /var/run/vdsm/respawn.pid /usr/share/vdsm/vdsm
0 T vdsm     25223 25220  0  75  -5 - 664994 signal 12:21 ?       00:01:00 /usr/bin/python /usr/share/vdsm/vdsm
4 S root     25262 25223  0  75  -5 - 12796 poll_s 12:21 ?        00:00:00 /usr/bin/sudo -n /usr/bin/python /usr/share/vdsm/supervdsmServer.py cb461b22-1101-4f5b-bbab-403ed07c1a95 25223
4 S root     25263 25262  0  75  -5 - 186790 futex_ 12:21 ?       00:00:00 /usr/bin/python /usr/share/vdsm/supervdsmServer.py cb461b22-1101-4f5b-bbab-403ed07c1a95 25223
0 T root     28873  2567  0  80   0 - 26360 signal 13:32 pts/0    00:00:00 less /var/log/vdsm/vdsm.log
0 S vdsm     29199     1  0  75  -5 -  2475 wait   13:35 ?        00:00:20 /bin/bash /usr/libexec/vdsm/spmprotect.sh renew 14a1733b-45a0-49b9-ab46-c64230585b54 2 5 /dev/14a1733b-45a0-49b9-ab46-c64230585b54/leases 60000 10000 1338806085720108
1 S vdsm     29283 25223  0  75  -5 - 518741 poll_s 13:35 ?       00:00:00 /usr/bin/python /usr/share/vdsm/vdsm
1 S vdsm     29284 25223  0  75  -5 - 518741 poll_s 13:35 ?       00:00:00 /usr/bin/python /usr/share/vdsm/vdsm
1 S vdsm     29285 25223  0  75  -5 - 518741 poll_s 13:35 ?       00:00:00 /usr/bin/python /usr/share/vdsm/vdsm
1 S vdsm     29287 25223  0  75  -5 - 518741 poll_s 13:35 ?       00:00:00 /usr/bin/python /usr/share/vdsm/vdsm
1 S vdsm     29289 25223  0  75  -5 - 518741 poll_s 13:35 ?       00:00:00 /usr/bin/python /usr/share/vdsm/vdsm
1 S vdsm     29291 25223  0  75  -5 - 518741 poll_s 13:35 ?       00:00:00 /usr/bin/python /usr/share/vdsm/vdsm
1 S vdsm     29293 25223  0  75  -5 - 518741 poll_s 13:35 ?       00:00:00 /usr/bin/python /usr/share/vdsm/vdsm
1 S vdsm     29295 25223  0  75  -5 - 518741 poll_s 13:35 ?       00:00:00 /usr/bin/python /usr/share/vdsm/vdsm
1 S vdsm     29297 25223  0  75  -5 - 518741 poll_s 13:35 ?       00:00:00 /usr/bin/python /usr/share/vdsm/vdsm
1 S vdsm     29299 25223  0  75  -5 - 518741 poll_s 13:35 ?       00:00:00 /usr/bin/python /usr/share/vdsm/vdsm


Thread-3048::ERROR::2012-06-04 13:48:24,003::SecureXMLRPCServer::73::root::(handle_error) client ('10.35.97.65', 33500)
Traceback (most recent call last):
  File "/usr/lib64/python2.6/SocketServer.py", line 560, in process_request_thread
    self.finish_request(request, client_address)
  File "/usr/lib/python2.6/site-packages/vdsm/SecureXMLRPCServer.py", line 69, in finish_request
    client_address)
  File "/usr/lib64/python2.6/SocketServer.py", line 322, in finish_request
    self.RequestHandlerClass(request, client_address, self)
  File "/usr/lib64/python2.6/SocketServer.py", line 617, in __init__
    self.handle()
  File "/usr/lib64/python2.6/BaseHTTPServer.py", line 329, in handle
    self.handle_one_request()
  File "/usr/lib64/python2.6/BaseHTTPServer.py", line 323, in handle_one_request
    method()
  File "/usr/lib64/python2.6/SimpleXMLRPCServer.py", line 490, in do_POST
    self.send_response(200)
  File "/usr/lib64/python2.6/BaseHTTPServer.py", line 384, in send_response
    self.send_header('Server', self.version_string())
  File "/usr/lib64/python2.6/BaseHTTPServer.py", line 390, in send_header
    self.wfile.write("%s: %s\r\n" % (keyword, value))
  File "/usr/lib64/python2.6/socket.py", line 324, in write
    self.flush()
  File "/usr/lib64/python2.6/socket.py", line 303, in flush
    self._sock.sendall(buffer(data, write_offset, buffer_size))
  File "/usr/lib64/python2.6/ssl.py", line 203, in sendall
    v = self.send(data[count:])
  File "/usr/lib64/python2.6/ssl.py", line 174, in send
    v = self._sslobj.write(data)
error: [Errno 32] Broken pipe


Thread-3053::ERROR::2012-06-04 13:48:24,039::BindingXMLRPC::873::vds::(wrapper) libvirt error
Traceback (most recent call last):
  File "/usr/share/vdsm/BindingXMLRPC.py", line 869, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/BindingXMLRPC.py", line 278, in getCapabilities
    ret = api.getCapabilities()
  File "/usr/share/vdsm/API.py", line 1038, in getCapabilities
    c = caps.get()
  File "/usr/share/vdsm/caps.py", line 209, in get
    caps.update(netinfo.get())
  File "/usr/lib/python2.6/site-packages/vdsm/netinfo.py", line 246, in get
    nets = networks()
  File "/usr/lib/python2.6/site-packages/vdsm/netinfo.py", line 69, in networks
    for name in conn.listNetworks():
  File "/usr/lib64/python2.6/site-packages/libvirt.py", line 3240, in listNetworks
    if ret is None: raise libvirtError ('virConnectListNetworks() failed', conn=self)
libvirtError: Cannot write data: Broken pipe
Comment 2 Dafna Ron 2012-06-07 07:33:20 EDT
Created attachment 590168 [details]
logs
Comment 4 Barak 2012-07-15 12:04:16 EDT
Saggi,

Since this BZ is before the neww OOP code was introduced.

Can you please take a look and see whether this error can occure in the new code?
Comment 6 Saggi Mizrahi 2012-07-30 11:32:29 EDT
From looking at the logs it looks like:
* That OOP Failed with proper error (Connection was blocked so an IOError is properly fine)
* libvirt connection was stopped with the last log being
2012-06-04 11:06:05.279+0000: 2218: debug : virConnectClose:1462 : conn=0x7f4c34000c60

Which means libvirt was stopped about 3 hours before the error mentioned int he bug.
* Could not send response because HTTP connection broke in the middle of writing the response. (I guess because the requester timed out)

I really don't understand what you want me to fix here. It looks like vdsCaps() failed because of libvirt and not oop.

I would like to clarify that the host isn't really "non-responsive". It tries to respond with errors so it's not an OOP issue.

I don't understand what it is you want me to fix. Libvirt was closed by something because the logs show that the closing was intentional.
Everything else works like it should.
Comment 7 Barak 2012-08-02 05:57:01 EDT
Dafna,

Based on Saggi's response.

Was libvirt closed intentionally?

Can you please confirm this is the case or supply a different reproducer.
Comment 8 Saggi Mizrahi 2012-08-02 14:26:56 EDT
http://gerrit.ovirt.org/#/c/6867/
Comment 10 Barak 2012-09-10 06:45:49 EDT
Pending QE (dafan's) Verification
Comment 11 Dafna Ron 2012-09-10 14:18:12 EDT
the patch fixes the bug and the master domain remains in up state and not effected. 
but I can see that the export domain will not change status to inactive because of attr issue: 

AttributeError: 'DomainMonitorThread' object has no attribute 'onDomainConnectivityStateChange'

ERROR below and full log will be attached

Thread-22::ERROR::2012-09-10 21:04:10,093::domainMonitor::205::Storage.DomainMonitorThread::(_monitorDomain) Error while collecting domain 72ec1321-a114-451f-bee1-6790cbca1bc6 monitoring information
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/domainMonitor.py", line 183, in _monitorDomain
    self.domain.selftest()
  File "/usr/share/vdsm/storage/nfsSD.py", line 134, in selftest
    fileSD.FileStorageDomain.selftest(self)
  File "/usr/share/vdsm/storage/fileSD.py", line 370, in selftest
    self.oop.os.statvfs(self.domaindir)
  File "/usr/share/vdsm/storage/remoteFileHandler.py", line 282, in callCrabRPCFunction
    raise Timeout("Operation stuck on remote handler")
Timeout: Operation stuck on remote handler
Thread-22::DEBUG::2012-09-10 21:04:10,095::domainMonitor::213::Storage.DomainMonitorThread::(_monitorDomain) Domain 72ec1321-a114-451f-bee1-6790cbca1bc6 changed its status to Invalid
Thread-22::WARNING::2012-09-10 21:04:10,096::domainMonitor::220::Storage.DomainMonitorThread::(_monitorDomain) Could not emit domain state change event
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/domainMonitor.py", line 216, in _monitorDomain
    self.onDomainConnectivityStateChange.emit(
AttributeError: 'DomainMonitorThread' object has no attribute 'onDomainConnectivityStateChange'
Comment 12 Dafna Ron 2012-09-10 14:19:57 EDT
Created attachment 611535 [details]
log
Comment 16 Dafna Ron 2012-09-27 10:54:05 EDT
verified on si19
only the export domain is deactivated
Comment 19 errata-xmlrpc 2012-12-04 13:59:05 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2012-1508.html

Note You need to log in before you can comment on or make changes to this bug.