RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 829710 - 3.1 - vdsm: host become non-responsive when blocking nfs export domain
Summary: 3.1 - vdsm: host become non-responsive when blocking nfs export domain
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: vdsm
Version: 6.3
Hardware: x86_64
OS: Linux
high
urgent
Target Milestone: rc
: ---
Assignee: Saggi Mizrahi
QA Contact: Dafna Ron
URL:
Whiteboard: infra
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2012-06-07 11:31 UTC by Dafna Ron
Modified: 2022-07-09 05:35 UTC (History)
10 users (show)

Fixed In Version: vdsm-4.9.6-36.0
Doc Type: Bug Fix
Doc Text:
Previously, when iSCSI master domains were created with NFS export domains (where the export domains were located on different storage servers) and connectivity to the export domain was blocked by means of iptables, VDSM stopped responding and the host became non-responsive while storage remained up. An update to VDSM makes sure that now the export domain is deactivated when iptables is used to block connectivity to the export domain.
Clone Of:
Environment:
Last Closed: 2012-12-04 18:59:05 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
logs (598.94 KB, application/x-gzip)
2012-06-07 11:33 UTC, Dafna Ron
no flags Details
log (1.45 MB, application/x-xz)
2012-09-10 18:19 UTC, Dafna Ron
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2012:1508 0 normal SHIPPED_LIVE Important: rhev-3.1.0 vdsm security, bug fix, and enhancement update 2012-12-04 23:48:05 UTC

Description Dafna Ron 2012-06-07 11:31:13 UTC
Description of problem:

after blocking my export domain from the host (only export)
it seems like we have vdsm Broken pipe without vdsm kill. 
getCapabilities will raise libvirt error: 

libvirtError: Cannot write data: Broken pipe

libvirt is up and so is vdsm (vdsm is in additional info). 

4 S root      2218     1  0  80   0 - 232202 poll_s 11:46 ?       00:00:04 /usr/sbin/libvirtd --listen


Version-Release number of selected component (if applicable):

vdsm-4.9.6-10.el6.x86_64

How reproducible:

100%

Steps to Reproduce:
1. create an iscsi master domain with nfs Export domain (export located on different storage server)
2. block connectivity to the Export domain from the host using iptables
3.
  
Actual results:

vdsm will stop to respond. 

host becomes non-responsive while storage is up. 

Expected results:

Export domain will become inactive.

Additional info: vdsm,engine and libvirt logs are attached 


0 T root      1268  2567  0  80   0 - 26360 signal 14:15 pts/0    00:00:00 less /var/log/vdsm/vdsm.log
1 S vdsm      7184 29199  0  75  -5 -  2475 wait   14:22 ?        00:00:00 /bin/bash /usr/libexec/vdsm/spmprotect.sh renew 14a1733b-45a0-49b9-ab46-c64230585b54 2 5 /dev/14a1733b-45a0-49b9-ab46-c64230585b54/leases 60000 10000 1338806085720108
0 S vdsm      7186 29199  0  75  -5 -  1016 hrtime 14:22 ?        00:00:00 sleep 1
0 S vdsm      7187  7184  0  75  -5 -  1016 hrtime 14:22 ?        00:00:00 sleep 59
0 S root      7191  2567  0  80   0 - 25811 pipe_w 14:22 pts/0    00:00:00 grep vdsm
0 S vdsm      9439 25223  0  75  -5 -  2337 pipe_w 13:48 ?        00:00:00 dd if=/rhev/data-center/b77871a5-0998-45f9-8a6c-9f21e23929c8/mastersd/dom_md/inbox iflag=direct,fullblock count=1 bs=1024000
0 Z vdsm      9451 25223  0  75  -5 -     0 exit   13:48 ?        00:00:00 [ifconfig] <defunct>
1 S vdsm     25220     1  0  75  -5 -  2308 wait   12:21 ?        00:00:00 /bin/bash -e /usr/share/vdsm/respawn --minlifetime 10 --daemon --masterpid /var/run/vdsm/respawn.pid /usr/share/vdsm/vdsm
0 T vdsm     25223 25220  0  75  -5 - 664994 signal 12:21 ?       00:01:00 /usr/bin/python /usr/share/vdsm/vdsm
4 S root     25262 25223  0  75  -5 - 12796 poll_s 12:21 ?        00:00:00 /usr/bin/sudo -n /usr/bin/python /usr/share/vdsm/supervdsmServer.py cb461b22-1101-4f5b-bbab-403ed07c1a95 25223
4 S root     25263 25262  0  75  -5 - 186790 futex_ 12:21 ?       00:00:00 /usr/bin/python /usr/share/vdsm/supervdsmServer.py cb461b22-1101-4f5b-bbab-403ed07c1a95 25223
0 T root     28873  2567  0  80   0 - 26360 signal 13:32 pts/0    00:00:00 less /var/log/vdsm/vdsm.log
0 S vdsm     29199     1  0  75  -5 -  2475 wait   13:35 ?        00:00:20 /bin/bash /usr/libexec/vdsm/spmprotect.sh renew 14a1733b-45a0-49b9-ab46-c64230585b54 2 5 /dev/14a1733b-45a0-49b9-ab46-c64230585b54/leases 60000 10000 1338806085720108
1 S vdsm     29283 25223  0  75  -5 - 518741 poll_s 13:35 ?       00:00:00 /usr/bin/python /usr/share/vdsm/vdsm
1 S vdsm     29284 25223  0  75  -5 - 518741 poll_s 13:35 ?       00:00:00 /usr/bin/python /usr/share/vdsm/vdsm
1 S vdsm     29285 25223  0  75  -5 - 518741 poll_s 13:35 ?       00:00:00 /usr/bin/python /usr/share/vdsm/vdsm
1 S vdsm     29287 25223  0  75  -5 - 518741 poll_s 13:35 ?       00:00:00 /usr/bin/python /usr/share/vdsm/vdsm
1 S vdsm     29289 25223  0  75  -5 - 518741 poll_s 13:35 ?       00:00:00 /usr/bin/python /usr/share/vdsm/vdsm
1 S vdsm     29291 25223  0  75  -5 - 518741 poll_s 13:35 ?       00:00:00 /usr/bin/python /usr/share/vdsm/vdsm
1 S vdsm     29293 25223  0  75  -5 - 518741 poll_s 13:35 ?       00:00:00 /usr/bin/python /usr/share/vdsm/vdsm
1 S vdsm     29295 25223  0  75  -5 - 518741 poll_s 13:35 ?       00:00:00 /usr/bin/python /usr/share/vdsm/vdsm
1 S vdsm     29297 25223  0  75  -5 - 518741 poll_s 13:35 ?       00:00:00 /usr/bin/python /usr/share/vdsm/vdsm
1 S vdsm     29299 25223  0  75  -5 - 518741 poll_s 13:35 ?       00:00:00 /usr/bin/python /usr/share/vdsm/vdsm


Thread-3048::ERROR::2012-06-04 13:48:24,003::SecureXMLRPCServer::73::root::(handle_error) client ('10.35.97.65', 33500)
Traceback (most recent call last):
  File "/usr/lib64/python2.6/SocketServer.py", line 560, in process_request_thread
    self.finish_request(request, client_address)
  File "/usr/lib/python2.6/site-packages/vdsm/SecureXMLRPCServer.py", line 69, in finish_request
    client_address)
  File "/usr/lib64/python2.6/SocketServer.py", line 322, in finish_request
    self.RequestHandlerClass(request, client_address, self)
  File "/usr/lib64/python2.6/SocketServer.py", line 617, in __init__
    self.handle()
  File "/usr/lib64/python2.6/BaseHTTPServer.py", line 329, in handle
    self.handle_one_request()
  File "/usr/lib64/python2.6/BaseHTTPServer.py", line 323, in handle_one_request
    method()
  File "/usr/lib64/python2.6/SimpleXMLRPCServer.py", line 490, in do_POST
    self.send_response(200)
  File "/usr/lib64/python2.6/BaseHTTPServer.py", line 384, in send_response
    self.send_header('Server', self.version_string())
  File "/usr/lib64/python2.6/BaseHTTPServer.py", line 390, in send_header
    self.wfile.write("%s: %s\r\n" % (keyword, value))
  File "/usr/lib64/python2.6/socket.py", line 324, in write
    self.flush()
  File "/usr/lib64/python2.6/socket.py", line 303, in flush
    self._sock.sendall(buffer(data, write_offset, buffer_size))
  File "/usr/lib64/python2.6/ssl.py", line 203, in sendall
    v = self.send(data[count:])
  File "/usr/lib64/python2.6/ssl.py", line 174, in send
    v = self._sslobj.write(data)
error: [Errno 32] Broken pipe


Thread-3053::ERROR::2012-06-04 13:48:24,039::BindingXMLRPC::873::vds::(wrapper) libvirt error
Traceback (most recent call last):
  File "/usr/share/vdsm/BindingXMLRPC.py", line 869, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/BindingXMLRPC.py", line 278, in getCapabilities
    ret = api.getCapabilities()
  File "/usr/share/vdsm/API.py", line 1038, in getCapabilities
    c = caps.get()
  File "/usr/share/vdsm/caps.py", line 209, in get
    caps.update(netinfo.get())
  File "/usr/lib/python2.6/site-packages/vdsm/netinfo.py", line 246, in get
    nets = networks()
  File "/usr/lib/python2.6/site-packages/vdsm/netinfo.py", line 69, in networks
    for name in conn.listNetworks():
  File "/usr/lib64/python2.6/site-packages/libvirt.py", line 3240, in listNetworks
    if ret is None: raise libvirtError ('virConnectListNetworks() failed', conn=self)
libvirtError: Cannot write data: Broken pipe

Comment 2 Dafna Ron 2012-06-07 11:33:20 UTC
Created attachment 590168 [details]
logs

Comment 4 Barak 2012-07-15 16:04:16 UTC
Saggi,

Since this BZ is before the neww OOP code was introduced.

Can you please take a look and see whether this error can occure in the new code?

Comment 6 Saggi Mizrahi 2012-07-30 15:32:29 UTC
From looking at the logs it looks like:
* That OOP Failed with proper error (Connection was blocked so an IOError is properly fine)
* libvirt connection was stopped with the last log being
2012-06-04 11:06:05.279+0000: 2218: debug : virConnectClose:1462 : conn=0x7f4c34000c60

Which means libvirt was stopped about 3 hours before the error mentioned int he bug.
* Could not send response because HTTP connection broke in the middle of writing the response. (I guess because the requester timed out)

I really don't understand what you want me to fix here. It looks like vdsCaps() failed because of libvirt and not oop.

I would like to clarify that the host isn't really "non-responsive". It tries to respond with errors so it's not an OOP issue.

I don't understand what it is you want me to fix. Libvirt was closed by something because the logs show that the closing was intentional.
Everything else works like it should.

Comment 7 Barak 2012-08-02 09:57:01 UTC
Dafna,

Based on Saggi's response.

Was libvirt closed intentionally?

Can you please confirm this is the case or supply a different reproducer.

Comment 8 Saggi Mizrahi 2012-08-02 18:26:56 UTC
http://gerrit.ovirt.org/#/c/6867/

Comment 10 Barak 2012-09-10 10:45:49 UTC
Pending QE (dafan's) Verification

Comment 11 Dafna Ron 2012-09-10 18:18:12 UTC
the patch fixes the bug and the master domain remains in up state and not effected. 
but I can see that the export domain will not change status to inactive because of attr issue: 

AttributeError: 'DomainMonitorThread' object has no attribute 'onDomainConnectivityStateChange'

ERROR below and full log will be attached

Thread-22::ERROR::2012-09-10 21:04:10,093::domainMonitor::205::Storage.DomainMonitorThread::(_monitorDomain) Error while collecting domain 72ec1321-a114-451f-bee1-6790cbca1bc6 monitoring information
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/domainMonitor.py", line 183, in _monitorDomain
    self.domain.selftest()
  File "/usr/share/vdsm/storage/nfsSD.py", line 134, in selftest
    fileSD.FileStorageDomain.selftest(self)
  File "/usr/share/vdsm/storage/fileSD.py", line 370, in selftest
    self.oop.os.statvfs(self.domaindir)
  File "/usr/share/vdsm/storage/remoteFileHandler.py", line 282, in callCrabRPCFunction
    raise Timeout("Operation stuck on remote handler")
Timeout: Operation stuck on remote handler
Thread-22::DEBUG::2012-09-10 21:04:10,095::domainMonitor::213::Storage.DomainMonitorThread::(_monitorDomain) Domain 72ec1321-a114-451f-bee1-6790cbca1bc6 changed its status to Invalid
Thread-22::WARNING::2012-09-10 21:04:10,096::domainMonitor::220::Storage.DomainMonitorThread::(_monitorDomain) Could not emit domain state change event
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/domainMonitor.py", line 216, in _monitorDomain
    self.onDomainConnectivityStateChange.emit(
AttributeError: 'DomainMonitorThread' object has no attribute 'onDomainConnectivityStateChange'

Comment 12 Dafna Ron 2012-09-10 18:19:57 UTC
Created attachment 611535 [details]
log

Comment 16 Dafna Ron 2012-09-27 14:54:05 UTC
verified on si19
only the export domain is deactivated

Comment 19 errata-xmlrpc 2012-12-04 18:59:05 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2012-1508.html


Note You need to log in before you can comment on or make changes to this bug.