Bug 838602 - 3.2 - VDSM [Scalability]: On scale environment refreshStoragePool fails on timeout and caused hosts to stuck into Auto-recovery treatment.
3.2 - VDSM [Scalability]: On scale environment refreshStoragePool fails on ti...
Status: CLOSED CURRENTRELEASE
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm (Show other bugs)
3.1.0
x86_64 Linux
high Severity high
: ---
: 3.2.0
Assigned To: Eduardo Warszawski
Yuri Obshansky
storage
: TestBlocker
Depends On:
Blocks: rhev_scalability
  Show dependency treegraph
 
Reported: 2012-07-09 10:49 EDT by Omri Hochman
Modified: 2016-02-10 11:39 EST (History)
10 users (show)

See Also:
Fixed In Version: vdsm-4.10.2-17.0.el6ev
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2014-03-16 10:00:25 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: Storage
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
scohen: Triaged+


Attachments (Terms of Use)
full_logs (1.59 MB, application/octet-stream)
2012-07-09 10:56 EDT, Omri Hochman
no flags Details

  None (edit)
Description Omri Hochman 2012-07-09 10:49:38 EDT
- 3.1 - VDSM [Scalability]: On scale environment refreshStoragePool fails on timeout and caused hosts to stuck into Auto-recovery treatment. 

Description :
**************
I'm working on Scale environment:40 hosts and 100 iSCSI Storage Domains (EMC machine) - No VM's ,No Virtual Disks! 
I Attempted to Activate Several Storage Domains at once - the action took long time and failed.  after few SPM Fail-overs, some of the hosts (17 out of 40) got into 'rhevm auto-recovery loop' - that means their statuses switched from 'Non-Operational' --> to  'Unassigned' --> to 'Up' -->to 'Non-Operational'
-- > back to 'Unassigned', and so on.. 

Looking at the vdsm.log (on the 'Unassigned' hosts), it looks that the HSM Command 'refreshStoragePool' over and over took long time (over 6 minutes) and kept falling on SSL timeout. (broken pipe error)  

Note:
********
A) The bug report is for the slowness of the 'refreshStoragePool command', it  took over 6 minutes.

B) Follow Thread-38098  'Run and protect: refreshStoragePool'  ---> see that the action that took most of the time (4 minutes out of 6)is the SamplingMethod::(__call__) .


VDSM.log
*********

##Begins:

Thread-38098::DEBUG::2012-07-09 08:28:51,498::BindingXMLRPC::164::vds::(wrapper) [10.35.160.91]
Thread-38098::DEBUG::2012-07-09 08:28:51,499::task::588::TaskManager.Task::(_updateState) Task=`43caf957-5dda-46e0-932e-6699ba021304`::moving from state init -> state preparing
Thread-38098::INFO::2012-07-09 08:28:51,500::logUtils::37::dispatcher::(wrapper) Run and protect: refreshStoragePool(spUUID='50d275f3-2029-4fd9-bc86-20333f3cb24f', msdUUID='35459266-07dd-4f15-92c1-f6e36e9b851a', masterVersion=1, options=None)


##Calls:   SamplingMethod:

Thread-38098::DEBUG::2012-07-09 08:31:24,330::misc::1088::SamplingMethod::(__call__) Trying to enter sampling method (storage.sp.updateMonitoringThreads)
..
..
..
##4 minutes after..

Thread-38098::DEBUG::2012-07-09 08:35:23,295::misc::1090::SamplingMethod::(__call__) Got in to sampling method
Thread-38098::DEBUG::2012-07-09 08:35:23,297::lvm::464::OperationMutex::(_invalidatevgs) Operation 'lvm invalidate operation' got the operation mutex
Thread-38098::DEBUG::2012-07-09 08:35:23,297::lvm::466::OperationMutex::(_invalidatevgs) Operation 'lvm invalidate operation' released the operation mutex
Thread-38098::DEBUG::2012-07-09 08:35:23,298::lvm::475::OperationMutex::(_invalidatelvs) Operation 'lvm invalidate operation' got the operation mutex
Thread-38098::DEBUG::2012-07-09 08:35:23,298::lvm::487::OperationMutex::(_invalidatelvs) Operation 'lvm invalidate operation' released the operation mutex
Thread-38098::DEBUG::2012-07-09 08:35:23,298::lvm::349::OperationMutex::(_reloadvgs) Operation 'lvm reload operation' got the operation mutex

## fails on timeout with Broken pipe :
hread-38098::ERROR::2012-07-09 08:35:34,911::SecureXMLRPCServer::73::root::(handle_error) client ('10.35.160.91', 37831)
Traceback (most recent call last):
  File "/usr/lib64/python2.6/SocketServer.py", line 560, in process_request_thread
    self.finish_request(request, client_address)
  File "/usr/lib/python2.6/site-packages/vdsm/SecureXMLRPCServer.py", line 69, in finish_request
    client_address)
  File "/usr/lib64/python2.6/SocketServer.py", line 322, in finish_request
    self.RequestHandlerClass(request, client_address, self)
  File "/usr/lib64/python2.6/SocketServer.py", line 617, in __init__
    self.handle()
  File "/usr/lib64/python2.6/BaseHTTPServer.py", line 329, in handle
    self.handle_one_request()
  File "/usr/lib64/python2.6/BaseHTTPServer.py", line 323, in handle_one_request
    method()
  File "/usr/lib64/python2.6/SimpleXMLRPCServer.py", line 490, in do_POST
    self.send_response(200)
  File "/usr/lib64/python2.6/BaseHTTPServer.py", line 385, in send_response
    self.send_header('Date', self.date_time_string())
  File "/usr/lib64/python2.6/BaseHTTPServer.py", line 390, in send_header
    self.wfile.write("%s: %s\r\n" % (keyword, value))
  File "/usr/lib64/python2.6/socket.py", line 324, in write
    self.flush()
  File "/usr/lib64/python2.6/socket.py", line 303, in flush
    self._sock.sendall(buffer(data, write_offset, buffer_size))
  File "/usr/lib64/python2.6/ssl.py", line 203, in sendall
    v = self.send(data[count:])
  File "/usr/lib64/python2.6/ssl.py", line 174, in send
    v = self._sslobj.write(data)
error: [Errno 32] Broken pipe
Comment 2 Omri Hochman 2012-07-09 10:56:23 EDT
Created attachment 597096 [details]
full_logs

Adding - VDSM Logs from two different hosts which had refreshStoragePool that failed on timeout + RHEVM Logs.
Comment 8 Omri Hochman 2012-11-07 07:50:39 EST
[RHEVM 3.1] [vdsm-4.9.6-40.0.el6_3.x86_64] - bug still reproduce. 
This bug preventing us from continuing with scale storage tests -
when having storage pool with more than 50 Storage Domains and attempting to activate additional storage domains, refreshStoragePool fails on timeout(3:30 minutes) and hosts are being switched to non-operational.
Comment 10 Allon Mureinik 2013-04-10 05:03:47 EDT
This will probably be solved by same solution as bug 912158, pending finilization of that patch to verify.
Comment 11 Ayal Baron 2013-05-01 15:21:43 EDT
(In reply to comment #10)
> This will probably be solved by same solution as bug 912158, pending
> finilization of that patch to verify.

Allon, if the patch indeed solved this bug then please mark it appropriately.

Note You need to log in before you can comment on or make changes to this bug.