838602 – 3.2 - VDSM [Scalability]: On scale environment refreshStoragePool fails on timeout and caused hosts to stuck into Auto-recovery treatment.

Bug 838602 - 3.2 - VDSM [Scalability]: On scale environment refreshStoragePool fails on timeout and caused hosts to stuck into Auto-recovery treatment.

Summary: 3.2 - VDSM [Scalability]: On scale environment refreshStoragePool fails on ti...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	vdsm
Sub Component:
Version:	3.1.0
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	3.2.0
Assignee:	Eduardo Warszawski
QA Contact:	Yuri Obshansky
Docs Contact:
URL:
Whiteboard:	storage
Depends On:
Blocks:	rhev_scalability
TreeView+	depends on / blocked

Reported:	2012-07-09 14:49 UTC by Omri Hochman
Modified:	2016-02-10 16:39 UTC (History)
CC List:	10 users (show)
Fixed In Version:	vdsm-4.10.2-17.0.el6ev
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2014-03-16 14:00:25 UTC
oVirt Team:	Storage
Target Upstream Version:
Embargoed:
Flags:	scohen: Triaged+

Attachments	(Terms of Use)
full_logs (1.59 MB, application/octet-stream) 2012-07-09 14:56 UTC, Omri Hochman	no flags	Details
View All

Description Omri Hochman 2012-07-09 14:49:38 UTC

- 3.1 - VDSM [Scalability]: On scale environment refreshStoragePool fails on timeout and caused hosts to stuck into Auto-recovery treatment. 

Description :
**************
I'm working on Scale environment:40 hosts and 100 iSCSI Storage Domains (EMC machine) - No VM's ,No Virtual Disks! 
I Attempted to Activate Several Storage Domains at once - the action took long time and failed.  after few SPM Fail-overs, some of the hosts (17 out of 40) got into 'rhevm auto-recovery loop' - that means their statuses switched from 'Non-Operational' --> to  'Unassigned' --> to 'Up' -->to 'Non-Operational'
-- > back to 'Unassigned', and so on.. 

Looking at the vdsm.log (on the 'Unassigned' hosts), it looks that the HSM Command 'refreshStoragePool' over and over took long time (over 6 minutes) and kept falling on SSL timeout. (broken pipe error)  

Note:
********
A) The bug report is for the slowness of the 'refreshStoragePool command', it  took over 6 minutes.

B) Follow Thread-38098  'Run and protect: refreshStoragePool'  ---> see that the action that took most of the time (4 minutes out of 6)is the SamplingMethod::(__call__) .


VDSM.log
*********

##Begins:

Thread-38098::DEBUG::2012-07-09 08:28:51,498::BindingXMLRPC::164::vds::(wrapper) [10.35.160.91]
Thread-38098::DEBUG::2012-07-09 08:28:51,499::task::588::TaskManager.Task::(_updateState) Task=`43caf957-5dda-46e0-932e-6699ba021304`::moving from state init -> state preparing
Thread-38098::INFO::2012-07-09 08:28:51,500::logUtils::37::dispatcher::(wrapper) Run and protect: refreshStoragePool(spUUID='50d275f3-2029-4fd9-bc86-20333f3cb24f', msdUUID='35459266-07dd-4f15-92c1-f6e36e9b851a', masterVersion=1, options=None)


##Calls:   SamplingMethod:

Thread-38098::DEBUG::2012-07-09 08:31:24,330::misc::1088::SamplingMethod::(__call__) Trying to enter sampling method (storage.sp.updateMonitoringThreads)
..
..
..
##4 minutes after..

Thread-38098::DEBUG::2012-07-09 08:35:23,295::misc::1090::SamplingMethod::(__call__) Got in to sampling method
Thread-38098::DEBUG::2012-07-09 08:35:23,297::lvm::464::OperationMutex::(_invalidatevgs) Operation 'lvm invalidate operation' got the operation mutex
Thread-38098::DEBUG::2012-07-09 08:35:23,297::lvm::466::OperationMutex::(_invalidatevgs) Operation 'lvm invalidate operation' released the operation mutex
Thread-38098::DEBUG::2012-07-09 08:35:23,298::lvm::475::OperationMutex::(_invalidatelvs) Operation 'lvm invalidate operation' got the operation mutex
Thread-38098::DEBUG::2012-07-09 08:35:23,298::lvm::487::OperationMutex::(_invalidatelvs) Operation 'lvm invalidate operation' released the operation mutex
Thread-38098::DEBUG::2012-07-09 08:35:23,298::lvm::349::OperationMutex::(_reloadvgs) Operation 'lvm reload operation' got the operation mutex

## fails on timeout with Broken pipe :
hread-38098::ERROR::2012-07-09 08:35:34,911::SecureXMLRPCServer::73::root::(handle_error) client ('10.35.160.91', 37831)
Traceback (most recent call last):
  File "/usr/lib64/python2.6/SocketServer.py", line 560, in process_request_thread
    self.finish_request(request, client_address)
  File "/usr/lib/python2.6/site-packages/vdsm/SecureXMLRPCServer.py", line 69, in finish_request
    client_address)
  File "/usr/lib64/python2.6/SocketServer.py", line 322, in finish_request
    self.RequestHandlerClass(request, client_address, self)
  File "/usr/lib64/python2.6/SocketServer.py", line 617, in __init__
    self.handle()
  File "/usr/lib64/python2.6/BaseHTTPServer.py", line 329, in handle
    self.handle_one_request()
  File "/usr/lib64/python2.6/BaseHTTPServer.py", line 323, in handle_one_request
    method()
  File "/usr/lib64/python2.6/SimpleXMLRPCServer.py", line 490, in do_POST
    self.send_response(200)
  File "/usr/lib64/python2.6/BaseHTTPServer.py", line 385, in send_response
    self.send_header('Date', self.date_time_string())
  File "/usr/lib64/python2.6/BaseHTTPServer.py", line 390, in send_header
    self.wfile.write("%s: %s\r\n" % (keyword, value))
  File "/usr/lib64/python2.6/socket.py", line 324, in write
    self.flush()
  File "/usr/lib64/python2.6/socket.py", line 303, in flush
    self._sock.sendall(buffer(data, write_offset, buffer_size))
  File "/usr/lib64/python2.6/ssl.py", line 203, in sendall
    v = self.send(data[count:])
  File "/usr/lib64/python2.6/ssl.py", line 174, in send
    v = self._sslobj.write(data)
error: [Errno 32] Broken pipe

Comment 2 Omri Hochman 2012-07-09 14:56:23 UTC

Created attachment 597096 [details]
full_logs

Adding - VDSM Logs from two different hosts which had refreshStoragePool that failed on timeout + RHEVM Logs.

Comment 8 Omri Hochman 2012-11-07 12:50:39 UTC

[RHEVM 3.1] [vdsm-4.9.6-40.0.el6_3.x86_64] - bug still reproduce. 
This bug preventing us from continuing with scale storage tests -
when having storage pool with more than 50 Storage Domains and attempting to activate additional storage domains, refreshStoragePool fails on timeout(3:30 minutes) and hosts are being switched to non-operational.

Comment 10 Allon Mureinik 2013-04-10 09:03:47 UTC

This will probably be solved by same solution as bug 912158, pending finilization of that patch to verify.

Comment 11 Ayal Baron 2013-05-01 19:21:43 UTC

(In reply to comment #10)
> This will probably be solved by same solution as bug 912158, pending
> finilization of that patch to verify.

Allon, if the patch indeed solved this bug then please mark it appropriately.

Note You need to log in before you can comment on or make changes to this bug.