Bug 838602

Summary:

3.2 - VDSM [Scalability]: On scale environment refreshStoragePool fails on timeout and caused hosts to stuck into Auto-recovery treatment.

Product:

Red Hat Enterprise Virtualization Manager

Reporter:

Omri Hochman <ohochman>

Component:

vdsm

Assignee:

Eduardo Warszawski <ewarszaw>

Status:

CLOSED CURRENTRELEASE

QA Contact:

Yuri Obshansky <yobshans>

Severity:

high

Docs Contact:

Priority:

high

Version:

3.1.0

CC:

abaron, acathrow, amureini, bazulay, iheim, jkt, lpeer, mgoldboi, scohen, yeylon

Target Milestone:

---

Keywords:

TestBlocker

Target Release:

3.2.0

Flags:

scohen: Triaged+

Hardware:

x86_64

OS:

Linux

Whiteboard:

storage

Fixed In Version:

vdsm-4.10.2-17.0.el6ev

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2014-03-16 14:00:25 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

Storage

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

882647

Attachments:

Description	Flags
full_logs	none

Description Omri Hochman 2012-07-09 14:49:38 UTC

- 3.1 - VDSM [Scalability]: On scale environment refreshStoragePool fails on timeout and caused hosts to stuck into Auto-recovery treatment. 

Description :
**************
I'm working on Scale environment:40 hosts and 100 iSCSI Storage Domains (EMC machine) - No VM's ,No Virtual Disks! 
I Attempted to Activate Several Storage Domains at once - the action took long time and failed.  after few SPM Fail-overs, some of the hosts (17 out of 40) got into 'rhevm auto-recovery loop' - that means their statuses switched from 'Non-Operational' --> to  'Unassigned' --> to 'Up' -->to 'Non-Operational'
-- > back to 'Unassigned', and so on.. 

Looking at the vdsm.log (on the 'Unassigned' hosts), it looks that the HSM Command 'refreshStoragePool' over and over took long time (over 6 minutes) and kept falling on SSL timeout. (broken pipe error)  

Note:
********
A) The bug report is for the slowness of the 'refreshStoragePool command', it  took over 6 minutes.

B) Follow Thread-38098  'Run and protect: refreshStoragePool'  ---> see that the action that took most of the time (4 minutes out of 6)is the SamplingMethod::(__call__) .


VDSM.log
*********

##Begins:

Thread-38098::DEBUG::2012-07-09 08:28:51,498::BindingXMLRPC::164::vds::(wrapper) [10.35.160.91]
Thread-38098::DEBUG::2012-07-09 08:28:51,499::task::588::TaskManager.Task::(_updateState) Task=`43caf957-5dda-46e0-932e-6699ba021304`::moving from state init -> state preparing
Thread-38098::INFO::2012-07-09 08:28:51,500::logUtils::37::dispatcher::(wrapper) Run and protect: refreshStoragePool(spUUID='50d275f3-2029-4fd9-bc86-20333f3cb24f', msdUUID='35459266-07dd-4f15-92c1-f6e36e9b851a', masterVersion=1, options=None)


##Calls:   SamplingMethod:

Thread-38098::DEBUG::2012-07-09 08:31:24,330::misc::1088::SamplingMethod::(__call__) Trying to enter sampling method (storage.sp.updateMonitoringThreads)
..
..
..
##4 minutes after..

Thread-38098::DEBUG::2012-07-09 08:35:23,295::misc::1090::SamplingMethod::(__call__) Got in to sampling method
Thread-38098::DEBUG::2012-07-09 08:35:23,297::lvm::464::OperationMutex::(_invalidatevgs) Operation 'lvm invalidate operation' got the operation mutex
Thread-38098::DEBUG::2012-07-09 08:35:23,297::lvm::466::OperationMutex::(_invalidatevgs) Operation 'lvm invalidate operation' released the operation mutex
Thread-38098::DEBUG::2012-07-09 08:35:23,298::lvm::475::OperationMutex::(_invalidatelvs) Operation 'lvm invalidate operation' got the operation mutex
Thread-38098::DEBUG::2012-07-09 08:35:23,298::lvm::487::OperationMutex::(_invalidatelvs) Operation 'lvm invalidate operation' released the operation mutex
Thread-38098::DEBUG::2012-07-09 08:35:23,298::lvm::349::OperationMutex::(_reloadvgs) Operation 'lvm reload operation' got the operation mutex

## fails on timeout with Broken pipe :
hread-38098::ERROR::2012-07-09 08:35:34,911::SecureXMLRPCServer::73::root::(handle_error) client ('10.35.160.91', 37831)
Traceback (most recent call last):
  File "/usr/lib64/python2.6/SocketServer.py", line 560, in process_request_thread
    self.finish_request(request, client_address)
  File "/usr/lib/python2.6/site-packages/vdsm/SecureXMLRPCServer.py", line 69, in finish_request
    client_address)
  File "/usr/lib64/python2.6/SocketServer.py", line 322, in finish_request
    self.RequestHandlerClass(request, client_address, self)
  File "/usr/lib64/python2.6/SocketServer.py", line 617, in __init__
    self.handle()
  File "/usr/lib64/python2.6/BaseHTTPServer.py", line 329, in handle
    self.handle_one_request()
  File "/usr/lib64/python2.6/BaseHTTPServer.py", line 323, in handle_one_request
    method()
  File "/usr/lib64/python2.6/SimpleXMLRPCServer.py", line 490, in do_POST
    self.send_response(200)
  File "/usr/lib64/python2.6/BaseHTTPServer.py", line 385, in send_response
    self.send_header('Date', self.date_time_string())
  File "/usr/lib64/python2.6/BaseHTTPServer.py", line 390, in send_header
    self.wfile.write("%s: %s\r\n" % (keyword, value))
  File "/usr/lib64/python2.6/socket.py", line 324, in write
    self.flush()
  File "/usr/lib64/python2.6/socket.py", line 303, in flush
    self._sock.sendall(buffer(data, write_offset, buffer_size))
  File "/usr/lib64/python2.6/ssl.py", line 203, in sendall
    v = self.send(data[count:])
  File "/usr/lib64/python2.6/ssl.py", line 174, in send
    v = self._sslobj.write(data)
error: [Errno 32] Broken pipe

Comment 2 Omri Hochman 2012-07-09 14:56:23 UTC

Created attachment 597096 [details]
full_logs

Adding - VDSM Logs from two different hosts which had refreshStoragePool that failed on timeout + RHEVM Logs.

Comment 8 Omri Hochman 2012-11-07 12:50:39 UTC

[RHEVM 3.1] [vdsm-4.9.6-40.0.el6_3.x86_64] - bug still reproduce. 
This bug preventing us from continuing with scale storage tests -
when having storage pool with more than 50 Storage Domains and attempting to activate additional storage domains, refreshStoragePool fails on timeout(3:30 minutes) and hosts are being switched to non-operational.

Comment 10 Allon Mureinik 2013-04-10 09:03:47 UTC

This will probably be solved by same solution as bug 912158, pending finilization of that patch to verify.

Comment 11 Ayal Baron 2013-05-01 19:21:43 UTC

(In reply to comment #10)
> This will probably be solved by same solution as bug 912158, pending
> finilization of that patch to verify.

Allon, if the patch indeed solved this bug then please mark it appropriately.