- 3.1 - VDSM [Scalability]: On scale environment refreshStoragePool fails on timeout and caused hosts to stuck into Auto-recovery treatment. Description : ************** I'm working on Scale environment:40 hosts and 100 iSCSI Storage Domains (EMC machine) - No VM's ,No Virtual Disks! I Attempted to Activate Several Storage Domains at once - the action took long time and failed. after few SPM Fail-overs, some of the hosts (17 out of 40) got into 'rhevm auto-recovery loop' - that means their statuses switched from 'Non-Operational' --> to 'Unassigned' --> to 'Up' -->to 'Non-Operational' -- > back to 'Unassigned', and so on.. Looking at the vdsm.log (on the 'Unassigned' hosts), it looks that the HSM Command 'refreshStoragePool' over and over took long time (over 6 minutes) and kept falling on SSL timeout. (broken pipe error) Note: ******** A) The bug report is for the slowness of the 'refreshStoragePool command', it took over 6 minutes. B) Follow Thread-38098 'Run and protect: refreshStoragePool' ---> see that the action that took most of the time (4 minutes out of 6)is the SamplingMethod::(__call__) . VDSM.log ********* ##Begins: Thread-38098::DEBUG::2012-07-09 08:28:51,498::BindingXMLRPC::164::vds::(wrapper) [10.35.160.91] Thread-38098::DEBUG::2012-07-09 08:28:51,499::task::588::TaskManager.Task::(_updateState) Task=`43caf957-5dda-46e0-932e-6699ba021304`::moving from state init -> state preparing Thread-38098::INFO::2012-07-09 08:28:51,500::logUtils::37::dispatcher::(wrapper) Run and protect: refreshStoragePool(spUUID='50d275f3-2029-4fd9-bc86-20333f3cb24f', msdUUID='35459266-07dd-4f15-92c1-f6e36e9b851a', masterVersion=1, options=None) ##Calls: SamplingMethod: Thread-38098::DEBUG::2012-07-09 08:31:24,330::misc::1088::SamplingMethod::(__call__) Trying to enter sampling method (storage.sp.updateMonitoringThreads) .. .. .. ##4 minutes after.. Thread-38098::DEBUG::2012-07-09 08:35:23,295::misc::1090::SamplingMethod::(__call__) Got in to sampling method Thread-38098::DEBUG::2012-07-09 08:35:23,297::lvm::464::OperationMutex::(_invalidatevgs) Operation 'lvm invalidate operation' got the operation mutex Thread-38098::DEBUG::2012-07-09 08:35:23,297::lvm::466::OperationMutex::(_invalidatevgs) Operation 'lvm invalidate operation' released the operation mutex Thread-38098::DEBUG::2012-07-09 08:35:23,298::lvm::475::OperationMutex::(_invalidatelvs) Operation 'lvm invalidate operation' got the operation mutex Thread-38098::DEBUG::2012-07-09 08:35:23,298::lvm::487::OperationMutex::(_invalidatelvs) Operation 'lvm invalidate operation' released the operation mutex Thread-38098::DEBUG::2012-07-09 08:35:23,298::lvm::349::OperationMutex::(_reloadvgs) Operation 'lvm reload operation' got the operation mutex ## fails on timeout with Broken pipe : hread-38098::ERROR::2012-07-09 08:35:34,911::SecureXMLRPCServer::73::root::(handle_error) client ('10.35.160.91', 37831) Traceback (most recent call last): File "/usr/lib64/python2.6/SocketServer.py", line 560, in process_request_thread self.finish_request(request, client_address) File "/usr/lib/python2.6/site-packages/vdsm/SecureXMLRPCServer.py", line 69, in finish_request client_address) File "/usr/lib64/python2.6/SocketServer.py", line 322, in finish_request self.RequestHandlerClass(request, client_address, self) File "/usr/lib64/python2.6/SocketServer.py", line 617, in __init__ self.handle() File "/usr/lib64/python2.6/BaseHTTPServer.py", line 329, in handle self.handle_one_request() File "/usr/lib64/python2.6/BaseHTTPServer.py", line 323, in handle_one_request method() File "/usr/lib64/python2.6/SimpleXMLRPCServer.py", line 490, in do_POST self.send_response(200) File "/usr/lib64/python2.6/BaseHTTPServer.py", line 385, in send_response self.send_header('Date', self.date_time_string()) File "/usr/lib64/python2.6/BaseHTTPServer.py", line 390, in send_header self.wfile.write("%s: %s\r\n" % (keyword, value)) File "/usr/lib64/python2.6/socket.py", line 324, in write self.flush() File "/usr/lib64/python2.6/socket.py", line 303, in flush self._sock.sendall(buffer(data, write_offset, buffer_size)) File "/usr/lib64/python2.6/ssl.py", line 203, in sendall v = self.send(data[count:]) File "/usr/lib64/python2.6/ssl.py", line 174, in send v = self._sslobj.write(data) error: [Errno 32] Broken pipe
Created attachment 597096 [details] full_logs Adding - VDSM Logs from two different hosts which had refreshStoragePool that failed on timeout + RHEVM Logs.
[RHEVM 3.1] [vdsm-4.9.6-40.0.el6_3.x86_64] - bug still reproduce. This bug preventing us from continuing with scale storage tests - when having storage pool with more than 50 Storage Domains and attempting to activate additional storage domains, refreshStoragePool fails on timeout(3:30 minutes) and hosts are being switched to non-operational.
This will probably be solved by same solution as bug 912158, pending finilization of that patch to verify.
(In reply to comment #10) > This will probably be solved by same solution as bug 912158, pending > finilization of that patch to verify. Allon, if the patch indeed solved this bug then please mark it appropriately.