Description of problem: On a performance environment, When adding direct LUN disk to a VM the monitor fail on timeout and VM become unresponsive, start and end of the adding direct LUN : Started: Nov 26, 2017, 11:15:46 AM Completed: Nov 26, 2017, 11:16:55 AM First log failure on the VDSM log attached : 2017-11-26 11:17:22,158+0200 WARN (jsonrpc/0) [virt.vm] (vmId='ee2898c6-4f68-4354-850a-e39218c37249') monitor became unresponsive (command timeout, age=62.6400000006) (vm:5798) Version-Release number of selected component (if applicable): 4.20.7 How reproducible: Always Steps to Reproduce: 1.Deploy performance lab as defined below 2.Start all VMS 3.Add direct LUN disk to host #3 System topology (using bare metal hosts) 1 DC 1 Clusters 3 Hosts : host 1 : SPM + 50 VMS host 2 : 150 VMS host 3 : 100 VMS 40 SD 300 VMS - resides on pools, each has 2 thin provisioned COW2 DB resides on the engine server with data on ISCSI LUN connected to the XtremIO storage Host connected with 4 ISCSI path Actual results: VM become unresponsive several times and the going back online Expected results: VM stay up Additional info: logs attached
just to make it clear, the VMs on the non-SPM host which you add direct LUN to it, right?
Anything interesting on /var/log/messages from that time? (I'm also wondering if we'll need libvirt in debug to understand what's going on). This is with the VDSM with the watermark event (and not polling) code, right?
So a monitor took over 60 seconds and that is the regular reason for it to transition to non responsive. Question is, if we are using EVENTS why is this behaving like that still. Francesco if we use BLOCK_THRESHOLD then it doesn't make sense to mark the VM unresponsive. It is probably a bug in when we mark the vm sample stats_age? If scanning lvs is stalling the stats_age write than it is a wrong call to say that the vm is non responsive. Anyhow, this is from high level reading the code, I may miss something.
yes, the VMs on the non-SPM host.
With the help of the storage team, we need to understand the meaning of those warnings: 2017-11-27 11:00:20,394+0200 WARN (check/loop) [storage.asyncutils] Call <bound method DirectioChecker._check of <DirectioChecker /dev/6c89b067-629f-4a1b-82a5-a113cfa78f92/metadata running next_check=4882238.55 at 0x39a1550>> delayed by 7.97 seconds (asyncutils:138) 2017-11-27 11:00:20,462+0200 WARN (check/loop) [storage.asyncutils] Call <bound method DirectioChecker._check of <DirectioChecker /dev/a9bbc4ed-4f73-4bab-be49-e17cf3db8f86/metadata running next_check=4882238.74 at 0x2e0be10>> delayed by 7.85 seconds (asyncutils:138) 2017-11-27 11:00:20,530+0200 WARN (check/loop) [storage.asyncutils] Call <bound method DirectioChecker._check of <DirectioChecker /dev/ac5502b5-2925-44e1-8165-14654d2d3133/metadata running next_check=4882239.12 at 0x3b1e990>> delayed by 7.54 seconds (asyncutils:138) 2017-11-27 11:00:20,611+0200 WARN (check/loop) [storage.asyncutils] Call <bound method DirectioChecker._check of <DirectioChecker /dev/714a22ae-a47e-47ac-8194-3388e69c648c/metadata running next_check=4882240.98 at 0x3b63450>> delayed by 5.76 seconds (asyncutils:138) 2017-11-27 11:00:20,678+0200 WARN (check/loop) [storage.asyncutils] Call <bound method DirectioChecker._check of <DirectioChecker /dev/4a2cb174-b9bb-4d75-a336-a06dd84ede0c/metadata running next_check=4882241.03 at 0x3b83bd0>> delayed by 5.77 seconds (asyncutils:138) 2017-11-27 11:00:20,744+0200 WARN (check/loop) [storage.asyncutils] Call <bound method DirectioChecker._check of <DirectioChecker /dev/858a9c96-0676-43dd-a5a9-611ac71fb4d2/metadata running next_check=4882241.38 at 0x3b0ced0>> delayed by 5.49 seconds (asyncutils:138) 2017-11-27 11:00:20,865+0200 WARN (check/loop) [storage.asyncutils] Call <bound method DirectioChecker._check of <DirectioChecker /dev/7a70f116-f90c-4d1e-a11a-790c874dc76d/metadata running next_check=4882241.42 at 0x3b0f390>> delayed by 5.57 seconds (asyncutils:138) 2017-11-27 11:00:21,013+0200 WARN (check/loop) [storage.asyncutils] Call <bound method DirectioChecker._check of <DirectioChecker /dev/ca66ed15-529b-4c75-98d8-1f3be915de47/metadata running next_check=4882241.43 at 0x3b0fc10>> delayed by 5.71 seconds (asyncutils:138) 2017-11-27 11:00:21,083+0200 WARN (check/loop) [storage.asyncutils] Call <bound method DirectioChecker._check of <DirectioChecker /dev/9a28add6-ed7f-4572-a992-7bd6ec057bd8/metadata running next_check=4882241.46 at 0x3b0ff10>> delayed by 5.75 seconds (asyncutils:138) 2017-11-27 11:00:21,154+0200 WARN (check/loop) [storage.asyncutils] Call <bound method DirectioChecker._check of <DirectioChecker /dev/a3ac3dd0-f3ff-447f-8ed2-55e6ba0fa969/metadata running next_check=4882241.48 at 0x2de0250>> delayed by 5.80 seconds (asyncutils:138) 2017-11-27 11:00:21,327+0200 WARN (check/loop) [storage.asyncutils] Call <bound method DirectioChecker._check of <DirectioChecker /dev/689dd9cf-fa90-4057-8b8d-aeef77658123/metadata running next_check=4882241.50 at 0x2de0550>> delayed by 5.95 seconds (asyncutils:138) 2017-11-27 11:00:21,430+0200 WARN (check/loop) [storage.asyncutils] Call <bound method DirectioChecker._check of <DirectioChecker /dev/5533fa3b-7b56-4508-80a8-bb292ff50ae4/metadata running next_check=4882241.51 at 0x2de0850>> delayed by 6.05 seconds (asyncutils:138) 2017-11-27 11:00:21,499+0200 WARN (check/loop) [storage.asyncutils] Call <bound method DirectioChecker._check of <DirectioChecker /dev/e13f8b37-7e30-49eb-8734-b0bdf6e89c88/metadata running next_check=4882241.53 at 0x2de0cd0>> delayed by 6.10 seconds (asyncutils:138) 2017-11-27 11:00:21,568+0200 WARN (check/loop) [storage.asyncutils] Call <bound method DirectioChecker._check of <DirectioChecker /dev/9b56cfd5-89c5-49f1-9e63-9adc04d6bb5d/metadata running next_check=4882241.55 at 0x2deb2d0>> delayed by 6.14 seconds (asyncutils:138) 2017-11-27 11:00:21,636+0200 WARN (check/loop) [storage.asyncutils] Call <bound method DirectioChecker._check of <DirectioChecker /dev/c4e654e2-e36e-47ab-a602-90478be338bf/metadata running next_check=4882241.57 at 0x2deb410>> delayed by 6.19 seconds (asyncutils:138) We need first to understand if the storage subsystem is fine or there are already some issues here. Nir, could you please help understand the meaning of the warnings above?
(In reply to Francesco Romani from comment #14) > With the help of the storage team, we need to understand the meaning of > those warnings: > > 2017-11-27 11:00:20,394+0200 WARN (check/loop) [storage.asyncutils] Call > <bound method DirectioChecker._check of <DirectioChecker > /dev/6c89b067-629f-4a1b-82a5-a113cfa78f92/metadata running > next_check=4882238.55 > at 0x39a1550>> delayed by 7.97 seconds (asyncutils:138) This means that the storage check loop thread was blocked by some call, and running the checker for domain 6c89b067-629f-4a1b-82a5-a113cfa78f92 was delayed in 7.97 second. This delay is not critical, and should not cause any issue, but it should never happen. The system is designed so such delay should never happen. There are several options: - There is a bug in the check loop thread, calling something that may block we do not call anything that may block for seconds. - Writing logs blocks. we don't have infrastructure to drop logs if writing logs become too slow. - The system is so overloaded, that starting a child process takes several seconds - There is bad python extension doing blocking call without releasing the GIL This does not say anything about storage issue, the monitoring system is not effected by inaccessible shared storage.
Francesco, any insights from Nir's comment?
(In reply to Tal Nisan from comment #16) > Francesco, any insights from Nir's comment? I agree with Nir's analysis. Also, I think is worth checking if we still can reproduce using a recent 4.20 release, we reached 4.20.33 (!!) recently, with a constant stream of fixes and performance optimizations.
This bug has not been marked as blocker for oVirt 4.3.0. Since we are releasing it tomorrow, January 29th, this bug has been re-targeted to 4.3.1.
(In reply to Francesco Romani from comment #17) > (In reply to Tal Nisan from comment #16) > > Francesco, any insights from Nir's comment? > > I agree with Nir's analysis. Also, I think is worth checking if we still can > reproduce using a recent 4.20 release, we reached 4.20.33 (!!) recently, > with a constant stream of fixes and performance optimizations. Moredchai, can this be tested with 4.2.8 to see if the issue still reproduces?
We moved our setups to 4.3 and would as all qe would like to be focused on 4.3 Would it be ok if we test on 4.3 latest?
(In reply to Daniel Gur from comment #20) > We moved our setups to 4.3 and would as all qe would like to be focused on > 4.3 > Would it be ok if we test on 4.3 latest? Yes
Created attachment 1543759 [details] vdsm log from reproduction
reproduces in 4.3.1 vdsm-4.30.9-1.el7ev.x86_64 ENV has 5k+ 500 Hosts. Test performed in DC with 2 hosts (1 host: 248 vms and 2nd host: 254 vms) and 61 FC domains vms in this DC are virt enabled and used to install nested hosts. On the non-SPM host using VM called 'vm_virt_enabled__335__L1_Group' attempted to add disk of direct lun resulted in vm_virt_enabled__335__L1_Group going from up state to down state in the Engine UI and then backup. Additional vms went to not responding afterwards. Multiple exceptions in the vdsm and messages log some possibly related to high number of vms. - vdsm.virt.sampling.VMBulkstatsMonitor - vdsm.virt.sampling.HostMonitor Adding direct lun resulted in: 2019-03-13 18:13:00,245+0000 WARN (jsonrpc/5) [virt.vm] (vmId='c6658921-5b6f-4378-ae34-06629884554a') monitor became unresponsive (command timeout, age=65.4900000002) (vm:6024) 2019-03-13 18:13:00,261+0000 INFO (jsonrpc/5) [api.host] FINISH getAllVmStats return={'status': {'message': 'Done', 'code': 0}, 'statsList': (suppressed)} from=::1,51778 (api:54) 2019-03-13 18:13:00,357+0000 INFO (jsonrpc/5) [jsonrpc.JsonRpcServer] RPC call Host.getAllVmStats succeeded in 0.48 seconds (__init__:312) 2019-03-13 18:13:00,433+0000 INFO (jsonrpc/5) [api.host] START getAllVmStats() from=::ffff:10.12.69.30,43870 (api:48) 2019-03-13 18:13:00,499+0000 WARN (jsonrpc/5) [virt.vm] (vmId='dbf080b6-5aba-4720-9388-67c7bb26bce3') monitor became unresponsive (command timeout, age=65.7400000002) (vm:6024) 2019-03-13 18:13:00,523+0000 INFO (jsonrpc/6) [api.host] START getAllVmStats() from=::1,51778 (api:48) 2019-03-13 18:13:00,596+0000 WARN (jsonrpc/6) [virt.vm] (vmId='dbf080b6-5aba-4720-9388-67c7bb26bce3') monitor became unresponsive (command timeout, age=65.8399999999) (vm:6024) 2019-03-13 18:13:00,834+0000 WARN (jsonrpc/6) [virt.vm] (vmId='c6658921-5b6f-4378-ae34-06629884554a') monitor became unresponsive (command timeout, age=66.0800000001) (vm:6024) Vdsm/message Logs attached see previous comment.
We didn't get to this bug for more than 2 years, and it's not being considered for the upcoming 4.4. It's unlikely that it will ever be addressed so I'm suggesting to close it. If you feel this needs to be addressed and want to work on it please remove cond nack and target accordingly.
Closing old bug. Please reopen if still relevant/you want to work on it.