Bug 1517535 - [scale] - VM monitor timeout when adding direct LUN disk to a VM
Summary: [scale] - VM monitor timeout when adding direct LUN disk to a VM
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: vdsm
Classification: oVirt
Component: General
Version: 4.20.7
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Nir Soffer
QA Contact: mlehrer
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-11-26 13:21 UTC by guy chen
Modified: 2020-06-26 16:38 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-04-01 14:48:23 UTC
oVirt Team: Storage
Embargoed:
ylavi: ovirt-4.3+
sbonazzo: ovirt-4.5?


Attachments (Terms of Use)
vdsm log from reproduction (11.45 MB, application/zip)
2019-03-13 22:07 UTC, mlehrer
no flags Details

Description guy chen 2017-11-26 13:21:53 UTC
Description of problem:
On a performance environment, When adding direct LUN disk to a VM the monitor fail on timeout and VM become unresponsive, start and end of the adding direct LUN :

Started: Nov 26, 2017, 11:15:46 AM
Completed: Nov 26, 2017, 11:16:55 AM

First log failure on the VDSM log attached :
2017-11-26 11:17:22,158+0200 WARN  (jsonrpc/0) [virt.vm] (vmId='ee2898c6-4f68-4354-850a-e39218c37249') monitor became unresponsive (command timeout, age=62.6400000006) (vm:5798)

Version-Release number of selected component (if applicable):
4.20.7

How reproducible:
Always

Steps to Reproduce:
1.Deploy performance lab as defined below
2.Start all VMS
3.Add direct LUN disk to host #3

System topology (using bare metal hosts)
1 DC
1 Clusters
3 Hosts :
host 1 : SPM + 50 VMS
host 2 : 150 VMS
host 3 : 100 VMS
40  SD
300 VMS - resides on pools, each has 2 thin provisioned COW2
DB resides on  the engine server with data on ISCSI LUN connected to the XtremIO storage
Host connected with 4 ISCSI path


Actual results:
VM become unresponsive several times and the going back online

Expected results:
VM stay up

Additional info:
logs attached

Comment 4 Roy Golan 2017-11-27 08:34:49 UTC
just to make it clear, the VMs on the non-SPM host which you add direct LUN to it, right?

Comment 5 Yaniv Kaul 2017-11-27 08:54:26 UTC
Anything interesting on /var/log/messages from that time? (I'm also wondering if we'll need libvirt in debug to understand what's going on).
This is with the VDSM with the watermark event (and not polling) code, right?

Comment 6 Roy Golan 2017-11-27 09:06:40 UTC
So a monitor took over 60 seconds and that is the regular reason for it to transition to non responsive. Question is, if we are using EVENTS why is this behaving like that still.

Francesco if we use BLOCK_THRESHOLD then it doesn't make sense to mark the VM unresponsive. It is probably a bug in when we mark the vm sample stats_age? 
If scanning lvs is stalling the stats_age write than it is a wrong call to say that the vm is non responsive. Anyhow, this is from high level reading the code, I may miss something.

Comment 7 guy chen 2017-11-27 09:36:13 UTC
yes, the VMs on the non-SPM host.

Comment 14 Francesco Romani 2017-11-27 12:58:12 UTC
With the help of the storage team, we need to understand the meaning of those warnings:

2017-11-27 11:00:20,394+0200 WARN  (check/loop) [storage.asyncutils] Call <bound method DirectioChecker._check of <DirectioChecker /dev/6c89b067-629f-4a1b-82a5-a113cfa78f92/metadata running next_check=4882238.55
 at 0x39a1550>> delayed by 7.97 seconds (asyncutils:138)
2017-11-27 11:00:20,462+0200 WARN  (check/loop) [storage.asyncutils] Call <bound method DirectioChecker._check of <DirectioChecker /dev/a9bbc4ed-4f73-4bab-be49-e17cf3db8f86/metadata running next_check=4882238.74
 at 0x2e0be10>> delayed by 7.85 seconds (asyncutils:138)
2017-11-27 11:00:20,530+0200 WARN  (check/loop) [storage.asyncutils] Call <bound method DirectioChecker._check of <DirectioChecker /dev/ac5502b5-2925-44e1-8165-14654d2d3133/metadata running next_check=4882239.12
 at 0x3b1e990>> delayed by 7.54 seconds (asyncutils:138)
2017-11-27 11:00:20,611+0200 WARN  (check/loop) [storage.asyncutils] Call <bound method DirectioChecker._check of <DirectioChecker /dev/714a22ae-a47e-47ac-8194-3388e69c648c/metadata running next_check=4882240.98
 at 0x3b63450>> delayed by 5.76 seconds (asyncutils:138)
2017-11-27 11:00:20,678+0200 WARN  (check/loop) [storage.asyncutils] Call <bound method DirectioChecker._check of <DirectioChecker /dev/4a2cb174-b9bb-4d75-a336-a06dd84ede0c/metadata running next_check=4882241.03
 at 0x3b83bd0>> delayed by 5.77 seconds (asyncutils:138)
2017-11-27 11:00:20,744+0200 WARN  (check/loop) [storage.asyncutils] Call <bound method DirectioChecker._check of <DirectioChecker /dev/858a9c96-0676-43dd-a5a9-611ac71fb4d2/metadata running next_check=4882241.38
 at 0x3b0ced0>> delayed by 5.49 seconds (asyncutils:138)
2017-11-27 11:00:20,865+0200 WARN  (check/loop) [storage.asyncutils] Call <bound method DirectioChecker._check of <DirectioChecker /dev/7a70f116-f90c-4d1e-a11a-790c874dc76d/metadata running next_check=4882241.42
 at 0x3b0f390>> delayed by 5.57 seconds (asyncutils:138)
2017-11-27 11:00:21,013+0200 WARN  (check/loop) [storage.asyncutils] Call <bound method DirectioChecker._check of <DirectioChecker /dev/ca66ed15-529b-4c75-98d8-1f3be915de47/metadata running next_check=4882241.43
 at 0x3b0fc10>> delayed by 5.71 seconds (asyncutils:138)
2017-11-27 11:00:21,083+0200 WARN  (check/loop) [storage.asyncutils] Call <bound method DirectioChecker._check of <DirectioChecker /dev/9a28add6-ed7f-4572-a992-7bd6ec057bd8/metadata running next_check=4882241.46
 at 0x3b0ff10>> delayed by 5.75 seconds (asyncutils:138)
2017-11-27 11:00:21,154+0200 WARN  (check/loop) [storage.asyncutils] Call <bound method DirectioChecker._check of <DirectioChecker /dev/a3ac3dd0-f3ff-447f-8ed2-55e6ba0fa969/metadata running next_check=4882241.48
 at 0x2de0250>> delayed by 5.80 seconds (asyncutils:138)
2017-11-27 11:00:21,327+0200 WARN  (check/loop) [storage.asyncutils] Call <bound method DirectioChecker._check of <DirectioChecker /dev/689dd9cf-fa90-4057-8b8d-aeef77658123/metadata running next_check=4882241.50
 at 0x2de0550>> delayed by 5.95 seconds (asyncutils:138)
2017-11-27 11:00:21,430+0200 WARN  (check/loop) [storage.asyncutils] Call <bound method DirectioChecker._check of <DirectioChecker /dev/5533fa3b-7b56-4508-80a8-bb292ff50ae4/metadata running next_check=4882241.51
 at 0x2de0850>> delayed by 6.05 seconds (asyncutils:138)
2017-11-27 11:00:21,499+0200 WARN  (check/loop) [storage.asyncutils] Call <bound method DirectioChecker._check of <DirectioChecker /dev/e13f8b37-7e30-49eb-8734-b0bdf6e89c88/metadata running next_check=4882241.53
 at 0x2de0cd0>> delayed by 6.10 seconds (asyncutils:138)
2017-11-27 11:00:21,568+0200 WARN  (check/loop) [storage.asyncutils] Call <bound method DirectioChecker._check of <DirectioChecker /dev/9b56cfd5-89c5-49f1-9e63-9adc04d6bb5d/metadata running next_check=4882241.55
 at 0x2deb2d0>> delayed by 6.14 seconds (asyncutils:138)
2017-11-27 11:00:21,636+0200 WARN  (check/loop) [storage.asyncutils] Call <bound method DirectioChecker._check of <DirectioChecker /dev/c4e654e2-e36e-47ab-a602-90478be338bf/metadata running next_check=4882241.57 at 0x2deb410>> delayed by 6.19 seconds (asyncutils:138)

We need first to understand if the storage subsystem is fine or there are already some issues here.

Nir, could you please help understand the meaning of the warnings above?

Comment 15 Nir Soffer 2017-11-27 16:20:41 UTC
(In reply to Francesco Romani from comment #14)
> With the help of the storage team, we need to understand the meaning of
> those warnings:
> 
> 2017-11-27 11:00:20,394+0200 WARN  (check/loop) [storage.asyncutils] Call
> <bound method DirectioChecker._check of <DirectioChecker
> /dev/6c89b067-629f-4a1b-82a5-a113cfa78f92/metadata running
> next_check=4882238.55
>  at 0x39a1550>> delayed by 7.97 seconds (asyncutils:138)

This means that the storage check loop thread was blocked by some call, and 
running the checker for domain 6c89b067-629f-4a1b-82a5-a113cfa78f92 was
delayed in 7.97 second.

This delay is not critical, and should not cause any issue, but it should
never happen. The system is designed so such delay should never happen.

There are several options:
- There is a bug in the check loop thread, calling something that may block
  we do not call anything that may block for seconds.
- Writing logs blocks. we don't have infrastructure to drop logs if writing
  logs become too slow.
- The system is so overloaded, that starting a child process takes several
  seconds
- There is bad python extension doing blocking call without releasing the GIL

This does not say anything about storage issue, the monitoring system is not
effected by inaccessible shared storage.

Comment 16 Tal Nisan 2018-07-16 09:03:48 UTC
Francesco, any insights from Nir's comment?

Comment 17 Francesco Romani 2018-07-16 09:13:47 UTC
(In reply to Tal Nisan from comment #16)
> Francesco, any insights from Nir's comment?

I agree with Nir's analysis. Also, I think is worth checking if we still can reproduce using a recent 4.20 release, we reached 4.20.33 (!!) recently, with a constant stream of fixes and performance optimizations.

Comment 18 Sandro Bonazzola 2019-01-28 09:41:00 UTC
This bug has not been marked as blocker for oVirt 4.3.0.
Since we are releasing it tomorrow, January 29th, this bug has been re-targeted to 4.3.1.

Comment 19 Tal Nisan 2019-03-01 03:45:26 UTC
(In reply to Francesco Romani from comment #17)
> (In reply to Tal Nisan from comment #16)
> > Francesco, any insights from Nir's comment?
> 
> I agree with Nir's analysis. Also, I think is worth checking if we still can
> reproduce using a recent 4.20 release, we reached 4.20.33 (!!) recently,
> with a constant stream of fixes and performance optimizations.

Moredchai, can this be tested with 4.2.8 to see if the issue still reproduces?

Comment 20 Daniel Gur 2019-03-03 09:51:11 UTC
We moved our setups to 4.3 and would as all qe would like to be focused on 4.3
Would it be ok if we test on 4.3 latest?

Comment 21 Tal Nisan 2019-03-13 16:53:06 UTC
(In reply to Daniel Gur from comment #20)
> We moved our setups to 4.3 and would as all qe would like to be focused on
> 4.3
> Would it be ok if we test on 4.3 latest?

Yes

Comment 22 mlehrer 2019-03-13 22:07:22 UTC
Created attachment 1543759 [details]
vdsm log from reproduction

Comment 23 mlehrer 2019-03-13 22:14:25 UTC
reproduces in 4.3.1 vdsm-4.30.9-1.el7ev.x86_64
ENV has 5k+ 500 Hosts.

Test performed in DC with 2 hosts (1 host: 248 vms and 2nd host: 254 vms) and 61 FC domains vms in this DC are virt enabled and used to install nested hosts.
On the non-SPM host using VM called 'vm_virt_enabled__335__L1_Group' attempted to add disk of direct lun resulted in vm_virt_enabled__335__L1_Group going from up state to down state in the Engine UI and then backup.  Additional vms went to not responding afterwards.

Multiple exceptions in the vdsm and messages log some possibly related to high number of vms.
- vdsm.virt.sampling.VMBulkstatsMonitor
- vdsm.virt.sampling.HostMonitor

Adding direct lun resulted in:

2019-03-13 18:13:00,245+0000 WARN  (jsonrpc/5) [virt.vm] (vmId='c6658921-5b6f-4378-ae34-06629884554a') monitor became unresponsive (command timeout, age=65.4900000002) (vm:6024)
2019-03-13 18:13:00,261+0000 INFO  (jsonrpc/5) [api.host] FINISH getAllVmStats return={'status': {'message': 'Done', 'code': 0}, 'statsList': (suppressed)} from=::1,51778 (api:54)
2019-03-13 18:13:00,357+0000 INFO  (jsonrpc/5) [jsonrpc.JsonRpcServer] RPC call Host.getAllVmStats succeeded in 0.48 seconds (__init__:312)
2019-03-13 18:13:00,433+0000 INFO  (jsonrpc/5) [api.host] START getAllVmStats() from=::ffff:10.12.69.30,43870 (api:48)
2019-03-13 18:13:00,499+0000 WARN  (jsonrpc/5) [virt.vm] (vmId='dbf080b6-5aba-4720-9388-67c7bb26bce3') monitor became unresponsive (command timeout, age=65.7400000002) (vm:6024)
2019-03-13 18:13:00,523+0000 INFO  (jsonrpc/6) [api.host] START getAllVmStats() from=::1,51778 (api:48)
2019-03-13 18:13:00,596+0000 WARN  (jsonrpc/6) [virt.vm] (vmId='dbf080b6-5aba-4720-9388-67c7bb26bce3') monitor became unresponsive (command timeout, age=65.8399999999) (vm:6024)
2019-03-13 18:13:00,834+0000 WARN  (jsonrpc/6) [virt.vm] (vmId='c6658921-5b6f-4378-ae34-06629884554a') monitor became unresponsive (command timeout, age=66.0800000001) (vm:6024)

Vdsm/message Logs attached see previous comment.

Comment 25 Michal Skrivanek 2020-03-19 15:42:15 UTC
We didn't get to this bug for more than 2 years, and it's not being considered for the upcoming 4.4. It's unlikely that it will ever be addressed so I'm suggesting to close it.
If you feel this needs to be addressed and want to work on it please remove cond nack and target accordingly.

Comment 26 Michal Skrivanek 2020-04-01 14:48:23 UTC
Closing old bug. Please reopen if still relevant/you want to work on it.


Note You need to log in before you can comment on or make changes to this bug.