Bug 1758094 - [downstream clone - 4.3.7] [scale] VMs unresponsive due to delayed getVolumeSize
Summary: [downstream clone - 4.3.7] [scale] VMs unresponsive due to delayed getVolumeSize
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 4.1.11
Hardware: x86_64
OS: Linux
high
medium
Target Milestone: ovirt-4.3.7
: 4.3.7
Assignee: Nir Soffer
QA Contact: David Vaanunu
URL:
Whiteboard:
Depends On: 1598266
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-10-03 09:06 UTC by RHV bug bot
Modified: 2023-12-15 16:48 UTC (History)
23 users (show)

Fixed In Version: vdsm-4.30.34
Doc Type: Bug Fix
Doc Text:
Previously, when a system had many Fibre Channel (FC) LUNs with many paths per LUN and a high I/O load, scanning of FC devices became slow. This caused timeouts in monitoring virtual machine disk size and made virtual machines non-responsive. The current release mitigates this issue by optimizing FC scans for speed, and VM are much less likely to become non-responsive.
Clone Of: 1598266
Environment:
Last Closed: 2019-12-12 10:36:52 UTC
oVirt Team: Scale
Target Upstream Version:
Embargoed:
lsvaty: testing_plan_complete-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHV-44355 0 None None None 2021-12-10 18:37:57 UTC
Red Hat Knowledge Base (Solution) 3939711 0 None None None 2019-10-03 09:08:19 UTC
Red Hat Product Errata RHBA-2019:4230 0 None None None 2019-12-12 10:37:13 UTC
oVirt gerrit 103382 0 'None' MERGED fc-scan: Allow disabling device scanning 2020-12-23 22:56:04 UTC

Description RHV bug bot 2019-10-03 09:06:00 UTC
+++ This bug is a downstream clone. The original bug is: +++
+++   bug 1598266 +++
======================================================================

Description of problem:
Several VMs get unresponsive at the same time, regularly.

2018-07-04 08:27:11,986+0200 vmId='86dfa0fc-a4ed-4362-83a4-5bec1ca98d60' age=62.8300000001
2018-07-04 08:27:11,988+0200 vmId='def11d1f-9cb0-47be-9c45-7b961d275c40' age=62.8300000001
2018-07-04 08:27:11,990+0200 vmId='d3ffccf2-7743-463c-bd03-99cff6106ff0' age=62.8399999999
2018-07-04 08:27:11,992+0200 vmId='c8f91bed-1e71-4408-a76b-2bf02d55aa48' age=62.8399999999
2018-07-04 08:27:11,993+0200 vmId='01b4911b-d588-435e-a86a-a5a476da6362' age=62.8399999999
2018-07-04 08:27:11,995+0200 vmId='bd2dd808-af9a-40e9-b997-316f05d06bb9' age=62.8399999999
2018-07-04 08:27:11,997+0200 vmId='d7982c76-8d6b-496a-85cb-548478816e54' age=62.8399999999
2018-07-04 08:27:11,998+0200 vmId='ed10e819-cb6b-4f36-bedf-41fa0c89cab4' age=62.8399999999
2018-07-04 08:27:12,000+0200 vmId='247144df-6844-4f35-a728-80f25ad72971' age=62.8499999996
2018-07-04 08:27:12,001+0200 vmId='331d4145-d02c-4bfa-a136-1018839162be' age=62.8499999996
2018-07-04 08:27:12,003+0200 vmId='e7945585-5304-47b8-a533-89350ddac7b7' age=62.8499999996
2018-07-04 08:27:12,006+0200 vmId='73d41cb9-6564-44ca-bf94-a5aeb7c76a1c' age=62.8499999996
2018-07-04 08:27:12,007+0200 vmId='52589290-0033-4a97-b393-3650d7af7ef2' age=62.8499999996
2018-07-04 08:27:12,009+0200 vmId='8e7a575b-3f7d-4453-be96-190d0226d077' age=62.8599999994
2018-07-04 08:27:12,010+0200 vmId='61da1520-21b2-439d-8ae6-38c1ce9127df' age=62.8599999994
2018-07-04 08:27:12,012+0200 vmId='7701dfda-a3c6-456f-bd20-8db0f6f2373a' age=62.8599999994
2018-07-04 08:27:12,013+0200 vmId='89012d87-e8c3-466f-83a1-d673d1a7537b' age=62.8599999994
2018-07-04 08:27:12,015+0200 vmId='1ded83c2-4879-42d9-9bcd-ced3f14bcf03' age=62.8599999994
2018-07-04 08:27:12,017+0200 vmId='f6ff4904-5a50-40fd-bbae-e2ad8d6de1b7' age=62.8599999994
2018-07-04 08:27:12,018+0200 vmId='edd902e4-0e98-400e-9879-a320080bd148' age=62.8599999994
2018-07-04 08:27:12,020+0200 vmId='bdef5563-53d6-473b-968b-b1b279176682' age=62.870000001
2018-07-04 08:27:12,021+0200 vmId='4529e6fa-70a2-4233-abe5-4718090117a3' age=62.870000001
2018-07-04 08:27:12,023+0200 vmId='9174301d-fa75-4ad7-84a2-91b3b22db987' age=62.870000001
2018-07-04 08:27:12,025+0200 vmId='17f21001-8dfa-4786-89e6-27e0b157be9f' age=62.870000001
2018-07-04 08:27:12,026+0200 vmId='1247742f-7726-458f-8294-fbaa33ffce77' age=62.870000001
2018-07-04 08:27:12,028+0200 vmId='4c7d3aff-dad1-4108-883d-069ba0a4a4fb' age=62.870000001
2018-07-04 08:27:12,029+0200 vmId='7ac88b8f-db4c-4456-a1a4-310ff0bf5b39' age=62.8800000008
2018-07-04 08:27:12,031+0200 vmId='abacdc84-c107-42ef-abf6-13803e118a40' age=62.8800000008
2018-07-04 08:27:12,032+0200 vmId='3a42f097-4a4f-4c91-bd32-85e0d48bedf0' age=62.8800000008
2018-07-04 08:27:12,034+0200 vmId='60b2d3f0-9b38-4d6b-93f9-fe0d97857582' age=62.8800000008
2018-07-04 08:27:12,035+0200 vmId='1e693cd3-4cae-4a60-b914-2daaa0f40791' age=62.8800000008
2018-07-04 08:27:12,037+0200 vmId='bc822a5a-631d-4f1b-ac18-c760baaf657e' age=62.8800000008
2018-07-04 08:27:12,038+0200 vmId='bdfbc05b-5b8c-4be5-aedd-78e987c21d00' age=62.8800000008
2018-07-04 08:27:12,039+0200 vmId='e46492df-4da3-45a0-b49f-82245e0173a5' age=62.8900000006
2018-07-04 08:27:12,041+0200 vmId='288908d5-6379-466b-88a0-2c4cca2d3900' age=62.8900000006
2018-07-04 08:27:12,042+0200 vmId='87441f46-d891-44c6-8483-4516bfeac757' age=62.8900000006

Lets pick the first one, 86dfa0fc-a4ed-4362-83a4-5bec1ca98d60

These are its disks:
  SD 7b0d671a-fc1a-4f33-8da6-29a76d227ec8 IMG ab3c2cda-c54b-497a-b7a7-342e658324db
  SD 7b0d671a-fc1a-4f33-8da6-29a76d227ec8 IMG caa85e31-02c9-40cb-84ba-0693b1182314
  DLUN 360001440000000107071b5b4d8ebc59f

When the mass unreponsiveness occur, we can see getVolumeSize() called from UpdateVolumes() does not run on time (every 1 min). See:

2018-07-04 08:21:10,016+0200 getVolumeSize(sdUUID=u'7b0d671a-fc1a-4f33-8da6-29a76d227ec8', imgUUID=u'ab3c2cda-c54b-497a-b7a7-342e658324db',
2018-07-04 08:21:10,019+0200 getVolumeSize(sdUUID=u'7b0d671a-fc1a-4f33-8da6-29a76d227ec8', imgUUID=u'caa85e31-02c9-40cb-84ba-0693b1182314',
2018-07-04 08:22:10,018+0200 getVolumeSize(sdUUID=u'7b0d671a-fc1a-4f33-8da6-29a76d227ec8', imgUUID=u'ab3c2cda-c54b-497a-b7a7-342e658324db',
2018-07-04 08:22:10,045+0200 getVolumeSize(sdUUID=u'7b0d671a-fc1a-4f33-8da6-29a76d227ec8', imgUUID=u'caa85e31-02c9-40cb-84ba-0693b1182314',
2018-07-04 08:23:10,016+0200 getVolumeSize(sdUUID=u'7b0d671a-fc1a-4f33-8da6-29a76d227ec8', imgUUID=u'ab3c2cda-c54b-497a-b7a7-342e658324db',
2018-07-04 08:23:10,019+0200 getVolumeSize(sdUUID=u'7b0d671a-fc1a-4f33-8da6-29a76d227ec8', imgUUID=u'caa85e31-02c9-40cb-84ba-0693b1182314',
2018-07-04 08:24:10,013+0200 getVolumeSize(sdUUID=u'7b0d671a-fc1a-4f33-8da6-29a76d227ec8', imgUUID=u'ab3c2cda-c54b-497a-b7a7-342e658324db',
2018-07-04 08:24:10,477+0200 getVolumeSize(sdUUID=u'7b0d671a-fc1a-4f33-8da6-29a76d227ec8', imgUUID=u'caa85e31-02c9-40cb-84ba-0693b1182314',
2018-07-04 08:25:10,018+0200 getVolumeSize(sdUUID=u'7b0d671a-fc1a-4f33-8da6-29a76d227ec8', imgUUID=u'ab3c2cda-c54b-497a-b7a7-342e658324db',
2018-07-04 08:25:10,021+0200 getVolumeSize(sdUUID=u'7b0d671a-fc1a-4f33-8da6-29a76d227ec8', imgUUID=u'caa85e31-02c9-40cb-84ba-0693b1182314',
2018-07-04 08:26:10,013+0200 getVolumeSize(sdUUID=u'7b0d671a-fc1a-4f33-8da6-29a76d227ec8', imgUUID=u'ab3c2cda-c54b-497a-b7a7-342e658324db',
2018-07-04 08:26:10,016+0200 getVolumeSize(sdUUID=u'7b0d671a-fc1a-4f33-8da6-29a76d227ec8', imgUUID=u'caa85e31-02c9-40cb-84ba-0693b1182314',

2018-07-04 08:27:31,694+0200 getVolumeSize(sdUUID=u'7b0d671a-fc1a-4f33-8da6-29a76d227ec8', imgUUID=u'ab3c2cda-c54b-497a-b7a7-342e658324db',
2018-07-04 08:27:31,717+0200 getVolumeSize(sdUUID=u'7b0d671a-fc1a-4f33-8da6-29a76d227ec8', imgUUID=u'caa85e31-02c9-40cb-84ba-0693b1182314',

2018-07-04 08:28:10,019+0200 getVolumeSize(sdUUID=u'7b0d671a-fc1a-4f33-8da6-29a76d227ec8', imgUUID=u'ab3c2cda-c54b-497a-b7a7-342e658324db',
2018-07-04 08:28:10,022+0200 getVolumeSize(sdUUID=u'7b0d671a-fc1a-4f33-8da6-29a76d227ec8', imgUUID=u'caa85e31-02c9-40cb-84ba-0693b1182314',

When this happens, we usually see UpdateVolumes() stuck in the executor.

2018-07-04 08:26:40,030+0200 DEBUG (vdsm.Scheduler) [Executor] executor state: count=5 workers=set([<Worker name=periodic/4796 waiting task#=0 at 0x7f7480b85990>, <Worker name=periodic/4792 running <Task discardable <UpdateVolumes vm=61da1520-21b2-439d-8ae6-38c1ce9127df at 0x7f7468140650> timeout=30.0, duration=29 at 0x1299790> task#=34 at 0x28509d0>, <Worker name=periodic/4794 running <Task discardable <UpdateVolumes vm=bdef5563-53d6-473b-968b-b1b279176682 at 0x7f7400106450> timeout=30.0, duration=29 at 0x7f7400106090> task#=46 at 0x7f746c2e6e50>, <Worker name=periodic/4795 running <Task discardable <UpdateVolumes vm=01b4911b-d588-435e-a86a-a5a476da6362 at 0x7f746c4040d0> timeout=30.0, duration=30 at 0x22720d0> discarded task#=40 at 0x2381750>, <Worker name=periodic/4793 running <Task discardable <UpdateVolumes vm=e7945585-5304-47b8-a533-89350ddac7b7 at 0x7f7440ceeb50> timeout=30.0, duration=29 at 0x7f7468140b90> task#=49 at 0x7f7440524bd0>]) (executor:205)

Which means some of the getVolumeSize() at 08:26:10 did not complete and is probably hanging everything, including VM monitoring and leading to the delayed getVolumeSize() at 08:27:31 (which should have happened at 08:27:10?). 

Lets check VM e7945585-5304-47b8-a533-89350ddac7b7 from above, these are its disks:
  SD eff51393-6316-4f17-a4dc-83ad85a0b765 IMG 9b3af596-d8e8-4e89-aa2c-cbefd2a8908a
  SD eff51393-6316-4f17-a4dc-83ad85a0b765 IMG 6328070d-649d-4c57-8db0-0220c9cb4470

1. getVolumeSize for the first disk (9b3af596)

2018-07-04 08:26:10,069+0200 INFO  (periodic/4793) [vdsm.api] START getVolumeSize(sdUUID=u'eff51393-6316-4f17-a4dc-83ad85a0b765', spUUID=u'00000002-0002-0002-0002-0000000003b7', imgUUID=u'9b3af596-d8e8-4e89-aa2c-cbefd2a8908a', volUUID=u'8caa2d9e-71d7-4533-90f2-ba9ab955afce', options=None) from=internal, task_id=26bf9721-b856-4032-86bd-3f1de2c88190 (api:46)

2. Is discarded at 08:27:40 (as per logs above). But still goes ahead and takes takes 50 seconds to complete.

2018-07-04 08:27:00,667+0200 INFO  (periodic/4793) [vdsm.api] FINISH getVolumeSize return={'truesize': '53687091200', 'apparentsize': '53687091200'} from=internal, task_id=26bf9721-b856-4032-86bd-3f1de2c88190 (api:52)

3. Then it proceeds to the second disk, which completes immediately. We are 50 seconds late.

2018-07-04 08:27:00,669+0200 INFO  (periodic/4793) [vdsm.api] START getVolumeSize(sdUUID=u'eff51393-6316-4f17-a4dc-83ad85a0b765', spUUID=u'00000002-0002-0002-0002-0000000003b7', imgUUID=u'6328070d-649d-4c57-8db0-0220c9cb4470', volUUID=u'd04beb3a-db86-4dc1-af24-7139a0095942', options=None) from=internal, task_id=7abb24b2-1550-4b8a-b66a-499742ed857b (api:46)
2018-07-04 08:27:00,670+0200 INFO  (periodic/4793) [vdsm.api] FINISH getVolumeSize return={'truesize': '826781204480', 'apparentsize': '826781204480'} from=internal, task_id=7abb24b2-1550-4b8a-b66a-499742ed857b (api:52)

I'm not entirely sure, but I think these delays end up causing the discarded UpdateVolumes and ultimately monitors going unresponsive as VM monitoring is getting starved in the executor by these long standing getVolumeInfo tasks.

It seems these stuck getVolumeSize() on the storage domain eff51393-6316-4f17-a4dc-83ad85a0b765 get blocked by this, as it finishes at 08:27:00 which after all stuck getVolumeSize() calls return.

2018-07-04 08:26:04,526+0200 DEBUG (monitor/eff5139) [storage.Monitor] Refreshing domain eff51393-6316-4f17-a4dc-83ad85a0b765 (monitor:487)
2018-07-04 08:26:04,526+0200 DEBUG (monitor/eff5139) [storage.SamplingMethod] Trying to enter sampling method (storage.sdc.refreshStorage) (misc:467)

2018-07-04 08:26:26,071+0200 DEBUG (monitor/eff5139) [storage.SamplingMethod] Got in to sampling method (misc:470)
2018-07-04 08:26:26,072+0200 DEBUG (monitor/eff5139) [storage.SamplingMethod] Trying to enter sampling method (vdsm.storage.iscsi.rescan) (misc:467)
2018-07-04 08:26:26,072+0200 DEBUG (monitor/eff5139) [storage.SamplingMethod] Got in to sampling method (misc:470)
2018-07-04 08:26:26,072+0200 DEBUG (monitor/eff5139) [storage.ISCSI] Performing SCSI scan, this will take up to 30 seconds (iscsi:457)
2018-07-04 08:26:26,073+0200 DEBUG (monitor/eff5139) [storage.Misc.excCmd] /usr/bin/taskset --cpu-list 0-119 /usr/bin/sudo -n /sbin/iscsiadm -m session -R (cwd None) (commands:69)
2018-07-04 08:26:26,446+0200 DEBUG (monitor/eff5139) [storage.SamplingMethod] Returning last result (misc:477)
2018-07-04 08:26:26,447+0200 DEBUG (monitor/eff5139) [storage.SamplingMethod] Trying to enter sampling method (vdsm.storage.hba.rescan) (misc:467)
2018-07-04 08:26:26,447+0200 DEBUG (monitor/eff5139) [storage.SamplingMethod] Got in to sampling method (misc:470)
2018-07-04 08:26:26,448+0200 DEBUG (monitor/eff5139) [storage.HBA] Starting scan (hba:58)

2018-07-04 08:26:49,985+0200 DEBUG (monitor/eff5139) [storage.HBA] Scan finished (hba:64)
2018-07-04 08:26:49,985+0200 DEBUG (monitor/eff5139) [storage.SamplingMethod] Returning last result (misc:477)
2018-07-04 08:26:49,986+0200 DEBUG (monitor/eff5139) [root] /usr/bin/taskset --cpu-list 0-119 /sbin/udevadm settle --timeout=5 (cwd None) (commands:69)
2018-07-04 08:26:50,048+0200 DEBUG (monitor/eff5139) [root] SUCCESS: <err> = ''; <rc> = 0 (commands:93)
2018-07-04 08:26:50,616+0200 DEBUG (monitor/eff5139) [storage.SamplingMethod] Returning last result (misc:477)

<lvm commands take around 2s each>

2018-07-04 08:27:00,636+0200 INFO  (monitor/eff5139) [storage.StorageDomain] Resource namespace 01_img_eff51393-6316-4f17-a4dc-83ad85a0b765 already registered (sd:758)
2018-07-04 08:27:00,636+0200 INFO  (monitor/eff5139) [storage.StorageDomain] Resource namespace 02_vol_eff51393-6316-4f17-a4dc-83ad85a0b765 already registered (sd:767)
2018-07-04 08:27:00,636+0200 INFO  (monitor/eff5139) [storage.StorageDomain] Resource namespace 03_lvm_eff51393-6316-4f17-a4dc-83ad85a0b765 already registered (blockSD:989)

So most of the time is spent on hba scan and the first "trying to enter sampling method".

Interestingly enough, most of the VMs that get stuck on UpdateVolumes have Direct LUNS (20 out of 21). The 1 I used as example above is the only without a Direct Lun. So maybe this is related: https://bugzilla.redhat.com/show_bug.cgi?id=1517535

Should this slower storage refreshing trigger unresponsive vms? The VMs are running fine (confirmed by reading monotonic from inside).

Version-Release number of selected component (if applicable):
vdsm-4.19.50-1.el7ev.x86_64

How reproducible:
Unknown

(Originally by Germano Veit Michel)

Comment 3 RHV bug bot 2019-10-03 09:06:07 UTC
Maor, you are this week's QE contact, please scrub

(Originally by Tal Nisan)

Comment 4 RHV bug bot 2019-10-03 09:06:09 UTC
I saw the following error in the VDSM logs:

2018-07-04 08:05:11,680+0200 WARN  (jsonrpc/0) [virt.vm] (vmId='86dfa0fc-a4ed-4362-83a4-5bec1ca98d60') monitor became unresponsive (command timeout, age=62.6100000013) (vm:5135)
2018-07-04 08:05:11,682+0200 WARN  (jsonrpc/0) [virt.vm] (vmId='def11d1f-9cb0-47be-9c45-7b961d275c40') monitor became unresponsive (command timeout, age=62.6100000013) (vm:5135)
....

Can you please add some more indicators, such as cpu usage, load, sar, iotop of the host while the VMs get unresponsive.
Can you also add libvirt, qemu and engine logs

Thanks

(Originally by mlipchuk)

Comment 6 RHV bug bot 2019-10-03 09:06:13 UTC
Hi Maor,

Thanks for taking a look.

(In reply to Maor from comment #4)
> I saw the following error in the VDSM logs:
> 
> 2018-07-04 08:05:11,680+0200 WARN  (jsonrpc/0) [virt.vm]
> (vmId='86dfa0fc-a4ed-4362-83a4-5bec1ca98d60') monitor became unresponsive
> (command timeout, age=62.6100000013) (vm:5135)
> 2018-07-04 08:05:11,682+0200 WARN  (jsonrpc/0) [virt.vm]
> (vmId='def11d1f-9cb0-47be-9c45-7b961d275c40') monitor became unresponsive
> (command timeout, age=62.6100000013) (vm:5135)
> ....

Yes, as in comment #0.

> Can you please add some more indicators, such as cpu usage, load, sar, iotop
> of the host while the VMs get unresponsive.

Sure. There are plenty of resources available, there is no overcommit. Seem the delays on entering sampling method are on several places, and fc-scan is always slow (there is a large number of LUNs presented). Looks like the lockup is on VDSM to me. All VMs are running just fine, there is no detected performance issue anywhere else.

> Can you also add libvirt, qemu and engine logs

RHV doesnt run with libvirt or qemu logging by default. So I'll add engine.

> It might also be related to this bug https://bugzilla.redhat.com/1419856
> which libvirt version is the customer using?
libvirt-daemon-3.2.0-14.el7_4.9.x86_64

Please let me know if you need anything else.

(Originally by Germano Veit Michel)

Comment 16 RHV bug bot 2019-10-03 09:06:33 UTC
Similar case, UpdateVolumes for each VM is stuck because getVolumeSize is delayed/not-running/blocked. On this case is happens when adding a new SD.

vdsm-4.19.45-1.el7ev.x86_64

(Originally by Germano Veit Michel)

Comment 18 RHV bug bot 2019-10-03 09:06:38 UTC
From the data of comment #17:

2018-11-08 11:15:19,064+0100 getVolumeSize volUUID=u'284ae5e7-da22-4a51-8fb7-554c3d953ced',
2018-11-08 11:16:19,034+0100 getVolumeSize volUUID=u'284ae5e7-da22-4a51-8fb7-554c3d953ced',
2018-11-08 11:17:19,048+0100 getVolumeSize volUUID=u'284ae5e7-da22-4a51-8fb7-554c3d953ced',
2018-11-08 11:19:20,278+0100 getVolumeSize volUUID=u'284ae5e7-da22-4a51-8fb7-554c3d953ced',  <--- 2m01s delayed
2018-11-08 11:19:24,344+0100 getVolumeSize volUUID=u'284ae5e7-da22-4a51-8fb7-554c3d953ced',  <--- 5s delayed
2018-11-08 11:20:19,040+0100 getVolumeSize volUUID=u'284ae5e7-da22-4a51-8fb7-554c3d953ced',
2018-11-08 11:21:19,076+0100 getVolumeSize volUUID=u'284ae5e7-da22-4a51-8fb7-554c3d953ced',
2018-11-08 11:22:19,046+0100 getVolumeSize volUUID=u'284ae5e7-da22-4a51-8fb7-554c3d953ced',

+

2018-11-08 11:19:17,412+0100 WARN  (jsonrpc/6) [virt.vm] (vmId='a4ab6989-274b-4d50-9bec-1680e45442a1') monitor became unresponsive (command timeout, age=65.4700000007) (vm:5132)

So looks like UpdateVolumes was stuck from ~11:18 to 11:19:20. And the monitor goes unresponsive as getVolumeSize for the VM's volumes is stuck?
Tons of getVolumeSize were delayed/blocked while the user was adding a new SD to the Pool.

Is there any improvement on this area in vdsm-4.20?

(Originally by Germano Veit Michel)

Comment 19 RHV bug bot 2019-10-03 09:06:40 UTC
I'm positive we landed few fixes on 4.2 to improve the number of rescans we perform after gchen found out we repeat that about 4 times for every path.


Germano, how many paths are configured there?


meanwhile I'll try to dig the relevant BZ.

(Originally by Roy Golan)

Comment 20 RHV bug bot 2019-10-03 09:06:42 UTC
(In reply to Roy Golan from comment #19)
> Germano, how many paths are configured there?

This one has 8 paths ;)

Indeed should be better in 4.2 with https://gerrit.ovirt.org/#/c/81681/

Still, can't this be improved to not pile up GetVolumeSize() while the SD is refreshing/producing? It seems all GetVolumeSize() are waiting for the refresh, if one has enough VMs it does mean a ton of blocked UpdateVolumes tasks.

(Originally by Germano Veit Michel)

Comment 21 RHV bug bot 2019-10-03 09:06:44 UTC
This bug has not been marked as blocker for oVirt 4.3.0.
Since we are releasing it tomorrow, January 29th, this bug has been re-targeted to 4.3.1.

(Originally by Sandro Bonazzola)

Comment 23 RHV bug bot 2019-10-03 09:06:49 UTC
I think this issue should be solved by moving volume size monitoring to 
storage domain monitors.

When an image is prepared, we can register the leaf volume with the
storage domain monitor.

Then the storage domain monitor can check volume size every N seconds
and cache the value.

When virt API need the value, they will get the cached value from
repoStats.

When image is tore down, storage monitor will stop monitoring the leaf
volume.

When taking a snapshot, replacing the active leaf volume, the new
leaf volume should be registered with the storage domain monitor,
replacing the old leaf volume (that will never change).

If storage is not responsive, relevant volumes size will be stale, but
nothing in the system will block.

The UpdateVolume periodic task will be removed.

(Originally by Nir Soffer)

Comment 26 RHV bug bot 2019-10-03 09:06:55 UTC
sync2jira

(Originally by Daniel Gur)

Comment 27 RHV bug bot 2019-10-03 09:06:57 UTC
sync2jira

(Originally by Daniel Gur)

Comment 29 RHV bug bot 2019-10-03 09:07:01 UTC
Normally getting volume size is quick. we need to find the root cause causing
these calls to be slow.

Updating volume size call this every 60 seconds:

3842     def updateDriveVolume(self, vmDrive):
3843         if not vmDrive.device == 'disk' or not isVdsmImage(vmDrive):
3844             return
3845 
3846         volSize = self._getVolumeSize(
3847             vmDrive.domainID, vmDrive.poolID, vmDrive.imageID,
3848             vmDrive.volumeID)
3849 
3850         vmDrive.truesize = volSize.truesize
3851         vmDrive.apparentsize = volSize.apparentsize

getVolumeSize() is:

3067     @public
3068     def getVolumeSize(self, sdUUID, spUUID, imgUUID, volUUID, options=None):
...
3086         dom = sdCache.produce(sdUUID=sdUUID)
3087         apparentsize = str(dom.getVSize(imgUUID, volUUID))
3088         truesize = str(dom.getVAllocSize(imgUUID, volUUID))
3089         return dict(apparentsize=apparentsize, truesize=truesize)

On block storage getVSize and getVAllocSize are the same, so we have
2 calls to:

 421     def getVSize(self, imgUUUID, volUUID):
 422         """ Return the block volume size in bytes. """
 423         try:
 424             size = fsutils.size(lvm.lvPath(self.sdUUID, volUUID))
 425         except IOError as e:
 426             if e.errno != errno.ENOENT:
 427                 raise
 428             # Inactive volume has no /dev entry. Fallback to lvm way.
 429             size = lvm.getLV(self.sdUUID, volUUID).size
 430 
 431         return int(size)

Because the LVs are active (called for running VMs, we know we don't use
lvm to get the size but call fsutils.size(), which is:

26 def size(filename):
 27     """
 28     Return actual file size, should work with both file and block device.
 29     """
 30     with io.open(filename, "rb") as f:
 31         f.seek(0, os.SEEK_END)
 32         return f.tell()

Which should be fast, but maybe in these cases is not.

An easy optimization would be to call getVSize() only once on block
storage instead of twice. In the case when seeking to the end of the
device is slow, this may improve the situation.

But we really want to understand why this operation is slow on the 
certain systems.

Germano, can you share all WARN logs vdsm log and sanlock.log?

(Originally by Nir Soffer)

Comment 30 RHV bug bot 2019-10-03 09:07:03 UTC
Hi Nir,

Thanks for taking a look at this.

(In reply to Nir Soffer from comment #29)
> Normally getting volume size is quick. we need to find the root cause causing
> these calls to be slow.
> 
> Updating volume size call this every 60 seconds:
> 
> 3842     def updateDriveVolume(self, vmDrive):
> 3843         if not vmDrive.device == 'disk' or not isVdsmImage(vmDrive):
> 3844             return
> 3845 
> 3846         volSize = self._getVolumeSize(
> 3847             vmDrive.domainID, vmDrive.poolID, vmDrive.imageID,
> 3848             vmDrive.volumeID)
> 3849 
> 3850         vmDrive.truesize = volSize.truesize
> 3851         vmDrive.apparentsize = volSize.apparentsize
> 
> getVolumeSize() is:
> 
> 3067     @public
> 3068     def getVolumeSize(self, sdUUID, spUUID, imgUUID, volUUID,
> options=None):
> ...
> 3086         dom = sdCache.produce(sdUUID=sdUUID)
> 3087         apparentsize = str(dom.getVSize(imgUUID, volUUID))
> 3088         truesize = str(dom.getVAllocSize(imgUUID, volUUID))
> 3089         return dict(apparentsize=apparentsize, truesize=truesize)
> 

I might be wrong, but I think it's sdCache.produce above that slows down monitoring.
getVolumeSize is stuck on the executor until a storage refresh completes.
And on larger environments (many LUNs), HBA rescan can take longer time too, 
enough to make VM monitoring take more time.

I think having a few paths down also slow down HBA rescan.

Once the storage refresh completes, all stuck getVolumeSize() complete immediately
after.

> On block storage getVSize and getVAllocSize are the same, so we have
> 2 calls to:
> 
>  421     def getVSize(self, imgUUUID, volUUID):
>  422         """ Return the block volume size in bytes. """
>  423         try:
>  424             size = fsutils.size(lvm.lvPath(self.sdUUID, volUUID))
>  425         except IOError as e:
>  426             if e.errno != errno.ENOENT:
>  427                 raise
>  428             # Inactive volume has no /dev entry. Fallback to lvm way.
>  429             size = lvm.getLV(self.sdUUID, volUUID).size
>  430 
>  431         return int(size)
> 
> Because the LVs are active (called for running VMs, we know we don't use
> lvm to get the size but call fsutils.size(), which is:
> 
> 26 def size(filename):
>  27     """easy optimization would be to call getVSize() only once on block
> storage instead of twice. In the case when seeking to the end of the
> device is slow, this may improve the situation.
> 
> But we really want to understand why this operation is slow on the 
> certain systems.
> 
> Germano, can you share all WARN logs vdsm log and sanlock.log?

To open this bug I collected some debug logs from the initial customer, 
aren't the ones attached initially good enough?

Occasionally there are some HBA rescans timeout too.

From what I remember on call these cases sanlock is fine, and so is
IO to storage. The common scenario is they all have lots of SDs
and LUNs.

>  28     Return actual file size, should work with both file and block device.
>  29     """
>  30     with io.open(filename, "rb") as f:
>  31         f.seek(0, os.SEEK_END)
>  32         return f.tell()
> 
> Which should be fast, but maybe in these cases is not.

Yes, these should be fast. If this was slow, then the problem would not
be in VDSM I believe, and probably the VMs would be suffering too.

> 
> An easy optimization would be to call getVSize() only once on block
> storage instead of twice. In the case when seeking to the end of the
> device is slow, this may improve the situation.
> 
> But we really want to understand why this operation is slow on the 
> certain systems.
> 
> Germano, can you share all WARN logs vdsm log and sanlock.log?

To open this bug I collected some debug logs from the initial customer, 
aren't the ones attached initially good enough? They are more informative
then the latest ones without debug, but I can get more if you wish.

I don't have debug logs from latest 4.3 on this. Let me know if you need them
then I'll collect a new round, along with sanlock (which was fine when I checked).

(Originally by Germano Veit Michel)

Comment 31 RHV bug bot 2019-10-03 09:07:06 UTC
(In reply to Germano Veit Michel from comment #30)
...
> > getVolumeSize() is:
> > 
> > 3067     @public
> > 3068     def getVolumeSize(self, sdUUID, spUUID, imgUUID, volUUID,
> > options=None):
> > ...
> > 3086         dom = sdCache.produce(sdUUID=sdUUID)
> > 3087         apparentsize = str(dom.getVSize(imgUUID, volUUID))
> > 3088         truesize = str(dom.getVAllocSize(imgUUID, volUUID))
> > 3089         return dict(apparentsize=apparentsize, truesize=truesize)
> > 
> 
> I might be wrong, but I think it's sdCache.produce above that slows down
> monitoring.
> getVolumeSize is stuck on the executor until a storage refresh completes.
> And on larger environments (many LUNs), HBA rescan can take longer time too, 
> enough to make VM monitoring take more time.

Right, storage refreshes are a known issue that may block any flow in 
vdsm accessing storage domain method, even if the storage domain is not
related (e.g. accessing ISO domain during a storage refresh blokced
by SCSI scan).

But we don't expect lot of storage refreshes. We need to understand why we
have many refreshes, and why they are slow.

> I think having a few paths down also slow down HBA rescan.
> 
> Once the storage refresh completes, all stuck getVolumeSize() complete
> immediately after.

This makes sense.

> > Germano, can you share all WARN logs vdsm log and sanlock.log?
> 
> To open this bug I collected some debug logs from the initial customer, 
> aren't the ones attached initially good enough? They are more informative
> then the latest ones without debug, but I can get more if you wish.

We need to look at them again, hopefully we will have enough info.

> I don't have debug logs from latest 4.3 on this. Let me know if you need them
> then I'll collect a new round, along with sanlock (which was fine when I
> checked).

If sanlock did not have any delays, then this is most likely an issue with 
storage refreshes blocking other operations.

It would help to time a manual verbose fc-scan:

    time /usr/libexec/vdsm/fc-scan -v

On host that show this issue, and share the output of the command. We have
log with the time it took for every host/device scan.

Since we support device resize detection, this tool scans now not only only
the scsi hosts (/sys/class/scsi_host/%s/scan) but also all the scsi devices
(/sys/class/scsi_device/*).

This was added as part of bug 609689 in https://gerrit.ovirt.org/c/43388/.
Since vdsm does not manage devices, and can miss a request to resize devices
when a host is offline or disconnected from engine temporarily, vdsm scan 
all fc_hosts and all devices connected to these hosts on when ever we refresh
storage.


Looking at the refresh code in hsm.py:

 78     @misc.samplingmethod
 79     def refreshStorage(self, resize=True):
 80         self.__staleStatus = self.STORAGE_REFRESHING
 81 
 82         multipath.rescan()
 83         if resize:
 84             multipath.resize_devices()
 85         lvm.invalidateCache()

We don't need to scan devices if resize is False.


Checking all the places we call refreshStorage:


2796     def getStorageDomainsList(
2797             self, spUUID=None, domainClass=None, storageType=None,
2798             remotePath=None, options=None):
...
2814         sdCache.refreshStorage()

Looks like we can refresh with resize=False here.


2872     def getVGList(self, storageType=None, options=None):
...
2882         sdCache.refreshStorage()

Same, resize is not needed.


 411         def storageRefresh():
 412             sdCache.refreshStorage()

This is done during startup - since we don't know if a device was
resized on other hosts when vdsm was down, we really need to use
resize=True here.


1993     def _getDeviceList(self, storageType=None, guids=(), checkStatus=True):
1994         sdCache.refreshStorage()

Here we need to resize since the user expect to see the real device size after
this call.


2406     def connectStorageServer(self, domType, spUUID, conList, options=None):
...
2445         # In case there were changes in devices size
2446         # while the VDSM was not connected, we need to
2447         # call refreshStorage.
2448         if domType in (sd.FCP_DOMAIN, sd.ISCSI_DOMAIN):
2449             sdCache.refreshStorage()

Here we must resize, but rescan devices is not needed if we just connected to 
storage.


2495     def disconnectStorageServer(self, domType, spUUID, conList, options=None):
...
2530         # Disconnecting a device may change the visible storage domain list
2531         # so invalidate the caches
2532         sdCache.refreshStorage(resize=False)

Here we don't do any resize, so we don't need to scan anything.


Refreshes in sdc.py:

127         try:
128             # If multiple calls reach this point and the storage is not
129             # updated the refreshStorage() sampling method is called
130             # serializing (and eventually grouping) the requests.
131             if self.__staleStatus != self.STORAGE_UPDATED:
132                 self.refreshStorage()

We don't call multipath.resize here so there is no need to rescan devices.


Looks like we should:

- Add argument to fc-scan to disable device rescan
- Add rescan_devices=False argument to refreshStorage
- Call refreshStorage(rescan_device=True) only in the case that must
  to rescan devices and resize them.

(Originally by Nir Soffer)

Comment 32 RHV bug bot 2019-10-03 09:07:08 UTC
Germano, I posted a quick untested patch for fc-scan to allow disabling 
device scanning.

It would help if you can get share the output of these commands on a system
showing this issue.

    time /usr/libexec/vdsm/fc-scan -v

    time /usr/libexec/vdsm/fc-scan -v --no-devices

(Originally by Nir Soffer)

Comment 33 RHV bug bot 2019-10-03 09:07:10 UTC
(In reply to Nir Soffer from comment #31)
> Looks like we should:
> 
> - Add argument to fc-scan to disable device rescan
> - Add rescan_devices=False argument to refreshStorage
> - Call refreshStorage(rescan_device=True) only in the case that must
>   to rescan devices and resize them.

Makes a lot of sense to me, should save some time.

(In reply to Nir Soffer from comment #32)
> Germano, I posted a quick untested patch for fc-scan to allow disabling 
> device scanning.
> 
> It would help if you can get share the output of these commands on a system
> showing this issue.
> 
>     time /usr/libexec/vdsm/fc-scan -v
> 
>     time /usr/libexec/vdsm/fc-scan -v --no-devices

I've requested it to the customer that was facing these issues, along with a 
fresh sosreport after fixing other problems that could also have been contributing
to the slowdowns.

Will let you know once we have results.

Thanks!

(Originally by Germano Veit Michel)

Comment 34 RHV bug bot 2019-10-03 09:07:12 UTC
Looking in vdsm.log from attachment 1456628 [details]:


## FC scans

$ grep '\[storage.HBA\] Starting scan' vdsm.log.3 | wc -l
34

$ head -1 vdsm.log.3 
2018-07-04 08:01:06,269+0200 ...

$ tail -1 vdsm.log.3 
2018-07-04 09:01:01,772+0200 ...

34 scans in 1 hour - does not make sense. We don't expect so many storage refreshes.


$ grep '\[storage.HBA\] Starting scan' vdsm.log.3 | grep monitor/ | wc -l
33

$ grep '\[storage.HBA\] Starting scan' vdsm.log.3 | grep jsonrpc/ | wc -l
1

One scan was triggered by API call, all other by storage domain monitors.


Looking in the fc-scan calls, they take about 25 seconds, and they are performed in
groups - back to back:

1. 2 scans

2018-07-04 08:36:41,861+0200 DEBUG (monitor/3f31203) [storage.HBA] Starting scan (hba:58)
2018-07-04 08:37:05,450+0200 DEBUG (monitor/3f31203) [storage.HBA] Scan finished (hba:64)
2018-07-04 08:37:06,137+0200 DEBUG (monitor/62b40d0) [storage.HBA] Starting scan (hba:58)
2018-07-04 08:37:29,223+0200 DEBUG (monitor/62b40d0) [storage.HBA] Scan finished (hba:64)

2. After a minute, 3 scans

2018-07-04 08:41:23,481+0200 DEBUG (monitor/3f31203) [storage.HBA] Starting scan (hba:58)
2018-07-04 08:41:45,052+0200 DEBUG (monitor/3f31203) [storage.HBA] Scan finished (hba:64)
2018-07-04 08:41:45,592+0200 DEBUG (monitor/c59bd9b) [storage.HBA] Starting scan (hba:58)
2018-07-04 08:42:09,433+0200 DEBUG (monitor/c59bd9b) [storage.HBA] Scan finished (hba:64)
2018-07-04 08:42:10,136+0200 DEBUG (monitor/dbf3c27) [storage.HBA] Starting scan (hba:58)
2018-07-04 08:42:33,887+0200 DEBUG (monitor/dbf3c27) [storage.HBA] Scan finished (hba:64)

3. After 3:30 minutes, 4 scans

2018-07-04 08:46:00,457+0200 DEBUG (monitor/632f710) [storage.HBA] Starting scan (hba:58)
2018-07-04 08:46:22,423+0200 DEBUG (monitor/632f710) [storage.HBA] Scan finished (hba:64)
2018-07-04 08:46:23,010+0200 DEBUG (monitor/dbf3c27) [storage.HBA] Starting scan (hba:58)
2018-07-04 08:46:46,789+0200 DEBUG (monitor/dbf3c27) [storage.HBA] Scan finished (hba:64)
2018-07-04 08:46:47,317+0200 DEBUG (monitor/eff5139) [storage.HBA] Starting scan (hba:58)
2018-07-04 08:47:10,096+0200 DEBUG (monitor/eff5139) [storage.HBA] Scan finished (hba:64)
2018-07-04 08:47:10,774+0200 DEBUG (monitor/c9faf72) [storage.HBA] Starting scan (hba:58)
2018-07-04 08:47:34,787+0200 DEBUG (monitor/c9faf72) [storage.HBA] Scan finished (hba:64)


## Warnings

$ grep WARN vdsm.log.3 | wc -l
979

$ egrep 'WARN.+is blocked for' vdsm.log.3 | head -10
2018-07-04 08:01:11,278+0200 WARN  (check/loop) [storage.check] Checker u'/rhev/data-center/mnt/10.252.100.47:_appl/dbf3c278-6ec5-45f0-9c09-0f92e6c7ab22/dom_md/metadata' is blocked for 320.00 seconds (check:279)
2018-07-04 08:01:21,279+0200 WARN  (check/loop) [storage.check] Checker u'/rhev/data-center/mnt/10.252.100.47:_appl/dbf3c278-6ec5-45f0-9c09-0f92e6c7ab22/dom_md/metadata' is blocked for 330.00 seconds (check:279)
2018-07-04 08:01:31,278+0200 WARN  (check/loop) [storage.check] Checker u'/rhev/data-center/mnt/10.252.100.47:_appl/dbf3c278-6ec5-45f0-9c09-0f92e6c7ab22/dom_md/metadata' is blocked for 340.00 seconds (check:279)
2018-07-04 08:01:41,279+0200 WARN  (check/loop) [storage.check] Checker u'/rhev/data-center/mnt/10.252.100.47:_appl/dbf3c278-6ec5-45f0-9c09-0f92e6c7ab22/dom_md/metadata' is blocked for 350.00 seconds (check:279)
2018-07-04 08:01:51,278+0200 WARN  (check/loop) [storage.check] Checker u'/rhev/data-center/mnt/10.252.100.47:_appl/dbf3c278-6ec5-45f0-9c09-0f92e6c7ab22/dom_md/metadata' is blocked for 360.00 seconds (check:279)
2018-07-04 08:02:01,282+0200 WARN  (check/loop) [storage.check] Checker u'/rhev/data-center/mnt/10.252.100.47:_appl/dbf3c278-6ec5-45f0-9c09-0f92e6c7ab22/dom_md/metadata' is blocked for 370.00 seconds (check:279)
2018-07-04 08:02:11,278+0200 WARN  (check/loop) [storage.check] Checker u'/rhev/data-center/mnt/10.252.100.47:_appl/dbf3c278-6ec5-45f0-9c09-0f92e6c7ab22/dom_md/metadata' is blocked for 380.00 seconds (check:279)
2018-07-04 08:02:21,287+0200 WARN  (check/loop) [storage.check] Checker u'/rhev/data-center/mnt/10.252.100.47:_appl/dbf3c278-6ec5-45f0-9c09-0f92e6c7ab22/dom_md/metadata' is blocked for 390.00 seconds (check:279)
2018-07-04 08:02:31,285+0200 WARN  (check/loop) [storage.check] Checker u'/rhev/data-center/mnt/10.252.100.47:_appl/dbf3c278-6ec5-45f0-9c09-0f92e6c7ab22/dom_md/metadata' is blocked for 400.00 seconds (check:279)
2018-07-04 08:02:41,287+0200 WARN  (check/loop) [storage.check] Checker u'/rhev/data-center/mnt/10.252.100.47:_appl/dbf3c278-6ec5-45f0-9c09-0f92e6c7ab22/dom_md/metadata' is blocked for 410.00 seconds (check:279)

There is serious issues with the NFS domain dbf3c278-6ec5-45f0-9c09-0f92e6c7ab22
it was blocked for 840 seconds:

2018-07-04 08:55:31,279+0200 WARN  (check/loop) [storage.check] Checker u'/rhev/data-center/mnt/10.252.100.47:_appl/dbf3c278-6ec5-45f0-9c09-0f92e6c7ab22/dom_md/metadata' is blocked for 840.00 seconds (check:279)


$ egrep 'WARN.+is blocked for' vdsm.log.3 | grep -v dbf3c278-6ec5-45f0-9c09-0f92e6c7ab22

No other domain had issues.


## Storage cache invalidations

I think the issue is the storage domain cache. Every access to storage domain look up
an internal cache. If the domain is in cache there is no delay and the code continue.

But if the domain is not in the cache, and cache was invalidated, the caller try to 
refresh storage. This operation does:

- scsi rescan
- fc rescan
- resize multipath devices

Because of the way the cache is built, all this is typically done twice when multiple
storage domain try to enter the domain cache.

In our case, fc scan takes about 25 seconds. So typically code accessing storage domain
may be blocked for at least 25-50 seconds on the cache only because of slow fc-scans.

Why would domain not be in the cache? every 300 seconds, storage domain monitors remove
the domain from the cache so the next access to the domain will recreate the domain
object and refresh the domain metadata.

The cache is invalidated in:

- connectStorageServer()
- StoragePool.connect(), called from connectStoragePool()
- StoragePool.attachSD(), called from attachStorageDomain()
- StoragePool.refresh(), called from connectStoragePool()

Looking at the relevant calls:

$ egrep 'START (connectStoragePool|connectStorageServer|attachStorageDomain)' vdsm.log.3 | awk '{print $1,$2}'
2018-07-04 08:04:05,141+0200
2018-07-04 08:05:00,203+0200
2018-07-04 08:10:00,235+0200
2018-07-04 08:15:00,434+0200
2018-07-04 08:20:00,284+0200
2018-07-04 08:22:44,676+0200
2018-07-04 08:23:40,594+0200
2018-07-04 08:24:44,517+0200
2018-07-04 08:30:00,290+0200
2018-07-04 08:35:00,166+0200
2018-07-04 08:40:00,148+0200
2018-07-04 08:45:00,261+0200
2018-07-04 08:50:00,183+0200
2018-07-04 08:55:00,205+0200
2018-07-04 09:00:00,232+0200

$ egrep 'START (connectStoragePool|connectStorageServer|attachStorageDomain)' vdsm.log.3 | wc -l
15

So we had 15 cache invalidations in one hour.

$ egrep 'START connectStorageServer' vdsm.log.3 | awk '{print $1,$2}'
2018-07-04 08:05:00,203+0200
2018-07-04 08:10:00,235+0200
2018-07-04 08:15:00,434+0200
2018-07-04 08:20:00,284+0200
2018-07-04 08:22:44,676+0200
2018-07-04 08:24:44,517+0200
2018-07-04 08:30:00,290+0200
2018-07-04 08:35:00,166+0200
2018-07-04 08:40:00,148+0200
2018-07-04 08:45:00,261+0200
2018-07-04 08:50:00,183+0200
2018-07-04 08:55:00,205+0200
2018-07-04 09:00:00,232+0200

$ egrep 'START connectStorageServer' vdsm.log.3 | awk '{print $1,$2}' | wc -l
13

Almost all the calls are connectStorageServer. I guess this is engine trying to recover
the inaccessible NFS domain by trying to reconnect to the server.

Lets try to mix the cache invalidation with the rescans:

$ egrep '\[storage.HBA\] Starting scan' vdsm.log.3 | awk '{print $1,$2,"rescan"}' > rescan.log
$ egrep 'START connectStorageServer' vdsm.log.3 | awk '{print $1,$2,"invalidate cache"}' > invalidate.log
$ cat invalidate.log rescan.log | sort

$ cat invalidate.log rescan.log | sort
2018-07-04 08:05:00,203+0200 invalidate cache
2018-07-04 08:10:00,235+0200 invalidate cache
2018-07-04 08:15:00,434+0200 invalidate cache
2018-07-04 08:16:18,808+0200 rescan
2018-07-04 08:16:41,230+0200 rescan
2018-07-04 08:17:05,614+0200 rescan
2018-07-04 08:20:00,284+0200 invalidate cache
2018-07-04 08:21:24,552+0200 rescan
2018-07-04 08:21:46,828+0200 rescan
2018-07-04 08:22:10,778+0200 rescan
2018-07-04 08:22:44,676+0200 invalidate cache
2018-07-04 08:23:40,685+0200 rescan
2018-07-04 08:24:03,157+0200 rescan
2018-07-04 08:24:27,217+0200 rescan
2018-07-04 08:24:44,517+0200 invalidate cache
2018-07-04 08:26:03,833+0200 rescan
2018-07-04 08:26:26,448+0200 rescan
2018-07-04 08:26:50,643+0200 rescan
2018-07-04 08:30:00,290+0200 invalidate cache
2018-07-04 08:31:09,170+0200 rescan
2018-07-04 08:31:31,503+0200 rescan
2018-07-04 08:31:55,619+0200 rescan
2018-07-04 08:32:21,896+0200 rescan
2018-07-04 08:35:00,166+0200 invalidate cache
2018-07-04 08:36:16,809+0200 rescan
2018-07-04 08:36:41,861+0200 rescan
2018-07-04 08:37:06,137+0200 rescan
2018-07-04 08:40:00,148+0200 invalidate cache
2018-07-04 08:41:23,481+0200 rescan
2018-07-04 08:41:45,592+0200 rescan
2018-07-04 08:42:10,136+0200 rescan
2018-07-04 08:45:00,261+0200 invalidate cache
2018-07-04 08:46:00,457+0200 rescan
2018-07-04 08:46:23,010+0200 rescan
2018-07-04 08:46:47,317+0200 rescan
2018-07-04 08:47:10,774+0200 rescan
2018-07-04 08:50:00,183+0200 invalidate cache
2018-07-04 08:51:08,685+0200 rescan
2018-07-04 08:51:30,834+0200 rescan
2018-07-04 08:51:54,451+0200 rescan
2018-07-04 08:52:17,360+0200 rescan
2018-07-04 08:55:00,205+0200 invalidate cache
2018-07-04 08:56:16,675+0200 rescan
2018-07-04 08:56:39,366+0200 rescan
2018-07-04 08:57:03,588+0200 rescan
2018-07-04 08:57:26,343+0200 rescan
2018-07-04 09:00:00,232+0200 invalidate cache


The picture is pretty clear now:

1. NFS storage domain is not accessible
2. Engine try to recover the domain by calling connectStorageServer every ~5 minutes
3. connectStorgeServer invalidate storage domain cache
4. The next access to any domain trigger 2-5 fc-scans, each taking 25 seconds
5. calls to storage domain methods are blocked for at leat 25-50 seconds during the scans
6. virt executor get stuck with blocked getVolumeSize calls
7. vms become non-responsive

Germano, did you see the same picture in other cases?

(Originally by Nir Soffer)

Comment 35 RHV bug bot 2019-10-03 09:07:14 UTC
(In reply to Nir Soffer from comment #34)
> The picture is pretty clear now:
> 
> 1. NFS storage domain is not accessible
> 2. Engine try to recover the domain by calling connectStorageServer every ~5
> minutes
> 3. connectStorgeServer invalidate storage domain cache
> 4. The next access to any domain trigger 2-5 fc-scans, each taking 25 seconds
> 5. calls to storage domain methods are blocked for at leat 25-50 seconds
> during the scans
> 6. virt executor get stuck with blocked getVolumeSize calls
> 7. vms become non-responsive
> 
> Germano, did you see the same picture in other cases?

Brilliant, I think this connects all the dots.

Yes, one case I remember was when adding new storage domains. The current case also 
has a broken ISO domain.

(Originally by Germano Veit Michel)

Comment 36 RHV bug bot 2019-10-03 09:07:16 UTC
In connectStorageServer() vdsm invalidates the caches no matter what connection is being made.
Is this optimal? For example, connecting an ISO/Export Domain invalidates caches of everything. Is it really necessary?

IIUC, the problem can still happen without the extra cache invalidations. If getVolumeSize gets blocked for long the monitor will be non-responsive.
Not having the NFS problem makes it less frequent though.

(Originally by Germano Veit Michel)

Comment 37 RHV bug bot 2019-10-03 09:07:18 UTC
(In reply to Germano Veit Michel from comment #36)
> In connectStorageServer() vdsm invalidates the caches no matter what
> connection is being made.
> Is this optimal? For example, connecting an ISO/Export Domain invalidates
> caches of everything. Is it really necessary?

I don't think so, and I suggested to change this in the past:
https://gerrit.ovirt.org/c/45824/

But they were concerns about the safey of this change. I think we should
rethink this.

> IIUC, the problem can still happen without the extra cache invalidations. If
> getVolumeSize gets blocked for long the monitor will be non-responsive.
> Not having the NFS problem makes it less frequent though.

Yes, but this is real storage issue, and in this case libvirt and qemu will
get stuck anyway.

(Originally by Nir Soffer)

Comment 38 RHV bug bot 2019-10-03 09:07:20 UTC
(In reply to Nir Soffer from comment #37)
> > IIUC, the problem can still happen without the extra cache invalidations. If
> > getVolumeSize gets blocked for long the monitor will be non-responsive.
> > Not having the NFS problem makes it less frequent though.
> 
> Yes, but this is real storage issue, and in this case libvirt and qemu will
> get stuck anyway.

But if 1 domain has problems and is blocking refresh, then all vms will be
non-responsive, even VMs that have no disks on that domain, right?

Also, there might be other reasons the refresh might take longer and the VMs
are all running happily, like high number of devices, a problem with a few 
devices/paths which are not even being used.

No reply for the fc-scan yet, will post here as soon as I get it.

(Originally by Germano Veit Michel)

Comment 43 RHV bug bot 2019-10-03 09:07:31 UTC
(In reply to Germano Veit Michel from comment #38)
> (In reply to Nir Soffer from comment #37)
> > > IIUC, the problem can still happen without the extra cache invalidations. If
> > > getVolumeSize gets blocked for long the monitor will be non-responsive.
> > > Not having the NFS problem makes it less frequent though.
> > 
> > Yes, but this is real storage issue, and in this case libvirt and qemu will
> > get stuck anyway.
> 
> But if 1 domain has problems and is blocking refresh, then all vms will be
> non-responsive, even VMs that have no disks on that domain, right?

Right, this is exactly the case here. The problem is lack of storage domain
isolation.

(Originally by Nir Soffer)

Comment 44 RHV bug bot 2019-10-03 09:07:33 UTC
Created attachment 1616201 [details]
fc-scan with devices

(Originally by Nir Soffer)

Comment 45 RHV bug bot 2019-10-03 09:07:35 UTC
Created attachment 1616205 [details]
fc-scan without devices

(Originally by Nir Soffer)

Comment 46 RHV bug bot 2019-10-03 09:07:37 UTC
Created attachment 1616207 [details]
fc-scan with devices

Uploading again with fixed whitespace.

(Originally by Nir Soffer)

Comment 47 RHV bug bot 2019-10-03 09:07:40 UTC
I extracted the text from the pdf in attachment 1615732 [details] and uploaded:
- attachment 1616207 [details] - fc-scan with devices
- attachment 1616205 [details] - fc-scan without devices

Comparing total runtime:

- with devices: 4.6s
- without devices: 0.07s (65X times faster)

Checking how time was spent checking devices:

We scanned 486 devices:
$ egrep 'Rescanned device .+:' fc-scan-with-devices.log | wc -l
486

Slowest scans:
$ awk '/Rescanned device .+:/ {print $5}' fc-scan-with-devices.log | sort -rn | head
0.04
0.03
0.03
0.03
0.03
0.03
0.03
0.03
0.03
0.03

Fastest scans
$ awk '/Rescanned device .+:/ {print $5}' fc-scan-with-devices.log | sort -rn | tail
0.01
0.01
0.01
0.01
0.00
0.00
0.00
0.00
0.00
0.00

Average scan:
$ awk '/Rescanned device .+:/ {sum+=$5; count+=1}; END {print sum/count}' fc-scan-with-devices.log
0.0179218

Total time spent in scanning:
$ awk '/Rescanned device .+:/ {sum+=$5}; END {print sum}' fc-scan-with-devices.log
8.71

We have some some parallelism, using thread per host.

There is also time spent in finding devices for every host, but we cannot tell
how much time spent from the log.

98.5% of the time is spent in scanning devices. Skipping device scanning when it is not needed
will make slow scan a non-issue, avoiding delays in other parts of the system.

(Originally by Nir Soffer)

Comment 48 RHV bug bot 2019-10-03 09:07:42 UTC
Germano, it would be nice if we can get text instead of pdf next time.

An easy way to collect the data is:

    $ (time fc-scan -v --no-devices)> fc-scan-with-devices.log 2>&1
    $ (time fc-scan -v)> fc-scan-without-devices.log 2>&1

(Originally by Nir Soffer)

Comment 49 RHV bug bot 2019-10-03 09:07:44 UTC
Checking how devices are distributed across fc_hosts and rports:

$ egrep 'Rescanning device .+/host2/' fc-scan-with-devices.log | wc -l
243
$ egrep 'Rescanning device .+/host4/' fc-scan-with-devices.log | wc -l
243

$ egrep 'Rescanning device .+/host4/rport-4:0-2' fc-scan-with-devices.log | wc -l
3
$ egrep 'Rescanning device .+/host4/rport-4:0-3' fc-scan-with-devices.log | wc -l
82
$ egrep 'Rescanning device .+/host4/rport-4:0-4' fc-scan-with-devices.log | wc -l
82
$ egrep 'Rescanning device .+/host4/rport-4:0-5' fc-scan-with-devices.log | wc -l
38
$ egrep 'Rescanning device .+/host4/rport-4:0-6' fc-scan-with-devices.log | wc -l
38

$ egrep 'Rescanning device .+/host2/rport-2:0-2' fc-scan-with-devices.log | wc -l
82
$ egrep 'Rescanning device .+/host2/rport-2:0-3' fc-scan-with-devices.log | wc -l
82
$ egrep 'Rescanning device .+/host2/rport-2:0-4' fc-scan-with-devices.log | wc -l
38
$ egrep 'Rescanning device .+/host2/rport-2:0-5' fc-scan-with-devices.log | wc -l
38
$ egrep 'Rescanning device .+/host2/rport-2:0-6' fc-scan-with-devices.log | wc -l
3

We use only 2 threads - one per fc_host with devices. Can we get better parallelism
by scanning with thread per rport-target-N combinations?

The relevant code is here:
https://gerrit.ovirt.org/gitweb?p=vdsm.git;a=blob;f=lib/vdsm/storage/fc-scan;h=7da6105971c7c253b20626cb2cd87509b0815778;hb=61404df72b75dc329a70c745b9e0af88ff7027a2

The relevant part is:

 106                     fd = os.open(path, os.O_WRONLY)
 107                     try:
 108                         os.write(fd, "1")
 109                     finally:
 110                         os.close(fd)

Can this work in parallel for multiple devices on different host-rport-target-N, like:

/sys/devices/pci0000:40/0000:40:02.0/0000:41:00.1/host2/rport-2:0-2/target2:0:2/2:0:2:0/rescan
/sys/devices/pci0000:40/0000:40:02.0/0000:41:00.1/host2/rport-2:0-3/target2:0:3/2:0:3:0/rescan
/sys/devices/pci0000:40/0000:40:02.0/0000:41:00.1/host2/rport-2:0-4/target2:0:4/2:0:4:0/rescan
/sys/devices/pci0000:40/0000:40:02.0/0000:41:00.1/host2/rport-2:0-5/target2:0:5/2:0:5:0/rescan
/sys/devices/pci0000:40/0000:40:02.0/0000:41:00.1/host2/rport-2:0-6/target2:0:6/2:0:6:0/rescan

It it can work, we can speed up device scanning significantly.

Ben, I don't know if you can answer this, but maybe you can tell me who can 
answer this?

(Originally by Nir Soffer)

Comment 50 RHV bug bot 2019-10-03 09:07:46 UTC
Summary of actions items (in order of importance):

1. Disable device scanning when it is not needed
   I think most refreshes do not need device scanning

2. Optimize device scanning when it is needed (e.g. better parallelism)
   Important in getDeviceList and startup, when we must resize devices

3. Never trigger rescans when handling file based domains. It is never right to rescan
   hosts and/or devices when handling NFS domain.

4. Never trigger fc_host scan when handling iSCSI domain or iscsi_host scan when handling
   FC domain (currently we always invoke both).

5. Minimize cache invalidations  (sdCache.invalidateStorage)
   may trigger slow rescan blocking other flows. I think we can remove all calls.

6. Minimize cash refreshes (sdCache.refreshStorage)
   may trigger slow rescan blocking other flows. It think we can remove some calls.

7. Consider changing sampling method to use the current operation result
   instead of grouping waiters and performing new operation with all waiters
   can cut blocking time by half.

(Originally by Nir Soffer)

Comment 51 RHV bug bot 2019-10-03 09:07:48 UTC
(In reply to Nir Soffer from comment #49)
> We use only 2 threads - one per fc_host with devices. Can we get better parallelism
> by scanning with thread per rport-target-N combinations?

> Can this work in parallel for multiple devices on different host-rport-target-N, like:
> 
> /sys/devices/pci0000:40/0000:40:02.0/0000:41:00.1/host2/rport-2:0-2/target2:0:2/2:0:2:0/rescan
> /sys/devices/pci0000:40/0000:40:02.0/0000:41:00.1/host2/rport-2:0-3/target2:0:3/2:0:3:0/rescan
> /sys/devices/pci0000:40/0000:40:02.0/0000:41:00.1/host2/rport-2:0-4/target2:0:4/2:0:4:0/rescan
> /sys/devices/pci0000:40/0000:40:02.0/0000:41:00.1/host2/rport-2:0-5/target2:0:5/2:0:5:0/rescan
> /sys/devices/pci0000:40/0000:40:02.0/0000:41:00.1/host2/rport-2:0-6/target2:0:6/2:0:6:0/rescan
> 
> If it can work, we can speed up device scanning significantly.
> 
> Ben, I don't know if you can answer this, but maybe you can tell me who can 
> answer this?

Looking over the relevant kernel code briefly, I don't see any obvious reason why not. However, I really don't know the kernel scsi-layer code well, so Ewan might be a better person to ask.

(Originally by Benjamin Marzinski)

Comment 52 RHV bug bot 2019-10-03 09:07:50 UTC
It depends upon what you are doing.

There is a mutex on each SCSI Host object taken around the scan operation when
rescanning a host, or a target.  So if you use the "scan" sysfs node with wildcards
on the host object, you will wait on the mutex if multiple threads are performing
an operation on the same host.

Rescanning an already-discovered device, though, is different and does not take
the mutex.  So there is some potential for parallelism there.

However, one thing to be aware of is that scanning devices can take more time,
sometimes a lot more time, if there is also I/O in progress to the device.  I have
seen cases where it can take on the order of seconds for a device to respond to
inquiry commands under I/O load, because it apparently services the I/O with higher
priority.  Rescanning a device sends several such inquiry commands to it to get
all the VPD pages and attributes.

It is probably better to cache the information, and perhaps update it on a periodic
rotating basis (rather than scanning everything all at the same time).

Note also that since 7.5, if a device reports a unit attention ASC/ASCQ 3F 03
INQUIRY DATA HAS CHANGED, the kernel may perform the rescan automatically.
(A unit attention is only received if there is a command that can receive the
status, so if there is no I/O to the device the reporting can be delayed or not
received.  If multipath is used, typically the path checker will receive the UA.)

This does not happen for ASC/ASCQ 2A 09 CAPACITY DATA HAS CHANGED, though.
This had to do with multipath requiring the capacity on the underlying paths
to match, so if some paths are seen to change but not others, the paths would
not be used.  (I think multipath may have been changed to not check this, at some
point but I'm not certain.)


static void scsi_report_sense(struct scsi_device *sdev,
                              struct scsi_sense_hdr *sshdr)
{
        enum scsi_device_event evt_type = SDEV_EVT_MAXBITS;     /* i.e. none */

        if (sshdr->sense_key == UNIT_ATTENTION) {
                if (sshdr->asc == 0x3f && sshdr->ascq == 0x03) {
                        evt_type = SDEV_EVT_INQUIRY_CHANGE_REPORTED;
                        sdev_printk(KERN_WARNING, sdev,
                                    "Inquiry data has changed");

...

static void scsi_evt_emit(struct scsi_device *sdev, struct scsi_event *evt)
{
        int idx = 0;
        char *envp[3];

        switch (evt->evt_type) {
        case SDEV_EVT_MEDIA_CHANGE:
                envp[idx++] = "SDEV_MEDIA_CHANGE=1";
                break;
        case SDEV_EVT_INQUIRY_CHANGE_REPORTED:
                scsi_rescan_device(&sdev->sdev_gendev);                <=====
                envp[idx++] = "SDEV_UA=INQUIRY_DATA_HAS_CHANGED";
                break;

...

static ssize_t
store_rescan_field (struct device *dev, struct device_attribute *attr,
                    const char *buf, size_t count)
{
        scsi_rescan_device(dev);                                       <=====
        return count;
}
static DEVICE_ATTR(rescan, S_IWUSR, NULL, store_rescan_field);

(Originally by Ewan Milne)

Comment 53 RHV bug bot 2019-10-03 09:07:52 UTC
(In reply to Ewan D. Milne from comment #52)
> It depends upon what you are doing.

I'll try to describe RHV use case.

We have multiple hosts in a cluster (.e.g 50), all connected to the same
LUNs. The reason is we can try to run a VM using this storage on any of these
hosts quickly, without changing the storage (e.g. map LUN to some host).

All the LUNs use multipath. We create a LVM volume group with one or more LUNs,
and create LVM logical volumes used as VM virtual disks.

When we run a VM on some host, or perform some storage operation (e.g. cloning
and virtual disk), we activate the logical volume on this host.

When the volume group becomes full, users can go and resize the LUN on the server.
Since we don't manage the resize, we need to rescan the relevant device on
all hosts so all host are on the same page, and see the new size of the device.

Once we rescan devices on all hosts, we need to resize the LVM physical volume.
This is done on a master host managing LVM metadata. Other hosts will see this
update when they reload this volume group.

Optimally we would rescan only the device that was resized, but since we don't
manage the resize, we don't know which device was resized. We could ask the
user to select the device in RHV UI but is pretty hard task when you have lot
of devices. Also the user resizing the device on the server is typically not
the same user managing the RHV setup.

So instead of making user life hard, we rescan all the devices when a user
ask to update device size. Then we display the device that may be extended
(device size > LVM pysical volume size). The user can select the devices
and trigger a resize of the physical volume.

Since we manage a cluster of hosts, and a host can be disconnected from the
cluster at any time, we need to make sure a host will rescan devices before
connecting back to the cluster, in case there was an update when the host
was not connected.

So we need an efficient way to scan large amount of devices. I think we
test up to 300 LUNs, with 4 paths per LUNs, total 1200 devices.
 
> Rescanning an already-discovered device, though, is different and does not
> take
> the mutex.  So there is some potential for parallelism there.

Great.

> However, one thing to be aware of is that scanning devices can take more
> time,
> sometimes a lot more time, if there is also I/O in progress to the device. 
> I have
> seen cases where it can take on the order of seconds for a device to respond
> to
> inquiry commands under I/O load, because it apparently services the I/O with
> higher
> priority.  Rescanning a device sends several such inquiry commands to it to
> get
> all the VPD pages and attributes.

Yes, we learned this in the hard way, scanning 400 devices too about 25 seconds
with 2 threads (thread per host).

I hope that we can cut the time by 4 by having one thread per host-rport-target-N.

> It is probably better to cache the information, and perhaps update it on a
> periodic
> rotating basis (rather than scanning everything all at the same time).

The goal is to rescan devices only when:

- when a user need to perform an operation with devices, like creating 
  a new LVM volume group in the management system. At this point the user
  want to see the current size of the device.

- when a user perform a resize of the LVM physical volume, on all hosts.

- connecting back to cluster after a host was in maintenance mode or after
  a restart the management program on a host.

(Originally by Nir Soffer)

Comment 55 RHV bug bot 2019-10-03 09:07:56 UTC
Daniel, this is mostly a scale issue, affecting users with large setups
(many hosts, many vms, many LUNs, many storage domains).

We have initial patch for speeding up device scanning:
https://gerrit.ovirt.org/c/103455/

To test this patch I need access to a 1-2 hosts connected to FC storage
with many LUNs. I think that 300 LUNs, 4 paths per LUN, total 1200 SCSI
devices can be a good test.

We need to test:
- time to scan devices on idle host
- same when scanning from multiple hosts in the same time
- same when there is I/O on some devices
  (e.g. copying disks while scanning)

(Originally by Nir Soffer)

Comment 61 RHV bug bot 2019-10-03 09:08:08 UTC
Created attachment 1619485 [details]
graph: fc-scan with 222 devices

The graph shows that we get about 10x speedup with 20 threads and more threads do not improve results significantly.

Previously we were using thread per fc_host (typically 1 threads), so current
patches can give ~5x improvement.

We did not test yet with larger setups and with I/O load.

(Originally by Nir Soffer)

Comment 62 Nir Soffer 2019-10-26 16:14:08 UTC
Mordechai, see original bug 1598266 for testing/verification instructions.

Comment 63 David Vaanunu 2019-11-05 12:47:55 UTC
Verified on 4.3.7-3

redhat-release-server-7.7-10.el7.x86_64
rhv-release-4.3.7-3-001.noarch
vdsm-4.30.34-1.el7ev.x86_64


Flow:
Using 1 host + running 60VMs
Have 300 LUNs
Move 8 raw disks (500GB) in parallel
Storage Domains activities (5 SD in parallel to move disks):
   Maintenance
   Detach
   Activate
   Attach

Comment 66 errata-xmlrpc 2019-12-12 10:36:52 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:4230


Note You need to log in before you can comment on or make changes to this bug.