Bug 1598266

Summary:

[scale] VMs unresponsive due to delayed getVolumeSize

Product:

Red Hat Enterprise Virtualization Manager

Reporter:

Germano Veit Michel <gveitmic>

Component:

vdsm

Assignee:

Nir Soffer <nsoffer>

Status:

CLOSED ERRATA

QA Contact:

mlehrer

Severity:

medium

Docs Contact:

Priority:

high

Version:

4.1.11

CC:

aefrat, bmarzins, dagur, dvaanunu, emarcus, eshenitz, fgarciad, fromani, gveitmic, lsurette, michal.skrivanek, mkalinin, mlehrer, mzamazal, nsoffer, pelauter, pvilayat, rdlugyhe, rgolan, sfroemer, srevivo, tnisan, ycui

Target Milestone:

ovirt-4.4.0

Keywords:

Performance, ZStream

Target Release:

4.3.0

Flags:

mlehrer: needinfo-
lsvaty: testing_plan_complete-

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

When a system had many FC LUNs with many paths per LUN, and a high I/O load, scanning of FC devices became slow, causing timeouts in monitoring VM disk size, and making VMs non-responsive. In this release, FC scans have been optimized for speed, and VMs are much less likely to become non-responsive.

Story Points:

---

Clone Of:

Clones:

1758094 (view as bug list)

Environment:

Last Closed:

2020-08-04 13:26:06 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

Scale

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1547768, 1736852, 1758094

Attachments:

Description	Flags
fc-scan with devices	none
fc-scan without devices	none
fc-scan with devices	none
graph: fc-scan with 222 devices	none
graph: fc-scan with 1311 devices	none
script for creating I/O load	none
Output of fc-scan -v on idle system - current version	none
Output of fc-scan -v with I/O load - current version	none
Output of fc-scan -v on idle system - fix	none
Output of fc-scan -v with I/O load - fix	none
Output of lsblk -s on the test system	none
Output of multipath -ll on test system	none

Description Germano Veit Michel 2018-07-05 01:08:08 UTC

Description of problem:
Several VMs get unresponsive at the same time, regularly.

2018-07-04 08:27:11,986+0200 vmId='86dfa0fc-a4ed-4362-83a4-5bec1ca98d60' age=62.8300000001
2018-07-04 08:27:11,988+0200 vmId='def11d1f-9cb0-47be-9c45-7b961d275c40' age=62.8300000001
2018-07-04 08:27:11,990+0200 vmId='d3ffccf2-7743-463c-bd03-99cff6106ff0' age=62.8399999999
2018-07-04 08:27:11,992+0200 vmId='c8f91bed-1e71-4408-a76b-2bf02d55aa48' age=62.8399999999
2018-07-04 08:27:11,993+0200 vmId='01b4911b-d588-435e-a86a-a5a476da6362' age=62.8399999999
2018-07-04 08:27:11,995+0200 vmId='bd2dd808-af9a-40e9-b997-316f05d06bb9' age=62.8399999999
2018-07-04 08:27:11,997+0200 vmId='d7982c76-8d6b-496a-85cb-548478816e54' age=62.8399999999
2018-07-04 08:27:11,998+0200 vmId='ed10e819-cb6b-4f36-bedf-41fa0c89cab4' age=62.8399999999
2018-07-04 08:27:12,000+0200 vmId='247144df-6844-4f35-a728-80f25ad72971' age=62.8499999996
2018-07-04 08:27:12,001+0200 vmId='331d4145-d02c-4bfa-a136-1018839162be' age=62.8499999996
2018-07-04 08:27:12,003+0200 vmId='e7945585-5304-47b8-a533-89350ddac7b7' age=62.8499999996
2018-07-04 08:27:12,006+0200 vmId='73d41cb9-6564-44ca-bf94-a5aeb7c76a1c' age=62.8499999996
2018-07-04 08:27:12,007+0200 vmId='52589290-0033-4a97-b393-3650d7af7ef2' age=62.8499999996
2018-07-04 08:27:12,009+0200 vmId='8e7a575b-3f7d-4453-be96-190d0226d077' age=62.8599999994
2018-07-04 08:27:12,010+0200 vmId='61da1520-21b2-439d-8ae6-38c1ce9127df' age=62.8599999994
2018-07-04 08:27:12,012+0200 vmId='7701dfda-a3c6-456f-bd20-8db0f6f2373a' age=62.8599999994
2018-07-04 08:27:12,013+0200 vmId='89012d87-e8c3-466f-83a1-d673d1a7537b' age=62.8599999994
2018-07-04 08:27:12,015+0200 vmId='1ded83c2-4879-42d9-9bcd-ced3f14bcf03' age=62.8599999994
2018-07-04 08:27:12,017+0200 vmId='f6ff4904-5a50-40fd-bbae-e2ad8d6de1b7' age=62.8599999994
2018-07-04 08:27:12,018+0200 vmId='edd902e4-0e98-400e-9879-a320080bd148' age=62.8599999994
2018-07-04 08:27:12,020+0200 vmId='bdef5563-53d6-473b-968b-b1b279176682' age=62.870000001
2018-07-04 08:27:12,021+0200 vmId='4529e6fa-70a2-4233-abe5-4718090117a3' age=62.870000001
2018-07-04 08:27:12,023+0200 vmId='9174301d-fa75-4ad7-84a2-91b3b22db987' age=62.870000001
2018-07-04 08:27:12,025+0200 vmId='17f21001-8dfa-4786-89e6-27e0b157be9f' age=62.870000001
2018-07-04 08:27:12,026+0200 vmId='1247742f-7726-458f-8294-fbaa33ffce77' age=62.870000001
2018-07-04 08:27:12,028+0200 vmId='4c7d3aff-dad1-4108-883d-069ba0a4a4fb' age=62.870000001
2018-07-04 08:27:12,029+0200 vmId='7ac88b8f-db4c-4456-a1a4-310ff0bf5b39' age=62.8800000008
2018-07-04 08:27:12,031+0200 vmId='abacdc84-c107-42ef-abf6-13803e118a40' age=62.8800000008
2018-07-04 08:27:12,032+0200 vmId='3a42f097-4a4f-4c91-bd32-85e0d48bedf0' age=62.8800000008
2018-07-04 08:27:12,034+0200 vmId='60b2d3f0-9b38-4d6b-93f9-fe0d97857582' age=62.8800000008
2018-07-04 08:27:12,035+0200 vmId='1e693cd3-4cae-4a60-b914-2daaa0f40791' age=62.8800000008
2018-07-04 08:27:12,037+0200 vmId='bc822a5a-631d-4f1b-ac18-c760baaf657e' age=62.8800000008
2018-07-04 08:27:12,038+0200 vmId='bdfbc05b-5b8c-4be5-aedd-78e987c21d00' age=62.8800000008
2018-07-04 08:27:12,039+0200 vmId='e46492df-4da3-45a0-b49f-82245e0173a5' age=62.8900000006
2018-07-04 08:27:12,041+0200 vmId='288908d5-6379-466b-88a0-2c4cca2d3900' age=62.8900000006
2018-07-04 08:27:12,042+0200 vmId='87441f46-d891-44c6-8483-4516bfeac757' age=62.8900000006

Lets pick the first one, 86dfa0fc-a4ed-4362-83a4-5bec1ca98d60

These are its disks:
  SD 7b0d671a-fc1a-4f33-8da6-29a76d227ec8 IMG ab3c2cda-c54b-497a-b7a7-342e658324db
  SD 7b0d671a-fc1a-4f33-8da6-29a76d227ec8 IMG caa85e31-02c9-40cb-84ba-0693b1182314
  DLUN 360001440000000107071b5b4d8ebc59f

When the mass unreponsiveness occur, we can see getVolumeSize() called from UpdateVolumes() does not run on time (every 1 min). See:

2018-07-04 08:21:10,016+0200 getVolumeSize(sdUUID=u'7b0d671a-fc1a-4f33-8da6-29a76d227ec8', imgUUID=u'ab3c2cda-c54b-497a-b7a7-342e658324db',
2018-07-04 08:21:10,019+0200 getVolumeSize(sdUUID=u'7b0d671a-fc1a-4f33-8da6-29a76d227ec8', imgUUID=u'caa85e31-02c9-40cb-84ba-0693b1182314',
2018-07-04 08:22:10,018+0200 getVolumeSize(sdUUID=u'7b0d671a-fc1a-4f33-8da6-29a76d227ec8', imgUUID=u'ab3c2cda-c54b-497a-b7a7-342e658324db',
2018-07-04 08:22:10,045+0200 getVolumeSize(sdUUID=u'7b0d671a-fc1a-4f33-8da6-29a76d227ec8', imgUUID=u'caa85e31-02c9-40cb-84ba-0693b1182314',
2018-07-04 08:23:10,016+0200 getVolumeSize(sdUUID=u'7b0d671a-fc1a-4f33-8da6-29a76d227ec8', imgUUID=u'ab3c2cda-c54b-497a-b7a7-342e658324db',
2018-07-04 08:23:10,019+0200 getVolumeSize(sdUUID=u'7b0d671a-fc1a-4f33-8da6-29a76d227ec8', imgUUID=u'caa85e31-02c9-40cb-84ba-0693b1182314',
2018-07-04 08:24:10,013+0200 getVolumeSize(sdUUID=u'7b0d671a-fc1a-4f33-8da6-29a76d227ec8', imgUUID=u'ab3c2cda-c54b-497a-b7a7-342e658324db',
2018-07-04 08:24:10,477+0200 getVolumeSize(sdUUID=u'7b0d671a-fc1a-4f33-8da6-29a76d227ec8', imgUUID=u'caa85e31-02c9-40cb-84ba-0693b1182314',
2018-07-04 08:25:10,018+0200 getVolumeSize(sdUUID=u'7b0d671a-fc1a-4f33-8da6-29a76d227ec8', imgUUID=u'ab3c2cda-c54b-497a-b7a7-342e658324db',
2018-07-04 08:25:10,021+0200 getVolumeSize(sdUUID=u'7b0d671a-fc1a-4f33-8da6-29a76d227ec8', imgUUID=u'caa85e31-02c9-40cb-84ba-0693b1182314',
2018-07-04 08:26:10,013+0200 getVolumeSize(sdUUID=u'7b0d671a-fc1a-4f33-8da6-29a76d227ec8', imgUUID=u'ab3c2cda-c54b-497a-b7a7-342e658324db',
2018-07-04 08:26:10,016+0200 getVolumeSize(sdUUID=u'7b0d671a-fc1a-4f33-8da6-29a76d227ec8', imgUUID=u'caa85e31-02c9-40cb-84ba-0693b1182314',

2018-07-04 08:27:31,694+0200 getVolumeSize(sdUUID=u'7b0d671a-fc1a-4f33-8da6-29a76d227ec8', imgUUID=u'ab3c2cda-c54b-497a-b7a7-342e658324db',
2018-07-04 08:27:31,717+0200 getVolumeSize(sdUUID=u'7b0d671a-fc1a-4f33-8da6-29a76d227ec8', imgUUID=u'caa85e31-02c9-40cb-84ba-0693b1182314',

2018-07-04 08:28:10,019+0200 getVolumeSize(sdUUID=u'7b0d671a-fc1a-4f33-8da6-29a76d227ec8', imgUUID=u'ab3c2cda-c54b-497a-b7a7-342e658324db',
2018-07-04 08:28:10,022+0200 getVolumeSize(sdUUID=u'7b0d671a-fc1a-4f33-8da6-29a76d227ec8', imgUUID=u'caa85e31-02c9-40cb-84ba-0693b1182314',

When this happens, we usually see UpdateVolumes() stuck in the executor.

2018-07-04 08:26:40,030+0200 DEBUG (vdsm.Scheduler) [Executor] executor state: count=5 workers=set([<Worker name=periodic/4796 waiting task#=0 at 0x7f7480b85990>, <Worker name=periodic/4792 running <Task discardable <UpdateVolumes vm=61da1520-21b2-439d-8ae6-38c1ce9127df at 0x7f7468140650> timeout=30.0, duration=29 at 0x1299790> task#=34 at 0x28509d0>, <Worker name=periodic/4794 running <Task discardable <UpdateVolumes vm=bdef5563-53d6-473b-968b-b1b279176682 at 0x7f7400106450> timeout=30.0, duration=29 at 0x7f7400106090> task#=46 at 0x7f746c2e6e50>, <Worker name=periodic/4795 running <Task discardable <UpdateVolumes vm=01b4911b-d588-435e-a86a-a5a476da6362 at 0x7f746c4040d0> timeout=30.0, duration=30 at 0x22720d0> discarded task#=40 at 0x2381750>, <Worker name=periodic/4793 running <Task discardable <UpdateVolumes vm=e7945585-5304-47b8-a533-89350ddac7b7 at 0x7f7440ceeb50> timeout=30.0, duration=29 at 0x7f7468140b90> task#=49 at 0x7f7440524bd0>]) (executor:205)

Which means some of the getVolumeSize() at 08:26:10 did not complete and is probably hanging everything, including VM monitoring and leading to the delayed getVolumeSize() at 08:27:31 (which should have happened at 08:27:10?). 

Lets check VM e7945585-5304-47b8-a533-89350ddac7b7 from above, these are its disks:
  SD eff51393-6316-4f17-a4dc-83ad85a0b765 IMG 9b3af596-d8e8-4e89-aa2c-cbefd2a8908a
  SD eff51393-6316-4f17-a4dc-83ad85a0b765 IMG 6328070d-649d-4c57-8db0-0220c9cb4470

1. getVolumeSize for the first disk (9b3af596)

2018-07-04 08:26:10,069+0200 INFO  (periodic/4793) [vdsm.api] START getVolumeSize(sdUUID=u'eff51393-6316-4f17-a4dc-83ad85a0b765', spUUID=u'00000002-0002-0002-0002-0000000003b7', imgUUID=u'9b3af596-d8e8-4e89-aa2c-cbefd2a8908a', volUUID=u'8caa2d9e-71d7-4533-90f2-ba9ab955afce', options=None) from=internal, task_id=26bf9721-b856-4032-86bd-3f1de2c88190 (api:46)

2. Is discarded at 08:27:40 (as per logs above). But still goes ahead and takes takes 50 seconds to complete.

2018-07-04 08:27:00,667+0200 INFO  (periodic/4793) [vdsm.api] FINISH getVolumeSize return={'truesize': '53687091200', 'apparentsize': '53687091200'} from=internal, task_id=26bf9721-b856-4032-86bd-3f1de2c88190 (api:52)

3. Then it proceeds to the second disk, which completes immediately. We are 50 seconds late.

2018-07-04 08:27:00,669+0200 INFO  (periodic/4793) [vdsm.api] START getVolumeSize(sdUUID=u'eff51393-6316-4f17-a4dc-83ad85a0b765', spUUID=u'00000002-0002-0002-0002-0000000003b7', imgUUID=u'6328070d-649d-4c57-8db0-0220c9cb4470', volUUID=u'd04beb3a-db86-4dc1-af24-7139a0095942', options=None) from=internal, task_id=7abb24b2-1550-4b8a-b66a-499742ed857b (api:46)
2018-07-04 08:27:00,670+0200 INFO  (periodic/4793) [vdsm.api] FINISH getVolumeSize return={'truesize': '826781204480', 'apparentsize': '826781204480'} from=internal, task_id=7abb24b2-1550-4b8a-b66a-499742ed857b (api:52)

I'm not entirely sure, but I think these delays end up causing the discarded UpdateVolumes and ultimately monitors going unresponsive as VM monitoring is getting starved in the executor by these long standing getVolumeInfo tasks.

It seems these stuck getVolumeSize() on the storage domain eff51393-6316-4f17-a4dc-83ad85a0b765 get blocked by this, as it finishes at 08:27:00 which after all stuck getVolumeSize() calls return.

2018-07-04 08:26:04,526+0200 DEBUG (monitor/eff5139) [storage.Monitor] Refreshing domain eff51393-6316-4f17-a4dc-83ad85a0b765 (monitor:487)
2018-07-04 08:26:04,526+0200 DEBUG (monitor/eff5139) [storage.SamplingMethod] Trying to enter sampling method (storage.sdc.refreshStorage) (misc:467)

2018-07-04 08:26:26,071+0200 DEBUG (monitor/eff5139) [storage.SamplingMethod] Got in to sampling method (misc:470)
2018-07-04 08:26:26,072+0200 DEBUG (monitor/eff5139) [storage.SamplingMethod] Trying to enter sampling method (vdsm.storage.iscsi.rescan) (misc:467)
2018-07-04 08:26:26,072+0200 DEBUG (monitor/eff5139) [storage.SamplingMethod] Got in to sampling method (misc:470)
2018-07-04 08:26:26,072+0200 DEBUG (monitor/eff5139) [storage.ISCSI] Performing SCSI scan, this will take up to 30 seconds (iscsi:457)
2018-07-04 08:26:26,073+0200 DEBUG (monitor/eff5139) [storage.Misc.excCmd] /usr/bin/taskset --cpu-list 0-119 /usr/bin/sudo -n /sbin/iscsiadm -m session -R (cwd None) (commands:69)
2018-07-04 08:26:26,446+0200 DEBUG (monitor/eff5139) [storage.SamplingMethod] Returning last result (misc:477)
2018-07-04 08:26:26,447+0200 DEBUG (monitor/eff5139) [storage.SamplingMethod] Trying to enter sampling method (vdsm.storage.hba.rescan) (misc:467)
2018-07-04 08:26:26,447+0200 DEBUG (monitor/eff5139) [storage.SamplingMethod] Got in to sampling method (misc:470)
2018-07-04 08:26:26,448+0200 DEBUG (monitor/eff5139) [storage.HBA] Starting scan (hba:58)

2018-07-04 08:26:49,985+0200 DEBUG (monitor/eff5139) [storage.HBA] Scan finished (hba:64)
2018-07-04 08:26:49,985+0200 DEBUG (monitor/eff5139) [storage.SamplingMethod] Returning last result (misc:477)
2018-07-04 08:26:49,986+0200 DEBUG (monitor/eff5139) [root] /usr/bin/taskset --cpu-list 0-119 /sbin/udevadm settle --timeout=5 (cwd None) (commands:69)
2018-07-04 08:26:50,048+0200 DEBUG (monitor/eff5139) [root] SUCCESS: <err> = ''; <rc> = 0 (commands:93)
2018-07-04 08:26:50,616+0200 DEBUG (monitor/eff5139) [storage.SamplingMethod] Returning last result (misc:477)

<lvm commands take around 2s each>

2018-07-04 08:27:00,636+0200 INFO  (monitor/eff5139) [storage.StorageDomain] Resource namespace 01_img_eff51393-6316-4f17-a4dc-83ad85a0b765 already registered (sd:758)
2018-07-04 08:27:00,636+0200 INFO  (monitor/eff5139) [storage.StorageDomain] Resource namespace 02_vol_eff51393-6316-4f17-a4dc-83ad85a0b765 already registered (sd:767)
2018-07-04 08:27:00,636+0200 INFO  (monitor/eff5139) [storage.StorageDomain] Resource namespace 03_lvm_eff51393-6316-4f17-a4dc-83ad85a0b765 already registered (blockSD:989)

So most of the time is spent on hba scan and the first "trying to enter sampling method".

Interestingly enough, most of the VMs that get stuck on UpdateVolumes have Direct LUNS (20 out of 21). The 1 I used as example above is the only without a Direct Lun. So maybe this is related: https://bugzilla.redhat.com/show_bug.cgi?id=1517535

Should this slower storage refreshing trigger unresponsive vms? The VMs are running fine (confirmed by reading monotonic from inside).

Version-Release number of selected component (if applicable):
vdsm-4.19.50-1.el7ev.x86_64

How reproducible:
Unknown

Comment 3 Tal Nisan 2018-07-05 12:56:11 UTC

Maor, you are this week's QE contact, please scrub

Comment 4 Maor 2018-07-05 14:02:05 UTC

I saw the following error in the VDSM logs:

2018-07-04 08:05:11,680+0200 WARN  (jsonrpc/0) [virt.vm] (vmId='86dfa0fc-a4ed-4362-83a4-5bec1ca98d60') monitor became unresponsive (command timeout, age=62.6100000013) (vm:5135)
2018-07-04 08:05:11,682+0200 WARN  (jsonrpc/0) [virt.vm] (vmId='def11d1f-9cb0-47be-9c45-7b961d275c40') monitor became unresponsive (command timeout, age=62.6100000013) (vm:5135)
....

Can you please add some more indicators, such as cpu usage, load, sar, iotop of the host while the VMs get unresponsive.
Can you also add libvirt, qemu and engine logs

Thanks

Comment 6 Germano Veit Michel 2018-07-06 01:26:31 UTC

Hi Maor,

Thanks for taking a look.

(In reply to Maor from comment #4)
> I saw the following error in the VDSM logs:
> 
> 2018-07-04 08:05:11,680+0200 WARN  (jsonrpc/0) [virt.vm]
> (vmId='86dfa0fc-a4ed-4362-83a4-5bec1ca98d60') monitor became unresponsive
> (command timeout, age=62.6100000013) (vm:5135)
> 2018-07-04 08:05:11,682+0200 WARN  (jsonrpc/0) [virt.vm]
> (vmId='def11d1f-9cb0-47be-9c45-7b961d275c40') monitor became unresponsive
> (command timeout, age=62.6100000013) (vm:5135)
> ....

Yes, as in comment #0.

> Can you please add some more indicators, such as cpu usage, load, sar, iotop
> of the host while the VMs get unresponsive.

Sure. There are plenty of resources available, there is no overcommit. Seem the delays on entering sampling method are on several places, and fc-scan is always slow (there is a large number of LUNs presented). Looks like the lockup is on VDSM to me. All VMs are running just fine, there is no detected performance issue anywhere else.

> Can you also add libvirt, qemu and engine logs

RHV doesnt run with libvirt or qemu logging by default. So I'll add engine.

> It might also be related to this bug https://bugzilla.redhat.com/1419856
> which libvirt version is the customer using?
libvirt-daemon-3.2.0-14.el7_4.9.x86_64

Please let me know if you need anything else.

Comment 16 Germano Veit Michel 2018-12-11 04:24:03 UTC

Similar case, UpdateVolumes for each VM is stuck because getVolumeSize is delayed/not-running/blocked. On this case is happens when adding a new SD.

vdsm-4.19.45-1.el7ev.x86_64

Comment 18 Germano Veit Michel 2018-12-11 04:30:15 UTC

From the data of comment #17:

2018-11-08 11:15:19,064+0100 getVolumeSize volUUID=u'284ae5e7-da22-4a51-8fb7-554c3d953ced',
2018-11-08 11:16:19,034+0100 getVolumeSize volUUID=u'284ae5e7-da22-4a51-8fb7-554c3d953ced',
2018-11-08 11:17:19,048+0100 getVolumeSize volUUID=u'284ae5e7-da22-4a51-8fb7-554c3d953ced',
2018-11-08 11:19:20,278+0100 getVolumeSize volUUID=u'284ae5e7-da22-4a51-8fb7-554c3d953ced',  <--- 2m01s delayed
2018-11-08 11:19:24,344+0100 getVolumeSize volUUID=u'284ae5e7-da22-4a51-8fb7-554c3d953ced',  <--- 5s delayed
2018-11-08 11:20:19,040+0100 getVolumeSize volUUID=u'284ae5e7-da22-4a51-8fb7-554c3d953ced',
2018-11-08 11:21:19,076+0100 getVolumeSize volUUID=u'284ae5e7-da22-4a51-8fb7-554c3d953ced',
2018-11-08 11:22:19,046+0100 getVolumeSize volUUID=u'284ae5e7-da22-4a51-8fb7-554c3d953ced',

+

2018-11-08 11:19:17,412+0100 WARN  (jsonrpc/6) [virt.vm] (vmId='a4ab6989-274b-4d50-9bec-1680e45442a1') monitor became unresponsive (command timeout, age=65.4700000007) (vm:5132)

So looks like UpdateVolumes was stuck from ~11:18 to 11:19:20. And the monitor goes unresponsive as getVolumeSize for the VM's volumes is stuck?
Tons of getVolumeSize were delayed/blocked while the user was adding a new SD to the Pool.

Is there any improvement on this area in vdsm-4.20?

Comment 19 Roy Golan 2018-12-19 07:22:28 UTC

I'm positive we landed few fixes on 4.2 to improve the number of rescans we perform after gchen found out we repeat that about 4 times for every path.


Germano, how many paths are configured there?


meanwhile I'll try to dig the relevant BZ.

Comment 20 Germano Veit Michel 2018-12-20 01:02:59 UTC

(In reply to Roy Golan from comment #19)
> Germano, how many paths are configured there?

This one has 8 paths ;)

Indeed should be better in 4.2 with https://gerrit.ovirt.org/#/c/81681/

Still, can't this be improved to not pile up GetVolumeSize() while the SD is refreshing/producing? It seems all GetVolumeSize() are waiting for the refresh, if one has enough VMs it does mean a ton of blocked UpdateVolumes tasks.

Comment 21 Sandro Bonazzola 2019-01-28 09:44:25 UTC

This bug has not been marked as blocker for oVirt 4.3.0.
Since we are releasing it tomorrow, January 29th, this bug has been re-targeted to 4.3.1.

Comment 23 Nir Soffer 2019-02-26 16:09:49 UTC

I think this issue should be solved by moving volume size monitoring to 
storage domain monitors.

When an image is prepared, we can register the leaf volume with the
storage domain monitor.

Then the storage domain monitor can check volume size every N seconds
and cache the value.

When virt API need the value, they will get the cached value from
repoStats.

When image is tore down, storage monitor will stop monitoring the leaf
volume.

When taking a snapshot, replacing the active leaf volume, the new
leaf volume should be registered with the storage domain monitor,
replacing the old leaf volume (that will never change).

If storage is not responsive, relevant volumes size will be stale, but
nothing in the system will block.

The UpdateVolume periodic task will be removed.

Comment 26 Daniel Gur 2019-08-28 13:13:15 UTC

sync2jira

Comment 27 Daniel Gur 2019-08-28 13:17:28 UTC

sync2jira

Comment 29 Nir Soffer 2019-09-15 23:54:24 UTC

Normally getting volume size is quick. we need to find the root cause causing
these calls to be slow.

Updating volume size call this every 60 seconds:

3842     def updateDriveVolume(self, vmDrive):
3843         if not vmDrive.device == 'disk' or not isVdsmImage(vmDrive):
3844             return
3845 
3846         volSize = self._getVolumeSize(
3847             vmDrive.domainID, vmDrive.poolID, vmDrive.imageID,
3848             vmDrive.volumeID)
3849 
3850         vmDrive.truesize = volSize.truesize
3851         vmDrive.apparentsize = volSize.apparentsize

getVolumeSize() is:

3067     @public
3068     def getVolumeSize(self, sdUUID, spUUID, imgUUID, volUUID, options=None):
...
3086         dom = sdCache.produce(sdUUID=sdUUID)
3087         apparentsize = str(dom.getVSize(imgUUID, volUUID))
3088         truesize = str(dom.getVAllocSize(imgUUID, volUUID))
3089         return dict(apparentsize=apparentsize, truesize=truesize)

On block storage getVSize and getVAllocSize are the same, so we have
2 calls to:

 421     def getVSize(self, imgUUUID, volUUID):
 422         """ Return the block volume size in bytes. """
 423         try:
 424             size = fsutils.size(lvm.lvPath(self.sdUUID, volUUID))
 425         except IOError as e:
 426             if e.errno != errno.ENOENT:
 427                 raise
 428             # Inactive volume has no /dev entry. Fallback to lvm way.
 429             size = lvm.getLV(self.sdUUID, volUUID).size
 430 
 431         return int(size)

Because the LVs are active (called for running VMs, we know we don't use
lvm to get the size but call fsutils.size(), which is:

26 def size(filename):
 27     """
 28     Return actual file size, should work with both file and block device.
 29     """
 30     with io.open(filename, "rb") as f:
 31         f.seek(0, os.SEEK_END)
 32         return f.tell()

Which should be fast, but maybe in these cases is not.

An easy optimization would be to call getVSize() only once on block
storage instead of twice. In the case when seeking to the end of the
device is slow, this may improve the situation.

But we really want to understand why this operation is slow on the 
certain systems.

Germano, can you share all WARN logs vdsm log and sanlock.log?

Comment 30 Germano Veit Michel 2019-09-16 00:20:48 UTC

Hi Nir,

Thanks for taking a look at this.

(In reply to Nir Soffer from comment #29)
> Normally getting volume size is quick. we need to find the root cause causing
> these calls to be slow.
> 
> Updating volume size call this every 60 seconds:
> 
> 3842     def updateDriveVolume(self, vmDrive):
> 3843         if not vmDrive.device == 'disk' or not isVdsmImage(vmDrive):
> 3844             return
> 3845 
> 3846         volSize = self._getVolumeSize(
> 3847             vmDrive.domainID, vmDrive.poolID, vmDrive.imageID,
> 3848             vmDrive.volumeID)
> 3849 
> 3850         vmDrive.truesize = volSize.truesize
> 3851         vmDrive.apparentsize = volSize.apparentsize
> 
> getVolumeSize() is:
> 
> 3067     @public
> 3068     def getVolumeSize(self, sdUUID, spUUID, imgUUID, volUUID,
> options=None):
> ...
> 3086         dom = sdCache.produce(sdUUID=sdUUID)
> 3087         apparentsize = str(dom.getVSize(imgUUID, volUUID))
> 3088         truesize = str(dom.getVAllocSize(imgUUID, volUUID))
> 3089         return dict(apparentsize=apparentsize, truesize=truesize)
> 

I might be wrong, but I think it's sdCache.produce above that slows down monitoring.
getVolumeSize is stuck on the executor until a storage refresh completes.
And on larger environments (many LUNs), HBA rescan can take longer time too, 
enough to make VM monitoring take more time.

I think having a few paths down also slow down HBA rescan.

Once the storage refresh completes, all stuck getVolumeSize() complete immediately
after.

> On block storage getVSize and getVAllocSize are the same, so we have
> 2 calls to:
> 
>  421     def getVSize(self, imgUUUID, volUUID):
>  422         """ Return the block volume size in bytes. """
>  423         try:
>  424             size = fsutils.size(lvm.lvPath(self.sdUUID, volUUID))
>  425         except IOError as e:
>  426             if e.errno != errno.ENOENT:
>  427                 raise
>  428             # Inactive volume has no /dev entry. Fallback to lvm way.
>  429             size = lvm.getLV(self.sdUUID, volUUID).size
>  430 
>  431         return int(size)
> 
> Because the LVs are active (called for running VMs, we know we don't use
> lvm to get the size but call fsutils.size(), which is:
> 
> 26 def size(filename):
>  27     """easy optimization would be to call getVSize() only once on block
> storage instead of twice. In the case when seeking to the end of the
> device is slow, this may improve the situation.
> 
> But we really want to understand why this operation is slow on the 
> certain systems.
> 
> Germano, can you share all WARN logs vdsm log and sanlock.log?

To open this bug I collected some debug logs from the initial customer, 
aren't the ones attached initially good enough?

Occasionally there are some HBA rescans timeout too.

From what I remember on call these cases sanlock is fine, and so is
IO to storage. The common scenario is they all have lots of SDs
and LUNs.

>  28     Return actual file size, should work with both file and block device.
>  29     """
>  30     with io.open(filename, "rb") as f:
>  31         f.seek(0, os.SEEK_END)
>  32         return f.tell()
> 
> Which should be fast, but maybe in these cases is not.

Yes, these should be fast. If this was slow, then the problem would not
be in VDSM I believe, and probably the VMs would be suffering too.

> 
> An easy optimization would be to call getVSize() only once on block
> storage instead of twice. In the case when seeking to the end of the
> device is slow, this may improve the situation.
> 
> But we really want to understand why this operation is slow on the 
> certain systems.
> 
> Germano, can you share all WARN logs vdsm log and sanlock.log?

To open this bug I collected some debug logs from the initial customer, 
aren't the ones attached initially good enough? They are more informative
then the latest ones without debug, but I can get more if you wish.

I don't have debug logs from latest 4.3 on this. Let me know if you need them
then I'll collect a new round, along with sanlock (which was fine when I checked).

Comment 31 Nir Soffer 2019-09-16 21:06:00 UTC

(In reply to Germano Veit Michel from comment #30)
...
> > getVolumeSize() is:
> > 
> > 3067     @public
> > 3068     def getVolumeSize(self, sdUUID, spUUID, imgUUID, volUUID,
> > options=None):
> > ...
> > 3086         dom = sdCache.produce(sdUUID=sdUUID)
> > 3087         apparentsize = str(dom.getVSize(imgUUID, volUUID))
> > 3088         truesize = str(dom.getVAllocSize(imgUUID, volUUID))
> > 3089         return dict(apparentsize=apparentsize, truesize=truesize)
> > 
> 
> I might be wrong, but I think it's sdCache.produce above that slows down
> monitoring.
> getVolumeSize is stuck on the executor until a storage refresh completes.
> And on larger environments (many LUNs), HBA rescan can take longer time too, 
> enough to make VM monitoring take more time.

Right, storage refreshes are a known issue that may block any flow in 
vdsm accessing storage domain method, even if the storage domain is not
related (e.g. accessing ISO domain during a storage refresh blokced
by SCSI scan).

But we don't expect lot of storage refreshes. We need to understand why we
have many refreshes, and why they are slow.

> I think having a few paths down also slow down HBA rescan.
> 
> Once the storage refresh completes, all stuck getVolumeSize() complete
> immediately after.

This makes sense.

> > Germano, can you share all WARN logs vdsm log and sanlock.log?
> 
> To open this bug I collected some debug logs from the initial customer, 
> aren't the ones attached initially good enough? They are more informative
> then the latest ones without debug, but I can get more if you wish.

We need to look at them again, hopefully we will have enough info.

> I don't have debug logs from latest 4.3 on this. Let me know if you need them
> then I'll collect a new round, along with sanlock (which was fine when I
> checked).

If sanlock did not have any delays, then this is most likely an issue with 
storage refreshes blocking other operations.

It would help to time a manual verbose fc-scan:

    time /usr/libexec/vdsm/fc-scan -v

On host that show this issue, and share the output of the command. We have
log with the time it took for every host/device scan.

Since we support device resize detection, this tool scans now not only only
the scsi hosts (/sys/class/scsi_host/%s/scan) but also all the scsi devices
(/sys/class/scsi_device/*).

This was added as part of bug 609689 in https://gerrit.ovirt.org/c/43388/.
Since vdsm does not manage devices, and can miss a request to resize devices
when a host is offline or disconnected from engine temporarily, vdsm scan 
all fc_hosts and all devices connected to these hosts on when ever we refresh
storage.


Looking at the refresh code in hsm.py:

 78     @misc.samplingmethod
 79     def refreshStorage(self, resize=True):
 80         self.__staleStatus = self.STORAGE_REFRESHING
 81 
 82         multipath.rescan()
 83         if resize:
 84             multipath.resize_devices()
 85         lvm.invalidateCache()

We don't need to scan devices if resize is False.


Checking all the places we call refreshStorage:


2796     def getStorageDomainsList(
2797             self, spUUID=None, domainClass=None, storageType=None,
2798             remotePath=None, options=None):
...
2814         sdCache.refreshStorage()

Looks like we can refresh with resize=False here.


2872     def getVGList(self, storageType=None, options=None):
...
2882         sdCache.refreshStorage()

Same, resize is not needed.


 411         def storageRefresh():
 412             sdCache.refreshStorage()

This is done during startup - since we don't know if a device was
resized on other hosts when vdsm was down, we really need to use
resize=True here.


1993     def _getDeviceList(self, storageType=None, guids=(), checkStatus=True):
1994         sdCache.refreshStorage()

Here we need to resize since the user expect to see the real device size after
this call.


2406     def connectStorageServer(self, domType, spUUID, conList, options=None):
...
2445         # In case there were changes in devices size
2446         # while the VDSM was not connected, we need to
2447         # call refreshStorage.
2448         if domType in (sd.FCP_DOMAIN, sd.ISCSI_DOMAIN):
2449             sdCache.refreshStorage()

Here we must resize, but rescan devices is not needed if we just connected to 
storage.


2495     def disconnectStorageServer(self, domType, spUUID, conList, options=None):
...
2530         # Disconnecting a device may change the visible storage domain list
2531         # so invalidate the caches
2532         sdCache.refreshStorage(resize=False)

Here we don't do any resize, so we don't need to scan anything.


Refreshes in sdc.py:

127         try:
128             # If multiple calls reach this point and the storage is not
129             # updated the refreshStorage() sampling method is called
130             # serializing (and eventually grouping) the requests.
131             if self.__staleStatus != self.STORAGE_UPDATED:
132                 self.refreshStorage()

We don't call multipath.resize here so there is no need to rescan devices.


Looks like we should:

- Add argument to fc-scan to disable device rescan
- Add rescan_devices=False argument to refreshStorage
- Call refreshStorage(rescan_device=True) only in the case that must
  to rescan devices and resize them.

Comment 32 Nir Soffer 2019-09-16 21:12:41 UTC

Germano, I posted a quick untested patch for fc-scan to allow disabling 
device scanning.

It would help if you can get share the output of these commands on a system
showing this issue.

    time /usr/libexec/vdsm/fc-scan -v

    time /usr/libexec/vdsm/fc-scan -v --no-devices

Comment 33 Germano Veit Michel 2019-09-16 23:18:34 UTC

(In reply to Nir Soffer from comment #31)
> Looks like we should:
> 
> - Add argument to fc-scan to disable device rescan
> - Add rescan_devices=False argument to refreshStorage
> - Call refreshStorage(rescan_device=True) only in the case that must
>   to rescan devices and resize them.

Makes a lot of sense to me, should save some time.

(In reply to Nir Soffer from comment #32)
> Germano, I posted a quick untested patch for fc-scan to allow disabling 
> device scanning.
> 
> It would help if you can get share the output of these commands on a system
> showing this issue.
> 
>     time /usr/libexec/vdsm/fc-scan -v
> 
>     time /usr/libexec/vdsm/fc-scan -v --no-devices

I've requested it to the customer that was facing these issues, along with a 
fresh sosreport after fixing other problems that could also have been contributing
to the slowdowns.

Will let you know once we have results.

Thanks!

Comment 34 Nir Soffer 2019-09-17 00:24:38 UTC

Looking in vdsm.log from attachment 1456628 [details]:


## FC scans

$ grep '\[storage.HBA\] Starting scan' vdsm.log.3 | wc -l
34

$ head -1 vdsm.log.3 
2018-07-04 08:01:06,269+0200 ...

$ tail -1 vdsm.log.3 
2018-07-04 09:01:01,772+0200 ...

34 scans in 1 hour - does not make sense. We don't expect so many storage refreshes.


$ grep '\[storage.HBA\] Starting scan' vdsm.log.3 | grep monitor/ | wc -l
33

$ grep '\[storage.HBA\] Starting scan' vdsm.log.3 | grep jsonrpc/ | wc -l
1

One scan was triggered by API call, all other by storage domain monitors.


Looking in the fc-scan calls, they take about 25 seconds, and they are performed in
groups - back to back:

1. 2 scans

2018-07-04 08:36:41,861+0200 DEBUG (monitor/3f31203) [storage.HBA] Starting scan (hba:58)
2018-07-04 08:37:05,450+0200 DEBUG (monitor/3f31203) [storage.HBA] Scan finished (hba:64)
2018-07-04 08:37:06,137+0200 DEBUG (monitor/62b40d0) [storage.HBA] Starting scan (hba:58)
2018-07-04 08:37:29,223+0200 DEBUG (monitor/62b40d0) [storage.HBA] Scan finished (hba:64)

2. After a minute, 3 scans

2018-07-04 08:41:23,481+0200 DEBUG (monitor/3f31203) [storage.HBA] Starting scan (hba:58)
2018-07-04 08:41:45,052+0200 DEBUG (monitor/3f31203) [storage.HBA] Scan finished (hba:64)
2018-07-04 08:41:45,592+0200 DEBUG (monitor/c59bd9b) [storage.HBA] Starting scan (hba:58)
2018-07-04 08:42:09,433+0200 DEBUG (monitor/c59bd9b) [storage.HBA] Scan finished (hba:64)
2018-07-04 08:42:10,136+0200 DEBUG (monitor/dbf3c27) [storage.HBA] Starting scan (hba:58)
2018-07-04 08:42:33,887+0200 DEBUG (monitor/dbf3c27) [storage.HBA] Scan finished (hba:64)

3. After 3:30 minutes, 4 scans

2018-07-04 08:46:00,457+0200 DEBUG (monitor/632f710) [storage.HBA] Starting scan (hba:58)
2018-07-04 08:46:22,423+0200 DEBUG (monitor/632f710) [storage.HBA] Scan finished (hba:64)
2018-07-04 08:46:23,010+0200 DEBUG (monitor/dbf3c27) [storage.HBA] Starting scan (hba:58)
2018-07-04 08:46:46,789+0200 DEBUG (monitor/dbf3c27) [storage.HBA] Scan finished (hba:64)
2018-07-04 08:46:47,317+0200 DEBUG (monitor/eff5139) [storage.HBA] Starting scan (hba:58)
2018-07-04 08:47:10,096+0200 DEBUG (monitor/eff5139) [storage.HBA] Scan finished (hba:64)
2018-07-04 08:47:10,774+0200 DEBUG (monitor/c9faf72) [storage.HBA] Starting scan (hba:58)
2018-07-04 08:47:34,787+0200 DEBUG (monitor/c9faf72) [storage.HBA] Scan finished (hba:64)


## Warnings

$ grep WARN vdsm.log.3 | wc -l
979

$ egrep 'WARN.+is blocked for' vdsm.log.3 | head -10
2018-07-04 08:01:11,278+0200 WARN  (check/loop) [storage.check] Checker u'/rhev/data-center/mnt/10.252.100.47:_appl/dbf3c278-6ec5-45f0-9c09-0f92e6c7ab22/dom_md/metadata' is blocked for 320.00 seconds (check:279)
2018-07-04 08:01:21,279+0200 WARN  (check/loop) [storage.check] Checker u'/rhev/data-center/mnt/10.252.100.47:_appl/dbf3c278-6ec5-45f0-9c09-0f92e6c7ab22/dom_md/metadata' is blocked for 330.00 seconds (check:279)
2018-07-04 08:01:31,278+0200 WARN  (check/loop) [storage.check] Checker u'/rhev/data-center/mnt/10.252.100.47:_appl/dbf3c278-6ec5-45f0-9c09-0f92e6c7ab22/dom_md/metadata' is blocked for 340.00 seconds (check:279)
2018-07-04 08:01:41,279+0200 WARN  (check/loop) [storage.check] Checker u'/rhev/data-center/mnt/10.252.100.47:_appl/dbf3c278-6ec5-45f0-9c09-0f92e6c7ab22/dom_md/metadata' is blocked for 350.00 seconds (check:279)
2018-07-04 08:01:51,278+0200 WARN  (check/loop) [storage.check] Checker u'/rhev/data-center/mnt/10.252.100.47:_appl/dbf3c278-6ec5-45f0-9c09-0f92e6c7ab22/dom_md/metadata' is blocked for 360.00 seconds (check:279)
2018-07-04 08:02:01,282+0200 WARN  (check/loop) [storage.check] Checker u'/rhev/data-center/mnt/10.252.100.47:_appl/dbf3c278-6ec5-45f0-9c09-0f92e6c7ab22/dom_md/metadata' is blocked for 370.00 seconds (check:279)
2018-07-04 08:02:11,278+0200 WARN  (check/loop) [storage.check] Checker u'/rhev/data-center/mnt/10.252.100.47:_appl/dbf3c278-6ec5-45f0-9c09-0f92e6c7ab22/dom_md/metadata' is blocked for 380.00 seconds (check:279)
2018-07-04 08:02:21,287+0200 WARN  (check/loop) [storage.check] Checker u'/rhev/data-center/mnt/10.252.100.47:_appl/dbf3c278-6ec5-45f0-9c09-0f92e6c7ab22/dom_md/metadata' is blocked for 390.00 seconds (check:279)
2018-07-04 08:02:31,285+0200 WARN  (check/loop) [storage.check] Checker u'/rhev/data-center/mnt/10.252.100.47:_appl/dbf3c278-6ec5-45f0-9c09-0f92e6c7ab22/dom_md/metadata' is blocked for 400.00 seconds (check:279)
2018-07-04 08:02:41,287+0200 WARN  (check/loop) [storage.check] Checker u'/rhev/data-center/mnt/10.252.100.47:_appl/dbf3c278-6ec5-45f0-9c09-0f92e6c7ab22/dom_md/metadata' is blocked for 410.00 seconds (check:279)

There is serious issues with the NFS domain dbf3c278-6ec5-45f0-9c09-0f92e6c7ab22
it was blocked for 840 seconds:

2018-07-04 08:55:31,279+0200 WARN  (check/loop) [storage.check] Checker u'/rhev/data-center/mnt/10.252.100.47:_appl/dbf3c278-6ec5-45f0-9c09-0f92e6c7ab22/dom_md/metadata' is blocked for 840.00 seconds (check:279)


$ egrep 'WARN.+is blocked for' vdsm.log.3 | grep -v dbf3c278-6ec5-45f0-9c09-0f92e6c7ab22

No other domain had issues.


## Storage cache invalidations

I think the issue is the storage domain cache. Every access to storage domain look up
an internal cache. If the domain is in cache there is no delay and the code continue.

But if the domain is not in the cache, and cache was invalidated, the caller try to 
refresh storage. This operation does:

- scsi rescan
- fc rescan
- resize multipath devices

Because of the way the cache is built, all this is typically done twice when multiple
storage domain try to enter the domain cache.

In our case, fc scan takes about 25 seconds. So typically code accessing storage domain
may be blocked for at least 25-50 seconds on the cache only because of slow fc-scans.

Why would domain not be in the cache? every 300 seconds, storage domain monitors remove
the domain from the cache so the next access to the domain will recreate the domain
object and refresh the domain metadata.

The cache is invalidated in:

- connectStorageServer()
- StoragePool.connect(), called from connectStoragePool()
- StoragePool.attachSD(), called from attachStorageDomain()
- StoragePool.refresh(), called from connectStoragePool()

Looking at the relevant calls:

$ egrep 'START (connectStoragePool|connectStorageServer|attachStorageDomain)' vdsm.log.3 | awk '{print $1,$2}'
2018-07-04 08:04:05,141+0200
2018-07-04 08:05:00,203+0200
2018-07-04 08:10:00,235+0200
2018-07-04 08:15:00,434+0200
2018-07-04 08:20:00,284+0200
2018-07-04 08:22:44,676+0200
2018-07-04 08:23:40,594+0200
2018-07-04 08:24:44,517+0200
2018-07-04 08:30:00,290+0200
2018-07-04 08:35:00,166+0200
2018-07-04 08:40:00,148+0200
2018-07-04 08:45:00,261+0200
2018-07-04 08:50:00,183+0200
2018-07-04 08:55:00,205+0200
2018-07-04 09:00:00,232+0200

$ egrep 'START (connectStoragePool|connectStorageServer|attachStorageDomain)' vdsm.log.3 | wc -l
15

So we had 15 cache invalidations in one hour.

$ egrep 'START connectStorageServer' vdsm.log.3 | awk '{print $1,$2}'
2018-07-04 08:05:00,203+0200
2018-07-04 08:10:00,235+0200
2018-07-04 08:15:00,434+0200
2018-07-04 08:20:00,284+0200
2018-07-04 08:22:44,676+0200
2018-07-04 08:24:44,517+0200
2018-07-04 08:30:00,290+0200
2018-07-04 08:35:00,166+0200
2018-07-04 08:40:00,148+0200
2018-07-04 08:45:00,261+0200
2018-07-04 08:50:00,183+0200
2018-07-04 08:55:00,205+0200
2018-07-04 09:00:00,232+0200

$ egrep 'START connectStorageServer' vdsm.log.3 | awk '{print $1,$2}' | wc -l
13

Almost all the calls are connectStorageServer. I guess this is engine trying to recover
the inaccessible NFS domain by trying to reconnect to the server.

Lets try to mix the cache invalidation with the rescans:

$ egrep '\[storage.HBA\] Starting scan' vdsm.log.3 | awk '{print $1,$2,"rescan"}' > rescan.log
$ egrep 'START connectStorageServer' vdsm.log.3 | awk '{print $1,$2,"invalidate cache"}' > invalidate.log
$ cat invalidate.log rescan.log | sort

$ cat invalidate.log rescan.log | sort
2018-07-04 08:05:00,203+0200 invalidate cache
2018-07-04 08:10:00,235+0200 invalidate cache
2018-07-04 08:15:00,434+0200 invalidate cache
2018-07-04 08:16:18,808+0200 rescan
2018-07-04 08:16:41,230+0200 rescan
2018-07-04 08:17:05,614+0200 rescan
2018-07-04 08:20:00,284+0200 invalidate cache
2018-07-04 08:21:24,552+0200 rescan
2018-07-04 08:21:46,828+0200 rescan
2018-07-04 08:22:10,778+0200 rescan
2018-07-04 08:22:44,676+0200 invalidate cache
2018-07-04 08:23:40,685+0200 rescan
2018-07-04 08:24:03,157+0200 rescan
2018-07-04 08:24:27,217+0200 rescan
2018-07-04 08:24:44,517+0200 invalidate cache
2018-07-04 08:26:03,833+0200 rescan
2018-07-04 08:26:26,448+0200 rescan
2018-07-04 08:26:50,643+0200 rescan
2018-07-04 08:30:00,290+0200 invalidate cache
2018-07-04 08:31:09,170+0200 rescan
2018-07-04 08:31:31,503+0200 rescan
2018-07-04 08:31:55,619+0200 rescan
2018-07-04 08:32:21,896+0200 rescan
2018-07-04 08:35:00,166+0200 invalidate cache
2018-07-04 08:36:16,809+0200 rescan
2018-07-04 08:36:41,861+0200 rescan
2018-07-04 08:37:06,137+0200 rescan
2018-07-04 08:40:00,148+0200 invalidate cache
2018-07-04 08:41:23,481+0200 rescan
2018-07-04 08:41:45,592+0200 rescan
2018-07-04 08:42:10,136+0200 rescan
2018-07-04 08:45:00,261+0200 invalidate cache
2018-07-04 08:46:00,457+0200 rescan
2018-07-04 08:46:23,010+0200 rescan
2018-07-04 08:46:47,317+0200 rescan
2018-07-04 08:47:10,774+0200 rescan
2018-07-04 08:50:00,183+0200 invalidate cache
2018-07-04 08:51:08,685+0200 rescan
2018-07-04 08:51:30,834+0200 rescan
2018-07-04 08:51:54,451+0200 rescan
2018-07-04 08:52:17,360+0200 rescan
2018-07-04 08:55:00,205+0200 invalidate cache
2018-07-04 08:56:16,675+0200 rescan
2018-07-04 08:56:39,366+0200 rescan
2018-07-04 08:57:03,588+0200 rescan
2018-07-04 08:57:26,343+0200 rescan
2018-07-04 09:00:00,232+0200 invalidate cache


The picture is pretty clear now:

1. NFS storage domain is not accessible
2. Engine try to recover the domain by calling connectStorageServer every ~5 minutes
3. connectStorgeServer invalidate storage domain cache
4. The next access to any domain trigger 2-5 fc-scans, each taking 25 seconds
5. calls to storage domain methods are blocked for at leat 25-50 seconds during the scans
6. virt executor get stuck with blocked getVolumeSize calls
7. vms become non-responsive

Germano, did you see the same picture in other cases?

Comment 35 Germano Veit Michel 2019-09-17 00:35:56 UTC

(In reply to Nir Soffer from comment #34)
> The picture is pretty clear now:
> 
> 1. NFS storage domain is not accessible
> 2. Engine try to recover the domain by calling connectStorageServer every ~5
> minutes
> 3. connectStorgeServer invalidate storage domain cache
> 4. The next access to any domain trigger 2-5 fc-scans, each taking 25 seconds
> 5. calls to storage domain methods are blocked for at leat 25-50 seconds
> during the scans
> 6. virt executor get stuck with blocked getVolumeSize calls
> 7. vms become non-responsive
> 
> Germano, did you see the same picture in other cases?

Brilliant, I think this connects all the dots.

Yes, one case I remember was when adding new storage domains. The current case also 
has a broken ISO domain.

Comment 36 Germano Veit Michel 2019-09-17 01:02:49 UTC

In connectStorageServer() vdsm invalidates the caches no matter what connection is being made.
Is this optimal? For example, connecting an ISO/Export Domain invalidates caches of everything. Is it really necessary?

IIUC, the problem can still happen without the extra cache invalidations. If getVolumeSize gets blocked for long the monitor will be non-responsive.
Not having the NFS problem makes it less frequent though.

Comment 37 Nir Soffer 2019-09-17 01:23:49 UTC

(In reply to Germano Veit Michel from comment #36)
> In connectStorageServer() vdsm invalidates the caches no matter what
> connection is being made.
> Is this optimal? For example, connecting an ISO/Export Domain invalidates
> caches of everything. Is it really necessary?

I don't think so, and I suggested to change this in the past:
https://gerrit.ovirt.org/c/45824/

But they were concerns about the safey of this change. I think we should
rethink this.

> IIUC, the problem can still happen without the extra cache invalidations. If
> getVolumeSize gets blocked for long the monitor will be non-responsive.
> Not having the NFS problem makes it less frequent though.

Yes, but this is real storage issue, and in this case libvirt and qemu will
get stuck anyway.

Comment 38 Germano Veit Michel 2019-09-17 01:47:53 UTC

(In reply to Nir Soffer from comment #37)
> > IIUC, the problem can still happen without the extra cache invalidations. If
> > getVolumeSize gets blocked for long the monitor will be non-responsive.
> > Not having the NFS problem makes it less frequent though.
> 
> Yes, but this is real storage issue, and in this case libvirt and qemu will
> get stuck anyway.

But if 1 domain has problems and is blocking refresh, then all vms will be
non-responsive, even VMs that have no disks on that domain, right?

Also, there might be other reasons the refresh might take longer and the VMs
are all running happily, like high number of devices, a problem with a few 
devices/paths which are not even being used.

No reply for the fc-scan yet, will post here as soon as I get it.

Comment 43 Nir Soffer 2019-09-18 11:38:29 UTC

(In reply to Germano Veit Michel from comment #38)
> (In reply to Nir Soffer from comment #37)
> > > IIUC, the problem can still happen without the extra cache invalidations. If
> > > getVolumeSize gets blocked for long the monitor will be non-responsive.
> > > Not having the NFS problem makes it less frequent though.
> > 
> > Yes, but this is real storage issue, and in this case libvirt and qemu will
> > get stuck anyway.
> 
> But if 1 domain has problems and is blocking refresh, then all vms will be
> non-responsive, even VMs that have no disks on that domain, right?

Right, this is exactly the case here. The problem is lack of storage domain
isolation.

Comment 44 Nir Soffer 2019-09-18 12:00:38 UTC

Created attachment 1616201 [details]
fc-scan with devices

Comment 45 Nir Soffer 2019-09-18 12:01:10 UTC

Created attachment 1616205 [details]
fc-scan without devices

Comment 46 Nir Soffer 2019-09-18 12:08:10 UTC

Created attachment 1616207 [details]
fc-scan with devices

Uploading again with fixed whitespace.

Comment 47 Nir Soffer 2019-09-18 12:27:23 UTC

I extracted the text from the pdf in attachment 1615732 [details] and uploaded:
- attachment 1616207 [details] - fc-scan with devices
- attachment 1616205 [details] - fc-scan without devices

Comparing total runtime:

- with devices: 4.6s
- without devices: 0.07s (65X times faster)

Checking how time was spent checking devices:

We scanned 486 devices:
$ egrep 'Rescanned device .+:' fc-scan-with-devices.log | wc -l
486

Slowest scans:
$ awk '/Rescanned device .+:/ {print $5}' fc-scan-with-devices.log | sort -rn | head
0.04
0.03
0.03
0.03
0.03
0.03
0.03
0.03
0.03
0.03

Fastest scans
$ awk '/Rescanned device .+:/ {print $5}' fc-scan-with-devices.log | sort -rn | tail
0.01
0.01
0.01
0.01
0.00
0.00
0.00
0.00
0.00
0.00

Average scan:
$ awk '/Rescanned device .+:/ {sum+=$5; count+=1}; END {print sum/count}' fc-scan-with-devices.log
0.0179218

Total time spent in scanning:
$ awk '/Rescanned device .+:/ {sum+=$5}; END {print sum}' fc-scan-with-devices.log
8.71

We have some some parallelism, using thread per host.

There is also time spent in finding devices for every host, but we cannot tell
how much time spent from the log.

98.5% of the time is spent in scanning devices. Skipping device scanning when it is not needed
will make slow scan a non-issue, avoiding delays in other parts of the system.

Comment 48 Nir Soffer 2019-09-18 12:32:22 UTC

Germano, it would be nice if we can get text instead of pdf next time.

An easy way to collect the data is:

    $ (time fc-scan -v --no-devices)> fc-scan-with-devices.log 2>&1
    $ (time fc-scan -v)> fc-scan-without-devices.log 2>&1

Comment 49 Nir Soffer 2019-09-18 12:59:27 UTC

Checking how devices are distributed across fc_hosts and rports:

$ egrep 'Rescanning device .+/host2/' fc-scan-with-devices.log | wc -l
243
$ egrep 'Rescanning device .+/host4/' fc-scan-with-devices.log | wc -l
243

$ egrep 'Rescanning device .+/host4/rport-4:0-2' fc-scan-with-devices.log | wc -l
3
$ egrep 'Rescanning device .+/host4/rport-4:0-3' fc-scan-with-devices.log | wc -l
82
$ egrep 'Rescanning device .+/host4/rport-4:0-4' fc-scan-with-devices.log | wc -l
82
$ egrep 'Rescanning device .+/host4/rport-4:0-5' fc-scan-with-devices.log | wc -l
38
$ egrep 'Rescanning device .+/host4/rport-4:0-6' fc-scan-with-devices.log | wc -l
38

$ egrep 'Rescanning device .+/host2/rport-2:0-2' fc-scan-with-devices.log | wc -l
82
$ egrep 'Rescanning device .+/host2/rport-2:0-3' fc-scan-with-devices.log | wc -l
82
$ egrep 'Rescanning device .+/host2/rport-2:0-4' fc-scan-with-devices.log | wc -l
38
$ egrep 'Rescanning device .+/host2/rport-2:0-5' fc-scan-with-devices.log | wc -l
38
$ egrep 'Rescanning device .+/host2/rport-2:0-6' fc-scan-with-devices.log | wc -l
3

We use only 2 threads - one per fc_host with devices. Can we get better parallelism
by scanning with thread per rport-target-N combinations?

The relevant code is here:
https://gerrit.ovirt.org/gitweb?p=vdsm.git;a=blob;f=lib/vdsm/storage/fc-scan;h=7da6105971c7c253b20626cb2cd87509b0815778;hb=61404df72b75dc329a70c745b9e0af88ff7027a2

The relevant part is:

 106                     fd = os.open(path, os.O_WRONLY)
 107                     try:
 108                         os.write(fd, "1")
 109                     finally:
 110                         os.close(fd)

Can this work in parallel for multiple devices on different host-rport-target-N, like:

/sys/devices/pci0000:40/0000:40:02.0/0000:41:00.1/host2/rport-2:0-2/target2:0:2/2:0:2:0/rescan
/sys/devices/pci0000:40/0000:40:02.0/0000:41:00.1/host2/rport-2:0-3/target2:0:3/2:0:3:0/rescan
/sys/devices/pci0000:40/0000:40:02.0/0000:41:00.1/host2/rport-2:0-4/target2:0:4/2:0:4:0/rescan
/sys/devices/pci0000:40/0000:40:02.0/0000:41:00.1/host2/rport-2:0-5/target2:0:5/2:0:5:0/rescan
/sys/devices/pci0000:40/0000:40:02.0/0000:41:00.1/host2/rport-2:0-6/target2:0:6/2:0:6:0/rescan

It it can work, we can speed up device scanning significantly.

Ben, I don't know if you can answer this, but maybe you can tell me who can 
answer this?

Comment 50 Nir Soffer 2019-09-18 13:21:17 UTC

Summary of actions items (in order of importance):

1. Disable device scanning when it is not needed
   I think most refreshes do not need device scanning

2. Optimize device scanning when it is needed (e.g. better parallelism)
   Important in getDeviceList and startup, when we must resize devices

3. Never trigger rescans when handling file based domains. It is never right to rescan
   hosts and/or devices when handling NFS domain.

4. Never trigger fc_host scan when handling iSCSI domain or iscsi_host scan when handling
   FC domain (currently we always invoke both).

5. Minimize cache invalidations  (sdCache.invalidateStorage)
   may trigger slow rescan blocking other flows. I think we can remove all calls.

6. Minimize cash refreshes (sdCache.refreshStorage)
   may trigger slow rescan blocking other flows. It think we can remove some calls.

7. Consider changing sampling method to use the current operation result
   instead of grouping waiters and performing new operation with all waiters
   can cut blocking time by half.

Comment 51 Ben Marzinski 2019-09-18 16:15:45 UTC

(In reply to Nir Soffer from comment #49)
> We use only 2 threads - one per fc_host with devices. Can we get better parallelism
> by scanning with thread per rport-target-N combinations?

> Can this work in parallel for multiple devices on different host-rport-target-N, like:
> 
> /sys/devices/pci0000:40/0000:40:02.0/0000:41:00.1/host2/rport-2:0-2/target2:0:2/2:0:2:0/rescan
> /sys/devices/pci0000:40/0000:40:02.0/0000:41:00.1/host2/rport-2:0-3/target2:0:3/2:0:3:0/rescan
> /sys/devices/pci0000:40/0000:40:02.0/0000:41:00.1/host2/rport-2:0-4/target2:0:4/2:0:4:0/rescan
> /sys/devices/pci0000:40/0000:40:02.0/0000:41:00.1/host2/rport-2:0-5/target2:0:5/2:0:5:0/rescan
> /sys/devices/pci0000:40/0000:40:02.0/0000:41:00.1/host2/rport-2:0-6/target2:0:6/2:0:6:0/rescan
> 
> If it can work, we can speed up device scanning significantly.
> 
> Ben, I don't know if you can answer this, but maybe you can tell me who can 
> answer this?

Looking over the relevant kernel code briefly, I don't see any obvious reason why not. However, I really don't know the kernel scsi-layer code well, so Ewan might be a better person to ask.

Comment 52 Ewan D. Milne 2019-09-18 17:20:31 UTC

It depends upon what you are doing.

There is a mutex on each SCSI Host object taken around the scan operation when
rescanning a host, or a target.  So if you use the "scan" sysfs node with wildcards
on the host object, you will wait on the mutex if multiple threads are performing
an operation on the same host.

Rescanning an already-discovered device, though, is different and does not take
the mutex.  So there is some potential for parallelism there.

However, one thing to be aware of is that scanning devices can take more time,
sometimes a lot more time, if there is also I/O in progress to the device.  I have
seen cases where it can take on the order of seconds for a device to respond to
inquiry commands under I/O load, because it apparently services the I/O with higher
priority.  Rescanning a device sends several such inquiry commands to it to get
all the VPD pages and attributes.

It is probably better to cache the information, and perhaps update it on a periodic
rotating basis (rather than scanning everything all at the same time).

Note also that since 7.5, if a device reports a unit attention ASC/ASCQ 3F 03
INQUIRY DATA HAS CHANGED, the kernel may perform the rescan automatically.
(A unit attention is only received if there is a command that can receive the
status, so if there is no I/O to the device the reporting can be delayed or not
received.  If multipath is used, typically the path checker will receive the UA.)

This does not happen for ASC/ASCQ 2A 09 CAPACITY DATA HAS CHANGED, though.
This had to do with multipath requiring the capacity on the underlying paths
to match, so if some paths are seen to change but not others, the paths would
not be used.  (I think multipath may have been changed to not check this, at some
point but I'm not certain.)


static void scsi_report_sense(struct scsi_device *sdev,
                              struct scsi_sense_hdr *sshdr)
{
        enum scsi_device_event evt_type = SDEV_EVT_MAXBITS;     /* i.e. none */

        if (sshdr->sense_key == UNIT_ATTENTION) {
                if (sshdr->asc == 0x3f && sshdr->ascq == 0x03) {
                        evt_type = SDEV_EVT_INQUIRY_CHANGE_REPORTED;
                        sdev_printk(KERN_WARNING, sdev,
                                    "Inquiry data has changed");

...

static void scsi_evt_emit(struct scsi_device *sdev, struct scsi_event *evt)
{
        int idx = 0;
        char *envp[3];

        switch (evt->evt_type) {
        case SDEV_EVT_MEDIA_CHANGE:
                envp[idx++] = "SDEV_MEDIA_CHANGE=1";
                break;
        case SDEV_EVT_INQUIRY_CHANGE_REPORTED:
                scsi_rescan_device(&sdev->sdev_gendev);                <=====
                envp[idx++] = "SDEV_UA=INQUIRY_DATA_HAS_CHANGED";
                break;

...

static ssize_t
store_rescan_field (struct device *dev, struct device_attribute *attr,
                    const char *buf, size_t count)
{
        scsi_rescan_device(dev);                                       <=====
        return count;
}
static DEVICE_ATTR(rescan, S_IWUSR, NULL, store_rescan_field);

Comment 53 Nir Soffer 2019-09-18 18:12:05 UTC

(In reply to Ewan D. Milne from comment #52)
> It depends upon what you are doing.

I'll try to describe RHV use case.

We have multiple hosts in a cluster (.e.g 50), all connected to the same
LUNs. The reason is we can try to run a VM using this storage on any of these
hosts quickly, without changing the storage (e.g. map LUN to some host).

All the LUNs use multipath. We create a LVM volume group with one or more LUNs,
and create LVM logical volumes used as VM virtual disks.

When we run a VM on some host, or perform some storage operation (e.g. cloning
and virtual disk), we activate the logical volume on this host.

When the volume group becomes full, users can go and resize the LUN on the server.
Since we don't manage the resize, we need to rescan the relevant device on
all hosts so all host are on the same page, and see the new size of the device.

Once we rescan devices on all hosts, we need to resize the LVM physical volume.
This is done on a master host managing LVM metadata. Other hosts will see this
update when they reload this volume group.

Optimally we would rescan only the device that was resized, but since we don't
manage the resize, we don't know which device was resized. We could ask the
user to select the device in RHV UI but is pretty hard task when you have lot
of devices. Also the user resizing the device on the server is typically not
the same user managing the RHV setup.

So instead of making user life hard, we rescan all the devices when a user
ask to update device size. Then we display the device that may be extended
(device size > LVM pysical volume size). The user can select the devices
and trigger a resize of the physical volume.

Since we manage a cluster of hosts, and a host can be disconnected from the
cluster at any time, we need to make sure a host will rescan devices before
connecting back to the cluster, in case there was an update when the host
was not connected.

So we need an efficient way to scan large amount of devices. I think we
test up to 300 LUNs, with 4 paths per LUNs, total 1200 devices.
 
> Rescanning an already-discovered device, though, is different and does not
> take
> the mutex.  So there is some potential for parallelism there.

Great.

> However, one thing to be aware of is that scanning devices can take more
> time,
> sometimes a lot more time, if there is also I/O in progress to the device. 
> I have
> seen cases where it can take on the order of seconds for a device to respond
> to
> inquiry commands under I/O load, because it apparently services the I/O with
> higher
> priority.  Rescanning a device sends several such inquiry commands to it to
> get
> all the VPD pages and attributes.

Yes, we learned this in the hard way, scanning 400 devices too about 25 seconds
with 2 threads (thread per host).

I hope that we can cut the time by 4 by having one thread per host-rport-target-N.

> It is probably better to cache the information, and perhaps update it on a
> periodic
> rotating basis (rather than scanning everything all at the same time).

The goal is to rescan devices only when:

- when a user need to perform an operation with devices, like creating 
  a new LVM volume group in the management system. At this point the user
  want to see the current size of the device.

- when a user perform a resize of the LVM physical volume, on all hosts.

- connecting back to cluster after a host was in maintenance mode or after
  a restart the management program on a host.

Comment 55 Nir Soffer 2019-09-21 01:03:36 UTC

Daniel, this is mostly a scale issue, affecting users with large setups
(many hosts, many vms, many LUNs, many storage domains).

We have initial patch for speeding up device scanning:
https://gerrit.ovirt.org/c/103455/

To test this patch I need access to a 1-2 hosts connected to FC storage
with many LUNs. I think that 300 LUNs, 4 paths per LUN, total 1200 SCSI
devices can be a good test.

We need to test:
- time to scan devices on idle host
- same when scanning from multiple hosts in the same time
- same when there is I/O on some devices
  (e.g. copying disks while scanning)

Comment 61 Nir Soffer 2019-09-26 12:01:14 UTC

Created attachment 1619485 [details]
graph: fc-scan with 222 devices

The graph shows that we get about 10x speedup with 20 threads and more threads do not improve results significantly.

Previously we were using thread per fc_host (typically 1 threads), so current
patches can give ~5x improvement.

We did not test yet with larger setups and with I/O load.

Comment 63 Nir Soffer 2019-10-23 19:03:08 UTC

I tested again with a bigger setup with 327 multipath FC LUNs with
4 paths per LUN, total 1331 paths.

# lsblk -s | grep mpath | wc -l
327

# lsblk -s | grep disk | wc -l
1311

Tested with current fc-scan from 4.3:

# rpm -q vdsm
vdsm-4.30.30-1.el7ev.x86_64

And with fc-scan from this patch:
https://gerrit.ovirt.org/c/103455/

fc-scan from 4.3 uses one thread per port, so get 2 threads.

First I reproduced the issue of slow fc-scan, by running the scan
while copying 10 100 GiB disks using qemu-img.

When the system is idle, fc-scan took 10.971 seconds. With load,
it took 26.097 seconds.

With new fc-scan, using thread pool of 64 threads, run time improved
to 1.081 seconds (10x faster) when the system is idle and 1.356 seconds
under load (19x faster).

We get most of the benefit using smaller thread pool (e.g 32) but bigger
thread pool make fc-scan more resilient if some scanning some LUNs is
significantly slower.


Her are complete results:

workers  current/idle  current/load  fix/idle    fix/load
---------------------------------------------------------
  1      -             -             19.926      41.250 
  2      10.971        26.097        10.796      21.453
  4      -             -              5.93       11.538
  8      -             -              3.467       6.168
 16      -             -              2.134       3.387
 32      -             -              1.434       2.087
 64      -             -              1.081       1.356
128      -             -              0.855       0.965
256      -             -              0.819       0.917
512      -             -              0.880       0.963

This graph make the numbers more clear:
https://docs.google.com/spreadsheets/u/1/d/e/2PACX-1vQk37hO7sn-v6eLAHFwo7-3Z1W-5XAVQ8j-EUoGvZjfbIfEToWhK3Sf6RWQxu7BdqOHwoLC69bf22Oe/pubchart?oid=1308885970&format=image

Times are average of 10 runs, measured like this:

# (for i in $(seq 10); do PYTHONPATH=lib /usr/bin/time -f "%e" lib/vdsm/storage/fc-scan 2>&1; done) | awk '{sum+=$1} END{print sum/NR}'

Comment 64 Nir Soffer 2019-10-23 19:14:52 UTC

Created attachment 1628588 [details]
graph: fc-scan with 1311 devices

Comment 65 Nir Soffer 2019-10-23 19:16:23 UTC

Created attachment 1628589 [details]
script for creating I/O load

Comment 66 Nir Soffer 2019-10-23 19:17:46 UTC

Created attachment 1628590 [details]
Output of fc-scan -v on idle system - current version

Comment 67 Nir Soffer 2019-10-23 19:19:58 UTC

Created attachment 1628592 [details]
Output of fc-scan -v with I/O load - current version

Comment 68 Nir Soffer 2019-10-23 19:21:26 UTC

Created attachment 1628593 [details]
Output of fc-scan -v on idle system - fix

Comment 69 Nir Soffer 2019-10-23 19:22:05 UTC

Created attachment 1628595 [details]
Output of fc-scan -v with I/O load - fix

Comment 70 Nir Soffer 2019-10-23 19:23:34 UTC

Created attachment 1628596 [details]
Output of lsblk -s on the test system

Comment 71 Nir Soffer 2019-10-23 19:24:31 UTC

Created attachment 1628597 [details]
Output of multipath -ll on test system

Comment 72 Nir Soffer 2019-10-23 20:30:20 UTC

How to reproduce:

1. Use system with one host and many devices (slows down fc-scan)
2. Start several VMs on the host
3. Add ISO domain on a temporary NFS server, that you will be able
   to block during the test. No need to attach any ISO to the VMs,
   just to have the domain in the DC.
4. simulate high I/O load by moving 10 big raw disks (e.g. 500G) concurrently
5. Block access to the ISO domain. After couple of minutes the system will
   try to recover the domain by refreshing and invalidating caches.

Expected result:
- Some VMs will flip between Up and Non-responsive state
- Warnings in vdsm log about blocked workers
- Possible fc-scan timeouts

Verifying the fix:

Same flow with version including the attached patch. None of the results
are expected with the fix.

Comment 73 Germano Veit Michel 2019-10-23 22:21:57 UTC

(In reply to Nir Soffer from comment #72)
> How to reproduce:
> 
> 1. Use system with one host and many devices (slows down fc-scan)
> 2. Start several VMs on the host
> 3. Add ISO domain on a temporary NFS server, that you will be able
>    to block during the test. No need to attach any ISO to the VMs,
>    just to have the domain in the DC.
> 4. simulate high I/O load by moving 10 big raw disks (e.g. 500G) concurrently
> 5. Block access to the ISO domain. After couple of minutes the system will
>    try to recover the domain by refreshing and invalidating caches.

Just to complement: doing 1, 2, 4 and then adding a new Storage Domain should
also reproduce. It's the use case of SFDC02251126 also attached to this BZ.

Comment 74 Nir Soffer 2019-10-24 14:04:51 UTC

Merge upstream in:

commit feabd12a224bab87dd2d205c9034be35ee51c636
Author: Nir Soffer <nsoffer>
Date:   Sun Sep 22 14:05:20 2019 +0300

    fc-scan: Speed up device scanning

Comment 75 Nir Soffer 2019-10-25 16:12:57 UTC

Mordechai, please see comment 72 and comment 73 for testing/verifying.

Comment 76 RHV bug bot 2019-11-19 11:53:26 UTC

WARN: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops

Comment 77 RHV bug bot 2019-11-19 12:03:28 UTC

WARN: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops

Comment 78 Eli Marcus 2019-12-05 18:58:22 UTC

Hi Germano, please review the doc text that I updated: 

"When a system had many FC LUNs with many paths per LUN, and a high I/O load, scanning of FC devices became slow, causing timeouts in monitoring VM disk size, and making VMs non-responsive.
In this release, FC scans have been optimized for speed, and VM are much less likely to become non-responsive."

Comment 81 RHV bug bot 2019-12-13 13:16:53 UTC

WARN: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops

Comment 82 RHV bug bot 2019-12-20 17:46:15 UTC

WARN: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops

Comment 83 RHV bug bot 2020-01-08 14:50:33 UTC

WARN: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops

Comment 84 RHV bug bot 2020-01-08 15:19:02 UTC

WARN: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops

Comment 85 RHV bug bot 2020-01-24 19:52:13 UTC

WARN: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops

Comment 86 David Vaanunu 2020-03-18 12:08:55 UTC

Tested on versions:

redhat-release-8.1-3.3
rhv-release-4.4.0-13
vdsm-4.40.0-164


Preparation steps:

Create additional 300 LUNs ("L0_Group_1" cluster) with 4 paths each.
Create 10 Disks (4*500GB , 6 * 200GB raw) on a new Storage Domain ("L0_Group_1" cluster)

Define ISO domain
Verify Jmeter script working OK. (Move disk between storage domains)

[root@f01-h06-000-r620 ~]# multipath -ll | grep -c LUN
304

[root@f01-h06-000-r620 vdsm]# lsblk -s | grep mpath | wc -l
397

[root@f01-h06-000-r620 vdsm]# lsblk -s | grep disk | wc -l
1591



Testing:

50 VMs running on 1 host
JMeter running - move disks between storage domains (on the same cluster) - 10 disks in parallel

On the host running:
    rescan-scsi-bus.sh

    multipath -ll

Create & Manage Storage Domain (Load all the LUNs ID)

 

Verify Log files:

No "monitor became unresponsive" messages in vdsm.log
No "ERROR" messages in engine.log & vdsm.log

Comment 87 Nir Soffer 2020-03-18 12:55:39 UTC

(In reply to David Vaanunu from comment #86)
> Tested on versions:
> 
> redhat-release-8.1-3.3

We should test on 8.2 now.

...
> Testing:
> 
> 50 VMs running on 1 host
> JMeter running - move disks between storage domains (on the same cluster) -
> 10 disks in parallel
> 
> On the host running:
>     rescan-scsi-bus.sh

Why? this should never be used.

>     multipath -ll

Also not related to this test.

> Create & Manage Storage Domain (Load all the LUNs ID)

Good, this invalidates the caches, which will trigger fc-recan,
which was the slow part.

> Verify Log files:
> 
> No "monitor became unresponsive" messages in vdsm.log

Good, but this is useful only if we had inaccessible ISO domain.

> No "ERROR" messages in engine.log & vdsm.log

This is not enough.

I described how test this in comment 72:

> How to reproduce:
>
> 1. Use system with one host and many devices (slows down fc-scan)
> 2. Start several VMs on the host
> 3. Add ISO domain on a temporary NFS server, that you will be able
>    to block during the test. No need to attach any ISO to the VMs,
>    just to have the domain in the DC.
> 4. simulate high I/O load by moving 10 big raw disks (e.g. 500G) concurrently
> 5. Block access to the ISO domain. After couple of minutes the system will
>    try to recover the domain by refreshing and invalidating caches.
> 
> Expected result:
> - Some VMs will flip between Up and Non-responsive state
> - Warnings in vdsm log about blocked workers
> - Possible fc-scan timeouts
> 
> Verifying the fix:
>
> Same flow with version including the attached patch. None of the results
> are expected with the fix.

Did you reproduce the issue without the fix? If you did not, how do you
know that your verification was correct? Maybe you did not trigger the 
case leading to the issue.

Did you block access to the ISO domain during the test?

Did you check warnings in vdsm logs about blocked workers?
You need to grep for ' WARN ' in vdsm log.

Please attach vdsm log from one of the hosts during the tests.
I would like to see the stats from "Scanning FC devices:" INFO log.

Comment 88 David Vaanunu 2020-03-18 13:50:33 UTC

(In reply to Nir Soffer from comment #87)
> (In reply to David Vaanunu from comment #86)
> > Tested on versions:
> > 
> > redhat-release-8.1-3.3
> 
> We should test on 8.2 now.
> 
> ...
> > Testing:
> > 
> > 50 VMs running on 1 host
> > JMeter running - move disks between storage domains (on the same cluster) -
> > 10 disks in parallel
> > 
> > On the host running:
> >     rescan-scsi-bus.sh
> 
> Why? this should never be used.
> 
> >     multipath -ll
> 
> Also not related to this test.
> 
> > Create & Manage Storage Domain (Load all the LUNs ID)
> 
> Good, this invalidates the caches, which will trigger fc-recan,
> which was the slow part.
> 
> > Verify Log files:
> > 
> > No "monitor became unresponsive" messages in vdsm.log
> 
> Good, but this is useful only if we had inaccessible ISO domain.
> 
> > No "ERROR" messages in engine.log & vdsm.log
> 
> This is not enough.
> 
> I described how test this in comment 72:
> 
> > How to reproduce:
> >
> > 1. Use system with one host and many devices (slows down fc-scan)
> > 2. Start several VMs on the host
> > 3. Add ISO domain on a temporary NFS server, that you will be able
> >    to block during the test. No need to attach any ISO to the VMs,
> >    just to have the domain in the DC.
> > 4. simulate high I/O load by moving 10 big raw disks (e.g. 500G) concurrently
> > 5. Block access to the ISO domain. After couple of minutes the system will
> >    try to recover the domain by refreshing and invalidating caches.
> > 
> > Expected result:
> > - Some VMs will flip between Up and Non-responsive state
> > - Warnings in vdsm log about blocked workers
> > - Possible fc-scan timeouts
> > 
> > Verifying the fix:
> >
> > Same flow with version including the attached patch. None of the results
> > are expected with the fix.

Try to reproduce it according to comment #73 (Without ISO domain)


> 
> Did you reproduce the issue without the fix? If you did not, how do you
> know that your verification was correct? Maybe you did not trigger the 
> case leading to the issue.

Regarding vdsm version - It should be in 4.4 (Mail from you)

> 
> Did you block access to the ISO domain during the test?
> 
> Did you check warnings in vdsm logs about blocked workers?
> You need to grep for ' WARN ' in vdsm log.
> 
> Please attach vdsm log from one of the hosts during the tests.
> I would like to see the stats from "Scanning FC devices:" INFO log.


vdsm log files: 
(Test run 04:00-05:00 EDT time)

https://drive.google.com/open?id=1WDtWUVFOZjF-_rjcLLnMEvQ6AW1XaFP5

Comment 92 errata-xmlrpc 2020-08-04 13:26:06 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (RHV RHEL Host (ovirt-host) 4.4), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2020:3246