+++ This bug is a downstream clone. The original bug is: +++ +++ bug 1598266 +++ ====================================================================== Description of problem: Several VMs get unresponsive at the same time, regularly. 2018-07-04 08:27:11,986+0200 vmId='86dfa0fc-a4ed-4362-83a4-5bec1ca98d60' age=62.8300000001 2018-07-04 08:27:11,988+0200 vmId='def11d1f-9cb0-47be-9c45-7b961d275c40' age=62.8300000001 2018-07-04 08:27:11,990+0200 vmId='d3ffccf2-7743-463c-bd03-99cff6106ff0' age=62.8399999999 2018-07-04 08:27:11,992+0200 vmId='c8f91bed-1e71-4408-a76b-2bf02d55aa48' age=62.8399999999 2018-07-04 08:27:11,993+0200 vmId='01b4911b-d588-435e-a86a-a5a476da6362' age=62.8399999999 2018-07-04 08:27:11,995+0200 vmId='bd2dd808-af9a-40e9-b997-316f05d06bb9' age=62.8399999999 2018-07-04 08:27:11,997+0200 vmId='d7982c76-8d6b-496a-85cb-548478816e54' age=62.8399999999 2018-07-04 08:27:11,998+0200 vmId='ed10e819-cb6b-4f36-bedf-41fa0c89cab4' age=62.8399999999 2018-07-04 08:27:12,000+0200 vmId='247144df-6844-4f35-a728-80f25ad72971' age=62.8499999996 2018-07-04 08:27:12,001+0200 vmId='331d4145-d02c-4bfa-a136-1018839162be' age=62.8499999996 2018-07-04 08:27:12,003+0200 vmId='e7945585-5304-47b8-a533-89350ddac7b7' age=62.8499999996 2018-07-04 08:27:12,006+0200 vmId='73d41cb9-6564-44ca-bf94-a5aeb7c76a1c' age=62.8499999996 2018-07-04 08:27:12,007+0200 vmId='52589290-0033-4a97-b393-3650d7af7ef2' age=62.8499999996 2018-07-04 08:27:12,009+0200 vmId='8e7a575b-3f7d-4453-be96-190d0226d077' age=62.8599999994 2018-07-04 08:27:12,010+0200 vmId='61da1520-21b2-439d-8ae6-38c1ce9127df' age=62.8599999994 2018-07-04 08:27:12,012+0200 vmId='7701dfda-a3c6-456f-bd20-8db0f6f2373a' age=62.8599999994 2018-07-04 08:27:12,013+0200 vmId='89012d87-e8c3-466f-83a1-d673d1a7537b' age=62.8599999994 2018-07-04 08:27:12,015+0200 vmId='1ded83c2-4879-42d9-9bcd-ced3f14bcf03' age=62.8599999994 2018-07-04 08:27:12,017+0200 vmId='f6ff4904-5a50-40fd-bbae-e2ad8d6de1b7' age=62.8599999994 2018-07-04 08:27:12,018+0200 vmId='edd902e4-0e98-400e-9879-a320080bd148' age=62.8599999994 2018-07-04 08:27:12,020+0200 vmId='bdef5563-53d6-473b-968b-b1b279176682' age=62.870000001 2018-07-04 08:27:12,021+0200 vmId='4529e6fa-70a2-4233-abe5-4718090117a3' age=62.870000001 2018-07-04 08:27:12,023+0200 vmId='9174301d-fa75-4ad7-84a2-91b3b22db987' age=62.870000001 2018-07-04 08:27:12,025+0200 vmId='17f21001-8dfa-4786-89e6-27e0b157be9f' age=62.870000001 2018-07-04 08:27:12,026+0200 vmId='1247742f-7726-458f-8294-fbaa33ffce77' age=62.870000001 2018-07-04 08:27:12,028+0200 vmId='4c7d3aff-dad1-4108-883d-069ba0a4a4fb' age=62.870000001 2018-07-04 08:27:12,029+0200 vmId='7ac88b8f-db4c-4456-a1a4-310ff0bf5b39' age=62.8800000008 2018-07-04 08:27:12,031+0200 vmId='abacdc84-c107-42ef-abf6-13803e118a40' age=62.8800000008 2018-07-04 08:27:12,032+0200 vmId='3a42f097-4a4f-4c91-bd32-85e0d48bedf0' age=62.8800000008 2018-07-04 08:27:12,034+0200 vmId='60b2d3f0-9b38-4d6b-93f9-fe0d97857582' age=62.8800000008 2018-07-04 08:27:12,035+0200 vmId='1e693cd3-4cae-4a60-b914-2daaa0f40791' age=62.8800000008 2018-07-04 08:27:12,037+0200 vmId='bc822a5a-631d-4f1b-ac18-c760baaf657e' age=62.8800000008 2018-07-04 08:27:12,038+0200 vmId='bdfbc05b-5b8c-4be5-aedd-78e987c21d00' age=62.8800000008 2018-07-04 08:27:12,039+0200 vmId='e46492df-4da3-45a0-b49f-82245e0173a5' age=62.8900000006 2018-07-04 08:27:12,041+0200 vmId='288908d5-6379-466b-88a0-2c4cca2d3900' age=62.8900000006 2018-07-04 08:27:12,042+0200 vmId='87441f46-d891-44c6-8483-4516bfeac757' age=62.8900000006 Lets pick the first one, 86dfa0fc-a4ed-4362-83a4-5bec1ca98d60 These are its disks: SD 7b0d671a-fc1a-4f33-8da6-29a76d227ec8 IMG ab3c2cda-c54b-497a-b7a7-342e658324db SD 7b0d671a-fc1a-4f33-8da6-29a76d227ec8 IMG caa85e31-02c9-40cb-84ba-0693b1182314 DLUN 360001440000000107071b5b4d8ebc59f When the mass unreponsiveness occur, we can see getVolumeSize() called from UpdateVolumes() does not run on time (every 1 min). See: 2018-07-04 08:21:10,016+0200 getVolumeSize(sdUUID=u'7b0d671a-fc1a-4f33-8da6-29a76d227ec8', imgUUID=u'ab3c2cda-c54b-497a-b7a7-342e658324db', 2018-07-04 08:21:10,019+0200 getVolumeSize(sdUUID=u'7b0d671a-fc1a-4f33-8da6-29a76d227ec8', imgUUID=u'caa85e31-02c9-40cb-84ba-0693b1182314', 2018-07-04 08:22:10,018+0200 getVolumeSize(sdUUID=u'7b0d671a-fc1a-4f33-8da6-29a76d227ec8', imgUUID=u'ab3c2cda-c54b-497a-b7a7-342e658324db', 2018-07-04 08:22:10,045+0200 getVolumeSize(sdUUID=u'7b0d671a-fc1a-4f33-8da6-29a76d227ec8', imgUUID=u'caa85e31-02c9-40cb-84ba-0693b1182314', 2018-07-04 08:23:10,016+0200 getVolumeSize(sdUUID=u'7b0d671a-fc1a-4f33-8da6-29a76d227ec8', imgUUID=u'ab3c2cda-c54b-497a-b7a7-342e658324db', 2018-07-04 08:23:10,019+0200 getVolumeSize(sdUUID=u'7b0d671a-fc1a-4f33-8da6-29a76d227ec8', imgUUID=u'caa85e31-02c9-40cb-84ba-0693b1182314', 2018-07-04 08:24:10,013+0200 getVolumeSize(sdUUID=u'7b0d671a-fc1a-4f33-8da6-29a76d227ec8', imgUUID=u'ab3c2cda-c54b-497a-b7a7-342e658324db', 2018-07-04 08:24:10,477+0200 getVolumeSize(sdUUID=u'7b0d671a-fc1a-4f33-8da6-29a76d227ec8', imgUUID=u'caa85e31-02c9-40cb-84ba-0693b1182314', 2018-07-04 08:25:10,018+0200 getVolumeSize(sdUUID=u'7b0d671a-fc1a-4f33-8da6-29a76d227ec8', imgUUID=u'ab3c2cda-c54b-497a-b7a7-342e658324db', 2018-07-04 08:25:10,021+0200 getVolumeSize(sdUUID=u'7b0d671a-fc1a-4f33-8da6-29a76d227ec8', imgUUID=u'caa85e31-02c9-40cb-84ba-0693b1182314', 2018-07-04 08:26:10,013+0200 getVolumeSize(sdUUID=u'7b0d671a-fc1a-4f33-8da6-29a76d227ec8', imgUUID=u'ab3c2cda-c54b-497a-b7a7-342e658324db', 2018-07-04 08:26:10,016+0200 getVolumeSize(sdUUID=u'7b0d671a-fc1a-4f33-8da6-29a76d227ec8', imgUUID=u'caa85e31-02c9-40cb-84ba-0693b1182314', 2018-07-04 08:27:31,694+0200 getVolumeSize(sdUUID=u'7b0d671a-fc1a-4f33-8da6-29a76d227ec8', imgUUID=u'ab3c2cda-c54b-497a-b7a7-342e658324db', 2018-07-04 08:27:31,717+0200 getVolumeSize(sdUUID=u'7b0d671a-fc1a-4f33-8da6-29a76d227ec8', imgUUID=u'caa85e31-02c9-40cb-84ba-0693b1182314', 2018-07-04 08:28:10,019+0200 getVolumeSize(sdUUID=u'7b0d671a-fc1a-4f33-8da6-29a76d227ec8', imgUUID=u'ab3c2cda-c54b-497a-b7a7-342e658324db', 2018-07-04 08:28:10,022+0200 getVolumeSize(sdUUID=u'7b0d671a-fc1a-4f33-8da6-29a76d227ec8', imgUUID=u'caa85e31-02c9-40cb-84ba-0693b1182314', When this happens, we usually see UpdateVolumes() stuck in the executor. 2018-07-04 08:26:40,030+0200 DEBUG (vdsm.Scheduler) [Executor] executor state: count=5 workers=set([<Worker name=periodic/4796 waiting task#=0 at 0x7f7480b85990>, <Worker name=periodic/4792 running <Task discardable <UpdateVolumes vm=61da1520-21b2-439d-8ae6-38c1ce9127df at 0x7f7468140650> timeout=30.0, duration=29 at 0x1299790> task#=34 at 0x28509d0>, <Worker name=periodic/4794 running <Task discardable <UpdateVolumes vm=bdef5563-53d6-473b-968b-b1b279176682 at 0x7f7400106450> timeout=30.0, duration=29 at 0x7f7400106090> task#=46 at 0x7f746c2e6e50>, <Worker name=periodic/4795 running <Task discardable <UpdateVolumes vm=01b4911b-d588-435e-a86a-a5a476da6362 at 0x7f746c4040d0> timeout=30.0, duration=30 at 0x22720d0> discarded task#=40 at 0x2381750>, <Worker name=periodic/4793 running <Task discardable <UpdateVolumes vm=e7945585-5304-47b8-a533-89350ddac7b7 at 0x7f7440ceeb50> timeout=30.0, duration=29 at 0x7f7468140b90> task#=49 at 0x7f7440524bd0>]) (executor:205) Which means some of the getVolumeSize() at 08:26:10 did not complete and is probably hanging everything, including VM monitoring and leading to the delayed getVolumeSize() at 08:27:31 (which should have happened at 08:27:10?). Lets check VM e7945585-5304-47b8-a533-89350ddac7b7 from above, these are its disks: SD eff51393-6316-4f17-a4dc-83ad85a0b765 IMG 9b3af596-d8e8-4e89-aa2c-cbefd2a8908a SD eff51393-6316-4f17-a4dc-83ad85a0b765 IMG 6328070d-649d-4c57-8db0-0220c9cb4470 1. getVolumeSize for the first disk (9b3af596) 2018-07-04 08:26:10,069+0200 INFO (periodic/4793) [vdsm.api] START getVolumeSize(sdUUID=u'eff51393-6316-4f17-a4dc-83ad85a0b765', spUUID=u'00000002-0002-0002-0002-0000000003b7', imgUUID=u'9b3af596-d8e8-4e89-aa2c-cbefd2a8908a', volUUID=u'8caa2d9e-71d7-4533-90f2-ba9ab955afce', options=None) from=internal, task_id=26bf9721-b856-4032-86bd-3f1de2c88190 (api:46) 2. Is discarded at 08:27:40 (as per logs above). But still goes ahead and takes takes 50 seconds to complete. 2018-07-04 08:27:00,667+0200 INFO (periodic/4793) [vdsm.api] FINISH getVolumeSize return={'truesize': '53687091200', 'apparentsize': '53687091200'} from=internal, task_id=26bf9721-b856-4032-86bd-3f1de2c88190 (api:52) 3. Then it proceeds to the second disk, which completes immediately. We are 50 seconds late. 2018-07-04 08:27:00,669+0200 INFO (periodic/4793) [vdsm.api] START getVolumeSize(sdUUID=u'eff51393-6316-4f17-a4dc-83ad85a0b765', spUUID=u'00000002-0002-0002-0002-0000000003b7', imgUUID=u'6328070d-649d-4c57-8db0-0220c9cb4470', volUUID=u'd04beb3a-db86-4dc1-af24-7139a0095942', options=None) from=internal, task_id=7abb24b2-1550-4b8a-b66a-499742ed857b (api:46) 2018-07-04 08:27:00,670+0200 INFO (periodic/4793) [vdsm.api] FINISH getVolumeSize return={'truesize': '826781204480', 'apparentsize': '826781204480'} from=internal, task_id=7abb24b2-1550-4b8a-b66a-499742ed857b (api:52) I'm not entirely sure, but I think these delays end up causing the discarded UpdateVolumes and ultimately monitors going unresponsive as VM monitoring is getting starved in the executor by these long standing getVolumeInfo tasks. It seems these stuck getVolumeSize() on the storage domain eff51393-6316-4f17-a4dc-83ad85a0b765 get blocked by this, as it finishes at 08:27:00 which after all stuck getVolumeSize() calls return. 2018-07-04 08:26:04,526+0200 DEBUG (monitor/eff5139) [storage.Monitor] Refreshing domain eff51393-6316-4f17-a4dc-83ad85a0b765 (monitor:487) 2018-07-04 08:26:04,526+0200 DEBUG (monitor/eff5139) [storage.SamplingMethod] Trying to enter sampling method (storage.sdc.refreshStorage) (misc:467) 2018-07-04 08:26:26,071+0200 DEBUG (monitor/eff5139) [storage.SamplingMethod] Got in to sampling method (misc:470) 2018-07-04 08:26:26,072+0200 DEBUG (monitor/eff5139) [storage.SamplingMethod] Trying to enter sampling method (vdsm.storage.iscsi.rescan) (misc:467) 2018-07-04 08:26:26,072+0200 DEBUG (monitor/eff5139) [storage.SamplingMethod] Got in to sampling method (misc:470) 2018-07-04 08:26:26,072+0200 DEBUG (monitor/eff5139) [storage.ISCSI] Performing SCSI scan, this will take up to 30 seconds (iscsi:457) 2018-07-04 08:26:26,073+0200 DEBUG (monitor/eff5139) [storage.Misc.excCmd] /usr/bin/taskset --cpu-list 0-119 /usr/bin/sudo -n /sbin/iscsiadm -m session -R (cwd None) (commands:69) 2018-07-04 08:26:26,446+0200 DEBUG (monitor/eff5139) [storage.SamplingMethod] Returning last result (misc:477) 2018-07-04 08:26:26,447+0200 DEBUG (monitor/eff5139) [storage.SamplingMethod] Trying to enter sampling method (vdsm.storage.hba.rescan) (misc:467) 2018-07-04 08:26:26,447+0200 DEBUG (monitor/eff5139) [storage.SamplingMethod] Got in to sampling method (misc:470) 2018-07-04 08:26:26,448+0200 DEBUG (monitor/eff5139) [storage.HBA] Starting scan (hba:58) 2018-07-04 08:26:49,985+0200 DEBUG (monitor/eff5139) [storage.HBA] Scan finished (hba:64) 2018-07-04 08:26:49,985+0200 DEBUG (monitor/eff5139) [storage.SamplingMethod] Returning last result (misc:477) 2018-07-04 08:26:49,986+0200 DEBUG (monitor/eff5139) [root] /usr/bin/taskset --cpu-list 0-119 /sbin/udevadm settle --timeout=5 (cwd None) (commands:69) 2018-07-04 08:26:50,048+0200 DEBUG (monitor/eff5139) [root] SUCCESS: <err> = ''; <rc> = 0 (commands:93) 2018-07-04 08:26:50,616+0200 DEBUG (monitor/eff5139) [storage.SamplingMethod] Returning last result (misc:477) <lvm commands take around 2s each> 2018-07-04 08:27:00,636+0200 INFO (monitor/eff5139) [storage.StorageDomain] Resource namespace 01_img_eff51393-6316-4f17-a4dc-83ad85a0b765 already registered (sd:758) 2018-07-04 08:27:00,636+0200 INFO (monitor/eff5139) [storage.StorageDomain] Resource namespace 02_vol_eff51393-6316-4f17-a4dc-83ad85a0b765 already registered (sd:767) 2018-07-04 08:27:00,636+0200 INFO (monitor/eff5139) [storage.StorageDomain] Resource namespace 03_lvm_eff51393-6316-4f17-a4dc-83ad85a0b765 already registered (blockSD:989) So most of the time is spent on hba scan and the first "trying to enter sampling method". Interestingly enough, most of the VMs that get stuck on UpdateVolumes have Direct LUNS (20 out of 21). The 1 I used as example above is the only without a Direct Lun. So maybe this is related: https://bugzilla.redhat.com/show_bug.cgi?id=1517535 Should this slower storage refreshing trigger unresponsive vms? The VMs are running fine (confirmed by reading monotonic from inside). Version-Release number of selected component (if applicable): vdsm-4.19.50-1.el7ev.x86_64 How reproducible: Unknown (Originally by Germano Veit Michel)
Maor, you are this week's QE contact, please scrub (Originally by Tal Nisan)
I saw the following error in the VDSM logs: 2018-07-04 08:05:11,680+0200 WARN (jsonrpc/0) [virt.vm] (vmId='86dfa0fc-a4ed-4362-83a4-5bec1ca98d60') monitor became unresponsive (command timeout, age=62.6100000013) (vm:5135) 2018-07-04 08:05:11,682+0200 WARN (jsonrpc/0) [virt.vm] (vmId='def11d1f-9cb0-47be-9c45-7b961d275c40') monitor became unresponsive (command timeout, age=62.6100000013) (vm:5135) .... Can you please add some more indicators, such as cpu usage, load, sar, iotop of the host while the VMs get unresponsive. Can you also add libvirt, qemu and engine logs Thanks (Originally by mlipchuk)
Hi Maor, Thanks for taking a look. (In reply to Maor from comment #4) > I saw the following error in the VDSM logs: > > 2018-07-04 08:05:11,680+0200 WARN (jsonrpc/0) [virt.vm] > (vmId='86dfa0fc-a4ed-4362-83a4-5bec1ca98d60') monitor became unresponsive > (command timeout, age=62.6100000013) (vm:5135) > 2018-07-04 08:05:11,682+0200 WARN (jsonrpc/0) [virt.vm] > (vmId='def11d1f-9cb0-47be-9c45-7b961d275c40') monitor became unresponsive > (command timeout, age=62.6100000013) (vm:5135) > .... Yes, as in comment #0. > Can you please add some more indicators, such as cpu usage, load, sar, iotop > of the host while the VMs get unresponsive. Sure. There are plenty of resources available, there is no overcommit. Seem the delays on entering sampling method are on several places, and fc-scan is always slow (there is a large number of LUNs presented). Looks like the lockup is on VDSM to me. All VMs are running just fine, there is no detected performance issue anywhere else. > Can you also add libvirt, qemu and engine logs RHV doesnt run with libvirt or qemu logging by default. So I'll add engine. > It might also be related to this bug https://bugzilla.redhat.com/1419856 > which libvirt version is the customer using? libvirt-daemon-3.2.0-14.el7_4.9.x86_64 Please let me know if you need anything else. (Originally by Germano Veit Michel)
Similar case, UpdateVolumes for each VM is stuck because getVolumeSize is delayed/not-running/blocked. On this case is happens when adding a new SD. vdsm-4.19.45-1.el7ev.x86_64 (Originally by Germano Veit Michel)
From the data of comment #17: 2018-11-08 11:15:19,064+0100 getVolumeSize volUUID=u'284ae5e7-da22-4a51-8fb7-554c3d953ced', 2018-11-08 11:16:19,034+0100 getVolumeSize volUUID=u'284ae5e7-da22-4a51-8fb7-554c3d953ced', 2018-11-08 11:17:19,048+0100 getVolumeSize volUUID=u'284ae5e7-da22-4a51-8fb7-554c3d953ced', 2018-11-08 11:19:20,278+0100 getVolumeSize volUUID=u'284ae5e7-da22-4a51-8fb7-554c3d953ced', <--- 2m01s delayed 2018-11-08 11:19:24,344+0100 getVolumeSize volUUID=u'284ae5e7-da22-4a51-8fb7-554c3d953ced', <--- 5s delayed 2018-11-08 11:20:19,040+0100 getVolumeSize volUUID=u'284ae5e7-da22-4a51-8fb7-554c3d953ced', 2018-11-08 11:21:19,076+0100 getVolumeSize volUUID=u'284ae5e7-da22-4a51-8fb7-554c3d953ced', 2018-11-08 11:22:19,046+0100 getVolumeSize volUUID=u'284ae5e7-da22-4a51-8fb7-554c3d953ced', + 2018-11-08 11:19:17,412+0100 WARN (jsonrpc/6) [virt.vm] (vmId='a4ab6989-274b-4d50-9bec-1680e45442a1') monitor became unresponsive (command timeout, age=65.4700000007) (vm:5132) So looks like UpdateVolumes was stuck from ~11:18 to 11:19:20. And the monitor goes unresponsive as getVolumeSize for the VM's volumes is stuck? Tons of getVolumeSize were delayed/blocked while the user was adding a new SD to the Pool. Is there any improvement on this area in vdsm-4.20? (Originally by Germano Veit Michel)
I'm positive we landed few fixes on 4.2 to improve the number of rescans we perform after gchen found out we repeat that about 4 times for every path. Germano, how many paths are configured there? meanwhile I'll try to dig the relevant BZ. (Originally by Roy Golan)
(In reply to Roy Golan from comment #19) > Germano, how many paths are configured there? This one has 8 paths ;) Indeed should be better in 4.2 with https://gerrit.ovirt.org/#/c/81681/ Still, can't this be improved to not pile up GetVolumeSize() while the SD is refreshing/producing? It seems all GetVolumeSize() are waiting for the refresh, if one has enough VMs it does mean a ton of blocked UpdateVolumes tasks. (Originally by Germano Veit Michel)
This bug has not been marked as blocker for oVirt 4.3.0. Since we are releasing it tomorrow, January 29th, this bug has been re-targeted to 4.3.1. (Originally by Sandro Bonazzola)
I think this issue should be solved by moving volume size monitoring to storage domain monitors. When an image is prepared, we can register the leaf volume with the storage domain monitor. Then the storage domain monitor can check volume size every N seconds and cache the value. When virt API need the value, they will get the cached value from repoStats. When image is tore down, storage monitor will stop monitoring the leaf volume. When taking a snapshot, replacing the active leaf volume, the new leaf volume should be registered with the storage domain monitor, replacing the old leaf volume (that will never change). If storage is not responsive, relevant volumes size will be stale, but nothing in the system will block. The UpdateVolume periodic task will be removed. (Originally by Nir Soffer)
sync2jira (Originally by Daniel Gur)
Normally getting volume size is quick. we need to find the root cause causing these calls to be slow. Updating volume size call this every 60 seconds: 3842 def updateDriveVolume(self, vmDrive): 3843 if not vmDrive.device == 'disk' or not isVdsmImage(vmDrive): 3844 return 3845 3846 volSize = self._getVolumeSize( 3847 vmDrive.domainID, vmDrive.poolID, vmDrive.imageID, 3848 vmDrive.volumeID) 3849 3850 vmDrive.truesize = volSize.truesize 3851 vmDrive.apparentsize = volSize.apparentsize getVolumeSize() is: 3067 @public 3068 def getVolumeSize(self, sdUUID, spUUID, imgUUID, volUUID, options=None): ... 3086 dom = sdCache.produce(sdUUID=sdUUID) 3087 apparentsize = str(dom.getVSize(imgUUID, volUUID)) 3088 truesize = str(dom.getVAllocSize(imgUUID, volUUID)) 3089 return dict(apparentsize=apparentsize, truesize=truesize) On block storage getVSize and getVAllocSize are the same, so we have 2 calls to: 421 def getVSize(self, imgUUUID, volUUID): 422 """ Return the block volume size in bytes. """ 423 try: 424 size = fsutils.size(lvm.lvPath(self.sdUUID, volUUID)) 425 except IOError as e: 426 if e.errno != errno.ENOENT: 427 raise 428 # Inactive volume has no /dev entry. Fallback to lvm way. 429 size = lvm.getLV(self.sdUUID, volUUID).size 430 431 return int(size) Because the LVs are active (called for running VMs, we know we don't use lvm to get the size but call fsutils.size(), which is: 26 def size(filename): 27 """ 28 Return actual file size, should work with both file and block device. 29 """ 30 with io.open(filename, "rb") as f: 31 f.seek(0, os.SEEK_END) 32 return f.tell() Which should be fast, but maybe in these cases is not. An easy optimization would be to call getVSize() only once on block storage instead of twice. In the case when seeking to the end of the device is slow, this may improve the situation. But we really want to understand why this operation is slow on the certain systems. Germano, can you share all WARN logs vdsm log and sanlock.log? (Originally by Nir Soffer)
Hi Nir, Thanks for taking a look at this. (In reply to Nir Soffer from comment #29) > Normally getting volume size is quick. we need to find the root cause causing > these calls to be slow. > > Updating volume size call this every 60 seconds: > > 3842 def updateDriveVolume(self, vmDrive): > 3843 if not vmDrive.device == 'disk' or not isVdsmImage(vmDrive): > 3844 return > 3845 > 3846 volSize = self._getVolumeSize( > 3847 vmDrive.domainID, vmDrive.poolID, vmDrive.imageID, > 3848 vmDrive.volumeID) > 3849 > 3850 vmDrive.truesize = volSize.truesize > 3851 vmDrive.apparentsize = volSize.apparentsize > > getVolumeSize() is: > > 3067 @public > 3068 def getVolumeSize(self, sdUUID, spUUID, imgUUID, volUUID, > options=None): > ... > 3086 dom = sdCache.produce(sdUUID=sdUUID) > 3087 apparentsize = str(dom.getVSize(imgUUID, volUUID)) > 3088 truesize = str(dom.getVAllocSize(imgUUID, volUUID)) > 3089 return dict(apparentsize=apparentsize, truesize=truesize) > I might be wrong, but I think it's sdCache.produce above that slows down monitoring. getVolumeSize is stuck on the executor until a storage refresh completes. And on larger environments (many LUNs), HBA rescan can take longer time too, enough to make VM monitoring take more time. I think having a few paths down also slow down HBA rescan. Once the storage refresh completes, all stuck getVolumeSize() complete immediately after. > On block storage getVSize and getVAllocSize are the same, so we have > 2 calls to: > > 421 def getVSize(self, imgUUUID, volUUID): > 422 """ Return the block volume size in bytes. """ > 423 try: > 424 size = fsutils.size(lvm.lvPath(self.sdUUID, volUUID)) > 425 except IOError as e: > 426 if e.errno != errno.ENOENT: > 427 raise > 428 # Inactive volume has no /dev entry. Fallback to lvm way. > 429 size = lvm.getLV(self.sdUUID, volUUID).size > 430 > 431 return int(size) > > Because the LVs are active (called for running VMs, we know we don't use > lvm to get the size but call fsutils.size(), which is: > > 26 def size(filename): > 27 """easy optimization would be to call getVSize() only once on block > storage instead of twice. In the case when seeking to the end of the > device is slow, this may improve the situation. > > But we really want to understand why this operation is slow on the > certain systems. > > Germano, can you share all WARN logs vdsm log and sanlock.log? To open this bug I collected some debug logs from the initial customer, aren't the ones attached initially good enough? Occasionally there are some HBA rescans timeout too. From what I remember on call these cases sanlock is fine, and so is IO to storage. The common scenario is they all have lots of SDs and LUNs. > 28 Return actual file size, should work with both file and block device. > 29 """ > 30 with io.open(filename, "rb") as f: > 31 f.seek(0, os.SEEK_END) > 32 return f.tell() > > Which should be fast, but maybe in these cases is not. Yes, these should be fast. If this was slow, then the problem would not be in VDSM I believe, and probably the VMs would be suffering too. > > An easy optimization would be to call getVSize() only once on block > storage instead of twice. In the case when seeking to the end of the > device is slow, this may improve the situation. > > But we really want to understand why this operation is slow on the > certain systems. > > Germano, can you share all WARN logs vdsm log and sanlock.log? To open this bug I collected some debug logs from the initial customer, aren't the ones attached initially good enough? They are more informative then the latest ones without debug, but I can get more if you wish. I don't have debug logs from latest 4.3 on this. Let me know if you need them then I'll collect a new round, along with sanlock (which was fine when I checked). (Originally by Germano Veit Michel)
(In reply to Germano Veit Michel from comment #30) ... > > getVolumeSize() is: > > > > 3067 @public > > 3068 def getVolumeSize(self, sdUUID, spUUID, imgUUID, volUUID, > > options=None): > > ... > > 3086 dom = sdCache.produce(sdUUID=sdUUID) > > 3087 apparentsize = str(dom.getVSize(imgUUID, volUUID)) > > 3088 truesize = str(dom.getVAllocSize(imgUUID, volUUID)) > > 3089 return dict(apparentsize=apparentsize, truesize=truesize) > > > > I might be wrong, but I think it's sdCache.produce above that slows down > monitoring. > getVolumeSize is stuck on the executor until a storage refresh completes. > And on larger environments (many LUNs), HBA rescan can take longer time too, > enough to make VM monitoring take more time. Right, storage refreshes are a known issue that may block any flow in vdsm accessing storage domain method, even if the storage domain is not related (e.g. accessing ISO domain during a storage refresh blokced by SCSI scan). But we don't expect lot of storage refreshes. We need to understand why we have many refreshes, and why they are slow. > I think having a few paths down also slow down HBA rescan. > > Once the storage refresh completes, all stuck getVolumeSize() complete > immediately after. This makes sense. > > Germano, can you share all WARN logs vdsm log and sanlock.log? > > To open this bug I collected some debug logs from the initial customer, > aren't the ones attached initially good enough? They are more informative > then the latest ones without debug, but I can get more if you wish. We need to look at them again, hopefully we will have enough info. > I don't have debug logs from latest 4.3 on this. Let me know if you need them > then I'll collect a new round, along with sanlock (which was fine when I > checked). If sanlock did not have any delays, then this is most likely an issue with storage refreshes blocking other operations. It would help to time a manual verbose fc-scan: time /usr/libexec/vdsm/fc-scan -v On host that show this issue, and share the output of the command. We have log with the time it took for every host/device scan. Since we support device resize detection, this tool scans now not only only the scsi hosts (/sys/class/scsi_host/%s/scan) but also all the scsi devices (/sys/class/scsi_device/*). This was added as part of bug 609689 in https://gerrit.ovirt.org/c/43388/. Since vdsm does not manage devices, and can miss a request to resize devices when a host is offline or disconnected from engine temporarily, vdsm scan all fc_hosts and all devices connected to these hosts on when ever we refresh storage. Looking at the refresh code in hsm.py: 78 @misc.samplingmethod 79 def refreshStorage(self, resize=True): 80 self.__staleStatus = self.STORAGE_REFRESHING 81 82 multipath.rescan() 83 if resize: 84 multipath.resize_devices() 85 lvm.invalidateCache() We don't need to scan devices if resize is False. Checking all the places we call refreshStorage: 2796 def getStorageDomainsList( 2797 self, spUUID=None, domainClass=None, storageType=None, 2798 remotePath=None, options=None): ... 2814 sdCache.refreshStorage() Looks like we can refresh with resize=False here. 2872 def getVGList(self, storageType=None, options=None): ... 2882 sdCache.refreshStorage() Same, resize is not needed. 411 def storageRefresh(): 412 sdCache.refreshStorage() This is done during startup - since we don't know if a device was resized on other hosts when vdsm was down, we really need to use resize=True here. 1993 def _getDeviceList(self, storageType=None, guids=(), checkStatus=True): 1994 sdCache.refreshStorage() Here we need to resize since the user expect to see the real device size after this call. 2406 def connectStorageServer(self, domType, spUUID, conList, options=None): ... 2445 # In case there were changes in devices size 2446 # while the VDSM was not connected, we need to 2447 # call refreshStorage. 2448 if domType in (sd.FCP_DOMAIN, sd.ISCSI_DOMAIN): 2449 sdCache.refreshStorage() Here we must resize, but rescan devices is not needed if we just connected to storage. 2495 def disconnectStorageServer(self, domType, spUUID, conList, options=None): ... 2530 # Disconnecting a device may change the visible storage domain list 2531 # so invalidate the caches 2532 sdCache.refreshStorage(resize=False) Here we don't do any resize, so we don't need to scan anything. Refreshes in sdc.py: 127 try: 128 # If multiple calls reach this point and the storage is not 129 # updated the refreshStorage() sampling method is called 130 # serializing (and eventually grouping) the requests. 131 if self.__staleStatus != self.STORAGE_UPDATED: 132 self.refreshStorage() We don't call multipath.resize here so there is no need to rescan devices. Looks like we should: - Add argument to fc-scan to disable device rescan - Add rescan_devices=False argument to refreshStorage - Call refreshStorage(rescan_device=True) only in the case that must to rescan devices and resize them. (Originally by Nir Soffer)
Germano, I posted a quick untested patch for fc-scan to allow disabling device scanning. It would help if you can get share the output of these commands on a system showing this issue. time /usr/libexec/vdsm/fc-scan -v time /usr/libexec/vdsm/fc-scan -v --no-devices (Originally by Nir Soffer)
(In reply to Nir Soffer from comment #31) > Looks like we should: > > - Add argument to fc-scan to disable device rescan > - Add rescan_devices=False argument to refreshStorage > - Call refreshStorage(rescan_device=True) only in the case that must > to rescan devices and resize them. Makes a lot of sense to me, should save some time. (In reply to Nir Soffer from comment #32) > Germano, I posted a quick untested patch for fc-scan to allow disabling > device scanning. > > It would help if you can get share the output of these commands on a system > showing this issue. > > time /usr/libexec/vdsm/fc-scan -v > > time /usr/libexec/vdsm/fc-scan -v --no-devices I've requested it to the customer that was facing these issues, along with a fresh sosreport after fixing other problems that could also have been contributing to the slowdowns. Will let you know once we have results. Thanks! (Originally by Germano Veit Michel)
Looking in vdsm.log from attachment 1456628 [details]: ## FC scans $ grep '\[storage.HBA\] Starting scan' vdsm.log.3 | wc -l 34 $ head -1 vdsm.log.3 2018-07-04 08:01:06,269+0200 ... $ tail -1 vdsm.log.3 2018-07-04 09:01:01,772+0200 ... 34 scans in 1 hour - does not make sense. We don't expect so many storage refreshes. $ grep '\[storage.HBA\] Starting scan' vdsm.log.3 | grep monitor/ | wc -l 33 $ grep '\[storage.HBA\] Starting scan' vdsm.log.3 | grep jsonrpc/ | wc -l 1 One scan was triggered by API call, all other by storage domain monitors. Looking in the fc-scan calls, they take about 25 seconds, and they are performed in groups - back to back: 1. 2 scans 2018-07-04 08:36:41,861+0200 DEBUG (monitor/3f31203) [storage.HBA] Starting scan (hba:58) 2018-07-04 08:37:05,450+0200 DEBUG (monitor/3f31203) [storage.HBA] Scan finished (hba:64) 2018-07-04 08:37:06,137+0200 DEBUG (monitor/62b40d0) [storage.HBA] Starting scan (hba:58) 2018-07-04 08:37:29,223+0200 DEBUG (monitor/62b40d0) [storage.HBA] Scan finished (hba:64) 2. After a minute, 3 scans 2018-07-04 08:41:23,481+0200 DEBUG (monitor/3f31203) [storage.HBA] Starting scan (hba:58) 2018-07-04 08:41:45,052+0200 DEBUG (monitor/3f31203) [storage.HBA] Scan finished (hba:64) 2018-07-04 08:41:45,592+0200 DEBUG (monitor/c59bd9b) [storage.HBA] Starting scan (hba:58) 2018-07-04 08:42:09,433+0200 DEBUG (monitor/c59bd9b) [storage.HBA] Scan finished (hba:64) 2018-07-04 08:42:10,136+0200 DEBUG (monitor/dbf3c27) [storage.HBA] Starting scan (hba:58) 2018-07-04 08:42:33,887+0200 DEBUG (monitor/dbf3c27) [storage.HBA] Scan finished (hba:64) 3. After 3:30 minutes, 4 scans 2018-07-04 08:46:00,457+0200 DEBUG (monitor/632f710) [storage.HBA] Starting scan (hba:58) 2018-07-04 08:46:22,423+0200 DEBUG (monitor/632f710) [storage.HBA] Scan finished (hba:64) 2018-07-04 08:46:23,010+0200 DEBUG (monitor/dbf3c27) [storage.HBA] Starting scan (hba:58) 2018-07-04 08:46:46,789+0200 DEBUG (monitor/dbf3c27) [storage.HBA] Scan finished (hba:64) 2018-07-04 08:46:47,317+0200 DEBUG (monitor/eff5139) [storage.HBA] Starting scan (hba:58) 2018-07-04 08:47:10,096+0200 DEBUG (monitor/eff5139) [storage.HBA] Scan finished (hba:64) 2018-07-04 08:47:10,774+0200 DEBUG (monitor/c9faf72) [storage.HBA] Starting scan (hba:58) 2018-07-04 08:47:34,787+0200 DEBUG (monitor/c9faf72) [storage.HBA] Scan finished (hba:64) ## Warnings $ grep WARN vdsm.log.3 | wc -l 979 $ egrep 'WARN.+is blocked for' vdsm.log.3 | head -10 2018-07-04 08:01:11,278+0200 WARN (check/loop) [storage.check] Checker u'/rhev/data-center/mnt/10.252.100.47:_appl/dbf3c278-6ec5-45f0-9c09-0f92e6c7ab22/dom_md/metadata' is blocked for 320.00 seconds (check:279) 2018-07-04 08:01:21,279+0200 WARN (check/loop) [storage.check] Checker u'/rhev/data-center/mnt/10.252.100.47:_appl/dbf3c278-6ec5-45f0-9c09-0f92e6c7ab22/dom_md/metadata' is blocked for 330.00 seconds (check:279) 2018-07-04 08:01:31,278+0200 WARN (check/loop) [storage.check] Checker u'/rhev/data-center/mnt/10.252.100.47:_appl/dbf3c278-6ec5-45f0-9c09-0f92e6c7ab22/dom_md/metadata' is blocked for 340.00 seconds (check:279) 2018-07-04 08:01:41,279+0200 WARN (check/loop) [storage.check] Checker u'/rhev/data-center/mnt/10.252.100.47:_appl/dbf3c278-6ec5-45f0-9c09-0f92e6c7ab22/dom_md/metadata' is blocked for 350.00 seconds (check:279) 2018-07-04 08:01:51,278+0200 WARN (check/loop) [storage.check] Checker u'/rhev/data-center/mnt/10.252.100.47:_appl/dbf3c278-6ec5-45f0-9c09-0f92e6c7ab22/dom_md/metadata' is blocked for 360.00 seconds (check:279) 2018-07-04 08:02:01,282+0200 WARN (check/loop) [storage.check] Checker u'/rhev/data-center/mnt/10.252.100.47:_appl/dbf3c278-6ec5-45f0-9c09-0f92e6c7ab22/dom_md/metadata' is blocked for 370.00 seconds (check:279) 2018-07-04 08:02:11,278+0200 WARN (check/loop) [storage.check] Checker u'/rhev/data-center/mnt/10.252.100.47:_appl/dbf3c278-6ec5-45f0-9c09-0f92e6c7ab22/dom_md/metadata' is blocked for 380.00 seconds (check:279) 2018-07-04 08:02:21,287+0200 WARN (check/loop) [storage.check] Checker u'/rhev/data-center/mnt/10.252.100.47:_appl/dbf3c278-6ec5-45f0-9c09-0f92e6c7ab22/dom_md/metadata' is blocked for 390.00 seconds (check:279) 2018-07-04 08:02:31,285+0200 WARN (check/loop) [storage.check] Checker u'/rhev/data-center/mnt/10.252.100.47:_appl/dbf3c278-6ec5-45f0-9c09-0f92e6c7ab22/dom_md/metadata' is blocked for 400.00 seconds (check:279) 2018-07-04 08:02:41,287+0200 WARN (check/loop) [storage.check] Checker u'/rhev/data-center/mnt/10.252.100.47:_appl/dbf3c278-6ec5-45f0-9c09-0f92e6c7ab22/dom_md/metadata' is blocked for 410.00 seconds (check:279) There is serious issues with the NFS domain dbf3c278-6ec5-45f0-9c09-0f92e6c7ab22 it was blocked for 840 seconds: 2018-07-04 08:55:31,279+0200 WARN (check/loop) [storage.check] Checker u'/rhev/data-center/mnt/10.252.100.47:_appl/dbf3c278-6ec5-45f0-9c09-0f92e6c7ab22/dom_md/metadata' is blocked for 840.00 seconds (check:279) $ egrep 'WARN.+is blocked for' vdsm.log.3 | grep -v dbf3c278-6ec5-45f0-9c09-0f92e6c7ab22 No other domain had issues. ## Storage cache invalidations I think the issue is the storage domain cache. Every access to storage domain look up an internal cache. If the domain is in cache there is no delay and the code continue. But if the domain is not in the cache, and cache was invalidated, the caller try to refresh storage. This operation does: - scsi rescan - fc rescan - resize multipath devices Because of the way the cache is built, all this is typically done twice when multiple storage domain try to enter the domain cache. In our case, fc scan takes about 25 seconds. So typically code accessing storage domain may be blocked for at least 25-50 seconds on the cache only because of slow fc-scans. Why would domain not be in the cache? every 300 seconds, storage domain monitors remove the domain from the cache so the next access to the domain will recreate the domain object and refresh the domain metadata. The cache is invalidated in: - connectStorageServer() - StoragePool.connect(), called from connectStoragePool() - StoragePool.attachSD(), called from attachStorageDomain() - StoragePool.refresh(), called from connectStoragePool() Looking at the relevant calls: $ egrep 'START (connectStoragePool|connectStorageServer|attachStorageDomain)' vdsm.log.3 | awk '{print $1,$2}' 2018-07-04 08:04:05,141+0200 2018-07-04 08:05:00,203+0200 2018-07-04 08:10:00,235+0200 2018-07-04 08:15:00,434+0200 2018-07-04 08:20:00,284+0200 2018-07-04 08:22:44,676+0200 2018-07-04 08:23:40,594+0200 2018-07-04 08:24:44,517+0200 2018-07-04 08:30:00,290+0200 2018-07-04 08:35:00,166+0200 2018-07-04 08:40:00,148+0200 2018-07-04 08:45:00,261+0200 2018-07-04 08:50:00,183+0200 2018-07-04 08:55:00,205+0200 2018-07-04 09:00:00,232+0200 $ egrep 'START (connectStoragePool|connectStorageServer|attachStorageDomain)' vdsm.log.3 | wc -l 15 So we had 15 cache invalidations in one hour. $ egrep 'START connectStorageServer' vdsm.log.3 | awk '{print $1,$2}' 2018-07-04 08:05:00,203+0200 2018-07-04 08:10:00,235+0200 2018-07-04 08:15:00,434+0200 2018-07-04 08:20:00,284+0200 2018-07-04 08:22:44,676+0200 2018-07-04 08:24:44,517+0200 2018-07-04 08:30:00,290+0200 2018-07-04 08:35:00,166+0200 2018-07-04 08:40:00,148+0200 2018-07-04 08:45:00,261+0200 2018-07-04 08:50:00,183+0200 2018-07-04 08:55:00,205+0200 2018-07-04 09:00:00,232+0200 $ egrep 'START connectStorageServer' vdsm.log.3 | awk '{print $1,$2}' | wc -l 13 Almost all the calls are connectStorageServer. I guess this is engine trying to recover the inaccessible NFS domain by trying to reconnect to the server. Lets try to mix the cache invalidation with the rescans: $ egrep '\[storage.HBA\] Starting scan' vdsm.log.3 | awk '{print $1,$2,"rescan"}' > rescan.log $ egrep 'START connectStorageServer' vdsm.log.3 | awk '{print $1,$2,"invalidate cache"}' > invalidate.log $ cat invalidate.log rescan.log | sort $ cat invalidate.log rescan.log | sort 2018-07-04 08:05:00,203+0200 invalidate cache 2018-07-04 08:10:00,235+0200 invalidate cache 2018-07-04 08:15:00,434+0200 invalidate cache 2018-07-04 08:16:18,808+0200 rescan 2018-07-04 08:16:41,230+0200 rescan 2018-07-04 08:17:05,614+0200 rescan 2018-07-04 08:20:00,284+0200 invalidate cache 2018-07-04 08:21:24,552+0200 rescan 2018-07-04 08:21:46,828+0200 rescan 2018-07-04 08:22:10,778+0200 rescan 2018-07-04 08:22:44,676+0200 invalidate cache 2018-07-04 08:23:40,685+0200 rescan 2018-07-04 08:24:03,157+0200 rescan 2018-07-04 08:24:27,217+0200 rescan 2018-07-04 08:24:44,517+0200 invalidate cache 2018-07-04 08:26:03,833+0200 rescan 2018-07-04 08:26:26,448+0200 rescan 2018-07-04 08:26:50,643+0200 rescan 2018-07-04 08:30:00,290+0200 invalidate cache 2018-07-04 08:31:09,170+0200 rescan 2018-07-04 08:31:31,503+0200 rescan 2018-07-04 08:31:55,619+0200 rescan 2018-07-04 08:32:21,896+0200 rescan 2018-07-04 08:35:00,166+0200 invalidate cache 2018-07-04 08:36:16,809+0200 rescan 2018-07-04 08:36:41,861+0200 rescan 2018-07-04 08:37:06,137+0200 rescan 2018-07-04 08:40:00,148+0200 invalidate cache 2018-07-04 08:41:23,481+0200 rescan 2018-07-04 08:41:45,592+0200 rescan 2018-07-04 08:42:10,136+0200 rescan 2018-07-04 08:45:00,261+0200 invalidate cache 2018-07-04 08:46:00,457+0200 rescan 2018-07-04 08:46:23,010+0200 rescan 2018-07-04 08:46:47,317+0200 rescan 2018-07-04 08:47:10,774+0200 rescan 2018-07-04 08:50:00,183+0200 invalidate cache 2018-07-04 08:51:08,685+0200 rescan 2018-07-04 08:51:30,834+0200 rescan 2018-07-04 08:51:54,451+0200 rescan 2018-07-04 08:52:17,360+0200 rescan 2018-07-04 08:55:00,205+0200 invalidate cache 2018-07-04 08:56:16,675+0200 rescan 2018-07-04 08:56:39,366+0200 rescan 2018-07-04 08:57:03,588+0200 rescan 2018-07-04 08:57:26,343+0200 rescan 2018-07-04 09:00:00,232+0200 invalidate cache The picture is pretty clear now: 1. NFS storage domain is not accessible 2. Engine try to recover the domain by calling connectStorageServer every ~5 minutes 3. connectStorgeServer invalidate storage domain cache 4. The next access to any domain trigger 2-5 fc-scans, each taking 25 seconds 5. calls to storage domain methods are blocked for at leat 25-50 seconds during the scans 6. virt executor get stuck with blocked getVolumeSize calls 7. vms become non-responsive Germano, did you see the same picture in other cases? (Originally by Nir Soffer)
(In reply to Nir Soffer from comment #34) > The picture is pretty clear now: > > 1. NFS storage domain is not accessible > 2. Engine try to recover the domain by calling connectStorageServer every ~5 > minutes > 3. connectStorgeServer invalidate storage domain cache > 4. The next access to any domain trigger 2-5 fc-scans, each taking 25 seconds > 5. calls to storage domain methods are blocked for at leat 25-50 seconds > during the scans > 6. virt executor get stuck with blocked getVolumeSize calls > 7. vms become non-responsive > > Germano, did you see the same picture in other cases? Brilliant, I think this connects all the dots. Yes, one case I remember was when adding new storage domains. The current case also has a broken ISO domain. (Originally by Germano Veit Michel)
In connectStorageServer() vdsm invalidates the caches no matter what connection is being made. Is this optimal? For example, connecting an ISO/Export Domain invalidates caches of everything. Is it really necessary? IIUC, the problem can still happen without the extra cache invalidations. If getVolumeSize gets blocked for long the monitor will be non-responsive. Not having the NFS problem makes it less frequent though. (Originally by Germano Veit Michel)
(In reply to Germano Veit Michel from comment #36) > In connectStorageServer() vdsm invalidates the caches no matter what > connection is being made. > Is this optimal? For example, connecting an ISO/Export Domain invalidates > caches of everything. Is it really necessary? I don't think so, and I suggested to change this in the past: https://gerrit.ovirt.org/c/45824/ But they were concerns about the safey of this change. I think we should rethink this. > IIUC, the problem can still happen without the extra cache invalidations. If > getVolumeSize gets blocked for long the monitor will be non-responsive. > Not having the NFS problem makes it less frequent though. Yes, but this is real storage issue, and in this case libvirt and qemu will get stuck anyway. (Originally by Nir Soffer)
(In reply to Nir Soffer from comment #37) > > IIUC, the problem can still happen without the extra cache invalidations. If > > getVolumeSize gets blocked for long the monitor will be non-responsive. > > Not having the NFS problem makes it less frequent though. > > Yes, but this is real storage issue, and in this case libvirt and qemu will > get stuck anyway. But if 1 domain has problems and is blocking refresh, then all vms will be non-responsive, even VMs that have no disks on that domain, right? Also, there might be other reasons the refresh might take longer and the VMs are all running happily, like high number of devices, a problem with a few devices/paths which are not even being used. No reply for the fc-scan yet, will post here as soon as I get it. (Originally by Germano Veit Michel)
(In reply to Germano Veit Michel from comment #38) > (In reply to Nir Soffer from comment #37) > > > IIUC, the problem can still happen without the extra cache invalidations. If > > > getVolumeSize gets blocked for long the monitor will be non-responsive. > > > Not having the NFS problem makes it less frequent though. > > > > Yes, but this is real storage issue, and in this case libvirt and qemu will > > get stuck anyway. > > But if 1 domain has problems and is blocking refresh, then all vms will be > non-responsive, even VMs that have no disks on that domain, right? Right, this is exactly the case here. The problem is lack of storage domain isolation. (Originally by Nir Soffer)
Created attachment 1616201 [details] fc-scan with devices (Originally by Nir Soffer)
Created attachment 1616205 [details] fc-scan without devices (Originally by Nir Soffer)
Created attachment 1616207 [details] fc-scan with devices Uploading again with fixed whitespace. (Originally by Nir Soffer)
I extracted the text from the pdf in attachment 1615732 [details] and uploaded: - attachment 1616207 [details] - fc-scan with devices - attachment 1616205 [details] - fc-scan without devices Comparing total runtime: - with devices: 4.6s - without devices: 0.07s (65X times faster) Checking how time was spent checking devices: We scanned 486 devices: $ egrep 'Rescanned device .+:' fc-scan-with-devices.log | wc -l 486 Slowest scans: $ awk '/Rescanned device .+:/ {print $5}' fc-scan-with-devices.log | sort -rn | head 0.04 0.03 0.03 0.03 0.03 0.03 0.03 0.03 0.03 0.03 Fastest scans $ awk '/Rescanned device .+:/ {print $5}' fc-scan-with-devices.log | sort -rn | tail 0.01 0.01 0.01 0.01 0.00 0.00 0.00 0.00 0.00 0.00 Average scan: $ awk '/Rescanned device .+:/ {sum+=$5; count+=1}; END {print sum/count}' fc-scan-with-devices.log 0.0179218 Total time spent in scanning: $ awk '/Rescanned device .+:/ {sum+=$5}; END {print sum}' fc-scan-with-devices.log 8.71 We have some some parallelism, using thread per host. There is also time spent in finding devices for every host, but we cannot tell how much time spent from the log. 98.5% of the time is spent in scanning devices. Skipping device scanning when it is not needed will make slow scan a non-issue, avoiding delays in other parts of the system. (Originally by Nir Soffer)
Germano, it would be nice if we can get text instead of pdf next time. An easy way to collect the data is: $ (time fc-scan -v --no-devices)> fc-scan-with-devices.log 2>&1 $ (time fc-scan -v)> fc-scan-without-devices.log 2>&1 (Originally by Nir Soffer)
Checking how devices are distributed across fc_hosts and rports: $ egrep 'Rescanning device .+/host2/' fc-scan-with-devices.log | wc -l 243 $ egrep 'Rescanning device .+/host4/' fc-scan-with-devices.log | wc -l 243 $ egrep 'Rescanning device .+/host4/rport-4:0-2' fc-scan-with-devices.log | wc -l 3 $ egrep 'Rescanning device .+/host4/rport-4:0-3' fc-scan-with-devices.log | wc -l 82 $ egrep 'Rescanning device .+/host4/rport-4:0-4' fc-scan-with-devices.log | wc -l 82 $ egrep 'Rescanning device .+/host4/rport-4:0-5' fc-scan-with-devices.log | wc -l 38 $ egrep 'Rescanning device .+/host4/rport-4:0-6' fc-scan-with-devices.log | wc -l 38 $ egrep 'Rescanning device .+/host2/rport-2:0-2' fc-scan-with-devices.log | wc -l 82 $ egrep 'Rescanning device .+/host2/rport-2:0-3' fc-scan-with-devices.log | wc -l 82 $ egrep 'Rescanning device .+/host2/rport-2:0-4' fc-scan-with-devices.log | wc -l 38 $ egrep 'Rescanning device .+/host2/rport-2:0-5' fc-scan-with-devices.log | wc -l 38 $ egrep 'Rescanning device .+/host2/rport-2:0-6' fc-scan-with-devices.log | wc -l 3 We use only 2 threads - one per fc_host with devices. Can we get better parallelism by scanning with thread per rport-target-N combinations? The relevant code is here: https://gerrit.ovirt.org/gitweb?p=vdsm.git;a=blob;f=lib/vdsm/storage/fc-scan;h=7da6105971c7c253b20626cb2cd87509b0815778;hb=61404df72b75dc329a70c745b9e0af88ff7027a2 The relevant part is: 106 fd = os.open(path, os.O_WRONLY) 107 try: 108 os.write(fd, "1") 109 finally: 110 os.close(fd) Can this work in parallel for multiple devices on different host-rport-target-N, like: /sys/devices/pci0000:40/0000:40:02.0/0000:41:00.1/host2/rport-2:0-2/target2:0:2/2:0:2:0/rescan /sys/devices/pci0000:40/0000:40:02.0/0000:41:00.1/host2/rport-2:0-3/target2:0:3/2:0:3:0/rescan /sys/devices/pci0000:40/0000:40:02.0/0000:41:00.1/host2/rport-2:0-4/target2:0:4/2:0:4:0/rescan /sys/devices/pci0000:40/0000:40:02.0/0000:41:00.1/host2/rport-2:0-5/target2:0:5/2:0:5:0/rescan /sys/devices/pci0000:40/0000:40:02.0/0000:41:00.1/host2/rport-2:0-6/target2:0:6/2:0:6:0/rescan It it can work, we can speed up device scanning significantly. Ben, I don't know if you can answer this, but maybe you can tell me who can answer this? (Originally by Nir Soffer)
Summary of actions items (in order of importance): 1. Disable device scanning when it is not needed I think most refreshes do not need device scanning 2. Optimize device scanning when it is needed (e.g. better parallelism) Important in getDeviceList and startup, when we must resize devices 3. Never trigger rescans when handling file based domains. It is never right to rescan hosts and/or devices when handling NFS domain. 4. Never trigger fc_host scan when handling iSCSI domain or iscsi_host scan when handling FC domain (currently we always invoke both). 5. Minimize cache invalidations (sdCache.invalidateStorage) may trigger slow rescan blocking other flows. I think we can remove all calls. 6. Minimize cash refreshes (sdCache.refreshStorage) may trigger slow rescan blocking other flows. It think we can remove some calls. 7. Consider changing sampling method to use the current operation result instead of grouping waiters and performing new operation with all waiters can cut blocking time by half. (Originally by Nir Soffer)
(In reply to Nir Soffer from comment #49) > We use only 2 threads - one per fc_host with devices. Can we get better parallelism > by scanning with thread per rport-target-N combinations? > Can this work in parallel for multiple devices on different host-rport-target-N, like: > > /sys/devices/pci0000:40/0000:40:02.0/0000:41:00.1/host2/rport-2:0-2/target2:0:2/2:0:2:0/rescan > /sys/devices/pci0000:40/0000:40:02.0/0000:41:00.1/host2/rport-2:0-3/target2:0:3/2:0:3:0/rescan > /sys/devices/pci0000:40/0000:40:02.0/0000:41:00.1/host2/rport-2:0-4/target2:0:4/2:0:4:0/rescan > /sys/devices/pci0000:40/0000:40:02.0/0000:41:00.1/host2/rport-2:0-5/target2:0:5/2:0:5:0/rescan > /sys/devices/pci0000:40/0000:40:02.0/0000:41:00.1/host2/rport-2:0-6/target2:0:6/2:0:6:0/rescan > > If it can work, we can speed up device scanning significantly. > > Ben, I don't know if you can answer this, but maybe you can tell me who can > answer this? Looking over the relevant kernel code briefly, I don't see any obvious reason why not. However, I really don't know the kernel scsi-layer code well, so Ewan might be a better person to ask. (Originally by Benjamin Marzinski)
It depends upon what you are doing. There is a mutex on each SCSI Host object taken around the scan operation when rescanning a host, or a target. So if you use the "scan" sysfs node with wildcards on the host object, you will wait on the mutex if multiple threads are performing an operation on the same host. Rescanning an already-discovered device, though, is different and does not take the mutex. So there is some potential for parallelism there. However, one thing to be aware of is that scanning devices can take more time, sometimes a lot more time, if there is also I/O in progress to the device. I have seen cases where it can take on the order of seconds for a device to respond to inquiry commands under I/O load, because it apparently services the I/O with higher priority. Rescanning a device sends several such inquiry commands to it to get all the VPD pages and attributes. It is probably better to cache the information, and perhaps update it on a periodic rotating basis (rather than scanning everything all at the same time). Note also that since 7.5, if a device reports a unit attention ASC/ASCQ 3F 03 INQUIRY DATA HAS CHANGED, the kernel may perform the rescan automatically. (A unit attention is only received if there is a command that can receive the status, so if there is no I/O to the device the reporting can be delayed or not received. If multipath is used, typically the path checker will receive the UA.) This does not happen for ASC/ASCQ 2A 09 CAPACITY DATA HAS CHANGED, though. This had to do with multipath requiring the capacity on the underlying paths to match, so if some paths are seen to change but not others, the paths would not be used. (I think multipath may have been changed to not check this, at some point but I'm not certain.) static void scsi_report_sense(struct scsi_device *sdev, struct scsi_sense_hdr *sshdr) { enum scsi_device_event evt_type = SDEV_EVT_MAXBITS; /* i.e. none */ if (sshdr->sense_key == UNIT_ATTENTION) { if (sshdr->asc == 0x3f && sshdr->ascq == 0x03) { evt_type = SDEV_EVT_INQUIRY_CHANGE_REPORTED; sdev_printk(KERN_WARNING, sdev, "Inquiry data has changed"); ... static void scsi_evt_emit(struct scsi_device *sdev, struct scsi_event *evt) { int idx = 0; char *envp[3]; switch (evt->evt_type) { case SDEV_EVT_MEDIA_CHANGE: envp[idx++] = "SDEV_MEDIA_CHANGE=1"; break; case SDEV_EVT_INQUIRY_CHANGE_REPORTED: scsi_rescan_device(&sdev->sdev_gendev); <===== envp[idx++] = "SDEV_UA=INQUIRY_DATA_HAS_CHANGED"; break; ... static ssize_t store_rescan_field (struct device *dev, struct device_attribute *attr, const char *buf, size_t count) { scsi_rescan_device(dev); <===== return count; } static DEVICE_ATTR(rescan, S_IWUSR, NULL, store_rescan_field); (Originally by Ewan Milne)
(In reply to Ewan D. Milne from comment #52) > It depends upon what you are doing. I'll try to describe RHV use case. We have multiple hosts in a cluster (.e.g 50), all connected to the same LUNs. The reason is we can try to run a VM using this storage on any of these hosts quickly, without changing the storage (e.g. map LUN to some host). All the LUNs use multipath. We create a LVM volume group with one or more LUNs, and create LVM logical volumes used as VM virtual disks. When we run a VM on some host, or perform some storage operation (e.g. cloning and virtual disk), we activate the logical volume on this host. When the volume group becomes full, users can go and resize the LUN on the server. Since we don't manage the resize, we need to rescan the relevant device on all hosts so all host are on the same page, and see the new size of the device. Once we rescan devices on all hosts, we need to resize the LVM physical volume. This is done on a master host managing LVM metadata. Other hosts will see this update when they reload this volume group. Optimally we would rescan only the device that was resized, but since we don't manage the resize, we don't know which device was resized. We could ask the user to select the device in RHV UI but is pretty hard task when you have lot of devices. Also the user resizing the device on the server is typically not the same user managing the RHV setup. So instead of making user life hard, we rescan all the devices when a user ask to update device size. Then we display the device that may be extended (device size > LVM pysical volume size). The user can select the devices and trigger a resize of the physical volume. Since we manage a cluster of hosts, and a host can be disconnected from the cluster at any time, we need to make sure a host will rescan devices before connecting back to the cluster, in case there was an update when the host was not connected. So we need an efficient way to scan large amount of devices. I think we test up to 300 LUNs, with 4 paths per LUNs, total 1200 devices. > Rescanning an already-discovered device, though, is different and does not > take > the mutex. So there is some potential for parallelism there. Great. > However, one thing to be aware of is that scanning devices can take more > time, > sometimes a lot more time, if there is also I/O in progress to the device. > I have > seen cases where it can take on the order of seconds for a device to respond > to > inquiry commands under I/O load, because it apparently services the I/O with > higher > priority. Rescanning a device sends several such inquiry commands to it to > get > all the VPD pages and attributes. Yes, we learned this in the hard way, scanning 400 devices too about 25 seconds with 2 threads (thread per host). I hope that we can cut the time by 4 by having one thread per host-rport-target-N. > It is probably better to cache the information, and perhaps update it on a > periodic > rotating basis (rather than scanning everything all at the same time). The goal is to rescan devices only when: - when a user need to perform an operation with devices, like creating a new LVM volume group in the management system. At this point the user want to see the current size of the device. - when a user perform a resize of the LVM physical volume, on all hosts. - connecting back to cluster after a host was in maintenance mode or after a restart the management program on a host. (Originally by Nir Soffer)
Daniel, this is mostly a scale issue, affecting users with large setups (many hosts, many vms, many LUNs, many storage domains). We have initial patch for speeding up device scanning: https://gerrit.ovirt.org/c/103455/ To test this patch I need access to a 1-2 hosts connected to FC storage with many LUNs. I think that 300 LUNs, 4 paths per LUN, total 1200 SCSI devices can be a good test. We need to test: - time to scan devices on idle host - same when scanning from multiple hosts in the same time - same when there is I/O on some devices (e.g. copying disks while scanning) (Originally by Nir Soffer)
Created attachment 1619485 [details] graph: fc-scan with 222 devices The graph shows that we get about 10x speedup with 20 threads and more threads do not improve results significantly. Previously we were using thread per fc_host (typically 1 threads), so current patches can give ~5x improvement. We did not test yet with larger setups and with I/O load. (Originally by Nir Soffer)
Mordechai, see original bug 1598266 for testing/verification instructions.
Verified on 4.3.7-3 redhat-release-server-7.7-10.el7.x86_64 rhv-release-4.3.7-3-001.noarch vdsm-4.30.34-1.el7ev.x86_64 Flow: Using 1 host + running 60VMs Have 300 LUNs Move 8 raw disks (500GB) in parallel Storage Domains activities (5 SD in parallel to move disks): Maintenance Detach Activate Attach
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:4230