Bug 1598266
Summary: | [scale] VMs unresponsive due to delayed getVolumeSize | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Germano Veit Michel <gveitmic> | |
Component: | vdsm | Assignee: | Nir Soffer <nsoffer> | |
Status: | CLOSED ERRATA | QA Contact: | mlehrer | |
Severity: | medium | Docs Contact: | ||
Priority: | high | |||
Version: | 4.1.11 | CC: | aefrat, bmarzins, dagur, dvaanunu, emarcus, eshenitz, fgarciad, fromani, gveitmic, lsurette, michal.skrivanek, mkalinin, mlehrer, mzamazal, nsoffer, pelauter, pvilayat, rdlugyhe, rgolan, sfroemer, srevivo, tnisan, ycui | |
Target Milestone: | ovirt-4.4.0 | Keywords: | Performance, ZStream | |
Target Release: | 4.3.0 | Flags: | mlehrer:
needinfo-
lsvaty: testing_plan_complete- |
|
Hardware: | x86_64 | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | Bug Fix | ||
Doc Text: |
When a system had many FC LUNs with many paths per LUN, and a high I/O load, scanning of FC devices became slow, causing timeouts in monitoring VM disk size, and making VMs non-responsive.
In this release, FC scans have been optimized for speed, and VMs are much less likely to become non-responsive.
|
Story Points: | --- | |
Clone Of: | ||||
: | 1758094 (view as bug list) | Environment: | ||
Last Closed: | 2020-08-04 13:26:06 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | Scale | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1736852, 1547768, 1758094 | |||
Attachments: |
Description
Germano Veit Michel
2018-07-05 01:08:08 UTC
Maor, you are this week's QE contact, please scrub I saw the following error in the VDSM logs: 2018-07-04 08:05:11,680+0200 WARN (jsonrpc/0) [virt.vm] (vmId='86dfa0fc-a4ed-4362-83a4-5bec1ca98d60') monitor became unresponsive (command timeout, age=62.6100000013) (vm:5135) 2018-07-04 08:05:11,682+0200 WARN (jsonrpc/0) [virt.vm] (vmId='def11d1f-9cb0-47be-9c45-7b961d275c40') monitor became unresponsive (command timeout, age=62.6100000013) (vm:5135) .... Can you please add some more indicators, such as cpu usage, load, sar, iotop of the host while the VMs get unresponsive. Can you also add libvirt, qemu and engine logs Thanks Hi Maor, Thanks for taking a look. (In reply to Maor from comment #4) > I saw the following error in the VDSM logs: > > 2018-07-04 08:05:11,680+0200 WARN (jsonrpc/0) [virt.vm] > (vmId='86dfa0fc-a4ed-4362-83a4-5bec1ca98d60') monitor became unresponsive > (command timeout, age=62.6100000013) (vm:5135) > 2018-07-04 08:05:11,682+0200 WARN (jsonrpc/0) [virt.vm] > (vmId='def11d1f-9cb0-47be-9c45-7b961d275c40') monitor became unresponsive > (command timeout, age=62.6100000013) (vm:5135) > .... Yes, as in comment #0. > Can you please add some more indicators, such as cpu usage, load, sar, iotop > of the host while the VMs get unresponsive. Sure. There are plenty of resources available, there is no overcommit. Seem the delays on entering sampling method are on several places, and fc-scan is always slow (there is a large number of LUNs presented). Looks like the lockup is on VDSM to me. All VMs are running just fine, there is no detected performance issue anywhere else. > Can you also add libvirt, qemu and engine logs RHV doesnt run with libvirt or qemu logging by default. So I'll add engine. > It might also be related to this bug https://bugzilla.redhat.com/1419856 > which libvirt version is the customer using? libvirt-daemon-3.2.0-14.el7_4.9.x86_64 Please let me know if you need anything else. Similar case, UpdateVolumes for each VM is stuck because getVolumeSize is delayed/not-running/blocked. On this case is happens when adding a new SD. vdsm-4.19.45-1.el7ev.x86_64 From the data of comment #17: 2018-11-08 11:15:19,064+0100 getVolumeSize volUUID=u'284ae5e7-da22-4a51-8fb7-554c3d953ced', 2018-11-08 11:16:19,034+0100 getVolumeSize volUUID=u'284ae5e7-da22-4a51-8fb7-554c3d953ced', 2018-11-08 11:17:19,048+0100 getVolumeSize volUUID=u'284ae5e7-da22-4a51-8fb7-554c3d953ced', 2018-11-08 11:19:20,278+0100 getVolumeSize volUUID=u'284ae5e7-da22-4a51-8fb7-554c3d953ced', <--- 2m01s delayed 2018-11-08 11:19:24,344+0100 getVolumeSize volUUID=u'284ae5e7-da22-4a51-8fb7-554c3d953ced', <--- 5s delayed 2018-11-08 11:20:19,040+0100 getVolumeSize volUUID=u'284ae5e7-da22-4a51-8fb7-554c3d953ced', 2018-11-08 11:21:19,076+0100 getVolumeSize volUUID=u'284ae5e7-da22-4a51-8fb7-554c3d953ced', 2018-11-08 11:22:19,046+0100 getVolumeSize volUUID=u'284ae5e7-da22-4a51-8fb7-554c3d953ced', + 2018-11-08 11:19:17,412+0100 WARN (jsonrpc/6) [virt.vm] (vmId='a4ab6989-274b-4d50-9bec-1680e45442a1') monitor became unresponsive (command timeout, age=65.4700000007) (vm:5132) So looks like UpdateVolumes was stuck from ~11:18 to 11:19:20. And the monitor goes unresponsive as getVolumeSize for the VM's volumes is stuck? Tons of getVolumeSize were delayed/blocked while the user was adding a new SD to the Pool. Is there any improvement on this area in vdsm-4.20? I'm positive we landed few fixes on 4.2 to improve the number of rescans we perform after gchen found out we repeat that about 4 times for every path. Germano, how many paths are configured there? meanwhile I'll try to dig the relevant BZ. (In reply to Roy Golan from comment #19) > Germano, how many paths are configured there? This one has 8 paths ;) Indeed should be better in 4.2 with https://gerrit.ovirt.org/#/c/81681/ Still, can't this be improved to not pile up GetVolumeSize() while the SD is refreshing/producing? It seems all GetVolumeSize() are waiting for the refresh, if one has enough VMs it does mean a ton of blocked UpdateVolumes tasks. This bug has not been marked as blocker for oVirt 4.3.0. Since we are releasing it tomorrow, January 29th, this bug has been re-targeted to 4.3.1. I think this issue should be solved by moving volume size monitoring to storage domain monitors. When an image is prepared, we can register the leaf volume with the storage domain monitor. Then the storage domain monitor can check volume size every N seconds and cache the value. When virt API need the value, they will get the cached value from repoStats. When image is tore down, storage monitor will stop monitoring the leaf volume. When taking a snapshot, replacing the active leaf volume, the new leaf volume should be registered with the storage domain monitor, replacing the old leaf volume (that will never change). If storage is not responsive, relevant volumes size will be stale, but nothing in the system will block. The UpdateVolume periodic task will be removed. sync2jira sync2jira Normally getting volume size is quick. we need to find the root cause causing these calls to be slow. Updating volume size call this every 60 seconds: 3842 def updateDriveVolume(self, vmDrive): 3843 if not vmDrive.device == 'disk' or not isVdsmImage(vmDrive): 3844 return 3845 3846 volSize = self._getVolumeSize( 3847 vmDrive.domainID, vmDrive.poolID, vmDrive.imageID, 3848 vmDrive.volumeID) 3849 3850 vmDrive.truesize = volSize.truesize 3851 vmDrive.apparentsize = volSize.apparentsize getVolumeSize() is: 3067 @public 3068 def getVolumeSize(self, sdUUID, spUUID, imgUUID, volUUID, options=None): ... 3086 dom = sdCache.produce(sdUUID=sdUUID) 3087 apparentsize = str(dom.getVSize(imgUUID, volUUID)) 3088 truesize = str(dom.getVAllocSize(imgUUID, volUUID)) 3089 return dict(apparentsize=apparentsize, truesize=truesize) On block storage getVSize and getVAllocSize are the same, so we have 2 calls to: 421 def getVSize(self, imgUUUID, volUUID): 422 """ Return the block volume size in bytes. """ 423 try: 424 size = fsutils.size(lvm.lvPath(self.sdUUID, volUUID)) 425 except IOError as e: 426 if e.errno != errno.ENOENT: 427 raise 428 # Inactive volume has no /dev entry. Fallback to lvm way. 429 size = lvm.getLV(self.sdUUID, volUUID).size 430 431 return int(size) Because the LVs are active (called for running VMs, we know we don't use lvm to get the size but call fsutils.size(), which is: 26 def size(filename): 27 """ 28 Return actual file size, should work with both file and block device. 29 """ 30 with io.open(filename, "rb") as f: 31 f.seek(0, os.SEEK_END) 32 return f.tell() Which should be fast, but maybe in these cases is not. An easy optimization would be to call getVSize() only once on block storage instead of twice. In the case when seeking to the end of the device is slow, this may improve the situation. But we really want to understand why this operation is slow on the certain systems. Germano, can you share all WARN logs vdsm log and sanlock.log? Hi Nir, Thanks for taking a look at this. (In reply to Nir Soffer from comment #29) > Normally getting volume size is quick. we need to find the root cause causing > these calls to be slow. > > Updating volume size call this every 60 seconds: > > 3842 def updateDriveVolume(self, vmDrive): > 3843 if not vmDrive.device == 'disk' or not isVdsmImage(vmDrive): > 3844 return > 3845 > 3846 volSize = self._getVolumeSize( > 3847 vmDrive.domainID, vmDrive.poolID, vmDrive.imageID, > 3848 vmDrive.volumeID) > 3849 > 3850 vmDrive.truesize = volSize.truesize > 3851 vmDrive.apparentsize = volSize.apparentsize > > getVolumeSize() is: > > 3067 @public > 3068 def getVolumeSize(self, sdUUID, spUUID, imgUUID, volUUID, > options=None): > ... > 3086 dom = sdCache.produce(sdUUID=sdUUID) > 3087 apparentsize = str(dom.getVSize(imgUUID, volUUID)) > 3088 truesize = str(dom.getVAllocSize(imgUUID, volUUID)) > 3089 return dict(apparentsize=apparentsize, truesize=truesize) > I might be wrong, but I think it's sdCache.produce above that slows down monitoring. getVolumeSize is stuck on the executor until a storage refresh completes. And on larger environments (many LUNs), HBA rescan can take longer time too, enough to make VM monitoring take more time. I think having a few paths down also slow down HBA rescan. Once the storage refresh completes, all stuck getVolumeSize() complete immediately after. > On block storage getVSize and getVAllocSize are the same, so we have > 2 calls to: > > 421 def getVSize(self, imgUUUID, volUUID): > 422 """ Return the block volume size in bytes. """ > 423 try: > 424 size = fsutils.size(lvm.lvPath(self.sdUUID, volUUID)) > 425 except IOError as e: > 426 if e.errno != errno.ENOENT: > 427 raise > 428 # Inactive volume has no /dev entry. Fallback to lvm way. > 429 size = lvm.getLV(self.sdUUID, volUUID).size > 430 > 431 return int(size) > > Because the LVs are active (called for running VMs, we know we don't use > lvm to get the size but call fsutils.size(), which is: > > 26 def size(filename): > 27 """easy optimization would be to call getVSize() only once on block > storage instead of twice. In the case when seeking to the end of the > device is slow, this may improve the situation. > > But we really want to understand why this operation is slow on the > certain systems. > > Germano, can you share all WARN logs vdsm log and sanlock.log? To open this bug I collected some debug logs from the initial customer, aren't the ones attached initially good enough? Occasionally there are some HBA rescans timeout too. From what I remember on call these cases sanlock is fine, and so is IO to storage. The common scenario is they all have lots of SDs and LUNs. > 28 Return actual file size, should work with both file and block device. > 29 """ > 30 with io.open(filename, "rb") as f: > 31 f.seek(0, os.SEEK_END) > 32 return f.tell() > > Which should be fast, but maybe in these cases is not. Yes, these should be fast. If this was slow, then the problem would not be in VDSM I believe, and probably the VMs would be suffering too. > > An easy optimization would be to call getVSize() only once on block > storage instead of twice. In the case when seeking to the end of the > device is slow, this may improve the situation. > > But we really want to understand why this operation is slow on the > certain systems. > > Germano, can you share all WARN logs vdsm log and sanlock.log? To open this bug I collected some debug logs from the initial customer, aren't the ones attached initially good enough? They are more informative then the latest ones without debug, but I can get more if you wish. I don't have debug logs from latest 4.3 on this. Let me know if you need them then I'll collect a new round, along with sanlock (which was fine when I checked). (In reply to Germano Veit Michel from comment #30) ... > > getVolumeSize() is: > > > > 3067 @public > > 3068 def getVolumeSize(self, sdUUID, spUUID, imgUUID, volUUID, > > options=None): > > ... > > 3086 dom = sdCache.produce(sdUUID=sdUUID) > > 3087 apparentsize = str(dom.getVSize(imgUUID, volUUID)) > > 3088 truesize = str(dom.getVAllocSize(imgUUID, volUUID)) > > 3089 return dict(apparentsize=apparentsize, truesize=truesize) > > > > I might be wrong, but I think it's sdCache.produce above that slows down > monitoring. > getVolumeSize is stuck on the executor until a storage refresh completes. > And on larger environments (many LUNs), HBA rescan can take longer time too, > enough to make VM monitoring take more time. Right, storage refreshes are a known issue that may block any flow in vdsm accessing storage domain method, even if the storage domain is not related (e.g. accessing ISO domain during a storage refresh blokced by SCSI scan). But we don't expect lot of storage refreshes. We need to understand why we have many refreshes, and why they are slow. > I think having a few paths down also slow down HBA rescan. > > Once the storage refresh completes, all stuck getVolumeSize() complete > immediately after. This makes sense. > > Germano, can you share all WARN logs vdsm log and sanlock.log? > > To open this bug I collected some debug logs from the initial customer, > aren't the ones attached initially good enough? They are more informative > then the latest ones without debug, but I can get more if you wish. We need to look at them again, hopefully we will have enough info. > I don't have debug logs from latest 4.3 on this. Let me know if you need them > then I'll collect a new round, along with sanlock (which was fine when I > checked). If sanlock did not have any delays, then this is most likely an issue with storage refreshes blocking other operations. It would help to time a manual verbose fc-scan: time /usr/libexec/vdsm/fc-scan -v On host that show this issue, and share the output of the command. We have log with the time it took for every host/device scan. Since we support device resize detection, this tool scans now not only only the scsi hosts (/sys/class/scsi_host/%s/scan) but also all the scsi devices (/sys/class/scsi_device/*). This was added as part of bug 609689 in https://gerrit.ovirt.org/c/43388/. Since vdsm does not manage devices, and can miss a request to resize devices when a host is offline or disconnected from engine temporarily, vdsm scan all fc_hosts and all devices connected to these hosts on when ever we refresh storage. Looking at the refresh code in hsm.py: 78 @misc.samplingmethod 79 def refreshStorage(self, resize=True): 80 self.__staleStatus = self.STORAGE_REFRESHING 81 82 multipath.rescan() 83 if resize: 84 multipath.resize_devices() 85 lvm.invalidateCache() We don't need to scan devices if resize is False. Checking all the places we call refreshStorage: 2796 def getStorageDomainsList( 2797 self, spUUID=None, domainClass=None, storageType=None, 2798 remotePath=None, options=None): ... 2814 sdCache.refreshStorage() Looks like we can refresh with resize=False here. 2872 def getVGList(self, storageType=None, options=None): ... 2882 sdCache.refreshStorage() Same, resize is not needed. 411 def storageRefresh(): 412 sdCache.refreshStorage() This is done during startup - since we don't know if a device was resized on other hosts when vdsm was down, we really need to use resize=True here. 1993 def _getDeviceList(self, storageType=None, guids=(), checkStatus=True): 1994 sdCache.refreshStorage() Here we need to resize since the user expect to see the real device size after this call. 2406 def connectStorageServer(self, domType, spUUID, conList, options=None): ... 2445 # In case there were changes in devices size 2446 # while the VDSM was not connected, we need to 2447 # call refreshStorage. 2448 if domType in (sd.FCP_DOMAIN, sd.ISCSI_DOMAIN): 2449 sdCache.refreshStorage() Here we must resize, but rescan devices is not needed if we just connected to storage. 2495 def disconnectStorageServer(self, domType, spUUID, conList, options=None): ... 2530 # Disconnecting a device may change the visible storage domain list 2531 # so invalidate the caches 2532 sdCache.refreshStorage(resize=False) Here we don't do any resize, so we don't need to scan anything. Refreshes in sdc.py: 127 try: 128 # If multiple calls reach this point and the storage is not 129 # updated the refreshStorage() sampling method is called 130 # serializing (and eventually grouping) the requests. 131 if self.__staleStatus != self.STORAGE_UPDATED: 132 self.refreshStorage() We don't call multipath.resize here so there is no need to rescan devices. Looks like we should: - Add argument to fc-scan to disable device rescan - Add rescan_devices=False argument to refreshStorage - Call refreshStorage(rescan_device=True) only in the case that must to rescan devices and resize them. Germano, I posted a quick untested patch for fc-scan to allow disabling device scanning. It would help if you can get share the output of these commands on a system showing this issue. time /usr/libexec/vdsm/fc-scan -v time /usr/libexec/vdsm/fc-scan -v --no-devices (In reply to Nir Soffer from comment #31) > Looks like we should: > > - Add argument to fc-scan to disable device rescan > - Add rescan_devices=False argument to refreshStorage > - Call refreshStorage(rescan_device=True) only in the case that must > to rescan devices and resize them. Makes a lot of sense to me, should save some time. (In reply to Nir Soffer from comment #32) > Germano, I posted a quick untested patch for fc-scan to allow disabling > device scanning. > > It would help if you can get share the output of these commands on a system > showing this issue. > > time /usr/libexec/vdsm/fc-scan -v > > time /usr/libexec/vdsm/fc-scan -v --no-devices I've requested it to the customer that was facing these issues, along with a fresh sosreport after fixing other problems that could also have been contributing to the slowdowns. Will let you know once we have results. Thanks! Looking in vdsm.log from attachment 1456628 [details]:
## FC scans
$ grep '\[storage.HBA\] Starting scan' vdsm.log.3 | wc -l
34
$ head -1 vdsm.log.3
2018-07-04 08:01:06,269+0200 ...
$ tail -1 vdsm.log.3
2018-07-04 09:01:01,772+0200 ...
34 scans in 1 hour - does not make sense. We don't expect so many storage refreshes.
$ grep '\[storage.HBA\] Starting scan' vdsm.log.3 | grep monitor/ | wc -l
33
$ grep '\[storage.HBA\] Starting scan' vdsm.log.3 | grep jsonrpc/ | wc -l
1
One scan was triggered by API call, all other by storage domain monitors.
Looking in the fc-scan calls, they take about 25 seconds, and they are performed in
groups - back to back:
1. 2 scans
2018-07-04 08:36:41,861+0200 DEBUG (monitor/3f31203) [storage.HBA] Starting scan (hba:58)
2018-07-04 08:37:05,450+0200 DEBUG (monitor/3f31203) [storage.HBA] Scan finished (hba:64)
2018-07-04 08:37:06,137+0200 DEBUG (monitor/62b40d0) [storage.HBA] Starting scan (hba:58)
2018-07-04 08:37:29,223+0200 DEBUG (monitor/62b40d0) [storage.HBA] Scan finished (hba:64)
2. After a minute, 3 scans
2018-07-04 08:41:23,481+0200 DEBUG (monitor/3f31203) [storage.HBA] Starting scan (hba:58)
2018-07-04 08:41:45,052+0200 DEBUG (monitor/3f31203) [storage.HBA] Scan finished (hba:64)
2018-07-04 08:41:45,592+0200 DEBUG (monitor/c59bd9b) [storage.HBA] Starting scan (hba:58)
2018-07-04 08:42:09,433+0200 DEBUG (monitor/c59bd9b) [storage.HBA] Scan finished (hba:64)
2018-07-04 08:42:10,136+0200 DEBUG (monitor/dbf3c27) [storage.HBA] Starting scan (hba:58)
2018-07-04 08:42:33,887+0200 DEBUG (monitor/dbf3c27) [storage.HBA] Scan finished (hba:64)
3. After 3:30 minutes, 4 scans
2018-07-04 08:46:00,457+0200 DEBUG (monitor/632f710) [storage.HBA] Starting scan (hba:58)
2018-07-04 08:46:22,423+0200 DEBUG (monitor/632f710) [storage.HBA] Scan finished (hba:64)
2018-07-04 08:46:23,010+0200 DEBUG (monitor/dbf3c27) [storage.HBA] Starting scan (hba:58)
2018-07-04 08:46:46,789+0200 DEBUG (monitor/dbf3c27) [storage.HBA] Scan finished (hba:64)
2018-07-04 08:46:47,317+0200 DEBUG (monitor/eff5139) [storage.HBA] Starting scan (hba:58)
2018-07-04 08:47:10,096+0200 DEBUG (monitor/eff5139) [storage.HBA] Scan finished (hba:64)
2018-07-04 08:47:10,774+0200 DEBUG (monitor/c9faf72) [storage.HBA] Starting scan (hba:58)
2018-07-04 08:47:34,787+0200 DEBUG (monitor/c9faf72) [storage.HBA] Scan finished (hba:64)
## Warnings
$ grep WARN vdsm.log.3 | wc -l
979
$ egrep 'WARN.+is blocked for' vdsm.log.3 | head -10
2018-07-04 08:01:11,278+0200 WARN (check/loop) [storage.check] Checker u'/rhev/data-center/mnt/10.252.100.47:_appl/dbf3c278-6ec5-45f0-9c09-0f92e6c7ab22/dom_md/metadata' is blocked for 320.00 seconds (check:279)
2018-07-04 08:01:21,279+0200 WARN (check/loop) [storage.check] Checker u'/rhev/data-center/mnt/10.252.100.47:_appl/dbf3c278-6ec5-45f0-9c09-0f92e6c7ab22/dom_md/metadata' is blocked for 330.00 seconds (check:279)
2018-07-04 08:01:31,278+0200 WARN (check/loop) [storage.check] Checker u'/rhev/data-center/mnt/10.252.100.47:_appl/dbf3c278-6ec5-45f0-9c09-0f92e6c7ab22/dom_md/metadata' is blocked for 340.00 seconds (check:279)
2018-07-04 08:01:41,279+0200 WARN (check/loop) [storage.check] Checker u'/rhev/data-center/mnt/10.252.100.47:_appl/dbf3c278-6ec5-45f0-9c09-0f92e6c7ab22/dom_md/metadata' is blocked for 350.00 seconds (check:279)
2018-07-04 08:01:51,278+0200 WARN (check/loop) [storage.check] Checker u'/rhev/data-center/mnt/10.252.100.47:_appl/dbf3c278-6ec5-45f0-9c09-0f92e6c7ab22/dom_md/metadata' is blocked for 360.00 seconds (check:279)
2018-07-04 08:02:01,282+0200 WARN (check/loop) [storage.check] Checker u'/rhev/data-center/mnt/10.252.100.47:_appl/dbf3c278-6ec5-45f0-9c09-0f92e6c7ab22/dom_md/metadata' is blocked for 370.00 seconds (check:279)
2018-07-04 08:02:11,278+0200 WARN (check/loop) [storage.check] Checker u'/rhev/data-center/mnt/10.252.100.47:_appl/dbf3c278-6ec5-45f0-9c09-0f92e6c7ab22/dom_md/metadata' is blocked for 380.00 seconds (check:279)
2018-07-04 08:02:21,287+0200 WARN (check/loop) [storage.check] Checker u'/rhev/data-center/mnt/10.252.100.47:_appl/dbf3c278-6ec5-45f0-9c09-0f92e6c7ab22/dom_md/metadata' is blocked for 390.00 seconds (check:279)
2018-07-04 08:02:31,285+0200 WARN (check/loop) [storage.check] Checker u'/rhev/data-center/mnt/10.252.100.47:_appl/dbf3c278-6ec5-45f0-9c09-0f92e6c7ab22/dom_md/metadata' is blocked for 400.00 seconds (check:279)
2018-07-04 08:02:41,287+0200 WARN (check/loop) [storage.check] Checker u'/rhev/data-center/mnt/10.252.100.47:_appl/dbf3c278-6ec5-45f0-9c09-0f92e6c7ab22/dom_md/metadata' is blocked for 410.00 seconds (check:279)
There is serious issues with the NFS domain dbf3c278-6ec5-45f0-9c09-0f92e6c7ab22
it was blocked for 840 seconds:
2018-07-04 08:55:31,279+0200 WARN (check/loop) [storage.check] Checker u'/rhev/data-center/mnt/10.252.100.47:_appl/dbf3c278-6ec5-45f0-9c09-0f92e6c7ab22/dom_md/metadata' is blocked for 840.00 seconds (check:279)
$ egrep 'WARN.+is blocked for' vdsm.log.3 | grep -v dbf3c278-6ec5-45f0-9c09-0f92e6c7ab22
No other domain had issues.
## Storage cache invalidations
I think the issue is the storage domain cache. Every access to storage domain look up
an internal cache. If the domain is in cache there is no delay and the code continue.
But if the domain is not in the cache, and cache was invalidated, the caller try to
refresh storage. This operation does:
- scsi rescan
- fc rescan
- resize multipath devices
Because of the way the cache is built, all this is typically done twice when multiple
storage domain try to enter the domain cache.
In our case, fc scan takes about 25 seconds. So typically code accessing storage domain
may be blocked for at least 25-50 seconds on the cache only because of slow fc-scans.
Why would domain not be in the cache? every 300 seconds, storage domain monitors remove
the domain from the cache so the next access to the domain will recreate the domain
object and refresh the domain metadata.
The cache is invalidated in:
- connectStorageServer()
- StoragePool.connect(), called from connectStoragePool()
- StoragePool.attachSD(), called from attachStorageDomain()
- StoragePool.refresh(), called from connectStoragePool()
Looking at the relevant calls:
$ egrep 'START (connectStoragePool|connectStorageServer|attachStorageDomain)' vdsm.log.3 | awk '{print $1,$2}'
2018-07-04 08:04:05,141+0200
2018-07-04 08:05:00,203+0200
2018-07-04 08:10:00,235+0200
2018-07-04 08:15:00,434+0200
2018-07-04 08:20:00,284+0200
2018-07-04 08:22:44,676+0200
2018-07-04 08:23:40,594+0200
2018-07-04 08:24:44,517+0200
2018-07-04 08:30:00,290+0200
2018-07-04 08:35:00,166+0200
2018-07-04 08:40:00,148+0200
2018-07-04 08:45:00,261+0200
2018-07-04 08:50:00,183+0200
2018-07-04 08:55:00,205+0200
2018-07-04 09:00:00,232+0200
$ egrep 'START (connectStoragePool|connectStorageServer|attachStorageDomain)' vdsm.log.3 | wc -l
15
So we had 15 cache invalidations in one hour.
$ egrep 'START connectStorageServer' vdsm.log.3 | awk '{print $1,$2}'
2018-07-04 08:05:00,203+0200
2018-07-04 08:10:00,235+0200
2018-07-04 08:15:00,434+0200
2018-07-04 08:20:00,284+0200
2018-07-04 08:22:44,676+0200
2018-07-04 08:24:44,517+0200
2018-07-04 08:30:00,290+0200
2018-07-04 08:35:00,166+0200
2018-07-04 08:40:00,148+0200
2018-07-04 08:45:00,261+0200
2018-07-04 08:50:00,183+0200
2018-07-04 08:55:00,205+0200
2018-07-04 09:00:00,232+0200
$ egrep 'START connectStorageServer' vdsm.log.3 | awk '{print $1,$2}' | wc -l
13
Almost all the calls are connectStorageServer. I guess this is engine trying to recover
the inaccessible NFS domain by trying to reconnect to the server.
Lets try to mix the cache invalidation with the rescans:
$ egrep '\[storage.HBA\] Starting scan' vdsm.log.3 | awk '{print $1,$2,"rescan"}' > rescan.log
$ egrep 'START connectStorageServer' vdsm.log.3 | awk '{print $1,$2,"invalidate cache"}' > invalidate.log
$ cat invalidate.log rescan.log | sort
$ cat invalidate.log rescan.log | sort
2018-07-04 08:05:00,203+0200 invalidate cache
2018-07-04 08:10:00,235+0200 invalidate cache
2018-07-04 08:15:00,434+0200 invalidate cache
2018-07-04 08:16:18,808+0200 rescan
2018-07-04 08:16:41,230+0200 rescan
2018-07-04 08:17:05,614+0200 rescan
2018-07-04 08:20:00,284+0200 invalidate cache
2018-07-04 08:21:24,552+0200 rescan
2018-07-04 08:21:46,828+0200 rescan
2018-07-04 08:22:10,778+0200 rescan
2018-07-04 08:22:44,676+0200 invalidate cache
2018-07-04 08:23:40,685+0200 rescan
2018-07-04 08:24:03,157+0200 rescan
2018-07-04 08:24:27,217+0200 rescan
2018-07-04 08:24:44,517+0200 invalidate cache
2018-07-04 08:26:03,833+0200 rescan
2018-07-04 08:26:26,448+0200 rescan
2018-07-04 08:26:50,643+0200 rescan
2018-07-04 08:30:00,290+0200 invalidate cache
2018-07-04 08:31:09,170+0200 rescan
2018-07-04 08:31:31,503+0200 rescan
2018-07-04 08:31:55,619+0200 rescan
2018-07-04 08:32:21,896+0200 rescan
2018-07-04 08:35:00,166+0200 invalidate cache
2018-07-04 08:36:16,809+0200 rescan
2018-07-04 08:36:41,861+0200 rescan
2018-07-04 08:37:06,137+0200 rescan
2018-07-04 08:40:00,148+0200 invalidate cache
2018-07-04 08:41:23,481+0200 rescan
2018-07-04 08:41:45,592+0200 rescan
2018-07-04 08:42:10,136+0200 rescan
2018-07-04 08:45:00,261+0200 invalidate cache
2018-07-04 08:46:00,457+0200 rescan
2018-07-04 08:46:23,010+0200 rescan
2018-07-04 08:46:47,317+0200 rescan
2018-07-04 08:47:10,774+0200 rescan
2018-07-04 08:50:00,183+0200 invalidate cache
2018-07-04 08:51:08,685+0200 rescan
2018-07-04 08:51:30,834+0200 rescan
2018-07-04 08:51:54,451+0200 rescan
2018-07-04 08:52:17,360+0200 rescan
2018-07-04 08:55:00,205+0200 invalidate cache
2018-07-04 08:56:16,675+0200 rescan
2018-07-04 08:56:39,366+0200 rescan
2018-07-04 08:57:03,588+0200 rescan
2018-07-04 08:57:26,343+0200 rescan
2018-07-04 09:00:00,232+0200 invalidate cache
The picture is pretty clear now:
1. NFS storage domain is not accessible
2. Engine try to recover the domain by calling connectStorageServer every ~5 minutes
3. connectStorgeServer invalidate storage domain cache
4. The next access to any domain trigger 2-5 fc-scans, each taking 25 seconds
5. calls to storage domain methods are blocked for at leat 25-50 seconds during the scans
6. virt executor get stuck with blocked getVolumeSize calls
7. vms become non-responsive
Germano, did you see the same picture in other cases?
(In reply to Nir Soffer from comment #34) > The picture is pretty clear now: > > 1. NFS storage domain is not accessible > 2. Engine try to recover the domain by calling connectStorageServer every ~5 > minutes > 3. connectStorgeServer invalidate storage domain cache > 4. The next access to any domain trigger 2-5 fc-scans, each taking 25 seconds > 5. calls to storage domain methods are blocked for at leat 25-50 seconds > during the scans > 6. virt executor get stuck with blocked getVolumeSize calls > 7. vms become non-responsive > > Germano, did you see the same picture in other cases? Brilliant, I think this connects all the dots. Yes, one case I remember was when adding new storage domains. The current case also has a broken ISO domain. In connectStorageServer() vdsm invalidates the caches no matter what connection is being made. Is this optimal? For example, connecting an ISO/Export Domain invalidates caches of everything. Is it really necessary? IIUC, the problem can still happen without the extra cache invalidations. If getVolumeSize gets blocked for long the monitor will be non-responsive. Not having the NFS problem makes it less frequent though. (In reply to Germano Veit Michel from comment #36) > In connectStorageServer() vdsm invalidates the caches no matter what > connection is being made. > Is this optimal? For example, connecting an ISO/Export Domain invalidates > caches of everything. Is it really necessary? I don't think so, and I suggested to change this in the past: https://gerrit.ovirt.org/c/45824/ But they were concerns about the safey of this change. I think we should rethink this. > IIUC, the problem can still happen without the extra cache invalidations. If > getVolumeSize gets blocked for long the monitor will be non-responsive. > Not having the NFS problem makes it less frequent though. Yes, but this is real storage issue, and in this case libvirt and qemu will get stuck anyway. (In reply to Nir Soffer from comment #37) > > IIUC, the problem can still happen without the extra cache invalidations. If > > getVolumeSize gets blocked for long the monitor will be non-responsive. > > Not having the NFS problem makes it less frequent though. > > Yes, but this is real storage issue, and in this case libvirt and qemu will > get stuck anyway. But if 1 domain has problems and is blocking refresh, then all vms will be non-responsive, even VMs that have no disks on that domain, right? Also, there might be other reasons the refresh might take longer and the VMs are all running happily, like high number of devices, a problem with a few devices/paths which are not even being used. No reply for the fc-scan yet, will post here as soon as I get it. (In reply to Germano Veit Michel from comment #38) > (In reply to Nir Soffer from comment #37) > > > IIUC, the problem can still happen without the extra cache invalidations. If > > > getVolumeSize gets blocked for long the monitor will be non-responsive. > > > Not having the NFS problem makes it less frequent though. > > > > Yes, but this is real storage issue, and in this case libvirt and qemu will > > get stuck anyway. > > But if 1 domain has problems and is blocking refresh, then all vms will be > non-responsive, even VMs that have no disks on that domain, right? Right, this is exactly the case here. The problem is lack of storage domain isolation. Created attachment 1616201 [details]
fc-scan with devices
Created attachment 1616205 [details]
fc-scan without devices
Created attachment 1616207 [details]
fc-scan with devices
Uploading again with fixed whitespace.
I extracted the text from the pdf in attachment 1615732 [details] and uploaded: - attachment 1616207 [details] - fc-scan with devices - attachment 1616205 [details] - fc-scan without devices Comparing total runtime: - with devices: 4.6s - without devices: 0.07s (65X times faster) Checking how time was spent checking devices: We scanned 486 devices: $ egrep 'Rescanned device .+:' fc-scan-with-devices.log | wc -l 486 Slowest scans: $ awk '/Rescanned device .+:/ {print $5}' fc-scan-with-devices.log | sort -rn | head 0.04 0.03 0.03 0.03 0.03 0.03 0.03 0.03 0.03 0.03 Fastest scans $ awk '/Rescanned device .+:/ {print $5}' fc-scan-with-devices.log | sort -rn | tail 0.01 0.01 0.01 0.01 0.00 0.00 0.00 0.00 0.00 0.00 Average scan: $ awk '/Rescanned device .+:/ {sum+=$5; count+=1}; END {print sum/count}' fc-scan-with-devices.log 0.0179218 Total time spent in scanning: $ awk '/Rescanned device .+:/ {sum+=$5}; END {print sum}' fc-scan-with-devices.log 8.71 We have some some parallelism, using thread per host. There is also time spent in finding devices for every host, but we cannot tell how much time spent from the log. 98.5% of the time is spent in scanning devices. Skipping device scanning when it is not needed will make slow scan a non-issue, avoiding delays in other parts of the system. Germano, it would be nice if we can get text instead of pdf next time. An easy way to collect the data is: $ (time fc-scan -v --no-devices)> fc-scan-with-devices.log 2>&1 $ (time fc-scan -v)> fc-scan-without-devices.log 2>&1 Checking how devices are distributed across fc_hosts and rports: $ egrep 'Rescanning device .+/host2/' fc-scan-with-devices.log | wc -l 243 $ egrep 'Rescanning device .+/host4/' fc-scan-with-devices.log | wc -l 243 $ egrep 'Rescanning device .+/host4/rport-4:0-2' fc-scan-with-devices.log | wc -l 3 $ egrep 'Rescanning device .+/host4/rport-4:0-3' fc-scan-with-devices.log | wc -l 82 $ egrep 'Rescanning device .+/host4/rport-4:0-4' fc-scan-with-devices.log | wc -l 82 $ egrep 'Rescanning device .+/host4/rport-4:0-5' fc-scan-with-devices.log | wc -l 38 $ egrep 'Rescanning device .+/host4/rport-4:0-6' fc-scan-with-devices.log | wc -l 38 $ egrep 'Rescanning device .+/host2/rport-2:0-2' fc-scan-with-devices.log | wc -l 82 $ egrep 'Rescanning device .+/host2/rport-2:0-3' fc-scan-with-devices.log | wc -l 82 $ egrep 'Rescanning device .+/host2/rport-2:0-4' fc-scan-with-devices.log | wc -l 38 $ egrep 'Rescanning device .+/host2/rport-2:0-5' fc-scan-with-devices.log | wc -l 38 $ egrep 'Rescanning device .+/host2/rport-2:0-6' fc-scan-with-devices.log | wc -l 3 We use only 2 threads - one per fc_host with devices. Can we get better parallelism by scanning with thread per rport-target-N combinations? The relevant code is here: https://gerrit.ovirt.org/gitweb?p=vdsm.git;a=blob;f=lib/vdsm/storage/fc-scan;h=7da6105971c7c253b20626cb2cd87509b0815778;hb=61404df72b75dc329a70c745b9e0af88ff7027a2 The relevant part is: 106 fd = os.open(path, os.O_WRONLY) 107 try: 108 os.write(fd, "1") 109 finally: 110 os.close(fd) Can this work in parallel for multiple devices on different host-rport-target-N, like: /sys/devices/pci0000:40/0000:40:02.0/0000:41:00.1/host2/rport-2:0-2/target2:0:2/2:0:2:0/rescan /sys/devices/pci0000:40/0000:40:02.0/0000:41:00.1/host2/rport-2:0-3/target2:0:3/2:0:3:0/rescan /sys/devices/pci0000:40/0000:40:02.0/0000:41:00.1/host2/rport-2:0-4/target2:0:4/2:0:4:0/rescan /sys/devices/pci0000:40/0000:40:02.0/0000:41:00.1/host2/rport-2:0-5/target2:0:5/2:0:5:0/rescan /sys/devices/pci0000:40/0000:40:02.0/0000:41:00.1/host2/rport-2:0-6/target2:0:6/2:0:6:0/rescan It it can work, we can speed up device scanning significantly. Ben, I don't know if you can answer this, but maybe you can tell me who can answer this? Summary of actions items (in order of importance): 1. Disable device scanning when it is not needed I think most refreshes do not need device scanning 2. Optimize device scanning when it is needed (e.g. better parallelism) Important in getDeviceList and startup, when we must resize devices 3. Never trigger rescans when handling file based domains. It is never right to rescan hosts and/or devices when handling NFS domain. 4. Never trigger fc_host scan when handling iSCSI domain or iscsi_host scan when handling FC domain (currently we always invoke both). 5. Minimize cache invalidations (sdCache.invalidateStorage) may trigger slow rescan blocking other flows. I think we can remove all calls. 6. Minimize cash refreshes (sdCache.refreshStorage) may trigger slow rescan blocking other flows. It think we can remove some calls. 7. Consider changing sampling method to use the current operation result instead of grouping waiters and performing new operation with all waiters can cut blocking time by half. (In reply to Nir Soffer from comment #49) > We use only 2 threads - one per fc_host with devices. Can we get better parallelism > by scanning with thread per rport-target-N combinations? > Can this work in parallel for multiple devices on different host-rport-target-N, like: > > /sys/devices/pci0000:40/0000:40:02.0/0000:41:00.1/host2/rport-2:0-2/target2:0:2/2:0:2:0/rescan > /sys/devices/pci0000:40/0000:40:02.0/0000:41:00.1/host2/rport-2:0-3/target2:0:3/2:0:3:0/rescan > /sys/devices/pci0000:40/0000:40:02.0/0000:41:00.1/host2/rport-2:0-4/target2:0:4/2:0:4:0/rescan > /sys/devices/pci0000:40/0000:40:02.0/0000:41:00.1/host2/rport-2:0-5/target2:0:5/2:0:5:0/rescan > /sys/devices/pci0000:40/0000:40:02.0/0000:41:00.1/host2/rport-2:0-6/target2:0:6/2:0:6:0/rescan > > If it can work, we can speed up device scanning significantly. > > Ben, I don't know if you can answer this, but maybe you can tell me who can > answer this? Looking over the relevant kernel code briefly, I don't see any obvious reason why not. However, I really don't know the kernel scsi-layer code well, so Ewan might be a better person to ask. It depends upon what you are doing. There is a mutex on each SCSI Host object taken around the scan operation when rescanning a host, or a target. So if you use the "scan" sysfs node with wildcards on the host object, you will wait on the mutex if multiple threads are performing an operation on the same host. Rescanning an already-discovered device, though, is different and does not take the mutex. So there is some potential for parallelism there. However, one thing to be aware of is that scanning devices can take more time, sometimes a lot more time, if there is also I/O in progress to the device. I have seen cases where it can take on the order of seconds for a device to respond to inquiry commands under I/O load, because it apparently services the I/O with higher priority. Rescanning a device sends several such inquiry commands to it to get all the VPD pages and attributes. It is probably better to cache the information, and perhaps update it on a periodic rotating basis (rather than scanning everything all at the same time). Note also that since 7.5, if a device reports a unit attention ASC/ASCQ 3F 03 INQUIRY DATA HAS CHANGED, the kernel may perform the rescan automatically. (A unit attention is only received if there is a command that can receive the status, so if there is no I/O to the device the reporting can be delayed or not received. If multipath is used, typically the path checker will receive the UA.) This does not happen for ASC/ASCQ 2A 09 CAPACITY DATA HAS CHANGED, though. This had to do with multipath requiring the capacity on the underlying paths to match, so if some paths are seen to change but not others, the paths would not be used. (I think multipath may have been changed to not check this, at some point but I'm not certain.) static void scsi_report_sense(struct scsi_device *sdev, struct scsi_sense_hdr *sshdr) { enum scsi_device_event evt_type = SDEV_EVT_MAXBITS; /* i.e. none */ if (sshdr->sense_key == UNIT_ATTENTION) { if (sshdr->asc == 0x3f && sshdr->ascq == 0x03) { evt_type = SDEV_EVT_INQUIRY_CHANGE_REPORTED; sdev_printk(KERN_WARNING, sdev, "Inquiry data has changed"); ... static void scsi_evt_emit(struct scsi_device *sdev, struct scsi_event *evt) { int idx = 0; char *envp[3]; switch (evt->evt_type) { case SDEV_EVT_MEDIA_CHANGE: envp[idx++] = "SDEV_MEDIA_CHANGE=1"; break; case SDEV_EVT_INQUIRY_CHANGE_REPORTED: scsi_rescan_device(&sdev->sdev_gendev); <===== envp[idx++] = "SDEV_UA=INQUIRY_DATA_HAS_CHANGED"; break; ... static ssize_t store_rescan_field (struct device *dev, struct device_attribute *attr, const char *buf, size_t count) { scsi_rescan_device(dev); <===== return count; } static DEVICE_ATTR(rescan, S_IWUSR, NULL, store_rescan_field); (In reply to Ewan D. Milne from comment #52) > It depends upon what you are doing. I'll try to describe RHV use case. We have multiple hosts in a cluster (.e.g 50), all connected to the same LUNs. The reason is we can try to run a VM using this storage on any of these hosts quickly, without changing the storage (e.g. map LUN to some host). All the LUNs use multipath. We create a LVM volume group with one or more LUNs, and create LVM logical volumes used as VM virtual disks. When we run a VM on some host, or perform some storage operation (e.g. cloning and virtual disk), we activate the logical volume on this host. When the volume group becomes full, users can go and resize the LUN on the server. Since we don't manage the resize, we need to rescan the relevant device on all hosts so all host are on the same page, and see the new size of the device. Once we rescan devices on all hosts, we need to resize the LVM physical volume. This is done on a master host managing LVM metadata. Other hosts will see this update when they reload this volume group. Optimally we would rescan only the device that was resized, but since we don't manage the resize, we don't know which device was resized. We could ask the user to select the device in RHV UI but is pretty hard task when you have lot of devices. Also the user resizing the device on the server is typically not the same user managing the RHV setup. So instead of making user life hard, we rescan all the devices when a user ask to update device size. Then we display the device that may be extended (device size > LVM pysical volume size). The user can select the devices and trigger a resize of the physical volume. Since we manage a cluster of hosts, and a host can be disconnected from the cluster at any time, we need to make sure a host will rescan devices before connecting back to the cluster, in case there was an update when the host was not connected. So we need an efficient way to scan large amount of devices. I think we test up to 300 LUNs, with 4 paths per LUNs, total 1200 devices. > Rescanning an already-discovered device, though, is different and does not > take > the mutex. So there is some potential for parallelism there. Great. > However, one thing to be aware of is that scanning devices can take more > time, > sometimes a lot more time, if there is also I/O in progress to the device. > I have > seen cases where it can take on the order of seconds for a device to respond > to > inquiry commands under I/O load, because it apparently services the I/O with > higher > priority. Rescanning a device sends several such inquiry commands to it to > get > all the VPD pages and attributes. Yes, we learned this in the hard way, scanning 400 devices too about 25 seconds with 2 threads (thread per host). I hope that we can cut the time by 4 by having one thread per host-rport-target-N. > It is probably better to cache the information, and perhaps update it on a > periodic > rotating basis (rather than scanning everything all at the same time). The goal is to rescan devices only when: - when a user need to perform an operation with devices, like creating a new LVM volume group in the management system. At this point the user want to see the current size of the device. - when a user perform a resize of the LVM physical volume, on all hosts. - connecting back to cluster after a host was in maintenance mode or after a restart the management program on a host. Daniel, this is mostly a scale issue, affecting users with large setups (many hosts, many vms, many LUNs, many storage domains). We have initial patch for speeding up device scanning: https://gerrit.ovirt.org/c/103455/ To test this patch I need access to a 1-2 hosts connected to FC storage with many LUNs. I think that 300 LUNs, 4 paths per LUN, total 1200 SCSI devices can be a good test. We need to test: - time to scan devices on idle host - same when scanning from multiple hosts in the same time - same when there is I/O on some devices (e.g. copying disks while scanning) Created attachment 1619485 [details]
graph: fc-scan with 222 devices
The graph shows that we get about 10x speedup with 20 threads and more threads do not improve results significantly.
Previously we were using thread per fc_host (typically 1 threads), so current
patches can give ~5x improvement.
We did not test yet with larger setups and with I/O load.
I tested again with a bigger setup with 327 multipath FC LUNs with 4 paths per LUN, total 1331 paths. # lsblk -s | grep mpath | wc -l 327 # lsblk -s | grep disk | wc -l 1311 Tested with current fc-scan from 4.3: # rpm -q vdsm vdsm-4.30.30-1.el7ev.x86_64 And with fc-scan from this patch: https://gerrit.ovirt.org/c/103455/ fc-scan from 4.3 uses one thread per port, so get 2 threads. First I reproduced the issue of slow fc-scan, by running the scan while copying 10 100 GiB disks using qemu-img. When the system is idle, fc-scan took 10.971 seconds. With load, it took 26.097 seconds. With new fc-scan, using thread pool of 64 threads, run time improved to 1.081 seconds (10x faster) when the system is idle and 1.356 seconds under load (19x faster). We get most of the benefit using smaller thread pool (e.g 32) but bigger thread pool make fc-scan more resilient if some scanning some LUNs is significantly slower. Her are complete results: workers current/idle current/load fix/idle fix/load --------------------------------------------------------- 1 - - 19.926 41.250 2 10.971 26.097 10.796 21.453 4 - - 5.93 11.538 8 - - 3.467 6.168 16 - - 2.134 3.387 32 - - 1.434 2.087 64 - - 1.081 1.356 128 - - 0.855 0.965 256 - - 0.819 0.917 512 - - 0.880 0.963 This graph make the numbers more clear: https://docs.google.com/spreadsheets/u/1/d/e/2PACX-1vQk37hO7sn-v6eLAHFwo7-3Z1W-5XAVQ8j-EUoGvZjfbIfEToWhK3Sf6RWQxu7BdqOHwoLC69bf22Oe/pubchart?oid=1308885970&format=image Times are average of 10 runs, measured like this: # (for i in $(seq 10); do PYTHONPATH=lib /usr/bin/time -f "%e" lib/vdsm/storage/fc-scan 2>&1; done) | awk '{sum+=$1} END{print sum/NR}' Created attachment 1628588 [details]
graph: fc-scan with 1311 devices
Created attachment 1628589 [details]
script for creating I/O load
Created attachment 1628590 [details]
Output of fc-scan -v on idle system - current version
Created attachment 1628592 [details]
Output of fc-scan -v with I/O load - current version
Created attachment 1628593 [details]
Output of fc-scan -v on idle system - fix
Created attachment 1628595 [details]
Output of fc-scan -v with I/O load - fix
Created attachment 1628596 [details]
Output of lsblk -s on the test system
Created attachment 1628597 [details]
Output of multipath -ll on test system
How to reproduce: 1. Use system with one host and many devices (slows down fc-scan) 2. Start several VMs on the host 3. Add ISO domain on a temporary NFS server, that you will be able to block during the test. No need to attach any ISO to the VMs, just to have the domain in the DC. 4. simulate high I/O load by moving 10 big raw disks (e.g. 500G) concurrently 5. Block access to the ISO domain. After couple of minutes the system will try to recover the domain by refreshing and invalidating caches. Expected result: - Some VMs will flip between Up and Non-responsive state - Warnings in vdsm log about blocked workers - Possible fc-scan timeouts Verifying the fix: Same flow with version including the attached patch. None of the results are expected with the fix. (In reply to Nir Soffer from comment #72) > How to reproduce: > > 1. Use system with one host and many devices (slows down fc-scan) > 2. Start several VMs on the host > 3. Add ISO domain on a temporary NFS server, that you will be able > to block during the test. No need to attach any ISO to the VMs, > just to have the domain in the DC. > 4. simulate high I/O load by moving 10 big raw disks (e.g. 500G) concurrently > 5. Block access to the ISO domain. After couple of minutes the system will > try to recover the domain by refreshing and invalidating caches. Just to complement: doing 1, 2, 4 and then adding a new Storage Domain should also reproduce. It's the use case of SFDC02251126 also attached to this BZ. Merge upstream in: commit feabd12a224bab87dd2d205c9034be35ee51c636 Author: Nir Soffer <nsoffer> Date: Sun Sep 22 14:05:20 2019 +0300 fc-scan: Speed up device scanning Mordechai, please see comment 72 and comment 73 for testing/verifying. WARN: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason: [Found non-acked flags: '{}', ] For more info please contact: rhv-devops: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason: [Found non-acked flags: '{}', ] For more info please contact: rhv-devops WARN: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason: [Found non-acked flags: '{}', ] For more info please contact: rhv-devops: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason: [Found non-acked flags: '{}', ] For more info please contact: rhv-devops Hi Germano, please review the doc text that I updated: "When a system had many FC LUNs with many paths per LUN, and a high I/O load, scanning of FC devices became slow, causing timeouts in monitoring VM disk size, and making VMs non-responsive. In this release, FC scans have been optimized for speed, and VM are much less likely to become non-responsive." WARN: Bug status (ON_QA) wasn't changed but the folowing should be fixed: [Found non-acked flags: '{}', ] For more info please contact: rhv-devops: Bug status (ON_QA) wasn't changed but the folowing should be fixed: [Found non-acked flags: '{}', ] For more info please contact: rhv-devops WARN: Bug status (ON_QA) wasn't changed but the folowing should be fixed: [Found non-acked flags: '{}', ] For more info please contact: rhv-devops: Bug status (ON_QA) wasn't changed but the folowing should be fixed: [Found non-acked flags: '{}', ] For more info please contact: rhv-devops WARN: Bug status (ON_QA) wasn't changed but the folowing should be fixed: [Found non-acked flags: '{}', ] For more info please contact: rhv-devops: Bug status (ON_QA) wasn't changed but the folowing should be fixed: [Found non-acked flags: '{}', ] For more info please contact: rhv-devops WARN: Bug status (ON_QA) wasn't changed but the folowing should be fixed: [Found non-acked flags: '{}', ] For more info please contact: rhv-devops: Bug status (ON_QA) wasn't changed but the folowing should be fixed: [Found non-acked flags: '{}', ] For more info please contact: rhv-devops WARN: Bug status (ON_QA) wasn't changed but the folowing should be fixed: [Found non-acked flags: '{}', ] For more info please contact: rhv-devops: Bug status (ON_QA) wasn't changed but the folowing should be fixed: [Found non-acked flags: '{}', ] For more info please contact: rhv-devops Tested on versions: redhat-release-8.1-3.3 rhv-release-4.4.0-13 vdsm-4.40.0-164 Preparation steps: Create additional 300 LUNs ("L0_Group_1" cluster) with 4 paths each. Create 10 Disks (4*500GB , 6 * 200GB raw) on a new Storage Domain ("L0_Group_1" cluster) Define ISO domain Verify Jmeter script working OK. (Move disk between storage domains) [root@f01-h06-000-r620 ~]# multipath -ll | grep -c LUN 304 [root@f01-h06-000-r620 vdsm]# lsblk -s | grep mpath | wc -l 397 [root@f01-h06-000-r620 vdsm]# lsblk -s | grep disk | wc -l 1591 Testing: 50 VMs running on 1 host JMeter running - move disks between storage domains (on the same cluster) - 10 disks in parallel On the host running: rescan-scsi-bus.sh multipath -ll Create & Manage Storage Domain (Load all the LUNs ID) Verify Log files: No "monitor became unresponsive" messages in vdsm.log No "ERROR" messages in engine.log & vdsm.log (In reply to David Vaanunu from comment #86) > Tested on versions: > > redhat-release-8.1-3.3 We should test on 8.2 now. ... > Testing: > > 50 VMs running on 1 host > JMeter running - move disks between storage domains (on the same cluster) - > 10 disks in parallel > > On the host running: > rescan-scsi-bus.sh Why? this should never be used. > multipath -ll Also not related to this test. > Create & Manage Storage Domain (Load all the LUNs ID) Good, this invalidates the caches, which will trigger fc-recan, which was the slow part. > Verify Log files: > > No "monitor became unresponsive" messages in vdsm.log Good, but this is useful only if we had inaccessible ISO domain. > No "ERROR" messages in engine.log & vdsm.log This is not enough. I described how test this in comment 72: > How to reproduce: > > 1. Use system with one host and many devices (slows down fc-scan) > 2. Start several VMs on the host > 3. Add ISO domain on a temporary NFS server, that you will be able > to block during the test. No need to attach any ISO to the VMs, > just to have the domain in the DC. > 4. simulate high I/O load by moving 10 big raw disks (e.g. 500G) concurrently > 5. Block access to the ISO domain. After couple of minutes the system will > try to recover the domain by refreshing and invalidating caches. > > Expected result: > - Some VMs will flip between Up and Non-responsive state > - Warnings in vdsm log about blocked workers > - Possible fc-scan timeouts > > Verifying the fix: > > Same flow with version including the attached patch. None of the results > are expected with the fix. Did you reproduce the issue without the fix? If you did not, how do you know that your verification was correct? Maybe you did not trigger the case leading to the issue. Did you block access to the ISO domain during the test? Did you check warnings in vdsm logs about blocked workers? You need to grep for ' WARN ' in vdsm log. Please attach vdsm log from one of the hosts during the tests. I would like to see the stats from "Scanning FC devices:" INFO log. (In reply to Nir Soffer from comment #87) > (In reply to David Vaanunu from comment #86) > > Tested on versions: > > > > redhat-release-8.1-3.3 > > We should test on 8.2 now. > > ... > > Testing: > > > > 50 VMs running on 1 host > > JMeter running - move disks between storage domains (on the same cluster) - > > 10 disks in parallel > > > > On the host running: > > rescan-scsi-bus.sh > > Why? this should never be used. > > > multipath -ll > > Also not related to this test. > > > Create & Manage Storage Domain (Load all the LUNs ID) > > Good, this invalidates the caches, which will trigger fc-recan, > which was the slow part. > > > Verify Log files: > > > > No "monitor became unresponsive" messages in vdsm.log > > Good, but this is useful only if we had inaccessible ISO domain. > > > No "ERROR" messages in engine.log & vdsm.log > > This is not enough. > > I described how test this in comment 72: > > > How to reproduce: > > > > 1. Use system with one host and many devices (slows down fc-scan) > > 2. Start several VMs on the host > > 3. Add ISO domain on a temporary NFS server, that you will be able > > to block during the test. No need to attach any ISO to the VMs, > > just to have the domain in the DC. > > 4. simulate high I/O load by moving 10 big raw disks (e.g. 500G) concurrently > > 5. Block access to the ISO domain. After couple of minutes the system will > > try to recover the domain by refreshing and invalidating caches. > > > > Expected result: > > - Some VMs will flip between Up and Non-responsive state > > - Warnings in vdsm log about blocked workers > > - Possible fc-scan timeouts > > > > Verifying the fix: > > > > Same flow with version including the attached patch. None of the results > > are expected with the fix. Try to reproduce it according to comment #73 (Without ISO domain) > > Did you reproduce the issue without the fix? If you did not, how do you > know that your verification was correct? Maybe you did not trigger the > case leading to the issue. Regarding vdsm version - It should be in 4.4 (Mail from you) > > Did you block access to the ISO domain during the test? > > Did you check warnings in vdsm logs about blocked workers? > You need to grep for ' WARN ' in vdsm log. > > Please attach vdsm log from one of the hosts during the tests. > I would like to see the stats from "Scanning FC devices:" INFO log. vdsm log files: (Test run 04:00-05:00 EDT time) https://drive.google.com/open?id=1WDtWUVFOZjF-_rjcLLnMEvQ6AW1XaFP5 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (RHV RHEL Host (ovirt-host) 4.4), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2020:3246 |