Description of problem: Virtual machines sometimes do not resume after a short storage outage. Version-Release number of selected component (if applicable): vdsm-4.16.26-1 How reproducible: Sometimes depends on timing Steps to Reproduce: 1. Cause a storage failure (FC/iSCSI) 2. Wait untill the VMs are paused 3. immediately recover the storage Actual results: VmMs remain in the paused state. Expected results: VMs resumes when the storage recovers Additional info Currently the VMs are resumed by emiting contEIOVms event on onDomainStateChange. This event is not based on the VM sate, but on the Stoarge domain state. That means that VMs go to the paused state independently onthe stoage domain. The state check happens se to a storage related issue triggered fromthe SD monitoring thread which checks the SD every 10 sec. So in some rare cases the storage can go down and restore back within these 20 sec. It menas that VMs are paused as they are writing all the time, but monitoring not and it doe snot get paused. Multipathd checks the paths every 20 sec and if a sotage stoarge issue is detected the poling interval is changed to 5 so the paths can get UP in 5 seconds. So it a short outage happens on all of the paths ip to 5 sec after last storage check and multipath checks the stoage in the same time then the storage outage is not detectd by vdsm and the event is not emited. Exmaple: Last data is read from the problematic SD at 2016-04-13 00:15:30 Thread-16::DEBUG::2016-04-13 00:15:30,020::blockSD::336::Storage.Misc.excCmd::(readlines) SUCCESS: <err> = '1+0 records in\n1+0 records out\n2048 bytes (2.0 kB) copied, 8.41896 s, 0.2 kB/s\n'; <rc> = 0 Then the VMs are paused at 00:15:39 and next storage operation is triggered at 00:15:44 and finished at 00:15:45 libvirtEventLoop::INFO::2016-04-13 00:15:39,478::vm::4906::vm.Vm::(_onIOError) vmId=`836814fc-d624-4e76-8733-f941a7a04473`::abnormal vm stop device virtio-disk1 error eio ... Thread-16::DEBUG::2016-04-13 00:15:44,958::lvm::291::Storage.Misc.excCmd::(cmd) /usr/bin/sudo -n /sbin/lvm vgs .... Thread-16::DEBUG::2016-04-13 00:15:45,358::lvm::291::Storage.Misc.excCmd::(cmd) SUCCESS: <err> = ''; <rc> = 0 So 00:15:30 => successful operation 00:15:3? => multipath detects all paths down. The polling interval can reach up to 20 sec in case that there was not issue before. It checks the storage every 5, 10, 15 and finally 20 sec if there is no issue. As soon as there is an issue the interval is reset to 5 sec. 00:15:39 => Vm Paused 00:15:39-44 => multipath detects at least one path up again. The polling interval is now 5 sec due to the issue detected. So the first possible time when the paths can be set up again is 5 sec after the paths are detected down. 00:15:44 => Next soccessfull SD monitoring IO operation.
If the storage becomes unresponsive (=pausing the VMs) and responsive again betweenin monitoring cycles, the VMs won't be unpaused. This is a known limitation, and I doubt we'd invest in fixing it, let alone for a z-stream. Need to dig deeper into the logs, though, and confirm this is actually the issue.
(In reply to Allon Mureinik from comment #8) > If the storage becomes unresponsive (=pausing the VMs) and responsive again > betweenin monitoring cycles, the VMs won't be unpaused. This is a known > limitation, and I doubt we'd invest in fixing it, let alone for a z-stream. > > Need to dig deeper into the logs, though, and confirm this is actually the > issue. Note that the usual suspects for such issue to happen in real life: 1. No HA (for example, the FC or iSCSI switch) - and it reboots itself. 2. NDU (Non disruptive upgrade) of the storage firmware - in most of them, there's a 'change world' short action, where all user mode or whatnot have to (at once) be upgraded, and a brief IO error is expected. Usually, this is dealt by the lower SCSI layers (retransmission), so it should not even be a visible issue to the clients.
(In reply to Roman Hodain from comment #0) Storage domain monitors are doing only reads from storage, and only from single volume (metadata). So if there is a short storage outage that cause write to fail, the monitor cannot notice this outage. To mitigate this issue we can try to use higher values for multipath no_path_retry option - the default is "no_path_retry fail", which is optimal for vdsm. We can try setting like "no_path_retry 20", which will queue io for 20 seconds when all paths are down. If the outage is shorter, vms would not freeze, so we don't have to resume them. If the outage is longer and they freeze, there is higher chance that the domain monitor will detect the outage, and frozen vms would be resume when the next storage check succeeds. To try this solution, you should add the VDSM PRIVATE tag to multipath.conf, and modify no_path_retry value. Here is example multipath.conf: # VDSM REVISION 1.3 # VDSM PRIVATE defaults { polling_interval 5 no_path_retry fail user_friendly_names no flush_on_last_del yes fast_io_fail_tmo 5 dev_loss_tmo 30 max_fds 4096 } devices { device { all_devs yes no_path_retry 20 } } The downside of this settings, is that commands run by vdsm like lvm may block now for additional 20 seconds during storage outage, which may caused delays in various flows, including storage domain monitoring. It is clear that we cannot detect reliably short storage outage, and this is also true for multipath, which monitor paths in similar way (at least for the direct checker). What we can do, is introduce a new mode in the storage domain monitor, when a vm using that storage domain was paused because of IO error. While the monitor is in this mode, it will submit up events on each cycle, so even if the domain did not detect an outage, the next succcesful check will resume the vm. Once all vms are resumed, we can return the normal mode were up events are submited only when storage has chanegd from down state to up state. I would try first to tune multipath configuration so we can sustain short outage.
Ben, do you have anything to add on multipath tuning that can help here? Do you think that we can improve detection of short outage by querying multiapthd status?
Increasing the no_path_retry option makes sense to me. Using something like the following (you would probably want to use "multipathd show maps format raw" instead for easier parsing): # multipathd show maps format "%n %N %3 %4" name paths total_q_time q_timeouts mpathc 2 0 0 mpathb 2 0 0 mpathf 1 0 0 mpathe 1 0 0 mpathi 1 0 0 could help %n is the device name (you could also use uuid, or whatever is easiest) %N shows you the number of active paths. %3 shows the total number of seconds that this device has spent queuing. It is not reset when paths are restored %4 shows you the number of times that the device has switched to queuing mode. This is probably the only information you really need besides the device name/uuid/major:minor/etc. identifier you want to use. Unfortunately, this will always be zero, unless you set no_path_retry to enable some length of queuing. If could change %4 to report the number of times that you would have started queuing if multipath had been set up to do so. This would tell you how many times you had lost all your paths. There is no way to directly figure out what your queuing state currently is. This should probably be added. Also, in playing around with this, I noticed that trying to get your state by examining the features line with the %f wildcard doesn't really work right. Since it is only accurate to the start of the last path check (which means that when you disable queuing, it won't display that for 5 more seconds). This definitely needs to get fixed. You can obviously grab the current features from "multipath -ll" or you can use "multipathd show maps topology" (which resyncs the multipathd state with the kernel state before displaying it, as it be should whenever you run a "show maps" command), but it would be better if you could grab all the information at one point of time, with one command. multipathd will never be able to tell you if there was an IO error on a multipath device. But, short of that, I assume what you want is to know if multipathd ever had an all-paths-down situation, where an IO error could have happened due to connectivity issues. Right now, this is only possible if you have some sort of queueing enabled, but even no_path_retry 1 would allow you to check the number of times that queueing has been enabled (with the %4 format wildcard), which would let you know if you'd lost all paths since the last time you checked. Like I said, I can change multipath to get this information, even if no_path_retry is set to fail, if that would be helpful to you.
I wonder if we should touch the defaults, or just leave this to the documentation?
Regarding to no_path_retry. I suppose that this could cause more issue than the undetected SD outage. We know from the past that the monitoring is highly affected by delays from the storage. I am not sure if we should go this way. We also have to consider the fact that the monitoring thread is not taking place every 10 sec, but it can be mode due to the storage refresh. In that case the 20 sec doe not have to be enough. Obviously the probability is much lower.
(In reply to Roman Hodain from comment #14) > Regarding to no_path_retry. I suppose that this could cause more issue than > the undetected SD outage. We know from the past that the monitoring is > highly affected by delays from the storage. I am not sure if we should go > this way. We also have to consider the fact that the monitoring thread is > not taking place every 10 sec, but it can be mode due to the storage > refresh. In that case the 20 sec doe not have to be enough. Obviously the > probability is much lower. We have improved the monitoring considerably for 4.0.
I'd be happy if we can look at the solution from multipath and below - short outages should be handled by the SCSI layer, from my perspective.
(In reply to Roman Hodain from comment #14) > Regarding to no_path_retry. I suppose that this could cause more issue than > the undetected SD outage. We know from the past that the monitoring is > highly affected by delays from the storage. I am not sure if we should go > this way. We also have to consider the fact that the monitoring thread is > not taking place every 10 sec, but it can be mode due to the storage > refresh. In that case the 20 sec doe not have to be enough. Obviously the > probability is much lower. We have two issues here: - vms get paused after short outage - monitoring does not detect short outage With 3.6, using higher no_path_retry values may prevent pausing, so not detecting the outage is not interesting. With 4.0, the chance that we miss an outage is much smaller, since storage checks are not affected by the rest of the system (e.g. storage refresh). storage read checks are performed exactly every 10 seconds. The only reason that storage check is delayed is slow or inaccessible storage. See bug 1081962 Currently we do not test short storage outage. If this is a common problem, we should probably start testing it regularly. I think the next step is to reproduce this issue in the lab, and test if multipath tuning helps.
(In reply to Allon Mureinik from comment #13) > I wonder if we should touch the defaults, or just leave this to the > documentation? This requires lot of testing, I would not touch the defaults at this point.
We have made improvements in 4.0 to this use case. We can consider additional improvement but this is not material for a z stream. Moving to 4.1 to consider what can be enhanced.
(In reply to Nir Soffer from comment #18) > (In reply to Roman Hodain from comment #14) > > Regarding to no_path_retry. I suppose that this could cause more issue than > > the undetected SD outage. We know from the past that the monitoring is > > highly affected by delays from the storage. I am not sure if we should go > > this way. We also have to consider the fact that the monitoring thread is > > not taking place every 10 sec, but it can be mode due to the storage > > refresh. In that case the 20 sec doe not have to be enough. Obviously the > > probability is much lower. > > We have two issues here: > - vms get paused after short outage > - monitoring does not detect short outage > > With 3.6, using higher no_path_retry values may prevent pausing, so not > detecting the outage is not interesting. As I mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=1335176#c14 . Is it safe to increase the no_path_retry parameter? I am asking specially because of https://gerrit.ovirt.org/#/c/45735/ If we increase this we can get into trouble in 3.6 in case thre is really an outage on one of the storages or just a problem with on of the LUNs. > > With 4.0, the chance that we miss an outage is much smaller, since storage > checks are not affected by the rest of the system (e.g. storage refresh). > storage read checks are performed exactly every 10 seconds. The only reason > that storage check is delayed is slow or inaccessible storage. > See bug 1081962 This does not seem to be very good. This is something that it has to prevented. Not just saying that the possibility is lower. > > Currently we do not test short storage outage. If this is a common problem, > we should probably start testing it regularly. Yes we should test that as this already happened in more environment and repeatedly in one specific environment. > > I think the next step is to reproduce this issue in the lab, and test if > multipath tuning helps. Sure Can QE hadle this?
(In reply to Roman Hodain from comment #21) > As I mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=1335176#c14 . > Is it safe to increase the no_path_retry parameter? I am asking specially > because of > > https://gerrit.ovirt.org/#/c/45735/ > > If we increase this we can get into trouble in 3.6 in case thre is really an > outage on one of the storages or just a problem with on of the LUNs. It is safe to increase the value of no_path_retry, but of course if you increase it too much you will add unwanted delays. Basically this is the only thing we can do at this point in 3.6. > > With 4.0, the chance that we miss an outage is much smaller, since storage > > checks are not affected by the rest of the system (e.g. storage refresh). > > storage read checks are performed exactly every 10 seconds. The only reason > > that storage check is delayed is slow or inaccessible storage. > > See bug 1081962 > > This does not seem to be very good. This is something that it has to > prevented. Not just saying that the possibility is lower. Preventing this means redesign the way we detect outage, as I described in comment 10. There is no way we can do this for 3.6, maybe for 4.0. > > Currently we do not test short storage outage. If this is a common problem, > > we should probably start testing it regularly. > > Yes we should test that as this already happened in more environment and > repeatedly in one specific environment. > > > > > I think the next step is to reproduce this issue in the lab, and test if > > multipath tuning helps. > > Sure Can QE hadle this? Gil, can we make progress with this?
>> As I mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=1335176#c14 . >> Is it safe to increase the no_path_retry parameter? I am asking specially >> because of >> >> https://gerrit.ovirt.org/#/c/45735/ >> >> If we increase this we can get into trouble in 3.6 in case thre is really an >> outage on one of the storages or just a problem with on of the LUNs. >It is safe to increase the value of no_path_retry, but of course if you increase >it too much you will add unwanted delays. Just to make sure. as we run the the check every 10 sec ( I do no t count the storage refresh which will will not be able to address by this ) we need to set no_path_retry to 2 as the pool interval is set to 5 sec. So we should be at least aligned with the 10 sec interval. This means that it the worst case a storage operation can take up to 30 sec. 4 * pool interval in case of no previous issue 2 * pool interval to fail the scsi commands after the failure So we should also update the MaxStorageVdsTimeoutCheckSec to more then 30 sec. I would go with something like 60 sec as we are rather interested in keep the env up than report issues. What do you think?
(In reply to Roman Hodain from comment #23) > Just to make sure. as we run the the check every 10 sec ( I do no t count > the storage refresh which will will not be able to address by this ) we need > to set no_path_retry to 2 as the pool interval is set to 5 sec. So we should > be at least aligned with the 10 sec interval. I don't see a need to match 10 second interval. It is ok if the storage monitor block on inaccessible storage for more time. Engine will detect this using the lastCheck value. If you don't exceed the limit (30 seconds by default), engine does not take any action. If you exceed the lastCheck limit, engine will start a 5 minutes timer, and if the domain did not recover during these 5 minutes, the domain (if faulty on all hosts) or host (if a domain is ok on other hosts) will be deactivated. The problem with timeouts is delaying refresh operations. In the worst case, all domains have short outage, they refresh at the same time, and the refresh takes more than 5 minutes for some of the domains. In 3.6, because monitoring is blocked by refresh operations, this will cause engine to think a domain is in problem for more then 5 minutes, which can cause the domain or the host to be deactivated. The timeout should be long enough to hide short outage on the storage, preventing vms from pausing, and short enough to prevent storage refresh operations to take more then 5 minutes, causing a domain to be deactivated. > This means that it the worst > case a storage operation can take up to 30 sec. > > 4 * pool interval in case of no previous issue What is this value? > 2 * pool interval to fail the scsi commands after the failure I would start with 20 seconds (4 retries), and see how it affects the system. You should keep the timeout lower if you have many storage domains, and can increase the timeout if you have only few storage domains. > So we should also update the MaxStorageVdsTimeoutCheckSec to more then 30 > sec. I would go with something like 60 sec as we are rather interested in > keep the env up than report issues. In 3.6 this is probably a good idea. Can you give more details about the setup, how many storage domains are there how many storage servers, and when they have a short outage, does it affect single storage domain or all storage domains on the same server? We need this info to reproduce this issue in the lab.
(In reply to Nir Soffer from comment #24) > (In reply to Roman Hodain from comment #23) > > Just to make sure. as we run the the check every 10 sec ( I do no t count > > the storage refresh which will will not be able to address by this ) we need > > to set no_path_retry to 2 as the pool interval is set to 5 sec. So we should > > be at least aligned with the 10 sec interval. > > I don't see a need to match 10 second interval. It is ok if the storage > monitor > block on inaccessible storage for more time. Engine will detect this using > the > lastCheck value. If you don't exceed the limit (30 seconds by default), > engine > does not take any action. > > If you exceed the lastCheck limit, engine will start a 5 minutes timer, and > if > the domain did not recover during these 5 minutes, the domain (if faulty on > all > hosts) or host (if a domain is ok on other hosts) will be deactivated. > > The problem with timeouts is delaying refresh operations. In the worst case, > all > domains have short outage, they refresh at the same time, and the refresh > takes more than 5 minutes for some of the domains. In 3.6, because monitoring > is blocked by refresh operations, this will cause engine to think a domain > is in problem for more then 5 minutes, which can cause the domain or the host > to be deactivated. > > The timeout should be long enough to hide short outage on the storage, > preventing vms from pausing, and short enough to prevent storage refresh > operations to take more then 5 minutes, causing a domain to be deactivated. > > > This means that it the worst > > case a storage operation can take up to 30 sec. > > > > 4 * pool interval in case of no previous issue > > What is this value? polling_interval 5 max_polling_interval 20 So when the device is marked up again the polling_interval grows 5,10,15,20 > > > 2 * pool interval to fail the scsi commands after the failure > > I would start with 20 seconds (4 retries), and see how it affects the > system. > > You should keep the timeout lower if you have many storage domains, and > can increase the timeout if you have only few storage domains. > > > So we should also update the MaxStorageVdsTimeoutCheckSec to more then 30 > > sec. I would go with something like 60 sec as we are rather interested in > > keep the env up than report issues. > > In 3.6 this is probably a good idea. > > Can you give more details about the setup, how many storage domains are there > how many storage servers, and when they have a short outage, does it affect > single storage domain or all storage domains on the same server? > > We need this info to reproduce this issue in the lab. 20 FC SDs 29 LUNs 1 ISO 1 export Most probably all of the SDs were affected. Here is what we know: Regarding the main cause of the storage event, our storage team says that Netapp detected inside a log an error of the fiber channel port caused by a driver failure and they are working to solve the issue. So I assume that all of them were affected as all of the SDs are connected to the same storage.
The attached patch should avoid the first issue, vms pausing after short storage outage. It does not fix the issue of not detecting a short outage that did cause some vms to pause. Fixing this requires redesign on the vm unpausing mechanism. We need to learn how to simulate short storage outage, and test this patch.
When testing this patch, we should make sure there is no regression with unzoned devices - bug 880738. I expect this change to add 20 seconds delay in the first time unzoned device is detected, and after that any access to this device will fail immediately.
Executed the following test plan (that includes the flows in https://bugzilla.redhat.com/show_bug.cgi?id=880738#c60): https://polarion.engineering.redhat.com/polarion/#/project/RHEVM3 /testrun?id=3_5_Storage_Dynamic_detection_FC_devices_run_28_07_16 Used Nir's patch https://gerrit.ovirt.org/#/c/61281/ All passed.
Do we want that patch in?
(In reply to Yaniv Dary from comment #32) > Do we want that patch in? It needs a review cycle first, but the idea looks more or less right.
(In reply to Allon Mureinik from comment #33) > (In reply to Yaniv Dary from comment #32) > > Do we want that patch in? > It needs a review cycle first, but the idea looks more or less right. Should we move this bug back to 4.1 for tracking?
(In reply to Yaniv Dary from comment #34) > (In reply to Allon Mureinik from comment #33) > > (In reply to Yaniv Dary from comment #32) > > > Do we want that patch in? > > It needs a review cycle first, but the idea looks more or less right. > > Should we move this bug back to 4.1 for tracking? Sure, might as well.
The attached patch may help to avoid vm pause in case of short storage outage, but it will not help to resume paused vms. Our current mechanism to unpause vms is broken and cannot be fixed in 4.1. The issue is explained in another bug, I cannot find this bug now.
Nir, patch https://gerrit.ovirt.org/#/c/61281/ is merged. Is there anything else missing, or can this be moved to MODIFIED?
The patch was reverted in https://gerrit.ovirt.org/77902 because of bug 1459370.
(In reply to Nir Soffer from comment #38) > The patch was reverted in https://gerrit.ovirt.org/77902 because of bug > 1459370. The above BZ is verified for 7.5 and its 7.4.z (https://bugzilla.redhat.com/show_bug.cgi?id=1510837 ) is in VERIFIED and in REL_PREP. Can we look at this again?
(In reply to Yaniv Kaul from comment #39) > The above BZ is verified for 7.5 and its 7.4.z > (https://bugzilla.redhat.com/show_bug.cgi?id=1510837 ) is in VERIFIED and in > REL_PREP. Can we look at this again? Once the fix is available, we will require it and use it is our multipath.conf.
The package is not available yet, according to the errata, it will be available on 2018-Jan-23.
Still not available, according to the errata, it will be available on 2018-Jan-25.
(In reply to Nir Soffer from comment #43) > Still not available, according to the errata, it will be available on > 2018-Jan-25. The platform fix is now available in device-mapper-multipath-0.4.9-111.el7.1 (see bug 1510837 for details). Can we move forward with this?
I posted the configuration change again.
Partly fixed upstream by: - using "no_path_retry 4" in multipath.conf - this gives vms 20 seconds grace time (assuming polling_interval=5) before failing io when all paths are down. multipath enter recovery mode, checking the paths every 5 seconds. If any path is restored during the recovery time, the io may succeed and the vm will not paused. - failing storage domains on read timeout - when vdsm monitor read times out, the storage domain is moved to INVALID state. Previously the storage domain was moved to this state only when the read fail. The next successful read will move the domain to VALID state, and at this point we try to resume paused vms. Not fixed: - VM pause because of an issue with a non-metadata LUN. Vdsm monitors only the first metadata LUN, so issues with other LUNs will not be detected. Can be fixed by using multipath events. When any LUNs belonging to a storage domain goes down (no path is available), we can switch the domain to INVALID state. - VM paused because of write() error, but vdsm monitor uses only read(). Not clear yet how we can fix this. Maybe create a canary lv on every PV when adding a PV to the system, and monitor this using both read and write, instead of the metadata lv. I think we can file RFE for the issues we did not fix, since the changes we could fix should be useful now.
(In reply to Nir Soffer from comment #46) > Partly fixed upstream by: > > - using "no_path_retry 4" in multipath.conf - this gives vms 20 seconds > grace time > (assuming polling_interval=5) before failing io when all paths are down. > multipath enter recovery mode, checking the paths every 5 seconds. If any > path > is restored during the recovery time, the io may succeed and the vm will > not > paused. > > - failing storage domains on read timeout - when vdsm monitor read times > out, the > storage domain is moved to INVALID state. Previously the storage domain > was moved > to this state only when the read fail. The next successful read will move > the > domain to VALID state, and at this point we try to resume paused vms. > > Not fixed: > > - VM pause because of an issue with a non-metadata LUN. Vdsm monitors only > the > first metadata LUN, so issues with other LUNs will not be detected. > Can be fixed by using multipath events. When any LUNs belonging to a > storage > domain goes down (no path is available), we can switch the domain to > INVALID > state. > > - VM paused because of write() error, but vdsm monitor uses only read(). Not > clear > yet how we can fix this. Maybe create a canary lv on every PV when adding > a PV > to the system, and monitor this using both read and write, instead of the > metadata lv. > > I think we can file RFE for the issues we did not fix, since the changes we > could > fix should be useful now. Agreed. Nir, please file RFEs for those two issues and backport the current work to the 4.2 branch.
(In reply to Allon Mureinik from comment #47) Patches backported to 4.2: https://gerrit.ovirt.org/#/q/status:open+project:vdsm+branch:ovirt-4.2+topic:backport/4.2/multipath-timeout2 RFEs for the rest of the issues: bug 1548011, bug 1548017
Last patch (https://gerrit.ovirt.org/#/c/88174/) fixes the documentation in multipath.conf. It is important to add it so multipath.conf version 1.5 contains the correct documentation.
Nir, can you add some doctext explaining the improvement?
Verified with the following code: ------------------------------------------ ovirt-engine-4.2.2.2-0.1.el7.noarch vdsm-4.20.20-1.el7ev.x86_64 Verified with the following scenario: ------------------------------------------ Steps to Reproduce: 1. Cause a storage failure (FC/iSCSI) 2. Wait untill the VMs are paused 3. immediately recover the storage - VM is resumed again Moving to VERIFIED
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:1489
BZ<2>Jira Resync