1335176 – VMs do not auto-resume after short storage outage

Bug 1335176 - VMs do not auto-resume after short storage outage

Summary: VMs do not auto-resume after short storage outage

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	vdsm
Sub Component:
Version:	3.5.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	ovirt-4.2.2
Target Release:	---
Assignee:	Nir Soffer
QA Contact:	Kevin Alon Goldblatt
Docs Contact:
URL:
Whiteboard:
Depends On:	1510837
Blocks:	CEECIR_RHV43_proposed
TreeView+	depends on / blocked

Reported:	2016-05-11 13:33 UTC by Roman Hodain
Modified:	2019-12-16 05:46 UTC (History)
CC List:	18 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Previously, VDSM could miss short storage outages that caused a VM to pause, so the VM paused during the short outage was not resumed. Now, Multipath queues I/O for more time and fails I/O only if there was no access to storage after the timeout, and VDSM uses a shorter timeout to detect inaccessible storage. If the outage is very short, a VM will not pause, and does not need to be resumed. If the outage is longer and a VM did pause, VDSM is more likely to detect the outage and resume the VM when storage becomes available again.
Clone Of:
Environment:
Last Closed:	2018-05-15 17:49:33 UTC
oVirt Team:	Storage
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1548011	unspecified	CLOSED	[RFE] Detect errors on any LUN and resume paused vms	2021-02-22 00:41:40 UTC
Red Hat Bugzilla	1548017	unspecified	CLOSED	[RFE] Detect write errors and resume paused vms	2021-02-22 00:41:40 UTC
Red Hat Product Errata	RHEA-2018:1489	None	None	None	2018-05-15 17:51:04 UTC
oVirt gerrit	61281	master	MERGED	multipath: Do not fail I/O after short outage	2021-01-22 00:08:16 UTC
oVirt gerrit	77902	None	MERGED	Revert "multipath: Do not fail I/O after short outage"	2021-01-22 00:08:16 UTC
oVirt gerrit	80541	master	MERGED	check: Fail blocked checker after check interval	2021-01-22 00:08:16 UTC
oVirt gerrit	87138	master	MERGED	multipath: Do not fail I/O after short outage	2021-01-22 00:08:16 UTC
oVirt gerrit	88082	ovirt-4.2	MERGED	multipath: Do not fail I/O after short outage	2021-01-22 00:08:16 UTC
oVirt gerrit	88083	ovirt-4.2	MERGED	check: Fail blocked checker after check interval	2021-01-22 00:08:57 UTC
oVirt gerrit	88174	master	MERGED	multipath.conf: Fix documentation about drop-ins	2021-01-22 00:08:16 UTC
oVirt gerrit	88178	ovirt-4.2	MERGED	multipath.conf: Fix documentation about drop-ins	2021-01-22 00:08:16 UTC

Internal Links: 1548011 1548017

Description Roman Hodain 2016-05-11 13:33:56 UTC

Description of problem:
Virtual machines sometimes do not resume after a short storage outage.

Version-Release number of selected component (if applicable):
vdsm-4.16.26-1

How reproducible:
Sometimes depends on timing

Steps to Reproduce:
1. Cause a storage failure (FC/iSCSI)
2. Wait untill the VMs are paused
3. immediately recover the storage

Actual results:
VmMs remain in the paused state.

Expected results:
VMs resumes when the storage recovers

Additional info

Currently the VMs are resumed by emiting contEIOVms event on
onDomainStateChange. This event is not based on the VM sate, but on the
Stoarge domain state. That means that VMs go to the paused state
independently onthe stoage domain. The state check happens se to a storage
related issue triggered fromthe SD monitoring thread which checks the SD
every 10 sec. So in some rare cases the storage can go down and restore back
within these 20 sec. It menas that VMs are paused as they are writing all
the time, but monitoring not and it doe snot get paused.

Multipathd checks the paths every 20 sec and if a sotage stoarge issue is
detected the poling interval is changed to 5 so the paths can get UP in 5
seconds. So it a short outage happens on all of the paths ip to 5 sec after
last storage check and multipath checks the stoage in the same time then the
storage outage is not detectd by vdsm and the event is not emited.

Exmaple:
Last data is read from the problematic SD at 2016-04-13 00:15:30

Thread-16::DEBUG::2016-04-13 00:15:30,020::blockSD::336::Storage.Misc.excCmd::(readlines) SUCCESS: <err> = '1+0 records in\n1+0 records out\n2048 bytes (2.0 kB) copied, 8.41896 s, 0.2 kB/s\n'; <rc> = 0

Then the VMs are paused at 00:15:39 and next storage operation is triggered
at 00:15:44 and finished at 00:15:45

libvirtEventLoop::INFO::2016-04-13 00:15:39,478::vm::4906::vm.Vm::(_onIOError) vmId=`836814fc-d624-4e76-8733-f941a7a04473`::abnormal vm stop device virtio-disk1 error eio
...
Thread-16::DEBUG::2016-04-13 00:15:44,958::lvm::291::Storage.Misc.excCmd::(cmd) /usr/bin/sudo -n /sbin/lvm vgs ....
Thread-16::DEBUG::2016-04-13 00:15:45,358::lvm::291::Storage.Misc.excCmd::(cmd) SUCCESS: <err> = ''; <rc> = 0

00:15:30 => successful operation
00:15:3? => multipath detects all paths down. The polling interval can
reach up to 20 sec in case that there was not issue before.
It checks the storage every 5, 10, 15 and finally 20 sec if
there is no issue. As soon as there is an issue the interval
is reset to 5 sec.
00:15:39 => Vm Paused
00:15:39-44 => multipath detects at least one path up again. The polling
interval is now 5 sec due to the issue detected. So the
first possible time when the paths can be set up again is
5 sec after the paths are detected down.
00:15:44 => Next soccessfull SD monitoring IO operation.

Comment 8 Allon Mureinik 2016-06-01 21:15:22 UTC

If the storage becomes unresponsive (=pausing the VMs) and responsive again betweenin monitoring cycles, the VMs won't be unpaused. This is a known limitation, and I doubt we'd invest in fixing it, let alone for a z-stream.

Need to dig deeper into the logs, though, and confirm this is actually the issue.

Comment 9 Yaniv Kaul 2016-06-01 21:29:38 UTC

(In reply to Allon Mureinik from comment #8)
> If the storage becomes unresponsive (=pausing the VMs) and responsive again
> betweenin monitoring cycles, the VMs won't be unpaused. This is a known
> limitation, and I doubt we'd invest in fixing it, let alone for a z-stream.
> 
> Need to dig deeper into the logs, though, and confirm this is actually the
> issue.

Note that the usual suspects for such issue to happen in real life:
1. No HA (for example, the FC or iSCSI switch) - and it reboots itself.
2. NDU (Non disruptive upgrade) of the storage firmware - in most of them, there's a 'change world' short action, where all user mode or whatnot have to (at once) be upgraded, and a brief IO error is expected. Usually, this is dealt by the lower SCSI layers (retransmission), so it should not even be a visible issue to the clients.

Comment 10 Nir Soffer 2016-06-01 22:00:53 UTC

(In reply to Roman Hodain from comment #0)
Storage domain monitors are doing only reads from storage, and only from
single volume (metadata). So if there is a short storage outage that
cause write to fail, the monitor cannot notice this outage.

To mitigate this issue we can try to use higher values for multipath
no_path_retry option - the default is "no_path_retry fail", which is
optimal for vdsm.

We can try setting like "no_path_retry 20", which will queue io for 20
seconds when all paths are down. If the outage is shorter, vms would not
freeze, so we don't have to resume them. If the outage is longer and
they freeze, there is higher chance that the domain monitor will detect
the outage, and frozen vms would be resume when the next storage check
succeeds.

To try this solution, you should add the VDSM PRIVATE tag to
multipath.conf, and modify no_path_retry value.

Here is example multipath.conf:

# VDSM REVISION 1.3
# VDSM PRIVATE

defaults {
    polling_interval            5
    no_path_retry               fail
    user_friendly_names         no
    flush_on_last_del           yes
    fast_io_fail_tmo            5
    dev_loss_tmo                30
    max_fds                     4096
}

devices {
    device {
        all_devs                yes
        no_path_retry           20
    }
}

The downside of this settings, is that commands run by vdsm like lvm may
block now for additional 20 seconds during storage outage, which may
caused delays in various flows, including storage domain monitoring.

It is clear that we cannot detect reliably short storage outage, and
this is also true for multipath, which monitor paths in similar way (at
least for the direct checker).

What we can do, is introduce a new mode in the storage domain monitor, when a
vm using that storage domain was paused because of IO error. While the monitor
is in this mode, it will submit up events on each cycle, so even if the domain
did not detect an outage, the next succcesful check will resume the vm.

Once all vms are resumed, we can return the normal mode were up events are
submited only when storage has chanegd from down state to up state.

I would try first to tune multipath configuration so we can sustain short
outage.

Comment 11 Nir Soffer 2016-06-01 22:03:32 UTC

Ben, do you have anything to add on multipath tuning that can help here?

Do you think that we can improve detection of short outage by querying
multiapthd status?

Comment 12 Ben Marzinski 2016-06-02 21:54:27 UTC

Increasing the no_path_retry option makes sense to me.

Using something like the following (you would probably want to use "multipathd show maps format raw" instead for easier parsing):

# multipathd show maps format "%n %N %3 %4"
name paths total_q_time q_timeouts
mpathc 2 0 0
mpathb 2 0 0
mpathf 1 0 0
mpathe 1 0 0
mpathi 1 0 0

could help

%n is the device name (you could also use uuid, or whatever is easiest)

%N shows you the number of active paths.

%3 shows the total number of seconds that this device has spent queuing. It is not reset when paths are restored

%4 shows you the number of times that the device has switched to queuing mode. This is probably the only information you really need besides the device name/uuid/major:minor/etc. identifier you want to use. Unfortunately, this will always be zero, unless you set no_path_retry to enable some length of queuing. If could change %4 to report the number of times that you would have started queuing if multipath had been set up to do so. This would tell you how many times you had lost all your paths.

There is no way to directly figure out what your queuing state currently is. This should probably be added. Also, in playing around with this, I noticed that trying to get your state by examining the features line with the %f wildcard doesn't really work right. Since it is only accurate to the start of the last path check (which means that when you disable queuing, it won't display that for 5 more seconds). This definitely needs to get fixed. You can obviously grab the current features from "multipath -ll" or you can use "multipathd show maps topology" (which resyncs the multipathd state with the kernel state before displaying it, as it be should whenever you run a "show maps" command), but it would be better if you could grab all the information at one point of time, with one command.

multipathd will never be able to tell you if there was an IO error on a multipath device. But, short of that, I assume what you want is to know if multipathd ever had an all-paths-down situation, where an IO error could have happened due to connectivity issues. Right now, this is only possible if you have some sort of queueing enabled, but even

no_path_retry 1

would allow you to check the number of times that queueing has been enabled (with the %4 format wildcard), which would let you know if you'd lost all paths since the last time you checked. Like I said, I can change multipath to get this information, even if no_path_retry is set to fail, if that would be helpful to you.

Comment 13 Allon Mureinik 2016-06-06 15:24:57 UTC

I wonder if we should touch the defaults, or just leave this to the documentation?

Comment 14 Roman Hodain 2016-06-07 07:01:34 UTC

Regarding to no_path_retry. I suppose that this could cause more issue than the undetected SD outage. We know from the past that the monitoring is highly affected by delays from the storage. I am not sure if we should go this way. We also have to consider the fact that the monitoring thread is not taking place every 10 sec, but it can be mode due to the storage refresh. In that case the 20 sec doe not have to be enough. Obviously the probability is much lower.

Comment 16 Yaniv Kaul 2016-06-08 14:36:19 UTC

(In reply to Roman Hodain from comment #14)
> Regarding to no_path_retry. I suppose that this could cause more issue than
> the undetected SD outage. We know from the past that the monitoring is
> highly affected by delays from the storage. I am not sure if we should go
> this way. We also have to consider the fact that the monitoring thread is
> not taking place every 10 sec, but it can be mode due to the storage
> refresh. In that case the 20 sec doe not have to be enough. Obviously the
> probability is much lower.

We have improved the monitoring considerably for 4.0.

Comment 17 Yaniv Kaul 2016-06-08 14:38:31 UTC

I'd be happy if we can look at the solution from multipath and below - short outages should be handled by the SCSI layer, from my perspective.

Comment 18 Nir Soffer 2016-06-08 16:21:31 UTC

(In reply to Roman Hodain from comment #14)
> Regarding to no_path_retry. I suppose that this could cause more issue than
> the undetected SD outage. We know from the past that the monitoring is
> highly affected by delays from the storage. I am not sure if we should go
> this way. We also have to consider the fact that the monitoring thread is
> not taking place every 10 sec, but it can be mode due to the storage
> refresh. In that case the 20 sec doe not have to be enough. Obviously the
> probability is much lower.

We have two issues here:
- vms get paused after short outage
- monitoring does not detect short outage

With 3.6, using higher no_path_retry values may prevent pausing, so not 
detecting the outage is not interesting.

With 4.0, the chance that we miss an outage is much smaller, since storage
checks are not affected by the rest of the system (e.g. storage refresh).
storage read checks are performed exactly every 10 seconds. The only reason
that storage check is delayed is slow or inaccessible storage.
See bug 1081962

Currently we do not test short storage outage. If this is a common problem,
we should probably start testing it regularly.

I think the next step is to reproduce this issue in the lab, and test if
multipath tuning helps.

Comment 19 Nir Soffer 2016-06-08 16:23:27 UTC

(In reply to Allon Mureinik from comment #13)
> I wonder if we should touch the defaults, or just leave this to the
> documentation?

This requires lot of testing, I would not touch the defaults at this point.

Comment 20 Yaniv Lavi 2016-06-20 11:30:58 UTC

We have made improvements in 4.0 to this use case. We can consider additional improvement but this is not material for a z stream.  Moving to 4.1 to consider what can be enhanced.

Comment 21 Roman Hodain 2016-07-05 06:21:07 UTC

(In reply to Nir Soffer from comment #18)
> (In reply to Roman Hodain from comment #14)
> > Regarding to no_path_retry. I suppose that this could cause more issue than
> > the undetected SD outage. We know from the past that the monitoring is
> > highly affected by delays from the storage. I am not sure if we should go
> > this way. We also have to consider the fact that the monitoring thread is
> > not taking place every 10 sec, but it can be mode due to the storage
> > refresh. In that case the 20 sec doe not have to be enough. Obviously the
> > probability is much lower.
> 
> We have two issues here:
> - vms get paused after short outage
> - monitoring does not detect short outage
> 
> With 3.6, using higher no_path_retry values may prevent pausing, so not 
> detecting the outage is not interesting.

As I mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=1335176#c14 . Is it safe to increase the no_path_retry parameter? I am asking specially because of 

     https://gerrit.ovirt.org/#/c/45735/

If we increase this we can get into trouble in 3.6 in case thre is really an outage on one of the storages or just a problem with on of the LUNs.

> 
> With 4.0, the chance that we miss an outage is much smaller, since storage
> checks are not affected by the rest of the system (e.g. storage refresh).
> storage read checks are performed exactly every 10 seconds. The only reason
> that storage check is delayed is slow or inaccessible storage.
> See bug 1081962

This does not seem to be very good. This is something that it has to prevented. Not just saying that the possibility is lower. 

> 
> Currently we do not test short storage outage. If this is a common problem,
> we should probably start testing it regularly.

Yes we should test that as this already happened in more environment and repeatedly in one specific environment.

> 
> I think the next step is to reproduce this issue in the lab, and test if
> multipath tuning helps.

Sure Can QE hadle this?

Comment 22 Nir Soffer 2016-07-07 14:01:26 UTC

(In reply to Roman Hodain from comment #21)
> As I mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=1335176#c14 .
> Is it safe to increase the no_path_retry parameter? I am asking specially
> because of 
> 
>      https://gerrit.ovirt.org/#/c/45735/
> 
> If we increase this we can get into trouble in 3.6 in case thre is really an
> outage on one of the storages or just a problem with on of the LUNs.

It is safe to increase the value of no_path_retry, but of course if you increase
it too much you will add unwanted delays.

Basically this is the only thing we can do at this point in 3.6.

> > With 4.0, the chance that we miss an outage is much smaller, since storage
> > checks are not affected by the rest of the system (e.g. storage refresh).
> > storage read checks are performed exactly every 10 seconds. The only reason
> > that storage check is delayed is slow or inaccessible storage.
> > See bug 1081962
> 
> This does not seem to be very good. This is something that it has to
> prevented. Not just saying that the possibility is lower.

Preventing this means redesign the way we detect outage, as I described in 
comment 10. There is no way we can do this for 3.6, maybe for 4.0.

> > Currently we do not test short storage outage. If this is a common problem,
> > we should probably start testing it regularly.
> 
> Yes we should test that as this already happened in more environment and
> repeatedly in one specific environment.
> 
> > 
> > I think the next step is to reproduce this issue in the lab, and test if
> > multipath tuning helps.
> 
> Sure Can QE hadle this?

Gil, can we make progress with this?

Comment 23 Roman Hodain 2016-07-09 08:47:42 UTC

>> As I mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=1335176#c14 .
>> Is it safe to increase the no_path_retry parameter? I am asking specially
>> because of 
>>
>>      https://gerrit.ovirt.org/#/c/45735/
>> 
>> If we increase this we can get into trouble in 3.6 in case thre is really an
>> outage on one of the storages or just a problem with on of the LUNs.

>It is safe to increase the value of no_path_retry, but of course if you increase
>it too much you will add unwanted delays.

Just to make sure. as we run the the check every 10 sec ( I do no t count the storage refresh which will will not be able to address by this ) we need to set no_path_retry to 2 as the pool interval is set to 5 sec. So we should be at least aligned with the 10 sec interval. This means that it the worst case a storage operation can take up to 30 sec. 

    4 * pool interval in case of no previous issue 
    2 * pool interval to fail the scsi commands after the failure

So we should also update the MaxStorageVdsTimeoutCheckSec to more then 30 sec. I would  go with something like 60 sec as we are rather interested in keep the env up than report issues.

What do you think?

Comment 24 Nir Soffer 2016-07-09 17:11:36 UTC

(In reply to Roman Hodain from comment #23)
> Just to make sure. as we run the the check every 10 sec ( I do no t count
> the storage refresh which will will not be able to address by this ) we need
> to set no_path_retry to 2 as the pool interval is set to 5 sec. So we should
> be at least aligned with the 10 sec interval. 

I don't see a need to match 10 second interval. It is ok if the storage monitor
block on inaccessible storage for more time. Engine will detect this using the
lastCheck value. If you don't exceed the limit (30 seconds by default), engine
does not take any action.

If you exceed the lastCheck limit, engine will start a 5 minutes timer, and if
the domain did not recover during these 5 minutes, the domain (if faulty on all
hosts) or host (if a domain is ok on other hosts) will be deactivated.

The problem with timeouts is delaying refresh operations. In the worst case, all
domains have short outage, they refresh at the same time, and the refresh
takes more than 5 minutes for some of the domains. In 3.6, because monitoring
is blocked by refresh operations, this will cause engine to think a domain
is in problem for more then 5 minutes, which can cause the domain or the host
to be deactivated.

The timeout should be long enough to hide short outage on the storage,
preventing vms from pausing, and short enough to prevent storage refresh
operations to take more then 5 minutes, causing a domain to be deactivated.

> This means that it the worst
> case a storage operation can take up to 30 sec.
> 
>     4 * pool interval in case of no previous issue 

What is this value?

>     2 * pool interval to fail the scsi commands after the failure

I would start with 20 seconds (4 retries), and see how it affects the
system.

You should keep the timeout lower if you have many storage domains, and
can increase the timeout if you have only few storage domains.

> So we should also update the MaxStorageVdsTimeoutCheckSec to more then 30
> sec. I would  go with something like 60 sec as we are rather interested in
> keep the env up than report issues.

In 3.6 this is probably a good idea.

Can you give more details about the setup, how many storage domains are there
how many storage servers, and when they have a short outage, does it affect
single storage domain or all storage domains on the same server?

We need this info to reproduce this issue in the lab.

Comment 25 Roman Hodain 2016-07-11 08:14:14 UTC

(In reply to Nir Soffer from comment #24)
> (In reply to Roman Hodain from comment #23)
> > Just to make sure. as we run the the check every 10 sec ( I do no t count
> > the storage refresh which will will not be able to address by this ) we need
> > to set no_path_retry to 2 as the pool interval is set to 5 sec. So we should
> > be at least aligned with the 10 sec interval. 
> 
> I don't see a need to match 10 second interval. It is ok if the storage
> monitor
> block on inaccessible storage for more time. Engine will detect this using
> the
> lastCheck value. If you don't exceed the limit (30 seconds by default),
> engine
> does not take any action.
> 
> If you exceed the lastCheck limit, engine will start a 5 minutes timer, and
> if
> the domain did not recover during these 5 minutes, the domain (if faulty on
> all
> hosts) or host (if a domain is ok on other hosts) will be deactivated.
> 
> The problem with timeouts is delaying refresh operations. In the worst case,
> all
> domains have short outage, they refresh at the same time, and the refresh
> takes more than 5 minutes for some of the domains. In 3.6, because monitoring
> is blocked by refresh operations, this will cause engine to think a domain
> is in problem for more then 5 minutes, which can cause the domain or the host
> to be deactivated.
> 
> The timeout should be long enough to hide short outage on the storage,
> preventing vms from pausing, and short enough to prevent storage refresh
> operations to take more then 5 minutes, causing a domain to be deactivated.
> 
> > This means that it the worst
> > case a storage operation can take up to 30 sec.
> > 
> >     4 * pool interval in case of no previous issue 
> 
> What is this value?

polling_interval 5
max_polling_interval 20

So when the device is marked up again the polling_interval grows 5,10,15,20

> 
> >     2 * pool interval to fail the scsi commands after the failure
> 
> I would start with 20 seconds (4 retries), and see how it affects the
> system.
> 
> You should keep the timeout lower if you have many storage domains, and
> can increase the timeout if you have only few storage domains.
> 
> > So we should also update the MaxStorageVdsTimeoutCheckSec to more then 30
> > sec. I would  go with something like 60 sec as we are rather interested in
> > keep the env up than report issues.
> 
> In 3.6 this is probably a good idea.
> 
> Can you give more details about the setup, how many storage domains are there
> how many storage servers, and when they have a short outage, does it affect
> single storage domain or all storage domains on the same server?
> 
> We need this info to reproduce this issue in the lab.

20 FC SDs
29 LUNs
1 ISO
1 export

Most probably all of the SDs were affected. Here is what we know:

Regarding the main cause of the storage event, our storage team says that Netapp detected inside a log an error of the fiber channel port caused by a driver failure and they are working to solve the issue.

So I assume that all of them were affected as all of the SDs are connected to the same storage.

Comment 27 Nir Soffer 2016-07-24 16:46:27 UTC

The attached patch should avoid the first issue, vms pausing after short storage
outage. It does not fix the issue of not detecting a short outage that did cause
some vms to pause. Fixing this requires redesign on the vm unpausing mechanism.

We need to learn how to simulate short storage outage, and test this patch.

Comment 28 Nir Soffer 2016-07-24 16:53:46 UTC

When testing this patch, we should make sure there is no regression with unzoned
devices - bug 880738.

I expect this change to add 20 seconds delay in the first time unzoned device is
detected, and after that any access to this device will fail immediately.

Comment 31 Elad 2016-07-28 14:47:27 UTC

Executed the following test plan (that includes the flows in https://bugzilla.redhat.com/show_bug.cgi?id=880738#c60):

https://polarion.engineering.redhat.com/polarion/#/project/RHEVM3
/testrun?id=3_5_Storage_Dynamic_detection_FC_devices_run_28_07_16

Used Nir's patch https://gerrit.ovirt.org/#/c/61281/


All passed.

Comment 32 Yaniv Lavi 2016-08-08 14:26:47 UTC

Do we want that patch in?

Comment 33 Allon Mureinik 2016-08-22 02:24:58 UTC

(In reply to Yaniv Dary from comment #32)
> Do we want that patch in?
It needs a review cycle first, but the idea looks more or less right.

Comment 34 Yaniv Lavi 2016-08-25 02:14:05 UTC

(In reply to Allon Mureinik from comment #33)
> (In reply to Yaniv Dary from comment #32)
> > Do we want that patch in?
> It needs a review cycle first, but the idea looks more or less right.

Should we move this bug back to 4.1 for tracking?

Comment 35 Allon Mureinik 2016-08-25 06:50:23 UTC

(In reply to Yaniv Dary from comment #34)
> (In reply to Allon Mureinik from comment #33)
> > (In reply to Yaniv Dary from comment #32)
> > > Do we want that patch in?
> > It needs a review cycle first, but the idea looks more or less right.
> 
> Should we move this bug back to 4.1 for tracking?

Sure, might as well.

Comment 36 Nir Soffer 2017-02-28 23:52:12 UTC

The attached patch may help to avoid vm pause in case of short storage outage, but
it will not help to resume paused vms.

Our current mechanism to unpause vms is broken and cannot be fixed in 4.1. The
issue is explained in another bug, I cannot find this bug now.

Comment 37 Allon Mureinik 2017-09-03 09:02:54 UTC

Nir, patch https://gerrit.ovirt.org/#/c/61281/ is merged. Is there anything else missing, or can this be moved to MODIFIED?

Comment 38 Nir Soffer 2017-09-03 10:35:31 UTC

The patch was reverted in https://gerrit.ovirt.org/77902 because of bug 1459370.

Comment 39 Yaniv Kaul 2017-12-05 18:50:17 UTC

(In reply to Nir Soffer from comment #38)
> The patch was reverted in https://gerrit.ovirt.org/77902 because of bug
> 1459370.

The above BZ is verified for 7.5 and its 7.4.z (https://bugzilla.redhat.com/show_bug.cgi?id=1510837 ) is in VERIFIED and in REL_PREP. Can we look at this again?

Comment 40 Nir Soffer 2017-12-12 17:34:23 UTC

(In reply to Yaniv Kaul from comment #39)
> The above BZ is verified for 7.5 and its 7.4.z
> (https://bugzilla.redhat.com/show_bug.cgi?id=1510837 ) is in VERIFIED and in
> REL_PREP. Can we look at this again?

Once the fix is available, we will require it and use it is our multipath.conf.

Comment 42 Nir Soffer 2018-01-17 18:49:11 UTC

The package is not available yet, according to the errata, it will be available 
on 2018-Jan-23.

Comment 43 Nir Soffer 2018-01-23 22:09:09 UTC

Still not available, according to the errata, it will be available on 2018-Jan-25.

Comment 44 Allon Mureinik 2018-01-28 21:32:04 UTC

(In reply to Nir Soffer from comment #43)
> Still not available, according to the errata, it will be available on
> 2018-Jan-25.

The platform fix is now available in device-mapper-multipath-0.4.9-111.el7.1 (see bug 1510837 for details).
Can we move forward with this?

Comment 45 Nir Soffer 2018-02-05 08:46:00 UTC

I posted the configuration change again.

Comment 46 Nir Soffer 2018-02-13 11:23:28 UTC

Partly fixed upstream by:

- using "no_path_retry 4" in multipath.conf - this gives vms 20 seconds grace time
  (assuming polling_interval=5) before failing io when all paths are down.
  multipath enter recovery mode, checking the paths every 5 seconds. If any path
  is restored during the recovery time, the io may succeed and the vm will not 
  paused.

- failing storage domains on read timeout - when vdsm monitor read times out, the 
  storage domain is moved to INVALID state. Previously the storage domain was moved
  to this state only when the read fail. The next successful read will move the
  domain to VALID state, and at this point we try to resume paused vms.

Not fixed:

- VM pause because of an issue with a non-metadata LUN. Vdsm monitors only the
  first metadata LUN, so issues with other LUNs will not be detected.
  Can be fixed by using multipath events. When any LUNs belonging to a storage
  domain goes down (no path is available), we can switch the domain to INVALID
  state.

- VM paused because of write() error, but vdsm monitor uses only read(). Not clear
  yet how we can fix this. Maybe create a canary lv on every PV when adding a PV
  to the system, and monitor this using both read and write, instead of the
  metadata lv.

I think we can file RFE for the issues we did not fix, since the changes we could
fix should be useful now.

Comment 47 Allon Mureinik 2018-02-19 09:36:23 UTC

(In reply to Nir Soffer from comment #46)
> Partly fixed upstream by:
> 
> - using "no_path_retry 4" in multipath.conf - this gives vms 20 seconds
> grace time
>   (assuming polling_interval=5) before failing io when all paths are down.
>   multipath enter recovery mode, checking the paths every 5 seconds. If any
> path
>   is restored during the recovery time, the io may succeed and the vm will
> not 
>   paused.
> 
> - failing storage domains on read timeout - when vdsm monitor read times
> out, the 
>   storage domain is moved to INVALID state. Previously the storage domain
> was moved
>   to this state only when the read fail. The next successful read will move
> the
>   domain to VALID state, and at this point we try to resume paused vms.
> 
> Not fixed:
> 
> - VM pause because of an issue with a non-metadata LUN. Vdsm monitors only
> the
>   first metadata LUN, so issues with other LUNs will not be detected.
>   Can be fixed by using multipath events. When any LUNs belonging to a
> storage
>   domain goes down (no path is available), we can switch the domain to
> INVALID
>   state.
> 
> - VM paused because of write() error, but vdsm monitor uses only read(). Not
> clear
>   yet how we can fix this. Maybe create a canary lv on every PV when adding
> a PV
>   to the system, and monitor this using both read and write, instead of the
>   metadata lv.
> 
> I think we can file RFE for the issues we did not fix, since the changes we
> could
> fix should be useful now.

Agreed.
Nir, please file RFEs for those two issues and backport the current work to the 4.2 branch.

Comment 48 Nir Soffer 2018-02-22 14:33:26 UTC

(In reply to Allon Mureinik from comment #47)

Patches backported to 4.2:
https://gerrit.ovirt.org/#/q/status:open+project:vdsm+branch:ovirt-4.2+topic:backport/4.2/multipath-timeout2

RFEs for the rest of the issues: bug 1548011, bug 1548017

Comment 49 Nir Soffer 2018-02-26 12:11:12 UTC

Last patch (https://gerrit.ovirt.org/#/c/88174/) fixes the documentation in
multipath.conf. It is important to add it so multipath.conf version 1.5 contains
the correct documentation.

Comment 50 Allon Mureinik 2018-02-28 10:13:03 UTC

Nir, can you add some doctext explaining the improvement?

Comment 51 Kevin Alon Goldblatt 2018-03-19 16:21:17 UTC

Verified with the following code:
------------------------------------------
ovirt-engine-4.2.2.2-0.1.el7.noarch
vdsm-4.20.20-1.el7ev.x86_64

Verified with the following scenario:
------------------------------------------

Steps to Reproduce:
    1. Cause a storage failure (FC/iSCSI)
    2. Wait untill the VMs are paused
    3. immediately recover the storage - VM is resumed again
Moving to VERIFIED

Comment 56 errata-xmlrpc 2018-05-15 17:49:33 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:1489

Comment 57 Franta Kust 2019-05-16 13:08:15 UTC

BZ<2>Jira Resync

Note You need to log in before you can comment on or make changes to this bug.