1225162 – oVirt Instability with Dell Compellent via iSCSI/Multipath with default configs

Bug 1225162 - oVirt Instability with Dell Compellent via iSCSI/Multipath with default configs

Summary: oVirt Instability with Dell Compellent via iSCSI/Multipath with default configs

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	vdsm
Classification:	oVirt
Component:	General
Sub Component:
Version:	4.16.30
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	ovirt-3.6.0-rc
Target Release:	4.17.8
Assignee:	Nir Soffer
QA Contact:	Elad
Docs Contact:
URL:
Whiteboard:	storage
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2015-05-26 18:06 UTC by Chris Jones
Modified:	2016-03-10 13:54 UTC (History)
CC List:	19 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2016-01-13 14:39:43 UTC
oVirt Team:	Storage
Embargoed:
Dependent Products:
Flags:	rule-engine: ovirt-3.6.0+ ylavi: planning_ack+ amureini: devel_ack+ rule-engine: testing_ack+

Attachments	(Terms of Use)
multipath.conf and iscsi.conf (2.38 KB, text/plain) 2015-05-26 18:06 UTC, Chris Jones	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
oVirt gerrit	44855	0	None	MERGED	multipath: Replace specific device sections with all_devs section	2020-11-25 03:32:36 UTC
oVirt gerrit	45735	0	None	MERGED	multipath: Replace specific device sections with all_devs section	2020-11-25 03:32:13 UTC

Description Chris Jones 2015-05-26 18:06:37 UTC

Created attachment 1030111 [details]
multipath.conf and iscsi.conf

Description of problem:

Engine: oVirt Engine Version: 3.5.2-1.el7.centos
Remote storage: Dell Compellent SC8000
Storage setup: 2 nics connected to the Compellent. Several domains backed by LUNs. Several VM disk using direct LUN.
Networking: Dell 10 Gb/s switches

Version-Release number of selected component (if applicable):
oVirt Node - 3.5 - 0.999.201504280931.el7.centos

How reproducible:
100% of the time

Steps to Reproduce:
1. Have all nodes and storage domains activated
2. Have all VMs shut down
3. Select all VMs and hit start

Actual results:
Nodes start becoming unresponsive due to losing contact with the storage domains.

Warnings in engine.log that look like:
2015-05-19 03:09:57,782 WARN  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-6) domain 26929b89-d1ca-4718-90d6-b3a6da585451:generic_data_1 in problem. vds: blade2c1.ism.ld
2015-05-19 03:09:57,783 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-6) Domain 0b1d36e4-7992-43c7-8ac0-740f7c2cadb7:ovirttest1 recovered from problem. vds: blade2c1.ism.ld
2015-05-19 03:10:00,639 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-31) Domain c46adffc-614a-4fa2-9d2d-954f174f6a39:db_binlog_1 recovered from problem. vds: blade4c1.ism.ld
2015-05-19 03:10:00,703 WARN  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-17) domain 64101f40-0f10-471d-9f5f-44591f9e087d:logging_1 in problem. vds: blade1c1.ism.ld

High last check maximum values from vdsm.log:
[nsoffer@thin untitled (master)]$ repostat vdsm.log 
domain: 05c8fa9c-fcbf-4a17-a3c6-011696a1b9a2
  delay      avg: 0.000856 min: 0.000000 max: 0.001168
  last check avg: 11.510000 min: 0.300000 max: 64.100000
domain: 64101f40-0f10-471d-9f5f-44591f9e087d
  delay      avg: 0.008358 min: 0.000000 max: 0.040269
  last check avg: 11.863333 min: 0.300000 max: 63.400000
domain: 31e97cc8-6a10-4a45-8f25-95eba88b4dc0
  delay      avg: 0.007793 min: 0.000819 max: 0.041316
  last check avg: 11.466667 min: 0.000000 max: 70.200000
domain: 842edf83-22c6-46cd-acaa-a1f76d61e545
  delay      avg: 0.000493 min: 0.000374 max: 0.000698
  last check avg: 4.860000 min: 0.200000 max: 9.900000
domain: b050c455-5ab1-4107-b055-bfcc811195fc
  delay      avg: 0.002080 min: 0.000000 max: 0.040142
  last check avg: 11.830000 min: 0.000000 max: 63.700000
domain: c46adffc-614a-4fa2-9d2d-954f174f6a39
  delay      avg: 0.004798 min: 0.000000 max: 0.041006
  last check avg: 18.423333 min: 1.400000 max: 102.900000
domain: 0b1d36e4-7992-43c7-8ac0-740f7c2cadb7
  delay      avg: 0.001002 min: 0.000000 max: 0.001199
  last check avg: 11.560000 min: 0.300000 max: 61.700000
domain: 20153412-f77a-4944-b252-ff06a78a1d64
  delay      avg: 0.003748 min: 0.000000 max: 0.040903
  last check avg: 12.180000 min: 0.000000 max: 67.200000
domain: 26929b89-d1ca-4718-90d6-b3a6da585451
  delay      avg: 0.000963 min: 0.000000 max: 0.001209
  last check avg: 10.993333 min: 0.000000 max: 64.300000
domain: 0137183b-ea40-49b1-b617-256f47367280
  delay      avg: 0.000881 min: 0.000000 max: 0.001227
  last check avg: 11.086667 min: 0.100000 max: 63.200000

As far as I know our network and Compellent are working properly.

Expected results:
oVirt would stay up and running.

Additional info:
This is using the default multipath and iSCSI configs. I've applied the changes described in https://gerrit.ovirt.org/#/c/41244/3/lib/vdsm/tool/configurators/multipath.py and http://lists.ovirt.org/pipermail/users/2015-May/032973.html to my multipath.conf and iscsi.conf along with Dell's recommend settings. I've attached those configs to this message. While it is no longer falling apart due to losing contact with storage, I still see that same warnings in engine.log. I don't know if I need to be concerned with those or not.

Related email thread: http://lists.ovirt.org/pipermail/users/2015-May/032936.html

Comment 1 Fabian Deutsch 2015-05-27 18:14:11 UTC

Nir, looking at the gerrit patch and the ml thread, it looks like this is something which can be fixed in vdsm, right?

Comment 2 Nir Soffer 2015-05-27 21:41:53 UTC

(In reply to Fabian Deutsch from comment #1)
> Nir, looking at the gerrit patch and the ml thread, it looks like this is
> something which can be fixed in vdsm, right?

It think that the multipath configuration that vdsm uses is wrong, missing some
settings for this storage server. However the settings I recommended, taken from
multipath.conf.defaults, and modified for vdsm, did not eliminate the problem.

So yes, vdsm need to ship better configuration for this server, but so multipath
builtin setup is probably also need to be fixed.

I posted https://gerrit.ovirt.org/41244, but we cannot ship this without testing
with this storage server, or getting ack from the storage vendor.

Comment 3 Nir Soffer 2015-05-27 21:46:11 UTC

Chris,

Did you check with the vendor about the storage server configuration?

Can you upload some logs so we can investigate this further?

We need:
/var/log/messages
/var/log/vdsm/vdsm/vdsm.log
engine.log

Comment 4 Chris Jones 2015-05-29 14:33:49 UTC

Just got this figured out yesterday. The underlying issue was a bad disk on the Compellent's tier 2 storage. This disk was causing read latency issues. The default multipath.conf and iscsi.conf was causing it to fail when combined with the high latency.

So the good news is that the updated configs made oVirt more resilient. The bad news is that the Compellent reports the disk as "healthy" despite it having been logging this latency this the entire time.

With the inclusion for the updated configs in future oVirt versions, this can be marked as solved. Thank you again for all the help.

Comment 5 Chris Jones 2015-05-29 14:34:58 UTC

(In reply to Chris Jones from comment #4)
> Just got this figured out yesterday. The underlying issue was a bad disk on
> the Compellent's tier 2 storage. This disk was causing read latency issues.
> The default multipath.conf and iscsi.conf was causing it to fail when
> combined with the high latency.
> 
> So the good news is that the updated configs made oVirt more resilient. The
> bad news is that the Compellent reports the disk as "healthy" despite it
> having been logging this latency this the entire time.
> 
> With the inclusion for the updated configs in future oVirt versions, this
> can be marked as solved. Thank you again for all the help.

Forgot to mention. We have no more storage domain errors in the logs now that the bad disk was removed.

Comment 6 Nir Soffer 2015-08-14 15:47:33 UTC

Lowering severity and priority as the root cause was the storage server.

Keeping this open since we need to improve the related multipath
configuration.

Comment 7 Tal Nisan 2015-08-18 15:31:19 UTC

Patch was abandoned, moving back to NEW

Comment 8 Nir Soffer 2015-09-17 14:30:59 UTC

Fixed by removing the incorrect configuration for COMPELNT/Compellent Vol
exposing the builtin configuration for this device.

Vdsm overrides now only the "no_path_retry" option, using "fail". This 
value is required by vdsm to prevent long delays in storage related 
threads, blocked on lvm or multiapth operations, because io to one device
is queued.

Comment 9 Allon Mureinik 2015-09-22 10:20:19 UTC

Nir, is there anything worth while documenting here?

Comment 10 Nir Soffer 2015-09-22 10:50:41 UTC

(In reply to Allon Mureinik from comment #9)
> Nir, is there anything worth while documenting here?

See comment 8. Not sure that adding doc text (we had bad configuration, fixed) has any value.

Comment 11 Allon Mureinik 2015-09-22 11:17:30 UTC

(In reply to Nir Soffer from comment #10)
> (In reply to Allon Mureinik from comment #9)
> > Nir, is there anything worth while documenting here?
> 
> See comment 8. Not sure that adding doc text (we had bad configuration,
> fixed) has any value.
IMHO, there isn't, thanks for the clarification.
Setting requires-doctext-

Comment 12 Elad 2015-11-19 16:07:26 UTC

'devices' section in multipath.conf has 'all_devs=yes' and 'no_path_retry=fail'


devices {                                                                                                                                                                                                            
    device {                                                                                                                                                                                                                                                                                                                  
        all_devs                yes                                                                                                                                                                                  
        no_path_retry           fail
    }
}


Nir, can I move to VERIFIED based on that?

Comment 13 Nir Soffer 2015-11-20 13:02:10 UTC

(In reply to Elad from comment #12)
> 'devices' section in multipath.conf has 'all_devs=yes' and
> 'no_path_retry=fail'
> 
> 
> devices {                                                                   
> 
>     device {                                                                
> 
>         all_devs                yes                                         
> 
>         no_path_retry           fail
>     }
> }
> 
> 
> Nir, can I move to VERIFIED based on that?

No, you should check the effect of this configuration.

The best way is to check the actual storage, but we cannot do that,
unless you find a way to make your server looks like this product.

What you can do is to check the effective settings for this device,
using multipath:

$ multipathd show config | less

Now locate the device by the vendor/product - in this case it is
the vendor name - "COMPELNT":

        device {
                vendor "COMPELNT"
                product "Compellent Vol"
                path_grouping_policy "multibus"
                path_checker "tur"
                features "0"
                hardware_handler "0"
                prio "const"
                failback immediate
                rr_weight "uniform"
                no_path_retry "fail"
        }

Check:

    features "0"
    no_path_retry "fail"

Compare these settings with ovirt-3.5.

Comment 14 Elad 2015-11-22 07:56:02 UTC

On latest vdsm (vdsm-4.17.10.1-0.el7ev.noarch), under dell vendor - "COMPELNT", The following is configured:

        device {
                vendor "COMPELNT"
                product "Compellent Vol"
                path_grouping_policy "multibus"
                path_checker "tur"
                features "0"
                hardware_handler "0"
                prio "const"
                failback immediate
                rr_weight "uniform"
                no_path_retry "fail"
        }



(features "0"   no_path_retry "fail")


On 3.5 (vdsm-4.16.29-1.el7ev.x86_64), it's configured as follows:


        device {
                vendor "COMPELNT"
                product "Compellent Vol"
                path_grouping_policy "multibus"
                path_checker "tur"
                features "0"
                hardware_handler "0"
                prio "const"
                failback immediate
                rr_weight "uniform"
                no_path_retry "fail"
        }




(features "0"   no_path_retry "fail")



Nir, it looks the same on both.

Comment 15 Nir Soffer 2015-11-22 11:31:28 UTC

(In reply to Elad from comment #14)
> Nir, it looks the same on both.

Can you post the multiapth.conf on the 3.5 setup?

In 3.5, I expect to find this multipath configuration:

    device {
        vendor                  "COMPELNT"
        product                 "Compellent Vol"
        no_path_retry           fail
    }

Which is incorrect because it overrides all the settings other then
no_path_retry to the defualts.

It is possible that on the system you tested, multipath defualts
are the same for this device, which can explain why the effective
setting are the same.

Anyway, our configuration works as expected, so I thinks we are
done with this.

Comment 16 Elad 2015-11-22 11:40:49 UTC

On 3.5 - 7.1 it is as I pasted on comment #14. 

On the other hand, on 3.5 - 6.7, it's as follows:


device {
                vendor "COMPELNT"
                product "Compellent Vol"
                path_grouping_policy multibus
                getuid_callout "/lib/udev/scsi_id --whitelisted --device=/dev/%n"
                path_selector "round-robin 0"
                path_checker tur
                features "0"
                hardware_handler "0"
                prio const
                failback immediate
                rr_weight uniform
                no_path_retry fail
                rr_min_io 1000
                rr_min_io_rq 1
        }



multipath.conf from the 3.5 - 7.1 host:


[root@camel-vdsb ~]# cat /etc/multipath.conf 
# RHEV REVISION 1.1

defaults {
    polling_interval        5
    getuid_callout          "/usr/lib/udev/scsi_id --whitelisted --replace-whitespace --device=/dev/%n"
    no_path_retry           fail
    user_friendly_names     no
    flush_on_last_del       yes
    fast_io_fail_tmo        5
    dev_loss_tmo            30
    max_fds                 4096
}

devices {
device {
    vendor                  "HITACHI"
    product                 "DF.*"
    getuid_callout          "/usr/lib/udev/scsi_id --whitelisted --replace-whitespace --device=/dev/%n"
}
device {
    vendor                  "COMPELNT"
    product                 "Compellent Vol"
    no_path_retry           fail
}
device {
    # multipath.conf.default
    vendor                  "DGC"
    product                 ".*"
    product_blacklist       "LUNZ"
    path_grouping_policy    "group_by_prio"
    path_checker            "emc_clariion"
    hardware_handler        "1 emc"
    prio                    "emc"
    failback                immediate
    rr_weight               "uniform"
    # vdsm required configuration
    getuid_callout          "/usr/lib/udev/scsi_id --whitelisted --replace-whitespace --device=/dev/%n"
    features                "0"
    no_path_retry           fail
}

Comment 17 Nir Soffer 2015-11-22 11:58:14 UTC

(In reply to Elad from comment #16)
The effective configuration on the 7 is strange, I would expect
the same values as seen on 6.7.

I will check this with multiapth maintainer - maybe there were changes
in multipath defaults, or some patch is missing in multipath on 7.

Comment 18 Elad 2015-12-16 10:06:05 UTC

Hi Nir, 

Did you get an answer?

Comment 19 Nir Soffer 2015-12-20 09:22:15 UTC

(In reply to Elad from comment #18)
This seems to be a difference in the way rhel 6.7 and 7.2 are configured.

When this bug was reported, the the behavior with the following configuration:

device {
    vendor                  "COMPELNT"
    product                 "Compellent Vol"
    no_path_retry           fail
}

Was different from the current configuration:

device {
         all_devs                yes                                                                                                                                      
         no_path_retry           fail
}

But now they are seems to be the same for this device.

Anyway, on our side, the new configuration supports all devices,
and behave as expected.

This can be safely closed.

Comment 20 Elad 2016-01-05 12:54:27 UTC

Moving to VERIFIED Based on comment #19

Comment 21 Sandro Bonazzola 2016-01-13 14:39:43 UTC

oVirt 3.6.0 has been released, closing current release

Note You need to log in before you can comment on or make changes to this bug.