Created attachment 1030111 [details] multipath.conf and iscsi.conf Description of problem: Engine: oVirt Engine Version: 3.5.2-1.el7.centos Remote storage: Dell Compellent SC8000 Storage setup: 2 nics connected to the Compellent. Several domains backed by LUNs. Several VM disk using direct LUN. Networking: Dell 10 Gb/s switches Version-Release number of selected component (if applicable): oVirt Node - 3.5 - 0.999.201504280931.el7.centos How reproducible: 100% of the time Steps to Reproduce: 1. Have all nodes and storage domains activated 2. Have all VMs shut down 3. Select all VMs and hit start Actual results: Nodes start becoming unresponsive due to losing contact with the storage domains. Warnings in engine.log that look like: 2015-05-19 03:09:57,782 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-6) domain 26929b89-d1ca-4718-90d6-b3a6da585451:generic_data_1 in problem. vds: blade2c1.ism.ld 2015-05-19 03:09:57,783 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-6) Domain 0b1d36e4-7992-43c7-8ac0-740f7c2cadb7:ovirttest1 recovered from problem. vds: blade2c1.ism.ld 2015-05-19 03:10:00,639 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-31) Domain c46adffc-614a-4fa2-9d2d-954f174f6a39:db_binlog_1 recovered from problem. vds: blade4c1.ism.ld 2015-05-19 03:10:00,703 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-17) domain 64101f40-0f10-471d-9f5f-44591f9e087d:logging_1 in problem. vds: blade1c1.ism.ld High last check maximum values from vdsm.log: [nsoffer@thin untitled (master)]$ repostat vdsm.log domain: 05c8fa9c-fcbf-4a17-a3c6-011696a1b9a2 delay avg: 0.000856 min: 0.000000 max: 0.001168 last check avg: 11.510000 min: 0.300000 max: 64.100000 domain: 64101f40-0f10-471d-9f5f-44591f9e087d delay avg: 0.008358 min: 0.000000 max: 0.040269 last check avg: 11.863333 min: 0.300000 max: 63.400000 domain: 31e97cc8-6a10-4a45-8f25-95eba88b4dc0 delay avg: 0.007793 min: 0.000819 max: 0.041316 last check avg: 11.466667 min: 0.000000 max: 70.200000 domain: 842edf83-22c6-46cd-acaa-a1f76d61e545 delay avg: 0.000493 min: 0.000374 max: 0.000698 last check avg: 4.860000 min: 0.200000 max: 9.900000 domain: b050c455-5ab1-4107-b055-bfcc811195fc delay avg: 0.002080 min: 0.000000 max: 0.040142 last check avg: 11.830000 min: 0.000000 max: 63.700000 domain: c46adffc-614a-4fa2-9d2d-954f174f6a39 delay avg: 0.004798 min: 0.000000 max: 0.041006 last check avg: 18.423333 min: 1.400000 max: 102.900000 domain: 0b1d36e4-7992-43c7-8ac0-740f7c2cadb7 delay avg: 0.001002 min: 0.000000 max: 0.001199 last check avg: 11.560000 min: 0.300000 max: 61.700000 domain: 20153412-f77a-4944-b252-ff06a78a1d64 delay avg: 0.003748 min: 0.000000 max: 0.040903 last check avg: 12.180000 min: 0.000000 max: 67.200000 domain: 26929b89-d1ca-4718-90d6-b3a6da585451 delay avg: 0.000963 min: 0.000000 max: 0.001209 last check avg: 10.993333 min: 0.000000 max: 64.300000 domain: 0137183b-ea40-49b1-b617-256f47367280 delay avg: 0.000881 min: 0.000000 max: 0.001227 last check avg: 11.086667 min: 0.100000 max: 63.200000 As far as I know our network and Compellent are working properly. Expected results: oVirt would stay up and running. Additional info: This is using the default multipath and iSCSI configs. I've applied the changes described in https://gerrit.ovirt.org/#/c/41244/3/lib/vdsm/tool/configurators/multipath.py and http://lists.ovirt.org/pipermail/users/2015-May/032973.html to my multipath.conf and iscsi.conf along with Dell's recommend settings. I've attached those configs to this message. While it is no longer falling apart due to losing contact with storage, I still see that same warnings in engine.log. I don't know if I need to be concerned with those or not. Related email thread: http://lists.ovirt.org/pipermail/users/2015-May/032936.html
Nir, looking at the gerrit patch and the ml thread, it looks like this is something which can be fixed in vdsm, right?
(In reply to Fabian Deutsch from comment #1) > Nir, looking at the gerrit patch and the ml thread, it looks like this is > something which can be fixed in vdsm, right? It think that the multipath configuration that vdsm uses is wrong, missing some settings for this storage server. However the settings I recommended, taken from multipath.conf.defaults, and modified for vdsm, did not eliminate the problem. So yes, vdsm need to ship better configuration for this server, but so multipath builtin setup is probably also need to be fixed. I posted https://gerrit.ovirt.org/41244, but we cannot ship this without testing with this storage server, or getting ack from the storage vendor.
Chris, Did you check with the vendor about the storage server configuration? Can you upload some logs so we can investigate this further? We need: /var/log/messages /var/log/vdsm/vdsm/vdsm.log engine.log
Just got this figured out yesterday. The underlying issue was a bad disk on the Compellent's tier 2 storage. This disk was causing read latency issues. The default multipath.conf and iscsi.conf was causing it to fail when combined with the high latency. So the good news is that the updated configs made oVirt more resilient. The bad news is that the Compellent reports the disk as "healthy" despite it having been logging this latency this the entire time. With the inclusion for the updated configs in future oVirt versions, this can be marked as solved. Thank you again for all the help.
(In reply to Chris Jones from comment #4) > Just got this figured out yesterday. The underlying issue was a bad disk on > the Compellent's tier 2 storage. This disk was causing read latency issues. > The default multipath.conf and iscsi.conf was causing it to fail when > combined with the high latency. > > So the good news is that the updated configs made oVirt more resilient. The > bad news is that the Compellent reports the disk as "healthy" despite it > having been logging this latency this the entire time. > > With the inclusion for the updated configs in future oVirt versions, this > can be marked as solved. Thank you again for all the help. Forgot to mention. We have no more storage domain errors in the logs now that the bad disk was removed.
Lowering severity and priority as the root cause was the storage server. Keeping this open since we need to improve the related multipath configuration.
Patch was abandoned, moving back to NEW
Fixed by removing the incorrect configuration for COMPELNT/Compellent Vol exposing the builtin configuration for this device. Vdsm overrides now only the "no_path_retry" option, using "fail". This value is required by vdsm to prevent long delays in storage related threads, blocked on lvm or multiapth operations, because io to one device is queued.
Nir, is there anything worth while documenting here?
(In reply to Allon Mureinik from comment #9) > Nir, is there anything worth while documenting here? See comment 8. Not sure that adding doc text (we had bad configuration, fixed) has any value.
(In reply to Nir Soffer from comment #10) > (In reply to Allon Mureinik from comment #9) > > Nir, is there anything worth while documenting here? > > See comment 8. Not sure that adding doc text (we had bad configuration, > fixed) has any value. IMHO, there isn't, thanks for the clarification. Setting requires-doctext-
'devices' section in multipath.conf has 'all_devs=yes' and 'no_path_retry=fail' devices { device { all_devs yes no_path_retry fail } } Nir, can I move to VERIFIED based on that?
(In reply to Elad from comment #12) > 'devices' section in multipath.conf has 'all_devs=yes' and > 'no_path_retry=fail' > > > devices { > > device { > > all_devs yes > > no_path_retry fail > } > } > > > Nir, can I move to VERIFIED based on that? No, you should check the effect of this configuration. The best way is to check the actual storage, but we cannot do that, unless you find a way to make your server looks like this product. What you can do is to check the effective settings for this device, using multipath: $ multipathd show config | less Now locate the device by the vendor/product - in this case it is the vendor name - "COMPELNT": device { vendor "COMPELNT" product "Compellent Vol" path_grouping_policy "multibus" path_checker "tur" features "0" hardware_handler "0" prio "const" failback immediate rr_weight "uniform" no_path_retry "fail" } Check: features "0" no_path_retry "fail" Compare these settings with ovirt-3.5.
On latest vdsm (vdsm-4.17.10.1-0.el7ev.noarch), under dell vendor - "COMPELNT", The following is configured: device { vendor "COMPELNT" product "Compellent Vol" path_grouping_policy "multibus" path_checker "tur" features "0" hardware_handler "0" prio "const" failback immediate rr_weight "uniform" no_path_retry "fail" } (features "0" no_path_retry "fail") On 3.5 (vdsm-4.16.29-1.el7ev.x86_64), it's configured as follows: device { vendor "COMPELNT" product "Compellent Vol" path_grouping_policy "multibus" path_checker "tur" features "0" hardware_handler "0" prio "const" failback immediate rr_weight "uniform" no_path_retry "fail" } (features "0" no_path_retry "fail") Nir, it looks the same on both.
(In reply to Elad from comment #14) > Nir, it looks the same on both. Can you post the multiapth.conf on the 3.5 setup? In 3.5, I expect to find this multipath configuration: device { vendor "COMPELNT" product "Compellent Vol" no_path_retry fail } Which is incorrect because it overrides all the settings other then no_path_retry to the defualts. It is possible that on the system you tested, multipath defualts are the same for this device, which can explain why the effective setting are the same. Anyway, our configuration works as expected, so I thinks we are done with this.
On 3.5 - 7.1 it is as I pasted on comment #14. On the other hand, on 3.5 - 6.7, it's as follows: device { vendor "COMPELNT" product "Compellent Vol" path_grouping_policy multibus getuid_callout "/lib/udev/scsi_id --whitelisted --device=/dev/%n" path_selector "round-robin 0" path_checker tur features "0" hardware_handler "0" prio const failback immediate rr_weight uniform no_path_retry fail rr_min_io 1000 rr_min_io_rq 1 } multipath.conf from the 3.5 - 7.1 host: [root@camel-vdsb ~]# cat /etc/multipath.conf # RHEV REVISION 1.1 defaults { polling_interval 5 getuid_callout "/usr/lib/udev/scsi_id --whitelisted --replace-whitespace --device=/dev/%n" no_path_retry fail user_friendly_names no flush_on_last_del yes fast_io_fail_tmo 5 dev_loss_tmo 30 max_fds 4096 } devices { device { vendor "HITACHI" product "DF.*" getuid_callout "/usr/lib/udev/scsi_id --whitelisted --replace-whitespace --device=/dev/%n" } device { vendor "COMPELNT" product "Compellent Vol" no_path_retry fail } device { # multipath.conf.default vendor "DGC" product ".*" product_blacklist "LUNZ" path_grouping_policy "group_by_prio" path_checker "emc_clariion" hardware_handler "1 emc" prio "emc" failback immediate rr_weight "uniform" # vdsm required configuration getuid_callout "/usr/lib/udev/scsi_id --whitelisted --replace-whitespace --device=/dev/%n" features "0" no_path_retry fail }
(In reply to Elad from comment #16) The effective configuration on the 7 is strange, I would expect the same values as seen on 6.7. I will check this with multiapth maintainer - maybe there were changes in multipath defaults, or some patch is missing in multipath on 7.
Hi Nir, Did you get an answer?
(In reply to Elad from comment #18) This seems to be a difference in the way rhel 6.7 and 7.2 are configured. When this bug was reported, the the behavior with the following configuration: device { vendor "COMPELNT" product "Compellent Vol" no_path_retry fail } Was different from the current configuration: device { all_devs yes no_path_retry fail } But now they are seems to be the same for this device. Anyway, on our side, the new configuration supports all devices, and behave as expected. This can be safely closed.
Moving to VERIFIED Based on comment #19
oVirt 3.6.0 has been released, closing current release