Bug 1225162
Summary: | oVirt Instability with Dell Compellent via iSCSI/Multipath with default configs | ||||||
---|---|---|---|---|---|---|---|
Product: | [oVirt] vdsm | Reporter: | Chris Jones <chris.jones> | ||||
Component: | General | Assignee: | Nir Soffer <nsoffer> | ||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Elad <ebenahar> | ||||
Severity: | medium | Docs Contact: | |||||
Priority: | medium | ||||||
Version: | 4.16.30 | CC: | acanan, amureini, bazulay, bugs, chris.jones, daniel.helgenberger, ebenahar, ecohen, gklein, greartes, lsurette, mgoldboi, nsoffer, ovirt-bugs, rbalakri, s.kieske, tnisan, yeylon, ylavi | ||||
Target Milestone: | ovirt-3.6.0-rc | Flags: | rule-engine:
ovirt-3.6.0+
ylavi: planning_ack+ amureini: devel_ack+ rule-engine: testing_ack+ |
||||
Target Release: | 4.17.8 | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | storage | ||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2016-01-13 14:39:43 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Chris Jones
2015-05-26 18:06:37 UTC
Nir, looking at the gerrit patch and the ml thread, it looks like this is something which can be fixed in vdsm, right? (In reply to Fabian Deutsch from comment #1) > Nir, looking at the gerrit patch and the ml thread, it looks like this is > something which can be fixed in vdsm, right? It think that the multipath configuration that vdsm uses is wrong, missing some settings for this storage server. However the settings I recommended, taken from multipath.conf.defaults, and modified for vdsm, did not eliminate the problem. So yes, vdsm need to ship better configuration for this server, but so multipath builtin setup is probably also need to be fixed. I posted https://gerrit.ovirt.org/41244, but we cannot ship this without testing with this storage server, or getting ack from the storage vendor. Chris, Did you check with the vendor about the storage server configuration? Can you upload some logs so we can investigate this further? We need: /var/log/messages /var/log/vdsm/vdsm/vdsm.log engine.log Just got this figured out yesterday. The underlying issue was a bad disk on the Compellent's tier 2 storage. This disk was causing read latency issues. The default multipath.conf and iscsi.conf was causing it to fail when combined with the high latency. So the good news is that the updated configs made oVirt more resilient. The bad news is that the Compellent reports the disk as "healthy" despite it having been logging this latency this the entire time. With the inclusion for the updated configs in future oVirt versions, this can be marked as solved. Thank you again for all the help. (In reply to Chris Jones from comment #4) > Just got this figured out yesterday. The underlying issue was a bad disk on > the Compellent's tier 2 storage. This disk was causing read latency issues. > The default multipath.conf and iscsi.conf was causing it to fail when > combined with the high latency. > > So the good news is that the updated configs made oVirt more resilient. The > bad news is that the Compellent reports the disk as "healthy" despite it > having been logging this latency this the entire time. > > With the inclusion for the updated configs in future oVirt versions, this > can be marked as solved. Thank you again for all the help. Forgot to mention. We have no more storage domain errors in the logs now that the bad disk was removed. Lowering severity and priority as the root cause was the storage server. Keeping this open since we need to improve the related multipath configuration. Patch was abandoned, moving back to NEW Fixed by removing the incorrect configuration for COMPELNT/Compellent Vol exposing the builtin configuration for this device. Vdsm overrides now only the "no_path_retry" option, using "fail". This value is required by vdsm to prevent long delays in storage related threads, blocked on lvm or multiapth operations, because io to one device is queued. Nir, is there anything worth while documenting here? (In reply to Allon Mureinik from comment #9) > Nir, is there anything worth while documenting here? See comment 8. Not sure that adding doc text (we had bad configuration, fixed) has any value. (In reply to Nir Soffer from comment #10) > (In reply to Allon Mureinik from comment #9) > > Nir, is there anything worth while documenting here? > > See comment 8. Not sure that adding doc text (we had bad configuration, > fixed) has any value. IMHO, there isn't, thanks for the clarification. Setting requires-doctext- 'devices' section in multipath.conf has 'all_devs=yes' and 'no_path_retry=fail' devices { device { all_devs yes no_path_retry fail } } Nir, can I move to VERIFIED based on that? (In reply to Elad from comment #12) > 'devices' section in multipath.conf has 'all_devs=yes' and > 'no_path_retry=fail' > > > devices { > > device { > > all_devs yes > > no_path_retry fail > } > } > > > Nir, can I move to VERIFIED based on that? No, you should check the effect of this configuration. The best way is to check the actual storage, but we cannot do that, unless you find a way to make your server looks like this product. What you can do is to check the effective settings for this device, using multipath: $ multipathd show config | less Now locate the device by the vendor/product - in this case it is the vendor name - "COMPELNT": device { vendor "COMPELNT" product "Compellent Vol" path_grouping_policy "multibus" path_checker "tur" features "0" hardware_handler "0" prio "const" failback immediate rr_weight "uniform" no_path_retry "fail" } Check: features "0" no_path_retry "fail" Compare these settings with ovirt-3.5. On latest vdsm (vdsm-4.17.10.1-0.el7ev.noarch), under dell vendor - "COMPELNT", The following is configured: device { vendor "COMPELNT" product "Compellent Vol" path_grouping_policy "multibus" path_checker "tur" features "0" hardware_handler "0" prio "const" failback immediate rr_weight "uniform" no_path_retry "fail" } (features "0" no_path_retry "fail") On 3.5 (vdsm-4.16.29-1.el7ev.x86_64), it's configured as follows: device { vendor "COMPELNT" product "Compellent Vol" path_grouping_policy "multibus" path_checker "tur" features "0" hardware_handler "0" prio "const" failback immediate rr_weight "uniform" no_path_retry "fail" } (features "0" no_path_retry "fail") Nir, it looks the same on both. (In reply to Elad from comment #14) > Nir, it looks the same on both. Can you post the multiapth.conf on the 3.5 setup? In 3.5, I expect to find this multipath configuration: device { vendor "COMPELNT" product "Compellent Vol" no_path_retry fail } Which is incorrect because it overrides all the settings other then no_path_retry to the defualts. It is possible that on the system you tested, multipath defualts are the same for this device, which can explain why the effective setting are the same. Anyway, our configuration works as expected, so I thinks we are done with this. On 3.5 - 7.1 it is as I pasted on comment #14. On the other hand, on 3.5 - 6.7, it's as follows: device { vendor "COMPELNT" product "Compellent Vol" path_grouping_policy multibus getuid_callout "/lib/udev/scsi_id --whitelisted --device=/dev/%n" path_selector "round-robin 0" path_checker tur features "0" hardware_handler "0" prio const failback immediate rr_weight uniform no_path_retry fail rr_min_io 1000 rr_min_io_rq 1 } multipath.conf from the 3.5 - 7.1 host: [root@camel-vdsb ~]# cat /etc/multipath.conf # RHEV REVISION 1.1 defaults { polling_interval 5 getuid_callout "/usr/lib/udev/scsi_id --whitelisted --replace-whitespace --device=/dev/%n" no_path_retry fail user_friendly_names no flush_on_last_del yes fast_io_fail_tmo 5 dev_loss_tmo 30 max_fds 4096 } devices { device { vendor "HITACHI" product "DF.*" getuid_callout "/usr/lib/udev/scsi_id --whitelisted --replace-whitespace --device=/dev/%n" } device { vendor "COMPELNT" product "Compellent Vol" no_path_retry fail } device { # multipath.conf.default vendor "DGC" product ".*" product_blacklist "LUNZ" path_grouping_policy "group_by_prio" path_checker "emc_clariion" hardware_handler "1 emc" prio "emc" failback immediate rr_weight "uniform" # vdsm required configuration getuid_callout "/usr/lib/udev/scsi_id --whitelisted --replace-whitespace --device=/dev/%n" features "0" no_path_retry fail } (In reply to Elad from comment #16) The effective configuration on the 7 is strange, I would expect the same values as seen on 6.7. I will check this with multiapth maintainer - maybe there were changes in multipath defaults, or some patch is missing in multipath on 7. Hi Nir, Did you get an answer? (In reply to Elad from comment #18) This seems to be a difference in the way rhel 6.7 and 7.2 are configured. When this bug was reported, the the behavior with the following configuration: device { vendor "COMPELNT" product "Compellent Vol" no_path_retry fail } Was different from the current configuration: device { all_devs yes no_path_retry fail } But now they are seems to be the same for this device. Anyway, on our side, the new configuration supports all devices, and behave as expected. This can be safely closed. Moving to VERIFIED Based on comment #19 oVirt 3.6.0 has been released, closing current release |