Bug 1427183
Summary: | [downstream clone - 4.0.7] [SCALE] VMs becoming non-responsive due to LVM slowdown caused by too many active devices | ||
---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | rhev-integ |
Component: | vdsm | Assignee: | Nir Soffer <nsoffer> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | eberman |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | unspecified | CC: | amarchuk, amureini, bazulay, bugs, eberman, eedri, emarcian, gklein, guchen, gveitmic, lsurette, nicolas, nsoffer, ratamir, srevivo, tnisan, ycui, ykaul, ylavi |
Target Milestone: | ovirt-4.0.7 | Keywords: | Performance, TestOnly, ZStream |
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | 1331978 | Environment: | |
Last Closed: | 2017-03-20 13:53:08 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1331978 | ||
Bug Blocks: |
Description
rhev-integ
2017-02-27 14:35:14 UTC
Warning: the cleanup script mentioned in the description is safe only if the host is in maintenance. If the host is up, restarting vdsm will clean stale devices. (Originally by Nir Soffer) - Bond mode 6 is not supported in oVirt (see 'bonding modes' in http://www.ovirt.org/documentation/admin-guide/administration-guide/ ), only 1, 2, 4, 5 are - unless you've configured the network as non-VM (so it doesn't have a bridge on top). - Why not use iSCSI multipathing instead of bonding? You should get slightly better performance (latency and throughput most likely). I assume it's due to the storage HW? (Originally by Yaniv Kaul) Bug tickets must have version flags set prior to targeting them to a release. Please ask maintainer to set the correct version flags and only then set the target milestone. (Originally by rule-engine) - Indeed, the storage network is configured as non-VM (neither is the motion network). - We formerly decided to use bonding in terms of scalability. We knew the number of VMs was going to be very big and to avoid storage latency problems in the future we decided to use an ALB-bonding. In any case, I think this issue is not a network performance problem as the current network transfer average is lower than 200Mb/s. (Originally by nicolas) Did you have chance to test the patch? http://gerrit.ovirt.org/56877 We like to know if this fixes the issue with stale lvs on your environment, and if fixing the stale lvs issue also fixes the sporadic vm issues, which we did not investigate yet. (Originally by Nir Soffer) I was able to apply this patch on both conflictive nodes at the time we were debugging this (hosts 5 and 6) and the issue has not happened since then, and neither the machines have become unresponsive again as the VG operations' execution time is reasonably low right now (~2-3 secs). (Originally by nicolas) This change is too intrusive for a z-stream, pushing out to 4. (Originally by Allon Mureinik) The recent LVM fixes reduce the probably of having this, and I don't want to risk 4.0.2 for this fix which doesn't have a consensus yet. Pushing out to 4.0.4. (Originally by Allon Mureinik) *** Bug 1412900 has been marked as a duplicate of this bug. *** (Originally by Germano Veit Michel) 4.0.6 has been the last oVirt 4.0 release, please re-target this bug. (Originally by Sandro Bonazzola) Should be solved 4.1.1. (Originally by Nir Soffer) But on FC we need lvm filter, moving back to POST. (Originally by Nir Soffer) We can test with this with iscsi: 1. Create lot of domains (30?) 2. Create many disks in each domains (300)? 3. Put host to maintenance 4. Activate host After activation, vdsm connect to storage, and all LUNs will be discovered by lvmetad and become active. On 4.1, with lvmetad disabled, no lv should be active, expect the lvs vdsm activates for running vms. I don't know if you can reproduce the same issue the user had, since lvm fixed a scale issue since then, but even if this work with both 4.0.z and 4.1, you can compare the time vdsm spent in running lvm commands (if you enable debug level logs). With FC storage, vdsm deactivates all lvs during startups. From scale point of view we are ok, but we may have lvs that cannot be deactivated because there are guest lvs on top of ovirt raw volumes. Yaniv, since this is mostly about scale, and the main issue was auto activation by lvmetad, I think we can move this to modified, and test. (Originally by Nir Soffer) We believe this is solved by disabling lvmetad, preventing auto activation of ovirt lvs. See the original bug for instructions on testing this. |