Bug 1331978
| Summary: | [SCALE] VMs becoming non-responsive due to LVM slowdown caused by too many active devices | ||||||
|---|---|---|---|---|---|---|---|
| Product: | [oVirt] vdsm | Reporter: | nicolas | ||||
| Component: | Core | Assignee: | Nir Soffer <nsoffer> | ||||
| Status: | CLOSED CURRENTRELEASE | QA Contact: | guy chen <guchen> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | high | ||||||
| Version: | 4.17.26 | CC: | amureini, bugs, dfodor, eberman, eedri, emarcian, gveitmic, nicolas, nsoffer, tnisan, ylavi | ||||
| Target Milestone: | ovirt-4.1.1 | Keywords: | Performance, TestOnly, ZStream | ||||
| Target Release: | 4.19.8 | Flags: | rule-engine:
ovirt-4.1+
ylavi: planning_ack+ tnisan: devel_ack+ acanan: testing_ack+ |
||||
| Hardware: | x86_64 | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | |||||||
| : | 1427183 (view as bug list) | Environment: | |||||
| Last Closed: | 2017-04-21 09:36:30 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | 1374545 | ||||||
| Bug Blocks: | 1063871, 1163890, 1427183 | ||||||
| Attachments: |
|
||||||
|
Description
nicolas
2016-04-30 21:22:41 UTC
Warning: the cleanup script mentioned in the description is safe only if the host is in maintenance. If the host is up, restarting vdsm will clean stale devices. - Bond mode 6 is not supported in oVirt (see 'bonding modes' in http://www.ovirt.org/documentation/admin-guide/administration-guide/ ), only 1, 2, 4, 5 are - unless you've configured the network as non-VM (so it doesn't have a bridge on top). - Why not use iSCSI multipathing instead of bonding? You should get slightly better performance (latency and throughput most likely). I assume it's due to the storage HW? Bug tickets must have version flags set prior to targeting them to a release. Please ask maintainer to set the correct version flags and only then set the target milestone. - Indeed, the storage network is configured as non-VM (neither is the motion network). - We formerly decided to use bonding in terms of scalability. We knew the number of VMs was going to be very big and to avoid storage latency problems in the future we decided to use an ALB-bonding. In any case, I think this issue is not a network performance problem as the current network transfer average is lower than 200Mb/s. Did you have chance to test the patch? http://gerrit.ovirt.org/56877 We like to know if this fixes the issue with stale lvs on your environment, and if fixing the stale lvs issue also fixes the sporadic vm issues, which we did not investigate yet. I was able to apply this patch on both conflictive nodes at the time we were debugging this (hosts 5 and 6) and the issue has not happened since then, and neither the machines have become unresponsive again as the VG operations' execution time is reasonably low right now (~2-3 secs). This change is too intrusive for a z-stream, pushing out to 4. The recent LVM fixes reduce the probably of having this, and I don't want to risk 4.0.2 for this fix which doesn't have a consensus yet. Pushing out to 4.0.4. *** Bug 1412900 has been marked as a duplicate of this bug. *** 4.0.6 has been the last oVirt 4.0 release, please re-target this bug. Should be solved 4.1.1. But on FC we need lvm filter, moving back to POST. We can test with this with iscsi: 1. Create lot of domains (30?) 2. Create many disks in each domains (300)? 3. Put host to maintenance 4. Activate host After activation, vdsm connect to storage, and all LUNs will be discovered by lvmetad and become active. On 4.1, with lvmetad disabled, no lv should be active, expect the lvs vdsm activates for running vms. I don't know if you can reproduce the same issue the user had, since lvm fixed a scale issue since then, but even if this work with both 4.0.z and 4.1, you can compare the time vdsm spent in running lvm commands (if you enable debug level logs). With FC storage, vdsm deactivates all lvs during startups. From scale point of view we are ok, but we may have lvs that cannot be deactivated because there are guest lvs on top of ovirt raw volumes. Yaniv, since this is mostly about scale, and the main issue was auto activation by lvmetad, I think we can move this to modified, and test. Based on comment 18, moving to modified. Status change Modified->ON_QA was accidental I can only see master patch attached, where is 4.1/4.1.1? (In reply to Eyal Edri from comment #22) > I can only see master patch attached, where is 4.1/4.1.1? This bug should be fixed by the changes in bug 1374545, this is why we marked it as modified. OK, please make sure to move it to ON_QA when relevant, it can't be done automatically. Or close duplicate if another bug has the fix Moving to ON_QA based on comment 24. Tested on 4.1.1.4 and verified |