Bug 1331978 - [SCALE] VMs becoming non-responsive due to LVM slowdown caused by too many active devices
Summary: [SCALE] VMs becoming non-responsive due to LVM slowdown caused by too many ac...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: vdsm
Classification: oVirt
Component: Core
Version: 4.17.26
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ovirt-4.1.1
: 4.19.8
Assignee: Nir Soffer
QA Contact: guy chen
URL:
Whiteboard:
: 1412900 (view as bug list)
Depends On: 1374545
Blocks: 1063871 deactivate_lv_on_domain_deactivation 1427183
TreeView+ depends on / blocked
 
Reported: 2016-04-30 21:22 UTC by nicolas
Modified: 2018-07-31 22:11 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1427183 (view as bug list)
Environment:
Last Closed: 2017-04-21 09:36:30 UTC
oVirt Team: Storage
Embargoed:
rule-engine: ovirt-4.1+
ylavi: planning_ack+
tnisan: devel_ack+
acanan: testing_ack+


Attachments (Terms of Use)
All tests and time measures (16.77 MB, application/zip)
2016-04-30 21:22 UTC, nicolas
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 56876 0 master ABANDONED blockSD: Storage domain life cycle management 2021-02-10 13:19:43 UTC
oVirt gerrit 56877 0 ovirt-3.6 ABANDONED blockSD: Avoid stale lvs 2021-02-10 13:19:42 UTC

Description nicolas 2016-04-30 21:22:41 UTC
Created attachment 1152639 [details]
All tests and time measures

This issue has been discused on the users list in [1].

We're running oVirt 3.6.5.3-1 and lately we're experiencing some issues with some VMs being paused because they're marked as non-responsive. Mostly, after a few seconds they recover, but we want to debug precisely this problem so we can fix it consistently.

Our scenario is the following:

~495 VMs, of which ~120 are constantly up
3 datastores, all of them iSCSI-based:
  * ds1: 2T, currently has 276 disks
  * ds2: 2T, currently has 179 disks
  * ds3: 500G, currently has 65 disks
7 hosts: All have mostly the same hardware. CPU and memory are currently very lowly used (< 10%).

ds1 and ds2 are physically the same backend (HP LeftHand P4000 G2) which exports two 2TB volumes. ds3 is a different storage backend where we're currently migrating some disks from ds1 and ds2.

Usually, when VMs become unresponsive, the whole host where they run gets unresponsive too, so that gives a hint about the problem, my bet is the culprit is somewhere on the host side and not on the VMs side. When that happens, the host itself gets non-responsive and only recoverable after reboot, since it's unable to reconnect. I must say this is not specific to this oVirt version, when we were using v.3.6.4 the same happened, and it's also worthy mentioning we've not done any configuration changes and everything had been working quite well for a long time.

Less commonly we have situations where the host doesn't get unresponsive but the VMs on it do and they don't become responsive ever again, so we have to forcibly power them off and start them on a different host. But in this case the connection with the host doesn't ever get lost (so basically the host is Up, but any VM run on them is unresponsive).

We were monitoring our ds1 and ds2 physical backend to see performance and we suspect we've run out of IOPS since we're reaching the maximum specified by the manufacturer, probably at certain times the host cannot perform a storage operation within some time limit and it marks VMs as unresponsive. That's why we've set up ds3 and we're migrating ds1 and ds2 to ds3. When we run out of space on ds3 we'll create more smaller volumes to keep migrating.

On the host side, when this happens, we've run repoplot on the vdsm log. Clearly there's a *huge* LVM response time (~30 secs.). Our host storage network is correctly configured and on a 1G interface, no errors on the host itself, switches, etc. We have separate networks for management, storage and motion. Storage and motion have 1G each (plus, for storage we use a bond of 2 interfaces in ALB-mode (6)). Currently no host has more than 30 VMs at a time. At the time of this report, all hosts were working correctly except host5 and host6, which had very high LVM response times, so relevant tests are done on these 2 hosts.

We've also limited storage in QoS to use 10MB/s and 40 IOPS, as mentioned in [2], but this issue still happens, which leads me to be concerned whether this is not just an IOPS issue; each host handles about cca. 600 LVs. I remark the LVM response times are low in normal conditions (~1-2 seconds).

With this scenario, the following tests have been done based on Nir Soffer's suggestions (all tests are attached to this BZ):

* Run both vgck and vgs commands, which took ~30secs on hosts 5 and 6, and ~1-2 secs. on the other hosts.
* dmsetup, where output on hosts 5 and 6 are 10 times the size of the other hosts.
* vdsm-tool dump-volume-chains command has been run on one of the problematic hosts, but it didn't end nor provide any output in cca. 1 hour, so I guess it was hanged and I stopped it.
* iotop has been run on one of the problematic hosts and I could see there were 2 vgck and vgs processes with a rate of ~500Kb/s each for reading.

Based on the above tests, these are Nir Soffer's conclusions (copied literally):

---
    I think the issue causing slow vgs and vgck commands is stale lvs on host 5 and 6. This may be the root caused for paused vms, but I cannot explain how it is related yet.

    Comparing vgck on host1 and host5 - we can see that vgck opens 213 /dev/mapper/* devices, actually 53 uniq devices, most of them are opened 4 times. On host5,  vgck opens 2489 devices (622 uniq). This explains why the operation takes about 10 times longer.

    Checking dmsteup table output, we can see that host1 has 53 devices, and host5 622 devices.

    Checking the device open count, host1 has 15 stale devices (Open count: 0), but host5 has 597 stale devices.

    Leaving stale devices is a known issue, but we never had evidence that it cause trouble except warnings in lvm commands.

    Please open an ovirt bug about this issue, and include all the files you sent so far in this thread, and all the vdsm logs on host5.

    To remove the stale devices, you can do:

    for name in `dmsetup info -c -o open,name | awk '/ 0 / {print $2}'`; do
        echo "removing stale device: $name"
        dmsetup remove $name
    done
---
	
Attached zip file contains 5 directories where:

* "1.first-logs": These are the original logs sent to the list. Includes the vdsm.log of one of the problematic host, engine.log (in the same time frame) and the repoplot PDF report where LV commands show ~30sec. response times.

* "2.dmsetup-output": 'dmsetup table -v' command run on a sane host (1) and on the problematic hosts (5, 6). You can appreciate the size of this 2 latter are ten times the size of the sane one.

* "3.time-measuring": Both vgs and vgck commands run with the -vvvv flag and measuring execution time (time vgs|vgck -vvvv --config ...) on all hosts. Also, a TIMES file is included with the commands' execution times. Once again, you can appreciate the size and execution times of problematic hosts are 10 times the values of the sane ones. It also includes the iostat output for host5, which is problematic.

* "4.whole-day-vdsm-logs": Full untouched vdsm logs of day Mar. 30 2016. Please note that Nir Soffer's suggested temporary solution for stale devices has been run at 20:54 of local log time, which indeed fixed the issue, and after that vgs and vgck execution times became normal (~1-2 secs.).

* "5.vdsm-tool": The previous 'vdsm-tool dump-volume-chains 5de4a000-a9c4-489c-8eee-10368647c413' command which initially staled was successful after running the stale devices cleanup loop. It took ~1.30 min. and this directory contains the result of the execution.

[1]: http://lists.ovirt.org/pipermail/users/2016-April/039501.html
[2]: http://www.ovirt.org/documentation/sla/network-qos/

Comment 1 Nir Soffer 2016-04-30 22:25:31 UTC
Warning: the cleanup script mentioned in the description is safe only if 
the host is in maintenance.

If the host is up, restarting vdsm will clean stale devices.

Comment 2 Yaniv Kaul 2016-05-01 05:56:30 UTC
- Bond mode 6 is not supported in oVirt (see 'bonding modes' in http://www.ovirt.org/documentation/admin-guide/administration-guide/ ), only 1, 2, 4, 5 are - unless you've configured the network as non-VM (so it doesn't have a bridge on top).
- Why not use iSCSI multipathing instead of bonding? You should get slightly better performance (latency and throughput most likely). I assume it's due to the storage HW?

Comment 3 Red Hat Bugzilla Rules Engine 2016-05-01 08:23:06 UTC
Bug tickets must have version flags set prior to targeting them to a release. Please ask maintainer to set the correct version flags and only then set the target milestone.

Comment 4 nicolas 2016-05-01 12:41:30 UTC
- Indeed, the storage network is configured as non-VM (neither is the motion network).
- We formerly decided to use bonding in terms of scalability. We knew the number of VMs was going to be very big and to avoid storage latency problems in the future we decided to use an ALB-bonding. In any case, I think this issue is not a network performance problem as the current network transfer average is lower than 200Mb/s.

Comment 5 Nir Soffer 2016-05-15 12:42:20 UTC
Did you have chance to test the patch?
http://gerrit.ovirt.org/56877

We like to know if this fixes the issue with stale lvs on your environment, and if
fixing the stale lvs issue also fixes the sporadic vm issues, which we did not
investigate yet.

Comment 6 nicolas 2016-05-16 08:22:43 UTC
I was able to apply this patch on both conflictive nodes at the time we were debugging this (hosts 5 and 6) and the issue has not happened since then, and neither the machines have become unresponsive again as the VG operations' execution time is reasonably low right now (~2-3 secs).

Comment 9 Allon Mureinik 2016-07-06 16:53:20 UTC
This change is too intrusive for a z-stream, pushing out to 4.

Comment 12 Allon Mureinik 2016-07-27 12:52:05 UTC
The recent LVM fixes reduce the probably of having this, and I don't want to risk 4.0.2 for this fix which doesn't have a consensus yet. Pushing out to 4.0.4.

Comment 13 Germano Veit Michel 2017-01-16 01:59:56 UTC
*** Bug 1412900 has been marked as a duplicate of this bug. ***

Comment 14 Sandro Bonazzola 2017-01-25 07:55:06 UTC
4.0.6 has been the last oVirt 4.0 release, please re-target this bug.

Comment 15 Nir Soffer 2017-02-14 11:40:40 UTC
Should be solved 4.1.1.

Comment 16 Nir Soffer 2017-02-14 11:42:22 UTC
But on FC we need lvm filter, moving back to POST.

Comment 18 Nir Soffer 2017-02-14 20:39:44 UTC
We can test with this with iscsi:

1. Create lot of domains (30?)
2. Create many disks in each domains (300)?
3. Put host to maintenance
4. Activate host

After activation, vdsm connect to storage, and all LUNs will be discovered
by lvmetad and become active.

On 4.1, with lvmetad disabled, no lv should be active, expect the lvs vdsm
activates for running vms.

I don't know if you can reproduce the same issue the user had, since lvm fixed
a scale issue since then, but even if this work with both 4.0.z and 4.1, you
can compare the time vdsm spent in running lvm commands (if you enable debug 
level logs).

With FC storage, vdsm deactivates all lvs during startups. From scale point of
view we are ok, but we may have lvs that cannot be deactivated because there are
guest lvs on top of ovirt raw volumes.

Yaniv, since this is mostly about scale, and the main issue was auto activation
by lvmetad, I think we can move this to modified, and test.

Comment 20 Nir Soffer 2017-02-27 15:02:02 UTC
Based on comment 18, moving to modified.

Comment 21 Dusan Fodor 2017-03-04 23:42:46 UTC
Status change Modified->ON_QA was accidental

Comment 22 Eyal Edri 2017-03-19 08:01:35 UTC
I can only see master patch attached, where is 4.1/4.1.1?

Comment 23 Nir Soffer 2017-03-19 14:02:01 UTC
(In reply to Eyal Edri from comment #22)
> I can only see master patch attached, where is 4.1/4.1.1?

This bug should be fixed by the changes in bug 1374545, this is why we marked it
as modified.

Comment 24 Eyal Edri 2017-03-19 14:29:55 UTC
OK, please make sure to move it to ON_QA when relevant, it can't be done automatically.

Or close duplicate if another bug has the fix

Comment 25 Nir Soffer 2017-03-19 14:34:32 UTC
Moving to ON_QA based on comment 24.

Comment 26 guy chen 2017-03-30 10:48:37 UTC
Tested on 4.1.1.4 and verified


Note You need to log in before you can comment on or make changes to this bug.