Bug 1427183

Summary:	[downstream clone - 4.0.7] [SCALE] VMs becoming non-responsive due to LVM slowdown caused by too many active devices
Product:	Red Hat Enterprise Virtualization Manager	Reporter:	rhev-integ
Component:	vdsm	Assignee:	Nir Soffer <nsoffer>
Status:	CLOSED CURRENTRELEASE	QA Contact:	eberman
Severity:	high	Docs Contact:
Priority:	high
Version:	unspecified	CC:	amarchuk, amureini, bazulay, bugs, eberman, eedri, emarcian, gklein, guchen, gveitmic, lsurette, nicolas, nsoffer, ratamir, srevivo, tnisan, ycui, ykaul, ylavi
Target Milestone:	ovirt-4.0.7	Keywords:	Performance, TestOnly, ZStream
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:	1331978	Environment:
Last Closed:	2017-03-20 13:53:08 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	Storage	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1331978
Bug Blocks:

Description rhev-integ 2017-02-27 14:35:14 UTC

+++ This bug is an upstream to downstream clone. The original bug is: +++
+++   bug 1331978 +++
======================================================================

Created attachment 1152639 [details]
All tests and time measures

This issue has been discused on the users list in [1].

We're running oVirt 3.6.5.3-1 and lately we're experiencing some issues with some VMs being paused because they're marked as non-responsive. Mostly, after a few seconds they recover, but we want to debug precisely this problem so we can fix it consistently.

Our scenario is the following:

~495 VMs, of which ~120 are constantly up
3 datastores, all of them iSCSI-based:
  * ds1: 2T, currently has 276 disks
  * ds2: 2T, currently has 179 disks
  * ds3: 500G, currently has 65 disks
7 hosts: All have mostly the same hardware. CPU and memory are currently very lowly used (< 10%).

ds1 and ds2 are physically the same backend (HP LeftHand P4000 G2) which exports two 2TB volumes. ds3 is a different storage backend where we're currently migrating some disks from ds1 and ds2.

Usually, when VMs become unresponsive, the whole host where they run gets unresponsive too, so that gives a hint about the problem, my bet is the culprit is somewhere on the host side and not on the VMs side. When that happens, the host itself gets non-responsive and only recoverable after reboot, since it's unable to reconnect. I must say this is not specific to this oVirt version, when we were using v.3.6.4 the same happened, and it's also worthy mentioning we've not done any configuration changes and everything had been working quite well for a long time.

Less commonly we have situations where the host doesn't get unresponsive but the VMs on it do and they don't become responsive ever again, so we have to forcibly power them off and start them on a different host. But in this case the connection with the host doesn't ever get lost (so basically the host is Up, but any VM run on them is unresponsive).

We were monitoring our ds1 and ds2 physical backend to see performance and we suspect we've run out of IOPS since we're reaching the maximum specified by the manufacturer, probably at certain times the host cannot perform a storage operation within some time limit and it marks VMs as unresponsive. That's why we've set up ds3 and we're migrating ds1 and ds2 to ds3. When we run out of space on ds3 we'll create more smaller volumes to keep migrating.

On the host side, when this happens, we've run repoplot on the vdsm log. Clearly there's a *huge* LVM response time (~30 secs.). Our host storage network is correctly configured and on a 1G interface, no errors on the host itself, switches, etc. We have separate networks for management, storage and motion. Storage and motion have 1G each (plus, for storage we use a bond of 2 interfaces in ALB-mode (6)). Currently no host has more than 30 VMs at a time. At the time of this report, all hosts were working correctly except host5 and host6, which had very high LVM response times, so relevant tests are done on these 2 hosts.

We've also limited storage in QoS to use 10MB/s and 40 IOPS, as mentioned in [2], but this issue still happens, which leads me to be concerned whether this is not just an IOPS issue; each host handles about cca. 600 LVs. I remark the LVM response times are low in normal conditions (~1-2 seconds).

With this scenario, the following tests have been done based on Nir Soffer's suggestions (all tests are attached to this BZ):

* Run both vgck and vgs commands, which took ~30secs on hosts 5 and 6, and ~1-2 secs. on the other hosts.
* dmsetup, where output on hosts 5 and 6 are 10 times the size of the other hosts.
* vdsm-tool dump-volume-chains command has been run on one of the problematic hosts, but it didn't end nor provide any output in cca. 1 hour, so I guess it was hanged and I stopped it.
* iotop has been run on one of the problematic hosts and I could see there were 2 vgck and vgs processes with a rate of ~500Kb/s each for reading.

Based on the above tests, these are Nir Soffer's conclusions (copied literally):

---
    I think the issue causing slow vgs and vgck commands is stale lvs on host 5 and 6. This may be the root caused for paused vms, but I cannot explain how it is related yet.

    Comparing vgck on host1 and host5 - we can see that vgck opens 213 /dev/mapper/* devices, actually 53 uniq devices, most of them are opened 4 times. On host5,  vgck opens 2489 devices (622 uniq). This explains why the operation takes about 10 times longer.

    Checking dmsteup table output, we can see that host1 has 53 devices, and host5 622 devices.

    Checking the device open count, host1 has 15 stale devices (Open count: 0), but host5 has 597 stale devices.

    Leaving stale devices is a known issue, but we never had evidence that it cause trouble except warnings in lvm commands.

    Please open an ovirt bug about this issue, and include all the files you sent so far in this thread, and all the vdsm logs on host5.

    To remove the stale devices, you can do:

    for name in `dmsetup info -c -o open,name | awk '/ 0 / {print $2}'`; do
        echo "removing stale device: $name"
        dmsetup remove $name
    done
---
	
Attached zip file contains 5 directories where:

* "1.first-logs": These are the original logs sent to the list. Includes the vdsm.log of one of the problematic host, engine.log (in the same time frame) and the repoplot PDF report where LV commands show ~30sec. response times.

* "2.dmsetup-output": 'dmsetup table -v' command run on a sane host (1) and on the problematic hosts (5, 6). You can appreciate the size of this 2 latter are ten times the size of the sane one.

* "3.time-measuring": Both vgs and vgck commands run with the -vvvv flag and measuring execution time (time vgs|vgck -vvvv --config ...) on all hosts. Also, a TIMES file is included with the commands' execution times. Once again, you can appreciate the size and execution times of problematic hosts are 10 times the values of the sane ones. It also includes the iostat output for host5, which is problematic.

* "4.whole-day-vdsm-logs": Full untouched vdsm logs of day Mar. 30 2016. Please note that Nir Soffer's suggested temporary solution for stale devices has been run at 20:54 of local log time, which indeed fixed the issue, and after that vgs and vgck execution times became normal (~1-2 secs.).

* "5.vdsm-tool": The previous 'vdsm-tool dump-volume-chains 5de4a000-a9c4-489c-8eee-10368647c413' command which initially staled was successful after running the stale devices cleanup loop. It took ~1.30 min. and this directory contains the result of the execution.

[1]: http://lists.ovirt.org/pipermail/users/2016-April/039501.html
[2]: http://www.ovirt.org/documentation/sla/network-qos/

(Originally by nicolas)

Comment 1 rhev-integ 2017-02-27 14:35:24 UTC

Warning: the cleanup script mentioned in the description is safe only if 
the host is in maintenance.

If the host is up, restarting vdsm will clean stale devices.

(Originally by Nir Soffer)

Comment 3 rhev-integ 2017-02-27 14:35:30 UTC

- Bond mode 6 is not supported in oVirt (see 'bonding modes' in http://www.ovirt.org/documentation/admin-guide/administration-guide/ ), only 1, 2, 4, 5 are - unless you've configured the network as non-VM (so it doesn't have a bridge on top).
- Why not use iSCSI multipathing instead of bonding? You should get slightly better performance (latency and throughput most likely). I assume it's due to the storage HW?

(Originally by Yaniv Kaul)

Comment 4 rhev-integ 2017-02-27 14:35:37 UTC

Bug tickets must have version flags set prior to targeting them to a release. Please ask maintainer to set the correct version flags and only then set the target milestone.

(Originally by rule-engine)

Comment 5 rhev-integ 2017-02-27 14:35:43 UTC

- Indeed, the storage network is configured as non-VM (neither is the motion network).
- We formerly decided to use bonding in terms of scalability. We knew the number of VMs was going to be very big and to avoid storage latency problems in the future we decided to use an ALB-bonding. In any case, I think this issue is not a network performance problem as the current network transfer average is lower than 200Mb/s.

(Originally by nicolas)

Comment 6 rhev-integ 2017-02-27 14:35:50 UTC

Did you have chance to test the patch?
http://gerrit.ovirt.org/56877

We like to know if this fixes the issue with stale lvs on your environment, and if
fixing the stale lvs issue also fixes the sporadic vm issues, which we did not
investigate yet.

(Originally by Nir Soffer)

Comment 7 rhev-integ 2017-02-27 14:35:56 UTC

I was able to apply this patch on both conflictive nodes at the time we were debugging this (hosts 5 and 6) and the issue has not happened since then, and neither the machines have become unresponsive again as the VG operations' execution time is reasonably low right now (~2-3 secs).

(Originally by nicolas)

Comment 10 rhev-integ 2017-02-27 14:36:14 UTC

This change is too intrusive for a z-stream, pushing out to 4.

(Originally by Allon Mureinik)

Comment 13 rhev-integ 2017-02-27 14:36:31 UTC

The recent LVM fixes reduce the probably of having this, and I don't want to risk 4.0.2 for this fix which doesn't have a consensus yet. Pushing out to 4.0.4.

(Originally by Allon Mureinik)

Comment 14 rhev-integ 2017-02-27 14:36:38 UTC

*** Bug 1412900 has been marked as a duplicate of this bug. ***

(Originally by Germano Veit Michel)

Comment 15 rhev-integ 2017-02-27 14:36:44 UTC

4.0.6 has been the last oVirt 4.0 release, please re-target this bug.

(Originally by Sandro Bonazzola)

Comment 16 rhev-integ 2017-02-27 14:36:51 UTC

Should be solved 4.1.1.

(Originally by Nir Soffer)

Comment 17 rhev-integ 2017-02-27 14:36:57 UTC

But on FC we need lvm filter, moving back to POST.

(Originally by Nir Soffer)

Comment 19 rhev-integ 2017-02-27 14:37:08 UTC

We can test with this with iscsi:

1. Create lot of domains (30?)
2. Create many disks in each domains (300)?
3. Put host to maintenance
4. Activate host

After activation, vdsm connect to storage, and all LUNs will be discovered
by lvmetad and become active.

On 4.1, with lvmetad disabled, no lv should be active, expect the lvs vdsm
activates for running vms.

I don't know if you can reproduce the same issue the user had, since lvm fixed
a scale issue since then, but even if this work with both 4.0.z and 4.1, you
can compare the time vdsm spent in running lvm commands (if you enable debug 
level logs).

With FC storage, vdsm deactivates all lvs during startups. From scale point of
view we are ok, but we may have lvs that cannot be deactivated because there are
guest lvs on top of ovirt raw volumes.

Yaniv, since this is mostly about scale, and the main issue was auto activation
by lvmetad, I think we can move this to modified, and test.

(Originally by Nir Soffer)

Comment 20 Nir Soffer 2017-02-27 15:42:24 UTC

We believe this is solved by disabling lvmetad, preventing auto activation of 
ovirt lvs.

See the original bug for instructions on testing this.