Created attachment 1152639 [details] All tests and time measures This issue has been discused on the users list in [1]. We're running oVirt 3.6.5.3-1 and lately we're experiencing some issues with some VMs being paused because they're marked as non-responsive. Mostly, after a few seconds they recover, but we want to debug precisely this problem so we can fix it consistently. Our scenario is the following: ~495 VMs, of which ~120 are constantly up 3 datastores, all of them iSCSI-based: * ds1: 2T, currently has 276 disks * ds2: 2T, currently has 179 disks * ds3: 500G, currently has 65 disks 7 hosts: All have mostly the same hardware. CPU and memory are currently very lowly used (< 10%). ds1 and ds2 are physically the same backend (HP LeftHand P4000 G2) which exports two 2TB volumes. ds3 is a different storage backend where we're currently migrating some disks from ds1 and ds2. Usually, when VMs become unresponsive, the whole host where they run gets unresponsive too, so that gives a hint about the problem, my bet is the culprit is somewhere on the host side and not on the VMs side. When that happens, the host itself gets non-responsive and only recoverable after reboot, since it's unable to reconnect. I must say this is not specific to this oVirt version, when we were using v.3.6.4 the same happened, and it's also worthy mentioning we've not done any configuration changes and everything had been working quite well for a long time. Less commonly we have situations where the host doesn't get unresponsive but the VMs on it do and they don't become responsive ever again, so we have to forcibly power them off and start them on a different host. But in this case the connection with the host doesn't ever get lost (so basically the host is Up, but any VM run on them is unresponsive). We were monitoring our ds1 and ds2 physical backend to see performance and we suspect we've run out of IOPS since we're reaching the maximum specified by the manufacturer, probably at certain times the host cannot perform a storage operation within some time limit and it marks VMs as unresponsive. That's why we've set up ds3 and we're migrating ds1 and ds2 to ds3. When we run out of space on ds3 we'll create more smaller volumes to keep migrating. On the host side, when this happens, we've run repoplot on the vdsm log. Clearly there's a *huge* LVM response time (~30 secs.). Our host storage network is correctly configured and on a 1G interface, no errors on the host itself, switches, etc. We have separate networks for management, storage and motion. Storage and motion have 1G each (plus, for storage we use a bond of 2 interfaces in ALB-mode (6)). Currently no host has more than 30 VMs at a time. At the time of this report, all hosts were working correctly except host5 and host6, which had very high LVM response times, so relevant tests are done on these 2 hosts. We've also limited storage in QoS to use 10MB/s and 40 IOPS, as mentioned in [2], but this issue still happens, which leads me to be concerned whether this is not just an IOPS issue; each host handles about cca. 600 LVs. I remark the LVM response times are low in normal conditions (~1-2 seconds). With this scenario, the following tests have been done based on Nir Soffer's suggestions (all tests are attached to this BZ): * Run both vgck and vgs commands, which took ~30secs on hosts 5 and 6, and ~1-2 secs. on the other hosts. * dmsetup, where output on hosts 5 and 6 are 10 times the size of the other hosts. * vdsm-tool dump-volume-chains command has been run on one of the problematic hosts, but it didn't end nor provide any output in cca. 1 hour, so I guess it was hanged and I stopped it. * iotop has been run on one of the problematic hosts and I could see there were 2 vgck and vgs processes with a rate of ~500Kb/s each for reading. Based on the above tests, these are Nir Soffer's conclusions (copied literally): --- I think the issue causing slow vgs and vgck commands is stale lvs on host 5 and 6. This may be the root caused for paused vms, but I cannot explain how it is related yet. Comparing vgck on host1 and host5 - we can see that vgck opens 213 /dev/mapper/* devices, actually 53 uniq devices, most of them are opened 4 times. On host5, vgck opens 2489 devices (622 uniq). This explains why the operation takes about 10 times longer. Checking dmsteup table output, we can see that host1 has 53 devices, and host5 622 devices. Checking the device open count, host1 has 15 stale devices (Open count: 0), but host5 has 597 stale devices. Leaving stale devices is a known issue, but we never had evidence that it cause trouble except warnings in lvm commands. Please open an ovirt bug about this issue, and include all the files you sent so far in this thread, and all the vdsm logs on host5. To remove the stale devices, you can do: for name in `dmsetup info -c -o open,name | awk '/ 0 / {print $2}'`; do echo "removing stale device: $name" dmsetup remove $name done --- Attached zip file contains 5 directories where: * "1.first-logs": These are the original logs sent to the list. Includes the vdsm.log of one of the problematic host, engine.log (in the same time frame) and the repoplot PDF report where LV commands show ~30sec. response times. * "2.dmsetup-output": 'dmsetup table -v' command run on a sane host (1) and on the problematic hosts (5, 6). You can appreciate the size of this 2 latter are ten times the size of the sane one. * "3.time-measuring": Both vgs and vgck commands run with the -vvvv flag and measuring execution time (time vgs|vgck -vvvv --config ...) on all hosts. Also, a TIMES file is included with the commands' execution times. Once again, you can appreciate the size and execution times of problematic hosts are 10 times the values of the sane ones. It also includes the iostat output for host5, which is problematic. * "4.whole-day-vdsm-logs": Full untouched vdsm logs of day Mar. 30 2016. Please note that Nir Soffer's suggested temporary solution for stale devices has been run at 20:54 of local log time, which indeed fixed the issue, and after that vgs and vgck execution times became normal (~1-2 secs.). * "5.vdsm-tool": The previous 'vdsm-tool dump-volume-chains 5de4a000-a9c4-489c-8eee-10368647c413' command which initially staled was successful after running the stale devices cleanup loop. It took ~1.30 min. and this directory contains the result of the execution. [1]: http://lists.ovirt.org/pipermail/users/2016-April/039501.html [2]: http://www.ovirt.org/documentation/sla/network-qos/
Warning: the cleanup script mentioned in the description is safe only if the host is in maintenance. If the host is up, restarting vdsm will clean stale devices.
- Bond mode 6 is not supported in oVirt (see 'bonding modes' in http://www.ovirt.org/documentation/admin-guide/administration-guide/ ), only 1, 2, 4, 5 are - unless you've configured the network as non-VM (so it doesn't have a bridge on top). - Why not use iSCSI multipathing instead of bonding? You should get slightly better performance (latency and throughput most likely). I assume it's due to the storage HW?
Bug tickets must have version flags set prior to targeting them to a release. Please ask maintainer to set the correct version flags and only then set the target milestone.
- Indeed, the storage network is configured as non-VM (neither is the motion network). - We formerly decided to use bonding in terms of scalability. We knew the number of VMs was going to be very big and to avoid storage latency problems in the future we decided to use an ALB-bonding. In any case, I think this issue is not a network performance problem as the current network transfer average is lower than 200Mb/s.
Did you have chance to test the patch? http://gerrit.ovirt.org/56877 We like to know if this fixes the issue with stale lvs on your environment, and if fixing the stale lvs issue also fixes the sporadic vm issues, which we did not investigate yet.
I was able to apply this patch on both conflictive nodes at the time we were debugging this (hosts 5 and 6) and the issue has not happened since then, and neither the machines have become unresponsive again as the VG operations' execution time is reasonably low right now (~2-3 secs).
This change is too intrusive for a z-stream, pushing out to 4.
The recent LVM fixes reduce the probably of having this, and I don't want to risk 4.0.2 for this fix which doesn't have a consensus yet. Pushing out to 4.0.4.
*** Bug 1412900 has been marked as a duplicate of this bug. ***
4.0.6 has been the last oVirt 4.0 release, please re-target this bug.
Should be solved 4.1.1.
But on FC we need lvm filter, moving back to POST.
We can test with this with iscsi: 1. Create lot of domains (30?) 2. Create many disks in each domains (300)? 3. Put host to maintenance 4. Activate host After activation, vdsm connect to storage, and all LUNs will be discovered by lvmetad and become active. On 4.1, with lvmetad disabled, no lv should be active, expect the lvs vdsm activates for running vms. I don't know if you can reproduce the same issue the user had, since lvm fixed a scale issue since then, but even if this work with both 4.0.z and 4.1, you can compare the time vdsm spent in running lvm commands (if you enable debug level logs). With FC storage, vdsm deactivates all lvs during startups. From scale point of view we are ok, but we may have lvs that cannot be deactivated because there are guest lvs on top of ovirt raw volumes. Yaniv, since this is mostly about scale, and the main issue was auto activation by lvmetad, I think we can move this to modified, and test.
Based on comment 18, moving to modified.
Status change Modified->ON_QA was accidental
I can only see master patch attached, where is 4.1/4.1.1?
(In reply to Eyal Edri from comment #22) > I can only see master patch attached, where is 4.1/4.1.1? This bug should be fixed by the changes in bug 1374545, this is why we marked it as modified.
OK, please make sure to move it to ON_QA when relevant, it can't be done automatically. Or close duplicate if another bug has the fix
Moving to ON_QA based on comment 24.
Tested on 4.1.1.4 and verified