Description of problem: I run 4.1 system with 1 host, 1 engine, and 200 VMS with 5 disks each - means total of 1000 disks. When all VMS are down all snapshot operations work, when I started up 100 VMS, create snapshot with and without memory, as well as delete snapshot fail. Version-Release number of selected component (if applicable): 4.1 How reproducible: All the time Steps to Reproduce: 1.build 4.1 system ith 1 host, 1 engine, and 200 VMS with 5 disks each 2.started up 100 VMS 3.try any snapshot operation Actual results: snapshot operation fail Expected results: snapshot operation succeed
Nir - the LVM deactication issue is the reason I've raised comments on https://gerrit.ovirt.org/#/c/66932/
Created attachment 1224909 [details] lvmlocal.conf file for testing
(In reply to Nir Soffer from comment #4) > Created attachment 1224909 [details] > lvmlocal.conf file for testing Yep, that's the next item on our list. Currently, 100 VMs, each with 5 disk. Each VM has 1 disk with an OS on it (not sure how many LVs there), and this is what we see: [root@ucs1-b420-2 qemu]# vgs VG #PV #LV #SN Attr VSize VFree 70e79239-983a-4376-9cde-d1744488334c 1 1020 0 wz--n- 2.00t 1014.88g 7a7e988c-553c-45a9-b612-c4f753e72c36 1 8 0 wz--n- 5.00t 5.00t ca0ed831-83ac-4057-8a86-0a23a6f15159 1 40 0 wz--n- 5.00t 4.96t vg0 1 3 0 wz--n- 277.97g 60.00m So probably some filtering should cut the number of LVs nicely. Guy - per our discussion today with Eyal, the lvm filter is next and hopefully should improve the situation here.
Guy, can you test this again with lvmlocal.conf from attachment 1224909 [details]? Without this filter, lvm operation are scanning all the devices under /dev/mapper/.*. These devices includes both the multipath devices used by a storage domain, and the ovirt lvs. In your case, you have 1000 active lvs, so each lvm operation may need to scan 1000 devices, and this takes time. With this filter, lvm operation will never scan ovirt lvs, only multiapth devices. If the issue is slow lvm operations, this can fix this issue. To test this configuration, you copy the file to /etc/lvm/lvmlocal.conf before after your provision the hosts. After you test this, I would also like to test (separately) this patch: https://gerrit.ovirt.org/66932/ This patch limit the number of concurrent lvm operations. This may slow down lvm operation, have no effect, or make them faster. I believe it will be a small improvement, and also limit the load on the machine. You use case with 1000 disks active is exactly the situation we should test because it makes lvm commands much slower.
(In reply to Yaniv Kaul from comment #5) > (In reply to Nir Soffer from comment #4) > > Created attachment 1224909 [details] > > lvmlocal.conf file for testing The filter is not only for guest lvs on ovirt raw lvs, but for ignoring ovirt lvs, since scanning them is waste of time and probably a scale issue. We have a bug that we never could reproduce about timeout starting vm with block storage because of slow lvm commands. Maybe this is the issue.
I checked vdsm lvm code again, and we are already ignoring ovirt lvms in lvm operations run by vdsm by using a filter that includes only the multiapth devices (either the devices used by some storage domain, or all multipath devices on a host). So the lvmlocal.conf configuration will not affect with vdsm lvm operation, only other lvm operations on the host. Regardless it is a good idea to test this configuration for this scale scenario. We will have to spend time with the logs to understand the failures reported here. Yaniv, is this bug really urgent?
It's not urgent for feature freeze, but certainly for 4.1 as we can't declare support for 1000 disks without understanding this or providing a workaround.
Guy, would be great to re-test, with newer 4.1 and lvmetad disabled.
*** This bug has been marked as a duplicate of bug 1408977 ***