Description of problem: Hostdev list_by_caps method can run for large amount of time if the number of storage devices on a host is high (~1000). Version-Release number of selected component (if applicable): VDSM tag v4.18.15.3 How reproducible: 100% Steps to Reproduce: 1. Acquire host with ~1000+ storage devices (or mock such environment), 2. try to run vdsm-restore-net-config or any action that requires refresh of host devices Actual results: The call takes hours to finish. Expected results: The call is executed within reasonable timeframe. Additional info: Caused due to ineffective algorithm for storage device construction.
can you attach the logs showing how much time is eaten by _restore_sriov_numvfs()? Do you suggest that the problem is in libvirt's listAllDevices()? it was introduced in 3.6. At worse, we can add a configurable to disable it for people who do not care for sr-iov.
Sorry, no exact figure for the specific function; that being said, the problem is in list_by_caps logic and it's introduced by the addition of proper scsi parsing. Currently, the whole tree is parsed in O(n) libvirt calls with O(n^2) passes over the tree. It wasn't really designed for such a large amount of disks. To back up my statement, I've added a VDSM test that parses ~3000 devices and off-line tested ~30000 devices (most of them are storage, leading to worst case performance). Without any code improvements, I've interrupted the parsing at 8 minutes: Ran 1 test in 506.604s ^C real 8m26.956s user 8m26.572s sys 0m0.385s as it'd seemed to easily reproduce what was happening in this bug. After few perf optimizations and bringing the complexity to O(1) libvirt calls & O(n) tree passes, the same number of devices can be parsed in ~.35 seconds: an 1 test in 0.337s OK real 0m0.696s user 0m0.640s sys 0m0.056s and 30000 devices can be parsed in ~3.2 seconds: Ran 1 test in 3.210s OK real 0m3.586s user 0m3.484s sys 0m0.100s
more patches under https://gerrit.ovirt.org/#/q/topic:hostdev-caching all in
reassigning for backport to 4.0.
this optimization doesn't require doc_text: same behaviour as above, just faster
if we need this in 4.0.7, then this is not MODIFIED
(In reply to Francesco Romani from comment #8) > if we need this in 4.0.7, then this is not MODIFIED it is, this is a 4.1 bug
I have tested this on latest 4.1 build 4 with 1000 storage devices - fixed : [root@ucs1-b420-2 ~]# time python /usr/share/vdsm/vdsm-restore-net-config real 0m0.228s user 0m0.151s sys 0m0.077s [root@ucs1-b420-2 ~]# time python /usr/share/vdsm/vdsm-restore-net-config real 0m0.237s user 0m0.161s sys 0m0.076s [root@ucs1-b420-2 ~]# time python /usr/share/vdsm/vdsm-restore-net-config real 0m0.256s user 0m0.180s sys 0m0.077s