1424853 – RHEV-H host activation fails after reboot on setup with large number of lvs

Bug 1424853 - RHEV-H host activation fails after reboot on setup with large number of lvs

Summary: RHEV-H host activation fails after reboot on setup with large number of lvs

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	vdsm
Sub Component:
Version:	3.5.7
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	ovirt-4.2.0
Target Release:	---
Assignee:	Dan Kenigsberg
QA Contact:	guy chen
Docs Contact:
URL:
Whiteboard:
Depends On:	1428637 1429203
Blocks:	1451240
TreeView+	depends on / blocked

Reported:	2017-02-19 21:27 UTC by Greg Scott
Modified:	2020-05-14 15:39 UTC (History)
CC List:	25 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:	undefined
Clone Of:
Clones:	1451240 (view as bug list)
Environment:
Last Closed:	2018-05-15 17:50:23 UTC
oVirt Team:	Storage
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Script to get lvm stats from sosreport (2.20 KB, text/plain) 2017-03-03 18:07 UTC, Nir Soffer	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1400446	unspecified	CLOSED	RHV-H starts very slowly when too many LUNs are connected to the host (lvm filter?)	2021-06-10 11:49:00 UTC
Red Hat Bugzilla	1404563	unspecified	CLOSED	[scale] Register RHVH to RHVM failed when too many LUNs (100) are connected to the host	2021-02-22 00:41:40 UTC
Red Hat Bugzilla	1435198	urgent	CLOSED	Udev/haldaemon consume 100 percent of the CPU for more than an hour when booting with lots of LVM objects	2021-02-22 00:41:40 UTC
Red Hat Product Errata	RHEA-2018:1489	None	None	None	2018-05-15 17:52:06 UTC

Internal Links: 1400446 1404563 1435198

Description Greg Scott 2017-02-19 21:27:06 UTC

Description of problem:

Note, this is with RHEV 3.5.8, but the bugzilla form has no selection for 3.5.8.

In a RHEV environment with around 2000 pooled, stateless VMs, host activations slow down and eventually stop working reliably. On the problem hosts, we see lsblk | wc -l showing tens of thousands of LVM objects. They appear to be orphan LVM objects and, so far, we aren't able to find where they come from.

We also find udevd and hald consuming most of the hosts' CPU capacity trying to enumerate all those LVM objects.  This wreaks havoc trying to activate RHEV-H systems into the environment. All these bogus LVM objects also set up numerous other performance issues.


Version-Release number of selected component (if applicable):
RHEVM 3.5.8 with RHEV-H-6 2016-0104.  It may also be true on newer versions.


How reproducible:
Always


Steps to Reproduce:
Build a RHEV 3.5.8 environment with a few RHEVH-6-2016-0104 hosts. Make a fiberchannel datacenter with a few LUNs and 11 storage domains.  Use the EMC XIO storage in TLV.

Create a Windows VM template. Make sure its virtual disk is thin provisioned.  Set it for Wipe After Delete. Create 11 pools, one pool per storage domain, each with 170 stateless VMs based on this template.  All those VMs should have WAD set, flowing from the template.  Each VM will consume 2 logical volumes (LV).  One LV for the VM, the other for its snapshot.  So each storage domain should have 340 LVs.

Use automation to boot all those VMs, run them for a few minutes, and shut them down.  Keep an eye on the lsblk | wc -l count on all the hosts.  Delete the VMs and create new ones, while watching lsblk | wc -l.

Repeat the above paragraph a few times.  My hunch is, that lsblk count will keep growing.  It should drop after deleting the VMs, but I have a hunch it won't.

Reboot one of the hosts in this environment and try to activate it. In an ssh session on that host watch what happens with udevd and hald.  And do a loop running lsblk | wc -l every 10 seconds.

Upgrade RHEVM to 3.6 and hosts to RHEVH-7 for 3.6.  Try the above steps again.

Upgrade the environment to RHV4 and try again.


Actual results:
Tens of thousands of LVM objects show up that cannot be accounted for from normal business.


Expected results:
The number of LVM objects in the system should make sense based on the number of VMs.  It should not be an order of magnitude too large, as we see in this environment.


Additional info:

During activation, running lsblk | wc -l in a loop shows the LV count dropping after restarting vdsmd while activating a host.  We captured the output below showing the sequence of events from one host today:

The system is going down for halt NOW!
login as: root
root@host0022's password:
Last login: Sun Feb 19 20:16:29 2017 from 10.3.19.39
[root@host0022 ~]# bash
[root@host0022 ~]#
[root@host0022 ~]# while true
> do
> lsblk |wc -l
> sleep 10
> done
15858
15858
15858
15858
^C
[root@host0022 ~]# service vdsmd restart
Shutting down vdsm daemon:
vdsm watchdog stop                                         [  OK  ]
vdsm: Running run_final_hooks                              [  OK  ]
vdsm stop                                                  [  OK  ]
vdsm: Running mkdirs
vdsm: Running configure_coredump
vdsm: Running configure_vdsm_logs
vdsm: Running wait_for_network
vdsm: Running run_init_hooks
vdsm: Running upgraded_version_check
vdsm: Running check_is_configured
libvirt is already configured for vdsm
vdsm: Running validate_configuration
SUCCESS: ssl configured to true. No conflicts
vdsm: Running prepare_transient_repository
vdsm: Running syslog_available
vdsm: Running nwfilter
vdsm: Running dummybr
vdsm: Running load_needed_modules
vdsm: Running tune_system
vdsm: Running test_space
vdsm: Running test_lo
Upgrading to unified persistence if needed
Upgrading to v3.x networking if needed
Starting up vdsm daemon:
vdsm start                                                 [  OK  ]
[root@host0022 ~]# bash
[root@host0022 ~]#
[root@host0022 ~]# while true
> do
> lsblk |wc -l
> sleep 10
> done
15858
15858
^C
[root@host0022 ~]# # activating in RHEVM
[root@host0022 ~]# bash
[root@host0022 ~]#
[root@host0022 ~]# while true
> do
> lsblk |wc -l
> sleep 10
> done
15858
15858
15858
15538
^C
[root@host0022 ~]# service vdsmd restart
Shutting down vdsm daemon:
vdsm watchdog stop                                         [  OK  ]
vdsm: Running run_final_hooks                              [  OK  ]
vdsm stop                                                  [  OK  ]
vdsm: Running mkdirs
vdsm: Running configure_coredump
vdsm: Running configure_vdsm_logs
vdsm: Running wait_for_network
vdsm: Running run_init_hooks
vdsm: Running upgraded_version_check
vdsm: Running check_is_configured
libvirt is already configured for vdsm
vdsm: Running validate_configuration
SUCCESS: ssl configured to true. No conflicts
vdsm: Running prepare_transient_repository
vdsm: Running syslog_available
vdsm: Running nwfilter
vdsm: Running dummybr
vdsm: Running load_needed_modules
vdsm: Running tune_system
vdsm: Running test_space
vdsm: Running test_lo
Upgrading to unified persistence if needed
Upgrading to v3.x networking if needed
Starting up vdsm daemon:
vdsm start                                                 [  OK  ]
[root@host0022 ~]# while true; do lsblk |wc -l; sleep 10; done
15538
15538
lsblk: dm-2816: failed to get device path
lsblk: dm-2899: failed to get device path
15452
lsblk: dm-114: failed to get device path
lsblk: dm-115: failed to get device path
lsblk: dm-192: failed to get device path
lsblk: dm-55: failed to get device path
lsblk: dm-56: failed to get device path
lsblk: dm-57: failed to get device path
lsblk: dm-58: failed to get device path
lsblk: dm-59: failed to get device path
lsblk: dm-60: failed to get device path
lsblk: dm-61: failed to get device path
lsblk: dm-62: failed to get device path
lsblk: dm-63: failed to get device path
lsblk: dm-64: failed to get device path
lsblk: dm-65: failed to get device path
lsblk: dm-66: failed to get device path
lsblk: dm-67: failed to get device path
lsblk: dm-68: failed to get device path
lsblk: dm-69: failed to get device path
lsblk: dm-70: failed to get device path
lsblk: dm-71: failed to get device path
lsblk: dm-72: failed to get device path
lsblk: dm-73: failed to get device path
lsblk: dm-74: failed to get device path
lsblk: dm-75: failed to get device path
lsblk: dm-76: failed to get device path
lsblk: dm-77: failed to get device path
lsblk: dm-78: failed to get device path
lsblk: dm-79: failed to get device path
lsblk: dm-80: failed to get device path
lsblk: dm-81: failed to get device path
lsblk: dm-82: failed to get device path
lsblk: dm-83: failed to get device path
lsblk: dm-84: failed to get device path
lsblk: dm-85: failed to get device path
lsblk: dm-86: failed to get device path
lsblk: dm-87: failed to get device path
lsblk: dm-88: failed to get device path
lsblk: dm-89: failed to get device path
lsblk: dm-90: failed to get device path
lsblk: dm-91: failed to get device path
lsblk: dm-92: failed to get device path
lsblk: dm-93: failed to get device path
lsblk: dm-94: failed to get device path
lsblk: dm-95: failed to get device path
lsblk: dm-96: failed to get device path
lsblk: dm-97: failed to get device path
lsblk: dm-98: failed to get device path
lsblk: dm-99: failed to get device path
lsblk: dm-100: failed to get device path
lsblk: dm-101: failed to get device path
lsblk: dm-102: failed to get device path
lsblk: dm-103: failed to get device path
lsblk: dm-104: failed to get device path
lsblk: dm-105: failed to get device path
lsblk: dm-106: failed to get device path
lsblk: dm-107: failed to get device path
lsblk: dm-108: failed to get device path
lsblk: dm-109: failed to get device path
lsblk: dm-110: failed to get device path
lsblk: dm-111: failed to get device path
lsblk: dm-112: failed to get device path
lsblk: dm-113: failed to get device path
lsblk: dm-114: failed to get device path
lsblk: dm-115: failed to get device path
lsblk: dm-116: failed to get device path
lsblk: dm-117: failed to get device path
lsblk: dm-118: failed to get device path
lsblk: dm-119: failed to get device path
lsblk: dm-120: failed to get device path
lsblk: dm-121: failed to get device path
lsblk: dm-122: failed to get device path
lsblk: dm-123: failed to get device path
lsblk: dm-124: failed to get device path
lsblk: dm-125: failed to get device path
lsblk: dm-126: failed to get device path
lsblk: dm-127: failed to get device path
lsblk: dm-128: failed to get device path
lsblk: dm-129: failed to get device path
lsblk: dm-130: failed to get device path
lsblk: dm-131: failed to get device path
lsblk: dm-132: failed to get device path
lsblk: dm-133: failed to get device path
lsblk: dm-134: failed to get device path
lsblk: dm-135: failed to get device path
lsblk: dm-136: failed to get device path
lsblk: dm-137: failed to get device path
lsblk: dm-138: failed to get device path
lsblk: dm-139: failed to get device path
lsblk: dm-140: failed to get device path
lsblk: dm-141: failed to get device path
lsblk: dm-142: failed to get device path
lsblk: dm-143: failed to get device path
lsblk: dm-144: failed to get device path
lsblk: dm-145: failed to get device path
lsblk: dm-146: failed to get device path
lsblk: dm-147: failed to get device path
lsblk: dm-148: failed to get device path
lsblk: dm-149: failed to get device path
lsblk: dm-150: failed to get device path
lsblk: dm-151: failed to get device path
lsblk: dm-152: failed to get device path
lsblk: dm-153: failed to get device path
lsblk: dm-154: failed to get device path
lsblk: dm-155: failed to get device path
lsblk: dm-156: failed to get device path
lsblk: dm-157: failed to get device path
lsblk: dm-158: failed to get device path
lsblk: dm-159: failed to get device path
lsblk: dm-160: failed to get device path
lsblk: dm-161: failed to get device path
lsblk: dm-162: failed to get device path
lsblk: dm-163: failed to get device path
lsblk: dm-164: failed to get device path
lsblk: dm-165: failed to get device path
lsblk: dm-166: failed to get device path
lsblk: dm-167: failed to get device path
lsblk: dm-168: failed to get device path
lsblk: dm-169: failed to get device path
lsblk: dm-170: failed to get device path
lsblk: dm-171: failed to get device path
lsblk: dm-172: failed to get device path
lsblk: dm-173: failed to get device path
lsblk: dm-174: failed to get device path
lsblk: dm-175: failed to get device path
lsblk: dm-176: failed to get device path
lsblk: dm-177: failed to get device path
lsblk: dm-178: failed to get device path
lsblk: dm-179: failed to get device path
lsblk: dm-180: failed to get device path
lsblk: dm-181: failed to get device path
lsblk: dm-182: failed to get device path
lsblk: dm-183: failed to get device path
lsblk: dm-184: failed to get device path
lsblk: dm-185: failed to get device path
lsblk: dm-186: failed to get device path
lsblk: dm-187: failed to get device path
lsblk: dm-188: failed to get device path
lsblk: dm-189: failed to get device path
lsblk: dm-190: failed to get device path
lsblk: dm-191: failed to get device path
lsblk: dm-192: failed to get device path
lsblk: dm-193: failed to get device path
lsblk: dm-194: failed to get device path
lsblk: dm-195: failed to get device path
lsblk: dm-196: failed to get device path
lsblk: dm-197: failed to get device path
lsblk: dm-198: failed to get device path
lsblk: dm-199: failed to get device path
lsblk: dm-200: failed to get device path
lsblk: dm-201: failed to get device path
lsblk: dm-202: failed to get device path
lsblk: dm-203: failed to get device path
lsblk: dm-204: failed to get device path
lsblk: dm-205: failed to get device path
lsblk: dm-206: failed to get device path
lsblk: dm-207: failed to get device path
lsblk: dm-208: failed to get device path
lsblk: dm-209: failed to get device path
lsblk: dm-210: failed to get device path
lsblk: dm-211: failed to get device path
lsblk: dm-212: failed to get device path
lsblk: dm-213: failed to get device path
14371
lsblk: dm-3337: failed to get device path
lsblk: dm-3416: failed to get device path
13389
lsblk: dm-405: failed to get device path
lsblk: dm-348: failed to get device path
lsblk: dm-349: failed to get device path
lsblk: dm-350: failed to get device path
lsblk: dm-351: failed to get device path
lsblk: dm-352: failed to get device path
lsblk: dm-353: failed to get device path
lsblk: dm-354: failed to get device path
lsblk: dm-355: failed to get device path
lsblk: dm-356: failed to get device path
lsblk: dm-357: failed to get device path
lsblk: dm-358: failed to get device path
lsblk: dm-359: failed to get device path
lsblk: dm-360: failed to get device path
lsblk: dm-361: failed to get device path
lsblk: dm-362: failed to get device path
lsblk: dm-363: failed to get device path
lsblk: dm-364: failed to get device path
lsblk: dm-365: failed to get device path
lsblk: dm-366: failed to get device path
lsblk: dm-367: failed to get device path
lsblk: dm-368: failed to get device path
lsblk: dm-369: failed to get device path
lsblk: dm-370: failed to get device path
lsblk: dm-371: failed to get device path
lsblk: dm-372: failed to get device path
lsblk: dm-373: failed to get device path
lsblk: dm-374: failed to get device path
lsblk: dm-375: failed to get device path
lsblk: dm-376: failed to get device path
lsblk: dm-377: failed to get device path
lsblk: dm-378: failed to get device path
lsblk: dm-379: failed to get device path
lsblk: dm-380: failed to get device path
lsblk: dm-381: failed to get device path
lsblk: dm-382: failed to get device path
lsblk: dm-383: failed to get device path
lsblk: dm-384: failed to get device path
lsblk: dm-385: failed to get device path
lsblk: dm-386: failed to get device path
lsblk: dm-387: failed to get device path
lsblk: dm-388: failed to get device path
lsblk: dm-389: failed to get device path
lsblk: dm-390: failed to get device path
lsblk: dm-391: failed to get device path
lsblk: dm-392: failed to get device path
lsblk: dm-393: failed to get device path
lsblk: dm-394: failed to get device path
lsblk: dm-395: failed to get device path
lsblk: dm-396: failed to get device path
lsblk: dm-397: failed to get device path
lsblk: dm-398: failed to get device path
lsblk: dm-399: failed to get device path
lsblk: dm-400: failed to get device path
lsblk: dm-401: failed to get device path
lsblk: dm-402: failed to get device path
lsblk: dm-403: failed to get device path
lsblk: dm-404: failed to get device path
lsblk: dm-405: failed to get device path
lsblk: dm-406: failed to get device path
lsblk: dm-407: failed to get device path
lsblk: dm-408: failed to get device path
lsblk: dm-409: failed to get device path
lsblk: dm-410: failed to get device path
12168
lsblk: dm-683: failed to get device path
lsblk: dm-669: failed to get device path
lsblk: dm-670: failed to get device path
lsblk: dm-671: failed to get device path
lsblk: dm-672: failed to get device path
lsblk: dm-673: failed to get device path
lsblk: dm-674: failed to get device path
lsblk: dm-675: failed to get device path
lsblk: dm-676: failed to get device path
lsblk: dm-677: failed to get device path
lsblk: dm-678: failed to get device path
lsblk: dm-679: failed to get device path
lsblk: dm-680: failed to get device path
lsblk: dm-681: failed to get device path
lsblk: dm-682: failed to get device path
lsblk: dm-683: failed to get device path
lsblk: dm-684: failed to get device path
lsblk: dm-685: failed to get device path
lsblk: dm-686: failed to get device path
lsblk: dm-687: failed to get device path
lsblk: dm-688: failed to get device path
lsblk: dm-689: failed to get device path
10951
lsblk: dm-967: failed to get device path
lsblk: dm-919: failed to get device path
lsblk: dm-920: failed to get device path
lsblk: dm-921: failed to get device path
lsblk: dm-922: failed to get device path
lsblk: dm-923: failed to get device path
lsblk: dm-924: failed to get device path
lsblk: dm-925: failed to get device path
lsblk: dm-926: failed to get device path
lsblk: dm-927: failed to get device path
lsblk: dm-928: failed to get device path
lsblk: dm-929: failed to get device path
lsblk: dm-930: failed to get device path
lsblk: dm-931: failed to get device path
lsblk: dm-932: failed to get device path
lsblk: dm-933: failed to get device path
lsblk: dm-934: failed to get device path
lsblk: dm-935: failed to get device path
lsblk: dm-936: failed to get device path
lsblk: dm-937: failed to get device path
lsblk: dm-938: failed to get device path
lsblk: dm-939: failed to get device path
lsblk: dm-940: failed to get device path
lsblk: dm-941: failed to get device path
lsblk: dm-942: failed to get device path
lsblk: dm-943: failed to get device path
lsblk: dm-944: failed to get device path
lsblk: dm-945: failed to get device path
lsblk: dm-946: failed to get device path
lsblk: dm-947: failed to get device path
lsblk: dm-948: failed to get device path
lsblk: dm-949: failed to get device path
lsblk: dm-950: failed to get device path
lsblk: dm-951: failed to get device path
lsblk: dm-952: failed to get device path
lsblk: dm-953: failed to get device path
lsblk: dm-954: failed to get device path
lsblk: dm-955: failed to get device path
lsblk: dm-956: failed to get device path
lsblk: dm-957: failed to get device path
lsblk: dm-958: failed to get device path
lsblk: dm-959: failed to get device path
lsblk: dm-960: failed to get device path
lsblk: dm-961: failed to get device path
lsblk: dm-962: failed to get device path
lsblk: dm-963: failed to get device path
lsblk: dm-964: failed to get device path
lsblk: dm-965: failed to get device path
lsblk: dm-966: failed to get device path
lsblk: dm-967: failed to get device path
lsblk: dm-968: failed to get device path
lsblk: dm-969: failed to get device path
lsblk: dm-970: failed to get device path
lsblk: dm-971: failed to get device path
9941
lsblk: dm-3643: failed to get device path
lsblk: dm-3644: failed to get device path
9135
lsblk: dm-1763: failed to get device path
lsblk: dm-1812: failed to get device path
lsblk: dm-1863: failed to get device path
lsblk: dm-1909: failed to get device path
lsblk: dm-1910: failed to get device path
lsblk: dm-1737: failed to get device path
lsblk: dm-1738: failed to get device path
lsblk: dm-1739: failed to get device path
lsblk: dm-1740: failed to get device path
lsblk: dm-1741: failed to get device path
lsblk: dm-1742: failed to get device path
lsblk: dm-1743: failed to get device path
lsblk: dm-1744: failed to get device path
lsblk: dm-1745: failed to get device path
lsblk: dm-1746: failed to get device path
lsblk: dm-1747: failed to get device path
lsblk: dm-1748: failed to get device path
lsblk: dm-1749: failed to get device path
lsblk: dm-1750: failed to get device path
lsblk: dm-1751: failed to get device path
lsblk: dm-1752: failed to get device path
lsblk: dm-1753: failed to get device path
lsblk: dm-1754: failed to get device path
lsblk: dm-1755: failed to get device path
lsblk: dm-1756: failed to get device path
lsblk: dm-1757: failed to get device path
lsblk: dm-1758: failed to get device path
lsblk: dm-1759: failed to get device path
lsblk: dm-1760: failed to get device path
lsblk: dm-1761: failed to get device path
lsblk: dm-1762: failed to get device path
lsblk: dm-1763: failed to get device path
lsblk: dm-1764: failed to get device path
lsblk: dm-1765: failed to get device path
lsblk: dm-1766: failed to get device path
lsblk: dm-1767: failed to get device path
lsblk: dm-1768: failed to get device path
lsblk: dm-1769: failed to get device path
lsblk: dm-1770: failed to get device path
lsblk: dm-1771: failed to get device path
lsblk: dm-1772: failed to get device path
lsblk: dm-1773: failed to get device path
lsblk: dm-1774: failed to get device path
lsblk: dm-1775: failed to get device path
lsblk: dm-1776: failed to get device path
lsblk: dm-1777: failed to get device path
lsblk: dm-1778: failed to get device path
lsblk: dm-1779: failed to get device path
lsblk: dm-1780: failed to get device path
lsblk: dm-1781: failed to get device path
lsblk: dm-1782: failed to get device path
lsblk: dm-1783: failed to get device path
lsblk: dm-1784: failed to get device path
lsblk: dm-1785: failed to get device path
lsblk: dm-1786: failed to get device path
lsblk: dm-1787: failed to get device path
lsblk: dm-1788: failed to get device path
lsblk: dm-1789: failed to get device path
lsblk: dm-1790: failed to get device path
lsblk: dm-1791: failed to get device path
lsblk: dm-1792: failed to get device path
lsblk: dm-1793: failed to get device path
lsblk: dm-1794: failed to get device path
lsblk: dm-1795: failed to get device path
lsblk: dm-1796: failed to get device path
lsblk: dm-1797: failed to get device path
lsblk: dm-1798: failed to get device path
lsblk: dm-1799: failed to get device path
lsblk: dm-1800: failed to get device path
lsblk: dm-1801: failed to get device path
lsblk: dm-1802: failed to get device path
lsblk: dm-1803: failed to get device path
lsblk: dm-1804: failed to get device path
lsblk: dm-1805: failed to get device path
lsblk: dm-1806: failed to get device path
lsblk: dm-1807: failed to get device path
lsblk: dm-1808: failed to get device path
lsblk: dm-1809: failed to get device path
lsblk: dm-1810: failed to get device path
lsblk: dm-1811: failed to get device path
lsblk: dm-1812: failed to get device path
lsblk: dm-1926: failed to get device path
8006
lsblk: dm-3826: failed to get device path
lsblk: dm-3827: failed to get device path
lsblk: dm-3828: failed to get device path
6986
^C
....

[root@host0022 ~]# ^C
[root@host0022 ~]# while true; do lsblk |wc -l; sleep 10; done
698

Comment 2 Sandro Bonazzola 2017-02-22 10:44:34 UTC

Does this reproduce on RHEL too? Doesn't seems to be RHEV-H related.

Comment 3 Greg Scott 2017-02-22 12:33:31 UTC

Probably does reproduce on RHEL. But we've only seen it on rhevh.

Comment 4 Greg Scott 2017-02-22 12:34:05 UTC

Probably does reproduce on RHEL. But we've only seen it on rhevh.

Comment 5 Greg Scott 2017-02-25 02:39:51 UTC

I owe a better answer than what I can tap out on a teeny tiny cell phone keyboard.  The unique part of all this is the use case.  The customer operates around 2000 pooled, stateless VMs with thin provisioned storage at each site, using an XIO fiberchannel storage array for back end storage.  Every one of those VMs consumes an LV; running VMs consume 2 LVs each, one for the VM, the other for its snapshot.

Those VMs are Windows 7, and the customer does patch management by updating the templates, deleting old VMs and storage domains, and creating new VMs and storage domains.  The customer also takes advantage of an XIO feature around treating zeroed blocks by setting all objects to "Wipe After Delete."

With WAD set, deletions go all the way back to the SAN, the SAN declares the space free, and the customer no longer needs to delete old LUNs and create new LUNs to free up SAN space.

The WAD method of operation has been in place for the past 2 months. We found this massive number of LVs just in the past 3 weeks.  I suspect that WAD interacts with the XIO storage and somehow leaves legacies of LVM metadata that RHEV hosts subsequently pick up.

I hope we can demonstrate that SCSI discards with RHVH-7 make all this moot.  But we need to demonstrate it at scale.

So, would this reproduce on RHEL? I'm sure it would, but I don't know of any relevant use case.

Comment 7 Nir Soffer 2017-03-03 16:18:02 UTC

The description about (In reply to Greg Scott from comment #0)
> Description of problem:
> 
> Note, this is with RHEV 3.5.8, but the bugzilla form has no selection for
> 3.5.8.
> 
> In a RHEV environment with around 2000 pooled, stateless VMs, host
> activations slow down and eventually stop working reliably. On the problem
> hosts, we see lsblk | wc -l showing tens of thousands of LVM objects. They
> appear to be orphan LVM objects and, so far, we aren't able to find where
> they come from.

On this setup we have 11 "current" storage domain, and 11 "old" storage domains.
the customers is rotating the storage domain regularly. On each set of storage
domains there are about 2000 stateless vms with one disk.

The running vms from the "current" storage domains consume 2 lvs per vm (one for
the disk, one for the temporary disk while the vm is running). The vms on the old
storage domains consume 1 lv per vm.

So we have total 6000 lvs in this setup.

These machines have 4 paths to storage, so lsblk show each lv 4 times (one under
each path). This explain why we see 24,000 lvs instead of the expected 6000.
 
> We also find udevd and hald consuming most of the hosts' CPU capacity trying
> to enumerate all those LVM objects.

We need data about this, showing the cpu usage of udev and hald during boot.

> Steps to Reproduce:
> Build a RHEV 3.5.8 environment with a few RHEVH-6-2016-0104 hosts. Make a
> fiberchannel datacenter with a few LUNs and 11 storage domains.  Use the EMC
> XIO storage in TLV.

We need also 4 paths to storage to simulate this issue.

The actual setup have also 11 old storage domains with same vms, so we need
22 storage domains, not 11.

> Create a Windows VM template. Make sure its virtual disk is thin
> provisioned.  Set it for Wipe After Delete. Create 11 pools, one pool per
> storage domain, each with 170 stateless VMs based on this template.  All
> those VMs should have WAD set, flowing from the template.  Each VM will
> consume 2 logical volumes (LV).  One LV for the VM, the other for its
> snapshot.  So each storage domain should have 340 LVs.

340 lvs when all the vm are running I assume?

> Use automation to boot all those VMs, run them for a few minutes, and shut
> them down.  Keep an eye on the lsblk | wc -l count on all the hosts.  Delete
> the VMs and create new ones, while watching lsblk | wc -l.

Watching lsblk for counting lvs is not useful.

> Repeat the above paragraph a few times.  My hunch is, that lsblk count will
> keep growing.  It should drop after deleting the VMs, but I have a hunch it
> won't.

I did find any evidence that we have an issue with deleting lvs, so I don't think
this test is useful for this case. 

> Reboot one of the hosts in this environment and try to activate it. In an
> ssh session on that host watch what happens with udevd and hald.  And do a
> loop running lsblk | wc -l every 10 seconds.

Except the lsblk, this is the interesting test - trying to simulate host that
cannot be activated after reboot.

> Upgrade RHEVM to 3.6 and hosts to RHEVH-7 for 3.6.  Try the above steps
> again.
> 
> Upgrade the environment to RHV4 and try again.

I think we want to test only with RHV4, 3.6 will be EOL soon.

> Actual results:
> Tens of thousands of LVM objects show up that cannot be accounted for from
> normal business.

The issue is host not activating after reboot.

> Expected results:
> The number of LVM objects in the system should make sense based on the
> number of VMs.  It should not be an order of magnitude too large, as we see
> in this environment.

This expected result is host activating after reboot without manually
restarting vdsm.

> Additional info:
> 
> During activation, running lsblk | wc -l in a loop shows the LV count
> dropping after restarting vdsmd while activating a host.  We captured the
> output below showing the sequence of events from one host today:

What we see here is the result of vdsm deactivating all lvs during startup.

The mystery is why this process hangs during boot, but works later?

Greg, do we have sosreport showing the timeframe from reboot, then the time
vdsm hanged trying to deactivate lvs, restarted, and finally activate?

All the sosreport I have seen were taken just after reboot, so I don't have
enough info.

Also if we plan rebooting of the hosts in this cluster, I would like to log
the cpu usage of the system during this timeframe, maybe using top -b.

Comment 8 Nir Soffer 2017-03-03 17:14:51 UTC

I found this issue in jabphx0008:

Vdsm startup:

MainThread::INFO::2017-02-12 05:49:16,959::vdsm::132::vds::(run) (PID: 16816) I am the actual vdsm 4.16.32-1.el6ev jabphx0008.statestr.com (2.6.32-573.12.1.el6.x86_64)

Performing udevadm settle after scsi rescan - with huge timeout

storageRefresh::DEBUG::2017-02-12 05:49:18,640::utils::755::root::(execCmd) /sbin/udevadm settle --timeout=480 (cwd None)

udevadm settle fails after 480 seconds:

storageRefresh::DEBUG::2017-02-12 05:57:18,695::utils::775::root::(execCmd) FAILED: <err> = ''; <rc> = 1
storageRefresh::ERROR::2017-02-12 05:57:18,696::udevadm::60::root::(settle) Process failed with rc=1 out='\nudevadm settle - timeout of 480 seconds reached, the event queue contains:\n  /sys/devices/virtual/net/eth0.2000 (24009)\n  /sys/devices/virtual/net/eth0.2000 (24010)\n  /sys/devices/virtual/net/eth0.2000/queues/rx-0 (24011)\n  /sys/devices/virtual/net/eth0.2000/queues/tx-0 (24012)\n' err=''

In /etc/vdsm/vdsm/conf we see:

$ cat etc/vdsm/vdsm.conf 
[addresses]
management_port = 54321

[vars]
ssl = true

[irs]
scsi_settle_timeout = 480

I also seen this in other hosts.

This explains the unexpected timeout. Using such value is not a
good idea, and may cause large timeout during vdsm flows, which may
be a reason why a host idd not activate.

Greg, when was this value set in this cluster? please make sure they
return the default.

If we see timeouts in udevadm settle in normal use, maybe use a bigger
value, like 10-15 seconds. If we still have timeouts we need to check
why udev time out.

Comment 9 Greg Scott 2017-03-03 17:28:24 UTC

We've been fighting this problem of hosts not activating for several months and one suggestion was to increase the scsi_settle_timeout value. The theory at the time - which we now know is wrong - was to give udev enough time to do its thing so hosts could activate.  I put a comment in one of the support cases last night to put scsi_settle_timeout back to its default on all hosts and don't go any higher than 10-15 seconds with it.

Comment 10 Greg Scott 2017-03-03 17:33:48 UTC

Oh yes - I don't have any recordings of top output, but we have had people watching top once they can get in via ssh.  We see CPU maxed, split roughly evenly between udevd and hald.  I'll see what I can do about getting one of these recorded.

- Greg

Comment 11 Nir Soffer 2017-03-03 18:03:14 UTC

Dumping the total lvs we see on different hosts:


$ python lvs-stats.py < sosreport-jabphx0002.*.01779057-20170206210250/sos_commands/lvm2/lvs_-a_-o_lv_tags_devices_--config_global_locking_type_0 
     #lv vg name                              #active  #open    #removed
     352 475f3971-83e8-48d4-af55-957b84c5b42e      341        0        0
     350 a09ec488-fc29-462f-9319-fd13d1335131      350        0        0
     349 ffc79d5c-ade4-4aa7-b29f-d18d0eb9ed47      349        0        0
     349 eba89272-6603-440d-8662-579882e4c566      349        0        0
     340 34c280f6-ea97-4ac1-af2e-241c83d106ad      339        0        0
     337 4b1948cb-8500-4155-834c-48a3b1e39abb      336        0        0
     315 a4f0eb6c-4904-41fc-b12b-84b053c98dc1      313        0        0
     296 bb273403-5489-4306-bdae-636afdc2689b      296        0        0
     276 dacd9668-6ca6-4c60-bffa-1efecdfb3595      275        0        0
     228 3a5fb6b8-74b1-47c8-857f-aa3a799f8699      227        0        0
     218 95231fba-5724-4063-8492-0866f4c74de8      218        0        0
     217 a214b043-c558-433a-8a00-9c12b41d590f      217        0        0
     209 482fb94d-b6a4-4fae-8728-d6204f1fd273      209        0        0
     191 f0a154ad-c212-4702-808a-7f371b4f8c8c      191        0        0
     191 375bfc4e-0747-44b4-9f16-662436445797      191        0        0
     190 b38d008f-b93f-4650-ba91-2f41b3ad4b41      190        0        0
     182 fc988f63-4051-44a6-988c-09a1ca9b1157      182        0        0
     179 c6e07169-a202-4a44-b2ae-440110ce0323      179        0        0
     176 018fd3a0-1513-4f80-81f0-ede47686cb8e      176        0        0
     144 a962de39-4d11-4f45-b3e0-a3e6da75796e      144        0        0
     134 f8e8526c-02fa-4906-b593-3821f99cb392      134        0        0
     112 5696ccc8-d0cd-4d35-9e3b-30c214408213      112        0        0
     107 9e436258-191d-411e-9c8f-3d7165682e90      107        0        0
      78 203e766c-a3fe-4b80-abc8-f5217a16d787       78        0        0
      62 b1bfa173-1748-4ae4-99b9-e6174528b03d       62        0        0
      56 1ca68870-23b3-4f02-b185-9a469ad14a90       56        0        0
       8 0416631d-a0c5-4a9e-b5d7-bedce8c0e85e        6        0        0
       4 HostVG                                      4        4        0
       1 VG                                          0        0        0

totals
        lv:  5651
    active:  5631
      open:  4

We see that we have 5651 lvs, and practically all of the are active.


$ python lvs-stats.py < sosreport-jabphx0004.*.01797770-20170301175143/sos_commands/lvm2/lvs_-a_-o_lv_tags_devices_--config_global_locking_type_0 
     #lv vg name                              #active  #open    #removed
     355 475f3971-83e8-48d4-af55-957b84c5b42e       31       26        0
     353 bb273403-5489-4306-bdae-636afdc2689b        6        1        0
     349 ffc79d5c-ade4-4aa7-b29f-d18d0eb9ed47        6        1        0
     349 eba89272-6603-440d-8662-579882e4c566        6        1        0
     349 4b1948cb-8500-4155-834c-48a3b1e39abb        6        1        0
     348 a09ec488-fc29-462f-9319-fd13d1335131        6        1        0
     348 34c280f6-ea97-4ac1-af2e-241c83d106ad        6        1        0
     325 a4f0eb6c-4904-41fc-b12b-84b053c98dc1        6        1        0
     292 b6c183ce-a020-4442-b7d8-596c8cf042a9        6        1        0
     283 5458edbd-1f29-4960-b67c-31644d53cac4        6        1        0
     274 dacd9668-6ca6-4c60-bffa-1efecdfb3595        6        1        0
     268 efb361c3-7cab-4601-a35f-f859c017ca11        6        1        0
     254 9b5c609b-0165-412a-9bcb-05e9bd8f802d        6        1        0
     242 3a5fb6b8-74b1-47c8-857f-aa3a799f8699       27       22        0
     216 95231fba-5724-4063-8492-0866f4c74de8       31       26        0
     200 a3ba1656-9567-4a6a-a939-e5ce0d2282f6        6        1        0
     179 2711b96f-b7d3-4d1c-98a3-c12046496c97        6        1        0
     163 e7ca2cca-94c2-45ef-ae12-a60d138c0faf        9        1        0
     112 5696ccc8-d0cd-4d35-9e3b-30c214408213       18       13        0
     104 a8a8d55f-d042-4f81-b8c2-cd676dacb214        6        1        0
      78 203e766c-a3fe-4b80-abc8-f5217a16d787        6        1        0
      62 b1bfa173-1748-4ae4-99b9-e6174528b03d        6        1        0
      56 1ca68870-23b3-4f02-b185-9a469ad14a90        6        1        0
       9 ef0b8570-f345-4187-9645-49ac484ffaba        6        1        0
       9 9d7ed78c-ceb2-4cb4-a52c-28c21ba320cf        6        1        0
       9 608a69c8-cf0b-49e3-a8b1-cd568e0b4e0c        6        1        0
       8 0416631d-a0c5-4a9e-b5d7-bedce8c0e85e        6        0        0
       4 HostVG                                      4        4        0
       1 VG                                          0        0        0

totals
        lv:  5599
    active:  252
      open:  113

In this host only 252 lvs are active, probably after vdsm deactivated
all unused lvs.


$ python lvs-stats.py < sosreport-jabphx0008.*-20170212122136/sos_commands/lvm2/lvs_-a_-o_lv_tags_devices_--config_global_locking_type_0 
     #lv vg name                              #active  #open    #removed
     340 475f3971-83e8-48d4-af55-957b84c5b42e      338        0        0
     331 34c280f6-ea97-4ac1-af2e-241c83d106ad      331        0        0
     327 4b1948cb-8500-4155-834c-48a3b1e39abb      327        0        0
     326 a09ec488-fc29-462f-9319-fd13d1335131        6        0        0
     324 ffc79d5c-ade4-4aa7-b29f-d18d0eb9ed47      324        0        0
     322 eba89272-6603-440d-8662-579882e4c566      322        0        0
     297 a4f0eb6c-4904-41fc-b12b-84b053c98dc1      297        0        0
     295 bb273403-5489-4306-bdae-636afdc2689b      295        0        0
     254 dacd9668-6ca6-4c60-bffa-1efecdfb3595      253        0        0
     218 3a5fb6b8-74b1-47c8-857f-aa3a799f8699      217        0        0
     205 95231fba-5724-4063-8492-0866f4c74de8      205        0        0
     112 5696ccc8-d0cd-4d35-9e3b-30c214408213      112        0        0
      78 203e766c-a3fe-4b80-abc8-f5217a16d787       78        0        0
      62 b1bfa173-1748-4ae4-99b9-e6174528b03d       62        0        0
      56 1ca68870-23b3-4f02-b185-9a469ad14a90       56        0        0
       8 0416631d-a0c5-4a9e-b5d7-bedce8c0e85e        8        0        0
       4 HostVG                                      4        4        0
       1 VG                                          0        0        0

totals
        lv:  3560
    active:  3235
      open:  4

I think this show the state after the old storage domain were removed, but
we still most lvs are active, maybe vdsm stated to deactivate some lvs when
this sosreport was created.


$ python lvs-stats.py < sosreport-jabphx0010.*.01779057-20170211203520/sos_commands/lvm2/lvs_-a_-o_lv_tags_devices_--config_global_locking_type_0 
     #lv vg name                              #active  #open    #removed
     348 475f3971-83e8-48d4-af55-957b84c5b42e      348        0        0
     341 eba89272-6603-440d-8662-579882e4c566      341        0        0
     340 ffc79d5c-ade4-4aa7-b29f-d18d0eb9ed47      340        0        0
     337 4b1948cb-8500-4155-834c-48a3b1e39abb      337        0        0
     336 34c280f6-ea97-4ac1-af2e-241c83d106ad      336        0        0
     335 a09ec488-fc29-462f-9319-fd13d1335131        6        0        0
     303 a4f0eb6c-4904-41fc-b12b-84b053c98dc1      303        0        0
     295 bb273403-5489-4306-bdae-636afdc2689b      295        0        0
     271 dacd9668-6ca6-4c60-bffa-1efecdfb3595      271        0        0
     231 3a5fb6b8-74b1-47c8-857f-aa3a799f8699      231        0        0
     219 95231fba-5724-4063-8492-0866f4c74de8      219        0        0
     112 5696ccc8-d0cd-4d35-9e3b-30c214408213      112        0        0
      78 203e766c-a3fe-4b80-abc8-f5217a16d787       78        0        0
      62 b1bfa173-1748-4ae4-99b9-e6174528b03d       62        0        0
      56 1ca68870-23b3-4f02-b185-9a469ad14a90       56        0        0
       8 0416631d-a0c5-4a9e-b5d7-bedce8c0e85e        8        0        0
       4 HostVG                                      4        4        0
       1 VG                                          0        0        0

totals
        lv:  3677
    active:  3347
      open:  4


Similar to jabphx0008.


$ python lvs-stats.py < sosreport-jabphx0014.*.01779057-20170211203907/sos_commands/lvm2/lvs_-a_-o_lv_tags_devices_--config_global_locking_type_0 
     #lv vg name                              #active  #open    #removed
     348 475f3971-83e8-48d4-af55-957b84c5b42e        6        1        0
     341 eba89272-6603-440d-8662-579882e4c566        6        1        0
     340 ffc79d5c-ade4-4aa7-b29f-d18d0eb9ed47        6        1        0
     337 4b1948cb-8500-4155-834c-48a3b1e39abb        6        1        0
     336 34c280f6-ea97-4ac1-af2e-241c83d106ad        6        1        0
     335 a09ec488-fc29-462f-9319-fd13d1335131        6        1        0
     303 a4f0eb6c-4904-41fc-b12b-84b053c98dc1        6        1        0
     295 bb273403-5489-4306-bdae-636afdc2689b        6        1        0
     271 dacd9668-6ca6-4c60-bffa-1efecdfb3595        6        1        0
     231 3a5fb6b8-74b1-47c8-857f-aa3a799f8699        6        1        0
     219 95231fba-5724-4063-8492-0866f4c74de8        6        1        0
     112 5696ccc8-d0cd-4d35-9e3b-30c214408213        6        1        0
      78 203e766c-a3fe-4b80-abc8-f5217a16d787        6        1        0
      62 b1bfa173-1748-4ae4-99b9-e6174528b03d        6        1        0
      56 1ca68870-23b3-4f02-b185-9a469ad14a90        6        1        0
       8 0416631d-a0c5-4a9e-b5d7-bedce8c0e85e        6        0        0
       4 HostVG                                      4        4        0
       1 VG                                          0        0        0

totals
        lv:  3677
    active:  100
      open:  19

This looks like normal operation mode, vdsm deactivated most lvs.


$ python lvs-stats.py < sosreport-gdcphx0027.*.01797770-20170224184848/sos_commands/lvm2/lvs_-a_-o_lv_tags_devices_--config_global_locking_type_0 
     #lv vg name                              #active  #open    #removed
     534 08fcdfc2-e31b-4fb0-83ae-2c6e55dcd772       21       14        0
     502 132791ee-3fef-4945-9279-cf2bbddd85f2        9        4        0
     488 2c66ca06-c83f-4dd0-ae8b-8c72cdd29035       21       12        0
     351 76afadf1-7b90-4f41-8601-f766fd4a7ada       17       12        0
     327 3377e9e7-63bc-4313-bdda-47cb38487860       19       14        0
     320 7b7daebc-982a-4ba3-ba7d-ddf418468ebc       35       28        0
     315 a475c700-a749-4b6e-9e3b-53c08af5e155        9        4        0
     265 e93af66a-cdcb-46a4-b52c-cf888902d354       16        6        0
     246 c6a49433-ae49-4d29-93ff-0b6534b35758        6        1        0
     193 b1c45380-e5e4-42e9-90f4-ff6732b57a50        6        1        0
     158 436af53a-1ac5-4a27-b78b-9387aa0977e6        6        1        0
      92 40d64e44-18d4-4cef-9e91-e03ca206f64c        6        1        0
      72 dc25e385-332e-4feb-8ea2-78a7036ac8d1        6        2        0
      56 41010432-a8ff-47c9-a538-46b37f0188ac       12        7        0
      48 d6d7afa9-5356-4c12-9659-5817f6767c5e        6        1        0
       8 f6ae30f1-9d4f-48f8-9e66-6f77cd906aa9        6        1        0
       8 b630389e-2d3b-4905-9e16-eaedd4b59581        6        1        0
       8 a7edf3cc-4812-497d-bf91-0f53a9450d52        6        1        0
       8 8a03fffa-b81e-4528-9a7e-c812850ceed0        6        1        0
       8 62cd036f-66c9-4c91-a80c-08fa29bf9de9        6        1        0
       8 4cb204e9-e799-400b-b449-97e3fd3c59dd        6        1        0
       8 3133b55f-6fb6-4c39-90f2-e18e1828eea6        6        1        0
       8 21d2912d-b267-4726-9077-f7cd0f5dc028        6        1        0
       8 1728c225-e465-4832-85e0-ded99280b4c0        6        1        0
       8 0d22631c-d0eb-4581-acda-9b56a3933594        6        1        0
       8 009bb777-0cac-433c-b292-b6869fe32914        6        1        0
       4 HostVG                                      4        4        0
       1 VG                                          0        0        0

totals
        lv:  4060
    active:  265
      open:  123

Here we seem to have new storage domains with no vm crated yet.


Important note - we don't see any partly removed lvs in the cluster, this means
that there is no issue of aborted or failed wipe after delete operation.
Before we delete a lv, we add a "remove_me" tag to the lv.

Comment 12 Nir Soffer 2017-03-03 18:07:24 UTC

Created attachment 1259616 [details]
Script to get lvm stats from sosreport

This script give a useful overview from sosreport/sos_commands/lvm2/lvs_-a_-o_lv_tags_devices_--config_global_locking_type_0

This will be part of vdsm later, posting it here so people can reuse it.

Comment 13 Nir Soffer 2017-03-03 18:31:31 UTC

Checking multipath configuration - on all hosts I see this warning:

MainThread::WARNING::2017-02-12 05:49:17,433::multipath::162::Storage.Multipath::(isEnabled) This manual override for multipath.conf was based on downrevved template. You are strongly advised to contact your support representatives

This means the multiapth.conf file is based on an old version, and should be
updated to new version.

This is the version used on the hosts:

# RHEV REVISION 1.0
# RHEV PRIVATE

defaults {
    polling_interval        5
    getuid_callout          "/sbin/scsi_id --whitelisted --replace-whitespace --device=/dev/%n"
    no_path_retry           fail
    user_friendly_names     no
    flush_on_last_del       yes
    fast_io_fail_tmo        5
    dev_loss_tmo            30
    max_fds                 4096
}
devices {
  device {
    vendor XtremIO
    product XtremApp
    path_selector "queue-length 0"
    rr_min_io_rq 1
    path_grouping_policy multibus
    path_checker tur
    failback immediate
    fast_io_fail_tmo 15
  }
}

This version shipped in 3.5.8 is 1.1. It seems that the defualts
used in the customer machines are updated from version 1.1, so
it will be a good idea to update the version of the file.
This will avoid the warning about "downrevved template"

One thing that I'm not sure about is not having "no_path_retry"
in the XtremIO device configuration. This may use the defualt
"no_path_retry fail", or other value. It is a good idea to 
define *all* device value in each device section.

In upstream we want to change no_path_retry from "fail" to "4",
see https://gerrit.ovirt.org/61281
This may also be a good setting for this setup, but best consult
with the storage vendor about this.

I'm not sure about the value of fast_io_fail_tmo, the default
is 5 seconds, they are using 15 seconds.

Greg, do you know why they are using this value?

Comment 14 Greg Scott 2017-03-04 02:40:19 UTC

We talked about multipath.conf a couple weeks ago with EMC while troubleshooting the multipath issue during the SAN firware upgrade.  After looking it over and talking about it, nobody had any objections to any settings.  The settings here have been in place for a few years.

Thinking about it now though, I wonder if a longer fast_io_fail_tmo contributed to the multipath issue? I'm not sure exactly what that parameter regulates. Is it a timeout to declare a path down, or the time to declare a previously down path back up?  I wonder if an issue came up years ago when paths flapped up and down and they set it this way to make it less sensitive?

- Greg

Comment 15 Yaniv Lavi 2017-03-06 12:34:48 UTC

This is from BZ #1428637 :

(In reply to Nir Soffer from comment #13)
> This seems to be fixed in rhel 6.8:
> 
> # lvs -vvvv --config 'devices { filter = ["a|/dev/vda2|", "r|.*|"] }' vg0
> 2>&1 | grep Open
> #device/dev-io.c:559         Opened /dev/vda2 RO O_DIRECT
> #device/dev-io.c:559         Opened /dev/vda2 RO O_DIRECT
> #device/dev-io.c:559         Opened /dev/vda2 RO O_DIRECT
> #device/dev-io.c:559         Opened /dev/vda2 RO O_DIRECT
> #device/dev-io.c:559         Opened /etc/lvm/lvm.conf RO
> 
> # rpm -qa | grep lvm2
> lvm2-2.02.143-7.el6.x86_64
> lvm2-libs-2.02.143-7.el6.x86_64
> 
> # cat /etc/redhat-release 
> Red Hat Enterprise Linux Server release 6.8 (Santiago)
> 
> Customers affected by this should upgrade to 6.8 or 7.3.


Please recommend upgrade to resolve this issue. 6.8 was released to the 3.5 stream as well.

Comment 16 Nir Soffer 2017-03-06 19:24:16 UTC

Yaniv, the fact that the lvm filter issue (bug 1428637) was fixed in 6.8 is not
enough to close this bug.

We assume that the lvm filter issue is related, but we do not understand yet the
root cause of this bug. We need help from the lvm team to understand this.

Reopening this bug to complete the investigation.

Comment 17 Nir Soffer 2017-03-08 18:36:35 UTC

I think the root cause of this bug is bug 1429203 - stuck lvm command during boot.

Comment 18 Greg Scott 2017-03-13 04:48:54 UTC

Also, the assumption in this bz is that RHEVH-6.8 does the filtering automatically, even with boot from SAN scenarios.  But it's apparently not that simple. From what I understand, RHEVH-6.8 will allow LVM filtering, but we need to figure out the individual boot block devices in each boot-from-SAN host and then put in the filters by hand.

- Greg

Comment 19 Nir Soffer 2017-03-14 18:13:16 UTC

(In reply to Greg Scott from comment #18)
> Also, the assumption in this bz is that RHEVH-6.8 does the filtering
> automatically, even with boot from SAN scenarios.

There is no such assumption, rhel 6.8 or 7 do not solve this issue, only provide
the lvm filter.

> But it's apparently not
> that simple. From what I understand, RHEVH-6.8 will allow LVM filtering, but
> we need to figure out the individual boot block devices in each
> boot-from-SAN host and then put in the filters by hand.

Correct, this is the recommended way to configure a host for booting from SAN,
and these hosts are not configured in this way.

The configuration will be the same on 6.8 or 7.

1. Find the devices needed by the host. You can find it using:

    vgs -o pv_name HostVG

Assuming rhev-h, where there is one vg named HostVG.

If you have other vgs needed by the host you need to include them in the command.

2. Create a filter from the found devices in /etc/lvm.conf.

    filter = [ "a|^/dev/mapper/xxxyyy$|", "r|.*|" ]

3. Persist lvm.conf - on rhev-h it is not enough to modify file, the changes will
   be lost unless the modified file is persisted.

4. Finally, rebuild initramfs so lvm.conf is updated.

Adding Douglass to add more info how to persist lvm.conf and rebuild initramfs 
on rhev-h.

Comment 20 Yaniv Kaul 2017-03-14 18:19:03 UTC

The rebuilding of initramfs is the heavy lifting. What happens when rhevh update arrives?

Comment 21 Greg Scott 2017-03-14 18:30:35 UTC

What about disabling haldaemon?  That should be easy enough to do. Would it do any good? How do you persist chkconfig --level 345 haldaemon off?

I ran across another idea too:

Add the following to /etc/rc.local to re-scan after booting:
echo 512 > /sys/module/scsi_mod/parameters/max_report_luns

- Greg

Comment 22 Ryan Barry 2017-03-14 19:02:21 UTC

We have an abstraction for this under ovirt.ndode.utils.system.Initramfs

This is called from ovirtnode.Install.ovirt_boot_setup() (which is always part of ovirt-node-upgrade.py)

In this sense. we'll always catch it on updates.

Douglas, can you please find an invocation which works here?

It will probably be something like:

python -c "from ovirt.node.utils.system import Initramfs; Initramfs().rebuild()"

It's possible that 'dracut -f' will work directly, but I haven't tested...

Comment 23 Douglas Schilling Landgraf 2017-03-14 21:59:12 UTC

(In reply to Ryan Barry from comment #22)
> We have an abstraction for this under ovirt.ndode.utils.system.Initramfs
> 
> This is called from ovirtnode.Install.ovirt_boot_setup() (which is always
> part of ovirt-node-upgrade.py)
> 
> In this sense. we'll always catch it on updates.
> 
> Douglas, can you please find an invocation which works here?
> 
> It will probably be something like:
> 
> python -c "from ovirt.node.utils.system import Initramfs;
> Initramfs().rebuild()"
> 
> It's possible that 'dracut -f' will work directly, but I haven't tested...

Agree with Ryan but I would give a try first with /usr/sbin/ovirt-node-rebuild-initramfs: 

# cat /etc/redhat-release 
Red Hat Enterprise Virtualization Hypervisor release 7.3 (20170118.0.el7ev)

# uname -a
Linux localhost 3.10.0-514.6.1.el7.x86_64 #1 SMP Sat Dec 10 11:15:38 EST 2016 x86_64 x86_64 x86_64 GNU/Linux

# /usr/sbin/ovirt-node-rebuild-initramfs
INFO - Preparing to regenerate the initramfs
INFO - The regenreation will overwrite the existing
INFO - Rebuilding for kver: 3.10.0-514.6.1.el7.x86_64
INFO - Mounting '/liveos' to /boot
INFO - Generating new initramfs '/var/tmp/initrd0.img.new' for kver 3.10.0-514.6.1.el7.x86_64 (this can take a while)
INFO - Installing the new initramfs '/var/tmp/initrd0.img.new' to '/boot/initrd0.img'
INFO - Successfully unmounted /liveos and /boot
INFO - Initramfs regenration completed successfully
[root@localhost ~]#

Comment 24 Nir Soffer 2017-03-14 23:00:17 UTC

(In reply to Greg Scott from comment #21)
> What about disabling haldaemon?

According to:
https://access.redhat.com/solutions/27571
https://access.redhat.com/solutions/16609
https://bugzilla.redhat.com/737755
https://bugzilla.redhat.com/515734

It may not be needed on a server, and is not designed for handling 6000
devices.

> echo 512 > /sys/module/scsi_mod/parameters/max_report_luns

You have 20-30 luns, how is this going to help?

Comment 25 Greg Scott 2017-03-14 23:13:12 UTC

Grabbing at straws I guess. I need to offer something before their maintenance window this weekend.  You're right, dealing with LUNs does no good.

On disabling haldaemon - how do I persist chkconfig --level 345 haldaemon off, so it stays that way after a reboot on RHEVH-6? I looked over all those articles and BZs - all of them are about the real RHEL6.  

And re: Doug - that nifty rebuild-initramfs utility isn't there in this version. Is it there with 6.8?

[root@rhevhtest admin]# /usr/sbin/ovirt-node-rebuild-initramfs
bash: /usr/sbin/ovirt-node-rebuild-initramfs: No such file or directory
[root@rhevhtest admin]#
[root@rhevhtest admin]# more /etc/redhat-release
Red Hat Enterprise Virtualization Hypervisor release 6.7 (20160104.0.el6ev)
[root@rhevhtest admin]#

- Greg

Comment 26 Nir Soffer 2017-03-14 23:37:47 UTC

Richard,

Greg reports that hald consumes huge amount of cpu during boot, that takes more
than 30 minutes on these systems, having about 6000 logical volumes.

Can you confirm that hal is not required on rhel 6 server?

Comment 27 Nir Soffer 2017-03-14 23:40:34 UTC

(In reply to Greg Scott from comment #25)
> And re: Doug - that nifty rebuild-initramfs utility isn't there in this
> version. Is it there with 6.8?

I suspect that Douglass is talking the shiny rhev-h 4.1 that is very much regular 
rhel, but we need instructions for good old rhev-h 6.8.

Comment 28 Ryan Barry 2017-03-15 00:11:39 UTC

Well, the shiny new 4.1/4.0 doesn't need a utility to generate the initramfs, since dracut works (mostly -- a lot of system utilities don't actually seem to put their output in the same path as BOOT_IMAGE, but we have a workaround for this...)

ovirt-node-rebuild-initramfs landed in RHEV 3.6. Along with the abstraction in system.Initramfs(), unfortunately.

That abstraction isn't terribly complex here, but I suspect that the customer will still need to F2 out to a shell to rebuild the initramfs.

It's been a long time since I looked at 3.5, unfortunately.

chkconfig makes symlinks in /etc/rc${X}.d, or removes them. If we're lucky, all of /etc is persisted, and we can nuke them in /config/etc/rc3.d/XXhaldaemon.

If not, the customer will probably have to create an empty file or noop script, then persist that.

Persisting/unpersisting init scripts is always race-y, though, since we're depending on hald (for example) to start *after* readonly-root and the other scripts which bind mount /config/etc over /etc. They might, but might not...

Comment 29 Greg Scott 2017-03-15 02:11:41 UTC

The question for Richard is, what breaks if I disable haldaemon on RHEVH-6.7?  There must have been a good reason to turn on haldaemon in the first place, right?

I have a challenge coming up this weekend when they'll need to reboot 34 hypervisors.  They have maybe 48 hours and each reboot cycle now takes about 2 hours.  The numbers add up badly.  What can we do with what's in place right now to make this better?

I put this in the support case yesterday. I should have also posted it here. Persisting/unpersisting symlinks in RHEVH-6 doesn't seem to work the same as real files.  If manipulating chkconfig doesn't work, I'll just rename or edit /etc/rc.d/init.d/haldaemon.

#181 (Associate)  Make PublicPrivate Cannot set 'Helps Resolution'  0
Edit
Created By: Greg Scott  (3/13/2017 5:07 PM)

I was trying to figure out how to persist chkconfig --level 345 haldaemon off in RHEVH-6 today. We can always edit /etc/rc.d/init.d/haldaemon and comment out the start portion of the init script, but I was hoping we can come up with a chkconfig method that works.

[root@rhevhtest admin]# chkconfig --list | grep hald
haldaemon       0:off   1:off   2:off   3:on    4:on    5:on    6:off
[root@rhevhtest admin]#
[root@rhevhtest admin]#
[root@rhevhtest admin]#
[root@rhevhtest admin]# chkconfig --level 345 haldaemon off
[root@rhevhtest admin]# chkconfig --list | grep hald
haldaemon       0:off   1:off   2:off   3:off   4:off   5:off   6:off
[root@rhevhtest admin]# reboot

And it's right back the way it was.

[root@rhevhtest rc3.d]#
[root@rhevhtest rc3.d]# chkconfig --list | grep hald
haldaemon       0:off   1:off   2:off   3:on    4:on    5:on    6:off
[root@rhevhtest rc3.d]#


Unpersisting and persisting the run level 3 symlink doesn't work.

[root@rhevhtest admin]# ls /etc/rc.d/rc3.d -al | grep hald
lrwxrwxrwx.  1 root root   19 2016-01-04 21:07 S26haldaemon -> ../init.d/haldaemon
[root@rhevhtest admin]#
[root@rhevhtest admin]# unpersist /etc/rc.d/rc3.d/S26haldaemon
File not explicitly persisted: /etc/rc.d/rc3.d/S26haldaemon
[root@rhevhtest admin]# persist /etc/rc.d/rc3.d/S26haldaemon
Failed to persist "/etc/rc.d/rc3.d/S26haldaemon"
Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/ovirt/node/utils/fs/__init__.py", line 425, in persist
  File "/usr/lib/python2.6/site-packages/ovirt/node/utils/fs/__init__.py", line 536, in _persist_symlink
  File "/usr/lib/python2.6/site-packages/ovirt/node/utils/fs/__init__.py", line 442, in copy_attributes
RuntimeError: Cannot proceed, check if paths exist!
Cannot persist: /etc/rc.d/rc3.d/S26haldaemon
[root@rhevhtest admin]#

Comment 30 Greg Scott 2017-03-15 02:26:40 UTC

This looks promising.  I did this on my test rhevh-6.7 system - itself a VM. You can see how I disabled hald and it still finds its boot LVs.  But if I have, say, 23000 block devices, will udev still go nuts finding them all?

[root@rhevhtest admin]#
[root@rhevhtest admin]# ls -al /etc/rc.d/init.d | grep hald
-rwxr-xr-x.  1 root root  1801 2014-06-19 13:01 haldaemon
[root@rhevhtest admin]#
[root@rhevhtest admin]# mv /etc/rc.d/init.d/haldaemon /etc/rc.d/init.d/haldaemon-greg
[root@rhevhtest admin]# touch /etc/rc.d/init.d/haldaemon
[root@rhevhtest admin]# persist /etc/rc.d/init.d/haldaemon
Successfully persisted: /etc/rc.d/init.d/haldaemon
[root@rhevhtest admin]# persist /etc/rc.d/init.d/haldaemon-greg
Successfully persisted: /etc/rc.d/init.d/haldaemon-greg
[root@rhevhtest admin]#
[root@rhevhtest admin]#
[root@rhevhtest admin]# lvs
  LV      VG     Attr       LSize  Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  Config  HostVG -wi-ao----  8.00m                                              
  Data    HostVG -wi-ao---- 43.38g                                              
  Logging HostVG -wi-ao----  2.00g                                              
  Swap    HostVG -wi-ao----  3.87g                                              
[root@rhevhtest admin]# vgs
  VG     #PV #LV #SN Attr   VSize  VFree
  HostVG   1   4   0 wz--n- 49.28g 28.00m
[root@rhevhtest admin]# pvs
  PV         VG     Fmt  Attr PSize  PFree
  /dev/sda4  HostVG lvm2 a--  49.28g 28.00m
[root@rhevhtest admin]# reboot

Broadcast message from admin.local
        (/dev/pts/0) at 2:15 ...

The system is going down for reboot NOW!
[root@rhevhtest admin]#

Logged back in after a reboot

login as: admin
admin.10.114's password:
Last login: Wed Mar 15 02:13:56 2017 from tinatinylenovo.infrasupport.local
[root@rhevhtest admin]# ps ax | grep hald
14111 pts/1    S+     0:00 grep hald
[root@rhevhtest admin]#
[root@rhevhtest admin]# lvs
  LV      VG     Attr       LSize  Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  Config  HostVG -wi-ao----  8.00m                                              
  Data    HostVG -wi-ao---- 43.38g                                              
  Logging HostVG -wi-ao----  2.00g                                              
  Swap    HostVG -wi-ao----  3.87g                                              
[root@rhevhtest admin]# vgs
  VG     #PV #LV #SN Attr   VSize  VFree
  HostVG   1   4   0 wz--n- 49.28g 28.00m
[root@rhevhtest admin]# pvs
  PV         VG     Fmt  Attr PSize  PFree
  /dev/sda4  HostVG lvm2 a--  49.28g 28.00m
[root@rhevhtest admin]#
[root@rhevhtest admin]# more /etc/rc.d/init.d/haldaemon
[root@rhevhtest admin]#

Comment 31 Nir Soffer 2017-03-16 19:51:26 UTC

Greg, here is another idea that may improve or even eliminate the issues
with autoactivation in 6.7.

LVM filter is the best option, since it save the unneeded scanning of all the 
devices (including active lvs), but lvm filter in 6.7 is broken, and setting
a filter is harder since you need different filter for each host.

The next thing is auto_activation_volume_list - this allows selection of the 
vg that will be auto activated. Since all the hosts are rhev-h, we know that
they should activate only the HostVG vg.

So you can do:

1. Edit lvm.conf

    auto_activation_volume_list = [ "HostVG" ]

2. Persist lvm.conf

I think that you don't have to regenerate initramfs for this, since all activation
is done after you switch to the root fs, via /etc/rc.d/rc.sysinit.

Or, another solution, lvm guys tells me that setting the kernel command line
parameter rd.lvm.vg will have the same effect, so you can set:

   rd.lvm.vg=HostVG

This may be easier to deploy on rhev-h, and I think Ryan already implemented this
for another bug recently.

Comment 32 Greg Scott 2017-03-16 20:06:14 UTC

This looks promising.  I'll pitch this.  If this works it should be in all rhvh systems.

Comment 33 Nir Soffer 2017-03-16 20:19:18 UTC

(In reply to Greg Scott from comment #32)
> This looks promising.  I'll pitch this.  If this works it should be in all
> rhvh systems.

Similar bugs:
- bug 1342786 - rhev-h 6.6
- bug 1400446 - rhev-h 7

For rhev-h 3.5, this may work for any host. rhev-h 4 is not closed as rhev-h 3.5,
and I'm not sure we can assume the names of all the vgs in some system.

Comment 34 Greg Scott 2017-03-20 03:13:12 UTC

Go figure.  The summary of the weekend work - disabling haldaemon made a difference; doing the LVM filter did not help. With haldaemon disabled, boots now take around 15 minutes and activation from maintenance mode happens in less than a minute.  This compares to boots taking around 1 - 2 hours with haldaemon turned on, and activations not working until restarting vdsmd by hand.

Booting a host with the LVM filter in place and haldaemon enabled - that host did not behave any differently than before.  Haldaemon is our bottleneck.

- Greg

Comment 35 Yaniv Lavi 2017-03-20 11:31:44 UTC

(In reply to Greg Scott from comment #34)
> Go figure.  The summary of the weekend work - disabling haldaemon made a
> difference; doing the LVM filter did not help. With haldaemon disabled,
> boots now take around 15 minutes and activation from maintenance mode
> happens in less than a minute.  This compares to boots taking around 1 - 2
> hours with haldaemon turned on, and activations not working until restarting
> vdsmd by hand.
> 
> Booting a host with the LVM filter in place and haldaemon enabled - that
> host did not behave any differently than before.  Haldaemon is our
> bottleneck.
> 
> - Greg

Please open a bug on RHEL on the use case and issue.

Comment 36 Greg Scott 2017-03-20 14:56:13 UTC

No.  I'm not opening yet another bug on RHEL.  This is a RHEV thing.  The use case is RHEV-H in a RHEV cluster with 2000+ active VMs.  The workaround we found from this weekend is disabling haldaemon at RHEV-H startup.

There is already a RHEL LVM bug about not handling LVM filters properly.  Since the hypervisors here are RHEVH-6.7, maybe they're subject to that bug.

Comment 40 Nir Soffer 2017-03-20 15:11:54 UTC

(In reply to Greg Scott from comment #36)
> No.  I'm not opening yet another bug on RHEL.  This is a RHEV thing.  The
> use case is RHEV-H in a RHEV cluster with 2000+ active VMs.  The workaround
> we found from this weekend is disabling haldaemon at RHEV-H startup.

Greg, we depend on RHEL, hald is not RHEV thing. The maintainer of this package
should check why it get stuck for 30-60 minutes delaying boot, and probably
causing lvm command to get stuck. In RHEV we will not touch this daemon without
blessing from the maintainer.

> There is already a RHEL LVM bug about not handling LVM filters properly. 
> Since the hypervisors here are RHEVH-6.7, maybe they're subject to that bug.

This bug may depend on hald issue, we know that lvm is waiting for udev event,
and if hald is causing timeouts in udev, lvm commands may be stuck.

Comment 41 Nir Soffer 2017-03-20 15:28:42 UTC

(In reply to Greg Scott from comment #34)
> Go figure.  The summary of the weekend work - disabling haldaemon made a
> difference; doing the LVM filter did not help. 

What do you mean by lvm filter?

I recommended:
- adding auto_activation_volume_list in lvm.conf
- or using rm.lvm.vg=HostVG

Which solution did try?

If you tried lvm.conf change, did you check that the changes were persisted
correctly after reboot? If not, this is a RHEV-H bug, it should support
modifying lvm.conf in some way.

> With haldaemon disabled,
> boots now take around 15 minutes and activation from maintenance mode
> happens in less than a minute.  This compares to boots taking around 1 - 2
> hours with haldaemon turned on, and activations not working until restarting
> vdsmd by hand.

This sounds very good, hopefully disabling hald does not break anything.

> Booting a host with the LVM filter in place and haldaemon enabled - that
> host did not behave any differently than before.  Haldaemon is our
> bottleneck.

I suspect that the mysterious "lvm filter" was not applied correctly.

Comment 48 Greg Scott 2017-03-20 19:55:19 UTC

From our phone call - I was messed up on how many of those LVM objects were LVs.  My mistake.  I think Yaniv has a copy of lsblk output now that details it all out.

re: Nir and a copy of lvm.conf - the lvm.conf on on all the hosts right now is all virgin.  They may have a staging copy still available.

They tried the modified lvm.conf on host 0001 and top showed haldaemon pegged, as before, and the host would not activate.  So they disabled hald, per the instructions above, and rebooted, leaving the modified lvm.conf in place.  I watched the shutdown via Webex screen sharing - the shutdown deactivated a bunch of LVM objects with UUID names before finally finishing its shutdown.  There were no VMs because the host was still in maintenance mode.  So the lvm filter could not have filtered.

After disabling haldaemon, it booted in about 15 minutes and activated from maintenance mode in less than 60 seconds.

On host 0002, they left lvm.conf virgin and only disabled hald.  This also booted withing 15 minutes and activated in less than 60 seconds from maintenance mode.

Perhaps they ran into the bug where lvm doesn't honor filtering?  Hard to say, but they had a short maintenance window and 34 production hosts to deal with.

So, based on the results we found, they just continued with disabling hald.

Comment 49 Nir Soffer 2017-03-20 21:00:18 UTC

(In reply to Greg Scott from comment #48)
> Perhaps they ran into the bug where lvm doesn't honor filtering?

The lvm filter issue should not be related, if auto_activation_volume_list
works correctly, there should be no active lvs except the host lvs, and 
hald should not cause any trouble.

Maybe the issue is not regenerating initramfs, lvm guys thought that this is
not needed for this configuration.

It is also possible that rhev-h has twisted boot procedure, that does not use
the lvm.conf you modified, so lvs are activated regardless of this setting.

We will have to test this in our lab with rhev-h 6.7 images.

Comment 50 Greg Scott 2017-03-23 11:47:45 UTC

I did your RHEL bugzilla.  See https://bugzilla.redhat.com/show_bug.cgi?id=1435198



- Greg

Comment 59 Greg Scott 2017-04-05 14:15:47 UTC

I just reopened this so we have a mechanism to track that scale testing with RHVH-7.

Comment 73 Nir Soffer 2017-05-03 14:33:37 UTC

Guy, can you repeat the test with lvm filter?

Comment 99 Nir Soffer 2017-05-15 09:07:27 UTC

(In reply to guy chen from comment #94)
> I have tested the 12K LVS on top of rhvh 4.1 release, with and without
> filters.

Thanks Guy, great work!

Can you attach vdsm logs from these runs? I want to understand where
time is spent in vdsm without a filter.

Comment 104 Nir Soffer 2017-05-15 09:13:23 UTC

Greg, can you confirm that this bug was tested properly? It seems that RHV 4.1 is
performing much better, specially when using lvm filter.

Comment 105 Nir Soffer 2017-05-15 09:20:14 UTC

Guy, please see my request in comment 99, seems that the needinfo was removed
by mistake.

Comment 109 Nir Soffer 2017-05-15 17:44:16 UTC

Guy, can you attach the output of:

journalctl -b
systemd-analyze time
systemd-analyze blame
systemd-analyze plot

If you perform more reboots with or without filter, please attach the output for
each boot.

Comment 110 Roy Golan 2017-05-15 18:15:12 UTC

> systemd-analyze plot

systemd-analyze plot > system_analyze_plot.svg

will be better viewed :)

Comment 120 Nir Soffer 2017-05-23 11:09:46 UTC

Summary of output uploaded by Guy:

Boot without filter, according to comment 113:

$ head systemd-analyze_blame_log 
   10min 24.711s kdump.service
    1min 44.053s vdsm-network.service
         36.314s lvm2-activation.service
         35.084s lvm2-activation-net.service
         32.021s lvm2-monitor.service
         27.805s lvm2-activation-early.service
          9.092s glusterd.service
          8.888s dev-mapper-rhvh_buri01\x2drhvh\x2d\x2d4.1\x2d\x2d0.20170506.0\x2b1.device
          7.869s network.service
          7.693s systemd-udev-settle.service

$ cat systemd-analyze_time_log 
Startup finished in 5.988s (kernel) + 17.706s (initrd) + 13min 8.717s (userspace) = 13min 32.412s

This run did not include vdsm log, so I looked at vdsm logs from previous 
boot (see comment 107):

1. Vdsm started:
2017-05-14 19:17:11,737+0300 INFO  (MainThread) [vds] (PID: 6318) I am the actual vdsm 4.19.12-1.el7ev buri01.eng.lab.tlv.redhat.com (3.10.0-514.16.1.el7.x86_64) (vdsm:145)

2. Vdsm starting deactivation of active lvs:
2017-05-14 19:17:31,484+0300 INFO  (hsm/init) [storage.LVM] Deactivating lvs: ...

3. Vdsm ready (this succeeds only when deactivation finish):
2017-05-14 19:25:06,730+0300 INFO  (jsonrpc/6) [jsonrpc.JsonRpcServer] RPC call Host.getCapabilities succeeded in 2.59 seconds (__init__:533)

So vdsm needed 7:30 minutes to deactivate all 12,000 lvs. We can probably 
shorten this by running the deactivation in parallel, but the real solution
is lvm filter that eliminates 7:30 minutes from vdsm startup time.

So system is slowed down by having 12k lvs, but this is a huge improvement compared
with the reported issue on rhev-h 6.7.

Comment 121 Nir Soffer 2017-05-23 11:20:29 UTC

(In reply to guy chen from comment #94)
> I have tested the 12K LVS on top of rhvh 4.1 release, with and without
> filters.
...
> Test without Filter :
> From boot till vdsm was up took 17 minutes and 25 seconds
> ...
> Test with Filter :
> From boot till vdsm was up took 9 minutes and 7 seconds
> ...

According to vdsm log, 7:30 minutes spent in deactivation the lvs when not using
a filter. Looking at systemd-analize blame (comment 120), lvm spent about 1:30
minutes in activation/monitoring.

Using a filter improves host boot time (time until vdsm is ready) in 8:30 minutes,
about 90% improvement.

Comment 131 errata-xmlrpc 2018-05-15 17:50:23 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:1489

Comment 132 Franta Kust 2019-05-16 13:09:19 UTC

BZ<2>Jira Resync

Note You need to log in before you can comment on or make changes to this bug.

cshao
dougsland
gscott
guchen
huzhao
jvmarinelli
lsurette
mgoldboi
mkalinin
nsoffer
pstehlik
qiyuan
ratamir
rbarry
rgolan
rhughes
sbonazzo
srevivo
tnisan
weiwang
yaniwang
ycui
ykaul
ylavi
yzhao