Bug 1424853
Summary: | RHEV-H host activation fails after reboot on setup with large number of lvs | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Greg Scott <gscott> | ||||
Component: | vdsm | Assignee: | Dan Kenigsberg <danken> | ||||
Status: | CLOSED ERRATA | QA Contact: | guy chen <guchen> | ||||
Severity: | urgent | Docs Contact: | |||||
Priority: | urgent | ||||||
Version: | 3.5.7 | CC: | cshao, dougsland, gscott, guchen, huzhao, jvmarinelli, lsurette, mgoldboi, mkalinin, nsoffer, pstehlik, qiyuan, ratamir, rbarry, rgolan, rhughes, sbonazzo, srevivo, tnisan, weiwang, yaniwang, ycui, ykaul, ylavi, yzhao | ||||
Target Milestone: | ovirt-4.2.0 | Keywords: | TestOnly, ZStream | ||||
Target Release: | --- | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | No Doc Update | |||||
Doc Text: |
undefined
|
Story Points: | --- | ||||
Clone Of: | |||||||
: | 1451240 (view as bug list) | Environment: | |||||
Last Closed: | 2018-05-15 17:50:23 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | 1428637, 1429203 | ||||||
Bug Blocks: | 1451240 | ||||||
Attachments: |
|
Description
Greg Scott
2017-02-19 21:27:06 UTC
Does this reproduce on RHEL too? Doesn't seems to be RHEV-H related. Probably does reproduce on RHEL. But we've only seen it on rhevh. Probably does reproduce on RHEL. But we've only seen it on rhevh. I owe a better answer than what I can tap out on a teeny tiny cell phone keyboard. The unique part of all this is the use case. The customer operates around 2000 pooled, stateless VMs with thin provisioned storage at each site, using an XIO fiberchannel storage array for back end storage. Every one of those VMs consumes an LV; running VMs consume 2 LVs each, one for the VM, the other for its snapshot. Those VMs are Windows 7, and the customer does patch management by updating the templates, deleting old VMs and storage domains, and creating new VMs and storage domains. The customer also takes advantage of an XIO feature around treating zeroed blocks by setting all objects to "Wipe After Delete." With WAD set, deletions go all the way back to the SAN, the SAN declares the space free, and the customer no longer needs to delete old LUNs and create new LUNs to free up SAN space. The WAD method of operation has been in place for the past 2 months. We found this massive number of LVs just in the past 3 weeks. I suspect that WAD interacts with the XIO storage and somehow leaves legacies of LVM metadata that RHEV hosts subsequently pick up. I hope we can demonstrate that SCSI discards with RHVH-7 make all this moot. But we need to demonstrate it at scale. So, would this reproduce on RHEL? I'm sure it would, but I don't know of any relevant use case. The description about (In reply to Greg Scott from comment #0) > Description of problem: > > Note, this is with RHEV 3.5.8, but the bugzilla form has no selection for > 3.5.8. > > In a RHEV environment with around 2000 pooled, stateless VMs, host > activations slow down and eventually stop working reliably. On the problem > hosts, we see lsblk | wc -l showing tens of thousands of LVM objects. They > appear to be orphan LVM objects and, so far, we aren't able to find where > they come from. On this setup we have 11 "current" storage domain, and 11 "old" storage domains. the customers is rotating the storage domain regularly. On each set of storage domains there are about 2000 stateless vms with one disk. The running vms from the "current" storage domains consume 2 lvs per vm (one for the disk, one for the temporary disk while the vm is running). The vms on the old storage domains consume 1 lv per vm. So we have total 6000 lvs in this setup. These machines have 4 paths to storage, so lsblk show each lv 4 times (one under each path). This explain why we see 24,000 lvs instead of the expected 6000. > We also find udevd and hald consuming most of the hosts' CPU capacity trying > to enumerate all those LVM objects. We need data about this, showing the cpu usage of udev and hald during boot. > Steps to Reproduce: > Build a RHEV 3.5.8 environment with a few RHEVH-6-2016-0104 hosts. Make a > fiberchannel datacenter with a few LUNs and 11 storage domains. Use the EMC > XIO storage in TLV. We need also 4 paths to storage to simulate this issue. The actual setup have also 11 old storage domains with same vms, so we need 22 storage domains, not 11. > Create a Windows VM template. Make sure its virtual disk is thin > provisioned. Set it for Wipe After Delete. Create 11 pools, one pool per > storage domain, each with 170 stateless VMs based on this template. All > those VMs should have WAD set, flowing from the template. Each VM will > consume 2 logical volumes (LV). One LV for the VM, the other for its > snapshot. So each storage domain should have 340 LVs. 340 lvs when all the vm are running I assume? > Use automation to boot all those VMs, run them for a few minutes, and shut > them down. Keep an eye on the lsblk | wc -l count on all the hosts. Delete > the VMs and create new ones, while watching lsblk | wc -l. Watching lsblk for counting lvs is not useful. > Repeat the above paragraph a few times. My hunch is, that lsblk count will > keep growing. It should drop after deleting the VMs, but I have a hunch it > won't. I did find any evidence that we have an issue with deleting lvs, so I don't think this test is useful for this case. > Reboot one of the hosts in this environment and try to activate it. In an > ssh session on that host watch what happens with udevd and hald. And do a > loop running lsblk | wc -l every 10 seconds. Except the lsblk, this is the interesting test - trying to simulate host that cannot be activated after reboot. > Upgrade RHEVM to 3.6 and hosts to RHEVH-7 for 3.6. Try the above steps > again. > > Upgrade the environment to RHV4 and try again. I think we want to test only with RHV4, 3.6 will be EOL soon. > Actual results: > Tens of thousands of LVM objects show up that cannot be accounted for from > normal business. The issue is host not activating after reboot. > Expected results: > The number of LVM objects in the system should make sense based on the > number of VMs. It should not be an order of magnitude too large, as we see > in this environment. This expected result is host activating after reboot without manually restarting vdsm. > Additional info: > > During activation, running lsblk | wc -l in a loop shows the LV count > dropping after restarting vdsmd while activating a host. We captured the > output below showing the sequence of events from one host today: What we see here is the result of vdsm deactivating all lvs during startup. The mystery is why this process hangs during boot, but works later? Greg, do we have sosreport showing the timeframe from reboot, then the time vdsm hanged trying to deactivate lvs, restarted, and finally activate? All the sosreport I have seen were taken just after reboot, so I don't have enough info. Also if we plan rebooting of the hosts in this cluster, I would like to log the cpu usage of the system during this timeframe, maybe using top -b. I found this issue in jabphx0008: Vdsm startup: MainThread::INFO::2017-02-12 05:49:16,959::vdsm::132::vds::(run) (PID: 16816) I am the actual vdsm 4.16.32-1.el6ev jabphx0008.statestr.com (2.6.32-573.12.1.el6.x86_64) Performing udevadm settle after scsi rescan - with huge timeout storageRefresh::DEBUG::2017-02-12 05:49:18,640::utils::755::root::(execCmd) /sbin/udevadm settle --timeout=480 (cwd None) udevadm settle fails after 480 seconds: storageRefresh::DEBUG::2017-02-12 05:57:18,695::utils::775::root::(execCmd) FAILED: <err> = ''; <rc> = 1 storageRefresh::ERROR::2017-02-12 05:57:18,696::udevadm::60::root::(settle) Process failed with rc=1 out='\nudevadm settle - timeout of 480 seconds reached, the event queue contains:\n /sys/devices/virtual/net/eth0.2000 (24009)\n /sys/devices/virtual/net/eth0.2000 (24010)\n /sys/devices/virtual/net/eth0.2000/queues/rx-0 (24011)\n /sys/devices/virtual/net/eth0.2000/queues/tx-0 (24012)\n' err='' In /etc/vdsm/vdsm/conf we see: $ cat etc/vdsm/vdsm.conf [addresses] management_port = 54321 [vars] ssl = true [irs] scsi_settle_timeout = 480 I also seen this in other hosts. This explains the unexpected timeout. Using such value is not a good idea, and may cause large timeout during vdsm flows, which may be a reason why a host idd not activate. Greg, when was this value set in this cluster? please make sure they return the default. If we see timeouts in udevadm settle in normal use, maybe use a bigger value, like 10-15 seconds. If we still have timeouts we need to check why udev time out. We've been fighting this problem of hosts not activating for several months and one suggestion was to increase the scsi_settle_timeout value. The theory at the time - which we now know is wrong - was to give udev enough time to do its thing so hosts could activate. I put a comment in one of the support cases last night to put scsi_settle_timeout back to its default on all hosts and don't go any higher than 10-15 seconds with it. Oh yes - I don't have any recordings of top output, but we have had people watching top once they can get in via ssh. We see CPU maxed, split roughly evenly between udevd and hald. I'll see what I can do about getting one of these recorded. - Greg Dumping the total lvs we see on different hosts: $ python lvs-stats.py < sosreport-jabphx0002.*.01779057-20170206210250/sos_commands/lvm2/lvs_-a_-o_lv_tags_devices_--config_global_locking_type_0 #lv vg name #active #open #removed 352 475f3971-83e8-48d4-af55-957b84c5b42e 341 0 0 350 a09ec488-fc29-462f-9319-fd13d1335131 350 0 0 349 ffc79d5c-ade4-4aa7-b29f-d18d0eb9ed47 349 0 0 349 eba89272-6603-440d-8662-579882e4c566 349 0 0 340 34c280f6-ea97-4ac1-af2e-241c83d106ad 339 0 0 337 4b1948cb-8500-4155-834c-48a3b1e39abb 336 0 0 315 a4f0eb6c-4904-41fc-b12b-84b053c98dc1 313 0 0 296 bb273403-5489-4306-bdae-636afdc2689b 296 0 0 276 dacd9668-6ca6-4c60-bffa-1efecdfb3595 275 0 0 228 3a5fb6b8-74b1-47c8-857f-aa3a799f8699 227 0 0 218 95231fba-5724-4063-8492-0866f4c74de8 218 0 0 217 a214b043-c558-433a-8a00-9c12b41d590f 217 0 0 209 482fb94d-b6a4-4fae-8728-d6204f1fd273 209 0 0 191 f0a154ad-c212-4702-808a-7f371b4f8c8c 191 0 0 191 375bfc4e-0747-44b4-9f16-662436445797 191 0 0 190 b38d008f-b93f-4650-ba91-2f41b3ad4b41 190 0 0 182 fc988f63-4051-44a6-988c-09a1ca9b1157 182 0 0 179 c6e07169-a202-4a44-b2ae-440110ce0323 179 0 0 176 018fd3a0-1513-4f80-81f0-ede47686cb8e 176 0 0 144 a962de39-4d11-4f45-b3e0-a3e6da75796e 144 0 0 134 f8e8526c-02fa-4906-b593-3821f99cb392 134 0 0 112 5696ccc8-d0cd-4d35-9e3b-30c214408213 112 0 0 107 9e436258-191d-411e-9c8f-3d7165682e90 107 0 0 78 203e766c-a3fe-4b80-abc8-f5217a16d787 78 0 0 62 b1bfa173-1748-4ae4-99b9-e6174528b03d 62 0 0 56 1ca68870-23b3-4f02-b185-9a469ad14a90 56 0 0 8 0416631d-a0c5-4a9e-b5d7-bedce8c0e85e 6 0 0 4 HostVG 4 4 0 1 VG 0 0 0 totals lv: 5651 active: 5631 open: 4 We see that we have 5651 lvs, and practically all of the are active. $ python lvs-stats.py < sosreport-jabphx0004.*.01797770-20170301175143/sos_commands/lvm2/lvs_-a_-o_lv_tags_devices_--config_global_locking_type_0 #lv vg name #active #open #removed 355 475f3971-83e8-48d4-af55-957b84c5b42e 31 26 0 353 bb273403-5489-4306-bdae-636afdc2689b 6 1 0 349 ffc79d5c-ade4-4aa7-b29f-d18d0eb9ed47 6 1 0 349 eba89272-6603-440d-8662-579882e4c566 6 1 0 349 4b1948cb-8500-4155-834c-48a3b1e39abb 6 1 0 348 a09ec488-fc29-462f-9319-fd13d1335131 6 1 0 348 34c280f6-ea97-4ac1-af2e-241c83d106ad 6 1 0 325 a4f0eb6c-4904-41fc-b12b-84b053c98dc1 6 1 0 292 b6c183ce-a020-4442-b7d8-596c8cf042a9 6 1 0 283 5458edbd-1f29-4960-b67c-31644d53cac4 6 1 0 274 dacd9668-6ca6-4c60-bffa-1efecdfb3595 6 1 0 268 efb361c3-7cab-4601-a35f-f859c017ca11 6 1 0 254 9b5c609b-0165-412a-9bcb-05e9bd8f802d 6 1 0 242 3a5fb6b8-74b1-47c8-857f-aa3a799f8699 27 22 0 216 95231fba-5724-4063-8492-0866f4c74de8 31 26 0 200 a3ba1656-9567-4a6a-a939-e5ce0d2282f6 6 1 0 179 2711b96f-b7d3-4d1c-98a3-c12046496c97 6 1 0 163 e7ca2cca-94c2-45ef-ae12-a60d138c0faf 9 1 0 112 5696ccc8-d0cd-4d35-9e3b-30c214408213 18 13 0 104 a8a8d55f-d042-4f81-b8c2-cd676dacb214 6 1 0 78 203e766c-a3fe-4b80-abc8-f5217a16d787 6 1 0 62 b1bfa173-1748-4ae4-99b9-e6174528b03d 6 1 0 56 1ca68870-23b3-4f02-b185-9a469ad14a90 6 1 0 9 ef0b8570-f345-4187-9645-49ac484ffaba 6 1 0 9 9d7ed78c-ceb2-4cb4-a52c-28c21ba320cf 6 1 0 9 608a69c8-cf0b-49e3-a8b1-cd568e0b4e0c 6 1 0 8 0416631d-a0c5-4a9e-b5d7-bedce8c0e85e 6 0 0 4 HostVG 4 4 0 1 VG 0 0 0 totals lv: 5599 active: 252 open: 113 In this host only 252 lvs are active, probably after vdsm deactivated all unused lvs. $ python lvs-stats.py < sosreport-jabphx0008.*-20170212122136/sos_commands/lvm2/lvs_-a_-o_lv_tags_devices_--config_global_locking_type_0 #lv vg name #active #open #removed 340 475f3971-83e8-48d4-af55-957b84c5b42e 338 0 0 331 34c280f6-ea97-4ac1-af2e-241c83d106ad 331 0 0 327 4b1948cb-8500-4155-834c-48a3b1e39abb 327 0 0 326 a09ec488-fc29-462f-9319-fd13d1335131 6 0 0 324 ffc79d5c-ade4-4aa7-b29f-d18d0eb9ed47 324 0 0 322 eba89272-6603-440d-8662-579882e4c566 322 0 0 297 a4f0eb6c-4904-41fc-b12b-84b053c98dc1 297 0 0 295 bb273403-5489-4306-bdae-636afdc2689b 295 0 0 254 dacd9668-6ca6-4c60-bffa-1efecdfb3595 253 0 0 218 3a5fb6b8-74b1-47c8-857f-aa3a799f8699 217 0 0 205 95231fba-5724-4063-8492-0866f4c74de8 205 0 0 112 5696ccc8-d0cd-4d35-9e3b-30c214408213 112 0 0 78 203e766c-a3fe-4b80-abc8-f5217a16d787 78 0 0 62 b1bfa173-1748-4ae4-99b9-e6174528b03d 62 0 0 56 1ca68870-23b3-4f02-b185-9a469ad14a90 56 0 0 8 0416631d-a0c5-4a9e-b5d7-bedce8c0e85e 8 0 0 4 HostVG 4 4 0 1 VG 0 0 0 totals lv: 3560 active: 3235 open: 4 I think this show the state after the old storage domain were removed, but we still most lvs are active, maybe vdsm stated to deactivate some lvs when this sosreport was created. $ python lvs-stats.py < sosreport-jabphx0010.*.01779057-20170211203520/sos_commands/lvm2/lvs_-a_-o_lv_tags_devices_--config_global_locking_type_0 #lv vg name #active #open #removed 348 475f3971-83e8-48d4-af55-957b84c5b42e 348 0 0 341 eba89272-6603-440d-8662-579882e4c566 341 0 0 340 ffc79d5c-ade4-4aa7-b29f-d18d0eb9ed47 340 0 0 337 4b1948cb-8500-4155-834c-48a3b1e39abb 337 0 0 336 34c280f6-ea97-4ac1-af2e-241c83d106ad 336 0 0 335 a09ec488-fc29-462f-9319-fd13d1335131 6 0 0 303 a4f0eb6c-4904-41fc-b12b-84b053c98dc1 303 0 0 295 bb273403-5489-4306-bdae-636afdc2689b 295 0 0 271 dacd9668-6ca6-4c60-bffa-1efecdfb3595 271 0 0 231 3a5fb6b8-74b1-47c8-857f-aa3a799f8699 231 0 0 219 95231fba-5724-4063-8492-0866f4c74de8 219 0 0 112 5696ccc8-d0cd-4d35-9e3b-30c214408213 112 0 0 78 203e766c-a3fe-4b80-abc8-f5217a16d787 78 0 0 62 b1bfa173-1748-4ae4-99b9-e6174528b03d 62 0 0 56 1ca68870-23b3-4f02-b185-9a469ad14a90 56 0 0 8 0416631d-a0c5-4a9e-b5d7-bedce8c0e85e 8 0 0 4 HostVG 4 4 0 1 VG 0 0 0 totals lv: 3677 active: 3347 open: 4 Similar to jabphx0008. $ python lvs-stats.py < sosreport-jabphx0014.*.01779057-20170211203907/sos_commands/lvm2/lvs_-a_-o_lv_tags_devices_--config_global_locking_type_0 #lv vg name #active #open #removed 348 475f3971-83e8-48d4-af55-957b84c5b42e 6 1 0 341 eba89272-6603-440d-8662-579882e4c566 6 1 0 340 ffc79d5c-ade4-4aa7-b29f-d18d0eb9ed47 6 1 0 337 4b1948cb-8500-4155-834c-48a3b1e39abb 6 1 0 336 34c280f6-ea97-4ac1-af2e-241c83d106ad 6 1 0 335 a09ec488-fc29-462f-9319-fd13d1335131 6 1 0 303 a4f0eb6c-4904-41fc-b12b-84b053c98dc1 6 1 0 295 bb273403-5489-4306-bdae-636afdc2689b 6 1 0 271 dacd9668-6ca6-4c60-bffa-1efecdfb3595 6 1 0 231 3a5fb6b8-74b1-47c8-857f-aa3a799f8699 6 1 0 219 95231fba-5724-4063-8492-0866f4c74de8 6 1 0 112 5696ccc8-d0cd-4d35-9e3b-30c214408213 6 1 0 78 203e766c-a3fe-4b80-abc8-f5217a16d787 6 1 0 62 b1bfa173-1748-4ae4-99b9-e6174528b03d 6 1 0 56 1ca68870-23b3-4f02-b185-9a469ad14a90 6 1 0 8 0416631d-a0c5-4a9e-b5d7-bedce8c0e85e 6 0 0 4 HostVG 4 4 0 1 VG 0 0 0 totals lv: 3677 active: 100 open: 19 This looks like normal operation mode, vdsm deactivated most lvs. $ python lvs-stats.py < sosreport-gdcphx0027.*.01797770-20170224184848/sos_commands/lvm2/lvs_-a_-o_lv_tags_devices_--config_global_locking_type_0 #lv vg name #active #open #removed 534 08fcdfc2-e31b-4fb0-83ae-2c6e55dcd772 21 14 0 502 132791ee-3fef-4945-9279-cf2bbddd85f2 9 4 0 488 2c66ca06-c83f-4dd0-ae8b-8c72cdd29035 21 12 0 351 76afadf1-7b90-4f41-8601-f766fd4a7ada 17 12 0 327 3377e9e7-63bc-4313-bdda-47cb38487860 19 14 0 320 7b7daebc-982a-4ba3-ba7d-ddf418468ebc 35 28 0 315 a475c700-a749-4b6e-9e3b-53c08af5e155 9 4 0 265 e93af66a-cdcb-46a4-b52c-cf888902d354 16 6 0 246 c6a49433-ae49-4d29-93ff-0b6534b35758 6 1 0 193 b1c45380-e5e4-42e9-90f4-ff6732b57a50 6 1 0 158 436af53a-1ac5-4a27-b78b-9387aa0977e6 6 1 0 92 40d64e44-18d4-4cef-9e91-e03ca206f64c 6 1 0 72 dc25e385-332e-4feb-8ea2-78a7036ac8d1 6 2 0 56 41010432-a8ff-47c9-a538-46b37f0188ac 12 7 0 48 d6d7afa9-5356-4c12-9659-5817f6767c5e 6 1 0 8 f6ae30f1-9d4f-48f8-9e66-6f77cd906aa9 6 1 0 8 b630389e-2d3b-4905-9e16-eaedd4b59581 6 1 0 8 a7edf3cc-4812-497d-bf91-0f53a9450d52 6 1 0 8 8a03fffa-b81e-4528-9a7e-c812850ceed0 6 1 0 8 62cd036f-66c9-4c91-a80c-08fa29bf9de9 6 1 0 8 4cb204e9-e799-400b-b449-97e3fd3c59dd 6 1 0 8 3133b55f-6fb6-4c39-90f2-e18e1828eea6 6 1 0 8 21d2912d-b267-4726-9077-f7cd0f5dc028 6 1 0 8 1728c225-e465-4832-85e0-ded99280b4c0 6 1 0 8 0d22631c-d0eb-4581-acda-9b56a3933594 6 1 0 8 009bb777-0cac-433c-b292-b6869fe32914 6 1 0 4 HostVG 4 4 0 1 VG 0 0 0 totals lv: 4060 active: 265 open: 123 Here we seem to have new storage domains with no vm crated yet. Important note - we don't see any partly removed lvs in the cluster, this means that there is no issue of aborted or failed wipe after delete operation. Before we delete a lv, we add a "remove_me" tag to the lv. Created attachment 1259616 [details]
Script to get lvm stats from sosreport
This script give a useful overview from sosreport/sos_commands/lvm2/lvs_-a_-o_lv_tags_devices_--config_global_locking_type_0
This will be part of vdsm later, posting it here so people can reuse it.
Checking multipath configuration - on all hosts I see this warning: MainThread::WARNING::2017-02-12 05:49:17,433::multipath::162::Storage.Multipath::(isEnabled) This manual override for multipath.conf was based on downrevved template. You are strongly advised to contact your support representatives This means the multiapth.conf file is based on an old version, and should be updated to new version. This is the version used on the hosts: # RHEV REVISION 1.0 # RHEV PRIVATE defaults { polling_interval 5 getuid_callout "/sbin/scsi_id --whitelisted --replace-whitespace --device=/dev/%n" no_path_retry fail user_friendly_names no flush_on_last_del yes fast_io_fail_tmo 5 dev_loss_tmo 30 max_fds 4096 } devices { device { vendor XtremIO product XtremApp path_selector "queue-length 0" rr_min_io_rq 1 path_grouping_policy multibus path_checker tur failback immediate fast_io_fail_tmo 15 } } This version shipped in 3.5.8 is 1.1. It seems that the defualts used in the customer machines are updated from version 1.1, so it will be a good idea to update the version of the file. This will avoid the warning about "downrevved template" One thing that I'm not sure about is not having "no_path_retry" in the XtremIO device configuration. This may use the defualt "no_path_retry fail", or other value. It is a good idea to define *all* device value in each device section. In upstream we want to change no_path_retry from "fail" to "4", see https://gerrit.ovirt.org/61281 This may also be a good setting for this setup, but best consult with the storage vendor about this. I'm not sure about the value of fast_io_fail_tmo, the default is 5 seconds, they are using 15 seconds. Greg, do you know why they are using this value? We talked about multipath.conf a couple weeks ago with EMC while troubleshooting the multipath issue during the SAN firware upgrade. After looking it over and talking about it, nobody had any objections to any settings. The settings here have been in place for a few years. Thinking about it now though, I wonder if a longer fast_io_fail_tmo contributed to the multipath issue? I'm not sure exactly what that parameter regulates. Is it a timeout to declare a path down, or the time to declare a previously down path back up? I wonder if an issue came up years ago when paths flapped up and down and they set it this way to make it less sensitive? - Greg This is from BZ #1428637 : (In reply to Nir Soffer from comment #13) > This seems to be fixed in rhel 6.8: > > # lvs -vvvv --config 'devices { filter = ["a|/dev/vda2|", "r|.*|"] }' vg0 > 2>&1 | grep Open > #device/dev-io.c:559 Opened /dev/vda2 RO O_DIRECT > #device/dev-io.c:559 Opened /dev/vda2 RO O_DIRECT > #device/dev-io.c:559 Opened /dev/vda2 RO O_DIRECT > #device/dev-io.c:559 Opened /dev/vda2 RO O_DIRECT > #device/dev-io.c:559 Opened /etc/lvm/lvm.conf RO > > # rpm -qa | grep lvm2 > lvm2-2.02.143-7.el6.x86_64 > lvm2-libs-2.02.143-7.el6.x86_64 > > # cat /etc/redhat-release > Red Hat Enterprise Linux Server release 6.8 (Santiago) > > Customers affected by this should upgrade to 6.8 or 7.3. Please recommend upgrade to resolve this issue. 6.8 was released to the 3.5 stream as well. Yaniv, the fact that the lvm filter issue (bug 1428637) was fixed in 6.8 is not enough to close this bug. We assume that the lvm filter issue is related, but we do not understand yet the root cause of this bug. We need help from the lvm team to understand this. Reopening this bug to complete the investigation. I think the root cause of this bug is bug 1429203 - stuck lvm command during boot. Also, the assumption in this bz is that RHEVH-6.8 does the filtering automatically, even with boot from SAN scenarios. But it's apparently not that simple. From what I understand, RHEVH-6.8 will allow LVM filtering, but we need to figure out the individual boot block devices in each boot-from-SAN host and then put in the filters by hand. - Greg (In reply to Greg Scott from comment #18) > Also, the assumption in this bz is that RHEVH-6.8 does the filtering > automatically, even with boot from SAN scenarios. There is no such assumption, rhel 6.8 or 7 do not solve this issue, only provide the lvm filter. > But it's apparently not > that simple. From what I understand, RHEVH-6.8 will allow LVM filtering, but > we need to figure out the individual boot block devices in each > boot-from-SAN host and then put in the filters by hand. Correct, this is the recommended way to configure a host for booting from SAN, and these hosts are not configured in this way. The configuration will be the same on 6.8 or 7. 1. Find the devices needed by the host. You can find it using: vgs -o pv_name HostVG Assuming rhev-h, where there is one vg named HostVG. If you have other vgs needed by the host you need to include them in the command. 2. Create a filter from the found devices in /etc/lvm.conf. filter = [ "a|^/dev/mapper/xxxyyy$|", "r|.*|" ] 3. Persist lvm.conf - on rhev-h it is not enough to modify file, the changes will be lost unless the modified file is persisted. 4. Finally, rebuild initramfs so lvm.conf is updated. Adding Douglass to add more info how to persist lvm.conf and rebuild initramfs on rhev-h. The rebuilding of initramfs is the heavy lifting. What happens when rhevh update arrives? What about disabling haldaemon? That should be easy enough to do. Would it do any good? How do you persist chkconfig --level 345 haldaemon off? I ran across another idea too: Add the following to /etc/rc.local to re-scan after booting: echo 512 > /sys/module/scsi_mod/parameters/max_report_luns - Greg We have an abstraction for this under ovirt.ndode.utils.system.Initramfs This is called from ovirtnode.Install.ovirt_boot_setup() (which is always part of ovirt-node-upgrade.py) In this sense. we'll always catch it on updates. Douglas, can you please find an invocation which works here? It will probably be something like: python -c "from ovirt.node.utils.system import Initramfs; Initramfs().rebuild()" It's possible that 'dracut -f' will work directly, but I haven't tested... (In reply to Ryan Barry from comment #22) > We have an abstraction for this under ovirt.ndode.utils.system.Initramfs > > This is called from ovirtnode.Install.ovirt_boot_setup() (which is always > part of ovirt-node-upgrade.py) > > In this sense. we'll always catch it on updates. > > Douglas, can you please find an invocation which works here? > > It will probably be something like: > > python -c "from ovirt.node.utils.system import Initramfs; > Initramfs().rebuild()" > > It's possible that 'dracut -f' will work directly, but I haven't tested... Agree with Ryan but I would give a try first with /usr/sbin/ovirt-node-rebuild-initramfs: # cat /etc/redhat-release Red Hat Enterprise Virtualization Hypervisor release 7.3 (20170118.0.el7ev) # uname -a Linux localhost 3.10.0-514.6.1.el7.x86_64 #1 SMP Sat Dec 10 11:15:38 EST 2016 x86_64 x86_64 x86_64 GNU/Linux # /usr/sbin/ovirt-node-rebuild-initramfs INFO - Preparing to regenerate the initramfs INFO - The regenreation will overwrite the existing INFO - Rebuilding for kver: 3.10.0-514.6.1.el7.x86_64 INFO - Mounting '/liveos' to /boot INFO - Generating new initramfs '/var/tmp/initrd0.img.new' for kver 3.10.0-514.6.1.el7.x86_64 (this can take a while) INFO - Installing the new initramfs '/var/tmp/initrd0.img.new' to '/boot/initrd0.img' INFO - Successfully unmounted /liveos and /boot INFO - Initramfs regenration completed successfully [root@localhost ~]# (In reply to Greg Scott from comment #21) > What about disabling haldaemon? According to: https://access.redhat.com/solutions/27571 https://access.redhat.com/solutions/16609 https://bugzilla.redhat.com/737755 https://bugzilla.redhat.com/515734 It may not be needed on a server, and is not designed for handling 6000 devices. > echo 512 > /sys/module/scsi_mod/parameters/max_report_luns You have 20-30 luns, how is this going to help? Grabbing at straws I guess. I need to offer something before their maintenance window this weekend. You're right, dealing with LUNs does no good. On disabling haldaemon - how do I persist chkconfig --level 345 haldaemon off, so it stays that way after a reboot on RHEVH-6? I looked over all those articles and BZs - all of them are about the real RHEL6. And re: Doug - that nifty rebuild-initramfs utility isn't there in this version. Is it there with 6.8? [root@rhevhtest admin]# /usr/sbin/ovirt-node-rebuild-initramfs bash: /usr/sbin/ovirt-node-rebuild-initramfs: No such file or directory [root@rhevhtest admin]# [root@rhevhtest admin]# more /etc/redhat-release Red Hat Enterprise Virtualization Hypervisor release 6.7 (20160104.0.el6ev) [root@rhevhtest admin]# - Greg Richard, Greg reports that hald consumes huge amount of cpu during boot, that takes more than 30 minutes on these systems, having about 6000 logical volumes. Can you confirm that hal is not required on rhel 6 server? (In reply to Greg Scott from comment #25) > And re: Doug - that nifty rebuild-initramfs utility isn't there in this > version. Is it there with 6.8? I suspect that Douglass is talking the shiny rhev-h 4.1 that is very much regular rhel, but we need instructions for good old rhev-h 6.8. Well, the shiny new 4.1/4.0 doesn't need a utility to generate the initramfs, since dracut works (mostly -- a lot of system utilities don't actually seem to put their output in the same path as BOOT_IMAGE, but we have a workaround for this...) ovirt-node-rebuild-initramfs landed in RHEV 3.6. Along with the abstraction in system.Initramfs(), unfortunately. That abstraction isn't terribly complex here, but I suspect that the customer will still need to F2 out to a shell to rebuild the initramfs. It's been a long time since I looked at 3.5, unfortunately. chkconfig makes symlinks in /etc/rc${X}.d, or removes them. If we're lucky, all of /etc is persisted, and we can nuke them in /config/etc/rc3.d/XXhaldaemon. If not, the customer will probably have to create an empty file or noop script, then persist that. Persisting/unpersisting init scripts is always race-y, though, since we're depending on hald (for example) to start *after* readonly-root and the other scripts which bind mount /config/etc over /etc. They might, but might not... The question for Richard is, what breaks if I disable haldaemon on RHEVH-6.7? There must have been a good reason to turn on haldaemon in the first place, right? I have a challenge coming up this weekend when they'll need to reboot 34 hypervisors. They have maybe 48 hours and each reboot cycle now takes about 2 hours. The numbers add up badly. What can we do with what's in place right now to make this better? I put this in the support case yesterday. I should have also posted it here. Persisting/unpersisting symlinks in RHEVH-6 doesn't seem to work the same as real files. If manipulating chkconfig doesn't work, I'll just rename or edit /etc/rc.d/init.d/haldaemon. #181 (Associate) Make PublicPrivate Cannot set 'Helps Resolution' 0 Edit Created By: Greg Scott (3/13/2017 5:07 PM) I was trying to figure out how to persist chkconfig --level 345 haldaemon off in RHEVH-6 today. We can always edit /etc/rc.d/init.d/haldaemon and comment out the start portion of the init script, but I was hoping we can come up with a chkconfig method that works. [root@rhevhtest admin]# chkconfig --list | grep hald haldaemon 0:off 1:off 2:off 3:on 4:on 5:on 6:off [root@rhevhtest admin]# [root@rhevhtest admin]# [root@rhevhtest admin]# [root@rhevhtest admin]# chkconfig --level 345 haldaemon off [root@rhevhtest admin]# chkconfig --list | grep hald haldaemon 0:off 1:off 2:off 3:off 4:off 5:off 6:off [root@rhevhtest admin]# reboot And it's right back the way it was. [root@rhevhtest rc3.d]# [root@rhevhtest rc3.d]# chkconfig --list | grep hald haldaemon 0:off 1:off 2:off 3:on 4:on 5:on 6:off [root@rhevhtest rc3.d]# Unpersisting and persisting the run level 3 symlink doesn't work. [root@rhevhtest admin]# ls /etc/rc.d/rc3.d -al | grep hald lrwxrwxrwx. 1 root root 19 2016-01-04 21:07 S26haldaemon -> ../init.d/haldaemon [root@rhevhtest admin]# [root@rhevhtest admin]# unpersist /etc/rc.d/rc3.d/S26haldaemon File not explicitly persisted: /etc/rc.d/rc3.d/S26haldaemon [root@rhevhtest admin]# persist /etc/rc.d/rc3.d/S26haldaemon Failed to persist "/etc/rc.d/rc3.d/S26haldaemon" Traceback (most recent call last): File "/usr/lib/python2.6/site-packages/ovirt/node/utils/fs/__init__.py", line 425, in persist File "/usr/lib/python2.6/site-packages/ovirt/node/utils/fs/__init__.py", line 536, in _persist_symlink File "/usr/lib/python2.6/site-packages/ovirt/node/utils/fs/__init__.py", line 442, in copy_attributes RuntimeError: Cannot proceed, check if paths exist! Cannot persist: /etc/rc.d/rc3.d/S26haldaemon [root@rhevhtest admin]# This looks promising. I did this on my test rhevh-6.7 system - itself a VM. You can see how I disabled hald and it still finds its boot LVs. But if I have, say, 23000 block devices, will udev still go nuts finding them all? [root@rhevhtest admin]# [root@rhevhtest admin]# ls -al /etc/rc.d/init.d | grep hald -rwxr-xr-x. 1 root root 1801 2014-06-19 13:01 haldaemon [root@rhevhtest admin]# [root@rhevhtest admin]# mv /etc/rc.d/init.d/haldaemon /etc/rc.d/init.d/haldaemon-greg [root@rhevhtest admin]# touch /etc/rc.d/init.d/haldaemon [root@rhevhtest admin]# persist /etc/rc.d/init.d/haldaemon Successfully persisted: /etc/rc.d/init.d/haldaemon [root@rhevhtest admin]# persist /etc/rc.d/init.d/haldaemon-greg Successfully persisted: /etc/rc.d/init.d/haldaemon-greg [root@rhevhtest admin]# [root@rhevhtest admin]# [root@rhevhtest admin]# lvs LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert Config HostVG -wi-ao---- 8.00m Data HostVG -wi-ao---- 43.38g Logging HostVG -wi-ao---- 2.00g Swap HostVG -wi-ao---- 3.87g [root@rhevhtest admin]# vgs VG #PV #LV #SN Attr VSize VFree HostVG 1 4 0 wz--n- 49.28g 28.00m [root@rhevhtest admin]# pvs PV VG Fmt Attr PSize PFree /dev/sda4 HostVG lvm2 a-- 49.28g 28.00m [root@rhevhtest admin]# reboot Broadcast message from admin.local (/dev/pts/0) at 2:15 ... The system is going down for reboot NOW! [root@rhevhtest admin]# Logged back in after a reboot login as: admin admin.10.114's password: Last login: Wed Mar 15 02:13:56 2017 from tinatinylenovo.infrasupport.local [root@rhevhtest admin]# ps ax | grep hald 14111 pts/1 S+ 0:00 grep hald [root@rhevhtest admin]# [root@rhevhtest admin]# lvs LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert Config HostVG -wi-ao---- 8.00m Data HostVG -wi-ao---- 43.38g Logging HostVG -wi-ao---- 2.00g Swap HostVG -wi-ao---- 3.87g [root@rhevhtest admin]# vgs VG #PV #LV #SN Attr VSize VFree HostVG 1 4 0 wz--n- 49.28g 28.00m [root@rhevhtest admin]# pvs PV VG Fmt Attr PSize PFree /dev/sda4 HostVG lvm2 a-- 49.28g 28.00m [root@rhevhtest admin]# [root@rhevhtest admin]# more /etc/rc.d/init.d/haldaemon [root@rhevhtest admin]# Greg, here is another idea that may improve or even eliminate the issues with autoactivation in 6.7. LVM filter is the best option, since it save the unneeded scanning of all the devices (including active lvs), but lvm filter in 6.7 is broken, and setting a filter is harder since you need different filter for each host. The next thing is auto_activation_volume_list - this allows selection of the vg that will be auto activated. Since all the hosts are rhev-h, we know that they should activate only the HostVG vg. So you can do: 1. Edit lvm.conf auto_activation_volume_list = [ "HostVG" ] 2. Persist lvm.conf I think that you don't have to regenerate initramfs for this, since all activation is done after you switch to the root fs, via /etc/rc.d/rc.sysinit. Or, another solution, lvm guys tells me that setting the kernel command line parameter rd.lvm.vg will have the same effect, so you can set: rd.lvm.vg=HostVG This may be easier to deploy on rhev-h, and I think Ryan already implemented this for another bug recently. This looks promising. I'll pitch this. If this works it should be in all rhvh systems. (In reply to Greg Scott from comment #32) > This looks promising. I'll pitch this. If this works it should be in all > rhvh systems. Similar bugs: - bug 1342786 - rhev-h 6.6 - bug 1400446 - rhev-h 7 For rhev-h 3.5, this may work for any host. rhev-h 4 is not closed as rhev-h 3.5, and I'm not sure we can assume the names of all the vgs in some system. Go figure. The summary of the weekend work - disabling haldaemon made a difference; doing the LVM filter did not help. With haldaemon disabled, boots now take around 15 minutes and activation from maintenance mode happens in less than a minute. This compares to boots taking around 1 - 2 hours with haldaemon turned on, and activations not working until restarting vdsmd by hand. Booting a host with the LVM filter in place and haldaemon enabled - that host did not behave any differently than before. Haldaemon is our bottleneck. - Greg (In reply to Greg Scott from comment #34) > Go figure. The summary of the weekend work - disabling haldaemon made a > difference; doing the LVM filter did not help. With haldaemon disabled, > boots now take around 15 minutes and activation from maintenance mode > happens in less than a minute. This compares to boots taking around 1 - 2 > hours with haldaemon turned on, and activations not working until restarting > vdsmd by hand. > > Booting a host with the LVM filter in place and haldaemon enabled - that > host did not behave any differently than before. Haldaemon is our > bottleneck. > > - Greg Please open a bug on RHEL on the use case and issue. No. I'm not opening yet another bug on RHEL. This is a RHEV thing. The use case is RHEV-H in a RHEV cluster with 2000+ active VMs. The workaround we found from this weekend is disabling haldaemon at RHEV-H startup. There is already a RHEL LVM bug about not handling LVM filters properly. Since the hypervisors here are RHEVH-6.7, maybe they're subject to that bug. (In reply to Greg Scott from comment #36) > No. I'm not opening yet another bug on RHEL. This is a RHEV thing. The > use case is RHEV-H in a RHEV cluster with 2000+ active VMs. The workaround > we found from this weekend is disabling haldaemon at RHEV-H startup. Greg, we depend on RHEL, hald is not RHEV thing. The maintainer of this package should check why it get stuck for 30-60 minutes delaying boot, and probably causing lvm command to get stuck. In RHEV we will not touch this daemon without blessing from the maintainer. > There is already a RHEL LVM bug about not handling LVM filters properly. > Since the hypervisors here are RHEVH-6.7, maybe they're subject to that bug. This bug may depend on hald issue, we know that lvm is waiting for udev event, and if hald is causing timeouts in udev, lvm commands may be stuck. (In reply to Greg Scott from comment #34) > Go figure. The summary of the weekend work - disabling haldaemon made a > difference; doing the LVM filter did not help. What do you mean by lvm filter? I recommended: - adding auto_activation_volume_list in lvm.conf - or using rm.lvm.vg=HostVG Which solution did try? If you tried lvm.conf change, did you check that the changes were persisted correctly after reboot? If not, this is a RHEV-H bug, it should support modifying lvm.conf in some way. > With haldaemon disabled, > boots now take around 15 minutes and activation from maintenance mode > happens in less than a minute. This compares to boots taking around 1 - 2 > hours with haldaemon turned on, and activations not working until restarting > vdsmd by hand. This sounds very good, hopefully disabling hald does not break anything. > Booting a host with the LVM filter in place and haldaemon enabled - that > host did not behave any differently than before. Haldaemon is our > bottleneck. I suspect that the mysterious "lvm filter" was not applied correctly. From our phone call - I was messed up on how many of those LVM objects were LVs. My mistake. I think Yaniv has a copy of lsblk output now that details it all out. re: Nir and a copy of lvm.conf - the lvm.conf on on all the hosts right now is all virgin. They may have a staging copy still available. They tried the modified lvm.conf on host 0001 and top showed haldaemon pegged, as before, and the host would not activate. So they disabled hald, per the instructions above, and rebooted, leaving the modified lvm.conf in place. I watched the shutdown via Webex screen sharing - the shutdown deactivated a bunch of LVM objects with UUID names before finally finishing its shutdown. There were no VMs because the host was still in maintenance mode. So the lvm filter could not have filtered. After disabling haldaemon, it booted in about 15 minutes and activated from maintenance mode in less than 60 seconds. On host 0002, they left lvm.conf virgin and only disabled hald. This also booted withing 15 minutes and activated in less than 60 seconds from maintenance mode. Perhaps they ran into the bug where lvm doesn't honor filtering? Hard to say, but they had a short maintenance window and 34 production hosts to deal with. So, based on the results we found, they just continued with disabling hald. (In reply to Greg Scott from comment #48) > Perhaps they ran into the bug where lvm doesn't honor filtering? The lvm filter issue should not be related, if auto_activation_volume_list works correctly, there should be no active lvs except the host lvs, and hald should not cause any trouble. Maybe the issue is not regenerating initramfs, lvm guys thought that this is not needed for this configuration. It is also possible that rhev-h has twisted boot procedure, that does not use the lvm.conf you modified, so lvs are activated regardless of this setting. We will have to test this in our lab with rhev-h 6.7 images. I did your RHEL bugzilla. See https://bugzilla.redhat.com/show_bug.cgi?id=1435198 - Greg I just reopened this so we have a mechanism to track that scale testing with RHVH-7. Guy, can you repeat the test with lvm filter? (In reply to guy chen from comment #94) > I have tested the 12K LVS on top of rhvh 4.1 release, with and without > filters. Thanks Guy, great work! Can you attach vdsm logs from these runs? I want to understand where time is spent in vdsm without a filter. Greg, can you confirm that this bug was tested properly? It seems that RHV 4.1 is performing much better, specially when using lvm filter. Guy, please see my request in comment 99, seems that the needinfo was removed by mistake. Guy, can you attach the output of: journalctl -b systemd-analyze time systemd-analyze blame systemd-analyze plot If you perform more reboots with or without filter, please attach the output for each boot.
> systemd-analyze plot
systemd-analyze plot > system_analyze_plot.svg
will be better viewed :)
Summary of output uploaded by Guy: Boot without filter, according to comment 113: $ head systemd-analyze_blame_log 10min 24.711s kdump.service 1min 44.053s vdsm-network.service 36.314s lvm2-activation.service 35.084s lvm2-activation-net.service 32.021s lvm2-monitor.service 27.805s lvm2-activation-early.service 9.092s glusterd.service 8.888s dev-mapper-rhvh_buri01\x2drhvh\x2d\x2d4.1\x2d\x2d0.20170506.0\x2b1.device 7.869s network.service 7.693s systemd-udev-settle.service $ cat systemd-analyze_time_log Startup finished in 5.988s (kernel) + 17.706s (initrd) + 13min 8.717s (userspace) = 13min 32.412s This run did not include vdsm log, so I looked at vdsm logs from previous boot (see comment 107): 1. Vdsm started: 2017-05-14 19:17:11,737+0300 INFO (MainThread) [vds] (PID: 6318) I am the actual vdsm 4.19.12-1.el7ev buri01.eng.lab.tlv.redhat.com (3.10.0-514.16.1.el7.x86_64) (vdsm:145) 2. Vdsm starting deactivation of active lvs: 2017-05-14 19:17:31,484+0300 INFO (hsm/init) [storage.LVM] Deactivating lvs: ... 3. Vdsm ready (this succeeds only when deactivation finish): 2017-05-14 19:25:06,730+0300 INFO (jsonrpc/6) [jsonrpc.JsonRpcServer] RPC call Host.getCapabilities succeeded in 2.59 seconds (__init__:533) So vdsm needed 7:30 minutes to deactivate all 12,000 lvs. We can probably shorten this by running the deactivation in parallel, but the real solution is lvm filter that eliminates 7:30 minutes from vdsm startup time. So system is slowed down by having 12k lvs, but this is a huge improvement compared with the reported issue on rhev-h 6.7. (In reply to guy chen from comment #94) > I have tested the 12K LVS on top of rhvh 4.1 release, with and without > filters. ... > Test without Filter : > From boot till vdsm was up took 17 minutes and 25 seconds > ... > Test with Filter : > From boot till vdsm was up took 9 minutes and 7 seconds > ... According to vdsm log, 7:30 minutes spent in deactivation the lvs when not using a filter. Looking at systemd-analize blame (comment 120), lvm spent about 1:30 minutes in activation/monitoring. Using a filter improves host boot time (time until vdsm is ready) in 8:30 minutes, about 90% improvement. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:1489 BZ<2>Jira Resync |