Description of problem: Using --select on Vdsm's lvm commands on lvm2-2.02.186-7.el7_8.1.x86_64 induces VG metadata transient errors seen on bz1837199: 2020-05-20 12:48:07,553+1000 DEBUG (tasks/4) [storage.Misc.excCmd] FAILED: <err> = ' Metadata on /dev/mapper/3600a09803831357370244d6d7662784f at 30543872 has wrong VG name "triped"\nstripe_count = 1\n\nstripes = [\n"pv0", 60545\n]\n}\nsegment142 {\nstart_extent = 1736\nextent_count = 16\n\ntype = "striped"\nstri" expected 23b7f41e-9e5f-4213-940f-7beba85d550a.\n Metadata on /dev/mapper/3600a09803831357370244d6d7662784f at 30543872 has wrong VG name "triped"\nstripe_count = 1\n\nstripes = [\n"pv0", 60545\n]\n}\nsegment142 {\nstart_extent = 1736\nextent_count = 16\n\ntype = "striped"\nstri" expected 23b7f41e-9e5f-4213-940f-7beba85d550a.\n Not repairing metadata for VG 23b7f41e-9e5f-4213-940f-7beba85d550a.\n Recovery of volume group "23b7f41e-9e5f-4213-940f-7beba85d550a" failed.\n Cannot process volume group 23b7f41e-9e5f-4213-940f-7beba85d550a\n'; <rc> = 5 (commands:219) Reproducer results from el 7.8 [taken from https://bugzilla.redhat.com/show_bug.cgi?id=1837199#c53]: # time ../reload.py run 2>run.log 2020-05-31 21:51:17,783 INFO (reload/pv) Stats: reloads=5263 errors=200 error_rate=3.80% avg_time=0.851 med_time=0.824 min_time=0.140 max_time=2.422 2020-05-31 21:51:17,902 INFO (reload/vg) Stats: reloads=4196 errors=172 error_rate=4.10% avg_time=1.068 med_time=1.037 min_time=0.135 max_time=3.533 2020-05-31 21:51:17,780 INFO (reload/lv) Stats: reloads=4789 errors=198 error_rate=4.13% avg_time=0.935 med_time=0.942 min_time=0.136 max_time=2.633 real 74m41.284s user 72m12.264s sys 27m3.581s # time ../reload.py run --no-udev ... 2020-05-31 20:09:15,257 INFO (reload/pv) Stats: reloads=5192 errors=192 error_rate=3.70% avg_time=0.862 med_time=0.829 min_time=0.132 max_time=2.407 2020-05-31 20:09:15,461 INFO (reload/vg) Stats: reloads=4178 errors=157 error_rate=3.76% avg_time=1.071 med_time=1.020 min_time=0.132 max_time=5.709 2020-05-31 20:09:15,262 INFO (reload/lv) Stats: reloads=4736 errors=177 error_rate=3.74% avg_time=0.945 med_time=0.953 min_time=0.133 max_time=2.393 real 74m37.137s user 73m5.567s sys 25m14.252s # time ../reload.py run --no-udev --no-select 2020-05-31 23:08:25,646 INFO (reload/pv) Stats: reloads=6311 errors=38 error_rate=0.60% avg_time=0.699 med_time=0.648 min_time=0.130 max_time=2.107 2020-05-31 23:08:25,490 INFO (reload/vg) Stats: reloads=7163 errors=26 error_rate=0.36% avg_time=0.616 med_time=0.619 min_time=0.130 max_time=2.366 2020-05-31 23:08:25,522 INFO (reload/lv) Stats: reloads=7580 errors=19 error_rate=0.25% avg_time=0.582 med_time=0.596 min_time=0.129 max_time=1.184 real 73m33.318s user 52m53.458s sys 21m37.100s We can see that disabling obtain_device_list_from_udev gives very small improvement - error rate 3.7% vs 4.1%. Replacing usage of --select with explicit pv or vg names is a huge improvement. Reloads are much faster and more consistent, we do almost 2X times more reloads (21,054 vs 14,106) and error rate is 10 times lower (0.39% vs 3.7%) Root cause is handled on LVM bug bz1842600. Version-Release number of selected component (if applicable): How reproducible: 100% Steps to Reproduce: Using reproducer from https://gerrit.ovirt.org/c/109342/ run over 8+ CPUs el7.8 machine. Actual results: Expected results: Additional info:
Tal, this is the ticket Nir wanted especially for the 4.3 backports of the issue for the lvm transient error on reloads (which root cause is pending fix from lvm in el7.9). Please advise what we need for 4.3 (in 4.4 the correlating patches are already merged and we don't wait for lvm fix so we don't need a special ticket there like in this case for 4.3).
is it required with lvm
bah, this already exists. *** This bug has been marked as a duplicate of bug 1849595 ***
Michal, this is not a duplicate of bug 1849595. We have two issues: - LVM bug (bug 1842600) - the fix is to require lvm version with the fix. - Using --select in lvm command, making the LVM bug 10 times more likely and causing performance regressions. This bug is about that fix. This bug does not depend on the lvm fix, it is needed regardless of the fix.
Thanks for clarification Nir, since all attached patches here are merged, is this MODIFIED?
Moved to modified. Per verification, if this is ticket is tested without the lvm2 fix (pending response in https://bugzilla.redhat.com/1842600#c23) suggesting a no-regression test. Otherwise verification could wait for resolution of https://bugzilla.redhat.com/1849595
(In reply to Michal Skrivanek from comment #6) > Thanks for clarification Nir, since all attached patches here are merged, is > this MODIFIED? Right, not sure why it did not move automatically, maybe CI hooks are broken in 4.3?
Tested Flow: 1. Create VM with 13 Disks (Each disk on diff SD) 2. Create VM snapshot 10 times 3. Delete VM The flow run 20 times with 2 concurrent users. Also, the test was run twice: 1. RHV 4.3.9-7 - Reproduce the problem. (vdsm-4.30.42 & lvm2-2.02.186-7) 2. RHV 4.3.11-7 - Fix verify. (vdsm-4.30.50 & lvm2-2.02.187-6) During the 1s test get many errors: :2020-08-02 00:47:06,400+0000 ERROR (monitor/d7e3b45) [storage.LVM] Reloading VGs failed vgs=['d7e3b455-ae7b-4bf9-a997-f25adc5f19b8'] rc=5 out=[' cqe0ji-XYxI-bejZ-AG2d-Vv1J-izNU-7Lvyfb|d7e3b455-ae7b-4bf9-a997-f25adc5f19b8|wz--n-|536602476544|516067164160|134217728|3998|3845|MDT_ALIGNMENT=1048576,MDT_BLOCK_SIZE=512,MDT_CLASS=Data,MDT_DESCRIPTION=MAXLUNS_3600a098038304437415d4b6a59685973,MDT_IOOPTIMEOUTSEC=10,MDT_LEASERETRIES=3,MDT_LEASETIMESEC=60,MDT_LOCKPOLICY=,MDT_LOCKRENEWALINTERVALSEC=5,MDT_POOL_UUID=3fd7f5a9-738f-44c3-8e3b-760a43954106,MDT_PV0=pv:3600a098038304437415d4b6a59685973&44&uuid:3Omcg8-31Sd-SNUd-rtmP-EIwf-aCxB-Ne6r34&44&pestart:0&44&pecount:3998&44&mapoffset:0,MDT_ROLE=Regular,MDT_SDUUID=d7e3b455-ae7b-4bf9-a997-f25adc5f19b8,MDT_TYPE=FCP,MDT_VERSION=5,MDT_VGUUID=cqe0ji-XYxI-bejZ-AG2d-Vv1J-izNU-7Lvyfb,MDT__SHA_CKSUM=9d455c6d1a9668abed2896fe76c126358f3461ad,RHAT_storage_domain|134217728|67097088|23|1|/dev/mapper/3600a098038304437415d4b6a59685973'] err=[' Metadata on /dev/mapper/3600a098038304437415d4b6a59685976 at 536848107520 has wrong VG name "" expected 44cfdf47-9464-447d-aab8-6f58f8b218d0.', ' Metadata on /dev/mapper/3600a098038304437415d4b6a59685976 at 536848107520 has wrong VG name "" expected 44cfdf47-9464-447d-aab8-6f58f8b218d0.', ' Not repairing metadata for VG 44cfdf47-9464-447d-aab8-6f58f8b218d0.', ' Recovery of volume group "44cfdf47-9464-447d-aab8-6f58f8b218d0" failed.', ' Cannot process volume group 44cfdf47-9464-447d-aab8-6f58f8b218d0'] (lvm:576) And during the 2nd test - No Errors in vdsm.log files VDSM log files (Both versions): https://drive.google.com/drive/folders/1klyclDkmUloD21pfle-giDpG7n9GUFZq
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat Virtualization RHEL Host (ovirt-host) 4.3.11), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4113