Bug 1400446 - RHV-H starts very slowly when too many LUNs are connected to the host (lvm filter?)
Summary: RHV-H starts very slowly when too many LUNs are connected to the host (lvm fi...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 4.0.4
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ovirt-4.2.1
: ---
Assignee: Nir Soffer
QA Contact: guy chen
URL:
Whiteboard:
Depends On: 1374545
Blocks: 1450114
TreeView+ depends on / blocked
 
Reported: 2016-12-01 09:30 UTC by Roman Hodain
Modified: 2021-06-10 11:45 UTC (History)
23 users (show)

Fixed In Version: vdsm-4.20.9
Doc Type: Bug Fix
Doc Text:
Currently, LVM scans and activates raw volumes during boot. Then it scans and activates guest logical volumes created inside a guest on top of the raw volumes. It also scans and activates guest logical volumes inside LUNs which are not part of a Red Hat Virtualization storage domain. As a result, there may be thousands of active logical volumes on a host, which should not be active. This leads to very slow boot time and may lead to data corruption later if a logical volume active on the host was extended on another host. To avoid this, you can configure an LVM filter using the "vdsm-tool config-lvm-filter" command. The LVM filter prevents scanning and activation of logical volumes not required by the host, which improves boot time.
Clone Of:
Environment:
Last Closed: 2018-05-15 17:49:33 UTC
oVirt Team: Storage
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Test results on RHEL (1.44 MB, application/x-xz)
2017-05-04 18:17 UTC, Nir Soffer
no flags Details
Test results on RHVH (906.79 KB, application/x-xz)
2017-05-23 21:24 UTC, Nir Soffer
no flags Details
Tests resutls on RHV-H with 600 luns and 4800 lvs (1.55 MB, application/x-xz)
2017-05-29 05:51 UTC, Nir Soffer
no flags Details
Script for creating lvs on all multipath devices (1.57 KB, text/plain)
2017-06-01 17:09 UTC, Nir Soffer
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1342786 0 medium CLOSED [TestOnly] All VGs on passthrough disks are activated during the RHV-H boot. 2021-08-30 13:32:58 UTC
Red Hat Bugzilla 1424853 0 urgent CLOSED RHEV-H host activation fails after reboot on setup with large number of lvs 2021-02-22 00:41:40 UTC
Red Hat Knowledge Base (Solution) 2662261 0 None None None 2017-06-01 19:32:21 UTC
Red Hat Knowledge Base (Solution) 2777831 0 None None None 2017-06-01 19:53:17 UTC
Red Hat Product Errata RHEA-2018:1489 0 None None None 2018-05-15 17:51:04 UTC
oVirt gerrit 85126 0 'None' MERGED lvmfilter: Filter out the master LV on the SPM 2021-01-13 23:48:37 UTC

Internal Links: 1342786 1424853

Description Roman Hodain 2016-12-01 09:30:26 UTC
Description of problem:
The boot of the host takes very long time. It can take about 40 minutes to boot. Sometimes the hosts end up in rescue shell.

Version-Release number of selected component (if applicable):
RHVH-4.0-20161018.0-RHVH-x86_64-dvd1.iso

How reproducible:
100%

Steps to Reproduce:
This issue is caused by high amount of LUNs connected to the environment for direct pass through for the VMs.

It should be reproducible on the following configuration:
290 (LUNs) physical volumes
170 VGs on top of them
Each of LUNs has four paths to the storage

As soon as a setup like this is establish restart the host.

Actual results:
The boot takes ages and the systemd-blame is reporting the following services consuming the time:

46min 33.479s lvm2-pvscan@132:1728.service
46min 33.361s lvm2-pvscan@132:1696.service
46min 33.300s lvm2-pvscan@134:1632.service
46min 33.172s lvm2-pvscan@134:1648.service
46min 33.082s lvm2-pvscan@134:1600.service
46min 33.008s lvm2-pvscan@133:1552.service


Expected results:
The boot is fast. I suppose that the bes is if we make the system ignore the devices.

Additional info:

We tried to do the following:

implemented https://gerrit.ovirt.org/#/c/66893/1/static/etc/lvm/lvmlocal.conf from Bug 1374545

add autoactivation list to the lvm.conf 
    auto_activation_volume_list = [ "rhvh" ]

regenerated initramfs


This may or may not overlap with Bug 1374545. This to make sure that this configuration is tested from the installation stage to the activation stage in the manager.

Comment 2 Tal Nisan 2016-12-01 12:11:57 UTC
Sounds related to bug 1374545 indeed, Nir can you confirm?

Comment 3 Fabian Deutsch 2016-12-01 13:55:53 UTC
I wonder if we can do something for these kind of bugs with (like this and bug 1374545) by leveraging the lvm.* kernel arguments?

Comment 4 Nir Soffer 2016-12-01 14:22:13 UTC
(In reply to Roman Hodain from comment #0)
> 46min 33.479s lvm2-pvscan@132:1728.service
> 46min 33.361s lvm2-pvscan@132:1696.service
> 46min 33.300s lvm2-pvscan@134:1632.service
> 46min 33.172s lvm2-pvscan@134:1648.service
> 46min 33.082s lvm2-pvscan@134:1600.service
> 46min 33.008s lvm2-pvscan@133:1552.service

I think these are related to lvmetad.

Did you try to disable lvmetad?

It should be possible in the same lvmlocal.conf file.

See bug 1374545 for instructions.
    
> Expected results:
> The boot is fast. I suppose that the bes is if we make the system ignore the
> devices.
> 
> Additional info:
> 
> We tried to do the following:

The boot is slow with these changes?

> implemented
> https://gerrit.ovirt.org/#/c/66893/1/static/etc/lvm/lvmlocal.conf from Bug
> 1374545

This helps by not scanning ovirt lvs, but the system is still scanning all the
luns. I think the only solution is to have lvm filter (not global_filter)
whitelisting only the devices used by rhvh. Any other device should not be
scanned by lvm on the host.

Vdsm run lvm commands with a special command line filter so lvm commands run
by vdsm will scan the required luns.

> 
> add autoactivation list to the lvm.conf 
>     auto_activation_volume_list = [ "rhvh" ]
> 
> regenerated initramfs
> 
> 
> This may or may not overlap with Bug 1374545. 

Bug 1374545 is about guest lvs inside ovirt lvs, the issue here is gust lvs
on luns.

We don't manage luns; we don't create them on the storage server or map them 
for a specific host, we only discover them when they are added, so we cannot
setup filters for luns with the current system design.

Comment 5 Nir Soffer 2016-12-01 14:25:12 UTC
Zdenek, it seems that the solution you suggested in bug 1374545 is also the 
need for this bug, having filter whitelisting the devices needed by the host.

Can you confirm or suggest a better configuration?

Comment 6 Peter Rajnoha 2016-12-01 14:41:31 UTC
(In reply to Roman Hodain from comment #0)
> Actual results:
> The boot takes ages and the systemd-blame is reporting the following
> services consuming the time:
> 
> 46min 33.479s lvm2-pvscan@132:1728.service
> 46min 33.361s lvm2-pvscan@132:1696.service
> 46min 33.300s lvm2-pvscan@134:1632.service
> 46min 33.172s lvm2-pvscan@134:1648.service
> 46min 33.082s lvm2-pvscan@134:1600.service
> 46min 33.008s lvm2-pvscan@133:1552.service
> 

Can you add -vvvv to /lib/systemd/system/lvm2-pvscan@.service like this:

  ExecStart=/usr/sbin/lvm pvscan -vvvv --cache --activate ay %i

Then reboot, and then do collect debug logs with lvmdump -u -l -s when the systemd is finally up and running (after that long delay) and attach the logs here (the tgz file).

Comment 7 Zdenek Kabelac 2016-12-01 14:50:04 UTC
I'd probably recommend to disable usage of 'lvmetad' and avoid using autoactivation.

It looks the system is full of duplicate devices (multipath paths), but it also looks like these devices are scanned 'before' multipath is activade - so it's quite tricky to recognize this state. There would be a way to swich into udevDB mode for recognizing  'multipath' component devices - but I'd not recommend to use this setting ATM.

If it's feasible - setting 'White-list' filtering to accept only very limited set of device as possible PV device is the best approach on minimizing delays here.

Comment 13 Yaniv Lavi 2017-02-14 11:58:44 UTC
Is this resolved with BZ #1342786?

Comment 14 cshao 2017-02-15 03:03:07 UTC
(In reply to Yaniv Dary from comment #13)
> Is this resolved with BZ #1342786?

The two bugs are in different branch. The Target Milestone of bug 1342786 is set to 3.6.11, and no 3.6.11 build available to QE testing. so I think the original bug still existing.

Comment 15 Yaniv Lavi 2017-02-23 11:23:51 UTC
Moving out all non blocker\exceptions.

Comment 16 Nir Soffer 2017-02-27 15:25:06 UTC
This should be solved by LVM filter accepting only the devices needed by the
hypervisor, and rejecting any other device (e.g direct luns, luns used for ovirt
storage domains).

Comment 19 Nir Soffer 2017-04-28 11:58:38 UTC
The first step is reproducing this.

- We need host with FC connection to FC storage
- Best have ports on the host, and 4 paths to storage
- create and map 300 (small) LUNs to this host.
- on each LUN, create a vg and 4 lvs. This can be created using a script (the real
  lvs are created within a guest)

This should be tested on both RHEL and RHEV-H.

When we have a system reproducing this issue we can continue.

Gil, can we schedule this?

Comment 20 Gil Klein 2017-04-30 11:17:54 UTC
(In reply to Nir Soffer from comment #19)
> The first step is reproducing this.
> 
> - We need host with FC connection to FC storage
> - Best have ports on the host, and 4 paths to storage
> - create and map 300 (small) LUNs to this host.
> - on each LUN, create a vg and 4 lvs. This can be created using a script
> (the real
>   lvs are created within a guest)
> 
> This should be tested on both RHEL and RHEV-H.
> 
> When we have a system reproducing this issue we can continue.
> 
> Gil, can we schedule this?
Raz, will you be able to handle this request?

Comment 21 Raz Tamir 2017-04-30 11:36:11 UTC
Sure,
Elad will take care of Nir's request

Comment 22 Elad 2017-04-30 14:45:47 UTC
Provided offline

Comment 23 Nir Soffer 2017-05-04 18:14:09 UTC
I tested this setup but could not reproduce the issue.

The setup tested:

# grep 'model name' /proc/cpuinfo 
model name	: Intel(R) Xeon(R) CPU           E5504  @ 2.00GHz
model name	: Intel(R) Xeon(R) CPU           E5504  @ 2.00GHz
model name	: Intel(R) Xeon(R) CPU           E5504  @ 2.00GHz
model name	: Intel(R) Xeon(R) CPU           E5504  @ 2.00GHz

# cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 7.3 (Maipo)

# rpm -q vdsm
vdsm-4.19.11-1.el7ev.x86_64

Number of LUNs: 300
Number of VGs: 300
Number of LVs: 1200

Each LUN has 4 paths to storage:

3514f0c5a516007fb dm-87 XtremIO ,XtremApp        
size=5.0G features='0' hwhandler='0' wp=rw
`-+- policy='queue-length 0' prio=1 status=active
  |- 0:0:1:177 sdaif 129:816  active ready running
  |- 5:0:1:177 sdake 132:864  active ready running
  |- 5:0:0:177 sdmo  70:256   active ready running
  `- 0:0:0:177 sdlr  68:400   active ready running


The test included:

1. Wait until host is up in engine
2. reboot
3. wait until host is up again

I measured the time from starting the boot until vdsm was started,
and then the time until vdsm initialization was completed.
(getCapabilities succeeded - vdsm should be up shortly after that)

Tested 2 configurations:
1. default lvm.conf
2. lvm.conf with devices/filter:

   devices {
       filter = [ "a|^/dev/sdate2$|", "r|.*|"] 
   }

   This filter prevent lvm from scanning the the lvs on the luns.

# Default configuration

Events:
13:57:15    boot started
14:02:15    vdsm started
14:10:03    vdsm init completed

Active lvs:
0 (deactivated by vdsm)

$ systemd-analyze time 
Startup finished in 1.228s (kernel) + 29.110s (initrd) + 4min 50.648s (userspace) = 5min 20.986s

$ head -20 boot-05/systemd-analyze-blame.out 
    1min 32.944s vdsm-network.service
     1min 8.678s network.service
         18.967s kdump.service
         15.337s lvm2-activation-early.service
         13.840s systemd-udev-settle.service
         11.763s dev-mapper-VolGroup01\x2droot.device
          7.281s dmraid-activation.service
          6.360s lvm2-activation-net.service
          5.778s lvm2-activation.service
          5.171s lvm2-monitor.service
          4.731s libvirtd.service
          3.402s vdsmd.service
          1.886s postfix.service
          1.720s tuned.service
          1.261s systemd-udevd.service
          1.037s vdsm-network-init.service
           778ms ovirt-imageio-daemon.service
           677ms systemd-tmpfiles-setup-dev.service
           655ms proc-fs-nfsd.mount
           648ms rhel-readonly.service

$ grep lvm2-pvscan boot-05/systemd-analyze-blame.out | head -10
           543ms lvm2-pvscan@253:105.service
           540ms lvm2-pvscan@253:109.service
           539ms lvm2-pvscan@253:108.service
           539ms lvm2-pvscan@253:104.service
           538ms lvm2-pvscan@253:106.service
           537ms lvm2-pvscan@253:98.service
           535ms lvm2-pvscan@253:103.service
           524ms lvm2-pvscan@253:102.service
           523ms lvm2-pvscan@253:100.service
           521ms lvm2-pvscan@253:101.service

$ grep multipathd  boot-05/journal.log | wc -l
8307

$ grep 'add path (uevent)' boot-05/journal.log | wc -l
2301

$ grep "devmap not registered, can't remove" boot-05/journal.log | wc -l
1200

$ grep 'remove map (uevent)' boot-05/journal.log | wc -l
2400


# With lvm filter

Events:
14:30:53    boot started
14:36:17    vdsm started
14:36:38    vdsm init completed

Active lvs:
0 (not activated by lvm because of the filter)

$ systemd-analyze time 
Startup finished in 1.227s (kernel) + 29.645s (initrd) + 4min 57.321s (userspace) = 5min 28.194s

$ head -20 boot-06-filter/systemd-analyze-blame.out 
    1min 25.276s vdsm-network.service
     1min 7.960s network.service
         13.340s systemd-udev-settle.service
         11.186s dev-mapper-VolGroup01\x2droot.device
         11.178s kdump.service
          7.402s dmraid-activation.service
          3.043s vdsmd.service
          2.640s libvirtd.service
          1.990s tuned.service
          1.907s postfix.service
          1.551s systemd-udevd.service
           982ms vdsm-network-init.service
           744ms rhel-readonly.service
           700ms ovirt-imageio-daemon.service
           662ms proc-fs-nfsd.mount
           600ms systemd-tmpfiles-setup-dev.service
           546ms systemd-modules-load.service
           486ms lvm2-activation-net.service
           448ms rc-local.service
           430ms lvm2-activation-early.service

$ grep lvm2-pvscan boot-06-filter/systemd-analyze-blame.out | head -10
           297ms lvm2-pvscan@253:69.service
           297ms lvm2-pvscan@253:67.service
           294ms lvm2-pvscan@253:68.service
           294ms lvm2-pvscan@253:70.service
           292ms lvm2-pvscan@253:72.service
           291ms lvm2-pvscan@253:75.service
           291ms lvm2-pvscan@253:73.service
           291ms lvm2-pvscan@253:74.service
           289ms lvm2-pvscan@253:77.service
           289ms lvm2-pvscan@253:76.service


$ grep multipathd  boot-06-filter/journal.log | wc -l
4707

$ grep 'add path (uevent)' boot-06-filter/journal.log | wc -l
2301

$ grep "devmap not registered, can't remove" boot-06-filter/journal.log | wc -l
0

$ grep 'remove map (uevent)' boot-06-filter/journal.log | wc -l
0

Summary:

1. Cannot reproduce, boot took about 13 minutes, nothing in the
   range of 40 minutes.

2. lvm filter does not change system boot time significantly, but
   has huge effect on vdsm initialization time:
   without filter: 9m48s
   with filter:      16s (36X faster)

3. multipath has some issues without lvm filter, lot of
   "devmap not registered, can't remove" and 'remove map (uevent)'
   events that we don't see when using lvm-filter.


Maybe the problem is related to RHEV-H, I'm waiting for RHEV-H setup
with same LUNs for testing.

I also want to test a setup with more LUNs (600?), to see how the
system scales up.

Roman, do you have an idea what is missing in my setup to reproduce
the issue?

Comment 24 Nir Soffer 2017-05-04 18:17:29 UTC
Created attachment 1276414 [details]
Test results on RHEL

Comment 25 Nir Soffer 2017-05-23 21:24:31 UTC
Created attachment 1281763 [details]
Test results on RHVH

Comment 26 Nir Soffer 2017-05-23 21:33:12 UTC
Test results on RHVH - same setup as RHEL tests (see comment 23)

One difference compared to RHEL tests, LUNs are 1G, in the RHEL
test we used 5G LUNs. Same number of lv per LUNS (smaller).


# Without filter (boot 3):

events:
17:24:50    boot started
17:27:59    vdsm started
17:34:13    vdsm ready

$ cat systemd-analyze-time.out 
Startup finished in 3.733s (kernel) + 17.437s (initrd) + 16min 13.033s (userspace) = 16min 34.204s

$ head systemd-analyze-blame.out 
   13min 59.616s kdump.service
         51.696s iscsi.service
         32.182s vdsm-network.service
         30.503s NetworkManager-wait-online.service
         28.862s lvm2-activation-net.service
         10.017s network.service
          9.343s glusterd.service
          3.909s vdsmd.service
          3.664s postfix.service
          3.587s systemd-udev-settle.service

$ grep lvm2-pvscan systemd-analyze-blame.out | head
           167ms lvm2-pvscan@253:36.service
           164ms lvm2-pvscan@253:22.service
           143ms lvm2-pvscan@253:171.service
           141ms lvm2-pvscan@253:90.service
           141ms lvm2-pvscan@253:176.service
           135ms lvm2-pvscan@253:43.service
           133ms lvm2-pvscan@253:159.service
           131ms lvm2-pvscan@253:54.service
           130ms lvm2-pvscan@253:161.service
           129ms lvm2-pvscan@253:141.service


# With filter

20:31:46    boot started
20:34:20    vdsm started
20:34:40    vdsm ready

$ cat systemd-analyze-time.out 
Startup finished in 3.735s (kernel) + 17.336s (initrd) + 7min 56.826s (userspace) = 8min 17.898s

$ head systemd-analyze-blame.out 
    6min 13.408s kdump.service
         51.821s iscsi.service
         30.389s NetworkManager-wait-online.service
         26.926s vdsm-network.service
          9.517s network.service
          8.955s glusterd.service
          3.927s vdsmd.service
          3.600s systemd-udev-settle.service
          3.479s postfix.service
          2.707s dev-mapper-rhvh_purple\x2d\x2dvds2\x2drhvh\x2d\x2d4.1\x2d\x2d0.20170506.0\x2b1.device

$ grep lvm2-pvscan systemd-analyze-blame.out | head
           166ms lvm2-pvscan@253:113.service
           160ms lvm2-pvscan@253:111.service
           151ms lvm2-pvscan@253:162.service
           146ms lvm2-pvscan@253:182.service
           146ms lvm2-pvscan@253:112.service
           145ms lvm2-pvscan@253:67.service
           143ms lvm2-pvscan@253:72.service
           141ms lvm2-pvscan@253:23.service
           138ms lvm2-pvscan@253:22.service
           132ms lvm2-pvscan@253:31.service


Similar results, to RHEL, slightly better. So we cannot reproduce the
issue reported in this bug.

Adding lvm filter shorten vdsm ready time from 6:14 to 0:20 (19X faster),
and overall boot time from 9:37 to 2:56 (3X faster).

Comment 27 Nir Soffer 2017-05-29 05:51:36 UTC
Created attachment 1283163 [details]
Tests resutls on RHV-H with 600 luns and 4800 lvs

Comment 28 Nir Soffer 2017-05-29 06:12:27 UTC
To test how the system scales, I performed another tests on RHV-H setup. This time
I had 600 LUNs, and to make the no filter case more interesting, 8 LVs per LUN,
totalling active 4800 LVs when booting without a filter.

# Without filter

$ cat 01-no-filter/events
19:53:38	boot started
19:59:27	vdsm started
20:31:50	vdsm ready

$ cat 01-no-filter/systemd-analyze-time.out 
Startup finished in 3.729s (kernel) + 17.430s (initrd) + 48min 7.574s (userspace) = 48min 28.734s

$ head 01-no-filter/systemd-analyze-blame.out 
   44min 17.550s kdump.service
    1min 43.876s iscsi.service
    1min 30.193s vdsm-network.service
    1min 15.326s lvm2-activation-net.service
         30.396s NetworkManager-wait-online.service
          8.976s network.service
          8.428s glusterd.service
          4.585s vdsmd.service
          3.433s postfix.service
          3.054s dev-mapper-rhvh_purple\x2d\x2dvds2\x2drhvh\x2d\x2d4.1\x2d\x2d0.20170506.0\x2b1.device

$ grep lvm2-pvscan 01-no-filter/systemd-analyze-blame.out | head
           232ms lvm2-pvscan@253:12.service
           167ms lvm2-pvscan@253:147.service
           155ms lvm2-pvscan@253:236.service
           154ms lvm2-pvscan@253:310.service
           153ms lvm2-pvscan@253:309.service
           149ms lvm2-pvscan@253:249.service
           145ms lvm2-pvscan@253:318.service
           141ms lvm2-pvscan@253:134.service
           140ms lvm2-pvscan@253:265.service
           139ms lvm2-pvscan@253:17.service


# With filter

$ cat 03-with-filter/events
21:22:43	boot started
21:26:32	vdsm started
21:27:19	vdsm ready

$ cat systemd-analyze-time.out 
$ cat 03-with-filter/systemd-analyze-time.out 
Startup finished in 3.734s (kernel) + 17.553s (initrd) + 14min 28.278s (userspace) = 14min 49.566s

$ head 03-with-filter/systemd-analyze-blame.out 
   11min 53.197s kdump.service
    1min 43.900s iscsi.service
         49.409s vdsm-network.service
         30.412s NetworkManager-wait-online.service
          8.990s network.service
          8.310s glusterd.service
          4.068s vdsmd.service
          3.587s systemd-udev-settle.service
          3.567s dev-mapper-rhvh_purple\x2d\x2dvds2\x2drhvh\x2d\x2d4.1\x2d\x2d0.20170506.0\x2b1.device
          3.476s postfix.service

$ grep lvm2-pvscan 03-with-filter/systemd-analyze-blame.out | head
           154ms lvm2-pvscan@253:271.service
           151ms lvm2-pvscan@253:103.service
           149ms lvm2-pvscan@253:19.service
           148ms lvm2-pvscan@253:197.service
           148ms lvm2-pvscan@253:229.service
           145ms lvm2-pvscan@253:76.service
           144ms lvm2-pvscan@253:192.service
           141ms lvm2-pvscan@253:63.service
           141ms lvm2-pvscan@253:129.service
           140ms lvm2-pvscan@253:207.service


We see same behaviour as previous tests but this time system boot time show
big improvement with lvm filter (3:52 with filter, 5:49 without filter).

Vdsm startup without lvm filter is 30 minutes, with lvm filter 47 seconds
(39X times faster).

Note that vdsm startup with 600 LUNs/4800 LVs is about 2X times slower compared
with 300 LUNs/1200 LVs setup (see comment 26). I did not check why, but I assume
this is the time needed to read lvm metadata to vdsm lvm cache.

Comment 29 Yaniv Kaul 2017-05-29 06:57:25 UTC
Nir, do we need a bug on kdump for the slow boot? (note that in real life case we might even fence the host as it starts so slow!)

Why is iscsi so slow?

Comment 30 Nir Soffer 2017-05-29 18:29:43 UTC
Ryan, can you explain why kdump.service is taking so much time during boot?
(see comment 26 and comment 28). Maybe we have misconfiguration issue?
I think we need to investigate this with platform folks.

Comment 31 Nir Soffer 2017-05-29 18:40:13 UTC
Ryan, iscsi.service is on the top of systemd-analyze-blame on rhev-h, but not in 
top 20 on rhel 7.3 (see comment 23). Do you have any idea why?

Comment 32 Nir Soffer 2017-05-29 18:43:06 UTC
Roman, do you have an idea what is missing in my setup to reproduce
the issue? see comment 23, comment 26 and comment 28.

Comment 33 Ryan Barry 2017-05-29 22:51:24 UTC
No idea on either, Nir. We use the same configuration files as RHEL, and there should be no difference here. Can you provide a test environment?

Comment 34 Nir Soffer 2017-05-29 22:59:25 UTC
Raz, can you provide a test environment to Ryan to investigate the kdump and iscsi
issues?

Comment 35 Raz Tamir 2017-06-01 08:57:05 UTC
What are the requirements?

Comment 36 Ryan Barry 2017-06-01 17:04:27 UTC
Ideally, the same setup as Nir used -- 600 LUNs/4800 LVs.

Unfortunately, I don't have an environment capable of creating that kind of load right now.

Comment 37 Nir Soffer 2017-06-01 17:07:38 UTC
(In reply to Ryan Barry from comment #36)
> Ideally, the same setup as Nir used -- 600 LUNs/4800 LVs.
> 
> Unfortunately, I don't have an environment capable of creating that kind of
> load right now.

The 300 FC LUNS setup should be good enough, we see exactly the same behavior
compared with RHEL.

Comment 38 Nir Soffer 2017-06-01 17:09:08 UTC
Created attachment 1284218 [details]
Script for creating lvs on all multipath devices

This is the script used for creating the LVs on all the 300/600 LUNs.

Comment 39 Nir Soffer 2017-06-01 17:19:50 UTC
Adding back needinfo removed in comment 33. Roman, see comment 32.

Comment 40 Marina Kalinin 2017-06-01 20:03:44 UTC
Nir / Allon,
Please review the two solutions we have:
1) https://access.redhat.com/solutions/2777831  
It has a better issue defined to address the bug.
2) https://access.redhat.com/solutions/2662261
The issues is not exactly matching this bug title, however the solution is more close to the one we discuss here.

Let me know which approach to take and I can modify the solutions.

Comment 42 Roman Hodain 2017-06-02 06:33:39 UTC
(In reply to Nir Soffer from comment #34)
> Raz, can you provide a test environment to Ryan to investigate the kdump and
> iscsi
> issues?

Hi,

the test seems to be valid. The only difference is that the ENV experiencing this issue is in production. So may be the storage is under some load from other hypervisors. 

I will check if this is only hppening with other hypervisors are running or all the time.

Comment 44 Raz Tamir 2017-06-04 12:52:01 UTC
(In reply to Roman Hodain from comment #42)
> (In reply to Nir Soffer from comment #34)
> > Raz, can you provide a test environment to Ryan to investigate the kdump and
> > iscsi
> > issues?
> 
> Hi,
> 
> the test seems to be valid. The only difference is that the ENV experiencing
> this issue is in production. So may be the storage is under some load from
> other hypervisors. 
> 
> I will check if this is only hppening with other hypervisors are running or
> all the time.

Do you still need a testing hypervisor?

Comment 55 Nir Soffer 2017-12-06 18:08:12 UTC
As we can see in the comment 28, there is a huge improvement by using a proper lvm
filter. Most of the time is spect in vdsm, trying to deactivate logical volumes
activated during boot. With a proper lvm filter, vdsm does not have to deactivate
anything.

We introduced a new vdsm-tool command, "config-lvm-filter", automating lvm
configuration. If you use block storage you should configure lvm filter properly
on all hosts.

See https://ovirt.org/blog/2017/12/lvm-configuration-the-easy-way/

Comment 56 Sandro Bonazzola 2017-12-12 15:53:44 UTC
Nir, this is targeted to 4.2.1 but modified in 4.2.0.
Can you please check / fix target milestone?

Comment 57 Nir Soffer 2017-12-12 15:57:47 UTC
(In reply to Sandro Bonazzola from comment #56)
> Nir, this is targeted to 4.2.1 but modified in 4.2.0.
> Can you please check / fix target milestone?

Same as https://bugzilla.redhat.com/show_bug.cgi?id=1130527#c26

Comment 58 RHV bug bot 2018-01-05 16:58:27 UTC
INFO: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[No relevant external trackers attached]

For more info please contact: rhv-devops

Comment 59 RHV bug bot 2018-01-12 14:39:50 UTC
INFO: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[No relevant external trackers attached]

For more info please contact: rhv-devops

Comment 60 guy chen 2018-03-06 16:03:55 UTC
Verified, config-lvm-filter put the correct filter and with the correct filter as specified at comment 28 duration's are improved.

Comment 65 errata-xmlrpc 2018-05-15 17:49:33 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:1489

Comment 66 Franta Kust 2019-05-16 13:05:33 UTC
BZ<2>Jira Resync


Note You need to log in before you can comment on or make changes to this bug.