Bug 515232 - LVM commands take a VERY long time (>40 min) to finish
Summary: LVM commands take a VERY long time (>40 min) to finish
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: lvm2
Version: 5.4
Hardware: All
OS: Linux
high
urgent
Target Milestone: rc
: ---
Assignee: Milan Broz
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks: 523441
TreeView+ depends on / blocked
 
Reported: 2009-08-03 13:03 UTC by Ayal Baron
Modified: 2018-10-27 11:37 UTC (History)
23 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2010-03-30 09:01:19 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
lvm dump (3.82 MB, application/octet-stream)
2009-08-03 14:06 UTC, Ayal Baron
no flags Details
lvchange --refresh output (705.27 KB, application/octet-stream)
2009-08-03 14:38 UTC, Ayal Baron
no flags Details
pvs hangs vdsm log (765.99 KB, application/octet-stream)
2009-08-10 15:18 UTC, Ayal Baron
no flags Details
lvm dump - cyan setup (199.29 KB, application/octet-stream)
2009-08-10 15:20 UTC, Ayal Baron
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2010:0298 0 normal SHIPPED_LIVE lvm2 bug fix and enhancement update 2010-03-29 15:16:34 UTC

Description Ayal Baron 2009-08-03 13:03:56 UTC
Description of problem:
On the Westford setup (>40 hosts with shared storage, >1100 LVs)
we encountered a problem where lvm commands slowed down to several minutes each and lvchange took 46 minutes and subsequent lvchange commands did not return at all (we killed them after over an hour).
In the interim, other lvm commands did return.
I am not sure there is any connection, but the lvchange command returned at the same time as the following line from /var/log/messages:
Aug  2 12:20:44 north-05 multipathd: dm-174: add map (uevent)

Hostname: north-05.lab.bos

Comment 2 Milan Broz 2009-08-03 13:32:00 UTC
Please could you add

- lvm2 rpm versions
- which exact command lvm commands you run
- run that command with -vvvv and paste output
- if possible, please provide lvmdump from that system

Thanks.

Comment 4 Ayal Baron 2009-08-03 14:06:52 UTC
Created attachment 356030 [details]
lvm dump

Comment 5 Ayal Baron 2009-08-03 14:33:52 UTC
rpm -qa lvm2: lvm2-2.02.46-8.el5
we ran many lvm commands.  The specific lvchange command that took 46min:
(11:34:03,112) sudo /usr/sbin/lvchange --refresh c270a758-40cb-48a6-934d-c30316740a06/b4ffa268-d1d8-4d3b-ad56-fc15d5bea80c (Finished successfully at: 12:20:44,241)
(11:34:11,052) sudo /usr/sbin/lvs -o attr --noheadings c270a758-40cb-48a6-934d-c30316740a06/5faa297d-6ba8-4dbf-9a81-24789232e700 (Finished successfully at: 11:35:17,107)

The following lvchange commands hang (we killed them at 12:44:19 so perhaps they would have finished eventually):

Thread-5593::DEBUG::2009-08-02 11:34:41,639::misc::66::irs::['sudo', '/usr/sbin/lvchange', '--available', 'n', 'c270a758-40cb-48a6-934d-c30316740a06/5faa297d-6ba8-4dbf-9a81-24789232e700'] (cwd None)

Thread-5583::DEBUG::2009-08-02 11:35:17,108::misc::66::irs::['sudo', '/usr/sbin/lvchange', '--available', 'y', 'c270a758-40cb-48a6-934d-c30316740a06/5faa297d-6ba8-4dbf-9a81-24789232e700'] (cwd None)

492ac3aa-a1ed-4bff-a8cb-c448b3762e34::DEBUG::2009-08-02 11:51:56,622::misc::66::irs::['sudo', '/usr/sbin/lvchange', '--available', 'y', 'c270a758-40cb-48a6-934d-c30316740a06/master'] (cwd None)

0a8e403a-d449-43eb-af75-fea2daa19dcf::DEBUG::2009-08-02 11:54:15,361::misc::66::irs::['sudo', '/usr/sbin/lvchange', '--refresh', 'c270a758-40cb-48a6-934d-c30316740a06/b4ffa268-d1d8-4d3b-ad56-fc15d5bea80c'] (cwd None)

cf196124-a31e-4b6e-aeb6-26b0db78466e::DEBUG::2009-08-02 11:54:16,348::misc::66::irs::['sudo', '/usr/sbin/lvchange', '--refresh', 'c270a758-40cb-48a6-934d-c30316740a06/b4ffa268-d1d8-4d3b-ad56-fc15d5bea80c'] (cwd None)

6fa5bd3b-7c40-41a4-a377-dbab732969ca::DEBUG::2009-08-02 12:37:32,165::misc::66::irs::['sudo', '/usr/sbin/lvchange', '--refresh', 'c270a758-40cb-48a6-934d-c30316740a06/b4ffa268-d1d8-4d3b-ad56-fc15d5bea80c'] (cwd None)

Concurrently we ran many more lvm commands (lvs, vgs, pvs etc.) if a specific list is needed, tell me.

I attached the lvmdump.
Currently the system is functioning ()more or less), but I will anyway attach the lvchange output with full verbosity shortly.

Comment 6 Ayal Baron 2009-08-03 14:38:08 UTC
Created attachment 356037 [details]
lvchange --refresh output

Comment 7 Milan Broz 2009-08-03 14:59:54 UTC
Thanks.

I'll check which commands have perforamce problems here, from you log I see that lvs is ok, lvchange takes long time.

FYI: running "fake" copy of PVs with provided metadata & activating all LVs show this times (this should be probably the worst case), manipulation with one volume should be fair more quick.

# time vgchange -a y
  1109 logical volume(s) in volume group "c270a758-40cb-48a6-934d-c30316740a06" now active
  361 logical volume(s) in volume group "8bd8e09d-10fb-4289-bd67-bb40ca3efc93" now active

real    6m12.058s
user    1m16.064s
sys     4m51.422s

Comment 8 Milan Broz 2009-08-03 15:18:39 UTC
What's the purpose of lvchange --refresh command in your environment?

Because code internaly run suspend->resume for that LV (to reload possible changes).

If the reason is volume size change, it is always increasing the size?

(Suspend can run long time, if there are running IOs - it need to flush all IOs tunning on that LV.)

Comment 9 Ayal Baron 2009-08-03 17:00:05 UTC
(In reply to comment #8)
> What's the purpose of lvchange --refresh command in your environment?
> 
We are running a multi node setup where only one node is allowed to run lvextend commands.  the "lvchange --refresh" command is run on a different node which needs to use this lv after the extend.

> Because code internaly run suspend->resume for that LV (to reload possible
> changes).
> 
> If the reason is volume size change, it is always increasing the size?
> 
> (Suspend can run long time, if there are running IOs - it need to flush all IOs
> tunning on that LV.)

Comment 10 Milan Broz 2009-08-03 18:36:05 UTC
> We are running a multi node setup where only one node is allowed to run
> lvextend commands.  the "lvchange --refresh" command is run on a different node
> which needs to use this lv after the extend.

ok. So when the lvchange --refresh <LV> command was stuck, LV was active.
Was there any IO activity on that LV? (IOW is possible that this operation waits for some ongoing IO?)

If you can reproduce it (with -vvvv), can you point me to line/operation/whatever in log which appears as last when before it waits?

If there is lvm command log in syslog (in lvmdump it seems that lvm commands are logged), can you provide log from the moment when it stops responding?

(Or just tell me how I can grab that log. Also you mentioned multipathd events, maybe it is related - it should be in log too.)

Comment 11 Ayal Baron 2009-08-04 11:06:00 UTC
(In reply to comment #10)
> > We are running a multi node setup where only one node is allowed to run
> > lvextend commands.  the "lvchange --refresh" command is run on a different node
> > which needs to use this lv after the extend.
> 
> ok. So when the lvchange --refresh <LV> command was stuck, LV was active.
> Was there any IO activity on that LV? (IOW is possible that this operation
> waits for some ongoing IO?)
Yes, it was active and could very well be waiting for IO.  There is a running VM on the lv and the whole point of the mechanism is to not freeze the machine during lvextend.  This process only happens after lvextend (we never reduce the lv size),  is there anyway we can hint this to lvchange?

> 
> If you can reproduce it (with -vvvv), can you point me to
> line/operation/whatever in log which appears as last when before it waits?
> 
> If there is lvm command log in syslog (in lvmdump it seems that lvm commands
> are logged), can you provide log from the moment when it stops responding?
Reproducing this will be extremely difficult.
> 
> (Or just tell me how I can grab that log. Also you mentioned multipathd events,
> maybe it is related - it should be in log too.)  
We activated the log after this happened, so no such luck.

We do have our log though which shows all the commands vdsm ran on the machine.
I can attach it but it is in debug mode and contains tons of irrelevant data...

Comment 12 Milan Broz 2009-08-04 11:40:41 UTC
> We do have our log though which shows all the commands vdsm ran on the machine.
> I can attach it but it is in debug mode and contains tons of irrelevant data...

Hm. We need to prove that the problem is with that suspend case and not with anything else (like multipath, udev etc.)

Without log with timestamps it is just guessing. Is there at least some kernel log on this machine? We will probably need go through your command debug log too if there is nothing better...

Comment 13 Ayal Baron 2009-08-04 12:30:09 UTC
(In reply to comment #12)
> > We do have our log though which shows all the commands vdsm ran on the machine.
> > I can attach it but it is in debug mode and contains tons of irrelevant data...
> 
> Hm. We need to prove that the problem is with that suspend case and not with
> anything else (like multipath, udev etc.)
> 
> Without log with timestamps it is just guessing. Is there at least some kernel
> log on this machine? We will probably need go through your command debug log
> too if there is nothing better...  

You can go directly to the server and take whatever you want:
north-05.lab.bos

the vdsm log is under: /var/log/vdsm/vdsm.log (there is logrotate there so the relevant log was already rotated and has an extension of N.gz (N=1..99)

Comment 20 Milan Broz 2009-08-06 10:12:53 UTC
Reading log, it seems that multipath is doing something strange here,
you have this mapping:

[root@north-05 log]# multipath -ll
3600a0b80005ad142000005a84a76e951 dm-920 IBM,1726-4xx  FAStT
[size=1.5T][features=1 queue_if_no_path][hwhandler=1 rdac][rw]
\_ round-robin 0 [prio=200][active]
 \_ 4:0:2:0 sdd 8:48  [active][ready]
 \_ 4:0:3:0 sde 8:64  [active][ready]
3600d0230000000000216d43a8f4c3000 dm-2 WINSYS,SF2372
[size=2.7T][features=0][hwhandler=0][rw]
\_ round-robin 0 [prio=0][active]
 \_ 4:0:0:0 sdb 8:16  [active][ready]
3600d023000000000025b3e0c04e7f100 dm-3 WINSYS,SF2372
[size=2.7T][features=0][hwhandler=0][rw]
\_ round-robin 0 [prio=0][active]
 \_ 4:0:1:0 sdc 8:32  [active][ready]

And there is no blacklist in /etc/multipath.conf.

Why multipath scan LVs (dm-X devices)?

and in log is many of such messages:

Aug  2 07:48:34 north-05 kernel: device-mapper: table: 253:1113: multipath: error getting device
Aug  2 07:48:34 north-05 kernel: device-mapper: ioctl: error adding target to table
Aug  2 07:48:34 north-05 multipathd: dm-1113: remove map (uevent)

...
Aug  2 08:34:01 north-05 multipathd: dm-4: remove map (uevent)
... (>1000 events!)
Aug  2 08:36:39 north-05 multipathd: dm-1084: remove map (uevent)

...
Aug  2 09:25:13 north-05 kernel: device-mapper: table: 253:1113: multipath: error getting device
Aug  2 09:25:13 north-05 kernel: device-mapper: ioctl: error adding target to table
Aug  2 09:25:13 north-05 multipathd: dm-1113: remove map (uevent)
Aug  2 09:25:13 north-05 kernel: device-mapper: table: 253:1113: multipath: error getting device
Aug  2 09:25:13 north-05 kernel: device-mapper: ioctl: error adding target to table
Aug  2 09:25:13 north-05 multipathd: dm-1113: remove map (uevent)
...

[->here is time of your reported problem:]
Aug  2 12:20:44 north-05 multipathd: dm-174: add map (uevent)
Aug  2 13:30:30 north-05 multipathd: dm-18: remove map (uevent)
...
Aug  2 14:21:38 north-05 multipathd: dm-105: remove map (uevent)
Aug  2 15:36:47 north-05 multipathd: dm-293: remove map (uevent)

Comment 21 Milan Broz 2009-08-06 10:46:33 UTC
For reference, this is non-standard multipath.conf from that machine:

# RHEV REVISION 0.3

defaults {
        udev_dir                /dev
        polling_interval        10
        selector                "round-robin 0"
        path_grouping_policy    multibus
        getuid_callout          "/sbin/scsi_id -g -u -s /block/%n"
        prio_callout            /bin/true
        path_checker            readsector0
        rr_min_io               100
        max_fds                 8192
        rr_weight               priorities
        failback                immediate
        no_path_retry           fail
        user_friendly_names     no
}

devices {

    # That is what worked for Andy, so we keep it for the beta
    device {
        vendor                  "COMPAQ"
        product                 "MSA*"
        path_grouping_policy    group_by_prio
        path_checker            "hp_sw"
        }
# This one is what we believe the correct configuration, but we need to
# verify it. So leave it commented out for now.
#   device {
#           vendor                  "COMPAQ"
#           product                 "MSA VOLUME"
#           getuid_callout          "/lib/udev/scsi_id --whitelisted --device=/dev/%n"
#           features                "0"
#           hardware_handler        "0"
#           path_selector           "round-robin 0"
#           path_grouping_policy    group_by_prio
#           failback                immediate
#           rr_weight               uniform
#           no_path_retry           12
#           rr_min_io               1000
#           path_checker            tur
#           prio                    alua
#    }

}

Comment 22 Dave Wysochanski 2009-08-06 14:23:33 UTC
Ok, I looked at this machine as well.  There may be multiple problems here.
As mbroz has pointed out, multipath should blacklist devices that are not paths:
http://www.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/5/html/DM_Multipath/device_name_blacklist.html

Also, it appears the filter line in /etc/lvm/lvm.conf is wrong in that it will accept LVM LV devices to scan (the "a|/dev/mapper|" was probably placed there to pick up multipath devices, but note that /dev/mapper contains LV devices as well.

You should verify both multipath and LVM are filtering properly.  Remember, we want multipath to only look at the paths - these are typically /dev/sd*.  Everything else should have a blacklist.  And LVM, we want to only look at multipath devices, and any non-multipath, direct storage devices (e.g. /dev/sda2).

You can check LVM's filtering by using the "-vvvv" flag on a command, and checking the /etc/lvm/cache/.cache file to make sure "valid_devices" only contains what you'd expect (multipath device names, and any other non-multipath devices that LVM should scan, e.g. /dev/sda2).  If you remove /etc/lvm/cache/.cache, it will be re-created next time you run an lvm command.

You can also check multipath is blacklisting properly with the verbosity flag on the multipath command (I think it is "-v2").

As mbroz points out, a third and final problem may be the suspend/resume cycles of the "lvchange" commands.  IMO, first step on that machine would be to update the blacklist and filter lines so multipath and lvm are only scanning/probing the devices they should, and nothing else.

Comment 23 Dave Wysochanski 2009-08-06 14:36:16 UTC
Here is another relevant link from our docs - note the behavior of active/passive arrays as well (I see rdac is one of the multipath maps, and HP is listed in multipath.conf, though I did not see any maps):
http://www.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/5/html/DM_Multipath/multipath_logical_volumes.html

Comment 24 Dave Wysochanski 2009-08-06 14:46:18 UTC
For the lvm.conf filter line, instead of:
filter = [ "a|/dev/sda2|",  "a|/dev/mpath/|", "a|/dev/mapper/|", "r|.*|" ]
try
filter = [ "a|/dev/sda2|",  "a|/dev/mpath/|", "a|/dev/mapper/3600*|", "r|.*|" ]

Or alternatively, set "user_friendly_names yes" in multipath.conf and use
filter = [ "a|/dev/sda2|",  "a|/dev/mpath/|", "a|/dev/mapper/mpath*|", "r|.*|" ]
(NOTE: I am not 100% sure all the issues with user_friendly_names have been addressed in RHEL5.4 but I know a lot of them have been fixed.)

Comment 25 Ben Marzinski 2009-08-06 14:55:28 UTC
If you don't have a blacklist in /etc/multipath.conf, you get the default, which is

devnode ^(ram|raw|loop|fd|md|dm-|sr|scd|st)[0-9]*"
devnode "^hd[a-z][[0-9]*]"

which will keep you from pulling in the device types you don't want.  The majority of users just use the default blacklist.

To double check that multipath isn't wasting time trying to use devices that it can't, run

# multipath -ll -v3 | less

The first thing that happens is path discovery, you should see lines like

dm-0: blacklisted
dm-1: blacklisted

Also sections like:

sda: not found in pathvec
sda: mask = 0x5
sda: bus = 1
sda: dev_t = 8:0
sda: size = 156250000
sda: vendor = ATA
sda: product = WDC WD800JD-75MS
sda: rev = 10.0
sda: h:b:t:l = 0:0:0:0
sda: path checker = readsector0 (config file default)
sda: state = 2

where multipath is getting information for the non-blacklisted devices.  You should only see sections like this for the devices you want multipath to run on.   Note, multipath won't create a map on top of a device that is already open.  Some people don't bother to blacklist devices like their root filesystem device,
since, in initrd, multipath will only create the multipath maps that you explicitly set up, and after that, the root device will already be mounted. If you have only a couple of devices like this, it shouldn't add a noticeable amount of time if you don't bother to blacklist them.  Also, multipath won't run whenever you run an lvm command, so the only slowdown you would see from this would be on startup, or whenever you explicitly ran multipath.

Comment 26 Ben Marzinski 2009-08-06 15:11:05 UTC
The only big problem with user friendly names is that devices are not guaranteed to have the same names on all the nodes that use these devices.  In fact you should take is as guaranteed that they won't.  However, If you're putting LVM on top of these devices anyway, then that's a non-issue.  Also, there's no real reason to use /dev/mpath, since the devices will already be in /dev/mapper and they won't appear in /dev/mpath until udev gets around to it.

But in general, using user_friendly_names is a good way to make your filtering easier.

filter = [ "a|/dev/sda2|", "a|/dev/mapper/mpath*|", "r|.*|"]

should work fine with, unless you need the multipath device to have the same names on all the nodes.

Comment 27 Shahar Frank 2009-08-09 13:35:18 UTC
(In reply to comment #26)
> The only big problem with user friendly names is that devices are not
> guaranteed to have the same names on all the nodes that use these devices.  In
> fact you should take is as guaranteed that they won't.  However, If you're
> putting LVM on top of these devices anyway, then that's a non-issue.  Also,
> there's no real reason to use /dev/mpath, since the devices will already be in
> /dev/mapper and they won't appear in /dev/mpath until udev gets around to it.
> 
> But in general, using user_friendly_names is a good way to make your filtering
> easier.
> 
> filter = [ "a|/dev/sda2|", "a|/dev/mapper/mpath*|", "r|.*|"]
> 
> should work fine with, unless you need the multipath device to have the same
> names on all the nodes.  

True for RHEV. 

Anyhow, why should the lvm filtering issue be related to the problem described above?

Comment 28 Milan Broz 2009-08-10 10:19:47 UTC
(In reply to comment #27)
> Anyhow, why should the lvm filtering issue be related to the problem described
> above?  

Properly configured storage and filtering is required prerequisite for configuration of many LVs (scanning all LVs is pointless).

There is no proper log decribing what exactly happened, only pieces I mentioned in comment #20 pointing to filtering problem, which should be fixed anyway.

I was not able to reproduce problem with copied lvm metadata from that machine, and I found no problem which should lead to such timeouts in lvm command processing with properly configured underlying storage layer.
(possible waiting for IO after lvchange mentioned in comment #8 is in seconds scale - surely not tens of minutes)

If you can reproduce it (with fixed filtering) and check log to see what it is waiting for we can move forward.

Comment 29 Ayal Baron 2009-08-10 15:18:51 UTC
Created attachment 356904 [details]
pvs hangs vdsm log

Hi,
We have a similar problem on another setup which has around 10 lvs.  I am not sure it is the same issue, but the pattern is pretty much the same.
This time, the offending command is pvs on a SPECIFIC device.  The pvs is stuck for 11 minutes and then returns successfully.  A few seconds earlier the EXACT same command finished successfully.  During the time the command is waiting, several additional pvs commands (exactly the same) are spawned and queued (they too return after the 11 minute period)
I am attaching the vdsm log.  I couldn't find anything of any interest in dmesg / /var/log/messages

I found no indication that there was any device error or problem with the device.

Successfull PVS:
f8192fca-aced-4459-adb6-cd6dcecac971::DEBUG::2009-08-10 05:58:21,441::misc::66::irs::['sudo', '/usr/sbin/pvs', '--noheadings', '-o', 'pv_uuid,dev_size,vg_uuid,pv_pe_count,pv_pe_alloc_count', '--units', 'b', '--nosuffix', '/dev/mpath/3600144f04a7fd05a00003048344a6b00'] (cwd None)

f8192fca-aced-4459-adb6-cd6dcecac971::DEBUG::2009-08-10 05:58:21,926::misc::86::irs::SUCCESS: <err> = ; <rc> = 0

Hung PVS (8 seconds later):
Thread-35058::DEBUG::2009-08-10 05:58:29,128::misc::66::irs::['sudo', '/usr/sbin/pvs', '--noheadings', '-o', 'pv_uuid,dev_size,vg_uuid,pv_pe_count,pv_pe_alloc_count', '--units', 'b', '--nosuffix', '/dev/mpath/3600144f04a7fd05a00003048344a6b00'] (cwd None)

Thread-35058::DEBUG::2009-08-10 06:09:19,319::misc::86::irs::SUCCESS: <err> = ; <rc> = 0

rpm -qa lvm2: lvm2-2.02.46-8.el5

i will attach lvmdump shortly

BTW Milan, you stated earlier that with our metadata, "vgchange -a y" took you over 6min.  Seeing as any normal installation of our system could easily exceed 1000 lvs, I do not see how such run times could be considered acceptable.

Comment 30 Ayal Baron 2009-08-10 15:20:31 UTC
Created attachment 356905 [details]
lvm dump - cyan setup

Comment 31 Milan Broz 2009-08-10 16:09:03 UTC
(In reply to comment #29)
> Created an attachment (id=356904) [details]
> pvs hangs vdsm log

thanks, but it contains no lvm debugging info, so we just know the command.
But this time it is read-only pvs command, not lvresize.

Please can you post also output of
# pvs -o +pv_mda_count
(seems that this one thing is missing in lvmdump)

> BTW Milan, you stated earlier that with our metadata, "vgchange -a y" took you
> over 6min.  Seeing as any normal installation of our system could easily exceed
> 1000 lvs, I do not see how such run times could be considered acceptable.  

No, it is not acceptable, we already optimized run time operations, seems that here it will needed another step.
(btw I tested it on debugging systems, so the real runtime on RHEL5 is far quicker, but still time in minutes is too much).

But this is separate problem, AFAIK your configuration never run this type of lvm command (active all volumes at once on one host).

Comment 32 Shahar Frank 2009-08-10 16:34:07 UTC
(In reply to comment #31)
> (In reply to comment #29)
> > Created an attachment (id=356904) [details] [details]
> > pvs hangs vdsm log
> 
> thanks, but it contains no lvm debugging info, so we just know the command.
> But this time it is read-only pvs command, not lvresize.
> 
> Please can you post also output of
> # pvs -o +pv_mda_count
> (seems that this one thing is missing in lvmdump)
> 
> > BTW Milan, you stated earlier that with our metadata, "vgchange -a y" took you
> > over 6min.  Seeing as any normal installation of our system could easily exceed
> > 1000 lvs, I do not see how such run times could be considered acceptable.  
> 
> No, it is not acceptable, we already optimized run time operations, seems that
> here it will needed another step.
> (btw I tested it on debugging systems, so the real runtime on RHEL5 is far
> quicker, but still time in minutes is too much).
> 
> But this is separate problem, AFAIK your configuration never run this type of
> lvm command (active all volumes at once on one host).  

This is true, we don't do such things, among others because we found it affect the LVM performance very badly, and because we even prefer *not* to activate LVs that are not used by the local node. So luckily we are perfectly OK with it.

BTW, do you wait for udev in any of the LVM flows?

Comment 33 Milan Broz 2009-08-10 16:48:13 UTC
> BTW, do you wait for udev in any of the LVM flows?  

In RHEL5 are all nodes in /dev for lvm2 created by libdevmapper internally, so there is no need to wait for udev.

Interaction with udev rules should not cause any problems except for badly written udev rules ("Device busy" errors during removal because reacting to wrong event ans issuing some scan which locks device etc.)

[In future (RHEL6) will lvm2 probably switch to udev completely, so it will need to wait till nodes are created.]

Comment 35 Ayal Baron 2009-08-11 08:05:48 UTC
This happened again this morning.  We have a pretty good guess regarding the cause.  It appears as though the locking mechanism isn't implemented in a fair way (i.e. read locks are granted after a write lock was requested).

Following is the output of 2 consecutive lsof commands on the vg lock file.
lvchange is waiting for a write lock while there are other read locks in the system which keep on arriving and are served first:

[root@brown-vdsa ~]# lsof /var/lock/lvm/V_b7825068-48e4-41ce-815c-b1566b9fde28

COMMAND    PID USER   FD   TYPE DEVICE SIZE     NODE NAME

lvchange 28870 root    3u   REG   0,19    0 21831529 /var/lock/lvm/V_b7825068-48e4-41ce-815c-b1566b9fde28

lvs      32154 root    3uR  REG   0,19    0 21831529 /var/lock/lvm/V_b7825068-48e4-41ce-815c-b1566b9fde28

lvs      32156 root    3uR  REG   0,19    0 21831529 /var/lock/lvm/V_b7825068-48e4-41ce-815c-b1566b9fde28

lvs      32158 root    3uR  REG   0,19    0 21831529 /var/lock/lvm/V_b7825068-48e4-41ce-815c-b1566b9fde28

lvs      32164 root    3uR  REG   0,19    0 21831529 /var/lock/lvm/V_b7825068-48e4-41ce-815c-b1566b9fde28

vgs      32169 root    3uR  REG   0,19    0 21831529 /var/lock/lvm/V_b7825068-48e4-41ce-815c-b1566b9fde28

vgs      32171 root    3uR  REG   0,19    0 21831529 /var/lock/lvm/V_b7825068-48e4-41ce-815c-b1566b9fde28

lvs      32173 root    3uR  REG   0,19    0 21831529 /var/lock/lvm/V_b7825068-48e4-41ce-815c-b1566b9fde28

vgs      32176 root    3uR  REG   0,19    0 21831529 /var/lock/lvm/V_b7825068-48e4-41ce-815c-b1566b9fde28

lvs      32177 root    3uR  REG   0,19    0 21831529 /var/lock/lvm/V_b7825068-48e4-41ce-815c-b1566b9fde28

lvs      32181 root    3uR  REG   0,19    0 21831529 /var/lock/lvm/V_b7825068-48e4-41ce-815c-b1566b9fde28

lvs      32182 root    3uR  REG   0,19    0 21831529 /var/lock/lvm/V_b7825068-48e4-41ce-815c-b1566b9fde28

lvs      32183 root    3uR  REG   0,19    0 21831529 /var/lock/lvm/V_b7825068-48e4-41ce-815c-b1566b9fde28

lvs      32185 root    3uR  REG   0,19    0 21831529 /var/lock/lvm/V_b7825068-48e4-41ce-815c-b1566b9fde28

lvs      32191 root    3uR  REG   0,19    0 21831529 /var/lock/lvm/V_b7825068-48e4-41ce-815c-b1566b9fde28

lvs      32196 root    3uR  REG   0,19    0 21831529 /var/lock/lvm/V_b7825068-48e4-41ce-815c-b1566b9fde28

vgs      32198 root    3uR  REG   0,19    0 21831529 /var/lock/lvm/V_b7825068-48e4-41ce-815c-b1566b9fde28

lvs      32199 root    3uR  REG   0,19    0 21831529 /var/lock/lvm/V_b7825068-48e4-41ce-815c-b1566b9fde28

vgs      32201 root    3uR  REG   0,19    0 21831529 /var/lock/lvm/V_b7825068-48e4-41ce-815c-b1566b9fde28

lvs      32203 root    3uR  REG   0,19    0 21831529 /var/lock/lvm/V_b7825068-48e4-41ce-815c-b1566b9fde28

vgs      32204 root    3uR  REG   0,19    0 21831529 /var/lock/lvm/V_b7825068-48e4-41ce-815c-b1566b9fde28

vgs      32205 root    3uR  REG   0,19    0 21831529 /var/lock/lvm/V_b7825068-48e4-41ce-815c-b1566b9fde28

lvs      32211 root    3uR  REG   0,19    0 21831529 /var/lock/lvm/V_b7825068-48e4-41ce-815c-b1566b9fde28

[root@brown-vdsa ~]# man lsof

-bash: man: command not found

[root@brown-vdsa ~]# lsof /var/lock/lvm/V_b7825068-48e4-41ce-815c-b1566b9fde28

COMMAND    PID USER   FD   TYPE DEVICE SIZE     NODE NAME

pvs        719 root    4uR  REG   0,19    0 21831529 /var/lock/lvm/V_b7825068-48e4-41ce-815c-b1566b9fde28

pvs        725 root    4uR  REG   0,19    0 21831529 /var/lock/lvm/V_b7825068-48e4-41ce-815c-b1566b9fde28

pvs        726 root    4uR  REG   0,19    0 21831529 /var/lock/lvm/V_b7825068-48e4-41ce-815c-b1566b9fde28

pvs        727 root    4uR  REG   0,19    0 21831529 /var/lock/lvm/V_b7825068-48e4-41ce-815c-b1566b9fde28

lvs        733 root    3uR  REG   0,19    0 21831529 /var/lock/lvm/V_b7825068-48e4-41ce-815c-b1566b9fde28

lvs        735 root    3uR  REG   0,19    0 21831529 /var/lock/lvm/V_b7825068-48e4-41ce-815c-b1566b9fde28

vgs        736 root    3uR  REG   0,19    0 21831529 /var/lock/lvm/V_b7825068-48e4-41ce-815c-b1566b9fde28

lvs        737 root    3uR  REG   0,19    0 21831529 /var/lock/lvm/V_b7825068-48e4-41ce-815c-b1566b9fde28

lvs        738 root    3uR  REG   0,19    0 21831529 /var/lock/lvm/V_b7825068-48e4-41ce-815c-b1566b9fde28

lvs        744 root    3uR  REG   0,19    0 21831529 /var/lock/lvm/V_b7825068-48e4-41ce-815c-b1566b9fde28

vgs        745 root    3uR  REG   0,19    0 21831529 /var/lock/lvm/V_b7825068-48e4-41ce-815c-b1566b9fde28

vgs        746 root    3uR  REG   0,19    0 21831529 /var/lock/lvm/V_b7825068-48e4-41ce-815c-b1566b9fde28

vgs        749 root    3uR  REG   0,19    0 21831529 /var/lock/lvm/V_b7825068-48e4-41ce-815c-b1566b9fde28

vgs        752 root    3uR  REG   0,19    0 21831529 /var/lock/lvm/V_b7825068-48e4-41ce-815c-b1566b9fde28

lvs        754 root    3uR  REG   0,19    0 21831529 /var/lock/lvm/V_b7825068-48e4-41ce-815c-b1566b9fde28

lvs        760 root    3uR  REG   0,19    0 21831529 /var/lock/lvm/V_b7825068-48e4-41ce-815c-b1566b9fde28

lvs        761 root    3uR  REG   0,19    0 21831529 /var/lock/lvm/V_b7825068-48e4-41ce-815c-b1566b9fde28

lvs        766 root    3uR  REG   0,19    0 21831529 /var/lock/lvm/V_b7825068-48e4-41ce-815c-b1566b9fde28

lvs        772 root    3uR  REG   0,19    0 21831529 /var/lock/lvm/V_b7825068-48e4-41ce-815c-b1566b9fde28

lvs        775 root    3uR  REG   0,19    0 21831529 /var/lock/lvm/V_b7825068-48e4-41ce-815c-b1566b9fde28

lvs        777 root    3uR  REG   0,19    0 21831529 /var/lock/lvm/V_b7825068-48e4-41ce-815c-b1566b9fde28

lvs        787 root    3uR  REG   0,19    0 21831529 /var/lock/lvm/V_b7825068-48e4-41ce-815c-b1566b9fde28

vgs        788 root    3uR  REG   0,19    0 21831529 /var/lock/lvm/V_b7825068-48e4-41ce-815c-b1566b9fde28

lvs        789 root    3uR  REG   0,19    0 21831529 /var/lock/lvm/V_b7825068-48e4-41ce-815c-b1566b9fde28

lvs        790 root    3uR  REG   0,19    0 21831529 /var/lock/lvm/V_b7825068-48e4-41ce-815c-b1566b9fde28

vgs        792 root    3uR  REG   0,19    0 21831529 /var/lock/lvm/V_b7825068-48e4-41ce-815c-b1566b9fde28

lvchange 28870 root    3u   REG   0,19    0 21831529 /var/lock/lvm/V_b7825068-48e4-41ce-815c-b1566b9fde28

Comment 36 Ayal Baron 2009-08-11 08:22:03 UTC
(In reply to comment #33)
> > BTW, do you wait for udev in any of the LVM flows?  
> 
> In RHEL5 are all nodes in /dev for lvm2 created by libdevmapper internally, so
> there is no need to wait for udev.
Does this mean that after running: "lvchange --available y vg1/lv1" I can safely assume that the device: /dev/vg1/lv1 will be available?

Comment 37 Milan Broz 2009-08-11 09:57:17 UTC
(In reply to comment #35)
> This happened again this morning.  We have a pretty good guess regarding the
> cause.  It appears as though the locking mechanism isn't implemented in a fair
> way (i.e. read locks are granted after a write lock was requested).

It is not such simple. You are saying that there is starvartion, here waits lvchange (writer) for finishing vgs,pvs (readers). Prioritise writer move starvation just to readers (in other usage pattern). Also there are some special locks like vg_global, which behaves differently.

Sure that locking is important part of problem here, but there are other problems like all vgs, pvs are in "D" state, waiting for some direct IO operation. probably blocking other to release lock.

I do not want to speculate now, need to read through gathered data and found what happened, anyway the access to machine helped a lot here.

Please, if this happens again, can you provide log output of "echo t>/proc/sysrq-trigger" and if possible try to check where are the processes waiting
(gdb -p <pid> and show bt backtrace) - spec. for lvs/vgs this part is missing here. (or just ping me and I'll try to check myself)

Comment 38 Milan Broz 2009-08-11 09:59:29 UTC
> Does this mean that after running: "lvchange --available y vg1/lv1" I can
> safely assume that the device: /dev/vg1/lv1 will be available?  

yes, after lvchange -a y LV, with successful exit code, the /dev/mapper/LV (and symlink /dev/VG/LV) is available.

Comment 39 Ayal Baron 2009-08-11 13:24:14 UTC
(In reply to comment #37)
> (In reply to comment #35)
> > This happened again this morning.  We have a pretty good guess regarding the
> > cause.  It appears as though the locking mechanism isn't implemented in a fair
> > way (i.e. read locks are granted after a write lock was requested).
> 
> It is not such simple. You are saying that there is starvartion, here waits
> lvchange (writer) for finishing vgs,pvs (readers). Prioritise writer move
> starvation just to readers (in other usage pattern). Also there are some
> special locks like vg_global, which behaves differently.
> 
> Sure that locking is important part of problem here, but there are other
> problems like all vgs, pvs are in "D" state, waiting for some direct IO
> operation. probably blocking other to release lock.
> 
> I do not want to speculate now, need to read through gathered data and found
> what happened, anyway the access to machine helped a lot here.
> 
> Please, if this happens again, can you provide log output of "echo
> t>/proc/sysrq-trigger" and if possible try to check where are the processes
> waiting
> (gdb -p <pid> and show bt backtrace) - spec. for lvs/vgs this part is missing
> here. (or just ping me and I'll try to check myself)  

We ran the following test which clearly states that there is a problem with relying on flock:
touch /tmp/xxx
flock -s /tmp/xxx sleep 1000 &
flock -x /tmp/xxx sleep 1001 &
flock -s /tmp/xxx ls

You can clearly see that the second shared lock is allowed to proceed (meaning flock is not "fair")

In our scenario we are running many read operations and we clearly see that the writers are starved.  I believe that considering the above you can easily recreate the problem.

Comment 40 Milan Broz 2009-08-11 13:50:45 UTC
yes, flock works this way, shared lock is just shared among reporting commands, lvchange takes exclusive.

I will post here full analysis which locks are taken in all used commands later but flock problem is not the root cause here.

(yes, lvchange can wait some time for readers, but I am sure it is not problem here - I just audited the code.)

Problem is that all commands which have shared flock are waiting for some IO:

Trace is some like this - seems that cause for this is _outside_ of lvm code, I'll need to analyse how it happens - if you can provide me gdb traces, it will help.

Aug 10 13:11:24 blue-vdsb kernel: vgs           D 0000000000000000     0 19451  21231                     (NOTLB)
Aug 10 13:11:24 blue-vdsb kernel:  ffff81014372dcb8 0000000000000086 0000000000000400 ffffffff8001c046
Aug 10 13:11:24 blue-vdsb kernel:  ffff81003ae56588 0000000000000007 ffff8101855a67e0 ffff81024dc62080
Aug 10 13:11:25 blue-vdsb kernel:  00004fafc493c408 000000000000a3e6 ffff8101855a69c8 0000000000000000
Aug 10 13:11:25 blue-vdsb kernel: Call Trace:
Aug 10 13:11:25 blue-vdsb kernel:  [<ffffffff8001c046>] generic_make_request+0x211/0x228
Aug 10 13:11:25 blue-vdsb kernel:  [<ffffffff8006dd6e>] do_gettimeofday+0x40/0x8f
Aug 10 13:11:25 blue-vdsb kernel:  [<ffffffff8005a905>] getnstimeofday+0x10/0x28
Aug 10 13:11:25 blue-vdsb kernel:  [<ffffffff800637ba>] io_schedule+0x3f/0x67
Aug 10 13:11:25 blue-vdsb kernel:  [<ffffffff800f27dd>] __blockdev_direct_IO+0x8bc/0xa35
Aug 10 13:11:25 blue-vdsb kernel:  [<ffffffff800e3df5>] blkdev_direct_IO+0x32/0x37
Aug 10 13:11:25 blue-vdsb kernel:  [<ffffffff800e3d2d>] blkdev_get_blocks+0x0/0x96
Aug 10 13:11:25 blue-vdsb kernel:  [<ffffffff8000c4cf>] __generic_file_aio_read+0xb8/0x198
Aug 10 13:11:25 blue-vdsb kernel:  [<ffffffff80063ad5>] mutex_lock+0xd/0x1d
Aug 10 13:11:25 blue-vdsb kernel:  [<ffffffff800c510c>] generic_file_read+0xac/0xc5
Aug 10 13:11:25 blue-vdsb kernel:  [<ffffffff8009f698>] autoremove_wake_function+0x0/0x2e
Aug 10 13:11:25 blue-vdsb kernel:  [<ffffffff800e2e85>] block_ioctl+0x1b/0x1f
Aug 10 13:11:25 blue-vdsb kernel:  [<ffffffff800420a7>] do_ioctl+0x21/0x6b
Aug 10 13:11:25 blue-vdsb kernel:  [<ffffffff800302a0>] vfs_ioctl+0x457/0x4b9
Aug 10 13:11:25 blue-vdsb kernel:  [<ffffffff800b6193>] audit_syscall_entry+0x180/0x1b3
Aug 10 13:11:25 blue-vdsb kernel:  [<ffffffff8000b695>] vfs_read+0xcb/0x171
Aug 10 13:11:25 blue-vdsb kernel:  [<ffffffff80011b72>] sys_read+0x45/0x6e
Aug 10 13:11:25 blue-vdsb kernel:  [<ffffffff8005d28d>] tracesys+0xd5/0xe0

Comment 41 Milan Broz 2009-08-11 14:56:21 UTC
Ok, another machine now - several problems in once:

- many reporting programs forked in once, it is normal operation?

1) you did not fixed filer line in lvm.conf
 filter = [ "a|/dev/sda2|",  "a|/dev/mpath/|", "a|/dev/mapper/|", "r|.*|" ]

2) you storage is now broken
Aug 11 14:06:31 brown-vdsa kernel: scsi 5:0:0:0: rejecting I/O to dead device
Aug 11 14:06:31 brown-vdsa multipathd: sdc: readsector0 checker reports path is down
Aug 11 14:07:42 brown-vdsa kernel: scsi 5:0:0:0: rejecting I/O to dead device

Every reporting command scans all devices in system because:

2) There in missing PV on this system now!
 # pvs
  Couldn't find device with uuid 'QkzcVy-J0xu-Meq1-DEkp-F8lK-uSGK-anYXWj'.
  Couldn't find device with uuid 'QkzcVy-J0xu-Meq1-DEkp-F8lK-uSGK-anYXWj'.
  Couldn't find device with uuid 'QkzcVy-J0xu-Meq1-DEkp-F8lK-uSGK-anYXWj'.
  PV                                           VG                                   Fmt  Attr PSize   PFree
  /dev/mpath/3600144f04a7fccfc00003048344a6b00 b7825068-48e4-41ce-815c-b1566b9fde28 lvm2 a-   399.88G      0
  /dev/sda2                                    HostVG                               lvm2 a-    74.45G      0
  unknown device                               b7825068-48e4-41ce-815c-b1566b9fde28 lvm2 a-   199.88G 174.38G


This is the very expensive situation, if lvm cannot find PV, it starts rescan, all internal caching is flushed, if 20 processes run rescan  with direct IO, you storage is under high streass, and more and more reporting commands are spawned...

3) because of these very slow reporting commands, lvchange waits for exclusive flock() - yes, it is kind of starvation here now.

Comment 42 Milan Broz 2009-08-11 15:11:42 UTC
Why are you issuing several _same_ reporting commands at once?

Comment 44 Milan Broz 2009-08-12 09:00:35 UTC
On examined affected systems, these commands run in parallel and need these locks:

Reporting (all locks are shared/read):
  pvs PV
    - locvs: global, <VG>
  vgs VG
    - locks: <VG>
  lvs VG/LV
    - locks VG
  lvs @tag
    - locks: all VG in system (only one at once)

Manipulation with LV:
  lvchange -a y/n VG/LV
    - VG (exclusive/write)

  lvchange --refresh VG/LV
    - VG (exclusive/write)


Internal lvm2 locking was designed so that READ lock in reporting commands is taken only for very quick time.

On the affected system, there is many parallel reporting commands running.

First impulse to promep is usually some storage problem which cause that reporting command runs too long and system will issue another monitoring in paralel. So the shared lock is never released and commands requiring write lock are starving.

Parts which lvm code can help here:
 - use shared lock instead of exclusive for activation commands
 - change flock locking to provide better handling of starvation problem

Parts which requires configuration change of your systems:

 - every command issues direct IO read of metadata (at least), isn't that problem for your storage performance? if so, try to minimize issuing so many lvm commands

 - use proper filtering - scan _only_ PV devices, never LV itself - it is very
important in situations where your underlying storage have problems (like path failure above) and lvm tries to find referenced PV in system - if it has /dev/mapper/* in filter, is scans _all_ your LVs in system with direct-io, this must be very expensive operation

 - note that reporting commands using tag specification need to scan all VG in system (to check tag inside metadata).

Comment 45 Milan Broz 2009-08-13 14:40:59 UTC
We have identified possible fixes/workarounds in lvm2 package.

Usual process is commit them upstream and backport to RHEL, I think that I will have experimental (read: no QE test at all) rpm for RHEL soon.
(Changes include possible using prioritization of exclusive flock operation, which basically avoids starvation, changes to vgextend to block and using read only lock for lvchange --available and --refresh commands).

But I think that using some rwlock in caller to not issue so many commands in parallel is still very good way to avoid problems - so both sides plays safely.

If RHEV team can help test such experimental lvm2 rpm package, please let me know.

Comment 53 Dave Wilson 2009-08-20 13:56:26 UTC
(In reply to comment #45)
> We have identified possible fixes/workarounds in lvm2 package.
> 
> Usual process is commit them upstream and backport to RHEL, I think that I will
> have experimental (read: no QE test at all) rpm for RHEL soon.
> (Changes include possible using prioritization of exclusive flock operation,
> which basically avoids starvation, changes to vgextend to block and using read
> only lock for lvchange --available and --refresh commands).
> 
> But I think that using some rwlock in caller to not issue so many commands in
> parallel is still very good way to avoid problems - so both sides plays safely.
> 
> If RHEV team can help test such experimental lvm2 rpm package, please let me
> know.  

If you have the "experimental" lmv2 rpm package ready I can help test under a database load with my RHEV configuration. 

I've baselined my database build with lvm2-2.02.46-8.el5 and would be interested to note any performance impact of the new package.

Thanks.

Comment 54 Milan Broz 2009-08-20 14:16:52 UTC
> If you have the "experimental" lmv2 rpm package ready I can help test under a
> database load with my RHEV configuration. 

see comment #48 for link to the package, available since last Friday there (comment is private, because it is internal server).

Comment 55 Shahar Frank 2009-08-20 14:28:20 UTC
Oded, can you give us some feed back on this new lvm package?

Comment 57 Yaniv Kaul 2009-09-14 15:00:50 UTC
We have not seen the past issues with this LVM on the Westford cluster for ~2 weeks now, so I guess we can say this is fixed.

Comment 58 Milan Broz 2009-09-15 12:37:43 UTC
Patch in lvm2-2_02_46-9_el5.

Comment 62 Corey Marthaler 2010-02-15 22:46:17 UTC
Marking this bug fixed based on comment #57. I don't currently have access to hardware to test this fix.

Comment 64 errata-xmlrpc 2010-03-30 09:01:19 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2010-0298.html


Note You need to log in before you can comment on or make changes to this bug.