Description of problem: On the Westford setup (>40 hosts with shared storage, >1100 LVs) we encountered a problem where lvm commands slowed down to several minutes each and lvchange took 46 minutes and subsequent lvchange commands did not return at all (we killed them after over an hour). In the interim, other lvm commands did return. I am not sure there is any connection, but the lvchange command returned at the same time as the following line from /var/log/messages: Aug 2 12:20:44 north-05 multipathd: dm-174: add map (uevent) Hostname: north-05.lab.bos
Please could you add - lvm2 rpm versions - which exact command lvm commands you run - run that command with -vvvv and paste output - if possible, please provide lvmdump from that system Thanks.
Created attachment 356030 [details] lvm dump
rpm -qa lvm2: lvm2-2.02.46-8.el5 we ran many lvm commands. The specific lvchange command that took 46min: (11:34:03,112) sudo /usr/sbin/lvchange --refresh c270a758-40cb-48a6-934d-c30316740a06/b4ffa268-d1d8-4d3b-ad56-fc15d5bea80c (Finished successfully at: 12:20:44,241) (11:34:11,052) sudo /usr/sbin/lvs -o attr --noheadings c270a758-40cb-48a6-934d-c30316740a06/5faa297d-6ba8-4dbf-9a81-24789232e700 (Finished successfully at: 11:35:17,107) The following lvchange commands hang (we killed them at 12:44:19 so perhaps they would have finished eventually): Thread-5593::DEBUG::2009-08-02 11:34:41,639::misc::66::irs::['sudo', '/usr/sbin/lvchange', '--available', 'n', 'c270a758-40cb-48a6-934d-c30316740a06/5faa297d-6ba8-4dbf-9a81-24789232e700'] (cwd None) Thread-5583::DEBUG::2009-08-02 11:35:17,108::misc::66::irs::['sudo', '/usr/sbin/lvchange', '--available', 'y', 'c270a758-40cb-48a6-934d-c30316740a06/5faa297d-6ba8-4dbf-9a81-24789232e700'] (cwd None) 492ac3aa-a1ed-4bff-a8cb-c448b3762e34::DEBUG::2009-08-02 11:51:56,622::misc::66::irs::['sudo', '/usr/sbin/lvchange', '--available', 'y', 'c270a758-40cb-48a6-934d-c30316740a06/master'] (cwd None) 0a8e403a-d449-43eb-af75-fea2daa19dcf::DEBUG::2009-08-02 11:54:15,361::misc::66::irs::['sudo', '/usr/sbin/lvchange', '--refresh', 'c270a758-40cb-48a6-934d-c30316740a06/b4ffa268-d1d8-4d3b-ad56-fc15d5bea80c'] (cwd None) cf196124-a31e-4b6e-aeb6-26b0db78466e::DEBUG::2009-08-02 11:54:16,348::misc::66::irs::['sudo', '/usr/sbin/lvchange', '--refresh', 'c270a758-40cb-48a6-934d-c30316740a06/b4ffa268-d1d8-4d3b-ad56-fc15d5bea80c'] (cwd None) 6fa5bd3b-7c40-41a4-a377-dbab732969ca::DEBUG::2009-08-02 12:37:32,165::misc::66::irs::['sudo', '/usr/sbin/lvchange', '--refresh', 'c270a758-40cb-48a6-934d-c30316740a06/b4ffa268-d1d8-4d3b-ad56-fc15d5bea80c'] (cwd None) Concurrently we ran many more lvm commands (lvs, vgs, pvs etc.) if a specific list is needed, tell me. I attached the lvmdump. Currently the system is functioning ()more or less), but I will anyway attach the lvchange output with full verbosity shortly.
Created attachment 356037 [details] lvchange --refresh output
Thanks. I'll check which commands have perforamce problems here, from you log I see that lvs is ok, lvchange takes long time. FYI: running "fake" copy of PVs with provided metadata & activating all LVs show this times (this should be probably the worst case), manipulation with one volume should be fair more quick. # time vgchange -a y 1109 logical volume(s) in volume group "c270a758-40cb-48a6-934d-c30316740a06" now active 361 logical volume(s) in volume group "8bd8e09d-10fb-4289-bd67-bb40ca3efc93" now active real 6m12.058s user 1m16.064s sys 4m51.422s
What's the purpose of lvchange --refresh command in your environment? Because code internaly run suspend->resume for that LV (to reload possible changes). If the reason is volume size change, it is always increasing the size? (Suspend can run long time, if there are running IOs - it need to flush all IOs tunning on that LV.)
(In reply to comment #8) > What's the purpose of lvchange --refresh command in your environment? > We are running a multi node setup where only one node is allowed to run lvextend commands. the "lvchange --refresh" command is run on a different node which needs to use this lv after the extend. > Because code internaly run suspend->resume for that LV (to reload possible > changes). > > If the reason is volume size change, it is always increasing the size? > > (Suspend can run long time, if there are running IOs - it need to flush all IOs > tunning on that LV.)
> We are running a multi node setup where only one node is allowed to run > lvextend commands. the "lvchange --refresh" command is run on a different node > which needs to use this lv after the extend. ok. So when the lvchange --refresh <LV> command was stuck, LV was active. Was there any IO activity on that LV? (IOW is possible that this operation waits for some ongoing IO?) If you can reproduce it (with -vvvv), can you point me to line/operation/whatever in log which appears as last when before it waits? If there is lvm command log in syslog (in lvmdump it seems that lvm commands are logged), can you provide log from the moment when it stops responding? (Or just tell me how I can grab that log. Also you mentioned multipathd events, maybe it is related - it should be in log too.)
(In reply to comment #10) > > We are running a multi node setup where only one node is allowed to run > > lvextend commands. the "lvchange --refresh" command is run on a different node > > which needs to use this lv after the extend. > > ok. So when the lvchange --refresh <LV> command was stuck, LV was active. > Was there any IO activity on that LV? (IOW is possible that this operation > waits for some ongoing IO?) Yes, it was active and could very well be waiting for IO. There is a running VM on the lv and the whole point of the mechanism is to not freeze the machine during lvextend. This process only happens after lvextend (we never reduce the lv size), is there anyway we can hint this to lvchange? > > If you can reproduce it (with -vvvv), can you point me to > line/operation/whatever in log which appears as last when before it waits? > > If there is lvm command log in syslog (in lvmdump it seems that lvm commands > are logged), can you provide log from the moment when it stops responding? Reproducing this will be extremely difficult. > > (Or just tell me how I can grab that log. Also you mentioned multipathd events, > maybe it is related - it should be in log too.) We activated the log after this happened, so no such luck. We do have our log though which shows all the commands vdsm ran on the machine. I can attach it but it is in debug mode and contains tons of irrelevant data...
> We do have our log though which shows all the commands vdsm ran on the machine. > I can attach it but it is in debug mode and contains tons of irrelevant data... Hm. We need to prove that the problem is with that suspend case and not with anything else (like multipath, udev etc.) Without log with timestamps it is just guessing. Is there at least some kernel log on this machine? We will probably need go through your command debug log too if there is nothing better...
(In reply to comment #12) > > We do have our log though which shows all the commands vdsm ran on the machine. > > I can attach it but it is in debug mode and contains tons of irrelevant data... > > Hm. We need to prove that the problem is with that suspend case and not with > anything else (like multipath, udev etc.) > > Without log with timestamps it is just guessing. Is there at least some kernel > log on this machine? We will probably need go through your command debug log > too if there is nothing better... You can go directly to the server and take whatever you want: north-05.lab.bos the vdsm log is under: /var/log/vdsm/vdsm.log (there is logrotate there so the relevant log was already rotated and has an extension of N.gz (N=1..99)
Reading log, it seems that multipath is doing something strange here, you have this mapping: [root@north-05 log]# multipath -ll 3600a0b80005ad142000005a84a76e951 dm-920 IBM,1726-4xx FAStT [size=1.5T][features=1 queue_if_no_path][hwhandler=1 rdac][rw] \_ round-robin 0 [prio=200][active] \_ 4:0:2:0 sdd 8:48 [active][ready] \_ 4:0:3:0 sde 8:64 [active][ready] 3600d0230000000000216d43a8f4c3000 dm-2 WINSYS,SF2372 [size=2.7T][features=0][hwhandler=0][rw] \_ round-robin 0 [prio=0][active] \_ 4:0:0:0 sdb 8:16 [active][ready] 3600d023000000000025b3e0c04e7f100 dm-3 WINSYS,SF2372 [size=2.7T][features=0][hwhandler=0][rw] \_ round-robin 0 [prio=0][active] \_ 4:0:1:0 sdc 8:32 [active][ready] And there is no blacklist in /etc/multipath.conf. Why multipath scan LVs (dm-X devices)? and in log is many of such messages: Aug 2 07:48:34 north-05 kernel: device-mapper: table: 253:1113: multipath: error getting device Aug 2 07:48:34 north-05 kernel: device-mapper: ioctl: error adding target to table Aug 2 07:48:34 north-05 multipathd: dm-1113: remove map (uevent) ... Aug 2 08:34:01 north-05 multipathd: dm-4: remove map (uevent) ... (>1000 events!) Aug 2 08:36:39 north-05 multipathd: dm-1084: remove map (uevent) ... Aug 2 09:25:13 north-05 kernel: device-mapper: table: 253:1113: multipath: error getting device Aug 2 09:25:13 north-05 kernel: device-mapper: ioctl: error adding target to table Aug 2 09:25:13 north-05 multipathd: dm-1113: remove map (uevent) Aug 2 09:25:13 north-05 kernel: device-mapper: table: 253:1113: multipath: error getting device Aug 2 09:25:13 north-05 kernel: device-mapper: ioctl: error adding target to table Aug 2 09:25:13 north-05 multipathd: dm-1113: remove map (uevent) ... [->here is time of your reported problem:] Aug 2 12:20:44 north-05 multipathd: dm-174: add map (uevent) Aug 2 13:30:30 north-05 multipathd: dm-18: remove map (uevent) ... Aug 2 14:21:38 north-05 multipathd: dm-105: remove map (uevent) Aug 2 15:36:47 north-05 multipathd: dm-293: remove map (uevent)
For reference, this is non-standard multipath.conf from that machine: # RHEV REVISION 0.3 defaults { udev_dir /dev polling_interval 10 selector "round-robin 0" path_grouping_policy multibus getuid_callout "/sbin/scsi_id -g -u -s /block/%n" prio_callout /bin/true path_checker readsector0 rr_min_io 100 max_fds 8192 rr_weight priorities failback immediate no_path_retry fail user_friendly_names no } devices { # That is what worked for Andy, so we keep it for the beta device { vendor "COMPAQ" product "MSA*" path_grouping_policy group_by_prio path_checker "hp_sw" } # This one is what we believe the correct configuration, but we need to # verify it. So leave it commented out for now. # device { # vendor "COMPAQ" # product "MSA VOLUME" # getuid_callout "/lib/udev/scsi_id --whitelisted --device=/dev/%n" # features "0" # hardware_handler "0" # path_selector "round-robin 0" # path_grouping_policy group_by_prio # failback immediate # rr_weight uniform # no_path_retry 12 # rr_min_io 1000 # path_checker tur # prio alua # } }
Ok, I looked at this machine as well. There may be multiple problems here. As mbroz has pointed out, multipath should blacklist devices that are not paths: http://www.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/5/html/DM_Multipath/device_name_blacklist.html Also, it appears the filter line in /etc/lvm/lvm.conf is wrong in that it will accept LVM LV devices to scan (the "a|/dev/mapper|" was probably placed there to pick up multipath devices, but note that /dev/mapper contains LV devices as well. You should verify both multipath and LVM are filtering properly. Remember, we want multipath to only look at the paths - these are typically /dev/sd*. Everything else should have a blacklist. And LVM, we want to only look at multipath devices, and any non-multipath, direct storage devices (e.g. /dev/sda2). You can check LVM's filtering by using the "-vvvv" flag on a command, and checking the /etc/lvm/cache/.cache file to make sure "valid_devices" only contains what you'd expect (multipath device names, and any other non-multipath devices that LVM should scan, e.g. /dev/sda2). If you remove /etc/lvm/cache/.cache, it will be re-created next time you run an lvm command. You can also check multipath is blacklisting properly with the verbosity flag on the multipath command (I think it is "-v2"). As mbroz points out, a third and final problem may be the suspend/resume cycles of the "lvchange" commands. IMO, first step on that machine would be to update the blacklist and filter lines so multipath and lvm are only scanning/probing the devices they should, and nothing else.
Here is another relevant link from our docs - note the behavior of active/passive arrays as well (I see rdac is one of the multipath maps, and HP is listed in multipath.conf, though I did not see any maps): http://www.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/5/html/DM_Multipath/multipath_logical_volumes.html
For the lvm.conf filter line, instead of: filter = [ "a|/dev/sda2|", "a|/dev/mpath/|", "a|/dev/mapper/|", "r|.*|" ] try filter = [ "a|/dev/sda2|", "a|/dev/mpath/|", "a|/dev/mapper/3600*|", "r|.*|" ] Or alternatively, set "user_friendly_names yes" in multipath.conf and use filter = [ "a|/dev/sda2|", "a|/dev/mpath/|", "a|/dev/mapper/mpath*|", "r|.*|" ] (NOTE: I am not 100% sure all the issues with user_friendly_names have been addressed in RHEL5.4 but I know a lot of them have been fixed.)
If you don't have a blacklist in /etc/multipath.conf, you get the default, which is devnode ^(ram|raw|loop|fd|md|dm-|sr|scd|st)[0-9]*" devnode "^hd[a-z][[0-9]*]" which will keep you from pulling in the device types you don't want. The majority of users just use the default blacklist. To double check that multipath isn't wasting time trying to use devices that it can't, run # multipath -ll -v3 | less The first thing that happens is path discovery, you should see lines like dm-0: blacklisted dm-1: blacklisted Also sections like: sda: not found in pathvec sda: mask = 0x5 sda: bus = 1 sda: dev_t = 8:0 sda: size = 156250000 sda: vendor = ATA sda: product = WDC WD800JD-75MS sda: rev = 10.0 sda: h:b:t:l = 0:0:0:0 sda: path checker = readsector0 (config file default) sda: state = 2 where multipath is getting information for the non-blacklisted devices. You should only see sections like this for the devices you want multipath to run on. Note, multipath won't create a map on top of a device that is already open. Some people don't bother to blacklist devices like their root filesystem device, since, in initrd, multipath will only create the multipath maps that you explicitly set up, and after that, the root device will already be mounted. If you have only a couple of devices like this, it shouldn't add a noticeable amount of time if you don't bother to blacklist them. Also, multipath won't run whenever you run an lvm command, so the only slowdown you would see from this would be on startup, or whenever you explicitly ran multipath.
The only big problem with user friendly names is that devices are not guaranteed to have the same names on all the nodes that use these devices. In fact you should take is as guaranteed that they won't. However, If you're putting LVM on top of these devices anyway, then that's a non-issue. Also, there's no real reason to use /dev/mpath, since the devices will already be in /dev/mapper and they won't appear in /dev/mpath until udev gets around to it. But in general, using user_friendly_names is a good way to make your filtering easier. filter = [ "a|/dev/sda2|", "a|/dev/mapper/mpath*|", "r|.*|"] should work fine with, unless you need the multipath device to have the same names on all the nodes.
(In reply to comment #26) > The only big problem with user friendly names is that devices are not > guaranteed to have the same names on all the nodes that use these devices. In > fact you should take is as guaranteed that they won't. However, If you're > putting LVM on top of these devices anyway, then that's a non-issue. Also, > there's no real reason to use /dev/mpath, since the devices will already be in > /dev/mapper and they won't appear in /dev/mpath until udev gets around to it. > > But in general, using user_friendly_names is a good way to make your filtering > easier. > > filter = [ "a|/dev/sda2|", "a|/dev/mapper/mpath*|", "r|.*|"] > > should work fine with, unless you need the multipath device to have the same > names on all the nodes. True for RHEV. Anyhow, why should the lvm filtering issue be related to the problem described above?
(In reply to comment #27) > Anyhow, why should the lvm filtering issue be related to the problem described > above? Properly configured storage and filtering is required prerequisite for configuration of many LVs (scanning all LVs is pointless). There is no proper log decribing what exactly happened, only pieces I mentioned in comment #20 pointing to filtering problem, which should be fixed anyway. I was not able to reproduce problem with copied lvm metadata from that machine, and I found no problem which should lead to such timeouts in lvm command processing with properly configured underlying storage layer. (possible waiting for IO after lvchange mentioned in comment #8 is in seconds scale - surely not tens of minutes) If you can reproduce it (with fixed filtering) and check log to see what it is waiting for we can move forward.
Created attachment 356904 [details] pvs hangs vdsm log Hi, We have a similar problem on another setup which has around 10 lvs. I am not sure it is the same issue, but the pattern is pretty much the same. This time, the offending command is pvs on a SPECIFIC device. The pvs is stuck for 11 minutes and then returns successfully. A few seconds earlier the EXACT same command finished successfully. During the time the command is waiting, several additional pvs commands (exactly the same) are spawned and queued (they too return after the 11 minute period) I am attaching the vdsm log. I couldn't find anything of any interest in dmesg / /var/log/messages I found no indication that there was any device error or problem with the device. Successfull PVS: f8192fca-aced-4459-adb6-cd6dcecac971::DEBUG::2009-08-10 05:58:21,441::misc::66::irs::['sudo', '/usr/sbin/pvs', '--noheadings', '-o', 'pv_uuid,dev_size,vg_uuid,pv_pe_count,pv_pe_alloc_count', '--units', 'b', '--nosuffix', '/dev/mpath/3600144f04a7fd05a00003048344a6b00'] (cwd None) f8192fca-aced-4459-adb6-cd6dcecac971::DEBUG::2009-08-10 05:58:21,926::misc::86::irs::SUCCESS: <err> = ; <rc> = 0 Hung PVS (8 seconds later): Thread-35058::DEBUG::2009-08-10 05:58:29,128::misc::66::irs::['sudo', '/usr/sbin/pvs', '--noheadings', '-o', 'pv_uuid,dev_size,vg_uuid,pv_pe_count,pv_pe_alloc_count', '--units', 'b', '--nosuffix', '/dev/mpath/3600144f04a7fd05a00003048344a6b00'] (cwd None) Thread-35058::DEBUG::2009-08-10 06:09:19,319::misc::86::irs::SUCCESS: <err> = ; <rc> = 0 rpm -qa lvm2: lvm2-2.02.46-8.el5 i will attach lvmdump shortly BTW Milan, you stated earlier that with our metadata, "vgchange -a y" took you over 6min. Seeing as any normal installation of our system could easily exceed 1000 lvs, I do not see how such run times could be considered acceptable.
Created attachment 356905 [details] lvm dump - cyan setup
(In reply to comment #29) > Created an attachment (id=356904) [details] > pvs hangs vdsm log thanks, but it contains no lvm debugging info, so we just know the command. But this time it is read-only pvs command, not lvresize. Please can you post also output of # pvs -o +pv_mda_count (seems that this one thing is missing in lvmdump) > BTW Milan, you stated earlier that with our metadata, "vgchange -a y" took you > over 6min. Seeing as any normal installation of our system could easily exceed > 1000 lvs, I do not see how such run times could be considered acceptable. No, it is not acceptable, we already optimized run time operations, seems that here it will needed another step. (btw I tested it on debugging systems, so the real runtime on RHEL5 is far quicker, but still time in minutes is too much). But this is separate problem, AFAIK your configuration never run this type of lvm command (active all volumes at once on one host).
(In reply to comment #31) > (In reply to comment #29) > > Created an attachment (id=356904) [details] [details] > > pvs hangs vdsm log > > thanks, but it contains no lvm debugging info, so we just know the command. > But this time it is read-only pvs command, not lvresize. > > Please can you post also output of > # pvs -o +pv_mda_count > (seems that this one thing is missing in lvmdump) > > > BTW Milan, you stated earlier that with our metadata, "vgchange -a y" took you > > over 6min. Seeing as any normal installation of our system could easily exceed > > 1000 lvs, I do not see how such run times could be considered acceptable. > > No, it is not acceptable, we already optimized run time operations, seems that > here it will needed another step. > (btw I tested it on debugging systems, so the real runtime on RHEL5 is far > quicker, but still time in minutes is too much). > > But this is separate problem, AFAIK your configuration never run this type of > lvm command (active all volumes at once on one host). This is true, we don't do such things, among others because we found it affect the LVM performance very badly, and because we even prefer *not* to activate LVs that are not used by the local node. So luckily we are perfectly OK with it. BTW, do you wait for udev in any of the LVM flows?
> BTW, do you wait for udev in any of the LVM flows? In RHEL5 are all nodes in /dev for lvm2 created by libdevmapper internally, so there is no need to wait for udev. Interaction with udev rules should not cause any problems except for badly written udev rules ("Device busy" errors during removal because reacting to wrong event ans issuing some scan which locks device etc.) [In future (RHEL6) will lvm2 probably switch to udev completely, so it will need to wait till nodes are created.]
This happened again this morning. We have a pretty good guess regarding the cause. It appears as though the locking mechanism isn't implemented in a fair way (i.e. read locks are granted after a write lock was requested). Following is the output of 2 consecutive lsof commands on the vg lock file. lvchange is waiting for a write lock while there are other read locks in the system which keep on arriving and are served first: [root@brown-vdsa ~]# lsof /var/lock/lvm/V_b7825068-48e4-41ce-815c-b1566b9fde28 COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME lvchange 28870 root 3u REG 0,19 0 21831529 /var/lock/lvm/V_b7825068-48e4-41ce-815c-b1566b9fde28 lvs 32154 root 3uR REG 0,19 0 21831529 /var/lock/lvm/V_b7825068-48e4-41ce-815c-b1566b9fde28 lvs 32156 root 3uR REG 0,19 0 21831529 /var/lock/lvm/V_b7825068-48e4-41ce-815c-b1566b9fde28 lvs 32158 root 3uR REG 0,19 0 21831529 /var/lock/lvm/V_b7825068-48e4-41ce-815c-b1566b9fde28 lvs 32164 root 3uR REG 0,19 0 21831529 /var/lock/lvm/V_b7825068-48e4-41ce-815c-b1566b9fde28 vgs 32169 root 3uR REG 0,19 0 21831529 /var/lock/lvm/V_b7825068-48e4-41ce-815c-b1566b9fde28 vgs 32171 root 3uR REG 0,19 0 21831529 /var/lock/lvm/V_b7825068-48e4-41ce-815c-b1566b9fde28 lvs 32173 root 3uR REG 0,19 0 21831529 /var/lock/lvm/V_b7825068-48e4-41ce-815c-b1566b9fde28 vgs 32176 root 3uR REG 0,19 0 21831529 /var/lock/lvm/V_b7825068-48e4-41ce-815c-b1566b9fde28 lvs 32177 root 3uR REG 0,19 0 21831529 /var/lock/lvm/V_b7825068-48e4-41ce-815c-b1566b9fde28 lvs 32181 root 3uR REG 0,19 0 21831529 /var/lock/lvm/V_b7825068-48e4-41ce-815c-b1566b9fde28 lvs 32182 root 3uR REG 0,19 0 21831529 /var/lock/lvm/V_b7825068-48e4-41ce-815c-b1566b9fde28 lvs 32183 root 3uR REG 0,19 0 21831529 /var/lock/lvm/V_b7825068-48e4-41ce-815c-b1566b9fde28 lvs 32185 root 3uR REG 0,19 0 21831529 /var/lock/lvm/V_b7825068-48e4-41ce-815c-b1566b9fde28 lvs 32191 root 3uR REG 0,19 0 21831529 /var/lock/lvm/V_b7825068-48e4-41ce-815c-b1566b9fde28 lvs 32196 root 3uR REG 0,19 0 21831529 /var/lock/lvm/V_b7825068-48e4-41ce-815c-b1566b9fde28 vgs 32198 root 3uR REG 0,19 0 21831529 /var/lock/lvm/V_b7825068-48e4-41ce-815c-b1566b9fde28 lvs 32199 root 3uR REG 0,19 0 21831529 /var/lock/lvm/V_b7825068-48e4-41ce-815c-b1566b9fde28 vgs 32201 root 3uR REG 0,19 0 21831529 /var/lock/lvm/V_b7825068-48e4-41ce-815c-b1566b9fde28 lvs 32203 root 3uR REG 0,19 0 21831529 /var/lock/lvm/V_b7825068-48e4-41ce-815c-b1566b9fde28 vgs 32204 root 3uR REG 0,19 0 21831529 /var/lock/lvm/V_b7825068-48e4-41ce-815c-b1566b9fde28 vgs 32205 root 3uR REG 0,19 0 21831529 /var/lock/lvm/V_b7825068-48e4-41ce-815c-b1566b9fde28 lvs 32211 root 3uR REG 0,19 0 21831529 /var/lock/lvm/V_b7825068-48e4-41ce-815c-b1566b9fde28 [root@brown-vdsa ~]# man lsof -bash: man: command not found [root@brown-vdsa ~]# lsof /var/lock/lvm/V_b7825068-48e4-41ce-815c-b1566b9fde28 COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME pvs 719 root 4uR REG 0,19 0 21831529 /var/lock/lvm/V_b7825068-48e4-41ce-815c-b1566b9fde28 pvs 725 root 4uR REG 0,19 0 21831529 /var/lock/lvm/V_b7825068-48e4-41ce-815c-b1566b9fde28 pvs 726 root 4uR REG 0,19 0 21831529 /var/lock/lvm/V_b7825068-48e4-41ce-815c-b1566b9fde28 pvs 727 root 4uR REG 0,19 0 21831529 /var/lock/lvm/V_b7825068-48e4-41ce-815c-b1566b9fde28 lvs 733 root 3uR REG 0,19 0 21831529 /var/lock/lvm/V_b7825068-48e4-41ce-815c-b1566b9fde28 lvs 735 root 3uR REG 0,19 0 21831529 /var/lock/lvm/V_b7825068-48e4-41ce-815c-b1566b9fde28 vgs 736 root 3uR REG 0,19 0 21831529 /var/lock/lvm/V_b7825068-48e4-41ce-815c-b1566b9fde28 lvs 737 root 3uR REG 0,19 0 21831529 /var/lock/lvm/V_b7825068-48e4-41ce-815c-b1566b9fde28 lvs 738 root 3uR REG 0,19 0 21831529 /var/lock/lvm/V_b7825068-48e4-41ce-815c-b1566b9fde28 lvs 744 root 3uR REG 0,19 0 21831529 /var/lock/lvm/V_b7825068-48e4-41ce-815c-b1566b9fde28 vgs 745 root 3uR REG 0,19 0 21831529 /var/lock/lvm/V_b7825068-48e4-41ce-815c-b1566b9fde28 vgs 746 root 3uR REG 0,19 0 21831529 /var/lock/lvm/V_b7825068-48e4-41ce-815c-b1566b9fde28 vgs 749 root 3uR REG 0,19 0 21831529 /var/lock/lvm/V_b7825068-48e4-41ce-815c-b1566b9fde28 vgs 752 root 3uR REG 0,19 0 21831529 /var/lock/lvm/V_b7825068-48e4-41ce-815c-b1566b9fde28 lvs 754 root 3uR REG 0,19 0 21831529 /var/lock/lvm/V_b7825068-48e4-41ce-815c-b1566b9fde28 lvs 760 root 3uR REG 0,19 0 21831529 /var/lock/lvm/V_b7825068-48e4-41ce-815c-b1566b9fde28 lvs 761 root 3uR REG 0,19 0 21831529 /var/lock/lvm/V_b7825068-48e4-41ce-815c-b1566b9fde28 lvs 766 root 3uR REG 0,19 0 21831529 /var/lock/lvm/V_b7825068-48e4-41ce-815c-b1566b9fde28 lvs 772 root 3uR REG 0,19 0 21831529 /var/lock/lvm/V_b7825068-48e4-41ce-815c-b1566b9fde28 lvs 775 root 3uR REG 0,19 0 21831529 /var/lock/lvm/V_b7825068-48e4-41ce-815c-b1566b9fde28 lvs 777 root 3uR REG 0,19 0 21831529 /var/lock/lvm/V_b7825068-48e4-41ce-815c-b1566b9fde28 lvs 787 root 3uR REG 0,19 0 21831529 /var/lock/lvm/V_b7825068-48e4-41ce-815c-b1566b9fde28 vgs 788 root 3uR REG 0,19 0 21831529 /var/lock/lvm/V_b7825068-48e4-41ce-815c-b1566b9fde28 lvs 789 root 3uR REG 0,19 0 21831529 /var/lock/lvm/V_b7825068-48e4-41ce-815c-b1566b9fde28 lvs 790 root 3uR REG 0,19 0 21831529 /var/lock/lvm/V_b7825068-48e4-41ce-815c-b1566b9fde28 vgs 792 root 3uR REG 0,19 0 21831529 /var/lock/lvm/V_b7825068-48e4-41ce-815c-b1566b9fde28 lvchange 28870 root 3u REG 0,19 0 21831529 /var/lock/lvm/V_b7825068-48e4-41ce-815c-b1566b9fde28
(In reply to comment #33) > > BTW, do you wait for udev in any of the LVM flows? > > In RHEL5 are all nodes in /dev for lvm2 created by libdevmapper internally, so > there is no need to wait for udev. Does this mean that after running: "lvchange --available y vg1/lv1" I can safely assume that the device: /dev/vg1/lv1 will be available?
(In reply to comment #35) > This happened again this morning. We have a pretty good guess regarding the > cause. It appears as though the locking mechanism isn't implemented in a fair > way (i.e. read locks are granted after a write lock was requested). It is not such simple. You are saying that there is starvartion, here waits lvchange (writer) for finishing vgs,pvs (readers). Prioritise writer move starvation just to readers (in other usage pattern). Also there are some special locks like vg_global, which behaves differently. Sure that locking is important part of problem here, but there are other problems like all vgs, pvs are in "D" state, waiting for some direct IO operation. probably blocking other to release lock. I do not want to speculate now, need to read through gathered data and found what happened, anyway the access to machine helped a lot here. Please, if this happens again, can you provide log output of "echo t>/proc/sysrq-trigger" and if possible try to check where are the processes waiting (gdb -p <pid> and show bt backtrace) - spec. for lvs/vgs this part is missing here. (or just ping me and I'll try to check myself)
> Does this mean that after running: "lvchange --available y vg1/lv1" I can > safely assume that the device: /dev/vg1/lv1 will be available? yes, after lvchange -a y LV, with successful exit code, the /dev/mapper/LV (and symlink /dev/VG/LV) is available.
(In reply to comment #37) > (In reply to comment #35) > > This happened again this morning. We have a pretty good guess regarding the > > cause. It appears as though the locking mechanism isn't implemented in a fair > > way (i.e. read locks are granted after a write lock was requested). > > It is not such simple. You are saying that there is starvartion, here waits > lvchange (writer) for finishing vgs,pvs (readers). Prioritise writer move > starvation just to readers (in other usage pattern). Also there are some > special locks like vg_global, which behaves differently. > > Sure that locking is important part of problem here, but there are other > problems like all vgs, pvs are in "D" state, waiting for some direct IO > operation. probably blocking other to release lock. > > I do not want to speculate now, need to read through gathered data and found > what happened, anyway the access to machine helped a lot here. > > Please, if this happens again, can you provide log output of "echo > t>/proc/sysrq-trigger" and if possible try to check where are the processes > waiting > (gdb -p <pid> and show bt backtrace) - spec. for lvs/vgs this part is missing > here. (or just ping me and I'll try to check myself) We ran the following test which clearly states that there is a problem with relying on flock: touch /tmp/xxx flock -s /tmp/xxx sleep 1000 & flock -x /tmp/xxx sleep 1001 & flock -s /tmp/xxx ls You can clearly see that the second shared lock is allowed to proceed (meaning flock is not "fair") In our scenario we are running many read operations and we clearly see that the writers are starved. I believe that considering the above you can easily recreate the problem.
yes, flock works this way, shared lock is just shared among reporting commands, lvchange takes exclusive. I will post here full analysis which locks are taken in all used commands later but flock problem is not the root cause here. (yes, lvchange can wait some time for readers, but I am sure it is not problem here - I just audited the code.) Problem is that all commands which have shared flock are waiting for some IO: Trace is some like this - seems that cause for this is _outside_ of lvm code, I'll need to analyse how it happens - if you can provide me gdb traces, it will help. Aug 10 13:11:24 blue-vdsb kernel: vgs D 0000000000000000 0 19451 21231 (NOTLB) Aug 10 13:11:24 blue-vdsb kernel: ffff81014372dcb8 0000000000000086 0000000000000400 ffffffff8001c046 Aug 10 13:11:24 blue-vdsb kernel: ffff81003ae56588 0000000000000007 ffff8101855a67e0 ffff81024dc62080 Aug 10 13:11:25 blue-vdsb kernel: 00004fafc493c408 000000000000a3e6 ffff8101855a69c8 0000000000000000 Aug 10 13:11:25 blue-vdsb kernel: Call Trace: Aug 10 13:11:25 blue-vdsb kernel: [<ffffffff8001c046>] generic_make_request+0x211/0x228 Aug 10 13:11:25 blue-vdsb kernel: [<ffffffff8006dd6e>] do_gettimeofday+0x40/0x8f Aug 10 13:11:25 blue-vdsb kernel: [<ffffffff8005a905>] getnstimeofday+0x10/0x28 Aug 10 13:11:25 blue-vdsb kernel: [<ffffffff800637ba>] io_schedule+0x3f/0x67 Aug 10 13:11:25 blue-vdsb kernel: [<ffffffff800f27dd>] __blockdev_direct_IO+0x8bc/0xa35 Aug 10 13:11:25 blue-vdsb kernel: [<ffffffff800e3df5>] blkdev_direct_IO+0x32/0x37 Aug 10 13:11:25 blue-vdsb kernel: [<ffffffff800e3d2d>] blkdev_get_blocks+0x0/0x96 Aug 10 13:11:25 blue-vdsb kernel: [<ffffffff8000c4cf>] __generic_file_aio_read+0xb8/0x198 Aug 10 13:11:25 blue-vdsb kernel: [<ffffffff80063ad5>] mutex_lock+0xd/0x1d Aug 10 13:11:25 blue-vdsb kernel: [<ffffffff800c510c>] generic_file_read+0xac/0xc5 Aug 10 13:11:25 blue-vdsb kernel: [<ffffffff8009f698>] autoremove_wake_function+0x0/0x2e Aug 10 13:11:25 blue-vdsb kernel: [<ffffffff800e2e85>] block_ioctl+0x1b/0x1f Aug 10 13:11:25 blue-vdsb kernel: [<ffffffff800420a7>] do_ioctl+0x21/0x6b Aug 10 13:11:25 blue-vdsb kernel: [<ffffffff800302a0>] vfs_ioctl+0x457/0x4b9 Aug 10 13:11:25 blue-vdsb kernel: [<ffffffff800b6193>] audit_syscall_entry+0x180/0x1b3 Aug 10 13:11:25 blue-vdsb kernel: [<ffffffff8000b695>] vfs_read+0xcb/0x171 Aug 10 13:11:25 blue-vdsb kernel: [<ffffffff80011b72>] sys_read+0x45/0x6e Aug 10 13:11:25 blue-vdsb kernel: [<ffffffff8005d28d>] tracesys+0xd5/0xe0
Ok, another machine now - several problems in once: - many reporting programs forked in once, it is normal operation? 1) you did not fixed filer line in lvm.conf filter = [ "a|/dev/sda2|", "a|/dev/mpath/|", "a|/dev/mapper/|", "r|.*|" ] 2) you storage is now broken Aug 11 14:06:31 brown-vdsa kernel: scsi 5:0:0:0: rejecting I/O to dead device Aug 11 14:06:31 brown-vdsa multipathd: sdc: readsector0 checker reports path is down Aug 11 14:07:42 brown-vdsa kernel: scsi 5:0:0:0: rejecting I/O to dead device Every reporting command scans all devices in system because: 2) There in missing PV on this system now! # pvs Couldn't find device with uuid 'QkzcVy-J0xu-Meq1-DEkp-F8lK-uSGK-anYXWj'. Couldn't find device with uuid 'QkzcVy-J0xu-Meq1-DEkp-F8lK-uSGK-anYXWj'. Couldn't find device with uuid 'QkzcVy-J0xu-Meq1-DEkp-F8lK-uSGK-anYXWj'. PV VG Fmt Attr PSize PFree /dev/mpath/3600144f04a7fccfc00003048344a6b00 b7825068-48e4-41ce-815c-b1566b9fde28 lvm2 a- 399.88G 0 /dev/sda2 HostVG lvm2 a- 74.45G 0 unknown device b7825068-48e4-41ce-815c-b1566b9fde28 lvm2 a- 199.88G 174.38G This is the very expensive situation, if lvm cannot find PV, it starts rescan, all internal caching is flushed, if 20 processes run rescan with direct IO, you storage is under high streass, and more and more reporting commands are spawned... 3) because of these very slow reporting commands, lvchange waits for exclusive flock() - yes, it is kind of starvation here now.
Why are you issuing several _same_ reporting commands at once?
On examined affected systems, these commands run in parallel and need these locks: Reporting (all locks are shared/read): pvs PV - locvs: global, <VG> vgs VG - locks: <VG> lvs VG/LV - locks VG lvs @tag - locks: all VG in system (only one at once) Manipulation with LV: lvchange -a y/n VG/LV - VG (exclusive/write) lvchange --refresh VG/LV - VG (exclusive/write) Internal lvm2 locking was designed so that READ lock in reporting commands is taken only for very quick time. On the affected system, there is many parallel reporting commands running. First impulse to promep is usually some storage problem which cause that reporting command runs too long and system will issue another monitoring in paralel. So the shared lock is never released and commands requiring write lock are starving. Parts which lvm code can help here: - use shared lock instead of exclusive for activation commands - change flock locking to provide better handling of starvation problem Parts which requires configuration change of your systems: - every command issues direct IO read of metadata (at least), isn't that problem for your storage performance? if so, try to minimize issuing so many lvm commands - use proper filtering - scan _only_ PV devices, never LV itself - it is very important in situations where your underlying storage have problems (like path failure above) and lvm tries to find referenced PV in system - if it has /dev/mapper/* in filter, is scans _all_ your LVs in system with direct-io, this must be very expensive operation - note that reporting commands using tag specification need to scan all VG in system (to check tag inside metadata).
We have identified possible fixes/workarounds in lvm2 package. Usual process is commit them upstream and backport to RHEL, I think that I will have experimental (read: no QE test at all) rpm for RHEL soon. (Changes include possible using prioritization of exclusive flock operation, which basically avoids starvation, changes to vgextend to block and using read only lock for lvchange --available and --refresh commands). But I think that using some rwlock in caller to not issue so many commands in parallel is still very good way to avoid problems - so both sides plays safely. If RHEV team can help test such experimental lvm2 rpm package, please let me know.
(In reply to comment #45) > We have identified possible fixes/workarounds in lvm2 package. > > Usual process is commit them upstream and backport to RHEL, I think that I will > have experimental (read: no QE test at all) rpm for RHEL soon. > (Changes include possible using prioritization of exclusive flock operation, > which basically avoids starvation, changes to vgextend to block and using read > only lock for lvchange --available and --refresh commands). > > But I think that using some rwlock in caller to not issue so many commands in > parallel is still very good way to avoid problems - so both sides plays safely. > > If RHEV team can help test such experimental lvm2 rpm package, please let me > know. If you have the "experimental" lmv2 rpm package ready I can help test under a database load with my RHEV configuration. I've baselined my database build with lvm2-2.02.46-8.el5 and would be interested to note any performance impact of the new package. Thanks.
> If you have the "experimental" lmv2 rpm package ready I can help test under a > database load with my RHEV configuration. see comment #48 for link to the package, available since last Friday there (comment is private, because it is internal server).
Oded, can you give us some feed back on this new lvm package?
We have not seen the past issues with this LVM on the Westford cluster for ~2 weeks now, so I guess we can say this is fixed.
Patch in lvm2-2_02_46-9_el5.
Marking this bug fixed based on comment #57. I don't currently have access to hardware to test this fix.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2010-0298.html