Hide Forgot
Description of problem: I was sequentially creating LVs in an attempt to see how well LVM scales, and after 27000 volumes, udev was killed. [root@taft-01 ~]# lvs | wc -l 27135 Apr 22 05:53:58 taft-01 kernel: udevd invoked oom-killer: gfp_mask=0xd0, order=1, oom_adj=-17 Apr 22 05:54:00 taft-01 kernel: udevd cpuset=/ mems_allowed=0 Apr 22 05:54:00 taft-01 kernel: Pid: 10869, comm: udevd Not tainted 2.6.32-131.0.1.el6.x86_64 #1 Apr 22 05:54:00 taft-01 kernel: Call Trace: Apr 22 05:54:00 taft-01 kernel: [<ffffffff810c0121>] ? cpuset_print_task_mems_allowed+0x91/0xb0 Apr 22 05:54:00 taft-01 kernel: [<ffffffff811102db>] ? oom_kill_process+0xcb/0x2e0 Apr 22 05:54:00 taft-01 kernel: [<ffffffff811108a0>] ? select_bad_process+0xd0/0x110 Apr 22 05:54:00 taft-01 kernel: [<ffffffff81110938>] ? __out_of_memory+0x58/0xc0 Apr 22 05:54:00 taft-01 kernel: [<ffffffff81110b39>] ? out_of_memory+0x199/0x210 Apr 22 05:54:00 taft-01 kernel: [<ffffffff811202fd>] ? __alloc_pages_nodemask+0x80d/0x8b0 Apr 22 05:54:00 taft-01 kernel: [<ffffffff81159952>] ? kmem_getpages+0x62/0x170 Apr 22 05:54:00 taft-01 kernel: [<ffffffff8115a56a>] ? fallback_alloc+0x1ba/0x270 Apr 22 05:54:00 taft-01 kernel: [<ffffffff81159fbf>] ? cache_grow+0x2cf/0x320 Apr 22 05:54:00 taft-01 kernel: [<ffffffff8115a2e9>] ? ____cache_alloc_node+0x99/0x160 Apr 22 05:54:00 taft-01 kernel: [<ffffffff8115ac4b>] ? kmem_cache_alloc+0x11b/0x190 Apr 22 05:54:00 taft-01 kernel: [<ffffffff81064f19>] ? copy_process+0xc9/0x1300 Apr 22 05:54:00 taft-01 kernel: [<ffffffff8115ab15>] ? kmem_cache_alloc_notrace+0x115/0x130 Apr 22 05:54:00 taft-01 kernel: [<ffffffff8115acb2>] ? kmem_cache_alloc+0x182/0x190 Apr 22 05:54:00 taft-01 kernel: [<ffffffff810661e4>] ? do_fork+0x94/0x480 Apr 22 05:54:00 taft-01 kernel: [<ffffffff8118f0b2>] ? alloc_fd+0x92/0x160 Apr 22 05:54:00 taft-01 kernel: [<ffffffff8116f617>] ? fd_install+0x47/0x90 Apr 22 05:54:00 taft-01 kernel: [<ffffffff8117cfaf>] ? do_pipe_flags+0xcf/0x130 Apr 22 05:54:00 taft-01 kernel: [<ffffffff810d1b82>] ? audit_syscall_entry+0x272/0x2a0 Apr 22 05:54:00 taft-01 kernel: [<ffffffff81009588>] ? sys_clone+0x28/0x30 Apr 22 05:54:00 taft-01 kernel: [<ffffffff8100b493>] ? stub_clone+0x13/0x20 Apr 22 05:54:00 taft-01 kernel: [<ffffffff8100b172>] ? system_call_fastpath+0x16/0x1b Apr 22 05:54:00 taft-01 kernel: Mem-Info: Apr 22 05:54:00 taft-01 kernel: Node 0 DMA per-cpu: Apr 22 05:54:00 taft-01 kernel: CPU 0: hi: 0, btch: 1 usd: 0 Apr 22 05:54:00 taft-01 kernel: CPU 1: hi: 0, btch: 1 usd: 0 Apr 22 05:54:00 taft-01 kernel: CPU 2: hi: 0, btch: 1 usd: 0 Apr 22 05:54:00 taft-01 kernel: CPU 3: hi: 0, btch: 1 usd: 0 Apr 22 05:54:00 taft-01 kernel: Node 0 DMA32 per-cpu: Apr 22 05:54:00 taft-01 kernel: CPU 0: hi: 186, btch: 31 usd: 0 Apr 22 05:54:00 taft-01 kernel: CPU 1: hi: 186, btch: 31 usd: 0 Apr 22 05:54:00 taft-01 kernel: CPU 2: hi: 186, btch: 31 usd: 0 Apr 22 05:54:00 taft-01 kernel: CPU 3: hi: 186, btch: 31 usd: 0 Apr 22 05:54:00 taft-01 kernel: Node 0 Normal per-cpu: Apr 22 05:54:00 taft-01 kernel: CPU 0: hi: 186, btch: 31 usd: 27 Apr 22 05:54:00 taft-01 kernel: CPU 1: hi: 186, btch: 31 usd: 50 Apr 22 05:54:00 taft-01 kernel: CPU 2: hi: 186, btch: 31 usd: 0 Apr 22 05:54:00 taft-01 kernel: CPU 3: hi: 186, btch: 31 usd: 0 Apr 22 05:54:00 taft-01 kernel: active_anon:4501 inactive_anon:840 isolated_anon:647 Apr 22 05:54:00 taft-01 kernel: active_file:37 inactive_file:91 isolated_file:32 Apr 22 05:54:00 taft-01 kernel: unevictable:6766 dirty:1 writeback:676 unstable:0 Apr 22 05:54:00 taft-01 kernel: free:40862 slab_reclaimable:84697 slab_unreclaimable:1797010 Apr 22 05:54:00 taft-01 kernel: mapped:826 shmem:135 pagetables:3564 bounce:0 Version-Release number of selected component (if applicable): 2.6.32-131.0.1.el6.x86_64 lvm2-2.02.83-3.el6 BUILT: Fri Mar 18 09:31:10 CDT 2011 lvm2-libs-2.02.83-3.el6 BUILT: Fri Mar 18 09:31:10 CDT 2011 lvm2-cluster-2.02.83-3.el6 BUILT: Fri Mar 18 09:31:10 CDT 2011 udev-147-2.35.el6 BUILT: Wed Mar 30 07:32:05 CDT 2011 device-mapper-1.02.62-3.el6 BUILT: Fri Mar 18 09:31:10 CDT 2011 device-mapper-libs-1.02.62-3.el6 BUILT: Fri Mar 18 09:31:10 CDT 2011 device-mapper-event-1.02.62-3.el6 BUILT: Fri Mar 18 09:31:10 CDT 2011 device-mapper-event-libs-1.02.62-3.el6 BUILT: Fri Mar 18 09:31:10 CDT 2011 cmirror-2.02.83-3.el6 BUILT: Fri Mar 18 09:31:10 CDT 2011
How much memory did the system in question have? For 27000 LVs i can certainly see that a smaller system might run out of memory as udev has to spawn sub processes for the devices iirc, and there isn't much we can do about that. There used to be a way to sequentialize the work queue of udev, but that will obviously lead to very long startup times which was the reason this was parallelized in the last few years. Thanks & regards, Phil
[root@taft-01 ~]# cat /proc/meminfo MemTotal: 8181340 kB
Hm, 8GB should be more than sufficient, even for such a large amount of LVs. Do you have an easy way to reproduce this via a script? Harald is currently on PTO till the end of the week, but if he would have a reproducer by then he could directly investigate if this is directly a udev problem or maybe down the chain some LVM tools running OOM. Thanks & regards, Phil
Can you experiment with the following kernel command line parameter? udev.children-max= Limit the number of parallel executed events.
An easy way to reproduce this is just running the creates. Keep in mind that the PV MDA size needs to be bumped way up to support this many volumes and that this takes quite a few hours. # pvcreate --metadatasize 1G /dev/sd[bcdefgh]1 # vgcreate TAFT /dev/sd[bcdefgh]1 # for i in $(seq 1 28000); do lvcreate -n lv$i -L 12M TAFT; done
You don't run any 'desktop' software on that box, right? No graphical login with auto-mounter functionality and similar? That is known to not survive such massive numbers of devices. Last time I checked, I created ~40.000 block devices on a 4 GB machine with scsi_debug just fine. I wouldn't be surprised if it's an issue with device-mapper taking that much memory.
There are no "desktop" apps running on these test boxes.
most likely lvm consuming the memory
(In reply to comment #6) > You don't run any 'desktop' software on that box, right? No graphical login > with auto-mounter functionality and similar? That is known to not survive such > massive numbers of devices. > > Last time I checked, I created ~40.000 block devices on a 4 GB machine with > scsi_debug just fine. OK, was that test performed against RHEL6? > I wouldn't be surprised if it's an issue with device-mapper taking that much > memory. (In reply to comment #8) > most likely lvm consuming the memory Anything is possible but one really should _verify_ it is DM and/or lvm2 before reassigning it. Wishful thinking I guess.
I have a script to approximate the amount of kernel memory DM consumes: http://people.redhat.com/msnitzer/patches/upstream/dm-rq-based-mempool-sharing/bio_vs_rq_slab_usage.py Approximately 5.2GB of memory is needed for 27000 bio-based DM devices (like corey created using comment#5's procedure): ./bio_vs_rq_slab_usage.py 27000 256 bios-based: bio-0 84.375 MB biovec-256 1687.5 MB bip-256 3375.0 MB dm_io 45.60546875 MB dm_target_io 11.71875 MB total: 5204.19921875 MB (and an obscene 85G is needed if one tried 27000 rq-based mpath devices, 54G is purely for the DIF/DIX but it is allocated conditionally so really "only" 30G is needed if DIF/DIX capable storage isn't detected) request-based: bio-0 1350.0 MB biovec-256 27000.0 MB bip-256 54000.0 MB dm_mpath_io 133.66015625 MB dm_rq_clone_bio_info 133.66015625 MB dm_rq_target_io 2700.0 MB total: 85317.3203125 MB Will be interesting to see Corey's results to see how closely my script reflects reality.
(In reply to comment #10) > Approximately 5.2GB of memory is needed for 27000 bio-based DM devices (like > corey created using comment#5's procedure): > > ./bio_vs_rq_slab_usage.py 27000 256 > bios-based: > bio-0 84.375 MB > biovec-256 1687.5 MB > bip-256 3375.0 MB > dm_io 45.60546875 MB > dm_target_io 11.71875 MB > > total: 5204.19921875 MB Ah, bip-256 is purely for DIF/DIX also, so the estimated memory usage for most existing storage is ~1.8GB. > Will be interesting to see Corey's results to see how closely my script > reflects reality. Though Corey reported actual DM-only memory usage of ~1.6GB. Seems my script is overestimating by 200MB (~12%); I'll look to reproduce to sort out why. But 1.6GB leaves 6.4GB of memory (from the original 8GB) for udev and lvm2 to work with. So we need to look closer at the memory usage when lvm/udev is introduced.
I ran the following loop up to over 11K LVs. I'll attach the system info at the time I finally stopped it. for i in $(seq 1 27000); do lvcreate -n lv$i -L 12M -an -Z n TAFT; done [...] Logical volume "lv11049" created WARNING: "lv11050" not zeroed Logical volume "lv11050" created WARNING: "lv11051" not zeroed Logical volume "lv11051" created 2.6.32-198.el6.x86_64 lvm2-2.02.87-2.1.el6 BUILT: Wed Sep 14 09:44:16 CDT 2011 lvm2-libs-2.02.87-2.1.el6 BUILT: Wed Sep 14 09:44:16 CDT 2011 lvm2-cluster-2.02.87-2.1.el6 BUILT: Wed Sep 14 09:44:16 CDT 2011 udev-147-2.38.el6 BUILT: Fri Sep 9 16:25:50 CDT 2011 device-mapper-1.02.66-2.1.el6 BUILT: Wed Sep 14 09:44:16 CDT 2011 device-mapper-libs-1.02.66-2.1.el6 BUILT: Wed Sep 14 09:44:16 CDT 2011 device-mapper-event-1.02.66-2.1.el6 BUILT: Wed Sep 14 09:44:16 CDT 2011 device-mapper-event-libs-1.02.66-2.1.el6 BUILT: Wed Sep 14 09:44:16 CDT 2011 cmirror-2.02.87-2.1.el6 BUILT: Wed Sep 14 09:44:16 CDT 2011
Created attachment 524472 [details] output from slabinfo,
Created attachment 524473 [details] output from meminfo
Created attachment 524474 [details] output from vmstat
Created attachment 524475 [details] output from free
Created attachment 524476 [details] output from slabtop -s c -o
Created attachment 524477 [details] output from ps aux
It should be noted that the procedure used in comment#12 doesn't do _any_ device activation... so neither DM (kernel) nor udev events will be firing... not sure how valid a test comment#12 is. But it is bizarre that so much memory is consumed by doing this test!
The following attachments were taken after a echo 3 > /proc/sys/vm/drop_caches.
Created attachment 524481 [details] output from slabinfo
Created attachment 524482 [details] output from meminfo
Created attachment 524483 [details] output from vmstat
Created attachment 524484 [details] output from free
Created attachment 524485 [details] output from slabtop -s c -o
Created attachment 524486 [details] output from ps aux
The following attachments were taken after activating all 11K LVs and doing an echo 3 > /proc/sys/vm/drop_caches.
Created attachment 524488 [details] output from slabinfo
Created attachment 524489 [details] output from meminfo
Created attachment 524490 [details] output from vmstat
Created attachment 524491 [details] output from free
Created attachment 524492 [details] output from slabtop -s c -o
Created attachment 524493 [details] output from ps aux
Created attachment 524494 [details] lvm.conf file from taft-01
(In reply to comment #32) > Created attachment 524492 [details] > output from slabtop -s c -o Assuming 'biovec-256' is addressed by this patch: https://www.redhat.com/archives/dm-devel/2011-August/msg00076.html Unsure why 'sysfs_dir_cache' is so large ?
(In reply to comment #19) > It should be noted that the procedure used in comment#12 doesn't do _any_ > device activation... so neither DM (kernel) nor udev events will be firing... > not sure how valid a test comment#12 is. > > But it is bizarre that so much memory is consumed by doing this test! Since metadata archiving is enabled in this case - and MD for 12000LV is approaching 4MB and you create it 12000 times - it's not so unexpectable I guess.
(In reply to comment #35) > (In reply to comment #32) > > Created attachment 524492 [details] > > output from slabtop -s c -o > > Assuming 'biovec-256' is addressed by this patch: > > https://www.redhat.com/archives/dm-devel/2011-August/msg00076.html Wasn't really a question but: Yes it is already applied to RHEL6.2 > Unsure why 'sysfs_dir_cache' is so large ? I am not sure why either but I can confirm I saw the same using dmsetup create (no lvm) to create 24000 linear devices. Total sysfs_dir_cache objects is ~64 times the number of devices.
(In reply to comment #37) > (In reply to comment #35) > > Unsure why 'sysfs_dir_cache' is so large ? > > I am not sure why either but I can confirm I saw the same using dmsetup create > (no lvm) to create 24000 linear devices. Total sysfs_dir_cache objects is ~64 > times the number of devices. Sure enough, there are 64 calls to sysfs_new_dirent() for each DM device that is created. # ./trace-cmd record -p function -l sysfs_new_dirent dmsetup create test27001 --table "0 16384 linear 8:64 0" # ./trace-cmd report |grep sysfs_new_dirent | wc -l 64
Creating 27000 linear DM devices using dmsetup results in the following memory use (after dropping caches) -- just over 4.2GB in use: MemTotal: 7158020 kB MemFree: 2809736 kB Buffers: 2956 kB Cached: 138960 kB SwapCached: 0 kB Active: 36288 kB Inactive: 126812 kB Active(anon): 25064 kB Inactive(anon): 108016 kB Active(file): 11224 kB Inactive(file): 18796 kB Unevictable: 6192 kB Mlocked: 6192 kB SwapTotal: 524280 kB SwapFree: 524280 kB Dirty: 0 kB Writeback: 0 kB AnonPages: 27316 kB Mapped: 14480 kB Shmem: 108220 kB Slab: 3720916 kB SReclaimable: 295884 kB SUnreclaim: 3425032 kB KernelStack: 217248 kB PageTables: 4044 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 4103288 kB Committed_AS: 235164 kB VmallocTotal: 34359738367 kB VmallocUsed: 267652 kB VmallocChunk: 34359283952 kB HardwareCorrupted: 0 kB AnonHugePages: 2048 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB DirectMap4k: 8180 kB DirectMap2M: 7331840 kB Could save ~450MB if the bio-based DM reserves were uniformly reduced to 16 with the following patch: https://www.redhat.com/archives/dm-devel/2011-August/msg00076.html reserve of 16: dm_io 18.33984375 MB dm_target_io 11.71875 MB vs. reserve of 256: dm_io 293.4765625 MB dm_target_io 187.5 MB
(In reply to comment #37) > (In reply to comment #35) > > Assuming 'biovec-256' is addressed by this patch: > > > > https://www.redhat.com/archives/dm-devel/2011-August/msg00076.html > > Wasn't really a question but: Yes it is already applied to RHEL6.2 I thought you were talking about bip-256 and just assumed the url you provided was about the integrity reduction we get from not allocating an integrity profile for devices that don't support DIF/DIX. As I mentioned at the end of comment#39, the patch you referenced will only reduce the 'dm_io' and 'dm_target_io' slabs. So coming full-circle on this. The original report (comment#0) was against the RHEL6.1 kernel. That kernel does _not_ have the DIF/DIX (bip-256) slab reduction changes that went in to RHEL6.2 (via bug#697992 and rhel6.git commit d587012e830 specifically). Without that DIF/DIX patch DM devices allocate considerably more slab (primarily in bip-256) -- and that explains why RHEL6.1 was exhausting 8GB. There was never a udev leak. Changing subject and closing NOTABUG.