| Summary: | OOM when creating 27K DM devices due to excess slab memory use in RHEL 6.1 | ||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | Corey Marthaler <cmarthal> | ||||||||||||||||||||||||||||||||||||||||
| Component: | lvm2 | Assignee: | LVM and device-mapper development team <lvm-team> | ||||||||||||||||||||||||||||||||||||||||
| Status: | CLOSED NOTABUG | QA Contact: | Cluster QE <mspqa-list> | ||||||||||||||||||||||||||||||||||||||||
| Severity: | medium | Docs Contact: | |||||||||||||||||||||||||||||||||||||||||
| Priority: | medium | ||||||||||||||||||||||||||||||||||||||||||
| Version: | 6.1 | CC: | agk, dwysocha, heinzm, jbrassow, kay, mbroz, msnitzer, pknirsch, prajnoha, prockai, thornber, zkabelac | ||||||||||||||||||||||||||||||||||||||||
| Target Milestone: | rc | ||||||||||||||||||||||||||||||||||||||||||
| Target Release: | --- | ||||||||||||||||||||||||||||||||||||||||||
| Hardware: | x86_64 | ||||||||||||||||||||||||||||||||||||||||||
| OS: | Linux | ||||||||||||||||||||||||||||||||||||||||||
| Whiteboard: | |||||||||||||||||||||||||||||||||||||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||||||||||||||||||||||||||||||||||||
| Doc Text: | Story Points: | --- | |||||||||||||||||||||||||||||||||||||||||
| Clone Of: | Environment: | ||||||||||||||||||||||||||||||||||||||||||
| Last Closed: | 2011-09-23 20:00:51 UTC | Type: | --- | ||||||||||||||||||||||||||||||||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||||||||||||||||||||||||||||||||
| Documentation: | --- | CRM: | |||||||||||||||||||||||||||||||||||||||||
| Verified Versions: | Category: | --- | |||||||||||||||||||||||||||||||||||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||||||||||||||||||||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||||||||||||||||||||||||||||
| Attachments: |
|
||||||||||||||||||||||||||||||||||||||||||
|
Description
Corey Marthaler
2011-04-25 18:18:28 UTC
How much memory did the system in question have? For 27000 LVs i can certainly see that a smaller system might run out of memory as udev has to spawn sub processes for the devices iirc, and there isn't much we can do about that. There used to be a way to sequentialize the work queue of udev, but that will obviously lead to very long startup times which was the reason this was parallelized in the last few years. Thanks & regards, Phil [root@taft-01 ~]# cat /proc/meminfo MemTotal: 8181340 kB Hm, 8GB should be more than sufficient, even for such a large amount of LVs. Do you have an easy way to reproduce this via a script? Harald is currently on PTO till the end of the week, but if he would have a reproducer by then he could directly investigate if this is directly a udev problem or maybe down the chain some LVM tools running OOM. Thanks & regards, Phil Can you experiment with the following kernel command line parameter?
udev.children-max=
Limit the number of parallel executed events.
An easy way to reproduce this is just running the creates. Keep in mind that the PV MDA size needs to be bumped way up to support this many volumes and that this takes quite a few hours. # pvcreate --metadatasize 1G /dev/sd[bcdefgh]1 # vgcreate TAFT /dev/sd[bcdefgh]1 # for i in $(seq 1 28000); do lvcreate -n lv$i -L 12M TAFT; done You don't run any 'desktop' software on that box, right? No graphical login with auto-mounter functionality and similar? That is known to not survive such massive numbers of devices. Last time I checked, I created ~40.000 block devices on a 4 GB machine with scsi_debug just fine. I wouldn't be surprised if it's an issue with device-mapper taking that much memory. There are no "desktop" apps running on these test boxes. most likely lvm consuming the memory (In reply to comment #6) > You don't run any 'desktop' software on that box, right? No graphical login > with auto-mounter functionality and similar? That is known to not survive such > massive numbers of devices. > > Last time I checked, I created ~40.000 block devices on a 4 GB machine with > scsi_debug just fine. OK, was that test performed against RHEL6? > I wouldn't be surprised if it's an issue with device-mapper taking that much > memory. (In reply to comment #8) > most likely lvm consuming the memory Anything is possible but one really should _verify_ it is DM and/or lvm2 before reassigning it. Wishful thinking I guess. I have a script to approximate the amount of kernel memory DM consumes: http://people.redhat.com/msnitzer/patches/upstream/dm-rq-based-mempool-sharing/bio_vs_rq_slab_usage.py Approximately 5.2GB of memory is needed for 27000 bio-based DM devices (like corey created using comment#5's procedure): ./bio_vs_rq_slab_usage.py 27000 256 bios-based: bio-0 84.375 MB biovec-256 1687.5 MB bip-256 3375.0 MB dm_io 45.60546875 MB dm_target_io 11.71875 MB total: 5204.19921875 MB (and an obscene 85G is needed if one tried 27000 rq-based mpath devices, 54G is purely for the DIF/DIX but it is allocated conditionally so really "only" 30G is needed if DIF/DIX capable storage isn't detected) request-based: bio-0 1350.0 MB biovec-256 27000.0 MB bip-256 54000.0 MB dm_mpath_io 133.66015625 MB dm_rq_clone_bio_info 133.66015625 MB dm_rq_target_io 2700.0 MB total: 85317.3203125 MB Will be interesting to see Corey's results to see how closely my script reflects reality. (In reply to comment #10) > Approximately 5.2GB of memory is needed for 27000 bio-based DM devices (like > corey created using comment#5's procedure): > > ./bio_vs_rq_slab_usage.py 27000 256 > bios-based: > bio-0 84.375 MB > biovec-256 1687.5 MB > bip-256 3375.0 MB > dm_io 45.60546875 MB > dm_target_io 11.71875 MB > > total: 5204.19921875 MB Ah, bip-256 is purely for DIF/DIX also, so the estimated memory usage for most existing storage is ~1.8GB. > Will be interesting to see Corey's results to see how closely my script > reflects reality. Though Corey reported actual DM-only memory usage of ~1.6GB. Seems my script is overestimating by 200MB (~12%); I'll look to reproduce to sort out why. But 1.6GB leaves 6.4GB of memory (from the original 8GB) for udev and lvm2 to work with. So we need to look closer at the memory usage when lvm/udev is introduced. I ran the following loop up to over 11K LVs. I'll attach the system info at the time I finally stopped it. for i in $(seq 1 27000); do lvcreate -n lv$i -L 12M -an -Z n TAFT; done [...] Logical volume "lv11049" created WARNING: "lv11050" not zeroed Logical volume "lv11050" created WARNING: "lv11051" not zeroed Logical volume "lv11051" created 2.6.32-198.el6.x86_64 lvm2-2.02.87-2.1.el6 BUILT: Wed Sep 14 09:44:16 CDT 2011 lvm2-libs-2.02.87-2.1.el6 BUILT: Wed Sep 14 09:44:16 CDT 2011 lvm2-cluster-2.02.87-2.1.el6 BUILT: Wed Sep 14 09:44:16 CDT 2011 udev-147-2.38.el6 BUILT: Fri Sep 9 16:25:50 CDT 2011 device-mapper-1.02.66-2.1.el6 BUILT: Wed Sep 14 09:44:16 CDT 2011 device-mapper-libs-1.02.66-2.1.el6 BUILT: Wed Sep 14 09:44:16 CDT 2011 device-mapper-event-1.02.66-2.1.el6 BUILT: Wed Sep 14 09:44:16 CDT 2011 device-mapper-event-libs-1.02.66-2.1.el6 BUILT: Wed Sep 14 09:44:16 CDT 2011 cmirror-2.02.87-2.1.el6 BUILT: Wed Sep 14 09:44:16 CDT 2011 Created attachment 524472 [details]
output from slabinfo,
Created attachment 524473 [details]
output from meminfo
Created attachment 524474 [details]
output from vmstat
Created attachment 524475 [details]
output from free
Created attachment 524476 [details]
output from slabtop -s c -o
Created attachment 524477 [details]
output from ps aux
It should be noted that the procedure used in comment#12 doesn't do _any_ device activation... so neither DM (kernel) nor udev events will be firing... not sure how valid a test comment#12 is. But it is bizarre that so much memory is consumed by doing this test! The following attachments were taken after a echo 3 > /proc/sys/vm/drop_caches. Created attachment 524481 [details]
output from slabinfo
Created attachment 524482 [details]
output from meminfo
Created attachment 524483 [details]
output from vmstat
Created attachment 524484 [details]
output from free
Created attachment 524485 [details]
output from slabtop -s c -o
Created attachment 524486 [details]
output from ps aux
The following attachments were taken after activating all 11K LVs and doing an echo 3 > /proc/sys/vm/drop_caches. Created attachment 524488 [details]
output from slabinfo
Created attachment 524489 [details]
output from meminfo
Created attachment 524490 [details]
output from vmstat
Created attachment 524491 [details]
output from free
Created attachment 524492 [details]
output from slabtop -s c -o
Created attachment 524493 [details]
output from ps aux
Created attachment 524494 [details]
lvm.conf file from taft-01
(In reply to comment #32) > Created attachment 524492 [details] > output from slabtop -s c -o Assuming 'biovec-256' is addressed by this patch: https://www.redhat.com/archives/dm-devel/2011-August/msg00076.html Unsure why 'sysfs_dir_cache' is so large ? (In reply to comment #19) > It should be noted that the procedure used in comment#12 doesn't do _any_ > device activation... so neither DM (kernel) nor udev events will be firing... > not sure how valid a test comment#12 is. > > But it is bizarre that so much memory is consumed by doing this test! Since metadata archiving is enabled in this case - and MD for 12000LV is approaching 4MB and you create it 12000 times - it's not so unexpectable I guess. (In reply to comment #35) > (In reply to comment #32) > > Created attachment 524492 [details] > > output from slabtop -s c -o > > Assuming 'biovec-256' is addressed by this patch: > > https://www.redhat.com/archives/dm-devel/2011-August/msg00076.html Wasn't really a question but: Yes it is already applied to RHEL6.2 > Unsure why 'sysfs_dir_cache' is so large ? I am not sure why either but I can confirm I saw the same using dmsetup create (no lvm) to create 24000 linear devices. Total sysfs_dir_cache objects is ~64 times the number of devices. (In reply to comment #37) > (In reply to comment #35) > > Unsure why 'sysfs_dir_cache' is so large ? > > I am not sure why either but I can confirm I saw the same using dmsetup create > (no lvm) to create 24000 linear devices. Total sysfs_dir_cache objects is ~64 > times the number of devices. Sure enough, there are 64 calls to sysfs_new_dirent() for each DM device that is created. # ./trace-cmd record -p function -l sysfs_new_dirent dmsetup create test27001 --table "0 16384 linear 8:64 0" # ./trace-cmd report |grep sysfs_new_dirent | wc -l 64 Creating 27000 linear DM devices using dmsetup results in the following memory use (after dropping caches) -- just over 4.2GB in use: MemTotal: 7158020 kB MemFree: 2809736 kB Buffers: 2956 kB Cached: 138960 kB SwapCached: 0 kB Active: 36288 kB Inactive: 126812 kB Active(anon): 25064 kB Inactive(anon): 108016 kB Active(file): 11224 kB Inactive(file): 18796 kB Unevictable: 6192 kB Mlocked: 6192 kB SwapTotal: 524280 kB SwapFree: 524280 kB Dirty: 0 kB Writeback: 0 kB AnonPages: 27316 kB Mapped: 14480 kB Shmem: 108220 kB Slab: 3720916 kB SReclaimable: 295884 kB SUnreclaim: 3425032 kB KernelStack: 217248 kB PageTables: 4044 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 4103288 kB Committed_AS: 235164 kB VmallocTotal: 34359738367 kB VmallocUsed: 267652 kB VmallocChunk: 34359283952 kB HardwareCorrupted: 0 kB AnonHugePages: 2048 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB DirectMap4k: 8180 kB DirectMap2M: 7331840 kB Could save ~450MB if the bio-based DM reserves were uniformly reduced to 16 with the following patch: https://www.redhat.com/archives/dm-devel/2011-August/msg00076.html reserve of 16: dm_io 18.33984375 MB dm_target_io 11.71875 MB vs. reserve of 256: dm_io 293.4765625 MB dm_target_io 187.5 MB (In reply to comment #37) > (In reply to comment #35) > > Assuming 'biovec-256' is addressed by this patch: > > > > https://www.redhat.com/archives/dm-devel/2011-August/msg00076.html > > Wasn't really a question but: Yes it is already applied to RHEL6.2 I thought you were talking about bip-256 and just assumed the url you provided was about the integrity reduction we get from not allocating an integrity profile for devices that don't support DIF/DIX. As I mentioned at the end of comment#39, the patch you referenced will only reduce the 'dm_io' and 'dm_target_io' slabs. So coming full-circle on this. The original report (comment#0) was against the RHEL6.1 kernel. That kernel does _not_ have the DIF/DIX (bip-256) slab reduction changes that went in to RHEL6.2 (via bug#697992 and rhel6.git commit d587012e830 specifically). Without that DIF/DIX patch DM devices allocate considerably more slab (primarily in bip-256) -- and that explains why RHEL6.1 was exhausting 8GB. There was never a udev leak. Changing subject and closing NOTABUG. |