Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 699462

Summary:

OOM when creating 27K DM devices due to excess slab memory use in RHEL 6.1

Product:

Red Hat Enterprise Linux 6

Reporter:

Corey Marthaler <cmarthal>

Component:

lvm2

Assignee:

LVM and device-mapper development team <lvm-team>

Status:

CLOSED NOTABUG

QA Contact:

Cluster QE <mspqa-list>

Severity:

medium

Docs Contact:

Priority:

medium

Version:

6.1

CC:

agk, dwysocha, heinzm, jbrassow, kay, mbroz, msnitzer, pknirsch, prajnoha, prockai, thornber, zkabelac

Target Milestone:

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2011-09-23 20:00:51 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
output from slabinfo,	none
output from meminfo	none
output from vmstat	none
output from free	none
output from slabtop -s c -o	none
output from ps aux	none
output from slabinfo	none
output from meminfo	none
output from vmstat	none
output from free	none
output from slabtop -s c -o	none
output from ps aux	none
output from slabinfo	none
output from meminfo	none
output from vmstat	none
output from free	none
output from slabtop -s c -o	none
output from ps aux	none
lvm.conf file from taft-01	none

Description Corey Marthaler 2011-04-25 18:18:28 UTC

Description of problem:
I was sequentially creating LVs in an attempt to see how well LVM scales, and after 27000 volumes, udev was killed.

[root@taft-01 ~]# lvs | wc -l
27135

Apr 22 05:53:58 taft-01 kernel: udevd invoked oom-killer: gfp_mask=0xd0, order=1, oom_adj=-17
Apr 22 05:54:00 taft-01 kernel: udevd cpuset=/ mems_allowed=0
Apr 22 05:54:00 taft-01 kernel: Pid: 10869, comm: udevd Not tainted 2.6.32-131.0.1.el6.x86_64 #1
Apr 22 05:54:00 taft-01 kernel: Call Trace:
Apr 22 05:54:00 taft-01 kernel: [<ffffffff810c0121>] ? cpuset_print_task_mems_allowed+0x91/0xb0
Apr 22 05:54:00 taft-01 kernel: [<ffffffff811102db>] ? oom_kill_process+0xcb/0x2e0
Apr 22 05:54:00 taft-01 kernel: [<ffffffff811108a0>] ? select_bad_process+0xd0/0x110
Apr 22 05:54:00 taft-01 kernel: [<ffffffff81110938>] ? __out_of_memory+0x58/0xc0
Apr 22 05:54:00 taft-01 kernel: [<ffffffff81110b39>] ? out_of_memory+0x199/0x210
Apr 22 05:54:00 taft-01 kernel: [<ffffffff811202fd>] ? __alloc_pages_nodemask+0x80d/0x8b0
Apr 22 05:54:00 taft-01 kernel: [<ffffffff81159952>] ? kmem_getpages+0x62/0x170
Apr 22 05:54:00 taft-01 kernel: [<ffffffff8115a56a>] ? fallback_alloc+0x1ba/0x270
Apr 22 05:54:00 taft-01 kernel: [<ffffffff81159fbf>] ? cache_grow+0x2cf/0x320
Apr 22 05:54:00 taft-01 kernel: [<ffffffff8115a2e9>] ? ____cache_alloc_node+0x99/0x160
Apr 22 05:54:00 taft-01 kernel: [<ffffffff8115ac4b>] ? kmem_cache_alloc+0x11b/0x190
Apr 22 05:54:00 taft-01 kernel: [<ffffffff81064f19>] ? copy_process+0xc9/0x1300
Apr 22 05:54:00 taft-01 kernel: [<ffffffff8115ab15>] ? kmem_cache_alloc_notrace+0x115/0x130
Apr 22 05:54:00 taft-01 kernel: [<ffffffff8115acb2>] ? kmem_cache_alloc+0x182/0x190
Apr 22 05:54:00 taft-01 kernel: [<ffffffff810661e4>] ? do_fork+0x94/0x480
Apr 22 05:54:00 taft-01 kernel: [<ffffffff8118f0b2>] ? alloc_fd+0x92/0x160
Apr 22 05:54:00 taft-01 kernel: [<ffffffff8116f617>] ? fd_install+0x47/0x90
Apr 22 05:54:00 taft-01 kernel: [<ffffffff8117cfaf>] ? do_pipe_flags+0xcf/0x130
Apr 22 05:54:00 taft-01 kernel: [<ffffffff810d1b82>] ? audit_syscall_entry+0x272/0x2a0
Apr 22 05:54:00 taft-01 kernel: [<ffffffff81009588>] ? sys_clone+0x28/0x30
Apr 22 05:54:00 taft-01 kernel: [<ffffffff8100b493>] ? stub_clone+0x13/0x20
Apr 22 05:54:00 taft-01 kernel: [<ffffffff8100b172>] ? system_call_fastpath+0x16/0x1b
Apr 22 05:54:00 taft-01 kernel: Mem-Info:
Apr 22 05:54:00 taft-01 kernel: Node 0 DMA per-cpu:
Apr 22 05:54:00 taft-01 kernel: CPU    0: hi:    0, btch:   1 usd:   0
Apr 22 05:54:00 taft-01 kernel: CPU    1: hi:    0, btch:   1 usd:   0
Apr 22 05:54:00 taft-01 kernel: CPU    2: hi:    0, btch:   1 usd:   0
Apr 22 05:54:00 taft-01 kernel: CPU    3: hi:    0, btch:   1 usd:   0
Apr 22 05:54:00 taft-01 kernel: Node 0 DMA32 per-cpu:
Apr 22 05:54:00 taft-01 kernel: CPU    0: hi:  186, btch:  31 usd:   0
Apr 22 05:54:00 taft-01 kernel: CPU    1: hi:  186, btch:  31 usd:   0
Apr 22 05:54:00 taft-01 kernel: CPU    2: hi:  186, btch:  31 usd:   0
Apr 22 05:54:00 taft-01 kernel: CPU    3: hi:  186, btch:  31 usd:   0
Apr 22 05:54:00 taft-01 kernel: Node 0 Normal per-cpu:
Apr 22 05:54:00 taft-01 kernel: CPU    0: hi:  186, btch:  31 usd:  27
Apr 22 05:54:00 taft-01 kernel: CPU    1: hi:  186, btch:  31 usd:  50
Apr 22 05:54:00 taft-01 kernel: CPU    2: hi:  186, btch:  31 usd:   0
Apr 22 05:54:00 taft-01 kernel: CPU    3: hi:  186, btch:  31 usd:   0
Apr 22 05:54:00 taft-01 kernel: active_anon:4501 inactive_anon:840 isolated_anon:647
Apr 22 05:54:00 taft-01 kernel: active_file:37 inactive_file:91 isolated_file:32
Apr 22 05:54:00 taft-01 kernel: unevictable:6766 dirty:1 writeback:676 unstable:0
Apr 22 05:54:00 taft-01 kernel: free:40862 slab_reclaimable:84697 slab_unreclaimable:1797010
Apr 22 05:54:00 taft-01 kernel: mapped:826 shmem:135 pagetables:3564 bounce:0


Version-Release number of selected component (if applicable):
2.6.32-131.0.1.el6.x86_64

lvm2-2.02.83-3.el6    BUILT: Fri Mar 18 09:31:10 CDT 2011
lvm2-libs-2.02.83-3.el6    BUILT: Fri Mar 18 09:31:10 CDT 2011
lvm2-cluster-2.02.83-3.el6    BUILT: Fri Mar 18 09:31:10 CDT 2011
udev-147-2.35.el6    BUILT: Wed Mar 30 07:32:05 CDT 2011
device-mapper-1.02.62-3.el6    BUILT: Fri Mar 18 09:31:10 CDT 2011
device-mapper-libs-1.02.62-3.el6    BUILT: Fri Mar 18 09:31:10 CDT 2011
device-mapper-event-1.02.62-3.el6    BUILT: Fri Mar 18 09:31:10 CDT 2011
device-mapper-event-libs-1.02.62-3.el6    BUILT: Fri Mar 18 09:31:10 CDT 2011
cmirror-2.02.83-3.el6    BUILT: Fri Mar 18 09:31:10 CDT 2011

Comment 1 Phil Knirsch 2011-06-10 11:32:58 UTC

How much memory did the system in question have? For 27000 LVs i can certainly see that a smaller system might run out of memory as udev has to spawn sub processes for the devices iirc, and there isn't much we can do about that. There used to be a way to sequentialize the work queue of udev, but that will obviously lead to very long startup times which was the reason this was parallelized in the last few years.

Thanks & regards, Phil

Comment 2 Corey Marthaler 2011-06-13 15:28:20 UTC

[root@taft-01 ~]# cat /proc/meminfo
MemTotal:      8181340 kB

Comment 3 Phil Knirsch 2011-06-14 11:34:41 UTC

Hm, 8GB should be more than sufficient, even for such a large amount of LVs. Do you have an easy way to reproduce this via a script? Harald is currently on PTO till the end of the week, but if he would have a reproducer by then he could directly investigate if this is directly a udev problem or maybe down the chain some LVM tools running OOM.

Thanks & regards, Phil

Comment 4 Harald Hoyer 2011-06-17 11:22:58 UTC

Can you experiment with the following kernel command line parameter?

       udev.children-max=
           Limit the number of parallel executed events.

Comment 5 Corey Marthaler 2011-07-07 17:51:52 UTC

An easy way to reproduce this is just running the creates. Keep in mind that the PV MDA size needs to be bumped way up to support this many volumes and that this takes quite a few hours.

# pvcreate --metadatasize 1G /dev/sd[bcdefgh]1
# vgcreate TAFT /dev/sd[bcdefgh]1
# for i in $(seq 1 28000); do lvcreate -n lv$i -L 12M TAFT; done

Comment 6 Kay Sievers 2011-07-11 11:51:20 UTC

You don't run any 'desktop' software on that box, right? No graphical login with auto-mounter functionality and similar? That is known to not survive such massive numbers of devices.

Last time I checked, I created ~40.000 block devices on a 4 GB machine with scsi_debug just fine.

I wouldn't be surprised if it's an issue with device-mapper taking that much memory.

Comment 7 Corey Marthaler 2011-07-15 15:37:09 UTC

There are no "desktop" apps running on these test boxes.

Comment 8 Harald Hoyer 2011-09-21 14:52:20 UTC

most likely lvm consuming the memory

Comment 9 Mike Snitzer 2011-09-21 15:16:25 UTC

(In reply to comment #6)
> You don't run any 'desktop' software on that box, right? No graphical login
> with auto-mounter functionality and similar? That is known to not survive such
> massive numbers of devices.
> 
> Last time I checked, I created ~40.000 block devices on a 4 GB machine with
> scsi_debug just fine.

OK, was that test performed against RHEL6?

> I wouldn't be surprised if it's an issue with device-mapper taking that much
> memory.

(In reply to comment #8)
> most likely lvm consuming the memory

Anything is possible but one really should _verify_ it is DM and/or lvm2 before reassigning it.  Wishful thinking I guess.

Comment 10 Mike Snitzer 2011-09-21 16:09:27 UTC

I have a script to approximate the amount of kernel memory DM consumes:
http://people.redhat.com/msnitzer/patches/upstream/dm-rq-based-mempool-sharing/bio_vs_rq_slab_usage.py

Approximately 5.2GB of memory is needed for 27000 bio-based DM devices (like corey created using comment#5's procedure):

./bio_vs_rq_slab_usage.py 27000 256
bios-based:
bio-0 84.375 MB
biovec-256 1687.5 MB
bip-256 3375.0 MB
dm_io 45.60546875 MB
dm_target_io 11.71875 MB

total: 5204.19921875 MB

(and an obscene 85G is needed if one tried 27000 rq-based mpath devices, 54G is purely for the DIF/DIX but it is allocated conditionally so really "only" 30G is needed if DIF/DIX capable storage isn't detected)

request-based:
bio-0 1350.0 MB
biovec-256 27000.0 MB
bip-256 54000.0 MB
dm_mpath_io 133.66015625 MB
dm_rq_clone_bio_info 133.66015625 MB
dm_rq_target_io 2700.0 MB

total: 85317.3203125 MB

Will be interesting to see Corey's results to see how closely my script reflects reality.

Comment 11 Mike Snitzer 2011-09-21 18:05:57 UTC

(In reply to comment #10)
> Approximately 5.2GB of memory is needed for 27000 bio-based DM devices (like
> corey created using comment#5's procedure):
> 
> ./bio_vs_rq_slab_usage.py 27000 256
> bios-based:
> bio-0 84.375 MB
> biovec-256 1687.5 MB
> bip-256 3375.0 MB
> dm_io 45.60546875 MB
> dm_target_io 11.71875 MB
> 
> total: 5204.19921875 MB

Ah, bip-256 is purely for DIF/DIX also, so the estimated memory usage for most existing storage is ~1.8GB.

> Will be interesting to see Corey's results to see how closely my script
> reflects reality.

Though Corey reported actual DM-only memory usage of ~1.6GB.  Seems my script is overestimating by 200MB (~12%); I'll look to reproduce to sort out why. 

But 1.6GB leaves 6.4GB of memory (from the original 8GB) for udev and lvm2 to work with.  So we need to look closer at the memory usage when lvm/udev is introduced.

Comment 12 Corey Marthaler 2011-09-22 20:42:47 UTC

I ran the following loop up to over 11K LVs. I'll attach the system info at the time I finally stopped it.

for i in $(seq 1 27000); do lvcreate -n lv$i -L 12M -an -Z n TAFT; done
  [...]
  Logical volume "lv11049" created
  WARNING: "lv11050" not zeroed
  Logical volume "lv11050" created
  WARNING: "lv11051" not zeroed
  Logical volume "lv11051" created

2.6.32-198.el6.x86_64

lvm2-2.02.87-2.1.el6    BUILT: Wed Sep 14 09:44:16 CDT 2011
lvm2-libs-2.02.87-2.1.el6    BUILT: Wed Sep 14 09:44:16 CDT 2011
lvm2-cluster-2.02.87-2.1.el6    BUILT: Wed Sep 14 09:44:16 CDT 2011
udev-147-2.38.el6    BUILT: Fri Sep  9 16:25:50 CDT 2011
device-mapper-1.02.66-2.1.el6    BUILT: Wed Sep 14 09:44:16 CDT 2011
device-mapper-libs-1.02.66-2.1.el6    BUILT: Wed Sep 14 09:44:16 CDT 2011
device-mapper-event-1.02.66-2.1.el6    BUILT: Wed Sep 14 09:44:16 CDT 2011
device-mapper-event-libs-1.02.66-2.1.el6    BUILT: Wed Sep 14 09:44:16 CDT 2011
cmirror-2.02.87-2.1.el6    BUILT: Wed Sep 14 09:44:16 CDT 2011

Comment 13 Corey Marthaler 2011-09-22 20:47:59 UTC

Created attachment 524472 [details]
output from slabinfo,

Comment 14 Corey Marthaler 2011-09-22 20:48:38 UTC

Created attachment 524473 [details]
output from meminfo

Comment 15 Corey Marthaler 2011-09-22 20:49:05 UTC

Created attachment 524474 [details]
output from vmstat

Comment 16 Corey Marthaler 2011-09-22 20:49:48 UTC

Created attachment 524475 [details]
output from free

Comment 17 Corey Marthaler 2011-09-22 20:50:32 UTC

Created attachment 524476 [details]
output from slabtop -s c -o

Comment 18 Corey Marthaler 2011-09-22 20:51:13 UTC

Created attachment 524477 [details]
output from ps aux

Comment 19 Mike Snitzer 2011-09-22 21:05:12 UTC

It should be noted that the procedure used in comment#12 doesn't do _any_ device activation... so neither DM (kernel) nor udev events will be firing... not sure how valid a test comment#12 is.

But it is bizarre that so much memory is consumed by doing this test!

Comment 20 Corey Marthaler 2011-09-22 21:06:36 UTC

The following attachments were taken after a echo 3 > /proc/sys/vm/drop_caches.

Comment 21 Corey Marthaler 2011-09-22 21:11:02 UTC

Created attachment 524481 [details]
output from slabinfo

Comment 22 Corey Marthaler 2011-09-22 21:11:35 UTC

Created attachment 524482 [details]
output from meminfo

Comment 23 Corey Marthaler 2011-09-22 21:12:09 UTC

Created attachment 524483 [details]
output from vmstat

Comment 24 Corey Marthaler 2011-09-22 21:12:45 UTC

Created attachment 524484 [details]
output from free

Comment 25 Corey Marthaler 2011-09-22 21:13:11 UTC

Created attachment 524485 [details]
output from slabtop -s c -o

Comment 26 Corey Marthaler 2011-09-22 21:13:42 UTC

Created attachment 524486 [details]
output from ps aux

Comment 27 Corey Marthaler 2011-09-22 21:25:48 UTC

The following attachments were taken after activating all 11K LVs and doing an echo 3 > /proc/sys/vm/drop_caches.

Comment 28 Corey Marthaler 2011-09-22 21:26:37 UTC

Created attachment 524488 [details]
output from slabinfo

Comment 29 Corey Marthaler 2011-09-22 21:27:07 UTC

Created attachment 524489 [details]
output from meminfo

Comment 30 Corey Marthaler 2011-09-22 21:27:45 UTC

Created attachment 524490 [details]
output from vmstat

Comment 31 Corey Marthaler 2011-09-22 21:28:08 UTC

Created attachment 524491 [details]
output from free

Comment 32 Corey Marthaler 2011-09-22 21:28:49 UTC

Created attachment 524492 [details]
output from slabtop -s c -o

Comment 33 Corey Marthaler 2011-09-22 21:29:22 UTC

Created attachment 524493 [details]
output from ps aux

Comment 34 Corey Marthaler 2011-09-22 21:30:08 UTC

Created attachment 524494 [details]
lvm.conf file from taft-01

Comment 35 Zdenek Kabelac 2011-09-23 08:34:43 UTC

(In reply to comment #32)
> Created attachment 524492 [details]
> output from slabtop -s c -o

Assuming  'biovec-256'  is addressed by this patch:

https://www.redhat.com/archives/dm-devel/2011-August/msg00076.html

Unsure why 'sysfs_dir_cache' is so large ?

Comment 36 Zdenek Kabelac 2011-09-23 08:42:45 UTC

(In reply to comment #19)
> It should be noted that the procedure used in comment#12 doesn't do _any_
> device activation... so neither DM (kernel) nor udev events will be firing...
> not sure how valid a test comment#12 is.
> 
> But it is bizarre that so much memory is consumed by doing this test!

Since metadata archiving is enabled in this case - and MD for 12000LV is approaching 4MB and you create it 12000 times - it's not so unexpectable I guess.

Comment 37 Mike Snitzer 2011-09-23 14:25:49 UTC

(In reply to comment #35)
> (In reply to comment #32)
> > Created attachment 524492 [details]
> > output from slabtop -s c -o
> 
> Assuming  'biovec-256'  is addressed by this patch:
> 
> https://www.redhat.com/archives/dm-devel/2011-August/msg00076.html

Wasn't really a question but: Yes it is already applied to RHEL6.2

> Unsure why 'sysfs_dir_cache' is so large ?

I am not sure why either but I can confirm I saw the same using dmsetup create (no lvm) to create 24000 linear devices.  Total sysfs_dir_cache objects is ~64 times the number of devices.

Comment 38 Mike Snitzer 2011-09-23 19:23:50 UTC

(In reply to comment #37)
> (In reply to comment #35)
> > Unsure why 'sysfs_dir_cache' is so large ?
> 
> I am not sure why either but I can confirm I saw the same using dmsetup create
> (no lvm) to create 24000 linear devices.  Total sysfs_dir_cache objects is ~64
> times the number of devices.

Sure enough, there are 64 calls to sysfs_new_dirent() for each DM device that is created.

# ./trace-cmd record -p function -l sysfs_new_dirent dmsetup create test27001 --table "0 16384 linear 8:64 0"
# ./trace-cmd report |grep sysfs_new_dirent | wc -l
64

Comment 39 Mike Snitzer 2011-09-23 19:42:37 UTC

Creating 27000 linear DM devices using dmsetup results in the following memory use (after dropping caches) -- just over 4.2GB in use:

MemTotal:        7158020 kB
MemFree:         2809736 kB
Buffers:            2956 kB
Cached:           138960 kB
SwapCached:            0 kB
Active:            36288 kB
Inactive:         126812 kB
Active(anon):      25064 kB
Inactive(anon):   108016 kB
Active(file):      11224 kB
Inactive(file):    18796 kB
Unevictable:        6192 kB
Mlocked:            6192 kB
SwapTotal:        524280 kB
SwapFree:         524280 kB
Dirty:                 0 kB
Writeback:             0 kB
AnonPages:         27316 kB
Mapped:            14480 kB
Shmem:            108220 kB
Slab:            3720916 kB
SReclaimable:     295884 kB
SUnreclaim:      3425032 kB
KernelStack:      217248 kB
PageTables:         4044 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:     4103288 kB
Committed_AS:     235164 kB
VmallocTotal:   34359738367 kB
VmallocUsed:      267652 kB
VmallocChunk:   34359283952 kB
HardwareCorrupted:     0 kB
AnonHugePages:      2048 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:        8180 kB
DirectMap2M:     7331840 kB

Could save ~450MB if the bio-based DM reserves were uniformly reduced to 16 with the following patch:
https://www.redhat.com/archives/dm-devel/2011-August/msg00076.html

reserve of 16:
dm_io 18.33984375 MB
dm_target_io 11.71875 MB

vs. reserve of 256:
dm_io 293.4765625 MB
dm_target_io 187.5 MB

Comment 40 Mike Snitzer 2011-09-23 20:00:51 UTC

(In reply to comment #37)
> (In reply to comment #35)
> > Assuming  'biovec-256'  is addressed by this patch:
> > 
> > https://www.redhat.com/archives/dm-devel/2011-August/msg00076.html
> 
> Wasn't really a question but: Yes it is already applied to RHEL6.2

I thought you were talking about bip-256 and just assumed the url you provided was about the integrity reduction we get from not allocating an integrity profile for devices that don't support DIF/DIX.

As I mentioned at the end of comment#39, the patch you referenced will only reduce the 'dm_io' and 'dm_target_io' slabs.

So coming full-circle on this.  The original report (comment#0) was against the RHEL6.1 kernel.  That kernel does _not_ have the DIF/DIX (bip-256) slab reduction changes that went in to RHEL6.2 (via bug#697992 and rhel6.git commit d587012e830 specifically).

Without that DIF/DIX patch DM devices allocate considerably more slab (primarily in bip-256) -- and that explains why RHEL6.1 was exhausting 8GB.

There was never a udev leak.  Changing subject and closing NOTABUG.