839137 – mdadm: page allocation failure. order:4, mode:0xd0

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 839137 - mdadm: page allocation failure. order:4, mode:0xd0

Summary: mdadm: page allocation failure. order:4, mode:0xd0

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	6.3
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Jes Sorensen
QA Contact:	Red Hat Kernel QE team
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2012-07-11 05:25 UTC by Mark Nipper
Modified:	2012-07-24 13:12 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2012-07-24 13:12:45 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
dmesg from the machine (37.56 KB, application/octet-stream) 2012-07-12 20:06 UTC, Mark Nipper	no flags	Details
new dmesg with md subsystem hanging (40.51 KB, application/octet-stream) 2012-07-12 20:32 UTC, Mark Nipper	no flags	Details
View All

Description Mark Nipper 2012-07-11 05:25:26 UTC

Description of problem:
The problem is that I can't create all of the md devices I want on this system because apparently it has too many devices in it.

Version-Release number of selected component (if applicable):
2.6.32-279.el6.x86_64  It's also possible that it's something to do with the mdadm package itself.  If so, I've got 3.2.3-9.el6.x86_64 of that installed.  Also, we just added all of these additional disks with this fairly up to date RHEL 6.3 box, so I'm not sure if it was happening before 6.3.

How reproducible:
I've only tried it once so far.

Steps to Reproduce:
1. Have a lot of multipath available SAS devices.
2. Start assembling and creating md devices.
3. It breaks at some point.

Actual results:
Bad things happen in the kernel.

Expected results:
Everything should work without complaint.

Additional info:
Here is my dmesg when the problem occurred.  We just added a rather large 45 disk expansion enclosure to the existing 36 disk enclosure on this system.  Trying to create the very last RAID-6 software RAID device using mdadm caused this:
---
[335824.979469] md: bind<dm-192>
[335824.990089] md: bind<dm-193>
[335824.998865] md: bind<dm-194>
[335825.025949] md: bind<dm-195>
[335825.034385] md: bind<dm-196>
[335825.050315] md: bind<dm-197>
[335825.084339] md: bind<dm-198>
[335825.095229] md: bind<dm-199>
[335825.141116] md: bind<dm-200>
[335825.142345] md/raid:md8: not clean -- starting background reconstruction
[335825.142368] md/raid:md8: device dm-200 operational as raid disk 8
[335825.142370] md/raid:md8: device dm-199 operational as raid disk 7
[335825.142373] md/raid:md8: device dm-198 operational as raid disk 6
[335825.142375] md/raid:md8: device dm-197 operational as raid disk 5
[335825.142377] md/raid:md8: device dm-196 operational as raid disk 4
[335825.142380] md/raid:md8: device dm-195 operational as raid disk 3
[335825.142382] md/raid:md8: device dm-194 operational as raid disk 2
[335825.142384] md/raid:md8: device dm-193 operational as raid disk 1
[335825.142386] md/raid:md8: device dm-192 operational as raid disk 0
[335825.143022] mdadm: page allocation failure. order:4, mode:0xd0
[335825.143025] Pid: 7023, comm: mdadm Not tainted 2.6.32-279.el6.x86_64 #1
[335825.143028] Call Trace:
[335825.143038]  [<ffffffff8112759f>] ? __alloc_pages_nodemask+0x77f/0x940
[335825.143044]  [<ffffffff81161d62>] ? kmem_getpages+0x62/0x170
[335825.143047]  [<ffffffff8116297a>] ? fallback_alloc+0x1ba/0x270
[335825.143050]  [<ffffffff811623cf>] ? cache_grow+0x2cf/0x320
[335825.143054]  [<ffffffff811626f9>] ? ____cache_alloc_node+0x99/0x160
[335825.143057]  [<ffffffff811634db>] ? kmem_cache_alloc+0x11b/0x190
[335825.143061]  [<ffffffff811659f4>] ? kmem_cache_create+0x1e4/0x5a0
[335825.143076]  [<ffffffffa00da1c5>] ? setup_conf+0x615/0x870 [raid456]
[335825.143081]  [<ffffffff814fd223>] ? printk+0x41/0x46
[335825.143087]  [<ffffffffa00dadaa>] ? run+0x66a/0x8e8 [raid456]
[335825.143092]  [<ffffffff813fc2e2>] ? md_run+0x3b2/0x8f0
[335825.143095]  [<ffffffff813fd670>] ? add_new_disk+0x2f0/0x560
[335825.143099]  [<ffffffff81193627>] ? __d_lookup+0xa7/0x150
[335825.143102]  [<ffffffff813fc83e>] ? do_md_run+0x1e/0xd0
[335825.143105]  [<ffffffff813ff38b>] ? md_ioctl+0xffb/0x1270
[335825.143109]  [<ffffffff8113ce71>] ? unmap_vmas+0x8e1/0xc30
[335825.143115]  [<ffffffff8125e317>] ? __blkdev_driver_ioctl+0x27/0x80
[335825.143120]  [<ffffffff81148498>] ? anon_vma_chain_free+0x18/0x20
[335825.143123]  [<ffffffff8125e7dd>] ? blkdev_ioctl+0x1ed/0x6e0
[335825.143128]  [<ffffffff811b381c>] ? block_ioctl+0x3c/0x40
[335825.143132]  [<ffffffff8118dec2>] ? vfs_ioctl+0x22/0xa0
[335825.143135]  [<ffffffff8114495d>] ? unmap_region+0xcd/0x130
[335825.143138]  [<ffffffff8118e064>] ? do_vfs_ioctl+0x84/0x580
[335825.143141]  [<ffffffff81142b4e>] ? remove_vma+0x6e/0x90
[335825.143145]  [<ffffffff81145038>] ? do_munmap+0x308/0x3a0
[335825.143148]  [<ffffffff8118e5e1>] ? sys_ioctl+0x81/0xa0
[335825.143153]  [<ffffffff8100b0f2>] ? system_call_fastpath+0x16/0x1b
[335825.143164] Mem-Info:
[335825.143166] Node 0 DMA per-cpu:
[335825.143170] CPU    0: hi:    0, btch:   1 usd:   0
[335825.143172] CPU    1: hi:    0, btch:   1 usd:   0
[335825.143175] CPU    2: hi:    0, btch:   1 usd:   0
[335825.143177] CPU    3: hi:    0, btch:   1 usd:   0
[335825.143179] CPU    4: hi:    0, btch:   1 usd:   0
[335825.143181] CPU    5: hi:    0, btch:   1 usd:   0
[335825.143184] CPU    6: hi:    0, btch:   1 usd:   0
[335825.143186] CPU    7: hi:    0, btch:   1 usd:   0
[335825.143189] CPU    8: hi:    0, btch:   1 usd:   0
[335825.143191] CPU    9: hi:    0, btch:   1 usd:   0
[335825.143194] CPU   10: hi:    0, btch:   1 usd:   0
[335825.143196] CPU   11: hi:    0, btch:   1 usd:   0
[335825.143198] Node 0 DMA32 per-cpu:
[335825.143200] CPU    0: hi:  186, btch:  31 usd:   0
[335825.143202] CPU    1: hi:  186, btch:  31 usd:   0
[335825.143204] CPU    2: hi:  186, btch:  31 usd:   0
[335825.143207] CPU    3: hi:  186, btch:  31 usd:   0
[335825.143209] CPU    4: hi:  186, btch:  31 usd:   0
[335825.143211] CPU    5: hi:  186, btch:  31 usd:   1
[335825.143214] CPU    6: hi:  186, btch:  31 usd:   0
[335825.143216] CPU    7: hi:  186, btch:  31 usd:  66
[335825.143219] CPU    8: hi:  186, btch:  31 usd:   0
[335825.143221] CPU    9: hi:  186, btch:  31 usd:   0
[335825.143224] CPU   10: hi:  186, btch:  31 usd:   0
[335825.143226] CPU   11: hi:  186, btch:  31 usd:   0
[335825.143227] Node 0 Normal per-cpu:
[335825.143230] CPU    0: hi:  186, btch:  31 usd:   0
[335825.143233] CPU    1: hi:  186, btch:  31 usd:   0
[335825.143235] CPU    2: hi:  186, btch:  31 usd:   0
[335825.143237] CPU    3: hi:  186, btch:  31 usd:   0
[335825.143240] CPU    4: hi:  186, btch:  31 usd:   0
[335825.143242] CPU    5: hi:  186, btch:  31 usd:  56
[335825.143245] CPU    6: hi:  186, btch:  31 usd:   0
[335825.143247] CPU    7: hi:  186, btch:  31 usd: 186
[335825.143250] CPU    8: hi:  186, btch:  31 usd:  62
[335825.143252] CPU    9: hi:  186, btch:  31 usd:   0
[335825.143254] CPU   10: hi:  186, btch:  31 usd:   0
[335825.143256] CPU   11: hi:  186, btch:  31 usd:   0
[335825.143262] active_anon:308189 inactive_anon:56406 isolated_anon:0
[335825.143263]  active_file:1213532 inactive_file:1220852 isolated_file:0
[335825.143264]  unevictable:6424 dirty:21147 writeback:3200 unstable:0
[335825.143265]  free:37202 slab_reclaimable:74384 slab_unreclaimable:52502
[335825.143266]  mapped:3257 shmem:575 pagetables:1898 bounce:0
[335825.143269] Node 0 DMA free:15676kB min:80kB low:100kB high:120kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15264kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
[335825.143280] lowmem_reserve[]: 0 2991 12081 12081
[335825.143288] Node 0 DMA32 free:65304kB min:16712kB low:20888kB high:25068kB active_anon:332456kB inactive_anon:91404kB active_file:1044000kB inactive_file:1045460kB unevictable:1108kB isolated(anon):0kB isolated(file):0kB present:3063264kB mlocked:0kB dirty:13352kB writeback:1232kB mapped:196kB shmem:64kB slab_reclaimable:112420kB slab_unreclaimable:24264kB kernel_stack:1912kB pagetables:624kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:66 all_unreclaimable? no
[335825.143300] lowmem_reserve[]: 0 0 9090 9090
[335825.143306] Node 0 Normal free:67828kB min:50784kB low:63480kB high:76176kB active_anon:900300kB inactive_anon:134220kB active_file:3810128kB inactive_file:3837948kB unevictable:24588kB isolated(anon):0kB isolated(file):0kB present:9308160kB mlocked:24588kB dirty:71236kB writeback:11568kB mapped:12832kB shmem:2236kB slab_reclaimable:185116kB slab_unreclaimable:185744kB kernel_stack:7136kB pagetables:6968kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[335825.143318] lowmem_reserve[]: 0 0 0 0
[335825.143324] Node 0 DMA: 1*4kB 1*8kB 1*16kB 1*32kB 2*64kB 1*128kB 0*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15676kB
[335825.143340] Node 0 DMA32: 5745*4kB 5119*8kB 113*16kB 6*32kB 1*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 65996kB
[335825.143355] Node 0 Normal: 16619*4kB 294*8kB 29*16kB 2*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 69356kB
[335825.143370] 2436215 total pagecache pages
[335825.143372] 17 pages in swap cache
[335825.143374] Swap cache stats: add 98, delete 81, find 0/0
[335825.143376] Free swap  = 12582512kB
[335825.143378] Total swap = 12582904kB
[335825.193672] 3145712 pages RAM
[335825.193675] 98887 pages reserved
[335825.193676] 839126 pages shared
[335825.193677] 2145736 pages non-shared
[335825.193688] md/raid:md8: couldn't allocate 9540kB for buffers
[335825.193719] md: pers->run() failed ...
[335825.193772] md: md8 stopped.
[335825.193781] md: unbind<dm-200>
[335825.203041] md: export_rdev(dm-200)
[335825.203066] md: unbind<dm-199>
[335825.220972] md: export_rdev(dm-199)
[335825.220997] md: unbind<dm-198>
[335825.231946] md: export_rdev(dm-198)
[335825.231962] md: unbind<dm-197>
[335825.249896] md: export_rdev(dm-197)
[335825.249910] md: unbind<dm-196>
[335825.258882] md: export_rdev(dm-196)
[335825.258897] md: unbind<dm-195>
[335825.269849] md: export_rdev(dm-195)
[335825.269863] md: unbind<dm-194>
[335825.280822] md: export_rdev(dm-194)
[335825.280836] md: unbind<dm-193>
[335825.289796] md: export_rdev(dm-193)
[335825.289810] md: unbind<dm-192>
[335825.300770] md: export_rdev(dm-192)
---

This is the output from cat /proc/mdstat currently:
---
Personalities : [raid6] [raid5] [raid4] [raid1]
md7 : active raid6 dm-191[8] dm-190[7] dm-189[6] dm-188[5] dm-187[4] dm-186[3] dm-185[2] dm-184[1] dm-183[0]
      13674576384 blocks super 1.2 level 6, 512k chunk, algorithm 2 [9/9] [UUUUUUUUU]
      [=============>.......]  resync = 66.1% (1292848640/1953510912) finish=315.7min speed=34871K/sec

md6 : active raid6 dm-182[8] dm-181[7] dm-180[6] dm-179[5] dm-178[4] dm-177[3] dm-176[2] dm-175[1] dm-174[0]
      13674576384 blocks super 1.2 level 6, 512k chunk, algorithm 2 [9/9] [UUUUUUUUU]
      [=============>.......]  resync = 66.1% (1292712448/1953510912) finish=318.7min speed=34547K/sec

md5 : active raid6 dm-173[8] dm-172[7] dm-171[6] dm-170[5] dm-169[4] dm-168[3] dm-167[2] dm-166[1] dm-165[0]
      13674576384 blocks super 1.2 level 6, 512k chunk, algorithm 2 [9/9] [UUUUUUUUU]
      [=============>.......]  resync = 69.1% (1351455104/1953510912) finish=277.7min speed=36121K/sec

md4 : active raid6 dm-164[8] dm-163[7] dm-162[6] dm-161[5] dm-160[4] dm-159[3] dm-158[2] dm-157[1] dm-156[0]
      13674576384 blocks super 1.2 level 6, 512k chunk, algorithm 2 [9/9] [UUUUUUUUU]
      [=============>.......]  resync = 69.6% (1360178732/1953510912) finish=273.1min speed=36201K/sec

md127 : active raid1 dm-66[9] dm-47[13] dm-43[21] dm-51[31] dm-38[20] dm-62[24] dm-45[15] dm-95[8] dm-52[22] dm-64[23] dm-42[12] dm-83[28] dm-28[11] dm-54[25] dm-68[27] dm-67[30] dm-34[10] dm-88[4] dm-75[2] dm-84[7] dm-70[32] dm-59[33] dm-104[1] dm-106[35] dm-30[16] dm-79[34] dm-73[29] dm-91[3] dm-40[14] dm-81[0] dm-36[18] dm-53[26] dm-76[17] dm-49[19] dm-100[5] dm-102[6]
      999988 blocks super 1.0 [36/36] [UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU]

md3 : active raid6 dm-101[2] dm-82[0] dm-78[7] dm-107[10] dm-103[4] dm-96[6] dm-105[5] dm-92[1] dm-89[9] dm-69[8] dm-86[3] dm-85[11]
      19525125120 blocks super 1.2 level 6, 512k chunk, algorithm 2 [12/12] [UUUUUUUUUUUU]

md2 : active raid6 dm-77[4] dm-72[7] dm-71[5] dm-87[6] dm-65[0] dm-58[3] dm-74[8] dm-60[1] dm-63[2] dm-57[9] dm-61[10] dm-33[11]
      19525125120 blocks super 1.2 level 6, 512k chunk, algorithm 2 [12/12] [UUUUUUUUUUUU]
      [=================>...]  check = 86.1% (1681537280/1952512512) finish=4195.3min speed=1076K/sec

md1 : active raid6 dm-80[8] dm-41[3] dm-44[4] dm-48[0] dm-39[11] dm-56[10] dm-50[5] dm-55[12] dm-46[6] dm-35[2] dm-37[9] dm-29[1]
      19525125120 blocks super 1.2 level 6, 512k chunk, algorithm 2 [12/12] [UUUUUUUUUUUU]
      [===============>.....]  check = 76.5% (1495513984/1952512512) finish=7378.7min speed=1031K/sec

unused devices: <none>
---

md[4-7] are the newly added disks.  md[1-3] and md127 are the original 36 drives.  The system just happened to running consistency checks on part of the original drives when I was creating the new md devices.

The drives are all 2TB.  They all use msdos labels and only primary partitions.

The original 36 drives have two partitions, a 1GB partition at the front of all 36 drives for /boot (md127 above), and the rest of the drive for MD (12 drives make up each of md[1-3] above which are then PV inside of one large VG with one large LV with XFS on top; we're not happy with XFS though because of some issues we've seen, so we'll most likely be switching to multiple ext4 partitions under the current RHEL 16TB limit like we're trying to do with the other 45 drives).

The other 45 drives all have only a single 2TB partition.  These were going to be md[4-8], 9 drives in each RAID-6 set, with ext4 directly on top of each md device.

However, now it looks like we won't be able to do any of this until a kernel patch potentially to address the problem we're seeing here.  Since this box is merely running as a CrashPlan server, I've got a little bit of room to take the machine down, reboot, whatever, as necessary.  I'd rather get this working now since we're looking at a similar storage architecture moving forward and with potentially more enclosures attached.

Comment 2 Jes Sorensen 2012-07-11 09:08:52 UTC

Mark,

Could you please attach the output of /proc/partitions to this BZ as well?

Looking at the mdstat output, do I get it right that you are building
md-raid instances on top of a stack of devicemapper devices?

Thanks,
Jes

Comment 3 Mark Nipper 2012-07-12 20:06:06 UTC

Created attachment 597892 [details]
dmesg from the machine

Comment 4 Mark Nipper 2012-07-12 20:12:30 UTC

Oops.  Ignore that dmesg.  It was meant for bug 839781.  Although, I installed the latest kernel (2.6.32-279.1.1.el6.x86_64) and rebooted and all 5 of the new arrays do appear now in /proc/mdstat:
---
Personalities : [raid6] [raid5] [raid4] [raid1] 
md122 : active raid6 dm-46[7] dm-41[3] dm-163[0] dm-34[4] dm-51[2] dm-55[5] dm-53[6] dm-110[8] dm-72[1]
      13674576384 blocks super 1.2 level 6, 512k chunk, algorithm 2 [9/9] [UUUUUUUUU]
      [===>.................]  resync = 18.8% (368961920/1953510912) finish=245.5min speed=107566K/sec
      
md123 : active raid6 dm-78[4] dm-79[3] dm-75[2] dm-161[6] dm-61[0] dm-133[7] dm-104[5] dm-135[8] dm-172[1]
      13674576384 blocks super 1.2 level 6, 512k chunk, algorithm 2 [9/9] [UUUUUUUUU]
      
md124 : active raid6 dm-130[5] dm-136[4] dm-173[7] dm-152[0] dm-138[3] dm-178[8] dm-146[6] dm-106[2] dm-182[1]
      13674576384 blocks super 1.2 level 6, 512k chunk, algorithm 2 [9/9] [UUUUUUUUU]
      [>....................]  resync =  0.0% (1240704/1953510912) finish=31379.8min speed=1036K/sec
      
md125 : active (auto-read-only) raid6 dm-45[0] dm-50[7] dm-38[3] dm-63[5] dm-60[8] dm-57[4] dm-36[2] dm-164[1] dm-54[6]
      13674576384 blocks super 1.2 level 6, 512k chunk, algorithm 2 [9/9] [UUUUUUUUU]
      
md126 : active (auto-read-only) raid6 dm-167[4] dm-140[8] dm-168[3] dm-189[6] dm-157[1] dm-113[5] dm-156[2] dm-180[7] dm-154[0]
      13674576384 blocks super 1.2 level 6, 512k chunk, algorithm 2 [9/9] [UUUUUUUUU]
      
md127 : active raid1 dm-101[21] dm-177[0] dm-158[3] dm-117[31] dm-196[4] dm-114[17] dm-95[27] dm-64[11] dm-92[30] dm-145[1] dm-82[20] dm-77[16] dm-80[29] dm-119[26] dm-118[22] dm-65[13] dm-128[32] dm-187[35] dm-70[19] dm-62[15] dm-169[8] dm-188[5] dm-111[34] dm-84[18] dm-115[33] dm-108[12] dm-76[14] dm-98[23] dm-192[9] dm-184[6] dm-194[2] dm-126[25] dm-144[7] dm-107[28] dm-112[24] dm-102[10]
      999988 blocks super 1.0 [36/36] [UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU]
      
md3 : active raid6 dm-195[7] dm-197[9] dm-193[8] dm-190[10] dm-191[2] dm-185[4] dm-179[11] dm-170[6] dm-160[1] dm-148[3] dm-147[5] dm-121[0]
      19525125120 blocks super 1.2 level 6, 512k chunk, algorithm 2 [12/12] [UUUUUUUUUUUU]
      
md1 : active raid6 dm-103[2] dm-129[10] dm-120[8] dm-87[11] dm-105[6] dm-122[4] dm-73[12] dm-68[5] dm-89[9] dm-66[0] dm-67[1] dm-83[3]
      19525125120 blocks super 1.2 level 6, 512k chunk, algorithm 2 [12/12] [UUUUUUUUUUUU]
      
md2 : active raid6 dm-124[10] dm-131[8] dm-134[1] dm-132[3] dm-97[7] dm-116[2] dm-125[9] dm-123[6] dm-94[5] dm-99[0] dm-85[11] dm-86[4]
      19525125120 blocks super 1.2 level 6, 512k chunk, algorithm 2 [12/12] [UUUUUUUUUUUU]
      
unused devices: <none>
---

I'm trying to format them as ext4 at the moment (directly on top of the md device, no LVM in the middle).

We'll see how that goes.  I'm not sure if it was simply the number of checks/syncs happening previously that caused the initial problem I saw in this bug.  I'm a little worried about what happens when the system decides to verify all the arrays since I assume it will try to check them all simultaneously.

Comment 5 Mark Nipper 2012-07-12 20:32:55 UTC

Created attachment 597900 [details]
new dmesg with md subsystem hanging

Well, I was being a little too optimistic it seems.  One of my mke2fs runs is hanging at:
---
mke2fs 1.41.12 (17-May-2010)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=128 blocks, Stripe width=896 blocks
854663168 inodes, 3418644096 blocks
170932204 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=4294967296
104329 block groups
32768 blocks per group, 32768 fragments per group
8192 inodes per group
Superblock backups stored on blocks: 
	32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 
	4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968, 
	102400000, 214990848, 512000000, 550731776, 644972544, 1934917632, 
	2560000000

Writing inode tables: done                            
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: 
---

and lots of badness in dmesg (attached).

Comment 6 Jes Sorensen 2012-07-17 07:43:18 UTC

Mark,

Looking at this one - it looks like the array that failed required the kernel
to allocate a fair chunk of memory for it. However I cannot see which and how
many drives were actually meant to be part of md8 (the one that failed to
start).

I presume this failure is consistent, not just a one-off error?

Thanks,
Jes

Comment 7 Jes Sorensen 2012-07-17 07:56:43 UTC

Hi Mark,

While I am at it, can you get me the /proc/partitions output as well, please.

Thanks,
Jes

Comment 8 Jes Sorensen 2012-07-17 13:12:43 UTC

Mark,

How much memory do you have in that box?

Thanks,
Jes

Comment 9 Mark Nipper 2012-07-17 18:54:35 UTC

Well, I thought I had attached the output from /proc/partitions someplace, but it appears I didn't. I've already wiped the RHEL installation on this box and put Deiban/testing on it to see if that fixed things. It seems to have fixed the inital mdadm creation problem, but we've run into another problem with jbd2 hanging during writes to an ext4 partition on one of these software RAID-6 volumes. To mame matters worse, we've also recabled the machine to avoid doing multipath and cascading across several SAS expanders as we thought that might be part of the problem in all of this.

To answer what I can, the machine has 12GB of RAM and the same amount of swap. md[122-126] are all 9, 2TB SAS2 drives in a software RAID-6. These was all 45 drives in the external Super Micro 847 expansion we have attached to this box. The primary system is also an 847 chassis with 36, 2TB SAS2 drives. Those we had done differently as md[1-3] comprised of 12 drives each, also RAID-6, LVM physical volumes on each and a volume group spanning all three physical volumes. md127 is a 1GB sliver on each of these 36 drives where /boot lives (massive overkill I know).

Anyway, I didn't try the creation more than the once. The machine running RHEL actually assembled the drives just fine the next time I booted, but continued to display weirdness including the mpt2sas log messages I linked to in my other bug report above.

Whatever, we're actually buying hardware RAID cards at this point due to all the problems we've been having with software RAID under Linux on these LSI based expanders. Hopefully their hardware RAID cards will work better talking directly to their own expanders than the Linux kernel has thus far.

Sorry that I can't provide more information than this. We need a working system here.

Comment 10 Mark Nipper 2012-07-17 18:57:06 UTC

Oh, I will say this about /proc/partitions.  The way we had everything cabled originally meant that all 81 drives showed up multiple times over in /proc/partitions.  So each drive had a real SCSI device (sda/sdb), a multipath device (mpatha), and all the related partitions (sda1/sdb1/mpathap1).  And then there were the dm-[[:digit:]] devices as well I believe (can't remember if they only appeared in /dev/mapper or /proc/partitions also).  It was a fairly insane number of devices regardless.

Comment 11 Jes Sorensen 2012-07-18 08:09:50 UTC

Mark,

Thanks for the information.

The reason I asked about the memory was that the failure you saw in the logs
was caused by an out of memory condition, and the logs were indicating that
you only have 128MB or so of free memory available at the time. 

I looked through the code and it doesn't look like it does anything wrong
in the raid part at least when encountering such a situation. It could be
made to retry but I don't think that would make the problem go away. If
the box was heavily loaded during creation of the raid, it is simply possible
that it failed due to a temporary lack of memory.

I saw your mpt bug report - that is not my area, however that part should get
addressed there.

I'll have another look at the code to see if there is anything we can do to
make it handle memory pressure later. However without being able to test it
directly I may not be able to do much.

Cheers,
Jes

Comment 12 Mark Nipper 2012-07-18 15:05:16 UTC

The machine was definitely under severe memory pressure.  It's our CrashPlan server and apparently CrashPlan is awful about memory consumption.  So if this was simply a memory allocation issue, then all is well I guess.  It was strange that it failed to create the array (apparently) but the next time I booted the machine, it actually did assemble the array as if everything was okay.

Thanks for the information.  You can probably close this then.

Comment 13 Jes Sorensen 2012-07-24 13:12:45 UTC

Mark,

Reading the logs I am pretty confident it was an out of memory situation.
Given the system was under heavy memory pressure at the time, then I think
the system pretty much acted as it was supposed to.

It may be we need to improve the error handling in mdadm so it reports this
better in the future, but I don't think there is a bug at hand here.

I'll close this for now, but if you experience this again, or any other md
related bugs, please to ahead and open new bugzilla entries for them. I
definitely want to know about it.

Thanks,
Jes

Note You need to log in before you can comment on or make changes to this bug.