Bug 490026

Summary: EXT4 panic, list corruption in ext4_mb_new_inode_pa
Product: Red Hat Enterprise Linux 5 Reporter: Issue Tracker <tao>
Component: kernelAssignee: Eric Sandeen <esandeen>
Status: CLOSED DUPLICATE QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: medium Docs Contact:
Priority: medium    
Version: 5.3CC: bmr, esandeen, rwheeler, tao, woodard
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-03-26 15:56:50 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Issue Tracker 2009-03-12 21:13:19 UTC
Escalated to Bugzilla from IssueTracker

Comment 1 Issue Tracker 2009-03-12 21:13:20 UTC
Eric Sandeen provided me a kernel that he wanted me to run my tests on. While doing this testing, I ran into this problem. It probably should go straight to bz and be sent to him because it isn't with a stock kernel.

2009-03-09 17:50:51 kjournald2 starting: pid 1298, dev md0:8, commit interval 5 seconds
2009-03-09 17:50:51 EXT4 FS on md0, internal journal on md0:8
2009-03-09 17:50:51 EXT4-fs: delayed allocation enabled
2009-03-09 17:50:51 EXT4-fs: file extents enabled
2009-03-09 17:50:53 EXT4-fs: mballoc enabled
2009-03-09 17:50:53 EXT4-fs: mounted filesystem md0 with ordered data mode
<ConMan> Console [sun-ost6] log at 2009-03-09 18:00:00 PDT.
2009-03-09 18:23:56 
2009-03-09 18:23:56 list_add corruption. next->prev should be ffff81042f2fe158, but was 0000000000200200
2009-03-09 18:23:56 ----------- [cut here ] --------- [please bite here ] ---------
2009-03-09 18:23:56 Kernel BUG at lib/list_debug.c:26
2009-03-09 18:23:56 invalid opcode: 0000 [1] SMP 
2009-03-09 18:23:56 last sysfs file: /devices/pci0000:80/0000:80:0f.0/0000:83:00.1/irq
2009-03-09 18:23:56 CPU 3 
2009-03-09 18:23:56 Modules linked in: hfsplus ext4 jbd2 crc16 xfs ext3 jbd raid0 ib_ipoib rdma_ucm rdma_cm iw_cm ib_addr ib_ucm ib_uverbs ib_umad 
mlx4_ib ib_cm ib_sa ib_mad ib_core ipv6 xfrm_nalgo crypto_api dm_mirror dm_log dm_multipath scsi_dh dm_mod video hwmon backlight sbs i2c_ec button 
battery asus_acpi acpi_memhotplug ac parport_pc lp parport sd_mod sg joydev mptsas mptscsih shpchp mptbase i2c_nforce2 scsi_transport_sas pcspkr oh
ci_hcd i2c_core scsi_mod mlx4_core niu nfs nfs_acl lockd fscache sunrpc forcedeth
2009-03-09 18:23:56 Pid: 8711, comm: pios Not tainted 2.6.18-132.el5.ers1 #1
2009-03-09 18:23:56 RIP: 0010:[<ffffffff8014d49b>]  [<ffffffff8014d49b>] __list_add+0x24/0x68
2009-03-09 18:23:56 RSP: 0018:ffff8107c4d73538  EFLAGS: 00010286
2009-03-09 18:23:56 RAX: 0000000000000058 RBX: ffff81032d354550 RCX: 0000000000000000
2009-03-09 18:23:56 RDX: 00000000000000ff RSI: 0000000000000000 RDI: ffffffff802f9aa0
2009-03-09 18:23:56 RBP: ffff81042f2fe158 R08: ffffffff802f9aa8 R09: 0000000000000046
2009-03-09 18:23:56 R10: ffff8107c4d731e8 R11: 0000000000000080 R12: ffff81032d354890
2009-03-09 18:23:56 R13: ffff81028b0c5c00 R14: ffff810471f62c10 R15: ffff8107c4d736e4
2009-03-09 18:23:56 FS:  0000000040a04940(0063) GS:ffff81010eb203c0(0000) knlGS:0000000000000000
2009-03-09 18:23:56 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
2009-03-09 18:23:56 CR2: 00002aaac42ff000 CR3: 00000005b2bee000 CR4: 00000000000006e0
2009-03-09 18:23:56 Process pios (pid: 8711, threadinfo ffff8107c4d72000, task ffff8108023d1100)
2009-03-09 18:23:56 Stack:  ffff81032d354880 ffff81005780c880 ffff810471f62b60 ffffffff8856cb89
2009-03-09 18:23:56  ffff8107c4d73678 ffff81005780c880 ffff8107c4d73678 0000000000000000
2009-03-09 18:23:56  ffff810471f62bf8 ffffffff8856e795 ffff810471f62c10 ffff8100446cfbe8
2009-03-09 18:23:56 Call Trace:
2009-03-09 18:23:56  [<ffffffff8856cb89>] :ext4:ext4_mb_new_inode_pa+0x261/0x2ea
2009-03-09 18:23:56  [<ffffffff8856e795>] :ext4:ext4_mb_new_blocks+0x1f6/0x41c
2009-03-09 18:23:56  [<ffffffff885675c9>] :ext4:ext4_ext_get_blocks+0xaf5/0xcf4
2009-03-09 18:23:56  [<ffffffff8005bc30>] cache_alloc_refill+0x106/0x186
2009-03-09 18:23:56  [<ffffffff885559f4>] :ext4:ext4_get_blocks_wrap+0xf0/0x20f
2009-03-09 18:23:56  [<ffffffff88559279>] :ext4:ext4_da_get_block_write+0x5f/0x14e
2009-03-09 18:23:56  [<ffffffff88554981>] :ext4:mpage_da_map_blocks+0x73/0x4fc
2009-03-09 18:23:56  [<ffffffff80047923>] __pagevec_release+0x19/0x22
2009-03-09 18:23:56  [<ffffffff800eea42>] write_cache_pages+0x20c/0x2c3
2009-03-09 18:23:56  [<ffffffff88556ed8>] :ext4:__mpage_da_writepage+0x0/0x181
2009-03-09 18:23:56  [<ffffffff8855807a>] :ext4:ext4_da_writepages+0x33e/0x4e3
2009-03-09 18:23:56  [<ffffffff88537916>] :jbd2:do_get_write_access+0x3f9/0x43f
2009-03-09 18:23:56  [<ffffffff80019421>] __getblk+0x25/0x236
2009-03-09 18:23:56  [<ffffffff8855921a>] :ext4:ext4_da_get_block_write+0x0/0x14e
2009-03-09 18:23:56  [<ffffffff8005a752>] do_writepages+0x20/0x2f
2009-03-09 18:23:56  [<ffffffff8002f72e>] __writeback_single_inode+0x1ae/0x328
2009-03-09 18:23:56  [<ffffffff885542d6>] :ext4:ext4_mark_inode_dirty+0x157/0x18a
2009-03-09 18:23:56  [<ffffffff800206b9>] sync_sb_inodes+0x1a9/0x267
2009-03-09 18:23:56  [<ffffffff800508f1>] writeback_inodes+0x82/0xd8
2009-03-09 18:23:56  [<ffffffff800c61c7>] balance_dirty_pages_ratelimited_nr+0x114/0x1fa
2009-03-09 18:23:56  [<ffffffff8000fa4f>] generic_file_buffered_write+0x1fb/0x665
2009-03-09 18:23:56  [<ffffffff88537511>] :jbd2:jbd2_journal_stop+0x1b4/0x1c0
2009-03-09 18:23:56  [<ffffffff80015f5a>] __generic_file_aio_write_nolock+0x369/0x3b6
2009-03-09 18:23:56  [<ffffffff80063d28>] __mutex_lock_slowpath+0x68/0x9b
2009-03-09 18:23:56  [<ffffffff80020f0a>] generic_file_aio_write+0x65/0xc1
2009-03-09 18:23:56  [<ffffffff88551775>] :ext4:ext4_file_write+0x62/0xe1
2009-03-09 18:23:56  [<ffffffff80017af7>] do_sync_write+0xc7/0x104
2009-03-09 18:23:56  [<ffffffff8009dc32>] autoremove_wake_function+0x0/0x2e
2009-03-09 18:23:56  [<ffffffff80063097>] thread_return+0x62/0xfe
2009-03-09 18:23:56  [<ffffffff80016363>] vfs_write+0xce/0x174
2009-03-09 18:23:56  [<ffffffff800436cc>] sys_pwrite64+0x50/0x70
2009-03-09 18:23:56  [<ffffffff8005dde9>] error_exit+0x0/0x84
2009-03-09 18:23:56  [<ffffffff8005d116>] system_call+0x7e/0x83
2009-03-09 18:23:56 
2009-03-09 18:23:56 
2009-03-09 18:23:56 Code: 0f 0b 68 fa 72 2a 80 c2 1a 00 48 8b 55 00 48 39 da 74 1b 48 
2009-03-09 18:23:56 RIP  [<ffffffff8014d49b>] __list_add+0x24/0x68
2009-03-09 18:23:56  RSP <ffff8107c4d73538>
2009-03-09 18:23:56 Linux version 2.6.18-132.el5.ers1 (root@bear-05) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-44)) #1 SMP Fri Feb 27 17:25:32 CST
 2009
2009-03-09 18:23:56 Command line: initrd=initrd-2.6.18-132.el5.ers1 console=ttyS0,115200n8  elevator=deadline swiotlb=65536 selinux=0 iommu=memaper
=3 BOOT_IMAGE=vmlinuz-2.6.18-132.el5.ers1 BOOTIF=01-00-14-4f-f2-8b-90 irqpoll maxcpus=1 reset_devices memmap=exactmap memmap=640K@0K memmap=5184K@1
6384K memmap=125232K@22208K elfcorehdr=147440K memmap=56K#3407616K memmap=200K#3407672K
2009-03-09 18:23:56 BIOS-provided physical RAM map:
...
Unforutnately kdump kernel doesn't work so I won't be able to get a crash dump.
We have a ticket open on that issue already.

This is a Sun Thor box with 6 LSI controllers and 48 drives.
I was tweaking around with various MD and EXT4 settings doing a parameter study and I had worked my way down to the MD having 4096K chunks and EXT4 having a stride of 1.

Not sure how reproducible it is yet. It only happened once in one test. At the time of the test it was doing an IO test with 128 threads doing a lot of IO. Here is a snippit from a vmstat that I had running at the time.

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
47 27      0  92952  13432 31236180    0    0     4 1239418 3804 1415  0 100  0  0  0
46 24      0  95460  16088 31244608    0    0     0 1141906 3915 1047  0 100  0  0  0
55 34      0  96344  17084 31243456    0    0     1 1233798 3799 1408  0 100  0  0  0
49 29      0  93628  17204 31234400    0    0     0 1226954 3685  972  0 100  0  0  0
46 21      0  91488  17424 31241932    0    0     4 1161592 3924  973  0 100  0  0  0
49 30      0  94736  17452 31238640    0    0     1 1200241 3956 1112  0 100  0  0  0
42 22      0  96004  17552 31242872    0    0     1 1175689 3775 1135  0 100  0  0  0
63 35      0  94764  17556 31217184    0    0     1 1324150 3875 1147  0 100  0  0  0
45 29      0  96964  17736 31232812    0    0     3 1145525 3974 1027  0 100  0  0  0
42 20      0  92340  17700 31241188    0    0     0 1157958 3773 1072  0 100  0  0  0
42 18      0  97068  17572 31238964    0    0     2 1192782 3761 1926  0 100  0  0  0
51 24      0  93844  17652 31234896    0    0     3 1220686 3980 1978  0 100  0  0  0
48 29      0  93680  17720 31237116    0    0     2 1175973 3790 1336  0 100  0  0  0
47 26      0  91036  17752 31242764    0    0     2 1156842 3729 1273  0 100  0  0  0
44 18      0  95332  17428 31208696    0    0     5 1414503 4048 1309  0 100  0  0  0
43 22      0  91204  13568 31224612    0    0     6 1195661 3925 1789  0 100  0  0  0
45 26      0  95280  13016 31242276    0    0     5 1128330 4356 1672  0 100  0  0  0
58 28      0  97112  12912 31240892    0    0     5 1273613 4015 1465  0 100  0  0  0
55 29      0  98608  12860 31226456    0    0     2 1135049 3515 1107  0 100  0  0  0
40 20      0  97268  13328 31230272    0    0     8 1158282 3850 1619  0 100  0  0  0
68 40      0  97932  13292 31229544    0    0     5 1271407 3973 1679  0 100  0  0  0
52 33      0  97808  13272 31233772    0    0     3 1288932 3972 1387  0 100  0  0  0
58 42      0  95832  13436 31216932    0    0     3 1303794 3744 1035  0 100  0  0  0
59 43      0  92248  13508 31229928    0    0     5 1233107 3979 1124  0 100  0  0  0
50 30      0  98844  13160 31217048    0    0     3 1332846 4035 1149  0 100  0  0  0
57 32      0  91208  13044 31237308    0    0     3 1295418 4060 1322  0 100  0  0  0
59 39      0  96008  13012 31208388    0    0     2 1413327 3864 2036  0 100  0  0  0
44 22      0  99348  13132 31236724    0    0     3 1196283 4129 1058  0 100  0  0  0
58 24      0 100884  13472 31231628    0    0     6 1216626 3821 2142  0 97  2  0  0
45 20      0  99996  13504 31239980    0    0     3 1199324 3888 1623  0 100  0  0  0
48 29      0 100896  13784 31214340    0    0     5 1370620 4007 1775  0 100  0  0  0
 0  2      0  91576  14040 31242384    0    0     4 1108883 3907 2018  0 98  2  1  0
49 25      0  93900  13960 31217932    0    0     4 1339805 3794 1833  0 93  5  2  0
50 29      0  92528  13508 31234344    0    0     2 1285618 4105 2374  0 100  0  0  0
53 27      0 101728  13388 31207740    0    0     3 1305614 3716 1893  0 100  0  0  0
52 29      0  98480  13212 31231288    0    0     5 1233886 4099 1820  0 100  0  0  0
55 37      0  93312  13436 31227944    0    0     5 1337174 4091 2139  0 100  0  0  0
59 36      0  92164  13584 31236736    0    0     4 1336565 4032 2009  0 100  0  0  0
 2  1      0 118952  13952 31205788    0    0    32 1229969 3820  955  0 47 46  6  0
 1  0      0 133508  14416 31206812    0    0     3 950961 3231  885  0 23 67  9  0
 1  0      0 134252  14428 31206836    0    0     0 337530 1776  388  0 16 84  0  0
 1  0      0 134688  14440 31206844    0    0     0 168712 1448  401  0 14 86  0  0
 1  0      0 135376  14448 31206836    0    0     0 84506 1282  397  0 13 87  0  0
30  8      0  99296  14940 31236728    0    0    59  7982 1170 1008  0 30 70  0  0
26 13      0  91980  14812 31225060    0    0   153 522390 2433 1872  0 100  0  0  0
29 18      0  95780  14856 31211808    0    0    76 728230 2996 1362  0 100  0  0  0
35 17      0  97908  14308 31191144    0    0     7 1267794 3851 1341  0 100  0  0  0
60 39      0  99868  10368 31132376    0    0     5 1036442 3579 3229  0 99  0  0  0
115 76      0  97820   7600 31047156    0    0    18 1278046 4035 1836  0 99  0  1  0
62 48      0  92048   9552 31058264    0    0    14 1221365 4042 1402  0 100  0  0  0
110 87      0  98004  10396 31048180    0    0     8 1280020 3968 1402  0 100  0  0  0
122 83      0  95596  10500 31053140    0    0     2 1321517 3988 1498  0 100  0  0  0
100 58      0  98632  10648 31008540    0    0     6 1428689 3993 2310  0 100  0  0  0
83 31      0 100296  11068 31041924    0    0    19 1093455 4105 2448  0 100  0  0  0
62 38      0  98676  11336 31038232    0    0     9 1301401 3971 2618  0 100  0  0  0
64 31      0  93112  11816 31049828    0    0     6 1226555 3929 2610  0 100  0  0  0
63 31      0  98576  12756 31008036    0    0    10 1341186 3854 2517  0 96  3  1  0
11  7      0  93020  13560 31041844    0    0     8 1081041 3969 2358  0 100  0  0  0
24 11      0  91560  14052 31048544    0    0     5 1213262 3841 2077  0 100  0  0  0
52 23      0  91084  14580 31049040    0    0     5 1262726 4032 2124  0 100  0  0  0
93 64      0  94800  14892 31013676    0    0     4 1401092 3938 1446  0 99  1  0  0
81 55      0  99424  14512 31039916    0    0     4 1137844 4017 1658  0 100  0  0  0

Here is the output from the benchmark:

# hyperion-sun-ost6 /mnt > mdadm /dev/md0 -C -c 4096 -l raid0 --raid-devices=48 -e1 /dev/sd{a,i,q,y,ag,ao,b,j,r,z,ah,ap,c,k,s,aa,ai,aq,d,l,t,ab,aj,ar,e,m,u,ac,ak,as,f,n,v,ad,al,at,g,o,w,ae,am,au,h,p,x,af,an,av}
# hyperion-sun-ost6 /mnt > time ~ben/pios/pios --load=fpp --path=/mnt --threadcount=1,2,4,8,16,32,64,128,256 --regioncount=1024 --regionsize=256M --chunksize=1M --offset=0M
Current limit of open files is 1024 and pios wants: 1034.
Setting soft limit of no of open files to: 11024 and Hard Limit to 11024, continuing ...


Following stats are of write performance.

Run     Timestamp       T       N       S       C       Aggregate throughput    lowest time             Highest time            Run time
--------------------------------------------------------------------------------------
1       1236646308      1       1024    256 MB  1 MB            764 MB / sec.   0.308 second(s).        1.111 second(s).        342.908 second(s).
2       1236646653      2       1024    256 MB  1 MB            876 MB / sec.   0.526 second(s).        1.751 second(s).        298.933 second(s).
3       1236646956      4       1024    256 MB  1 MB            1105 MB / sec.  0.710 second(s).        2.349 second(s).        237.051 second(s).
4       1236647200      8       1024    256 MB  1 MB            1159 MB / sec.  1.173 second(s).        5.623 second(s).        226.035 second(s).
5       1236647435      16      1024    256 MB  1 MB            1188 MB / sec.  1.776 second(s).        16.369 second(s).       220.565 second(s).
6       1236647667      32      1024    256 MB  1 MB            1246 MB / sec.  1.820 second(s).        32.257 second(s).       210.275 second(s).
7       1236647896      64      1024    256 MB  1 MB            1222 MB / sec.  3.259 second(s).        100.716 second(s).      214.512 second(s).
<that is where she ends>


This event sent from IssueTracker by jwest  [SEG - Storage]
 issue 274336

Comment 2 Issue Tracker 2009-03-12 21:13:25 UTC
<neb> kentb: 274336 can you just send it up and get a bz created and given
to sandeen
<neb> since this is one with one of his test kernels. I think that is
probably the best avenue.
<neb> Note that it was with 132.ers1 kernel.

kbaxley assigned to issue for LLNL (HPC).
Category set to: Kernel
Status set to: Waiting on Tech
Version set to: '5.3'

This event sent from IssueTracker by jwest  [SEG - Storage]
 issue 274336

Comment 3 Issue Tracker 2009-03-12 21:13:28 UTC
Date problem reported: 3-10-09

Problem: EXT4 panic with an Eric Sandeen-provided kernel.   

What is expected from SEG:  Ben Woodard out a LLNL has been working
directly with Eric on some EXT4 issues, and the kernel that this panic was
generated on was with one of the kernels Eric spun for Ben.  Can we get
this one into BZ?

Thanks.


Issue escalated to Support Engineering Group by: kbaxley.
Internal Status set to 'Waiting on SEG'

This event sent from IssueTracker by jwest  [SEG - Storage]
 issue 274336

Comment 5 Ric Wheeler 2009-03-13 14:40:34 UTC
Adding Eric - seems we cannot reproduce this easily. 

Does this look like any issues that we might have fixed in the refresh for 5.4?

Comment 6 Eric Sandeen 2009-03-13 14:48:37 UTC
Ric, unfortunately this was *with* the refresh.  :)  Looking into it.

Comment 7 Eric Sandeen 2009-03-13 17:43:28 UTC
This is in the list_add in ext4_mb_new_inode_pa

        ext4_lock_group(sb, ac->ac_b_ex.fe_group);
        list_add(&pa->pa_group_list, &grp->bb_prealloc_list);
        ext4_unlock_group(sb, ac->ac_b_ex.fe_group);

So this means the bb_prealloc_list got corrupted.

I'll see if I can findany problems...

Comment 8 Eric Sandeen 2009-03-13 19:01:11 UTC
At first glance it's all manipulated under the group lock as it should be, but the bad pointer is poisoned:

/********** include/linux/list.h **********/
/*
 * These are non-NULL pointers that will result in page faults
 * under normal circumstances, used to verify that nobody uses
 * non-initialized list entries.
 */
#define LIST_POISON1  ((void *) 0x00100100)
#define LIST_POISON2  ((void *) 0x00200200)

so this looks like a previously deleted list entry.

Comment 9 Eric Sandeen 2009-03-14 04:51:52 UTC
Ok, this one is icky(tm).

ext4_lock_group(sb, group) is supposed to protect this list for
each group, and a common code flow is like this:

    ext4_get_group_no_and_offset(sb, pa->pa_pstart, &grp, NULL);
    ext4_lock_group(sb, grp);
    list_del(&pa->pa_group_list);
    ext4_unlock_group(sb, grp);

so its critical that we get the right group number back for
this pa->pa_pstart block, to lock the right group (the one associated with this pa) and prevent concurrent list manipulation.

however, ext4_mb_put_pa() passes in (pa->pa_pstart - 1) with a 
comment, /* -1 is to protect from crossing allocation group */

This makes sense for the group_pa, where pa_pstart is advanced as it's used (in ext4_mb_release_context()).  If a group is X blocks long, then if we start pa_pstart at the first block in the group and advance it by X, we land at the first block in the *next* group.  So in ext4_mb_put_pa(), when the group is completely used up and pa_pstart is fully advanced, backing off one puts us back in the right group, and the proper group gets locked.

However, for inode_pa, pa_pstart is never advanced; it's just set once to the first block in the group and not moved after that.  So in this case, if we subtract one in ext4_mb_put_pa(), we are actually locking the *previous* group, and opening the race with the other threads which do not subtract off the extra block.

something like this should fix it:

static void ext4_mb_put_pa(struct ext4_allocation_context *ac, ...)
{
...
        /* -1 is to protect from crossing allocation group */
	if (pa->pa_linear)
		pa->pa_pstart--;
        ext4_get_group_no_and_offset(sb, pa->pa_pstart, &grp, NULL);
...

though this is all somewhat convoluted, perhaps it can get cleaned up....

Ben, have you seen this more than once?

-Eric

Comment 10 Ben Woodard 2009-03-16 17:19:07 UTC
Only have seen it once. However, race conditions are like that. Sounds like this one is hard to hit. Glad we saw it before it went into production. Good work  Eric!! Let me know when you have an ERS2 kernel for me to test with the fix in it and I'll run all my tests against that. 

At the very least we can make sure that we didn't introduce any new problems by doing this and the longer we test without seeing the problem, the more comfortable that we can be that this is in fact the problem that I triggered and not some other even more subtle bug that you haven't managed to spot yet.

Comment 14 Eric Sandeen 2009-03-26 15:56:50 UTC
Marking as a dup of bug 485315 since the rebase to 2.6.29 will pick up this fix.

Thanks for the report!

*** This bug has been marked as a duplicate of bug 485315 ***