Bug 699708

Summary:

kernel BUG at kernel/timer.c:844!

Product:

[Fedora] Fedora

Reporter:

Albert Strasheim <fullung>

Component:

kernel

Assignee:

Lukáš Czerner <lczerner>

Status:

CLOSED ERRATA

QA Contact:

Fedora Extras Quality Assurance <extras-qa>

Severity:

high

Docs Contact:

Priority:

high

Version:

CC:

fullung, gansalmon, itamar, jonathan, kernel-maint, lczerner, madhu.chinakonda

Target Milestone:

---

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

kernel-2.6.38.8-32.fc15

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2011-06-23 08:48:05 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
Patch to use schedule_timeout_interruptible()	none
ext4: Use schedule_timeout_interruptible() for waiting in lazyinit thread	none

Description Albert Strasheim 2011-04-26 12:32:48 UTC

Description of problem:

[365058.233947] ------------[ cut here ]------------
[365058.238891] kernel BUG at kernel/timer.c:844!
[365058.243580] invalid opcode: 0000 [#1] SMP
[365058.248051] last sysfs file: /sys/devices/virtual/block/loop0/uevent
[365058.254731] CPU 19
[365058.256750] Modules linked in: crc32c_intel hfsplus hfs vfat fat ext2 w83627ehf hwmon_vid adm1021 ipmi_devintf ipmi_si ipmi_msghandler ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ib_sa mlx4_ib ib_mad ib_core mlx4_en microcode ghes hed ses enclosure i7core_edac ioatdma mlx4_core igb edac_core dca i2c_i801 i2c_core joydev iTCO_wdt iTCO_vendor_support uinput ipv6 dm_round_robin mpt2sas scsi_transport_sas raid_class dm_multipath [last unloaded: scsi_wait_scan]
[365058.302786]
[365058.304597] Pid: 14634, comm: ext4lazyinit Not tainted 2.6.38.2-9.fc15.x86_64 #1 Supermicro X8DTH-i/6/iF/6F/X8DTH
[365058.315514] RIP: 0010:[<ffffffff810615ed>]  [<ffffffff810615ed>] add_timer+0xf/0x1c
[365058.323747] RSP: 0018:ffff88041c489e30  EFLAGS: 00010286
[365058.329380] RAX: 3ffffffffffffffe RBX: ffff88067419e9c0 RCX: 0000000000000000
[365058.337057] RDX: ffff88084d184560 RSI: 0000000000000000 RDI: ffff88067419e9f8
[365058.344741] RBP: ffff88041c489e30 R08: 0000000000000000 R09: 0000061d8dff31d9
[365058.352428] R10: ffff88183fcf3840 R11: ffff88183fcf3840 R12: ffff88067419e9c8
[365058.360115] R13: 0000000115bf00a9 R14: 0000000000000007 R15: ffff88067419ea50
[365058.367801] FS:  0000000000000000(0000) GS:ffff88183fce0000(0000) knlGS:0000000000000000
[365058.376439] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[365058.382511] CR2: 00000000005fd574 CR3: 0000000001a03000 CR4: 00000000000006e0
[365058.390199] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[365058.397883] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[365058.405570] Process ext4lazyinit (pid: 14634, threadinfo ffff88041c488000, task ffff88084d184560)
[365058.414988] Stack:
[365058.417327]  ffff88041c489ee0 ffffffff811a310e 0000000000000000 ffff880b00000008
[365058.425375]  ffff8809cd91fc00 ffff88084d184560 0000000115bf00a6 ffff88067419e9e0
[365058.433407]  ffff88067419ea50 ffff88067419ea60 0000000000013840 0000000000000000
[365058.441441] Call Trace:
[365058.444220]  [<ffffffff811a310e>] ext4_lazyinit_thread+0x253/0x343
[365058.450735]  [<ffffffff8106f132>] ? autoremove_wake_function+0x0/0x3d
[365058.457507]  [<ffffffff811a2ebb>] ? ext4_lazyinit_thread+0x0/0x343
[365058.464015]  [<ffffffff8106ea73>] kthread+0x84/0x8c
[365058.469225]  [<ffffffff8100a9e4>] kernel_thread_helper+0x4/0x10
[365058.475470]  [<ffffffff8106e9ef>] ? kthread+0x0/0x8c
[365058.480763]  [<ffffffff8100a9e0>] ? kernel_thread_helper+0x0/0x10
[365058.487182] Code: 89 75 e8 48 f7 d6 4c 21 e6 31 c9 31 d2 48 89 df e8 9d fd ff ff 5e 5f 5b 41 5c 5d c3 55 48 89 e5 0f 1f 44 00 00 48 83 3f 00 74 02 <0f> 0b 48 8b 77 10 e8 53 ff ff ff 5d c3 55 48 89 e5 0f 1f 44 00
[365058.508352] RIP  [<ffffffff810615ed>] add_timer+0xf/0x1c
[365058.514008]  RSP <ffff88041c489e30>
[365058.518190] ---[ end trace 7a2495e26bd6f594 ]---

Version-Release number of selected component (if applicable):

kernel-2.6.38.2-9.fc15.x86_64

How reproducible:

Sometimes when doing lots of ext4 operations.

Comment 1 Chuck Ebbert 2011-04-27 01:15:25 UTC

void add_timer(struct timer_list *timer)
{
        BUG_ON(timer_pending(timer));
        mod_timer(timer, timer->expires);
}

Happens on CPU 19, meaning there are lots of processors in that machine...

Comment 2 Eric Sandeen 2011-04-27 02:33:45 UTC

Lukas, looks like one for you :)

-Eric

Comment 3 Lukáš Czerner 2011-04-27 08:09:29 UTC

(In reply to comment #0)
> How reproducible:
> 
> Sometimes when doing lots of ext4 operations.

Any particular type of "lots of ext4 operations" ? I have done a lot of testing/benchmarks with lazyinit running and never saw this, so maybe it is workload related ?

Could you provide more information about a box you're using ? Have you been able to reproduce it reliably, or did you saw this on different machines ?

Thanks!
-Lukas

Comment 4 Lukáš Czerner 2011-04-27 14:06:28 UTC

Created attachment 495255 [details]
Patch to use schedule_timeout_interruptible()

I *think* I found the problem. The problem is that add_timer(&eli->li_timer); in ext4_lazyinit_thread() might be called again before the previous timer expired, hence causing the BUG_ON(). It might be easily solved by usind mod_timer() instead, however I have a better solution which uses schedule_timeout_interruptible() and simplify things a lot.

Will you be willing to test this quick & dirty patch since I am not able to reproduce the problem ??

Thanks in advance!
-Lukas

Comment 5 Chuck Ebbert 2011-04-29 02:25:10 UTC

(In reply to comment #4)
> Created attachment 495255 [details]
> Patch to use schedule_timeout_interruptible()
> 
Have you posted this upstream to get some feedback?

Comment 6 Lukáš Czerner 2011-04-29 08:54:47 UTC

(In reply to comment #5)
> (In reply to comment #4)
> > Created attachment 495255 [details]
> > Patch to use schedule_timeout_interruptible()
> > 
> Have you posted this upstream to get some feedback?

Of course not. This is not upstream material, the purpose was just to verify that this really caused the problem, since I am not able to reproduce it. Now I am testing the final version of the patch and I will send it upstream soon as it should go in regardless on whether it fixes this problem or not.

Comment 7 Lukáš Czerner 2011-05-02 11:34:36 UTC

Created attachment 496229 [details]
ext4: Use schedule_timeout_interruptible() for waiting  in lazyinit thread

The patch has been posted upstream. I would really appreciate if you can test whether this fixes the problem, or we have problem elsewhere. I do not have enough information to actually reproduce it.

http://marc.info/?l=linux-ext4&m=130433537707670&w=2
http://marc.info/?l=linux-ext4&m=130433537807674&w=2

Thanks!
-Lukas

Comment 8 Albert Strasheim 2011-05-02 21:52:49 UTC

Will test this tomorrow.

Comment 9 Albert Strasheim 2011-05-04 05:47:17 UTC

I haven't been able to reliably reproduce this crash with the original kernel yet.

My test program that caused this once basically created a bunch of multi-gigabytes files in /dev/shm, ran mkfs.ext4 with lazy_itable_init=1 on them, then loopback mounted then, created some directories containing big files (~100 MB), did various operations (read, mmap, etc.), and then unmounted the volumes.

Comment 10 Lukáš Czerner 2011-05-04 07:51:02 UTC

Hi Albert,

what is the licence of you test program ? Can you share it so I can better see what is the load and I can try to reproduce it ?

Thanks!
-Lukas

Comment 11 Albert Strasheim 2011-05-07 19:50:14 UTC

Hello

I haven't been able to reproduce this BUG, even without the patch. It seems that triggering the necessary sequence of events to see the bug is quite tricky.

I'll try to do some more testing to come up with a reproducable test, but if you think you've fixed it, that's good enough for me.

Regards

Albert

Comment 12 Lukáš Czerner 2011-05-07 20:11:12 UTC

Hi Albert,

I believe that this particular bug has been fixed with that patch, but it would have been better to be able to confirm that. But nevermind, thanks a lot for the report and for the testing!

Thanks!
-Lukas

Comment 13 Eric Sandeen 2011-05-24 15:50:35 UTC

Lukas has this one well in hand, assigning over to him...

Thanks,
-Eric

Comment 14 Lukáš Czerner 2011-05-31 16:59:26 UTC

This issue has been fixed with upstream commit 

4ed5c033c11b33149d993734a6a8de1016e8f03f ext4: Use schedule_timeout_interruptible() for waiting in lazyinit thread

but there are more useful commits in this area:

e1290b3e62c496ade19939ce036f35bb69306820 ext4: Remove unnecessary wait_event ext4_run_lazyinit_thread()
51ce65115642b77040f5582b8d2fc8815ac450f9 ext4: fix the mount option "init_itable=n" to work as expected for n=0
1bb933fb1fa8e4cb337a0d5dfd2ff4c0dc2073e8 ext4: fix possible use-after-free in ext4_remove_li_request()

though, I am not sure what is proper course of action for Fedora... Chuck ?

Thanks!
-Lukas

Comment 15 Chuck Ebbert 2011-06-23 08:48:05 UTC

(In reply to comment #14)
> This issue has been fixed with upstream commit 
> 
> 4ed5c033c11b33149d993734a6a8de1016e8f03f ext4: Use
> schedule_timeout_interruptible() for waiting in lazyinit thread
>

This is in 2.6.38.8
 
> but there are more useful commits in this area:
> 
> e1290b3e62c496ade19939ce036f35bb69306820 ext4: Remove unnecessary wait_event
> ext4_run_lazyinit_thread()
> 51ce65115642b77040f5582b8d2fc8815ac450f9 ext4: fix the mount option
> "init_itable=n" to work as expected for n=0
> 1bb933fb1fa8e4cb337a0d5dfd2ff4c0dc2073e8 ext4: fix possible use-after-free in
> ext4_remove_li_request()
> 

That last one is also in 2.6.38.8

I'll just close this now - those other two patches don't look as critical.