Description of problem: There are two kind of panics caused by get_partstats() returns NULL. This code is new in 2.6.18-123.el5v. Unable to handle kernel NULL pointer dereference at 0000000000000028 RIP: [<ffffffff801553bc>] part_round_stats+0x1f/0xa5 PGD 4ad46067 PUD 48e51067 PMD 0 Oops: 0000 [1] SMP last sysfs file: /devices/pci0000:00/0000:00:00.0/irq Entering kdb (task 0xffff810049cb87a0 pid 25481) on cpu 3 Oops: <NULL> due to oops @ 0xffffffff801553bc r15 = 0x0000000000000817 r14 = 0x0000000000000710 r13 = 0xffff81006bbf2980 r12 = 0x0000000000000012 rbp = 0xffff810050cf1d78 rbx = 0x00000001106710f3 r11 = 0x0000000000000000 r10 = 0x0000000000000020 r9 = 0x0000000000000004 r8 = 0x0000000000000000 rax = 0x0000000000000000 rcx = 0x4c00000000000000 rdx = 0x0000000000000000 rsi = 0x0000000000000000 rdi = 0xffff81006bbf2980 orig_rax = 0xffffffffffffffff rip = 0xffffffff801553bc cs = 0x0000000000000010 eflags = 0x0000000000010246 rsp = 0xffff810050cf1d30 ss = 0x0000000000000018 ®s = 0xffff810050cf1c98 [3]kdb> [3]kdb> bt ffff810049cb87a0 25481 25479 3* R ffff810049cb8a80 *sadc RSP RIP Function (args) ffff810050cf1d30 ffffffff801553bc part_round_stats+0x1f (ffff81006bbf2980 20 50cf1e38 0 817) ffff810050cf1d88 ffffffff80159c3c diskstats_show+0x3a6 (ffff810049220840 ffff81007eb68e00 ffff810049220870 36 35) ffff810050cf1e78 ffffffff80042922 seq_read+0x1b7 (ffff81007b6816c0 2afb80ada000 1000 ffff810050cf1f48 ffff81007b6816c0) ffff810050cf1ed8 ffffffff8000b60c vfs_read+0xc8 ffff810050cf1f28 ffffffff80012195 sys_read+0x44 [3]kdb> void part_round_stats(struct hd_struct *part) { unsigned long now = jiffies; struct partstats *ps = get_partstats(part); <-- get_partstats() returned NULL, caused panic at next line if (now == ps->stamp) return; if (ps->in_flight) { part_stat_add(part, time_in_queue, ps->in_flight * (now - ps->stamp)); part_stat_add(part, io_ticks, (now - ps->stamp)); } ps->stamp = now; } struct partstats *get_partstats(struct hd_struct *part) { struct hlist_head *head; struct hlist_node *node; struct partstats *p; head = &partstats_hash[partstats_hash_fn(part)]; hlist_for_each_entry_rcu(p, node, head, hlist) { if (p->addr == part) return p; } return NULL; } Unable to handle kernel NULL pointer dereference at 0000000000000040 RIP: [<ffffffff801161d4>] delete_partition+0x8f/0x114 void delete_partition(struct gendisk *disk, int part) { struct hd_struct *p = disk->part[part-1]; … p->sectors[0] = p->sectors[1] = 0; part_stat_reset(p); <<<<< This is something added in 2.6.18-123. sysfs_remove_link(&p->kobj, "subsystem"); } static inline void part_stat_reset(struct hd_struct *part) { struct partstats *ps; int i; rcu_read_lock(); ps = get_partstats(part); <<<<< get_partstats returned NULL for_each_possible_cpu(i) memset(per_cpu_ptr(ps->dkstats, i), 0, <<<<< ps is NULL, panic here. sizeof(struct disk_stats)); rcu_read_unlock(); } How reproducible: Not sure how to reproduce it. From the code, it should check the return value of get_partstats() before dereferencing it. Steps to Reproduce: Actual results: Expected results: Additional info:
probably, caused by recently adopted statistics patchset: linux-2.6-block-enhanced-partition-statistics-*.patch, in tree. was committed to 2.6.18-104.el5. Jerome, please, take a look.
I have not really investigate yet, but it doesn't seems right that get_partstats() may return NULL at these places. I suspect some race condition.
The problem is that we use get_partstats() without RCU read lock and we may end up trying to access an already freed structure. It happens on several places.
Wait... That does not explain the bug in part_stat_reset() where RCU read lock is held.
Hi Lachlan, I have two questions for you: What kind of block devices there are on this machine? Do you have any idea of what kind of workload ran on it when the crash occurred? Thanks, Jérôme
Hi Jerome, I don't know the answer to those questions - the issue is waiting on support so hopefully we get a response. Maybe a disk was partitioned at the time iostat was run - is it possible we raced and found a new partition that wasn't completely setup? Lachlan
This script was passed to me from mgoodwin. I tried it in a KVM and it didn't crash but might have more success on a bare-metal system. #! /bin/sh [ $# -ne 1 ] && echo Usage: $0 scratchdevice && exit 1 scratch=$1 while true; do parted $scratch mkpart p ext2 1 10 parted $scratch rm 1 done >/dev/null 2>&1 & while true; do grep crap /proc/partitions /proc/diskstats; done
I've never been able to reproduce the delete_partition() bug and I don't understand how it could happen. Anyway, I will post a patch that corrects the faulty RCU unprotected accesses. It will at least fixes the other kind of panic.
Posted: http://post-office.corp.redhat.com/archives/rhkernel-list/2009-August/msg00260.html
This problem has been seen by another customer... sdd:<1>Unable to handle kernel NULL pointer dereference at 0000000000000028 RIP: [<ffffffff80141291>] part_round_stats+0x13/0x8a PGD c5091c067 PUD c65aef067 PMD 0 Oops: 0000 [1] SMP last sysfs file: /devices/pci0000:00/0000:00:03.0/0000:02:00.1/irq CPU 10 Modules linked in: mptctl mptbase vfat fat usb_storage ipmi_devintf ipmi_si ipmi_msghandler dell_rbu autofs4 hidp rfcomm l2cap bluetooth lock_dlm gfs2 dlm configfs lockd sunrpc bonding cpufreq_ondemand acpi_cpufreq freq_table emcpdm(PU) emcpgpx(PU) emcpmpx(PU) emcp(PU) emcplib(U) dm_multipath scsi_dh video hwmon backlight sbs i2c_ec i2c_core button battery asus_acpi acpi_memhotplug ac ipv6 xfrm_nalgo crypto_api parport_pc lp parport joydev sr_mod cdrom sg bnx2 serio_raw pcspkr dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod ata_piix libata shpchp megaraid_sas ext3 jbd uhci_hcd ohci_hcd ehci_hcd qla2xxx(U) sd_mod scsi_mod qla2xxx_conf(U) intermodule(U) Pid: 11401, comm: vol_id Tainted: P 2.6.18-164.el5 #1 RIP: 0010:[<ffffffff80141291>] [<ffffffff80141291>] part_round_stats+0x13/0x8a RSP: 0018:ffff810c680d99d8 EFLAGS: 00010046 RAX: 0000000000000000 RBX: ffff810c7fc31680 RCX: b400000000000000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff810c7fc31680 RBP: 00000001003c6b5a R08: 000000000000000f R09: ffff810116c78000 R10: ffff8106947689c8 R11: ffffffff801465ba R12: 0000000000000000 R13: 0000000000000002 R14: 0000000012bfef3f R15: 0000000000800011 FS: 000000001889a880(0063) GS:ffff8106951a7940(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000028 CR3: 0000000c63323000 CR4: 00000000000006e0 Process vol_id (pid: 11401, threadinfo ffff810c680d8000, task ffff810c6717f040) Stack: ffff810c7fc31680 ffff810c6b283c58 0000000000000000 ffffffff80141459 ffff810c6b283c58 ffff810c7e0900c8 0000000012bfef3f ffffffff8000c047 0000000000000001 0000000000000400 ffff810c647054c0 0000000000000282 Call Trace: [<ffffffff80141459>] drive_stat_acct+0x151/0x175 [<ffffffff8000c047>] __make_request+0x385/0x401 [<ffffffff8001c049>] generic_make_request+0x211/0x228 [<ffffffff80033472>] submit_bio+0xe4/0xeb [<ffffffff8001a793>] submit_bh+0xf1/0x111 [<ffffffff800e0a81>] block_read_full_page+0x259/0x276 [<ffffffff800e3d27>] blkdev_get_block+0x0/0x46 [<ffffffff8000c6dd>] add_to_page_cache+0xaa/0xc1 [<ffffffff80012ead>] __do_page_cache_readahead+0x140/0x179 [<ffffffff8003245d>] blockable_page_cache_readahead+0x53/0xb2 [<ffffffff80013f68>] page_cache_readahead+0x13d/0x1af [<ffffffff8000c189>] do_generic_mapping_read+0xc6/0x354 [<ffffffff8000d0b6>] file_read_actor+0x0/0x159 [<ffffffff8000c563>] __generic_file_aio_read+0x14c/0x198 [<ffffffff800c5145>] generic_file_read+0xac/0xc5 [<ffffffff8009f6c1>] autoremove_wake_function+0x0/0x2e [<ffffffff800b61b0>] audit_syscall_entry+0x180/0x1b3 [<ffffffff8000b695>] vfs_read+0xcb/0x171 [<ffffffff80011b72>] sys_read+0x45/0x6e [<ffffffff8005d28d>] tracesys+0xd5/0xe0 Code: 48 3b 68 28 49 89 c4 74 69 83 78 38 00 74 5e 48 89 df e8 22 RIP [<ffffffff80141291>] part_round_stats+0x13/0x8a RSP <ffff810c680d99d8> crash> dis part_round_stats 0xffffffff8014127e <part_round_stats>: push %r12 0xffffffff80141280 <part_round_stats+2>: push %rbp 0xffffffff80141281 <part_round_stats+3>: mov 2809848(%rip),%rbp # 0xffffffff803ef280 0xffffffff80141288 <part_round_stats+10>: push %rbx 0xffffffff80141289 <part_round_stats+11>: mov %rdi,%rbx 0xffffffff8014128c <part_round_stats+14>: callq 0xffffffff801445ca <get_partstats> 0xffffffff80141291 <part_round_stats+19>: cmp 0x28(%rax),%rbp <== here why panic %rax = 0 0xffffffff80141295 <part_round_stats+23>: mov %rax,%r12 0xffffffff80141298 <part_round_stats+26>: je 0xffffffff80141303 <part_round_stats+133> 0xffffffff8014129a <part_round_stats+28>: cmpl $0x0,0x38(%rax) 0xffffffff8014129e <part_round_stats+32>: je 0xffffffff801412fe <part_round_stats+128> crash> dis 0xffffffff801445ca 0xffffffff801445ca <get_partstats>: mov $0x3ffffffffc0001,%rax Shows get_partstats() is called right before this so the code snip analysis in description is same issue here. This customer is also using EMC storage. Will attach IT.
NOTE: Latest customer is at RHEL 5.4
Elevating priority and severity to match ITs. This is now being seen on RHEL 5.4 and by more than one customer. Both customers are using EMC storage.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
@GSS We need to confirm that there is commitment to test for the resolution of this request during the RHEL 5.5 test phase, if it is accepted into the release. Please post a confirmation before Oct 16th, 2009, including the contact information for testing engineers.
As this bug targets two different issues the delete_partition() panic is now tracked in bug 530066. This bug will track the part_round_stats() panic.
I don't see how the posted patch will fix this issue. What I think is happening here is a race between the various users of part_round_stats() and rescan_partitions(). For example in diskstats_show() we have this sequence: struct hd_struct *hd = gp->part[n]; if (!hd || !hd->nr_sects) continue; preempt_disable(); part_round_stats(hd); preempt_enable(); Between assigning 'hd' and calling part_round_stats() another thread may have called rescan_partitions() and reloaded the partition table so our 'hd' is now old and doesn't match the new one and get_partstats() returns NULL. This bug has been fixed in BZ 495866. So what needs to be done to fix the same bug in drive_stat_acct(), blk_account_io_merge(), blk_account_io_done() and part_stats_read() is to call rcu_read_lock() before retrieving the partition which is mostly only a slight change to the posted patch.
I guess you're right. I'll nack my patch and fix it.
*** Bug 530066 has been marked as a duplicate of this bug. ***
~~ Attention Customers and Partners - RHEL 5.5 Beta is now available on RHN ~~ RHEL 5.5 Beta has been released! There should be a fix present in this release that addresses your request. Please test and report back results here, by March 3rd 2010 (2010-03-03) or sooner. Upon successful verification of this request, post your results and update the Verified field in Bugzilla with the appropriate value. If you encounter any issues while testing, please describe them and set this bug into NEED_INFO. If you encounter new defects or have additional patch(es) to request for inclusion, please clone this bug per each request and escalate through your support representative.
(In reply to comment #8) > This script was passed to me from mgoodwin. I tried it in a KVM and it didn't > crash but might have more success on a bare-metal system. > > #! /bin/sh > > [ $# -ne 1 ] && echo Usage: $0 scratchdevice && exit 1 > scratch=$1 > > while true; do > parted $scratch mkpart p ext2 1 10 > parted $scratch rm 1 > done >/dev/null 2>&1 & > > while true; do grep crap /proc/partitions /proc/diskstats; done Hi, Jerome, on what machine did you reproduce it? and how long the program run before the panic? I tried on beaker machine ibm-himalaya.rhts.eng.bos.redhat.com, it has 3 disks, and I used /dev/sdb as the parameter, the program has run for more than 20 minutes, and still does not trigger the bug. thanks in advance.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2010-0178.html