Bug 493517

Summary:	get_partstats() returns NULL and causes panic
Product:	Red Hat Enterprise Linux 5	Reporter:	Lachlan McIlroy <lmcilroy>
Component:	kernel	Assignee:	Jerome Marchand <jmarchan>
Status:	CLOSED ERRATA	QA Contact:	Red Hat Kernel QE team <kernel-qe>
Severity:	high	Docs Contact:
Priority:	high
Version:	5.3	CC:	anton, cward, dejohnso, dzickus, jmarchan, jtluka, kzhang, tao, vgaikwad
Target Milestone:	rc
Target Release:	5.4
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:
Clones:	517523 (view as bug list)		Environment:
Last Closed:	2010-03-30 07:18:33 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	517523, 533192

Description Lachlan McIlroy 2009-04-02 03:56:27 UTC

Description of problem:
There are two kind of panics caused by get_partstats() returns NULL.  This code is new in 2.6.18-123.el5v.

Unable to handle kernel NULL pointer dereference at 0000000000000028 RIP:
[<ffffffff801553bc>] part_round_stats+0x1f/0xa5
PGD 4ad46067 PUD 48e51067 PMD 0
Oops: 0000 [1] SMP
last sysfs file: /devices/pci0000:00/0000:00:00.0/irq

Entering kdb (task 0xffff810049cb87a0 pid 25481) on cpu 3 Oops: <NULL>
due to oops @ 0xffffffff801553bc
    r15 = 0x0000000000000817      r14 = 0x0000000000000710
    r13 = 0xffff81006bbf2980      r12 = 0x0000000000000012
    rbp = 0xffff810050cf1d78      rbx = 0x00000001106710f3
    r11 = 0x0000000000000000      r10 = 0x0000000000000020
     r9 = 0x0000000000000004       r8 = 0x0000000000000000
    rax = 0x0000000000000000      rcx = 0x4c00000000000000
    rdx = 0x0000000000000000      rsi = 0x0000000000000000
    rdi = 0xffff81006bbf2980 orig_rax = 0xffffffffffffffff
    rip = 0xffffffff801553bc       cs = 0x0000000000000010
 eflags = 0x0000000000010246      rsp = 0xffff810050cf1d30
     ss = 0x0000000000000018    &regs = 0xffff810050cf1c98
[3]kdb>
[3]kdb> bt
ffff810049cb87a0    25481    25479   3*  R  ffff810049cb8a80 *sadc
RSP              RIP              Function (args)
ffff810050cf1d30 ffffffff801553bc part_round_stats+0x1f (ffff81006bbf2980 20 50cf1e38 0 817)
ffff810050cf1d88 ffffffff80159c3c diskstats_show+0x3a6 (ffff810049220840 ffff81007eb68e00 ffff810049220870 36 35)
ffff810050cf1e78 ffffffff80042922 seq_read+0x1b7 (ffff81007b6816c0 2afb80ada000 1000 ffff810050cf1f48 ffff81007b6816c0)
ffff810050cf1ed8 ffffffff8000b60c vfs_read+0xc8
ffff810050cf1f28 ffffffff80012195 sys_read+0x44
[3]kdb>

void part_round_stats(struct hd_struct *part)
{
       unsigned long now = jiffies;
       struct partstats *ps = get_partstats(part);         <-- get_partstats() returned NULL, caused panic at next line

       if (now == ps->stamp)
               return;

       if (ps->in_flight) {
               part_stat_add(part, time_in_queue,
                       ps->in_flight * (now - ps->stamp));
               part_stat_add(part, io_ticks, (now - ps->stamp));
       }
       ps->stamp = now;
}

struct partstats *get_partstats(struct hd_struct *part)
{
       struct hlist_head *head;
       struct hlist_node *node;
       struct partstats *p;
       
       head = &partstats_hash[partstats_hash_fn(part)];
       hlist_for_each_entry_rcu(p, node, head, hlist) {
               if (p->addr == part)
                       return p;
       }                                                      
       return NULL;
}



Unable to handle kernel NULL pointer dereference at 0000000000000040 RIP:
[<ffffffff801161d4>] delete_partition+0x8f/0x114

void delete_partition(struct gendisk *disk, int part)
{      
       struct hd_struct *p = disk->part[part-1];
…
       p->sectors[0] = p->sectors[1] = 0;
       part_stat_reset(p);                                 <<<<< This is something added in 2.6.18-123.
       sysfs_remove_link(&p->kobj, "subsystem");
}   

static inline void part_stat_reset(struct hd_struct *part)
{      
       struct partstats *ps;
       int i;  
       rcu_read_lock();
       ps = get_partstats(part);   <<<<< get_partstats returned NULL
       for_each_possible_cpu(i)
               memset(per_cpu_ptr(ps->dkstats, i), 0,   <<<<< ps is NULL, panic here.
                      sizeof(struct disk_stats));              
       rcu_read_unlock();
}

How reproducible:
Not sure how to reproduce it.  From the code, it should check the return value of get_partstats() before dereferencing it.

Steps to Reproduce:

Actual results:

Expected results:

Additional info:

Comment 1 Anton Arapov 2009-04-02 11:44:22 UTC

probably, caused by recently adopted statistics patchset:
linux-2.6-block-enhanced-partition-statistics-*.patch, in tree.

was committed to 2.6.18-104.el5.

Jerome, please, take a look.

Comment 2 Jerome Marchand 2009-04-03 09:11:44 UTC

I have not really investigate yet, but it doesn't seems right that get_partstats() may  return NULL at these places. I suspect some race condition.

Comment 3 Jerome Marchand 2009-04-06 16:14:21 UTC

The problem is that we use get_partstats() without RCU read lock and we may end up trying to access an already freed structure. It happens on several places.

Comment 4 Jerome Marchand 2009-04-06 16:58:48 UTC

Wait... That does not explain the bug in part_stat_reset() where RCU read lock is held.

Comment 5 Jerome Marchand 2009-04-16 12:36:30 UTC

Hi Lachlan,

I have two questions for you:
What kind of block devices there are on this machine?
Do you have any idea of what kind of workload ran on it when the crash occurred?

Thanks,
Jérôme

Comment 6 Lachlan McIlroy 2009-04-16 23:37:14 UTC

Hi Jerome,

I don't know the answer to those questions - the issue is waiting on support so hopefully we get a response.  Maybe a disk was partitioned at the time iostat was run - is it possible we raced and found a new partition that wasn't completely setup?

Lachlan

Comment 8 Lachlan McIlroy 2009-04-21 07:43:01 UTC

This script was passed to me from mgoodwin.  I tried it in a KVM and it didn't crash but might have more success on a bare-metal system.

#! /bin/sh

[ $# -ne 1 ] && echo Usage: $0 scratchdevice && exit 1
scratch=$1

while true; do
	parted $scratch mkpart p ext2 1 10
	parted $scratch rm 1
done >/dev/null 2>&1 &

while true; do grep crap /proc/partitions /proc/diskstats; done

Comment 9 Jerome Marchand 2009-08-14 13:17:35 UTC

I've never been able to reproduce the delete_partition() bug and I don't understand how it could happen.

Anyway, I will post a patch that corrects the faulty RCU unprotected accesses. It will at least fixes the other kind of panic.

Comment 10 Jerome Marchand 2009-08-17 17:05:31 UTC

Posted:
http://post-office.corp.redhat.com/archives/rhkernel-list/2009-August/msg00260.html

Comment 12 Debbie Johnson 2009-09-23 13:53:49 UTC

This problem has been seen by another customer...

 sdd:<1>Unable to handle kernel NULL pointer dereference at 0000000000000028 RIP: 
 [<ffffffff80141291>] part_round_stats+0x13/0x8a
PGD c5091c067 PUD c65aef067 PMD 0 
Oops: 0000 [1] SMP 
last sysfs file: /devices/pci0000:00/0000:00:03.0/0000:02:00.1/irq
CPU 10 
Modules linked in: mptctl mptbase vfat fat usb_storage ipmi_devintf ipmi_si ipmi_msghandler dell_rbu autofs4 hidp rfcomm l2cap bluetooth lock_dlm gfs2 dlm configfs lockd sunrpc bonding cpufreq_ondemand acpi_cpufreq freq_table emcpdm(PU) emcpgpx(PU) emcpmpx(PU) emcp(PU) emcplib(U) dm_multipath scsi_dh video hwmon backlight sbs i2c_ec i2c_core button battery asus_acpi acpi_memhotplug ac ipv6 xfrm_nalgo crypto_api parport_pc lp parport joydev sr_mod cdrom sg bnx2 serio_raw pcspkr dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod ata_piix libata shpchp megaraid_sas ext3 jbd uhci_hcd ohci_hcd ehci_hcd qla2xxx(U) sd_mod scsi_mod qla2xxx_conf(U) intermodule(U)
Pid: 11401, comm: vol_id Tainted: P      2.6.18-164.el5 #1
RIP: 0010:[<ffffffff80141291>]  [<ffffffff80141291>] part_round_stats+0x13/0x8a
RSP: 0018:ffff810c680d99d8  EFLAGS: 00010046
RAX: 0000000000000000 RBX: ffff810c7fc31680 RCX: b400000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff810c7fc31680
RBP: 00000001003c6b5a R08: 000000000000000f R09: ffff810116c78000
R10: ffff8106947689c8 R11: ffffffff801465ba R12: 0000000000000000
R13: 0000000000000002 R14: 0000000012bfef3f R15: 0000000000800011
FS:  000000001889a880(0063) GS:ffff8106951a7940(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000028 CR3: 0000000c63323000 CR4: 00000000000006e0
Process vol_id (pid: 11401, threadinfo ffff810c680d8000, task ffff810c6717f040)
Stack:  ffff810c7fc31680 ffff810c6b283c58 0000000000000000 ffffffff80141459
 ffff810c6b283c58 ffff810c7e0900c8 0000000012bfef3f ffffffff8000c047
 0000000000000001 0000000000000400 ffff810c647054c0 0000000000000282
Call Trace:
 [<ffffffff80141459>] drive_stat_acct+0x151/0x175
 [<ffffffff8000c047>] __make_request+0x385/0x401
 [<ffffffff8001c049>] generic_make_request+0x211/0x228
 [<ffffffff80033472>] submit_bio+0xe4/0xeb
 [<ffffffff8001a793>] submit_bh+0xf1/0x111
 [<ffffffff800e0a81>] block_read_full_page+0x259/0x276
 [<ffffffff800e3d27>] blkdev_get_block+0x0/0x46
 [<ffffffff8000c6dd>] add_to_page_cache+0xaa/0xc1
 [<ffffffff80012ead>] __do_page_cache_readahead+0x140/0x179
 [<ffffffff8003245d>] blockable_page_cache_readahead+0x53/0xb2
 [<ffffffff80013f68>] page_cache_readahead+0x13d/0x1af
 [<ffffffff8000c189>] do_generic_mapping_read+0xc6/0x354
 [<ffffffff8000d0b6>] file_read_actor+0x0/0x159
 [<ffffffff8000c563>] __generic_file_aio_read+0x14c/0x198
 [<ffffffff800c5145>] generic_file_read+0xac/0xc5
 [<ffffffff8009f6c1>] autoremove_wake_function+0x0/0x2e
 [<ffffffff800b61b0>] audit_syscall_entry+0x180/0x1b3
 [<ffffffff8000b695>] vfs_read+0xcb/0x171
 [<ffffffff80011b72>] sys_read+0x45/0x6e
 [<ffffffff8005d28d>] tracesys+0xd5/0xe0


Code: 48 3b 68 28 49 89 c4 74 69 83 78 38 00 74 5e 48 89 df e8 22 
RIP  [<ffffffff80141291>] part_round_stats+0x13/0x8a
 RSP <ffff810c680d99d8>

crash> dis part_round_stats
0xffffffff8014127e <part_round_stats>:  push   %r12
0xffffffff80141280 <part_round_stats+2>:        push   %rbp
0xffffffff80141281 <part_round_stats+3>:        mov    2809848(%rip),%rbp        # 0xffffffff803ef280
0xffffffff80141288 <part_round_stats+10>:       push   %rbx
0xffffffff80141289 <part_round_stats+11>:       mov    %rdi,%rbx
0xffffffff8014128c <part_round_stats+14>:       callq  0xffffffff801445ca <get_partstats>
0xffffffff80141291 <part_round_stats+19>:       cmp    0x28(%rax),%rbp <== here why panic %rax = 0
0xffffffff80141295 <part_round_stats+23>:       mov    %rax,%r12
0xffffffff80141298 <part_round_stats+26>:       je     0xffffffff80141303 <part_round_stats+133>
0xffffffff8014129a <part_round_stats+28>:       cmpl   $0x0,0x38(%rax)
0xffffffff8014129e <part_round_stats+32>:       je     0xffffffff801412fe <part_round_stats+128>


crash> dis 0xffffffff801445ca
0xffffffff801445ca <get_partstats>:     mov    $0x3ffffffffc0001,%rax

Shows get_partstats() is called right before this so the code snip analysis in description is same issue here.

This customer is also using EMC storage.  Will attach IT.

Comment 13 Debbie Johnson 2009-09-23 13:54:51 UTC

NOTE:  Latest customer is at RHEL 5.4

Comment 14 Debbie Johnson 2009-09-23 13:59:23 UTC

Elevating priority and severity to match ITs.  This is now being seen on RHEL 5.4 and by more than one customer.  Both customers are using EMC storage.

Comment 15 RHEL Program Management 2009-09-25 17:40:25 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 18 Chris Ward 2009-10-14 10:56:56 UTC

@GSS

We need to confirm that there is commitment to test 
for the resolution of this request during the RHEL 5.5 test
phase, if it is accepted into the release. 

Please post a confirmation before Oct 16th, 2009, 
including the contact information for testing engineers.

Comment 23 Jan Tluka 2009-10-21 12:55:05 UTC

As this bug targets two different issues the delete_partition() panic is now tracked in bug 530066. This bug will track the part_round_stats() panic.

Comment 26 Lachlan McIlroy 2009-10-22 02:21:51 UTC

I don't see how the posted patch will fix this issue.

What I think is happening here is a race between the various users of part_round_stats() and rescan_partitions().

For example in diskstats_show() we have this sequence:

                struct hd_struct *hd = gp->part[n];

                if (!hd || !hd->nr_sects)
                        continue;

                preempt_disable();
                part_round_stats(hd);
                preempt_enable();

Between assigning 'hd' and calling part_round_stats() another thread may have called rescan_partitions() and reloaded the partition table so our 'hd' is now old and doesn't match the new one and get_partstats() returns NULL.  This bug has been fixed in BZ 495866.

So what needs to be done to fix the same bug in drive_stat_acct(), blk_account_io_merge(), blk_account_io_done() and part_stats_read() is to call rcu_read_lock() before retrieving the partition which is mostly only a slight change to the posted patch.

Comment 27 Jerome Marchand 2009-10-22 13:43:46 UTC

I guess you're right. I'll nack my patch and fix it.

Comment 28 Jerome Marchand 2009-11-16 09:43:06 UTC

*** Bug 530066 has been marked as a duplicate of this bug. ***

Comment 31 Chris Ward 2010-02-11 10:06:32 UTC

~~ Attention Customers and Partners - RHEL 5.5 Beta is now available on RHN ~~

RHEL 5.5 Beta has been released! There should be a fix present in this 
release that addresses your request. Please test and report back results 
here, by March 3rd 2010 (2010-03-03) or sooner.

Upon successful verification of this request, post your results and update 
the Verified field in Bugzilla with the appropriate value.

If you encounter any issues while testing, please describe them and set 
this bug into NEED_INFO. If you encounter new defects or have additional 
patch(es) to request for inclusion, please clone this bug per each request
and escalate through your support representative.

Comment 32 Zhang Kexin 2010-03-07 10:47:47 UTC

(In reply to comment #8)
> This script was passed to me from mgoodwin.  I tried it in a KVM and it didn't
> crash but might have more success on a bare-metal system.
> 
> #! /bin/sh
> 
> [ $# -ne 1 ] && echo Usage: $0 scratchdevice && exit 1
> scratch=$1
> 
> while true; do
>  parted $scratch mkpart p ext2 1 10
>  parted $scratch rm 1
> done >/dev/null 2>&1 &
> 
> while true; do grep crap /proc/partitions /proc/diskstats; done    

Hi, Jerome, on what machine did you reproduce it? and how long the program run before the panic?

I tried on beaker machine ibm-himalaya.rhts.eng.bos.redhat.com, it has 3 disks, and I used /dev/sdb as the parameter, the program has run for more than 20 minutes, and still does not trigger the bug.

thanks in advance.

Comment 37 errata-xmlrpc 2010-03-30 07:18:33 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2010-0178.html