Bug 816888

Summary: kernel panic in qfq_dequeue
Product: Red Hat Enterprise Linux 6 Reporter: Jan Tluka <jtluka>
Component: kernelAssignee: Cong Wang <amwang>
Status: CLOSED ERRATA QA Contact: Jan Tluka <jtluka>
Severity: medium Docs Contact:
Priority: medium    
Version: 6.3CC: jbrouer, rkhan
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: kernel-2.6.32-355.el6 Doc Type: Bug Fix
Doc Text:
Running the QFQ queuing discipline in a virtual guest eventually results in kernel panic.
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-02-21 06:09:53 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Proposed patch none

Description Jan Tluka 2012-04-27 09:04:35 UTC
Description of problem:
While reproducing bug 787637 I hit kernel panic in qfq_dequeue (trace below).

I used virtual guest to test the qfq qdisc. The qdisc setup is in attached shell script. Then I used netcat tool to run simple tcp stream from this guest to another. The panic is triggered easily.

I used NATted mode of the virtual guest's NICs. Virtual guest #1 had IP 192.168.122.10, virtual guest #2 had IP 192.168.122.20.

Version-Release number of selected component (if applicable):
kernel-2.6.32-262.el6.x86_64

How reproducible:
100% in virtual environment
on bare metal it's quite rare, it was reproduced on e1000 at least

Steps to Reproduce:
1. on virt-guest1 setup qdisc with qfq (see attached script for example)
2. on virt-guest2 start listening on ports 1234, 1235
# nc -l 1234 > /dev/null 2>&1
# nc -l 1235 > /dev/null 2>&1
3. on virt-guest1 send traffic to virt-guest2
# yes | nc $virt-guest2_ip_addr 1234
# yes | nc $virt-guest2_ip_addr 1235

  
Actual results:
kernel panic

BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
IP: [<ffffffffa02c3dca>] qfq_dequeue+0x30a/0x490 [sch_qfq]
PGD 1fbed067 PUD 1b103067 PMD 0 
Oops: 0000 [#1] SMP 
last sysfs file: /sys/devices/pci0000:00/0000:00:08.0/virtio4/net/eth2/address
CPU 0 
Modules linked in: cls_u32 sch_qfq sch_cbq ip6t_REJECT nf_conntrack_ipv6
nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6
virtio_balloon snd_intel8x0 snd_ac97_codec ac97_bus snd_seq snd_seq_device
snd_pcm snd_timer snd soundcore snd_page_alloc virtio_net i2c_piix4 i2c_core
ext4 mbcache jbd2 virtio_blk virtio_pci virtio_ring virtio pata_acpi
ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded:
scsi_wait_scan]

Pid: 0, comm: swapper Not tainted 2.6.32-259.el6.x86_64 #1 Red Hat KVM
RIP: 0010:[<ffffffffa02c3dca>]  [<ffffffffa02c3dca>] qfq_dequeue+0x30a/0x490
[sch_qfq]
RSP: 0018:ffff880002203da0  EFLAGS: 00010287
RAX: ffffffffffffffb0 RBX: ffff88001f45e0c0 RCX: 0000000000000029
RDX: fffffe0000000000 RSI: 0000000000000001 RDI: ffff88001f45f718
RBP: ffff880002203de0 R08: 0000000000000007 R09: 0000000225c602e3
R10: 00000000ffffffff R11: dead000000200200 R12: 0000000000000013
R13: ffff88001f124ea8 R14: ffff88001f45f6b8 R15: 0028940000000000
FS:  0000000000000000(0000) GS:ffff880002200000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000010 CR3: 000000001b277000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process swapper (pid: 0, threadinfo ffffffff81a00000, task ffffffff81a8d020)
Stack:
 ffff88001f45e000 0028900000000000 ffff880002203de0 ffff88001f4fcc00
<d> ffff88001f4fcc00 0000000000000000 0000000000000001 ffff88001ad640c0
<d> ffff880002203e60 ffffffffa02b9c85 ffff88001f4fcc00 ffff88001f4fcc00
Call Trace:
 <IRQ> 
 [<ffffffffa02b9c85>] cbq_dequeue+0x365/0x730 [sch_cbq]
 [<ffffffff81456c3f>] __qdisc_run+0x3f/0xe0
 [<ffffffff81436c00>] net_tx_action+0x130/0x1c0
 [<ffffffff8102b46d>] ? lapic_next_event+0x1d/0x30
 [<ffffffff81073d81>] __do_softirq+0xc1/0x1e0
 [<ffffffff81096b10>] ? hrtimer_interrupt+0x140/0x250
 [<ffffffff8100c24c>] call_softirq+0x1c/0x30
 [<ffffffff8100de85>] do_softirq+0x65/0xa0
 [<ffffffff81073b65>] irq_exit+0x85/0x90
 [<ffffffff81502bc0>] smp_apic_timer_interrupt+0x70/0x9b
 [<ffffffff8100bc13>] apic_timer_interrupt+0x13/0x20
 <EOI> 
 [<ffffffff810387cb>] ? native_safe_halt+0xb/0x10
 [<ffffffff810149cd>] default_idle+0x4d/0xb0
 [<ffffffff81009e06>] cpu_idle+0xb6/0x110
 [<ffffffff814e137a>] rest_init+0x7a/0x80
 [<ffffffff81c21f7b>] start_kernel+0x424/0x430
 [<ffffffff81c2133a>] x86_64_start_reservations+0x125/0x129
 [<ffffffff81c21438>] x86_64_start_kernel+0xfa/0x109
Code: 7c 03 50 4d 8b 7e 58 e8 b5 f6 ff ff 48 85 c0 0f 84 3c 01 00 00 41 8b 4e
60 be 01 00 00 00 49 8d 7e 60 48 89 f2 48 d3 e2 48 f7 da <48> 23 50 60 49 39 56
50 0f 84 d6 00 00 00 b8 02 00 00 00 49 89 
RIP  [<ffffffffa02c3dca>] qfq_dequeue+0x30a/0x490 [sch_qfq]
 RSP <ffff880002203da0>
CR2: 0000000000000010
---[ end trace 5a9f1207f04b8f6d ]---
Kernel panic - not syncing: Fatal exception in interrupt
Pid: 0, comm: swapper Tainted: G      D    ---------------   
2.6.32-259.el6.x86_64 #1
Call Trace:
 <IRQ>  [<ffffffff814fa100>] ? panic+0xa0/0x168
 [<ffffffff814fe2a2>] ? oops_end+0xf2/0x100
 [<ffffffff81043bbb>] ? no_context+0xfb/0x260
 [<ffffffff81043e45>] ? __bad_area_nosemaphore+0x125/0x1e0
 [<ffffffff81043f13>] ? bad_area_nosemaphore+0x13/0x20
 [<ffffffff810445cd>] ? __do_page_fault+0x31d/0x480
 [<ffffffff8146f25d>] ? ip_local_deliver_finish+0xdd/0x2d0
 [<ffffffff8146f4e8>] ? ip_local_deliver+0x98/0xa0
 [<ffffffff8150024e>] ? do_page_fault+0x3e/0xa0
 [<ffffffff814fd605>] ? page_fault+0x25/0x30
 [<ffffffffa02c3dca>] ? qfq_dequeue+0x30a/0x490 [sch_qfq]
 [<ffffffffa02b9c85>] ? cbq_dequeue+0x365/0x730 [sch_cbq]
 [<ffffffff81456c3f>] ? __qdisc_run+0x3f/0xe0
 [<ffffffff81436c00>] ? net_tx_action+0x130/0x1c0
 [<ffffffff8102b46d>] ? lapic_next_event+0x1d/0x30
 [<ffffffff81073d81>] ? __do_softirq+0xc1/0x1e0
 [<ffffffff81096b10>] ? hrtimer_interrupt+0x140/0x250
 [<ffffffff8100c24c>] ? call_softirq+0x1c/0x30
 [<ffffffff8100de85>] ? do_softirq+0x65/0xa0
 [<ffffffff81073b65>] ? irq_exit+0x85/0x90
 [<ffffffff81502bc0>] ? smp_apic_timer_interrupt+0x70/0x9b
 [<ffffffff8100bc13>] ? apic_timer_interrupt+0x13/0x20
 <EOI>  [<ffffffff810387cb>] ? native_safe_halt+0xb/0x10
 [<ffffffff810149cd>] ? default_idle+0x4d/0xb0
 [<ffffffff81009e06>] ? cpu_idle+0xb6/0x110
 [<ffffffff814e137a>] ? rest_init+0x7a/0x80
 [<ffffffff81c21f7b>] ? start_kernel+0x424/0x430
 [<ffffffff81c2133a>] ? x86_64_start_reservations+0x125/0x129
 [<ffffffff81c21438>] ? x86_64_start_kernel+0xfa/0x109

another trace I got was following:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
IP: [<ffffffffa02c3688>] qfq_deactivate_class+0x158/0x200 [sch_qfq]
PGD 1f138067 PUD 1ac08067 PMD 0 
Oops: 0000 [#1] SMP 
last sysfs file: /sys/devices/pci0000:00/0000:00:08.0/virtio4/net/eth2/address
CPU 0 
Modules linked in: cls_u32 sch_qfq sch_cbq ip6t_REJECT nf_conntrack_ipv6
nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6
virtio_balloon snd_intel8x0 snd_ac97_codec ac97_bus snd_seq snd_seq_device
snd_pcm snd_timer snd soundcore snd_page_alloc virtio_net i2c_piix4 i2c_core
ext4 mbcache jbd2 virtio_blk virtio_pci virtio_ring virtio pata_acpi
ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded:
scsi_wait_scan]

Pid: 1565, comm: ip Not tainted 2.6.32-259.el6.x86_64 #1 Red Hat KVM
RIP: 0010:[<ffffffffa02c3688>]  [<ffffffffa02c3688>]
qfq_deactivate_class+0x158/0x200 [sch_qfq]
RSP: 0018:ffff88001b4b15f8  EFLAGS: 00010287
RAX: ffffffffffffffb0 RBX: ffff88001fb800c0 RCX: 0000000000000029
RDX: fffffe0000000000 RSI: 0000000000000001 RDI: ffff88001fb81708
RBP: ffff88001b4b1608 R08: ffff88001fb817b0 R09: ffff88001f682000
R10: 0000000000000667 R11: 0000000000000000 R12: ffff88001fb81708
R13: 0000000000000010 R14: ffff88001fb817b0 R15: 0000000000000013
FS:  00007fed5b7d8700(0000) GS:ffff880002200000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000010 CR3: 000000001b079000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process ip (pid: 1565, threadinfo ffff88001b4b0000, task ffff88001f0f6040)
Stack:
 0000000000000000 ffff88001fb800c0 ffff88001b4b1658 ffffffffa02c37bf
<d> ffff88001fb80000 0000000100000001 ffff88001b4b1638 ffff88001fb80000
<d> ffff88001aeb3000 ffff88001fba9000 ffff88001fba9008 0000000000000000
Call Trace:
 [<ffffffffa02c37bf>] qfq_reset_qdisc+0x5f/0xe0 [sch_qfq]
 [<ffffffff814566d0>] qdisc_reset+0x20/0x50
 [<ffffffffa02b94a0>] cbq_reset+0xd0/0x140 [sch_cbq]
 [<ffffffff814566d0>] qdisc_reset+0x20/0x50
 [<ffffffff814567d3>] dev_deactivate_queue+0x53/0x80
 [<ffffffff81456f51>] dev_deactivate+0x51/0x1e0
 [<ffffffff81439f72>] dev_close+0x62/0xc0
 [<ffffffff814399a1>] dev_change_flags+0xa1/0x1d0
 [<ffffffff81446cc5>] do_setlink+0x1f5/0x860
 [<ffffffff8112b320>] ? __lru_cache_add+0x40/0x90
 [<ffffffff81287b64>] ? nla_parse+0x34/0x110
 [<ffffffff8144775a>] rtnl_newlink+0x42a/0x550
 [<ffffffff814468c0>] rtnetlink_rcv_msg+0x1e0/0x220
 [<ffffffff814466e0>] ? rtnetlink_rcv_msg+0x0/0x220
 [<ffffffff81461d79>] netlink_rcv_skb+0xa9/0xd0
 [<ffffffff814466c5>] rtnetlink_rcv+0x25/0x40
 [<ffffffff814619d6>] netlink_unicast+0x2e6/0x300
 [<ffffffff81462360>] netlink_sendmsg+0x200/0x2e0
 [<ffffffff81426093>] sock_sendmsg+0x123/0x150
 [<ffffffff81091f90>] ? autoremove_wake_function+0x0/0x40
 [<ffffffff81425cb4>] ? move_addr_to_kernel+0x64/0x70
 [<ffffffff81427be6>] __sys_sendmsg+0x406/0x420
 [<ffffffff81044494>] ? __do_page_fault+0x1e4/0x480
 [<ffffffff81285358>] ? __percpu_counter_add+0x68/0x90
 [<ffffffff8114304b>] ? vma_link+0x9b/0xf0
 [<ffffffff8114517c>] ? do_brk+0x26c/0x350
 [<ffffffff81427e09>] sys_sendmsg+0x49/0x90
 [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
Code: 00 00 00 41 8b 44 24 18 49 83 7c c4 28 00 75 db 4c 89 e7 e8 eb fd ff ff
41 8b 4c 24 10 be 01 00 00 00 48 89 f2 48 d3 e2 48 f7 da <48> 23 50 60 49 39 14
24 74 b6 41 8b 44 24 14 48 8d 7b 30 0f b3 
RIP  [<ffffffffa02c3688>] qfq_deactivate_class+0x158/0x200 [sch_qfq]
 RSP <ffff88001b4b15f8>
CR2: 0000000000000010
---[ end trace b4dd33810a077486 ]---
Kernel panic - not syncing: Fatal exception in interrupt
Pid: 1565, comm: ip Tainted: G      D    ---------------   
2.6.32-259.el6.x86_64 #1
Call Trace:
 [<ffffffff814fa100>] ? panic+0xa0/0x168
 [<ffffffff814fe2a2>] ? oops_end+0xf2/0x100
 [<ffffffff81043bbb>] ? no_context+0xfb/0x260
 [<ffffffff81043e45>] ? __bad_area_nosemaphore+0x125/0x1e0
 [<ffffffff81043f13>] ? bad_area_nosemaphore+0x13/0x20
 [<ffffffff810445cd>] ? __do_page_fault+0x31d/0x480
 [<ffffffff81311abf>] ? extract_buf+0x9f/0x130
 [<ffffffff8131167b>] ? mix_pool_bytes_extract+0x16b/0x180
 [<ffffffff8150024e>] ? do_page_fault+0x3e/0xa0
 [<ffffffff814fd605>] ? page_fault+0x25/0x30
 [<ffffffffa02c3688>] ? qfq_deactivate_class+0x158/0x200 [sch_qfq]
 [<ffffffffa02c3675>] ? qfq_deactivate_class+0x145/0x200 [sch_qfq]
 [<ffffffffa02c37bf>] ? qfq_reset_qdisc+0x5f/0xe0 [sch_qfq]
 [<ffffffff814566d0>] ? qdisc_reset+0x20/0x50
 [<ffffffffa02b94a0>] ? cbq_reset+0xd0/0x140 [sch_cbq]
 [<ffffffff814566d0>] ? qdisc_reset+0x20/0x50
 [<ffffffff814567d3>] ? dev_deactivate_queue+0x53/0x80
 [<ffffffff81456f51>] ? dev_deactivate+0x51/0x1e0
 [<ffffffff81439f72>] ? dev_close+0x62/0xc0
 [<ffffffff814399a1>] ? dev_change_flags+0xa1/0x1d0
 [<ffffffff81446cc5>] ? do_setlink+0x1f5/0x860
 [<ffffffff8112b320>] ? __lru_cache_add+0x40/0x90
 [<ffffffff81287b64>] ? nla_parse+0x34/0x110
 [<ffffffff8144775a>] ? rtnl_newlink+0x42a/0x550
 [<ffffffff814468c0>] ? rtnetlink_rcv_msg+0x1e0/0x220
 [<ffffffff814466e0>] ? rtnetlink_rcv_msg+0x0/0x220
 [<ffffffff81461d79>] ? netlink_rcv_skb+0xa9/0xd0
 [<ffffffff814466c5>] ? rtnetlink_rcv+0x25/0x40
 [<ffffffff814619d6>] ? netlink_unicast+0x2e6/0x300
 [<ffffffff81462360>] ? netlink_sendmsg+0x200/0x2e0
 [<ffffffff81426093>] ? sock_sendmsg+0x123/0x150
 [<ffffffff81091f90>] ? autoremove_wake_function+0x0/0x40
 [<ffffffff81425cb4>] ? move_addr_to_kernel+0x64/0x70
 [<ffffffff81427be6>] ? __sys_sendmsg+0x406/0x420
 [<ffffffff81044494>] ? __do_page_fault+0x1e4/0x480
 [<ffffffff81285358>] ? __percpu_counter_add+0x68/0x90
 [<ffffffff8114304b>] ? vma_link+0x9b/0xf0
 [<ffffffff8114517c>] ? do_brk+0x26c/0x350
 [<ffffffff81427e09>] ? sys_sendmsg+0x49/0x90
 [<ffffffff8100b0f2>] ? system_call_fastpath+0x16/0x1b

Expected results:
no panic

Additional info:

Thomas Graf started debugging this issue and following is his comment from the bug 787637:

Looking at the latest backtrace:

    [exception RIP: qfq_dequeue+778]
    RIP: ffffffffa02c2dca  RSP: ffff880002203da0  RFLAGS: 00010287
    RAX: ffffffffffffffb0  RBX: ffff88001b0460c0  RCX: 0000000000000029
    RDX: fffffe0000000000  RSI: 0000000000000001  RDI: ffff88001b047718
    RBP: ffff880002203de0   R8: 0000000000000008   R9: 00000001d5f5e068
    R10: 00000075435bc680  R11: dead000000200200  R12: 0000000000000013
    R13: ffff88001f19a6a8  R14: ffff88001b0476b8  R15: 0178560000000000
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #9 [ffff880002203de8] cbq_dequeue at ffffffffa02b8c85 [sch_cbq]
#10 [ffff880002203e68] __qdisc_run at ffffffff81456ccf
#11 [ffff880002203e98] net_tx_action at ffffffff81436bf0
#12 [ffff880002203ed8] __do_softirq at ffffffff81073d81
#13 [ffff880002203f48] call_softirq at ffffffff8100c24c
#14 [ffff880002203f60] do_softirq at ffffffff8100de85
#15 [ffff880002203f80] irq_exit at ffffffff81073b65
#16 [ffff880002203f90] smp_apic_timer_interrupt at ffffffff81502c60
#17 [ffff880002203fb0] apic_timer_interrupt at ffffffff8100bc13


crash> dis qfq_dequeue+742 15
0xffffffffa02c2da6 <qfq_dequeue+742>:   callq  0xffffffffa02c2460
<qfq_slot_scan>
0xffffffffa02c2dab <qfq_dequeue+747>:   test   %rax,%rax
0xffffffffa02c2dae <qfq_dequeue+750>:   je     0xffffffffa02c2ef0
0xffffffffa02c2db4 <qfq_dequeue+756>:   mov    0x60(%r14),%ecx
0xffffffffa02c2db8 <qfq_dequeue+760>:   mov    $0x1,%esi
0xffffffffa02c2dbd <qfq_dequeue+765>:   lea    0x60(%r14),%rdi
0xffffffffa02c2dc1 <qfq_dequeue+769>:   mov    %rsi,%rdx
0xffffffffa02c2dc4 <qfq_dequeue+772>:   shl    %cl,%rdx
0xffffffffa02c2dc7 <qfq_dequeue+775>:   neg    %rdx
0xffffffffa02c2dca <qfq_dequeue+778>:   and    0x60(%rax),%rdx

                                                   ^^^^^
RAX: ffffffffffffffb0

So obviously qfq_slot_scan() is returning crap for some reason.

0xffffffffa02c2dce <qfq_dequeue+782>:   cmp    %rdx,0x50(%r14)
0xffffffffa02c2dd2 <qfq_dequeue+786>:   je     0xffffffffa02c2eae
0xffffffffa02c2dd8 <qfq_dequeue+792>:   mov    $0x2,%eax
0xffffffffa02c2ddd <qfq_dequeue+797>:   mov    %rdx,0x50(%r14)
0xffffffffa02c2de1 <qfq_dequeue+801>:   shl    %cl,%rax

I think upstream is equally affected.

Comment 1 Martin Prpič 2012-05-09 12:50:28 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Running the QFQ queuing discipline in a virtual guest eventually results in kernel panic.

Comment 2 RHEL Program Management 2012-07-10 06:42:04 UTC
This request was not resolved in time for the current release.
Red Hat invites you to ask your support representative to
propose this request, if still desired, for consideration in
the next release of Red Hat Enterprise Linux.

Comment 3 RHEL Program Management 2012-07-10 23:38:27 UTC
This request was erroneously removed from consideration in Red Hat Enterprise Linux 6.4, which is currently under development.  This request will be evaluated for inclusion in Red Hat Enterprise Linux 6.4.

Comment 4 Cong Wang 2012-10-31 03:40:29 UTC
Created attachment 635925 [details]
Proposed patch

Patches backported from upstream.

Comment 6 RHEL Program Management 2012-10-31 03:50:39 UTC
This request was evaluated by Red Hat Product Management for
inclusion in a Red Hat Enterprise Linux release.  Product
Management has requested further review of this request by
Red Hat Engineering, for potential inclusion in a Red Hat
Enterprise Linux release for currently deployed products.
This request is not yet committed for inclusion in a release.

Comment 7 Jarod Wilson 2013-01-16 17:52:01 UTC
Patch(es)

Comment 10 Jan Tluka 2013-01-18 15:36:53 UTC
Reproduced on 2.6.32-279.5.1.el6.x86_64. The panic was triggered easily.

Verified on 2.6.32-355.el6.x86_64. The bug is gone.

Comment 12 errata-xmlrpc 2013-02-21 06:09:53 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2013-0496.html