Bug 219895

Summary: The SFQ qdisc crashes in the kernel if a limit of 2 packets is used.
Product: [Fedora] Fedora Reporter: Steven Elliott <selliott4>
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: CLOSED CURRENTRELEASE QA Contact: Brian Brock <bbrock>
Severity: high Docs Contact:
Priority: high    
Version: 6CC: davem, rvokal, wtogami
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: 2.6.22.9-61.fc6 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-10-08 21:44:28 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Steven Elliott 2006-12-16 00:57:51 UTC
Description of problem:

The SFQ qdisc crashes in the kernel if a limit of 2 packets is used.

Version-Release number of selected component (if applicable):

The following original Fedora Core 6 DVD rpms:
iproute-2.6.16-6.fc6
kernel-2.6.18-1.2798.fc6

How reproducible:

Usually immediately.

Steps to Reproduce:
1. If need be remove any existing qdisc so that it reverts to the standard
  pfifo_fist odisc:
    tc qdisc del dev eth0 root
2. Run "sync" to prepare for the crash.
3. Add a SFQ qdisc with a limit of 2 packets.
    tc qdisc add dev eth0 root handle 1: sfq limit 2
4. Cause packets to be emitted from the interface to which SFQ was added in
   step 3:
    ping www.gnu.org
  
Actual results:
Everything hangs such that a power cycle is required.


Expected results:
The tc command should either:
    1) Successfully add the 2 packet SFQ.
    2) Complain that more than 2 packets is required.

Additional info:

If the defect is produced at a console (as in ctrl-alt-f<0-6>) a kernel stack
trace can be seen the moment "ping" is invoked.  Since the stack trace is not
 written to the /var/log/messages here's part of it (manually copied):
  syscall_call(()
    sys_socketcall()
      sys_sendmsg()
        sock_sendmsg()
          inet_sendmsg()
            raw_sendmsg()      
              ip_push_pending_frames()
                ip_output()
                  neigh_resolve_output()
                    dev_queue_xmit()
                      __qdisc_run()
The location given in __qdisc_run() is 0x30/0x19b.  The value given for EIP is
sfq_dequeue+0xf6/0x179 in the sch_sfq module.

From disassembling sch_sfq.ko it seems that it is on line 360 of sch_sfq.c:
    sch->qstats.backlog -= skb->len;
where "skb" is an invalid pointer:
    net/sched/sch_sfq.c:360
 194:   ff 4d 28                decl   0x28(%ebp)
 197:   8b 14 24                mov    (%esp),%edx
 19a:   8b 42 60                mov    0x60(%edx),%eax ** crash **
 19d:   29 45 58                sub    %eax,0x58(%ebp)

If its helpful I can add some trace messages to sch_sfq.ko  But the problem is
quite easy to reproduce for me.

Comment 1 Radek Vokál 2006-12-19 10:07:12 UTC
That's really nasty bug, crashes my box also with latest rawhide iproute
version. Can somebody from kernel please look at it. I can patch iproute so it
will allow more than two packets in sfq only but I think this is more a kernel bug.

Comment 3 Dave Jones 2007-02-15 00:26:16 UTC
is this still a problem in the 2.6.19 update ?


Comment 4 Steven Elliott 2007-09-17 00:46:33 UTC
I wish I had gotten back to you sooner (assuming you were waiting for me), but
it's still broken.  I've tried it with Fedora 7 (with the stock kernel) and with
Fedora 8 Test 2.  Also, I tried it on SLAX Kill Bill 5.1.7b with kernel 2.6.16.
 So, as a function of the kernel version:
  2.6.16               - crash
  2.6.21-1.3194.fc7    - crash
  2.6.23-0.164.rc5.fc8 - crash
As far as I know it's broken across the board (not just specific to Fedora/Redhat).


Comment 5 Chuck Ebbert 2007-09-17 21:18:29 UTC
Does it work with 3?

Comment 6 Steven Elliott 2007-09-18 01:02:01 UTC
Yes, a limit of 3 packets works.  Or at least it does not crash and the network
is useable (I didn't verify that it's actually behaving in an SFQ manner).

A limit of 1 packets is not a problem since it's forbidden by an error message:
  Illegal "limit", must be > 1

Although the crash only happens when the limit is specifically 2 packets (of the
limits I've tried) and although having a limit of 2 packets is an odd thing to
want I hope there is still some interest in tracking down the root cause (that
is, not just forbidding it in the CLI or documenting it as a limitation).  Like
with any memory corruption bug it's hard to know what other things it may be
doing incorrectly until the problem is understood.

Comment 7 Chuck Ebbert 2007-09-18 17:43:06 UTC
(In reply to comment #0)

> where "skb" is an invalid pointer:
>     net/sched/sch_sfq.c:360
>  194:   ff 4d 28                decl   0x28(%ebp)
>  197:   8b 14 24                mov    (%esp),%edx
>  19a:   8b 42 60                mov    0x60(%edx),%eax ** crash **
>  19d:   29 45 58                sub    %eax,0x58(%ebp)
> 

What is in edx at the time of the crash?


Comment 8 Steven Elliott 2007-09-19 05:31:39 UTC
I don't know of an easy way of figuring out what edx is, but I can see why you'd
want to know that.  It's not part of the crash information written to the
screen.  Or, if it is, it's scrolled off and there is no way scroll back (since
the system is hung due to the crash).

Maybe the code in the kernel that dumps out the crash information could be
modified to write out the registers.  I suppose some sort of crash dump could be
turned on, but the "Kernel panic - not syncing" does not make me hopeful that it
would actually be written out.

Maybe some sort of emulator, something like User Mode Linux, could be used to
set a breakpoint on the exception handler in order to examine the registers.

Comment 9 Chuck Ebbert 2007-09-19 15:07:36 UTC
(In reply to comment #8)
> I don't know of an easy way of figuring out what edx is, but I can see why you'd
> want to know that. 

The upstream networking developers found the problem and already have a fix in
testing.

Comment 10 Chuck Ebbert 2007-09-28 17:32:25 UTC
Fix is in kernel 2.6.22.9-61.fc6. It will be in the
updates-testing repository soon.



Comment 11 Steven Elliott 2007-10-27 16:50:49 UTC
I've tested this in Fedora 8 Test 3.  It's working.  So the following kernel
version is fixed (does not crash with this problem):
    2.6.23-0.214.rc8.git2.fc8