806704 – kernel oops in do_ip_vs_get_ctl

Bug 806704 - kernel oops in do_ip_vs_get_ctl

Summary: kernel oops in do_ip_vs_get_ctl

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	16
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Jesper Brouer
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2012-03-26 03:22 UTC by Ryan O'Hara
Modified:	2012-07-05 19:11 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2012-07-05 13:11:16 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Trace of kenerl oops from /var/log/messages (5.36 KB, text/plain) 2012-03-26 03:23 UTC, Ryan O'Hara	no flags	Details
Quick script for attempting to reproducing BZ 806704 (965 bytes, application/x-shellscript) 2012-04-23 12:26 UTC, Jesper Brouer	no flags	Details
View All

Description Ryan O'Hara 2012-03-26 03:22:15 UTC

When starting keepalived in F16, the kernel will encounter an oops. This occurs about 60% of the time and does not appear to be related to keepalived.

[root@rocket-01 ~]# uname -r
3.3.0-4.fc16.x86_64

[root@rocket-01 ~]# rpm -q keepalived
keepalived-1.2.2-3.fc16.x86_64

Run 'service keepalived start' or 'systemctl start keepalived.service' and check /var/log/messages for possible oops trace.

Comment 1 Ryan O'Hara 2012-03-26 03:23:06 UTC

Created attachment 572621 [details]
Trace of kenerl oops from /var/log/messages

Comment 3 Jesper Brouer 2012-04-23 12:26:03 UTC

Created attachment 579521 [details]
Quick script for attempting to reproducing BZ 806704

Quick script for attempting to reproducing BZ 806704.

The script tries to handle that systemctl don't like too quick restarting.

Comment 4 Jesper Brouer 2012-04-23 12:30:00 UTC

Hi Ryan,

I have not been able to reproduce, with
 kernel 3.3.1-5.fc16.x86_64 and keepalived-1.2.2-4.fc16.x86_64

My basic reproduce loop is:
  systemctl stop keepalived.service
  rmmod ip_vs_rr ip_vs
  systemctl start keepalived.service
  dmesg | grep BUG:

I have also written a small script that handles, that systemctl don't allow too quick restarts of services, see comment #3.

I have done more than 400 "runs" with the script without reproducing the bug.

Comment 5 Jesper Brouer 2012-04-23 13:18:21 UTC

Cannot reproduce on 3.3.0-4.fc16.x86_64 either.
(also more than 400 runs)

As this is a most likely an init race condition, perhaps I need a better/faster config file for keepalived.  As I just run with the default RPM config, which isn't correct for my system (e.g. it ref eth0, which I don't have).

Ryan, could I please get your /etc/keepalived/keepalived.conf, or hints to how I should set it up?

Comment 6 Ryan O'Hara 2012-04-23 14:36:13 UTC

I'll post a keepalived.conf file for you to study, but I do not believe this is going to have any effect on reproducing this bug.

There is a discussion about this bug on the lvs-devel mailing lists.

Comment 8 Jesper Brouer 2012-04-24 13:32:31 UTC

(In reply to comment #6)
> I'll post a keepalived.conf file for you to study, but I do not believe this is
> going to have any effect on reproducing this bug.

That's right... I still cannot reproduce.
(This time on a virtual KVM Fedora 16,
 kernel 3.1.0-7.fc16.x86_64,
 keepalived-1.2.2-4.fc16.x86_64)
 
> There is a discussion about this bug on the lvs-devel mailing lists.

Good, guess I should subscribe to that mailing list :-)

I would just be nice, if I could somehow reproduce the bug, so I can verify any upstream fix...

Comment 9 Jesper Brouer 2012-04-25 08:31:36 UTC

Hans Schillstrom <hans.schillstrom>, just posted a upstream bugfix :-)

I'll test the patch and thank him for his work :-).

Comment 10 Jesper Brouer 2012-04-25 11:22:05 UTC

Patch:
 http://permalink.gmane.org/gmane.comp.security.firewalls.netfilter.devel/42308

(Patch contains some whitespace nitpicks, and Julian wanted to mark a function as __init.  Hans Schillstrom promised me to fix it up and repost)

I have applied the patch on top DaveMs net tree (at 2a5809499e) and tested it on a Fedora KVM machine.  It works, but I really cannot verify that it fixed the problem, as I could not reproduce it before...

Hans Schillstrom, proposed another way to reproduce, but I don't know if its worth implementing:

 http://permalink.gmane.org/gmane.comp.security.firewalls.netfilter.devel/42315

Comment 11 Hans Schillstrom 2012-04-26 08:34:55 UTC

There was a possible race in the code, that made it possible send commands from ioctl or netlink before all init was done.

However it is "impossible" to trigger it by purpose, the time frame when it's exposed is just to short.

A new patch is posted with suggested changes.

If this is a "real life" problem there is an easy workaround,
just modprobe ip_vs in advance, so all structs get initialized before using the user-mode tools.

Comment 12 Hans Schillstrom 2012-04-27 08:11:16 UTC

This patch also solved an old BUG:
loading of ip_vs.ko inside a networknamespace (container)

I tested with a 3.0.13 kernel with and without,
so in that context it's verified.

Comment 13 Jesper Brouer 2012-04-27 08:23:23 UTC

(In reply to comment #12)
> This patch also solved an old BUG:
>  loading of ip_vs.ko inside a networknamespace (container)

Is that an old bugzilla case?

> I tested with a 3.0.13 kernel with and without,
> so in that context it's verified.

Good, and thanks :-)

Comment 14 Ryan O'Hara 2012-05-01 20:42:07 UTC

It sounds like this fix is still making its way upstream. Please continue to work to get this into the Fedora kernel.

Comment 15 Josh Boyer 2012-05-02 16:44:48 UTC

The patch is in Davem's tree here:

http://git.kernel.org/?p=linux/kernel/git/davem/net.git;a=commitdiff;h=8537de8a7ab6681cc72fb0411ab1ba7fdba62dd0

There are two others that seem related:

http://git.kernel.org/?p=linux/kernel/git/davem/net.git;a=commitdiff;h=582b8e3eadaec77788c1aa188081a8d5059c42a6
http://git.kernel.org/?p=linux/kernel/git/davem/net.git;a=commitdiff;h=4b984cd50bc1b6d492175cd77bfabb78e76ffa67

Comment 16 Jesper Brouer 2012-07-05 10:31:59 UTC

(In reply to comment #14)
> It sounds like this fix is still making its way upstream. Please continue to
> work to get this into the Fedora kernel.

Looks like its in the Fedora 16 kernels now.
Commit 8537de8a7ab6681cc72fb0411ab1ba7fdba62dd0 made it into v3.4-rc6, and F16 has been rebased to 3.4.2 (3.4.2-1.fc16.x86_64).

Comment 17 Josh Boyer 2012-07-05 13:11:16 UTC

Yes, indeed.  Thanks for reminding us about this bug.

Note You need to log in before you can comment on or make changes to this bug.