Bug 806704

Summary: kernel oops in do_ip_vs_get_ctl
Product: [Fedora] Fedora Reporter: Ryan O'Hara <rohara>
Component: kernelAssignee: Jesper Brouer <jbrouer>
Status: CLOSED ERRATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 16CC: gansalmon, hans.schillstrom, itamar, jbrouer, jonathan, kernel-maint, madhu.chinakonda, rkhan, rohara, tgraf
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-07-05 13:11:16 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Trace of kenerl oops from /var/log/messages
none
Quick script for attempting to reproducing BZ 806704 none

Description Ryan O'Hara 2012-03-26 03:22:15 UTC
When starting keepalived in F16, the kernel will encounter an oops. This occurs about 60% of the time and does not appear to be related to keepalived.

[root@rocket-01 ~]# uname -r
3.3.0-4.fc16.x86_64

[root@rocket-01 ~]# rpm -q keepalived
keepalived-1.2.2-3.fc16.x86_64

Run 'service keepalived start' or 'systemctl start keepalived.service' and check /var/log/messages for possible oops trace.

Comment 1 Ryan O'Hara 2012-03-26 03:23:06 UTC
Created attachment 572621 [details]
Trace of kenerl oops from /var/log/messages

Comment 3 Jesper Brouer 2012-04-23 12:26:03 UTC
Created attachment 579521 [details]
Quick script for attempting to reproducing BZ 806704

Quick script for attempting to reproducing BZ 806704.

The script tries to handle that systemctl don't like too quick restarting.

Comment 4 Jesper Brouer 2012-04-23 12:30:00 UTC
Hi Ryan,

I have not been able to reproduce, with
 kernel 3.3.1-5.fc16.x86_64 and keepalived-1.2.2-4.fc16.x86_64

My basic reproduce loop is:
  systemctl stop keepalived.service
  rmmod ip_vs_rr ip_vs
  systemctl start keepalived.service
  dmesg | grep BUG:

I have also written a small script that handles, that systemctl don't allow too quick restarts of services, see comment #3.

I have done more than 400 "runs" with the script without reproducing the bug.

Comment 5 Jesper Brouer 2012-04-23 13:18:21 UTC
Cannot reproduce on 3.3.0-4.fc16.x86_64 either.
(also more than 400 runs)

As this is a most likely an init race condition, perhaps I need a better/faster config file for keepalived.  As I just run with the default RPM config, which isn't correct for my system (e.g. it ref eth0, which I don't have).

Ryan, could I please get your /etc/keepalived/keepalived.conf, or hints to how I should set it up?

Comment 6 Ryan O'Hara 2012-04-23 14:36:13 UTC
I'll post a keepalived.conf file for you to study, but I do not believe this is going to have any effect on reproducing this bug.

There is a discussion about this bug on the lvs-devel mailing lists.

Comment 8 Jesper Brouer 2012-04-24 13:32:31 UTC
(In reply to comment #6)
> I'll post a keepalived.conf file for you to study, but I do not believe this is
> going to have any effect on reproducing this bug.

That's right... I still cannot reproduce.
(This time on a virtual KVM Fedora 16,
 kernel 3.1.0-7.fc16.x86_64,
 keepalived-1.2.2-4.fc16.x86_64)
 
> There is a discussion about this bug on the lvs-devel mailing lists.

Good, guess I should subscribe to that mailing list :-)

I would just be nice, if I could somehow reproduce the bug, so I can verify any upstream fix...

Comment 9 Jesper Brouer 2012-04-25 08:31:36 UTC
Hans Schillstrom <hans.schillstrom>, just posted a upstream bugfix :-)

I'll test the patch and thank him for his work :-).

Comment 10 Jesper Brouer 2012-04-25 11:22:05 UTC
Patch:
 http://permalink.gmane.org/gmane.comp.security.firewalls.netfilter.devel/42308

(Patch contains some whitespace nitpicks, and Julian wanted to mark a function as __init.  Hans Schillstrom promised me to fix it up and repost)

I have applied the patch on top DaveMs net tree (at 2a5809499e) and tested it on a Fedora KVM machine.  It works, but I really cannot verify that it fixed the problem, as I could not reproduce it before...

Hans Schillstrom, proposed another way to reproduce, but I don't know if its worth implementing:

 http://permalink.gmane.org/gmane.comp.security.firewalls.netfilter.devel/42315

Comment 11 Hans Schillstrom 2012-04-26 08:34:55 UTC
There was a possible race in the code, that made it possible send commands from ioctl or netlink before all init was done.

However it is "impossible" to trigger it by purpose, the time frame when it's exposed is just to short.

A new patch is posted with suggested changes.

If this is a "real life" problem there is an easy workaround,
just modprobe ip_vs in advance, so all structs get initialized before using the user-mode tools.

Comment 12 Hans Schillstrom 2012-04-27 08:11:16 UTC
This patch also solved an old BUG:
loading of ip_vs.ko inside a networknamespace (container)

I tested with a 3.0.13 kernel with and without,
so in that context it's verified.

Comment 13 Jesper Brouer 2012-04-27 08:23:23 UTC
(In reply to comment #12)
> This patch also solved an old BUG:
>  loading of ip_vs.ko inside a networknamespace (container)

Is that an old bugzilla case?

> I tested with a 3.0.13 kernel with and without,
> so in that context it's verified.

Good, and thanks :-)

Comment 14 Ryan O'Hara 2012-05-01 20:42:07 UTC
It sounds like this fix is still making its way upstream. Please continue to work to get this into the Fedora kernel.

Comment 16 Jesper Brouer 2012-07-05 10:31:59 UTC
(In reply to comment #14)
> It sounds like this fix is still making its way upstream. Please continue to
> work to get this into the Fedora kernel.

Looks like its in the Fedora 16 kernels now.
Commit 8537de8a7ab6681cc72fb0411ab1ba7fdba62dd0 made it into v3.4-rc6, and F16 has been rebased to 3.4.2 (3.4.2-1.fc16.x86_64).

Comment 17 Josh Boyer 2012-07-05 13:11:16 UTC
Yes, indeed.  Thanks for reminding us about this bug.