Bug 806704 - kernel oops in do_ip_vs_get_ctl
kernel oops in do_ip_vs_get_ctl
Status: CLOSED ERRATA
Product: Fedora
Classification: Fedora
Component: kernel (Show other bugs)
16
Unspecified Unspecified
unspecified Severity unspecified
: ---
: ---
Assigned To: Jesper Brouer
Fedora Extras Quality Assurance
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2012-03-25 23:22 EDT by Ryan O'Hara
Modified: 2012-07-05 15:11 EDT (History)
10 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2012-07-05 09:11:16 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Trace of kenerl oops from /var/log/messages (5.36 KB, text/plain)
2012-03-25 23:23 EDT, Ryan O'Hara
no flags Details
Quick script for attempting to reproducing BZ 806704 (965 bytes, application/x-shellscript)
2012-04-23 08:26 EDT, Jesper Brouer
no flags Details

  None (edit)
Description Ryan O'Hara 2012-03-25 23:22:15 EDT
When starting keepalived in F16, the kernel will encounter an oops. This occurs about 60% of the time and does not appear to be related to keepalived.

[root@rocket-01 ~]# uname -r
3.3.0-4.fc16.x86_64

[root@rocket-01 ~]# rpm -q keepalived
keepalived-1.2.2-3.fc16.x86_64

Run 'service keepalived start' or 'systemctl start keepalived.service' and check /var/log/messages for possible oops trace.
Comment 1 Ryan O'Hara 2012-03-25 23:23:06 EDT
Created attachment 572621 [details]
Trace of kenerl oops from /var/log/messages
Comment 3 Jesper Brouer 2012-04-23 08:26:03 EDT
Created attachment 579521 [details]
Quick script for attempting to reproducing BZ 806704

Quick script for attempting to reproducing BZ 806704.

The script tries to handle that systemctl don't like too quick restarting.
Comment 4 Jesper Brouer 2012-04-23 08:30:00 EDT
Hi Ryan,

I have not been able to reproduce, with
 kernel 3.3.1-5.fc16.x86_64 and keepalived-1.2.2-4.fc16.x86_64

My basic reproduce loop is:
  systemctl stop keepalived.service
  rmmod ip_vs_rr ip_vs
  systemctl start keepalived.service
  dmesg | grep BUG:

I have also written a small script that handles, that systemctl don't allow too quick restarts of services, see comment #3.

I have done more than 400 "runs" with the script without reproducing the bug.
Comment 5 Jesper Brouer 2012-04-23 09:18:21 EDT
Cannot reproduce on 3.3.0-4.fc16.x86_64 either.
(also more than 400 runs)

As this is a most likely an init race condition, perhaps I need a better/faster config file for keepalived.  As I just run with the default RPM config, which isn't correct for my system (e.g. it ref eth0, which I don't have).

Ryan, could I please get your /etc/keepalived/keepalived.conf, or hints to how I should set it up?
Comment 6 Ryan O'Hara 2012-04-23 10:36:13 EDT
I'll post a keepalived.conf file for you to study, but I do not believe this is going to have any effect on reproducing this bug.

There is a discussion about this bug on the lvs-devel mailing lists.
Comment 8 Jesper Brouer 2012-04-24 09:32:31 EDT
(In reply to comment #6)
> I'll post a keepalived.conf file for you to study, but I do not believe this is
> going to have any effect on reproducing this bug.

That's right... I still cannot reproduce.
(This time on a virtual KVM Fedora 16,
 kernel 3.1.0-7.fc16.x86_64,
 keepalived-1.2.2-4.fc16.x86_64)
 
> There is a discussion about this bug on the lvs-devel mailing lists.

Good, guess I should subscribe to that mailing list :-)

I would just be nice, if I could somehow reproduce the bug, so I can verify any upstream fix...
Comment 9 Jesper Brouer 2012-04-25 04:31:36 EDT
Hans Schillstrom <hans.schillstrom@ericsson.com>, just posted a upstream bugfix :-)

I'll test the patch and thank him for his work :-).
Comment 10 Jesper Brouer 2012-04-25 07:22:05 EDT
Patch:
 http://permalink.gmane.org/gmane.comp.security.firewalls.netfilter.devel/42308

(Patch contains some whitespace nitpicks, and Julian wanted to mark a function as __init.  Hans Schillstrom promised me to fix it up and repost)

I have applied the patch on top DaveMs net tree (at 2a5809499e) and tested it on a Fedora KVM machine.  It works, but I really cannot verify that it fixed the problem, as I could not reproduce it before...

Hans Schillstrom, proposed another way to reproduce, but I don't know if its worth implementing:

 http://permalink.gmane.org/gmane.comp.security.firewalls.netfilter.devel/42315
Comment 11 Hans Schillstrom 2012-04-26 04:34:55 EDT
There was a possible race in the code, that made it possible send commands from ioctl or netlink before all init was done.

However it is "impossible" to trigger it by purpose, the time frame when it's exposed is just to short.

A new patch is posted with suggested changes.

If this is a "real life" problem there is an easy workaround,
just modprobe ip_vs in advance, so all structs get initialized before using the user-mode tools.
Comment 12 Hans Schillstrom 2012-04-27 04:11:16 EDT
This patch also solved an old BUG:
loading of ip_vs.ko inside a networknamespace (container)

I tested with a 3.0.13 kernel with and without,
so in that context it's verified.
Comment 13 Jesper Brouer 2012-04-27 04:23:23 EDT
(In reply to comment #12)
> This patch also solved an old BUG:
>  loading of ip_vs.ko inside a networknamespace (container)

Is that an old bugzilla case?

> I tested with a 3.0.13 kernel with and without,
> so in that context it's verified.

Good, and thanks :-)
Comment 14 Ryan O'Hara 2012-05-01 16:42:07 EDT
It sounds like this fix is still making its way upstream. Please continue to work to get this into the Fedora kernel.
Comment 16 Jesper Brouer 2012-07-05 06:31:59 EDT
(In reply to comment #14)
> It sounds like this fix is still making its way upstream. Please continue to
> work to get this into the Fedora kernel.

Looks like its in the Fedora 16 kernels now.
Commit 8537de8a7ab6681cc72fb0411ab1ba7fdba62dd0 made it into v3.4-rc6, and F16 has been rebased to 3.4.2 (3.4.2-1.fc16.x86_64).
Comment 17 Josh Boyer 2012-07-05 09:11:16 EDT
Yes, indeed.  Thanks for reminding us about this bug.

Note You need to log in before you can comment on or make changes to this bug.