Red Hat Bugzilla – Bug 806704
kernel oops in do_ip_vs_get_ctl
Last modified: 2012-07-05 15:11:54 EDT
When starting keepalived in F16, the kernel will encounter an oops. This occurs about 60% of the time and does not appear to be related to keepalived.
[root@rocket-01 ~]# uname -r
[root@rocket-01 ~]# rpm -q keepalived
Run 'service keepalived start' or 'systemctl start keepalived.service' and check /var/log/messages for possible oops trace.
Created attachment 572621 [details]
Trace of kenerl oops from /var/log/messages
Created attachment 579521 [details]
Quick script for attempting to reproducing BZ 806704
Quick script for attempting to reproducing BZ 806704.
The script tries to handle that systemctl don't like too quick restarting.
I have not been able to reproduce, with
kernel 3.3.1-5.fc16.x86_64 and keepalived-1.2.2-4.fc16.x86_64
My basic reproduce loop is:
systemctl stop keepalived.service
rmmod ip_vs_rr ip_vs
systemctl start keepalived.service
dmesg | grep BUG:
I have also written a small script that handles, that systemctl don't allow too quick restarts of services, see comment #3.
I have done more than 400 "runs" with the script without reproducing the bug.
Cannot reproduce on 3.3.0-4.fc16.x86_64 either.
(also more than 400 runs)
As this is a most likely an init race condition, perhaps I need a better/faster config file for keepalived. As I just run with the default RPM config, which isn't correct for my system (e.g. it ref eth0, which I don't have).
Ryan, could I please get your /etc/keepalived/keepalived.conf, or hints to how I should set it up?
I'll post a keepalived.conf file for you to study, but I do not believe this is going to have any effect on reproducing this bug.
There is a discussion about this bug on the lvs-devel mailing lists.
(In reply to comment #6)
> I'll post a keepalived.conf file for you to study, but I do not believe this is
> going to have any effect on reproducing this bug.
That's right... I still cannot reproduce.
(This time on a virtual KVM Fedora 16,
> There is a discussion about this bug on the lvs-devel mailing lists.
Good, guess I should subscribe to that mailing list :-)
I would just be nice, if I could somehow reproduce the bug, so I can verify any upstream fix...
Hans Schillstrom <email@example.com>, just posted a upstream bugfix :-)
I'll test the patch and thank him for his work :-).
(Patch contains some whitespace nitpicks, and Julian wanted to mark a function as __init. Hans Schillstrom promised me to fix it up and repost)
I have applied the patch on top DaveMs net tree (at 2a5809499e) and tested it on a Fedora KVM machine. It works, but I really cannot verify that it fixed the problem, as I could not reproduce it before...
Hans Schillstrom, proposed another way to reproduce, but I don't know if its worth implementing:
There was a possible race in the code, that made it possible send commands from ioctl or netlink before all init was done.
However it is "impossible" to trigger it by purpose, the time frame when it's exposed is just to short.
A new patch is posted with suggested changes.
If this is a "real life" problem there is an easy workaround,
just modprobe ip_vs in advance, so all structs get initialized before using the user-mode tools.
This patch also solved an old BUG:
loading of ip_vs.ko inside a networknamespace (container)
I tested with a 3.0.13 kernel with and without,
so in that context it's verified.
(In reply to comment #12)
> This patch also solved an old BUG:
> loading of ip_vs.ko inside a networknamespace (container)
Is that an old bugzilla case?
> I tested with a 3.0.13 kernel with and without,
> so in that context it's verified.
Good, and thanks :-)
It sounds like this fix is still making its way upstream. Please continue to work to get this into the Fedora kernel.
The patch is in Davem's tree here:
There are two others that seem related:
(In reply to comment #14)
> It sounds like this fix is still making its way upstream. Please continue to
> work to get this into the Fedora kernel.
Looks like its in the Fedora 16 kernels now.
Commit 8537de8a7ab6681cc72fb0411ab1ba7fdba62dd0 made it into v3.4-rc6, and F16 has been rebased to 3.4.2 (3.4.2-1.fc16.x86_64).
Yes, indeed. Thanks for reminding us about this bug.