User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.0.7) Gecko/2009030423 Ubuntu/8.10 (intrepid) Firefox/3.0.7 I'm running LVS with a masqueraded target on a different subnet (accessed via a different interface) from the host: From ipvsadm: TCP 143.XXX.XXX.168:25 wrr -> 10.YYY.YYY.139:25 Masq 1 0 0 From route -n: 143.XXX.XXX.128 0.0.0.0 255.255.255.128 U 0 0 0 eth1 10.YYY.YYY.0 0.0.0.0 255.255.254.0 U 0 0 0 eth0 During the beginning of a TCP handshake, a [SYN] addressed to 143.XXX.XXX.168:25 from the outside is correctly forwarded to 10.YYY.YYY.139:25 (via eth0), which responds with a [SYN,ACK]. This [SYN,ACK] is received and demasqueraded by the director (the host running the ip-vs module), which then correctly emits a [SYN,ACK] packet directed to the connecting client with a source address of the VIP, 143.XXX.XXX.168. HOWEVER, this packet is emitted on the eth0 interface rather than eth1, and thus is never received by the client. This was first observed on CentOS 4.7; I have since observed it to also occur with RHEL 5.2. When reported to the LVS mailing list, at http://permalink.gmane.org/gmane.comp.linux.lvs.user/9691, it was noted that (1) this is not expected behavior, and (2) they were unable to support a vendor kernel. Reproducible: Always Steps to Reproduce: 1.Configure (with ipvsadm) a load balancer with IPs on each of two physically separated networks, using "masq" mode to forward traffic addressed to a VIP on one network to one or more realservers on another network. 2.Per the LVS installation process, modify the realserver being directed the traffic to use the load balancer's IP on its subnet as default gateway. 3.Attempt to connect to the VIP from a remote system. Actual Results: The TCP connection never gets past the SYN_RECV step, as demasqueraded reply packets from the realserver are emitted from the director on the wrong interface. Expected Results: A TCP connection should be established. This issue is not reproducible when booting otherwise-unmodified RHEL4 or RHEL5 against a current 2.6.28.7 upstream kernel. The default route for ongoing traffic on this host is via its interface on the 10.* network.
Clarification -- for correct routing of demasqueraded packets to work correctly, 2.6.28.7 needs to be use, *and* source routing rules are necessary. Note that this system's primary IP is on the internal (10.YYY.YYY.128/25) network, and its default route is set accordingly: # ip rule show 0: from all lookup 255 32765: from 143.XXX.XXX.128/25 lookup 1 32766: from all lookup main 32767: from all lookup default # ip route show 143.XXX.XXX.128/25 dev eth1 proto kernel scope link src 143.XXX.XXX.169 10.YYY.YYY.0/23 dev eth0 proto kernel scope link src 10.YYY.YYY.131 169.254.0.0/16 dev eth1 scope link default via 10.YYY.YYY.1 dev eth0 # ip route show table 1 default via 143.XXX.XXX.129 dev eth1 On the RHEL4 and RHEL5 kernels, the source routing rules are ignored for demasqueraded packets.
Created attachment 336857 [details] Proposed fix Applying this patch (by Ken Brownfield and Farid Sarwari) works around the issue on RHEL5.2. That said, it includes unnecessary code duplication -- the solution included in current upstream calls ip_route_me_harder() rather than introducing a new variant on this function.
The upstream fix for this is commit 901eaf6c8f997f18ebc8fcbb85411c79161ab3b2, merged as of October 2, 2006 and included in 2.6.19. Note that this change exposes a memory corruption bug in ipt_REJECT, fixed in af443b6d90de17f7630621269cf0610d9d772670 (also included in 2.6.19). http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=901eaf6c8f997f18ebc8fcbb85411c79161ab3b2 http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=af443b6d90de17f7630621269cf0610d9d772670
I've backported this to 5.4. Can you please test my test kernel and let me know if this solves your issue? Test kernel rpms can be downloaded from here: http://people.redhat.com/jpirko/test/bz491010/ Please let me know if you need rpm for a different arch.
Jiri, We installed your test kernel on our (x86_64) QA load balancer yesterday; the issue is not reproducible with it installed (masqueraded targets are load-balanced correctly), and the system appears generally stable -- so it looks good here.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Event posted on 10-12-2009 03:59pm EDT by ctatman Hi Dell, Can we get some verification from you that this is definitely working now in RHEL5.4? A package was tested back in May, and results were positive then. We just want to make sure that everything is working well in RHEL5.4 now. If it is, then we can close out this Bug. Thanks! --Chris Internal Status set to 'Waiting on Customer' Status set to: Waiting on Client This event sent from IssueTracker by ctatman issue 295632
If it is not working, Ww would like to confirm that there is commitment to test for the resolution of this request during the RHEL 5.5 test phases, if it is accepted into the release. Please post a confirmation before Oct 16th, 2009, including the contact information for testing engineers.
The issue is validated as resolved in kernel-2.6.18-164.el5, as shipped in RHEL5.4.
(In reply to comment #13) > The issue is validated as resolved in kernel-2.6.18-164.el5, as shipped in > RHEL5.4. Hello Charles. Double-checking this, the patch for this issue hasn't been included in rhel5 kernel yet. You only validated the patch applied on my test kernel (Comments #5, #6). Thanks, Jirka
Jiri, Thank you kindly for the sanity check. I was unable to book time on the primary QA load balancer (on which we've previously replicated this issue) in time for the 10/16 deadline, and built a new one for this test. I'll attempt to isolate the difference (and go back to a known-broken kernel to validate the test), and will see about putting together an oracle for the bug (this will require libvirt+qemu if 'yall want to run it on your end; I doubt that should be an issue).
@Dell, just to be clear, we only need commitment to test once we ship RHEL 5.5 Beta, not that you would need to complete testing of this issue by 10/16. Beta should be shipping in a few months, so there should be time to schedule resources. If you could come up with a reproducer that we could execute on our end, that would be ideal.
Chris, I have a working reproducer, validated against both known-good and known-bad kernel packages. It's presently a bit unwieldy (requires a complete rebuild of the virtual appliance whenever the kernel is swapped out; needs a RPM repository with socat added to build), but I should be able to have a cleaned-up version by the middle of next week. Having this also makes it easy to commit to performing any future tests 'yall may need.
Created attachment 365045 [details] Reproducer/Oracle for this bug Accepts a kernel RPM, spawns a trio of virtual machines using libvirt+kvm (sharing two completely isolated virtual networks), and emits a PASS or FAIL response determining whether the kernel in question is impacted by #491010. See the README for build and usage instructions.
Charles, thank you for providing reproducer. One question, I saw in README file, that this test passed on 2.6.18-148.el5 kernel (however it looks like it is a patched kernel, correct me if I'm wrong). Do you consider this a regression?
Jan, The kernel on which the test passed was indeed a patched kernel, provided by Jiri in comment 5 on this ticket. As such, I do not consider this a regression.
in kernel-2.6.18-170.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 Please do NOT transition this bugzilla state to VERIFIED until our QE team has sent specific instructions indicating when to do so. However feel free to provide a comment indicating that this fix has been verified.
(In reply to comment #21) > in kernel-2.6.18-170.el5 > You can download this test kernel from http://people.redhat.com/dzickus/el5 Don, As febootstrap doesn't support cross-architecture installs (including 32-on-64), I can't test the i686-only kernel-2.6.18-170.el5 package you provided on my immediately available hosts (which are x86_64-only). I also don't see a src.rpm to use in generating binaries locally. Would it be straightforward to provide an x86_64 build? Thanks!
(In reply to comment #22) > (In reply to comment #21) > > in kernel-2.6.18-170.el5 > > You can download this test kernel from http://people.redhat.com/dzickus/el5 > > Don, > > As febootstrap doesn't support cross-architecture installs (including > 32-on-64), I can't test the i686-only kernel-2.6.18-170.el5 package you > provided on my immediately available hosts (which are x86_64-only). > > I also don't see a src.rpm to use in generating binaries locally. http://people.redhat.com/dzickus/el5/170.el5/src/kernel-2.6.18-170.el5.src.rpm > > Would it be straightforward to provide an x86_64 build? http://people.redhat.com/dzickus/el5/170.el5/x86_64/kernel-2.6.18-170.el5.x86_64.rpm > > Thanks!
(In reply to comment #22) > (In reply to comment #21) > > in kernel-2.6.18-170.el5 > > You can download this test kernel from http://people.redhat.com/dzickus/el5 > > Don, > > As febootstrap doesn't support cross-architecture installs (including > 32-on-64), I can't test the i686-only kernel-2.6.18-170.el5 package you > provided on my immediately available hosts (which are x86_64-only). > > I also don't see a src.rpm to use in generating binaries locally. > > Would it be straightforward to provide an x86_64 build? > > Thanks! Charles, I think you looked to early for the builds as I was still uploading them to my people page. As Jiri pointed out above, they should be there now.
As an aside: The libvirt bundled in RHEL5.4 has an unrelated bug, fixed upstream, which causes false negatives from the reproducer provided (the bridge is created but isn't brought up on on a net-define operation if no IP is assigned). I'll be attaching a version of the reproducer with an appropriate workaround shortly. (In reply to comment #24) > http://people.redhat.com/dzickus/el5/170.el5/src/kernel-2.6.18-170.el5.src.rpm > http://people.redhat.com/dzickus/el5/170.el5/x86_64/kernel-2.6.18-170.el5.x86_64.rpm Those directories existed but were empty when I first checked. Anyhow -- the updated package works for me: Waiting for rhbz491010.server ............ Waiting for rhbz491010.lb ............ Waiting for rhbz491010.client ............ PASS (kernel-2.6.18-170.el5)
Created attachment 367380 [details] Updated version of reproducer Updated reproducer works around libvirt bug #532834 (but requires sudo privileges when doing so).
~~ Attention Customers and Partners - RHEL 5.5 Beta is now available on RHN ~~ RHEL 5.5 Beta has been released! There should be a fix present in this release that addresses your request. Please test and report back results here, by March 3rd 2010 (2010-03-03) or sooner. Upon successful verification of this request, post your results and update the Verified field in Bugzilla with the appropriate value. If you encounter any issues while testing, please describe them and set this bug into NEED_INFO. If you encounter new defects or have additional patch(es) to request for inclusion, please clone this bug per each request and escalate through your support representative.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2010-0178.html