Bug 223258 - Xen guest name resolution fails
Summary: Xen guest name resolution fails
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel-xen
Version: 6
Hardware: i686
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Eduardo Habkost
QA Contact: Virtualization Bugs
URL:
Whiteboard:
: 186183 (view as bug list)
Depends On: 237339
Blocks:
TreeView+ depends on / blocked
 
Reported: 2007-01-18 18:34 UTC by Kimmo Vuorinen
Modified: 2018-04-11 13:03 UTC (History)
7 users (show)

Fixed In Version: 2.6.20-1.2954.fc6
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2007-07-31 14:29:14 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)
Modified 8139too driver (38.88 KB, application/octet-stream)
2007-03-21 03:23 UTC, Herbert Xu
no flags Details
New module with sig removed (38.73 KB, application/octet-stream)
2007-03-22 03:59 UTC, Herbert Xu
no flags Details
Fix Xen checksum with 2.6.19 and beyond (26.98 KB, patch)
2007-04-10 05:45 UTC, Herbert Xu
no flags Details | Diff

Description Kimmo Vuorinen 2007-01-18 18:34:31 UTC
Description of problem:

Xen guest name resolution fails to work with TX checksumming enabled. Possibly
related to update xen-3.0.3-3.fc6 or kernel-xen-2.6.19-1.2895.fc6 as it used to
work previously. Guests have bridged networking setup and
kernel-xen-2.6.18-1.2869.fc6 running. Connections opened with ip-address work as
expected.

How reproducible:

Always with TX checksumming enabled.

Comment 1 Daniel Berrangé 2007-01-18 19:04:37 UTC
I've looked at the patches in xen-3.0.3-3.fc6  and there is nothing there which
should affect the guest networking. So I suspect the problem will be in the
kernel instead.

Can you downgrade your kernel, but keep userspace on xen-3.0.3-3.fc6 and confirm
networking then works ?  If so, we can re-assign this ticket to the kernel


Comment 2 Kimmo Vuorinen 2007-01-18 20:56:17 UTC
I had kernel-xen-2.6.18-1.2868.fc6 installed for dom0 already so I tested with
it and guest networking seems to work ok with it. So it seems kernel related to
me. Although I can ssh out of the guests running within problematic kernel using
ip-addresses, it is unusually slow. Name resolution probably fails because of that.

Comment 3 Kimmo Vuorinen 2007-01-23 13:41:22 UTC
NFS mounting fails with "mount.nfs: Input/output error" even when TX
checksumming is disabled. nfslock also dies immediately after startup. Failure
is related to kernel-xen-2.6.19-1.2895.fc6 update too.

Comment 4 Kimmo Vuorinen 2007-01-23 14:01:23 UTC
Please disregard previous post, user error.

Comment 5 Herbert Xu 2007-03-06 09:39:34 UTC
I just tried reproducing this with 2911 and failed.  Can you still reproduce
this with the latest fc6 kernel? Thanks.

Comment 6 Kimmo Vuorinen 2007-03-06 10:37:32 UTC
2911 seems to work ok with other guest systems but the one running bind service
for local network still works only when tx checksumming is off. This sounds so
strange that I'll have to check that system more closely when I have time available.

Comment 7 Herbert Xu 2007-03-06 10:54:30 UTC
Hmm, I thought you were talking about DNS clients failing as Xen guests.  Was
the problem with running a DNS server inside a Xen guest all along? Can you
please generate a packet dump inside the Xen guest running bind? Thanks.

Comment 8 Kimmo Vuorinen 2007-03-06 14:12:12 UTC
This is about DNS clients failing as Xen guests.

Domain0:
Queries to remote DNS servers work and if tx checksumming is off in DomU1,
queries to DomU1 work.

DomU1 running bind:
If tx checksumming is on, local and remote DNS queries fail.

DomU2:
If tx checksumming is on and tx checksumming is off in DomU1, queries to DomU1
work, remote queries fail (I didn't test remote queries from this DomU earlier
this day). If tx checksumming is off, remote queries work too.

I'll find out how to generate those packet dumps a bit later.


Comment 9 Herbert Xu 2007-03-06 21:12:07 UTC
Can you still reproduce this with the 2911 kernel in both dom0 and domU2? What
NIC are you using in dom0?

Comment 10 Kimmo Vuorinen 2007-03-07 06:50:11 UTC
Domain0 and DomU:s were all running 2911 kernel during tests yesterday, so
problem is still there.

There are 2 NIC:s in dom0, both Unex ND010 with 8139too driver:

00:08.0 Ethernet controller: Realtek Semiconductor Co., Ltd.
RTL-8139/8139C/8139C+ (rev 10)
        Subsystem: Unex Technology Corp. ND010
        Flags: bus master, medium devsel, latency 32, IRQ 12
        I/O ports at dc00 [size=256]
        Memory at ea021000 (32-bit, non-prefetchable) [size=256]
        Capabilities: [50] Power Management version 2

eth0 is used in domU:s with default bridged setup.


Comment 11 Herbert Xu 2007-03-20 08:50:47 UTC
Hmm I still can't reproduce this.  Unfortunately I don't have a test machine
with an 8139too here.

Could you try a new test? Leave tx checksums on everywhere except for peth0. 
This should tell us whether we've got a problem with the 8139too's checksum
offload emulation.  Thanks.

Comment 12 Kimmo Vuorinen 2007-03-20 11:32:26 UTC
When trying to get information from peth0 about current settings:

[user@host ~]# ethtool -k peth0
Offload parameters for peth0:
Cannot get device rx csum settings: Operation not supported
Cannot get device tx csum settings: Operation not supported
Cannot get device scatter-gather settings: Operation not supported
Cannot get device tcp segmentation offload settings: Operation not supported
no offload info available

[user@host ~]# ethtool -k eth0
Offload parameters for eth0:
Cannot get device rx csum settings: Operation not supported
rx-checksumming: off
tx-checksumming: on
scatter-gather: on
tcp segmentation offload: on

So I can't change settings for peth0.

Comment 13 Herbert Xu 2007-03-20 13:03:16 UTC
Ah yes I had forgotten that 8139too doesn't let you tune the option.  I'll try
getting you a version that has tx checksum disabled.

Comment 14 Herbert Xu 2007-03-21 03:23:19 UTC
Created attachment 150551 [details]
Modified 8139too driver

OK, I've modified the 8139too driver for 2.6.19-1.2911.fc6xen so that checksums
are always disabled.  Please let me know if this lets the domU's work with
checksums enabled.

Comment 15 Kimmo Vuorinen 2007-03-21 08:11:44 UTC
I have problems loading that module, module signature verification fails.
Modinfo gives same properties for this and the original one. As far as I know
this kernel should load unsigned modules so this one must have a signature which
differs from the kernels signature. I even tried stripping the signature without
really knowing what I am doing and all I got was ELF verification errors :)

Could you make an unsigned module or is there another way around this?

Comment 16 Herbert Xu 2007-03-22 03:59:55 UTC
Created attachment 150639 [details]
New module with sig removed

Sorry, I forgot to remove the signature.  This one should work better.

Comment 17 Kimmo Vuorinen 2007-03-22 11:33:09 UTC
I'm sorry to report that the problem is still there when using provided driver.

One thing that appeared when trying this driver is that when querying from DomU1
which has bind running (193.229.0.40 being one of my service provides nameservers):

[user@host ~]# nslookup www.google.fi 193.229.0.40
;; reply from unexpected source: 127.0.0.1#53, expected 193.229.0.40#53
;; Warning: ID mismatch: expected ID 61472, got 15736
;; reply from unexpected source: 127.0.0.1#53, expected 193.229.0.40#53
;; Warning: ID mismatch: expected ID 61472, got 15736
;; reply from unexpected source: 127.0.0.1#53, expected 193.229.0.40#53
;; Warning: ID mismatch: expected ID 61472, got 15736
;; connection timed out; no servers could be reached

Previously nslookup just timed out.

Again when tx-checksumming in DomU:s is switched off everything works as expected.

Comment 18 Herbert Xu 2007-03-22 11:38:04 UTC
OK, with that change your driver-side is now identical to mine.  So it's
probably something above that.  Do you have any custom netfilter rules?

How about getting a packet dump in dom0 as well as domU to show what the
checksum says? Thanks.

Comment 19 Kimmo Vuorinen 2007-03-22 11:58:01 UTC
Firewalls in all hosts are quite simple:

INPUT policy changed to DROP

-A INPUT -i lo -j ACCEPT
-A INPUT -i eth0 -m state --state RELATED,ESTABLISHED -j ACCEPT

and required service ports opened as needed by host for example with:
-A INPUT -i eth0 -m state --state NEW -p tcp -m tcp --dport ssh -j ACCEPT

No port redirects that might be able to produce output like in my previous posts.

I'll start searching for info how to do packet dumps, pointers welcome.

Comment 20 Herbert Xu 2007-03-22 12:02:12 UTC
As root in dom0 do

tcpdump -s 1600 -w dom0.dump -p -i peth0

In domU do

tcpdump -s 1600 -w domU.dump -p -i eth0


Comment 21 Kimmo Vuorinen 2007-03-22 16:25:12 UTC
Dumps have been sent directly to mr. Xu because of possibly sensitive information.

I just discovered a program called Wireshark and I have been monitoring ethernet
traffic in one of the other system running same os which doesn't use xen but has
the same network adapter and there seems to be quite a lot of bad checksums in
network traffic also but somehow it just works.

Comment 22 Herbert Xu 2007-03-23 04:58:50 UTC
OK there's definitely something bad going on here.  Your packets aren't showing
up in the peth0 dump at all so dom0 isn't forwarding them on.  Please do another
dump for the same situation:

tcpdump -s 1600 -w vifX.dump -p -i vifX.0

where vifX.0 the interface for domU2

I'd also like to see the actual iptables rules with

iptables -v -L -n

and

iptables -t nat -v -L -n

in dom0.  Thanks.

Comment 23 Herbert Xu 2007-03-23 06:54:41 UTC
Hmm, your network setup is more complicated than I thought.  Could you please
include the output of these commands in dom0:

ifconfig
ip ru
ip r l
brctl show

Thanks.

Comment 24 Herbert Xu 2007-03-23 09:00:09 UTC
OK, what does

brctl showmacs xenbr0

say after the DNS requests to an external site? Also what does

ip r l

say in domU2?

Comment 25 Kimmo Vuorinen 2007-03-23 09:31:39 UTC
brctl showmacs xenbr0:
port no mac addr                is local?       ageing timer
  2     00:10:a7:05:a8:a8       no                 1.68
  1     00:10:a7:05:a8:ae       no                 0.68
  2     00:10:a7:05:af:cf       no                91.36
  2     00:15:99:03:60:19       no                23.92
  4     00:16:3e:13:9b:f0       no                 1.60
  3     00:16:3e:13:9b:f1       no                 0.64
  1     fe:ff:ff:ff:ff:ff       yes                0.00

ip r l in domU2:
192.168.1.0/24 dev eth0  proto kernel  scope link  src 192.168.1.6 
169.254.0.0/16 dev eth0  scope link 
default via 192.168.1.1 dev eth0 


Comment 26 Herbert Xu 2007-03-23 09:42:37 UTC
OK, this makes more sense.  You're not bridging the traffic but routing it
through dom0.  Since I was testing bridging that explains why I didn't see it.

I presume eth1 is 8139too as well? I'll test this scenario.  Thanks.

Comment 27 Kimmo Vuorinen 2007-03-23 10:13:17 UTC
My nonexistant network knowledge says that domU:s are bridged to 192.168.1.0
network and remote connections are then routed trough eth1 which is also rtl8139
based card.

Sorry if I gave too much misinformation.


Comment 28 Herbert Xu 2007-03-26 10:15:14 UTC
OK I've reproduced it now.  The new checksum code in 2.6.19 has not been merged
properly with Xen.  I'll try to fix it up.

Comment 29 Herbert Xu 2007-04-10 05:45:01 UTC
Created attachment 152098 [details]
Fix Xen checksum with 2.6.19 and beyond

Here's the aggregate patch that will get FC6 to operate correctly with
checksums.  I've verified that it fixes the original problem for me.

Comment 30 Herbert Xu 2007-04-24 22:51:16 UTC
*** Bug 186183 has been marked as a duplicate of this bug. ***

Comment 31 Kimmo Vuorinen 2007-04-25 08:06:50 UTC
Has this patch been applied to any fc6 kernel updates?

Comment 32 Myroslav Opyr 2007-05-23 17:59:42 UTC
(In reply to comment #31)
> Has this patch been applied to any fc6 kernel updates?

https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=234008#c54 says that it is
not yet.

Comment 33 Eduardo Habkost 2007-05-26 21:30:37 UTC
I've built a test RPM including Herbert's checksum patches. They are available 
at: http://raisama.net/rh/bz223258/

md5sums:
37783626bc8c13f98d21ea83f6dfeb02  kernel-2.6.20-1.2954.fc6.src.rpm
c8f3a603d12c7c0e9381c93fa00ba3a6  kernel-xen-2.6.20-1.2954.fc6.i686.rpm


Could you test if this kernel solves the problems?

Comment 34 Eduardo Habkost 2007-05-30 21:39:59 UTC
Did you test the kernel-xen-2.6.20-1.2954.fc6 package?

Comment 35 Kimmo Vuorinen 2007-06-02 10:10:34 UTC
I just did the testing. Patch has solved the problem I was experiencing.

Comment 36 Eduardo Habkost 2007-06-04 12:28:55 UTC
Marking as MODIFIED. Fix on kernel-xen 2.6.20-1.2954.fc6 worked.

Comment 37 Red Hat Bugzilla 2007-07-25 01:37:18 UTC
change QA contact


Note You need to log in before you can comment on or make changes to this bug.