Bug 222662 - "nfs: server xxxx not reponding, still trying" error from FV guest
Summary: "nfs: server xxxx not reponding, still trying" error from FV guest
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: xen
Version: 5.0
Hardware: i386
OS: Linux
high
high
Target Milestone: ---
: ---
Assignee: Xen Maintainance List
QA Contact:
URL:
Whiteboard:
: 220584 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2007-01-15 16:05 UTC by YangKun
Modified: 2007-11-30 22:07 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2007-01-17 23:57:04 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description YangKun 2007-01-15 16:05:32 UTC
Description of problem:
In RedHat Hardware Test Suite(HTS), we are doing an UDP test in both hostOS and
FV guests(by using NFS network test), both the FV guest and the Host os are
clients, there is another machine in our LAN plays as NFS server. The  mount
options are "rw,intr,rsize=32768,wsize=32768,udp". If we mount in the HostOS,
everything works well, but if we mount in the FV guest, it will prompt out a
"nfs: server xxxx not reponding, still trying" error message. And if I remove
the "udp" option from the option lists, then everything works well in both
HostOS and FV guest. I'm not sure whether this is a Xen's bug. SElinux is
enabled(enforcing) and firewall is turned on(default settings) on hostOS and FV
guest, and both SElinux and firewall are turned off/disabled on the server machine.

Version-Release number of selected component (if applicable):
RHEL5-Server-20070105.0-i386

How reproducible:
always

Steps to Reproduce:
1. create a FV
2. boot the FV
3. mount use the above mount options

Comment 1 YangKun 2007-01-15 16:07:57 UTC
*** Bug 220584 has been marked as a duplicate of this bug. ***

Comment 2 Stephen Tweedie 2007-01-15 16:19:41 UTC
Can you please try to narrow things down with tcpdump to see if you can identify
where packets are being lost?

Are you mounting with identical NFS options in both the dom0 and the HVM domU?

Does tcpdump in the dom0 give you any clue as to packets being malformed or dropped?

Thanks.


Comment 3 YangKun 2007-01-16 11:53:24 UTC
Yes, I mount with identical NFS options in both the dom0 and the HVM domU.

Actully, we wrote a script to do the test, what we do in the NFS test is :
    1) mount 
    2) make some dirs under the mounted dir
    3) cp some files into those dirs
    3) umount
    4) mount again
    5) cp those files back and cmp the original files with the copied back 
ones. to check whether there are some errors happened in file-copying . 

I saw the "nfs: server xxxx not reponding, still trying" error message between 
step 4) and 5). 

Interestingly, I found the following :
    o I saw this error message only when I mount use "udp" option and only in 
FV guest;
    o some times the test just passed and won't prompt out any error, this 
direct-pass ratio is about 30%;
    o some times the test prompts out this error, but after a while, the test 
will pass anyway(the waiting time varies, maybe short , maybe long), 69% 
chances; 
    o some times the test seems never pass(about 1% chances), it continue to 
prompts out this error combines with "nfs: server x.x.x.x OK" message, like 
following:

nfs: server 10.66.0.87 not responding, still trying
nfs: server 10.66.0.87 not responding, still trying
nfs: server 10.66.0.87 not responding, still trying
nfs: server 10.66.0.87 OK
nfs: server 10.66.0.87 OK
nfs: server 10.66.0.87 OK
nfs: server 10.66.0.87 not responding, still trying
nfs: server 10.66.0.87 not responding, still trying
nfs: server 10.66.0.87 OK
nfs: server 10.66.0.87 OK
nfs: server 10.66.0.87 not responding, still trying
....


I checked with tcpdump, seems nothing special ? I'm not sure:
-----------------------------------
09:32:01.242319 IP dhcp-0-087.pek.redhat.com > dhcp-0-117.pek.redhat.com: udp
09:32:01.242322 IP dhcp-0-087.pek.redhat.com > dhcp-0-117.pek.redhat.com: udp
09:32:01.639221 arp who-has dhcp-0-117.pek.redhat.com tell dhcp-0-
087.pek.redhat.com
09:32:01.639237 arp reply dhcp-0-117.pek.redhat.com is-at 00:16:3e:00:3e:10 
(oui Unknown)
09:32:01.639336 arp who-has dhcp-0-117.pek.redhat.com tell dhcp-0-
087.pek.redhat.com
09:32:01.639339 arp reply dhcp-0-117.pek.redhat.com is-at 00:16:3e:00:3e:10 
(oui Unknown)
09:32:01.639439 IP dhcp-0-087.pek.redhat.com.nfs > dhcp-0-
117.pek.redhat.com.2827293577: reply ok 1472 read
09:32:01.639447 IP dhcp-0-087.pek.redhat.com > dhcp-0-117.pek.redhat.com: udp
09:32:01.639449 IP dhcp-0-087.pek.redhat.com > dhcp-0-117.pek.redhat.com: udp
09:32:01.639455 IP dhcp-0-087.pek.redhat.com > dhcp-0-117.pek.
-----------------------------------


Is it possible FS-Cache is involed ? Becase I saw the "FS-Cache: Loaded" and 
"FS-Cache: netfs 'nfs' registered for caching " messages when I first mount(in 
step 1).)

Thanks.

Comment 4 YangKun 2007-01-16 11:57:13 UTC
"dhcp-0-087.pek.redhat.com" is the NFS server machine(10.66.0.87).
"dhcp-0-117.pek.redhat.com" is the FV guest(10.66.0.117).

Comment 5 Stephen Tweedie 2007-01-16 12:26:35 UTC
Can you please try with a smaller nfs blocksize in the guest?

The default NIC emulated in FV mode is an RTL-8139, which has only 64k of ring
buffer.  I suspect that with 32k blocksize it's simply too much for that
hardware to keep up with the large NFS packets involved.  With a tcp mount that
doesn't matter, because TCP will work out the best window size automatically;
but with udp, lose one fragment of a 32k packet and there's no recovery other
than a complete NFS retry.  Get more than one packet at a time doing that same
recovery and you'll repeatedly overflow the hardware buffer.


Comment 6 YangKun 2007-01-17 05:35:01 UTC
I tried 3 smaller blocksizes(8k,12k,16k), each blocksize runs for 10 times, 
following are the results:

Blocksize 8k ( rsize=wsize=8192 )
---------------------------
Run#   Result
---------------------------
1    direct PASS
2    direct PASS
3    direct PASS
4    direct PASS
5    direct PASS
6    direct PASS
7    direct PASS
8    direct PASS
9    direct PASS
10   direct PASS



Blocksize 12k ( rsize=wsize=12288 )
---------------------------
Run#   Result
---------------------------
1    direct PASS
2    direct PASS
3    direct PASS
4    direct PASS
5    direct PASS
6    direct PASS
7    direct PASS
8    direct PASS
9    direct PASS
10   direct PASS


Blocksize 16k ( rsize=wsize=16384 )
---------------------------
Run#   Result
---------------------------
1    prompt out Error message, can PASS after a short time wait
2    prompt out Error message, can PASS after a long time wait
3    direct PASS
4    prompt out Error message, can PASS after a short time wait
5    direct PASS
6    prompt out Error message, can PASS after a short time wait
7    prompt out Error message, can PASS after a short time wait
8    direct PASS
9    prompt out Error message, can PASS after a short time wait
10   prompt out Error message, can PASS after a short time wait


Well, I think it is the "64k ring buffer" cause this issue. Is there a way to 
increase this ring buffer in FV guest ? 

Is there a maximum "UDP-mount-safe" block size for RTL-8139 ? 

Thanks very much .

Comment 7 Stephen Tweedie 2007-01-17 14:18:12 UTC
What's causing the issue is that the test is non-portable and is requesting
something that is not going to work on all hardware.  I expect it will fail on
real rtl-8139 hardware just as badly as on the emulated virtual NIC.  Modern
NICs are unlikely to have the same limits.

The 64k ring buffer is hard-coded into the guest IO model.

As for maximum safe blocksize, I'm not sure: that's really an NFS question.  I
think by default we enable 5 nfs rpc threads in parallel; if that translates to
5 blocks max outstanding at once, then that would be consistent with 12k passing
and 16k failing.  But you'd have to check with an NFS expert.

Comment 8 YangKun 2007-01-17 23:53:29 UTC
Ok, thanks. we have decided to change our test to mount with 12k. I'll close
this bug then.
Thanks very much for your help :-)


Note You need to log in before you can comment on or make changes to this bug.