Description of problem: The 2.6 version of the RPC client's XID generator depends on having the random driver initialized before get_random_bytes() is invoked to generate the initial XID for an NFS mount point. If the random driver is not initialized, no error is returned, and the initial XID is left with its previous value, which is zero. This results in the client sending NFS_ROOT requests with the same XID across client reboots. If the server's DRC is large or persistent, it will return a cached response to these reused XIDs that has nothing to do with the current request. Symptoms include short write replies, created files that don't exist, and removed files that haven't been removed. Although Red Hat does not officially support NFS_ROOT, some customers do use this feature with RHEL 4, and should be aware of this problem. Version-Release number of selected component (if applicable): Will affect all 2.6-based versions of RHEL. How reproducible: Every boot that uses NFS_ROOT will reuse the same XIDs. The probability that the server will reply with a cached request is variable, meaning that reproducing this problem is sometimes difficult. Steps to Reproduce: 1. Set up a client system with NFS_ROOT. 2. Set up a server to export the share that is the client's root fs. 3. Reboot the client every few minutes. Actual results: The client boot will fail intermittently. Expected results: The client should always boot correctly. Additional info: This has been reported with a NetApp filer and SuSE SLES 9. The bug comes from the mainline kernel, so RHEL 4 is also susceptible.
Created attachment 126761 [details] patch to use current time as the initial XID compile tested only -- attached patch gives a general idea about how this might be fixed.
here's a patch against 2.6.16 that i worked up with trond's help. our customer has tried this in the SLES 9 kernel, and found it to eliminate their issue. the customer also remarked that net_random_init(), the function used to seed the net_random() random number generator, does not exist in 2.6.5. i haven't looked at 2.6.9, but if net_random_init() does not exist there, imo it should be backported. diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c index 4dd5b3c..02060d0 100644 --- a/net/sunrpc/xprt.c +++ b/net/sunrpc/xprt.c @@ -41,7 +41,7 @@ #include <linux/types.h> #include <linux/interrupt.h> #include <linux/workqueue.h> -#include <linux/random.h> +#include <linux/net.h> #include <linux/sunrpc/clnt.h> #include <linux/sunrpc/metrics.h> @@ -830,7 +830,7 @@ static inline u32 xprt_alloc_xid(struct static inline void xprt_init_xid(struct rpc_xprt *xprt) { - get_random_bytes(&xprt->xid, sizeof(xprt->xid)); + xprt->xid = net_random(); } static void xprt_request_init(struct rpc_task *task, struct rpc_xprt *xprt) diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c index 4b4e7df..21006b1 100644 --- a/net/sunrpc/xprtsock.c +++ b/net/sunrpc/xprtsock.c @@ -930,6 +930,13 @@ static void xs_udp_timer(struct rpc_task xprt_adjust_cwnd(task, -ETIMEDOUT); } +static unsigned short xs_get_random_port(void) +{ + unsigned short range = xprt_max_resvport - xprt_min_resvport; + unsigned short rand = (unsigned short) net_random() % range; + return rand + xprt_min_resvport; +} + /** * xs_set_port - reset the port number in the remote endpoint address * @xprt: generic transport @@ -1275,7 +1282,7 @@ int xs_setup_udp(struct rpc_xprt *xprt, memset(xprt->slot, 0, slot_table_size); xprt->prot = IPPROTO_UDP; - xprt->port = xprt_max_resvport; + xprt->port = xs_get_random_port(); xprt->tsh_size = 0; xprt->resvport = capable(CAP_NET_BIND_SERVICE) ? 1 : 0; /* XXX: header size can vary due to auth type, IPv6, etc. */ @@ -1317,7 +1324,7 @@ int xs_setup_tcp(struct rpc_xprt *xprt, memset(xprt->slot, 0, slot_table_size); xprt->prot = IPPROTO_TCP; - xprt->port = xprt_max_resvport; + xprt->port = xs_get_random_port(); xprt->tsh_size = sizeof(rpc_fraghdr) / sizeof(u32); xprt->resvport = capable(CAP_NET_BIND_SERVICE) ? 1 : 0; xprt->max_payload = RPC_MAX_FRAGMENT_SIZE;
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Created attachment 138810 [details] Proposed patch The patch described in Comment #2 seems to do the trick.
committed in stream U5 build 42.25. A test kernel with this patch is available from http://people.redhat.com/~jbaron/rhel4/
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2007-0304.html