Description of problem: Kernel: 2.6.9-11.4hp.XCsmp This occured on a cluster comprised of 48 nodes. Dec 06 13:56:32 mtrr: type mismatch for f6000000,800000 old: uncachable new: write-combining Dec 10 09:30:01 general protection fault: 0000 [1] SMP Dec 10 09:35:27 CPU 0 Dec 10 09:35:27 Modules linked in: md5(U) ipv6(U) llite(U) mdc(U) lov(U) osc (U) ptlrpc(U) obdclass(U) lvfs(U) kvibnal(U) ksocknal(U) portals(U) libcfs(U) ip_vs_rr(U) ip_vs(U) parport_pc(U) lp(U) parport(U) supermon_sensors(U) supermon_proc(U) autofs4(U) i2c_dev(U) i2c_core(U) nfsd(U) exportfs(U) lockd(U) sunrpc(U) ds(U) yenta_socket(U) pcmcia_core(U) ipoib_ud(U) ats(U) devugsi(U) devucm(U) iba t(U) cm(U) gsim(U) ad_tavor(U) vverbs(U) mlog(U) repository(U) hadump(U) mod_ib_mgt(U) mod_vapi(U) mod_vipkl(U) mod_thh(U) mod_hh(U) mod_vapi_common(U) mod_mpga(U) mosal(U) ipt_REJECT(U) ipt_state(U) ip_conntrack(U) iptable_filter(U) ip_tables(U) dm_mod(U) button( U) battery(U) ac(U) ohci_hcd(U) hw_random(U) e1000(U) tg3(U) floppy(U) ext3(U) jbd(U) sata_nv(U) mptscsih(U) mptbase(U) ata_piix(U) libata(U) cciss(U) sd_mod(U) scsi_mod(U) Dec 10 09:35:27 Pid: 3049, comm: nfsd Tainted: PF 2.6.9-11.4hp.XCsmp Dec 10 09:35:27 RIP: 0010:[<ffffffff801cfb59>] <ffffffff801cfb59>{strcmp+0} Dec 10 09:35:27 RSP: 0000:00000100ddffdde0 EFLAGS: 00010282 Dec 10 09:35:27 RAX: 00000000ffffff93 RBX: dead4ead00000001 RCX: 0000000000000000 Dec 10 09:35:27 RDX: 0000000000000001 RSI: 00000100ddffde50 RDI: dead4ead00000029 Dec 10 09:35:27 RBP: 00000100ddffde28 R08: 0000006e66736404 R09: 572666a0d6636404 Dec 10 09:35:27 R10: 00000100ddfc4028 R11: ffffffffa03556eb R12: 0000000000000000 Dec 10 09:35:27 R13: 000001006bf7c580 R14: 0000000000000000 R15: ffffffffa0371690 Dec 10 09:35:27 FS: 0000002a9589fb00(0000) GS:ffffffff804b8200(0000) knlGS:00000000f61b8bb0 Dec 10 09:35:27 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Dec 10 09:35:27 CR2: 00000000005b4960 CR3: 0000000000101000 CR4: 00000000000006e0 Dec 10 09:35:27 Process nfsd (pid: 3049, threadinfo 00000100ddffc000, task 000001007feaa030) Dec 10 09:35:27 Stack: ffffffffa035518f 0000000000000007 000001007fd2c400 000001007fd2c400 Dec 10 09:35:27 0000000000000001 0000000000000003 00000000000186a3 ffffffffa03bc900 Dec 10 09:35:27 ffffffffa0355738 000001007fd2c468 Dec 10 09:35:27 Call Trace:<ffffffffa035518f>{:sunrpc:ip_map_lookup+276} <ffffffffa0355738>{:sunrpc:svcauth_unix_set_client+77} Dec 10 09:35:27 <ffffffff801437b3>{groups_alloc+64} <ffffffffa0355acd> {:sunrpc:svcauth_unix_accept+409} Dec 10 09:35:27 <ffffffffa0354ad1>{:sunrpc:svc_set_client+58} <ffffffffa03520c3>{:sunrpc:svc_process+775} Dec 10 09:35:27 <ffffffff80131a32>{default_wake_function+0} <ffffffffa038d245>{:nfsd:nfsd+0} Dec 10 09:35:27 <ffffffffa038d47d>{:nfsd:nfsd+568} <ffffffff80110d4b> {child_rip+8} Dec 10 09:35:27 <ffffffffa038d245>{:nfsd:nfsd+0} <ffffffffa038d245> {:nfsd:nfsd+0} Dec 10 09:35:27 <ffffffff80110d43>{child_rip+0} Dec 10 09:35:27 Dec 10 09:35:27 Code: 0f b6 17 89 d0 2a 06 48 ff c6 84 c0 75 07 48 ff c7 84 d2 75 Dec 10 09:35:27 Reconfiguring memory bank information.... Dec 10 09:35:27 This may take a while.... Dec 10 09:35:27 unexpected IRQ trap at vector d8 Dec 10 09:35:27 CPU #0 is dumping; frozen CPUs: #1 Dec 10 09:35:27 Dumping to block device (104,6) on CPU 0 ... Dec 10 09:35:27 ............../ .| Dec 10 09:35:27 84170 dump pages saved of 4096 each in pass 0 Dec 10 09:35:27 Dec 10 09:35:27 842544 dump pages skipped of 4096 each in pass 1 Dec 10 09:35:27 Dec 10 09:35:27 21399 dump pages skipped of 4096 each in pass 2 Dec 10 09:35:27 Dec 10 09:35:27 0 dump pages skipped of 4096 each in pass 3 Dec 10 09:35:27 We see a call to âsvcauth_unix_set_clientâ, which then calls âip_map_lookupâ, which then calls the built-in âstrcmpâ. I had to run âsvcauth_unix.câ through the âCâ preprocessor or otherwise one could spend years trying to find âip_map_lookupâ in the code and never find it. It is created by the "DefineSimpleCacheLookup" macro in "obj/x86_64/kernel-2.6.9/linux- 2.6.9/include/linux/sunrpc/cache.h". Anyway, in âip_map_lookupâ, just before the call to âip_map_matchâ, which is an âinlineâ routine (that may be why it doesnât show up in the stack trace), isnât the âread_lockâ supposed to be acquired before the âhead = â¦â line? Say that we execute the âhead = â line and then some other kernel code trashes the address that âheadâ was pointing to before the âread_lockâ is acquired. Since âtmpâ is really a pointer to âheadâ and were doing a âstrcmpâ on âtmpâ, then wouldnât the possibility exist that âstrcmpâ would try to access a bad address pointed to by âtmpâ? ---------------------------------- FILE: obj/x86_64/kernel-2.6.9/linux-2.6.9/net/sunrpc/svcauth_unix.c svcauth_unix_set_client(struct svc_rqst *rqstp) { struct ip_map key, *ipm; rqstp->rq_client = NULL; if (rqstp->rq_proc == 0) return SVC_OK; strcpy(key.m_class, rqstp->rq_server->sv_program->pg_class); key.m_addr = rqstp->rq_addr.sin_addr; ipm = ip_map_lookup(&key, 0); ⦠} FILE: obj/x86_64/kernel-2.6.9/linux-2.6.9/net/sunrpc/svcauth_unix.i (Preprocessed file which has the "DefineSimpleCacheLookup" macro in "cache.h" expanded) static struct ip_map *ip_map_lookup (struct ip_map *item, int set) { struct ip_map *tmp, *new=((void *)0); struct cache_head **hp, **head; ; head = &(& ip_map_cache)->hash_table[ip_map_hash(item)]; retry: if (set||new) _write_lock(&(& ip_map_cache)->hash_lock); else _read_lock(&(& ip_map_cache)->hash_lock); for(hp=head; *hp != ((void *)0); hp = &tmp->h.next) { tmp = ({ const typeof( ((struct ip_map *)0)->h ) *__mptr = (*hp); (struct ip_map *)( (char *)__mptr - ((size_t) &((struct ip_map *)0)->h) );}); if (ip_map_match(item, tmp)) ⦠} FILE: obj/x86_64/kernel-2.6.9/linux-2.6.9/net/sunrpc/svcauth_unix.i (Preprocessed File) static inline int ip_map_match(struct ip_map *item, struct ip_map *tmp) { return __builtin_strcmp(tmp->m_class, item->m_class) == 0 && tmp->m_addr.s_addr == item->m_addr.s_addr; } --------------------------------------------------- We had someone analyze the crash dump and this is what they reported: After spending about 3 hours looking at the source code, the log, and the crash dump, here's what we've come up with: 1) The crash is occurring in the code that looks up entries in the "ip_map" table that NFS uses to keep track of its clients. 2) All the "cache head" entries in the table are zero, except one. (ineffective hash function) 3) Apparently all the clients map to that same non-zero cache head. 4) After finding the correct cache head, the code walks down the list of "ip map" entries, trying to match the current entry with the IP of the new request. 5) The cluster has 48 members. 6) In the 35th entry, the "next" pointer is pointing to a structure that is clearly NOT a valid "ip map" entry. This structure happens, apparently by accident, to have a "next" field that is a "non-canonical" address (it is a "magic" value component of a spinlock). So, when the loop tries to walk to this invalid "next" field, it trips an invalid address fault - this triggers the crash. So, either a valid "ip map" entry has been clobbered, or (more likely) the previous "next" value has been clobbered (or is stale). In any case, it seems unlikely that we can make further progress without understanding what the NFS server on the cluster is supporting, and being able to monitor its data structures. 7) It seems reasonable to assume that each of the cluster members is an NFS client, although this needs to be checked. If so, there ought to be 47 entries in the chain. So, perhaps a complete chain has been clobbered somewhere in the middle? Version-Release number of selected component (if applicable): How reproducible: Have not tried to reproduce. Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Can you please describe the: 2.6.9-11.4hp.XCsmp kernel? that is not a Red Hat supported kernel.
It is a Red Hat kernel with additional patches for Quadrics, Infiniband, etc. We rebuild the kernel with the additional patches, hence the "4hp" in the name to indicate that its been updated 4 times. After looking through some of the code related to the crash, we think that this fix (see below) that went into the 2.6.12 kernel may be relevant. Rigoberto ------------------------------------------------ http://kernel.org/git/?p=linux/kernel/git/torvalds/old-2.6- bkcvs.git;a=commit;h=851dfd8298dbd31358ae1df9ef3c6ab1453141c7 [PATCH] nfsd: discard CACHE_HASHED flag, keeping information in refcount instead. author neilb <neilb> Sat, 5 Mar 2005 17:15:09 +0000 (17:15 +0000) committer neilb <neilb> Sat, 5 Mar 2005 17:15:09 +0000 (17:15 +0000) commit 851dfd8298dbd31358ae1df9ef3c6ab1453141c7 tree a5c009bd58b82019d5a57cb3b530f87cf77da001 tree parent 4881bd0daf953852e016a4dc91a4e2c9cebe1542 commit | commitdiff [PATCH] nfsd: discard CACHE_HASHED flag, keeping information in refcount instead. This patch should fix a problem that has been experienced on at-least one busy NFS server, but it has not had lots of testing yet. If -mm could provide that ..... The rpc auth cache currently differentiates between a reference due to being in a hash chain (signalled by CACHE_HASHED flag) and any other reference (counted in refcnt). This is an artificial difference due to an historical accident, and it makes cache_put unsafe. This patch removes the distinction so now existance in a hash chain is counted just like any other reference. Thus a race window in cache_put is closed. Signed-off-by: Neil Brown <neilb.edu.au> Signed-off-by: Andrew Morton <akpm> Signed-off-by: Linus Torvalds <torvalds> BKrev: 4229e91dksfEwyIcWvN9kqVoJyUxQg include/linux/sunrpc/cache.h blob | diff | history net/sunrpc/cache.c blob | diff | history net/sunrpc/svcauth.c blob | diff | history
thanks for the pointer....Also we're planning infiniband support for U3.
The patch in Comment #2 has already commited so is not clear (at least to me) what needs to happen...
Please try the least RHEL3 U3 kernel to see if resolves this issue
Thank you for submitting this issue for consideration in Red Hat Enterprise Linux. The release for which you requested us to review is now End of Life. Please See https://access.redhat.com/support/policy/updates/errata/ If you would like Red Hat to re-consider your feature request for an active release, please re-open the request via appropriate support channels and provide additional supporting details about the importance of this issue.