Hide Forgot
Log at dev:/share/tickets/1234/self-heal-stale.log.bz2 From the end of the file, search upwards for the string "Stale". It will take you to the NFS LOOKUP that was returned an ESTALE by dht.
Adding pavan to cc list
Relevant log from the failed NFS LOOKUP request: [2010-07-27 09:24:31] D [nfs3-helpers.c:2233:nfs3_log_fh_entry_call] nfs-nfsv3: XID: 753eab2c, LOOKUP: args: FH: hashcount 6, xlid 0, gen 5498575376114255672, ino 32382182, name: sh4 [2010-07-27 09:24:31] T [nfs3.c:1038:nfs3_lookup] nfs-nfsv3: FH to Volume: dr [2010-07-27 09:24:31] T [nfs3-helpers.c:2934:nfs3_fh_resolve_entry_hard] nfs-nfsv3: FH hard resolution: ino: 32382182, gen: 5498575376114255672, entry: sh4, hashidx: 0 [2010-07-27 09:24:31] T [nfs.c:407:nfs_user_create] nfs: uid: 0, gid 0, gids: 0 [2010-07-27 09:24:31] T [nfs-fops.c:279:nfs_fop_lookup] nfs: Lookup: /c1/native-0/arch/sh/kernel/cpu/sh4 [2010-07-27 09:24:31] T [afr-lk-common.c:82:afr_set_lock_owner] repl1: Setting lk-owner=17403200 [2010-07-27 09:24:31] T [afr-self-heal-common.c:1532:afr_self_heal] repl1: performing self heal on /c1/native-0/arch/sh/kernel/cpu/sh4 (metadata=1 data=0 entry=1) [2010-07-27 09:24:31] T [afr-self-heal-common.c:1568:afr_self_heal] repl1: proceeding to metadata check on /c1/native-0/arch/sh/kernel/cpu/sh4 [2010-07-27 09:24:31] T [afr-self-heal-common.c:778:afr_sh_missing_entries_done] repl1: proceeding to metadata check on /c1/native-0/arch/sh/kernel/cpu/sh4 [2010-07-27 09:24:31] T [afr-lk-common.c:1284:afr_nonblocking_inodelk_cbk] repl1: Last inode locking reply received [2010-07-27 09:24:31] T [afr-lk-common.c:1289:afr_nonblocking_inodelk_cbk] repl1: All servers locked. Calling the cbk [2010-07-27 09:24:31] D [afr-self-heal-metadata.c:707:afr_sh_post_nonblocking_inodelk_cbk] repl1: Non blocking inodelks done. Proceeding to FOP [2010-07-27 09:24:31] T [afr-self-heal-metadata.c:652:afr_sh_metadata_lookup] repl1: looking up /c1/native-0/arch/sh/kernel/cpu/sh4 on client1 [2010-07-27 09:24:31] T [afr-self-heal-metadata.c:652:afr_sh_metadata_lookup] repl1: looking up /c1/native-0/arch/sh/kernel/cpu/sh4 on client2 [2010-07-27 09:24:31] T [afr-self-heal-metadata.c:593:afr_sh_metadata_lookup_cbk] repl1: path /c1/native-0/arch/sh/kernel/cpu/sh4 on subvolume client1 is of mode 02 [2010-07-27 09:24:31] T [afr-self-heal-metadata.c:593:afr_sh_metadata_lookup_cbk] repl1: path /c1/native-0/arch/sh/kernel/cpu/sh4 on subvolume client2 is of mode 02 [2010-07-27 09:24:31] T [afr-self-heal-common.c:111:afr_sh_print_pending_matrix] repl1: pending_matrix: [ 0 0 ] [2010-07-27 09:24:31] T [afr-self-heal-common.c:111:afr_sh_print_pending_matrix] repl1: pending_matrix: [ 2 0 ] [2010-07-27 09:24:31] T [afr-self-heal-metadata.c:464:afr_sh_metadata_sync_prepare] repl1: syncing metadata of /c1/native-0/arch/sh/kernel/cpu/sh4 from subvolume client2 to 1 active sinks [2010-07-27 09:24:31] D [afr-self-heal-metadata.c:367:afr_sh_metadata_sync] repl1: self-healing metadata of /c1/native-0/arch/sh/kernel/cpu/sh4 from client2 to client1 [2010-07-27 09:24:31] T [afr-self-heal-metadata.c:227:afr_sh_metadata_erase_pending] repl1: erasing pending flags from /c1/native-0/arch/sh/kernel/cpu/sh4 on client1 [2010-07-27 09:24:31] T [afr-self-heal-metadata.c:227:afr_sh_metadata_erase_pending] repl1: erasing pending flags from /c1/native-0/arch/sh/kernel/cpu/sh4 on client2 [2010-07-27 09:24:31] D [afr-lk-common.c:398:transaction_lk_op] repl1: lk op is for a self heal [2010-07-27 09:24:31] D [afr-lk-common.c:527:afr_unlock_common_cbk] repl1: call-count=1 [2010-07-27 09:24:31] D [afr-lk-common.c:527:afr_unlock_common_cbk] repl1: call-count=0 [2010-07-27 09:24:31] T [afr-lk-common.c:531:afr_unlock_common_cbk] repl1: All internal locks unlocked [2010-07-27 09:24:31] T [afr-self-heal-metadata.c:95:afr_sh_metadata_done] repl1: proceeding to entry check on /c1/native-0/arch/sh/kernel/cpu/sh4 [2010-07-27 09:24:31] T [afr-lk-common.c:1132:afr_nonblocking_entrylk_cbk] repl1: Last locking reply received [2010-07-27 09:24:31] T [afr-lk-common.c:1137:afr_nonblocking_entrylk_cbk] repl1: All servers locked. Calling the cbk [2010-07-27 09:24:31] D [afr-self-heal-entry.c:2401:afr_sh_post_nonblocking_entry_cbk] repl1: Non blocking entrylks done. Proceeding to FOP [2010-07-27 09:24:31] T [afr-self-heal-common.c:111:afr_sh_print_pending_matrix] repl1: pending_matrix: [ 0 0 ] [2010-07-27 09:24:31] T [afr-self-heal-common.c:111:afr_sh_print_pending_matrix] repl1: pending_matrix: [ 2 0 ] [2010-07-27 09:24:31] D [afr-self-heal-entry.c:2215:afr_sh_entry_sync_prepare] repl1: self-healing directory /c1/native-0/arch/sh/kernel/cpu/sh4 from subvolume client2 to 1 other [2010-07-27 09:24:31] T [afr-self-heal-entry.c:2135:afr_sh_entry_open] repl1: opening directory /c1/native-0/arch/sh/kernel/cpu/sh4 on subvolume client2 (source) [2010-07-27 09:24:31] T [afr-self-heal-entry.c:2153:afr_sh_entry_open] repl1: opening directory /c1/native-0/arch/sh/kernel/cpu/sh4 on subvolume client1 (sink) [2010-07-27 09:24:31] T [afr-self-heal-entry.c:2088:afr_sh_entry_opendir_cbk] repl1: fd for /c1/native-0/arch/sh/kernel/cpu/sh4 opened, commencing sync [2010-07-27 09:24:31] T [afr-self-heal-entry.c:2038:afr_sh_entry_impunge_all] repl1: impunging entries of /c1/native-0/arch/sh/kernel/cpu/sh4 on client2 to other sinks [2010-07-27 09:24:31] T [afr-self-heal-entry.c:1974:afr_sh_entry_impunge_readdir_cbk] repl1: readdir'ed 3 entries from client2 [2010-07-27 09:24:31] T [afr-self-heal-entry.c:1855:afr_sh_entry_impunge_entry] repl1: skipping inspection of . under /c1/native-0/arch/sh/kernel/cpu/sh4 [2010-07-27 09:24:31] T [afr-self-heal-entry.c:1861:afr_sh_entry_impunge_entry] repl1: inspecting existance of clock-sh4-202.c under /c1/native-0/arch/sh/kernel/cpu/sh4 [2010-07-27 09:24:31] T [afr-self-heal-entry.c:1907:afr_sh_entry_impunge_entry] repl1: looking up /c1/native-0/arch/sh/kernel/cpu/sh4/clock-sh4-202.c on client1 [2010-07-27 09:24:31] T [afr-self-heal-entry.c:1855:afr_sh_entry_impunge_entry] repl1: skipping inspection of .. under /c1/native-0/arch/sh/kernel/cpu/sh4 [2010-07-27 09:24:31] T [afr-self-heal-entry.c:1789:afr_sh_entry_impunge_entry_cbk] repl1: missing entry /c1/native-0/arch/sh/kernel/cpu/sh4/clock-sh4-202.c on client1 [2010-07-27 09:24:31] D [afr-self-heal-entry.c:1316:afr_sh_entry_impunge_mknod] repl1: creating missing file /c1/native-0/arch/sh/kernel/cpu/sh4/clock-sh4-202.c on client1 [2010-07-27 09:24:31] T [afr-self-heal-entry.c:1153:afr_sh_entry_impunge_xattrop_cbk] repl1: setting ownership of /c1/native-0/arch/sh/kernel/cpu/sh4/clock-sh4-202.c on client1 to 0/0 [2010-07-27 09:24:31] T [afr-self-heal-entry.c:1102:afr_sh_entry_impunge_setattr_cbk] repl1: setattr done for /c1/native-0/arch/sh/kernel/cpu/sh4/clock-sh4-202.c on client1 [2010-07-27 09:24:31] T [afr-self-heal-entry.c:1960:afr_sh_entry_impunge_readdir_cbk] repl1: readdir of /c1/native-0/arch/sh/kernel/cpu/sh4 on subvolume client2 complete [2010-07-27 09:24:31] T [afr-self-heal-entry.c:1027:afr_sh_entry_expunge_all] repl1: expunging entries of /c1/native-0/arch/sh/kernel/cpu/sh4 on client1 to other sinks [2010-07-27 09:24:31] T [afr-self-heal-entry.c:959:afr_sh_entry_expunge_readdir_cbk] repl1: readdir'ed 3 entries from client1 [2010-07-27 09:24:31] T [afr-self-heal-entry.c:873:afr_sh_entry_expunge_entry] repl1: inspecting existance of clock-sh4-202.c under /c1/native-0/arch/sh/kernel/cpu/sh4 [2010-07-27 09:24:31] T [afr-self-heal-entry.c:896:afr_sh_entry_expunge_entry] repl1: looking up /c1/native-0/arch/sh/kernel/cpu/sh4/clock-sh4-202.c on client2 [2010-07-27 09:24:31] T [afr-self-heal-entry.c:867:afr_sh_entry_expunge_entry] repl1: skipping inspection of .. under /c1/native-0/arch/sh/kernel/cpu/sh4 [2010-07-27 09:24:31] T [afr-self-heal-entry.c:867:afr_sh_entry_expunge_entry] repl1: skipping inspection of . under /c1/native-0/arch/sh/kernel/cpu/sh4 [2010-07-27 09:24:31] T [afr-self-heal-entry.c:822:afr_sh_entry_expunge_entry_cbk] repl1: /c1/native-0/arch/sh/kernel/cpu/sh4/clock-sh4-202.c exists under client2 [2010-07-27 09:24:31] T [afr-self-heal-entry.c:945:afr_sh_entry_expunge_readdir_cbk] repl1: readdir of /c1/native-0/arch/sh/kernel/cpu/sh4 on subvolume client1 complete [2010-07-27 09:24:31] T [afr-self-heal-entry.c:189:afr_sh_entry_erase_pending] repl1: erasing pending flags from /c1/native-0/arch/sh/kernel/cpu/sh4 on client1 [2010-07-27 09:24:31] T [afr-self-heal-entry.c:189:afr_sh_entry_erase_pending] repl1: erasing pending flags from /c1/native-0/arch/sh/kernel/cpu/sh4 on client2 [2010-07-27 09:24:31] T [afr-self-heal-entry.c:110:afr_sh_entry_finish] repl1: finishing entry selfheal of /c1/native-0/arch/sh/kernel/cpu/sh4 [2010-07-27 09:24:31] D [afr-lk-common.c:398:transaction_lk_op] repl1: lk op is for a self heal [2010-07-27 09:24:31] D [afr-lk-common.c:527:afr_unlock_common_cbk] repl1: call-count=1 [2010-07-27 09:24:31] D [afr-lk-common.c:527:afr_unlock_common_cbk] repl1: call-count=0 [2010-07-27 09:24:31] T [afr-lk-common.c:531:afr_unlock_common_cbk] repl1: All internal locks unlocked [2010-07-27 09:24:31] T [afr-self-heal-entry.c:80:afr_sh_entry_done] repl1: self heal of /c1/native-0/arch/sh/kernel/cpu/sh4 completed [2010-07-27 09:24:31] T [afr-self-heal-common.c:1476:afr_self_heal_completion_cbk] repl1: background self-heal completed [2010-07-27 09:24:31] D [dht-layout.c:653:dht_layout_dir_mismatch] dr: /c1/native-0/arch/sh/kernel/cpu/sh4 - disk layout missing [2010-07-27 09:24:31] D [dht-common.c:278:dht_revalidate_cbk] dr: mismatching layouts for /c1/native-0/arch/sh/kernel/cpu/sh4 [2010-07-27 09:24:31] D [nfs3-helpers.c:2406:nfs3_log_newfh_res] nfs-nfsv3: XID: 753eab2c, LOOKUP: NFS: 70(Invalid file handle), POSIX: 116(Stale NFS file handle), FH: hashcount 0, xlid 0, gen 0, ino 0
The following nfs client side dmesg output confirms that NFS client receives changed inode numbers after self-heal during the compilebench load. NFS: server domU-12-31-39-10-6D-E2 error: fileid changed fsid 0:15: expected fileid 0xea29f4, got 0x1ee2d0a NFS: server domU-12-31-39-10-6D-E2 error: fileid changed fsid 0:15: expected fileid 0xea2a2c, got 0x1ee2d3a NFS: server domU-12-31-39-10-6D-E2 error: fileid changed fsid 0:15: expected fileid 0xea29f0, got 0x1ee2d3e NFS: server domU-12-31-39-10-6D-E2 error: fileid changed fsid 0:15: expected fileid 0xea29e8, got 0x1ee2d4a NFS: server domU-12-31-39-10-6D-E2 error: fileid changed fsid 0:15: expected fileid 0xea2a08, got 0x1ee2d52 NFS: server domU-12-31-39-10-6D-E2 error: fileid changed fsid 0:15: expected fileid 0xea2a00, got 0x1ee2e36 NFS: server domU-12-31-39-10-6D-E2 error: fileid changed fsid 0:15: expected fileid 0xea2a28, got 0x1ee2f1a NFS: server domU-12-31-39-10-6D-E2 error: fileid changed fsid 0:15: expected fileid 0xea2a30, got 0x1ee2f2e Where fileid is a reference to inode number.
Reproduceable by running a compilebench load over a distributed-replicated setup and all perf translators. The ESTALE occurs a few mins after the downed node comes back up. [root@domU-12-31-39-0E-B1-16 compilebench-0.6]# ./compilebench -D /mnt/shehjart/mount/c1/ -r 100 -i 1 creating /mnt/shehjart/mount/c1/ using working directory /mnt/shehjart/mount/c1/, 1 intial dirs 100 runs Traceback (most recent call last): File "./compilebench", line 567, in <module> dset = dataset(options.sources, rnd) File "./compilebench", line 319, in __init__ self.unpatched = native_order(self.unpatched, "unpatched") File "./compilebench", line 97, in native_order run_directory(tmplist, dirname, "native %s" % tag) File "./compilebench", line 225, in run_directory fp = file(fname, 'a+') IOError: [Errno 116] Stale NFS file handle: '/mnt/shehjart/mount/c1/native-0/arch/sh/kernel/cpu/sh4/ex.S'
There is more fixes which needs to go in for handling this. As its in the self-heal path, this is not blocker for release
Not reproducable in latest git