| Summary: | Self-heal of compilebench load results in ESTALE for NFS | ||
|---|---|---|---|
| Product: | [Community] GlusterFS | Reporter: | Shehjar Tikoo <shehjart> |
| Component: | replicate | Assignee: | shishir gowda <sgowda> |
| Status: | CLOSED NOTABUG | QA Contact: | |
| Severity: | high | Docs Contact: | |
| Priority: | low | ||
| Version: | nfs-alpha | CC: | amarts, gluster-bugs, nsathyan, pavan, vijay |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | All | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | Type: | --- | |
| Regression: | RTP | Mount Type: | nfs |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
|
Description
Shehjar Tikoo
2010-07-27 10:33:13 UTC
Adding pavan to cc list Relevant log from the failed NFS LOOKUP request: [2010-07-27 09:24:31] D [nfs3-helpers.c:2233:nfs3_log_fh_entry_call] nfs-nfsv3: XID: 753eab2c, LOOKUP: args: FH: hashcount 6, xlid 0, gen 5498575376114255672, ino 32382182, name: sh4 [2010-07-27 09:24:31] T [nfs3.c:1038:nfs3_lookup] nfs-nfsv3: FH to Volume: dr [2010-07-27 09:24:31] T [nfs3-helpers.c:2934:nfs3_fh_resolve_entry_hard] nfs-nfsv3: FH hard resolution: ino: 32382182, gen: 5498575376114255672, entry: sh4, hashidx: 0 [2010-07-27 09:24:31] T [nfs.c:407:nfs_user_create] nfs: uid: 0, gid 0, gids: 0 [2010-07-27 09:24:31] T [nfs-fops.c:279:nfs_fop_lookup] nfs: Lookup: /c1/native-0/arch/sh/kernel/cpu/sh4 [2010-07-27 09:24:31] T [afr-lk-common.c:82:afr_set_lock_owner] repl1: Setting lk-owner=17403200 [2010-07-27 09:24:31] T [afr-self-heal-common.c:1532:afr_self_heal] repl1: performing self heal on /c1/native-0/arch/sh/kernel/cpu/sh4 (metadata=1 data=0 entry=1) [2010-07-27 09:24:31] T [afr-self-heal-common.c:1568:afr_self_heal] repl1: proceeding to metadata check on /c1/native-0/arch/sh/kernel/cpu/sh4 [2010-07-27 09:24:31] T [afr-self-heal-common.c:778:afr_sh_missing_entries_done] repl1: proceeding to metadata check on /c1/native-0/arch/sh/kernel/cpu/sh4 [2010-07-27 09:24:31] T [afr-lk-common.c:1284:afr_nonblocking_inodelk_cbk] repl1: Last inode locking reply received [2010-07-27 09:24:31] T [afr-lk-common.c:1289:afr_nonblocking_inodelk_cbk] repl1: All servers locked. Calling the cbk [2010-07-27 09:24:31] D [afr-self-heal-metadata.c:707:afr_sh_post_nonblocking_inodelk_cbk] repl1: Non blocking inodelks done. Proceeding to FOP [2010-07-27 09:24:31] T [afr-self-heal-metadata.c:652:afr_sh_metadata_lookup] repl1: looking up /c1/native-0/arch/sh/kernel/cpu/sh4 on client1 [2010-07-27 09:24:31] T [afr-self-heal-metadata.c:652:afr_sh_metadata_lookup] repl1: looking up /c1/native-0/arch/sh/kernel/cpu/sh4 on client2 [2010-07-27 09:24:31] T [afr-self-heal-metadata.c:593:afr_sh_metadata_lookup_cbk] repl1: path /c1/native-0/arch/sh/kernel/cpu/sh4 on subvolume client1 is of mode 02 [2010-07-27 09:24:31] T [afr-self-heal-metadata.c:593:afr_sh_metadata_lookup_cbk] repl1: path /c1/native-0/arch/sh/kernel/cpu/sh4 on subvolume client2 is of mode 02 [2010-07-27 09:24:31] T [afr-self-heal-common.c:111:afr_sh_print_pending_matrix] repl1: pending_matrix: [ 0 0 ] [2010-07-27 09:24:31] T [afr-self-heal-common.c:111:afr_sh_print_pending_matrix] repl1: pending_matrix: [ 2 0 ] [2010-07-27 09:24:31] T [afr-self-heal-metadata.c:464:afr_sh_metadata_sync_prepare] repl1: syncing metadata of /c1/native-0/arch/sh/kernel/cpu/sh4 from subvolume client2 to 1 active sinks [2010-07-27 09:24:31] D [afr-self-heal-metadata.c:367:afr_sh_metadata_sync] repl1: self-healing metadata of /c1/native-0/arch/sh/kernel/cpu/sh4 from client2 to client1 [2010-07-27 09:24:31] T [afr-self-heal-metadata.c:227:afr_sh_metadata_erase_pending] repl1: erasing pending flags from /c1/native-0/arch/sh/kernel/cpu/sh4 on client1 [2010-07-27 09:24:31] T [afr-self-heal-metadata.c:227:afr_sh_metadata_erase_pending] repl1: erasing pending flags from /c1/native-0/arch/sh/kernel/cpu/sh4 on client2 [2010-07-27 09:24:31] D [afr-lk-common.c:398:transaction_lk_op] repl1: lk op is for a self heal [2010-07-27 09:24:31] D [afr-lk-common.c:527:afr_unlock_common_cbk] repl1: call-count=1 [2010-07-27 09:24:31] D [afr-lk-common.c:527:afr_unlock_common_cbk] repl1: call-count=0 [2010-07-27 09:24:31] T [afr-lk-common.c:531:afr_unlock_common_cbk] repl1: All internal locks unlocked [2010-07-27 09:24:31] T [afr-self-heal-metadata.c:95:afr_sh_metadata_done] repl1: proceeding to entry check on /c1/native-0/arch/sh/kernel/cpu/sh4 [2010-07-27 09:24:31] T [afr-lk-common.c:1132:afr_nonblocking_entrylk_cbk] repl1: Last locking reply received [2010-07-27 09:24:31] T [afr-lk-common.c:1137:afr_nonblocking_entrylk_cbk] repl1: All servers locked. Calling the cbk [2010-07-27 09:24:31] D [afr-self-heal-entry.c:2401:afr_sh_post_nonblocking_entry_cbk] repl1: Non blocking entrylks done. Proceeding to FOP [2010-07-27 09:24:31] T [afr-self-heal-common.c:111:afr_sh_print_pending_matrix] repl1: pending_matrix: [ 0 0 ] [2010-07-27 09:24:31] T [afr-self-heal-common.c:111:afr_sh_print_pending_matrix] repl1: pending_matrix: [ 2 0 ] [2010-07-27 09:24:31] D [afr-self-heal-entry.c:2215:afr_sh_entry_sync_prepare] repl1: self-healing directory /c1/native-0/arch/sh/kernel/cpu/sh4 from subvolume client2 to 1 other [2010-07-27 09:24:31] T [afr-self-heal-entry.c:2135:afr_sh_entry_open] repl1: opening directory /c1/native-0/arch/sh/kernel/cpu/sh4 on subvolume client2 (source) [2010-07-27 09:24:31] T [afr-self-heal-entry.c:2153:afr_sh_entry_open] repl1: opening directory /c1/native-0/arch/sh/kernel/cpu/sh4 on subvolume client1 (sink) [2010-07-27 09:24:31] T [afr-self-heal-entry.c:2088:afr_sh_entry_opendir_cbk] repl1: fd for /c1/native-0/arch/sh/kernel/cpu/sh4 opened, commencing sync [2010-07-27 09:24:31] T [afr-self-heal-entry.c:2038:afr_sh_entry_impunge_all] repl1: impunging entries of /c1/native-0/arch/sh/kernel/cpu/sh4 on client2 to other sinks [2010-07-27 09:24:31] T [afr-self-heal-entry.c:1974:afr_sh_entry_impunge_readdir_cbk] repl1: readdir'ed 3 entries from client2 [2010-07-27 09:24:31] T [afr-self-heal-entry.c:1855:afr_sh_entry_impunge_entry] repl1: skipping inspection of . under /c1/native-0/arch/sh/kernel/cpu/sh4 [2010-07-27 09:24:31] T [afr-self-heal-entry.c:1861:afr_sh_entry_impunge_entry] repl1: inspecting existance of clock-sh4-202.c under /c1/native-0/arch/sh/kernel/cpu/sh4 [2010-07-27 09:24:31] T [afr-self-heal-entry.c:1907:afr_sh_entry_impunge_entry] repl1: looking up /c1/native-0/arch/sh/kernel/cpu/sh4/clock-sh4-202.c on client1 [2010-07-27 09:24:31] T [afr-self-heal-entry.c:1855:afr_sh_entry_impunge_entry] repl1: skipping inspection of .. under /c1/native-0/arch/sh/kernel/cpu/sh4 [2010-07-27 09:24:31] T [afr-self-heal-entry.c:1789:afr_sh_entry_impunge_entry_cbk] repl1: missing entry /c1/native-0/arch/sh/kernel/cpu/sh4/clock-sh4-202.c on client1 [2010-07-27 09:24:31] D [afr-self-heal-entry.c:1316:afr_sh_entry_impunge_mknod] repl1: creating missing file /c1/native-0/arch/sh/kernel/cpu/sh4/clock-sh4-202.c on client1 [2010-07-27 09:24:31] T [afr-self-heal-entry.c:1153:afr_sh_entry_impunge_xattrop_cbk] repl1: setting ownership of /c1/native-0/arch/sh/kernel/cpu/sh4/clock-sh4-202.c on client1 to 0/0 [2010-07-27 09:24:31] T [afr-self-heal-entry.c:1102:afr_sh_entry_impunge_setattr_cbk] repl1: setattr done for /c1/native-0/arch/sh/kernel/cpu/sh4/clock-sh4-202.c on client1 [2010-07-27 09:24:31] T [afr-self-heal-entry.c:1960:afr_sh_entry_impunge_readdir_cbk] repl1: readdir of /c1/native-0/arch/sh/kernel/cpu/sh4 on subvolume client2 complete [2010-07-27 09:24:31] T [afr-self-heal-entry.c:1027:afr_sh_entry_expunge_all] repl1: expunging entries of /c1/native-0/arch/sh/kernel/cpu/sh4 on client1 to other sinks [2010-07-27 09:24:31] T [afr-self-heal-entry.c:959:afr_sh_entry_expunge_readdir_cbk] repl1: readdir'ed 3 entries from client1 [2010-07-27 09:24:31] T [afr-self-heal-entry.c:873:afr_sh_entry_expunge_entry] repl1: inspecting existance of clock-sh4-202.c under /c1/native-0/arch/sh/kernel/cpu/sh4 [2010-07-27 09:24:31] T [afr-self-heal-entry.c:896:afr_sh_entry_expunge_entry] repl1: looking up /c1/native-0/arch/sh/kernel/cpu/sh4/clock-sh4-202.c on client2 [2010-07-27 09:24:31] T [afr-self-heal-entry.c:867:afr_sh_entry_expunge_entry] repl1: skipping inspection of .. under /c1/native-0/arch/sh/kernel/cpu/sh4 [2010-07-27 09:24:31] T [afr-self-heal-entry.c:867:afr_sh_entry_expunge_entry] repl1: skipping inspection of . under /c1/native-0/arch/sh/kernel/cpu/sh4 [2010-07-27 09:24:31] T [afr-self-heal-entry.c:822:afr_sh_entry_expunge_entry_cbk] repl1: /c1/native-0/arch/sh/kernel/cpu/sh4/clock-sh4-202.c exists under client2 [2010-07-27 09:24:31] T [afr-self-heal-entry.c:945:afr_sh_entry_expunge_readdir_cbk] repl1: readdir of /c1/native-0/arch/sh/kernel/cpu/sh4 on subvolume client1 complete [2010-07-27 09:24:31] T [afr-self-heal-entry.c:189:afr_sh_entry_erase_pending] repl1: erasing pending flags from /c1/native-0/arch/sh/kernel/cpu/sh4 on client1 [2010-07-27 09:24:31] T [afr-self-heal-entry.c:189:afr_sh_entry_erase_pending] repl1: erasing pending flags from /c1/native-0/arch/sh/kernel/cpu/sh4 on client2 [2010-07-27 09:24:31] T [afr-self-heal-entry.c:110:afr_sh_entry_finish] repl1: finishing entry selfheal of /c1/native-0/arch/sh/kernel/cpu/sh4 [2010-07-27 09:24:31] D [afr-lk-common.c:398:transaction_lk_op] repl1: lk op is for a self heal [2010-07-27 09:24:31] D [afr-lk-common.c:527:afr_unlock_common_cbk] repl1: call-count=1 [2010-07-27 09:24:31] D [afr-lk-common.c:527:afr_unlock_common_cbk] repl1: call-count=0 [2010-07-27 09:24:31] T [afr-lk-common.c:531:afr_unlock_common_cbk] repl1: All internal locks unlocked [2010-07-27 09:24:31] T [afr-self-heal-entry.c:80:afr_sh_entry_done] repl1: self heal of /c1/native-0/arch/sh/kernel/cpu/sh4 completed [2010-07-27 09:24:31] T [afr-self-heal-common.c:1476:afr_self_heal_completion_cbk] repl1: background self-heal completed [2010-07-27 09:24:31] D [dht-layout.c:653:dht_layout_dir_mismatch] dr: /c1/native-0/arch/sh/kernel/cpu/sh4 - disk layout missing [2010-07-27 09:24:31] D [dht-common.c:278:dht_revalidate_cbk] dr: mismatching layouts for /c1/native-0/arch/sh/kernel/cpu/sh4 [2010-07-27 09:24:31] D [nfs3-helpers.c:2406:nfs3_log_newfh_res] nfs-nfsv3: XID: 753eab2c, LOOKUP: NFS: 70(Invalid file handle), POSIX: 116(Stale NFS file handle), FH: hashcount 0, xlid 0, gen 0, ino 0 The following nfs client side dmesg output confirms that NFS client receives changed inode numbers after self-heal during the compilebench load. NFS: server domU-12-31-39-10-6D-E2 error: fileid changed fsid 0:15: expected fileid 0xea29f4, got 0x1ee2d0a NFS: server domU-12-31-39-10-6D-E2 error: fileid changed fsid 0:15: expected fileid 0xea2a2c, got 0x1ee2d3a NFS: server domU-12-31-39-10-6D-E2 error: fileid changed fsid 0:15: expected fileid 0xea29f0, got 0x1ee2d3e NFS: server domU-12-31-39-10-6D-E2 error: fileid changed fsid 0:15: expected fileid 0xea29e8, got 0x1ee2d4a NFS: server domU-12-31-39-10-6D-E2 error: fileid changed fsid 0:15: expected fileid 0xea2a08, got 0x1ee2d52 NFS: server domU-12-31-39-10-6D-E2 error: fileid changed fsid 0:15: expected fileid 0xea2a00, got 0x1ee2e36 NFS: server domU-12-31-39-10-6D-E2 error: fileid changed fsid 0:15: expected fileid 0xea2a28, got 0x1ee2f1a NFS: server domU-12-31-39-10-6D-E2 error: fileid changed fsid 0:15: expected fileid 0xea2a30, got 0x1ee2f2e Where fileid is a reference to inode number. Reproduceable by running a compilebench load over a distributed-replicated
setup and all perf translators. The ESTALE occurs a few mins after the downed
node comes back up.
[root@domU-12-31-39-0E-B1-16 compilebench-0.6]# ./compilebench -D /mnt/shehjart/mount/c1/ -r 100 -i 1
creating /mnt/shehjart/mount/c1/
using working directory /mnt/shehjart/mount/c1/, 1 intial dirs 100 runs
Traceback (most recent call last):
File "./compilebench", line 567, in <module>
dset = dataset(options.sources, rnd)
File "./compilebench", line 319, in __init__
self.unpatched = native_order(self.unpatched, "unpatched")
File "./compilebench", line 97, in native_order
run_directory(tmplist, dirname, "native %s" % tag)
File "./compilebench", line 225, in run_directory
fp = file(fname, 'a+')
IOError: [Errno 116] Stale NFS file handle: '/mnt/shehjart/mount/c1/native-0/arch/sh/kernel/cpu/sh4/ex.S'
There is more fixes which needs to go in for handling this. As its in the self-heal path, this is not blocker for release Not reproducable in latest git |