Description of problem: When running a benchmark test with the SPEC SFS 2014 VDA workload against a 3x2 on 3x2 tiered volume using the FUSE native client, the benchmark will fail with return code 101 between the INIT and WARM phases, reporting a client error of "No such file or directory." The client logs will consistently show an error in dht-linkfile.c, "Failed to set attr uid/gid" as in the log pattern below. [2016-10-24 15:48:39.738525] I [MSGID: 109045] [dht-common.c:1667:dht_lookup_everywhere_cbk] 2-1nvme-distrep3x2-tier-dht: attempting deletion of stale linkfile /specsfs/CL21 _VDA2/Dir0/bucket11/onebd0.leb on 1nvme-distrep3x2-cold-dht (hashed subvol is 1nvme-distrep3x2-hot-dht) [2016-10-24 15:48:39.739442] W [MSGID: 114031] [client-rpc-fops.c:664:client3_3_unlink_cbk] 2-1nvme-distrep3x2-client-0: remote operation failed [Device or resource busy] [2016-10-24 15:48:39.739459] W [MSGID: 114031] [client-rpc-fops.c:664:client3_3_unlink_cbk] 2-1nvme-distrep3x2-client-1: remote operation failed [Device or resource busy] [2016-10-24 15:48:39.739738] I [MSGID: 109069] [dht-common.c:992:dht_lookup_unlink_cbk] 2-1nvme-distrep3x2-tier-dht: lookup_unlink returned with op_ret -> -1 and op-errno -> 16 for /specsfs/CL21_VDA2/Dir0/bucket11/onebd0.leb [2016-10-24 15:48:39.742165] I [MSGID: 108019] [afr-lk-common.c:1086:afr_lock_blocking] 2-1nvme-distrep3x2-replicate-3: unable to lock on even one child [2016-10-24 15:48:39.742195] I [MSGID: 108019] [afr-transaction.c:1218:afr_post_blocking_inodelk_cbk] 2-1nvme-distrep3x2-replicate-3: Blocking inodelks failed. [2016-10-24 15:48:39.742209] E [MSGID: 109031] [dht-linkfile.c:306:dht_linkfile_setattr_cbk] 2-1nvme-distrep3x2-tier-dht: Failed to set attr uid/gid on /specsfs/CL21_VDA2/Dir0/bucket11/onebd0.leb :<gfid:00000000-0000-0000-0000-000000000000> [Stale file handle] [2016-10-24 15:48:39.853056] I [MSGID: 109045] [dht-common.c:1667:dht_lookup_everywhere_cbk] 2-1nvme-distrep3x2-tier-dht: attempting deletion of stale linkfile /specsfs/CL213_VDA2/Dir0/bucket8/myrpd0.osxom on 1nvme-distrep3x2-cold-dht (hashed subvol is 1nvme-distrep3x2-hot-dht) [2016-10-24 15:48:39.853577] W [MSGID: 114031] [client-rpc-fops.c:664:client3_3_unlink_cbk] 2-1nvme-distrep3x2-client-2: remote operation failed [Device or resource busy] [2016-10-24 15:48:39.853602] W [MSGID: 114031] [client-rpc-fops.c:664:client3_3_unlink_cbk] 2-1nvme-distrep3x2-client-3: remote operation failed [Device or resource busy] [2016-10-24 15:48:39.853788] I [MSGID: 109069] [dht-common.c:992:dht_lookup_unlink_cbk] 2-1nvme-distrep3x2-tier-dht: lookup_unlink returned with op_ret -> -1 and op-errno -> 16 for /specsfs/CL213_VDA2/Dir0/bucket8/myrpd0.osxom [2016-10-24 15:48:39.856132] I [MSGID: 108019] [afr-lk-common.c:1086:afr_lock_blocking] 2-1nvme-distrep3x2-replicate-3: unable to lock on even one child [2016-10-24 15:48:39.856150] I [MSGID: 108019] [afr-transaction.c:1218:afr_post_blocking_inodelk_cbk] 2-1nvme-distrep3x2-replicate-3: Blocking inodelks failed. [2016-10-24 15:48:39.856161] E [MSGID: 109031] [dht-linkfile.c:306:dht_linkfile_setattr_cbk] 2-1nvme-distrep3x2-tier-dht: Failed to set attr uid/gid on /specsfs/CL213_VDA2/Dir0/bucket8/myrpd0.osxom :<gfid:00000000-0000-0000-0000-000000000000> [Stale file handle] Version-Release number of selected component (if applicable): # cat /etc/redhat-storage-release Red Hat Gluster Storage Server 3.1 Update 3 # gluster --version glusterfs 3.7.9 built on Jun 10 2016 06:32:42 How reproducible: Consistently Steps to Reproduce: Gluster 3x2 hot tier on a 3x2 cold tier. Hot tier is NVMe with standard LVM thin provisioning stack, cold tier is on RAID 6 spinning disks with standard LVM thin provisioning stack. # gluster vol info 1nvme-distrep3x2 Volume Name: 1nvme-distrep3x2 Type: Tier Volume ID: adbef8d2-2d15-41b9-b1f1-7d7d070aaf37 Status: Started Number of Bricks: 12 Transport-type: tcp Hot Tier : Hot Tier Type : Distributed-Replicate Number of Bricks: 3 x 2 = 6 Brick1: n5:/rhgs/hotbricks/1nvme-distrep3x2-hot Brick2: n4:/rhgs/hotbricks/1nvme-distrep3x2-hot Brick3: n3:/rhgs/hotbricks/1nvme-distrep3x2-hot Brick4: n2:/rhgs/hotbricks/1nvme-distrep3x2-hot Brick5: n1:/rhgs/hotbricks/1nvme-distrep3x2-hot Brick6: n0:/rhgs/hotbricks/1nvme-distrep3x2-hot Cold Tier: Cold Tier Type : Distributed-Replicate Number of Bricks: 3 x 2 = 6 Brick7: n0:/rhgs/coldbricks/1nvme-distrep3x2 Brick8: n1:/rhgs/coldbricks/1nvme-distrep3x2 Brick9: n2:/rhgs/coldbricks/1nvme-distrep3x2 Brick10: n3:/rhgs/coldbricks/1nvme-distrep3x2 Brick11: n4:/rhgs/coldbricks/1nvme-distrep3x2 Brick12: n5:/rhgs/coldbricks/1nvme-distrep3x2 Options Reconfigured: cluster.tier-mode: cache features.ctr-enabled: on performance.readdir-ahead: on 12 clients connected to the volume with the FUSE native client. # glusterfs --version glusterfs 3.7.1 built on Sep 21 2015 07:52:17 Using the SPEC SFS 2014 SP1 benchmark software. The rc file for the SPEC benchmark: # cat /tmp/sfs.rc BENCHMARK=VDA LOAD=144 INCR_LOAD=24 NUM_RUNS=4 CLIENT_MOUNTPOINTS=c0:/rhgs/client/1nvme-distrep3x2/specsfs c1:/rhgs/client/1nvme-distrep3x2/specsfs c2:/rhgs/client/1nvme-distrep3x2/specsfs c3:/rhgs/client/1nvme-distrep3x2/specsfs c4:/rhgs/client/1nvme-distrep3x2/specsfs c5:/rhgs/client/1nvme-distrep3x2/specsfs c6:/rhgs/client/1nvme-distrep3x2/specsfs c7:/rhgs/client/1nvme-distrep3x2/specsfs c8:/rhgs/client/1nvme-distrep3x2/specsfs c9:/rhgs/client/1nvme-distrep3x2/specsfs c10:/rhgs/client/1nvme-distrep3x2/specsfs c11:/rhgs/client/1nvme-distrep3x2/specsfs EXEC_PATH=/root/bin/netmist USER=root WARMUP_TIME=300 IPV6_ENABLE=0 PRIME_MON_SCRIPT= PRIME_MON_ARGS= NETMIST_LOGS= PASSWORD= RUNTIME=300 WORKLOAD_FILE= OPRATE_MULTIPLIER= CLIENT_MEM=1g AGGR_CAP=1g FILE_SIZE= DIR_COUNT=10 FILES_PER_DIR=100 UNLINK_FILES=0 LATENCY_GRAPH=1 HEARTBEAT_NOTIFICATIONS=1 DISABLE_FSYNCS=0 USE_RSHRCP=0 BYTE_OFFSET=0 MAX_FD= PIT_SERVER= PIT_PORT= LOCAL_ONLY=0 FILE_ACCESS_LIST=0 SHARING_MODE=0 SOCK_DEBUG=0 TRACEDEBUG=0 The SPEC SFS command line: python /root/bin/SfsManager -r /tmp/sfs.rc -s $(date +%F-%H-%M-%S) -b /root/bin/benchmarks.xml -d ./ Actual results: Benchmark fails when at least one client reports back a "No such file or directory" error. <<< Mon Oct 24 07:25:31 2016: Starting VDA run 1 of 4: STREAMS=144 >> SPEC SFS2014 Release $Revision: 1072 $ ... Test run time = 300 seconds, Warmup = 300 seconds. Op latency reporting activated Files per directory set to 1 Directories per proc set to 1 Using custom file size 1048576 Kbytes Running 288 copies of the test on 12 clients Results directory: ./ Clients have a total of 12288 MiBytes of memory Clients have 42 MiBytes of memory size per process Clients each have 24 processes Adjustable aggregate data set value set to 1024 MiBytes Each process file size = 1048576 kbytes Client data set size = 270336 MiBytes Total starting data set size = 3244032 MiBytes Total initial file space = 3244032 MiBytes Total max file space = 3538944 MiBytes [INFO][Mon Oct 24 07:25:31 2016]Exec validation successful ... Starting tests: Mon Oct 24 07:25:31 2016 Launching 288 processes. ... Waiting to finish initialization. Mon Oct 24 07:26:12 2016 Mon Oct 24 07:30:55 2016 Starting INIT phase ... Initialization finished: Mon Oct 24 08:48:28 2016 Testing begins: Mon Oct 24 08:48:28 2016 Waiting for tests to finish. Mon Oct 24 08:48:34 2016 Mon Oct 24 08:48:39 2016 Starting WARM phase Mon Oct 24 08:48:39 2016 Warm Heartbeat Client 287: 0.00 Ops/sec Client testing failed: Mon Oct 24 08:48:46 2016 netmist did not complete successfully. The netmist return code was: 101 Expected results: All benchmark iterations complete successfully without client errors. Additional info: # cat /etc/redhat-release Red Hat Enterprise Linux Server release 7.2 (Maipo) # uname -r 3.10.0-327.28.3.el7.x86_64 # df -h | grep rhgs /dev/mapper/rhgs_vg-rhgs_lv 55T 520G 55T 1% /rhgs/coldbricks /dev/mapper/rhgs_hot_vg-rhgs_hot_lv 742G 531G 211G 72% /rhgs/hotbricks # mount | grep rhgs /dev/mapper/rhgs_vg-rhgs_lv on /rhgs/coldbricks type xfs (rw,noatime,nodiratime,attr2,inode64,sunit=512,swidth=5120,noquota) /dev/mapper/rhgs_hot_vg-rhgs_hot_lv on /rhgs/hotbricks type xfs (rw,noatime,nodiratime,attr2,inode64,logbsize=128k,sunit=256,swidth=512,noquota) # xfs_info /dev/mapper/rhgs_vg-rhgs_lv meta-data=/dev/mapper/rhgs_vg-rhgs_lv isize=512 agcount=55, agsize=268435392 blks = sectsz=4096 attr=2, projid32bit=1 = crc=0 finobt=0 data = bsize=4096 blocks=14645721600, imaxpct=1 = sunit=64 swidth=640 blks naming =version 2 bsize=8192 ascii-ci=0 ftype=0 log =internal bsize=4096 blocks=521728, version=2 = sectsz=4096 sunit=1 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 # xfs_info /dev/mapper/rhgs_hot_vg-rhgs_hot_lv meta-data=/dev/mapper/rhgs_hot_vg-rhgs_hot_lv isize=512 agcount=32, agsize=6074208 blks = sectsz=512 attr=2, projid32bit=1 = crc=0 finobt=0 data = bsize=4096 blocks=194374656, imaxpct=25 = sunit=32 swidth=64 blks naming =version 2 bsize=8192 ascii-ci=0 ftype=0 log =internal bsize=4096 blocks=94912, version=2 = sectsz=512 sunit=32 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 # tuned-adm active Current active profile: rhgs-sequential-io # getenforce Disabled # iptables -L Chain INPUT (policy ACCEPT) target prot opt source destination Chain FORWARD (policy ACCEPT) target prot opt source destination Chain OUTPUT (policy ACCEPT) target prot opt source destination
Possible related BZ 1277088 (partial error message match): https://bugzilla.redhat.com/show_bug.cgi?id=1277088
Dustin, I see you are running older clients for this test: > 12 clients connected to the volume with the FUSE native client. > # glusterfs --version > glusterfs 3.7.1 built on Sep 21 2015 07:52:17 It is recommended that you use the latest 3.1.3 clients as well. We fixed a lot of tier bugs in 3.1.2 so earlier versions are likely to have issues that are already fixed. Can you retry the tests with the latest clients and let us know if you still see the issues? If you do, please attach the logs/sos reports to the BZ.
(In reply to Nithya Balachandran from comment #4) > Dustin, > > I see you are running older clients for this test: > > > 12 clients connected to the volume with the FUSE native client. > > > # glusterfs --version > > glusterfs 3.7.1 built on Sep 21 2015 07:52:17 > > > It is recommended that you use the latest 3.1.3 clients as well. We fixed a > lot of tier bugs in 3.1.2 so earlier versions are likely to have issues that > are already fixed. > > Can you retry the tests with the latest clients and let us know if you still > see the issues? > > If you do, please attach the logs/sos reports to the BZ. Nithya, It appears that my client systems are registered to the rhel-7-server repo but not to the rhel-7-server-rh-common repo. The rhel-7-server repo has the glusterfs and glusterfs-fuse packages up to version 3.7.1, so when I did a yum install I believed I had the most updated packages. Looking into the repos, I found that the later packages were apparently moved to the rhel-7-server-rh-common repo. I am updating my clients now and will re-test.
Clients were updated to glusterfs 3.7.9 built on Aug 24 2016 02:51:26 Problem was not fixed and looks like similar errors SPEC SFS error (time in PDT) Tue Oct 25 15:31:51 2016 Warm Heartbeat Client 335: 1.01 Ops/sec Tue Oct 25 15:32:12 2016 Warm-up 50 percent complete Tue Oct 25 15:32:43 2016 Warm-up 60 percent complete Tue Oct 25 15:32:53 2016 Warm Heartbeat Client 335: 1.00 Ops/sec Tue Oct 25 15:33:13 2016 Warm-up 70 percent complete Tue Oct 25 15:33:43 2016 Warm-up 80 percent complete Tue Oct 25 15:33:55 2016 Warm Heartbeat Client 335: 1.00 Ops/sec Tue Oct 25 15:34:12 2016 Warm-up 90 percent complete Client testing failed: Tue Oct 25 15:34:13 2016 netmist did not complete successfully. The netmist return code was: 101 rhclient0 gluserfs log (time in UTC) [2016-10-25 23:29:12.070138] I [MSGID: 114024] [client-helpers.c:101:this_fd_set_ctx] 0-1nvme-distrep3x2-client-11: /specsfs/CL312_VDA1/Dir0/bucket0/ttr0.tises (eefe9fd7-b94e-4a12-9b03-f0c00cc1973e): trying duplicate remote fd set. [2016-10-25 23:29:12.070160] I [MSGID: 114024] [client-helpers.c:101:this_fd_set_ctx] 0-1nvme-distrep3x2-client-10: /specsfs/CL312_VDA1/Dir0/bucket0/ttr0.tises (eefe9fd7-b94e-4a12-9b03-f0c00cc1973e): trying duplicate remote fd set. [2016-10-25 23:29:12.070226] I [MSGID: 114024] [client-helpers.c:101:this_fd_set_ctx] 0-1nvme-distrep3x2-client-11: /specsfs/CL312_VDA1/Dir0/bucket0/ttr0.tises (eefe9fd7-b94e-4a12-9b03-f0c00cc1973e): trying duplicate remote fd set. [2016-10-25 23:29:12.070239] I [MSGID: 114024] [client-helpers.c:101:this_fd_set_ctx] 0-1nvme-distrep3x2-client-10: /specsfs/CL312_VDA1/Dir0/bucket0/ttr0.tises (eefe9fd7-b94e-4a12-9b03-f0c00cc1973e): trying duplicate remote fd set. [2016-10-25 23:32:42.838486] W [MSGID: 114031] [client-rpc-fops.c:1572:client3_3_fstat_cbk] 0-1nvme-distrep3x2-client-6: remote operation failed [No such file or directory] [2016-10-25 23:32:42.838782] W [MSGID: 114031] [client-rpc-fops.c:1572:client3_3_fstat_cbk] 0-1nvme-distrep3x2-client-7: remote operation failed [No such file or directory] [2016-10-25 23:32:42.839359] W [MSGID: 108008] [afr-read-txn.c:250:afr_read_txn] 0-1nvme-distrep3x2-replicate-5: Unreadable subvolume -1 found with event generation 2 for gfid 55d47778-6a22-4233-9fce-e97e4c0da376. (Possible split-brain) [2016-10-25 23:32:42.840242] E [MSGID: 109040] [dht-helper.c:1189:dht_migration_complete_check_task] 0-1nvme-distrep3x2-hot-dht: (null): failed to lookup the file on 1nvme-distrep3x2-hot-dht [Stale file handle] [2016-10-25 23:32:42.840763] W [MSGID: 108008] [afr-read-txn.c:250:afr_read_txn] 0-1nvme-distrep3x2-replicate-5: Unreadable subvolume -1 found with event generation 2 for gfid 55d47778-6a22-4233-9fce-e97e4c0da376. (Possible split-brain) The message "W [MSGID: 114031] [client-rpc-fops.c:1572:client3_3_fstat_cbk] 0-1nvme-distrep3x2-client-7: remote operation failed [No such file or directory]" repeated 2 times between [2016-10-25 23:32:42.838782] and [2016-10-25 23:32:42.840981] The message "W [MSGID: 114031] [client-rpc-fops.c:1572:client3_3_fstat_cbk] 0-1nvme-distrep3x2-client-6: remote operation failed [No such file or directory]" repeated 3 times between [2016-10-25 23:32:42.838486] and [2016-10-25 23:32:42.841022]
Additional failed test results with logs attached starting test client: 125 Host: rhclient2 Workload: VDA2 Workdir: /rhgs/client/1nvme-ec42/specsfs ... Starting test client: 287 Host: rhclient11 Workload: VDA2 Workdir: /rhgs/client/1nvme-ec42/specsfs Waiting to finish initialization. Sat Oct 29 02:03:07 2016 Sat Oct 29 02:05:22 2016 Starting INIT phase Sat Oct 29 02:05:23 2016 Init Heartbeat __/\_/\__ Sat Oct 29 02:06:12 2016 Init 20 percent complete Sat Oct 29 02:06:13 2016 Init 50 percent complete Sat Oct 29 02:06:13 2016 Init 70 percent complete Sat Oct 29 02:06:13 2016 Init 100 percent complete Initialization finished: Sat Oct 29 02:06:57 2016 Testing begins: Sat Oct 29 02:06:57 2016 Waiting for tests to finish. Sat Oct 29 02:07:03 2016 Sat Oct 29 02:07:08 2016 Starting WARM phase Sat Oct 29 02:07:08 2016 Warm Heartbeat Client 287: 0.00 Ops/sec Sat Oct 29 02:07:33 2016 Warm-up 10 percent complete Master: Failure message client 125 Error = 22 Master: Client 125 returned error 22 Possibly: Invalid argument Client testing failed: Sat Oct 29 02:07:33 2016 Exit with cleanup Killing 288 clients. From rhclient2 glusterfs log (attached; UTC) [2016-10-29 09:07:33.843919] W [fuse-bridge.c:766:fuse_attr_cbk] 0-glusterfs-fuse: 20740513: FSTAT() /specsfs/CL125_VDA2/Dir0/bucket2/iee0.nhn => -1 (Invalid argument) [2016-10-29 09:07:35.598809] W [MSGID: 114031] [client-rpc-fops.c:1848:client3_3_xattrop_cbk] 0-1nvme-ec42-client-0: remote operation failed. Path: /specsfs/CL53_VDA2/Dir0/bucket11/tut0.i (a02df0b8-fbf2-4789-8465-dc260c71e8b2) [No such file or directory]
Created attachment 1215341 [details] Error message on screen when SPEC SFS VDA test fails
Created attachment 1215342 [details] SFS script used to run testing
Created attachment 1215343 [details] SPEC SFS log
Created attachment 1215344 [details] Gluser log for rhclient2 This log has the error message for what caused the SPEC SFS VDA test to fail
New VDA failure with Logs attached Testing output on screen before, during and after failure (PDT) … Starting test client: 329 Host: rhclient8 Workload: VDA2 Workdir: /rhgs/client/1nvme-distrep3x2/specsfs Starting test client: 330 Host: rhclient9 Workload: VDA1 Workdir: /rhgs/client/1nvme-distrep3x2/specsfs Starting test client: 331 Host: rhclient9 Workload: VDA2 Workdir: /rhgs/client/1nvme-distrep3x2/specsfs Starting test client: 332 Host: rhclient10 Workload: VDA1 Workdir: /rhgs/client/1nvme-distrep3x2/specsfs Starting test client: 333 Host: rhclient10 Workload: VDA2 Workdir: /rhgs/client/1nvme-distrep3x2/specsfs Starting test client: 334 Host: rhclient11 Workload: VDA1 Workdir: /rhgs/client/1nvme-distrep3x2/specsfs Starting test client: 335 Host: rhclient11 Workload: VDA2 Workdir: /rhgs/client/1nvme-distrep3x2/specsfs Waiting to finish initialization. Fri Nov 4 10:10:38 2016 Fri Nov 4 10:15:27 2016 Starting INIT phase Fri Nov 4 10:15:43 2016 Init Heartbeat __/\_/\__ Fri Nov 4 10:16:43 2016 Init Heartbeat __/\_/\__ Fri Nov 4 10:17:43 2016 Init Heartbeat __/\_/\__ Fri Nov 4 10:18:43 2016 Init Heartbeat __/\_/\__ Fri Nov 4 10:18:48 2016 Init 20 percent complete Fri Nov 4 10:19:43 2016 Init Heartbeat __/\_/\__ Fri Nov 4 10:20:43 2016 Init Heartbeat __/\_/\__ Fri Nov 4 10:21:43 2016 Init Heartbeat __/\_/\__ Fri Nov 4 10:22:11 2016 Init 50 percent complete Fri Nov 4 10:22:44 2016 Init Heartbeat __/\_/\__ Fri Nov 4 10:23:19 2016 Init 70 percent complete Fri Nov 4 10:23:44 2016 Init Heartbeat __/\_/\__ Fri Nov 4 10:24:44 2016 Init Heartbeat __/\_/\__ Fri Nov 4 10:25:35 2016 Init 100 percent complete Initialization finished: Fri Nov 4 10:27:10 2016 Testing begins: Fri Nov 4 10:27:10 2016 Waiting for tests to finish. Fri Nov 4 10:27:17 2016 Fri Nov 4 10:27:22 2016 Starting WARM phase Fri Nov 4 10:27:22 2016 Warm Heartbeat Client 335: 0.00 Ops/sec Fri Nov 4 10:27:45 2016 Warm-up 10 percent complete Fri Nov 4 10:28:16 2016 Warm-up 20 percent complete Fri Nov 4 10:28:22 2016 Warm Heartbeat Client 335: 1.02 Ops/sec Fri Nov 4 10:28:46 2016 Warm-up 30 percent complete Fri Nov 4 10:29:16 2016 Warm-up 40 percent complete Fri Nov 4 10:29:25 2016 Warm Heartbeat Client 335: 1.00 Ops/sec Fri Nov 4 10:29:45 2016 Warm-up 50 percent complete Fri Nov 4 10:30:15 2016 Warm-up 60 percent complete Fri Nov 4 10:30:27 2016 Warm Heartbeat Client 335: 1.01 Ops/sec Fri Nov 4 10:30:45 2016 Warm-up 70 percent complete Fri Nov 4 10:31:15 2016 Warm-up 80 percent complete Fri Nov 4 10:31:27 2016 Warm Heartbeat Client 335: 1.05 Ops/sec Fri Nov 4 10:31:45 2016 Warm-up 90 percent complete Fri Nov 4 10:32:17 2016 Warm-up 100 percent complete Fri Nov 4 10:32:17 2016 Starting RUN phase Fri Nov 4 10:32:29 2016 Run Heartbeat Client 335: 1.00 Ops/sec Fri Nov 4 10:32:46 2016 Run 10 percent complete Fri Nov 4 10:33:30 2016 Run Heartbeat Client 335: 1.03 Ops/sec Fri Nov 4 10:33:47 2016 Run 30 percent complete Fri Nov 4 10:34:15 2016 Run 40 percent complete Fri Nov 4 10:34:32 2016 Run Heartbeat Client 335: 0.99 Ops/sec Fri Nov 4 10:34:47 2016 Run 50 percent complete Master: Failure message client 329 Error = 22 Master: Client 329 returned error 22 Possibly: Invalid argument Client testing failed: Fri Nov 4 10:34:59 2016 Exit with cleanup Killing 336 clients. … netmist did not complete successfully.
Created attachment 1217499 [details] Gluster client log for rhclient8 part2 (UTC; Nov 4, 2016)
Comment on attachment 1217499 [details] Gluster client log for rhclient8 part2 (UTC; Nov 4, 2016) SPEC SFS error may be related to this line in the attached log [2016-11-04 17:34:59.224968] W [fuse-bridge.c:766:fuse_attr_cbk] 0-glusterfs-fuse: 4506012: FSTAT() /specsfs/CL329_VDA2/Dir0/bucket2/ehndeie0.preo => -1 (Invalid argument)
Created attachment 1217501 [details] Gluster client log for rhclient8 part1 (UTC; Nov 4, 2016)
Created attachment 1217502 [details] SPEC SFS log with timestamps (PDT; Nov 4, 2016)
Created attachment 1217503 [details] SPEC SFS setup file (Nov 4, 2016)
Created attachment 1217504 [details] Error message on screen when SPEC SFS VDA test fails (PDT; Nov 4, 2016)
Found Gluster server rhosd4 showing movement between tiers for file ehndeie0.preo. Used this command grep -i "ehndeie0.preo" 1nvme-distrep3x2-tier.log and found output here: http://pastebin.test.redhat.com/427282. Sane file found in rhclient8 gluster log [2016-11-04 17:34:59.224968] W [fuse-bridge.c:766:fuse_attr_cbk] 0-glusterfs-fuse: 4506012: FSTAT() /specsfs/CL329_VDA2/Dir0/bucket2/ehndeie0.preo => -1 (Invalid argument) Both logs are attached.
Created attachment 1217508 [details] Gluster server log for rhosd4 (UTC; Nov 4, 2016)
Output from server log for file with error. [root@rhosd4 glusterfs]# grep -i "ehndeie0.preo" 1nvme-distrep3x2-tier.log [2016-11-04 17:33:33.364950] D [MSGID: 0] [dht-common.c:2445:dht_lookup] 0-1nvme-distrep3x2-tier-dht: Calling fresh lookup for /specsfs/CL329_VDA2/Dir0/bucket2/ehndeie0.preo on 1nvme-distrep3x2-cold-dht [2016-11-04 17:33:33.364972] D [MSGID: 0] [dht-common.c:2445:dht_lookup] 0-1nvme-distrep3x2-cold-dht: Calling fresh lookup for /specsfs/CL329_VDA2/Dir0/bucket2/ehndeie0.preo on 1nvme-distrep3x2-replicate-2 [2016-11-04 17:33:33.365363] D [MSGID: 0] [dht-common.c:2032:dht_lookup_cbk] 0-1nvme-distrep3x2-cold-dht: fresh_lookup returned for /specsfs/CL329_VDA2/Dir0/bucket2/ehndeie0.preo with op_ret 0 [2016-11-04 17:33:33.365382] D [MSGID: 0] [dht-common.c:2032:dht_lookup_cbk] 0-1nvme-distrep3x2-tier-dht: fresh_lookup returned for /specsfs/CL329_VDA2/Dir0/bucket2/ehndeie0.preo with op_ret 0 [2016-11-04 17:33:33.365412] D [MSGID: 0] [tier.c:733:tier_migrate_using_query_file] 0-1nvme-distrep3x2-tier-dht: Tier promote: src_subvol 1nvme-distrep3x2-cold-dht file /specsfs/CL329_VDA2/Dir0/bucket2/ehndeie0.preo [2016-11-04 17:33:33.368494] D [MSGID: 0] [dht-common.c:2371:dht_lookup] 0-1nvme-distrep3x2-cold-dht: calling revalidate lookup for /specsfs/CL329_VDA2/Dir0/bucket2/ehndeie0.preo at 1nvme-distrep3x2-replicate-2 [2016-11-04 17:33:33.368799] D [MSGID: 0] [dht-common.c:791:dht_revalidate_cbk] 0-1nvme-distrep3x2-cold-dht: revalidate lookup of /specsfs/CL329_VDA2/Dir0/bucket2/ehndeie0.preo returned with op_ret 0 [2016-11-04 17:33:33.369036] D [MSGID: 0] [dht-common.c:2445:dht_lookup] 0-1nvme-distrep3x2-hot-dht: Calling fresh lookup for /specsfs/CL329_VDA2/Dir0/bucket2/ehndeie0.preo on 1nvme-distrep3x2-replicate-5 [2016-11-04 17:33:33.369414] D [MSGID: 0] [dht-common.c:2032:dht_lookup_cbk] 0-1nvme-distrep3x2-hot-dht: fresh_lookup returned for /specsfs/CL329_VDA2/Dir0/bucket2/ehndeie0.preo with op_ret -1 [No such file or directory] [2016-11-04 17:33:33.369420] D [MSGID: 0] [dht-common.c:2045:dht_lookup_cbk] 0-1nvme-distrep3x2-hot-dht: Entry /specsfs/CL329_VDA2/Dir0/bucket2/ehndeie0.preo missing on subvol 1nvme-distrep3x2-replicate-5 [2016-11-04 17:33:33.369718] D [MSGID: 0] [dht-common.c:1623:dht_lookup_everywhere_cbk] 0-1nvme-distrep3x2-hot-dht: returned with op_ret -1 and op_errno 2 (/specsfs/CL329_VDA2/Dir0/bucket2/ehndeie0.preo) from subvol 1nvme-distrep3x2-replicate-3 [2016-11-04 17:33:33.369930] D [MSGID: 0] [dht-common.c:1623:dht_lookup_everywhere_cbk] 0-1nvme-distrep3x2-hot-dht: returned with op_ret -1 and op_errno 2 (/specsfs/CL329_VDA2/Dir0/bucket2/ehndeie0.preo) from subvol 1nvme-distrep3x2-replicate-5 [2016-11-04 17:33:33.369992] D [MSGID: 0] [dht-common.c:1623:dht_lookup_everywhere_cbk] 0-1nvme-distrep3x2-hot-dht: returned with op_ret -1 and op_errno 2 (/specsfs/CL329_VDA2/Dir0/bucket2/ehndeie0.preo) from subvol 1nvme-distrep3x2-replicate-4 [2016-11-04 17:33:33.370032] D [MSGID: 0] [dht-common.c:1349:dht_lookup_everywhere_done] 0-1nvme-distrep3x2-hot-dht: There was no cached file and unlink on hashed is not skipped /specsfs/CL329_VDA2/Dir0/bucket2/ehndeie0.preo [2016-11-04 17:33:33.370147] D [MSGID: 0] [dht-common.c:6496:dht_create_wind_to_avail_subvol] 0-1nvme-distrep3x2-hot-dht: creating /specsfs/CL329_VDA2/Dir0/bucket2/ehndeie0.preo on 1nvme-distrep3x2-replicate-5 [2016-11-04 17:33:33.370827] D [MSGID: 0] [dht-common.c:2371:dht_lookup] 0-1nvme-distrep3x2-hot-dht: calling revalidate lookup for /specsfs/CL329_VDA2/Dir0/bucket2/ehndeie0.preo at 1nvme-distrep3x2-replicate-5 [2016-11-04 17:33:33.371103] D [MSGID: 0] [dht-common.c:791:dht_revalidate_cbk] 0-1nvme-distrep3x2-hot-dht: revalidate lookup of /specsfs/CL329_VDA2/Dir0/bucket2/ehndeie0.preo returned with op_ret 0 [2016-11-04 17:34:02.393288] D [MSGID: 0] [client-rpc-fops.c:1157:client3_3_getxattr_cbk] 0-1nvme-distrep3x2-client-4: remote operation failed: No data available. Path: /specsfs/CL329_VDA2/Dir0/bucket2/ehndeie0.preo (3da564c5-22ff-46c4-9d99-16a35141531a). Key: trusted.glusterfs.dht.linkto [2016-11-04 17:34:02.393510] D [MSGID: 0] [dht-common.c:2371:dht_lookup] 0-1nvme-distrep3x2-tier-dht: calling revalidate lookup for /specsfs/CL329_VDA2/Dir0/bucket2/ehndeie0.preo at 1nvme-distrep3x2-cold-dht [2016-11-04 17:34:02.393531] D [MSGID: 0] [dht-common.c:2371:dht_lookup] 0-1nvme-distrep3x2-cold-dht: calling revalidate lookup for /specsfs/CL329_VDA2/Dir0/bucket2/ehndeie0.preo at 1nvme-distrep3x2-replicate-2 [2016-11-04 17:34:02.393811] D [MSGID: 0] [dht-common.c:791:dht_revalidate_cbk] 0-1nvme-distrep3x2-cold-dht: revalidate lookup of /specsfs/CL329_VDA2/Dir0/bucket2/ehndeie0.preo returned with op_ret 0 [2016-11-04 17:34:02.393828] D [MSGID: 0] [dht-common.c:791:dht_revalidate_cbk] 0-1nvme-distrep3x2-tier-dht: revalidate lookup of /specsfs/CL329_VDA2/Dir0/bucket2/ehndeie0.preo returned with op_ret 0 [Structure needs cleaning] [2016-11-04 17:34:02.393847] D [MSGID: 0] [dht-common.c:2371:dht_lookup] 0-1nvme-distrep3x2-hot-dht: calling revalidate lookup for /specsfs/CL329_VDA2/Dir0/bucket2/ehndeie0.preo at 1nvme-distrep3x2-replicate-5 [2016-11-04 17:34:02.394356] D [MSGID: 0] [dht-common.c:791:dht_revalidate_cbk] 0-1nvme-distrep3x2-hot-dht: revalidate lookup of /specsfs/CL329_VDA2/Dir0/bucket2/ehndeie0.preo returned with op_ret 0
File create time on hot tier (PDT) [root@rhosd0 VDA_failed_110416]# cd /rhgs/hotbricks/1nvme-distrep3x2-hot/specsfs/CL329_VDA2/Dir0/bucket2/ [root@rhosd0 bucket2]# ls -ltr total 1048576 -rw-r--r-- 2 root root 1073741824 Nov 4 10:19 ehndeie0.preo
Also see this on rhosd0 after test fails, looks like trying to demote to cold tier and it fails? [root@rhosd0 glusterfs]# grep -i "ehndeie0.preo" 1nvme-distrep3x2-tier.log [2016-11-04 19:00:00.812749] D [MSGID: 0] [dht-common.c:2445:dht_lookup] 0-1nvme-distrep3x2-tier-dht: Calling fresh lookup for /specsfs/CL329_VDA2/Dir0/bucket2/ehndeie0.preo on 1nvme-distrep3x2-cold-dht [2016-11-04 19:00:00.812790] D [MSGID: 0] [dht-common.c:2445:dht_lookup] 0-1nvme-distrep3x2-cold-dht: Calling fresh lookup for /specsfs/CL329_VDA2/Dir0/bucket2/ehndeie0.preo on 1nvme-distrep3x2-replicate-2 [2016-11-04 19:00:00.813132] D [MSGID: 0] [dht-common.c:2032:dht_lookup_cbk] 0-1nvme-distrep3x2-cold-dht: fresh_lookup returned for /specsfs/CL329_VDA2/Dir0/bucket2/ehndeie0.preo with op_ret 0 [2016-11-04 19:00:00.813144] D [MSGID: 0] [dht-common.c:2032:dht_lookup_cbk] 0-1nvme-distrep3x2-tier-dht: fresh_lookup returned for /specsfs/CL329_VDA2/Dir0/bucket2/ehndeie0.preo with op_ret 0 [2016-11-04 19:00:00.813157] D [MSGID: 0] [dht-common.c:2151:dht_lookup_cbk] 0-1nvme-distrep3x2-tier-dht: Calling lookup on linkto target 1nvme-distrep3x2-hot-dht for path /specsfs/CL329_VDA2/Dir0/bucket2/ehndeie0.preo [2016-11-04 19:00:00.813170] D [MSGID: 0] [dht-common.c:2445:dht_lookup] 0-1nvme-distrep3x2-hot-dht: Calling fresh lookup for /specsfs/CL329_VDA2/Dir0/bucket2/ehndeie0.preo on 1nvme-distrep3x2-replicate-5 [2016-11-04 19:00:00.813714] D [MSGID: 0] [dht-common.c:2032:dht_lookup_cbk] 0-1nvme-distrep3x2-hot-dht: fresh_lookup returned for /specsfs/CL329_VDA2/Dir0/bucket2/ehndeie0.pre with op_ret 0 [2016-11-04 19:00:00.813754] D [MSGID: 0] [tier.c:733:tier_migrate_using_query_file] 0-1nvme-distrep3x2-tier-dht: Tier demote: src_subvol 1nvme-distrep3x2-hot-dht file /specsfs/CL329_VDA2/Dir0/bucket2/ehndeie0.preo [2016-11-04 19:00:00.814033] D [MSGID: 0] [tier.c:233:tier_check_same_node] 0-1nvme-distrep3x2-tier-dht: /specsfs/CL329_VDA2/Dir0/bucket2/ehndeie0.preo does not belong to this node
[root@rhosd0 glusterfs]# stat /rhgs/hotbricks/1nvme-distrep3x2-hot/specsfs/CL329_VDA2/Dir0/bucket2/ehndeie0.preo File: ‘/rhgs/hotbricks/1nvme-distrep3x2-hot/specsfs/CL329_VDA2/Dir0/bucket2/ehndeie0.preo’ Size: 1073741824 Blocks: 2097152 IO Block: 4096 regular file Device: fd0ch/64780d Inode: 136313724 Links: 2 Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2016-11-04 10:18:43.024091000 -0700 Modify: 2016-11-04 10:19:29.488418000 -0700 Change: 2016-11-04 10:34:02.386627448 -0700 Birth: -
New test run with DEBUG level set for clients & servers (PST) SPEC SFS client 235 on Host rhclient9 Starting test client: 235 Host: rhclient9 Workload: VDA2 Workdir: /rhgs/client/1nvme-distrep3x2/specsfs ... Initialization finished: Tue Nov 8 12:03:04 2016 Testing begins: Tue Nov 8 12:03:04 2016 Waiting for tests to finish. Tue Nov 8 12:03:10 2016 Tue Nov 8 12:03:15 2016 Starting WARM phase Tue Nov 8 12:03:15 2016 Warm Heartbeat Client 287: 0.00 Ops/sec Tue Nov 8 12:03:38 2016 Warm-up 10 percent complete Tue Nov 8 12:04:08 2016 Warm-up 20 percent complete Tue Nov 8 12:04:17 2016 Warm Heartbeat Client 287: 1.00 Ops/sec Tue Nov 8 12:04:38 2016 Warm-up 30 percent complete Tue Nov 8 12:05:10 2016 Warm-up 40 percent complete Tue Nov 8 12:05:19 2016 Warm Heartbeat Client 287: 0.82 Ops/sec Tue Nov 8 12:05:39 2016 Warm-up 50 percent complete Tue Nov 8 12:06:10 2016 Warm-up 60 percent complete Tue Nov 8 12:06:21 2016 Warm Heartbeat Client 287: 1.00 Ops/sec Tue Nov 8 12:06:39 2016 Warm-up 70 percent complete Tue Nov 8 12:07:09 2016 Warm-up 80 percent complete Tue Nov 8 12:07:23 2016 Warm Heartbeat Client 287: 1.08 Ops/sec Tue Nov 8 12:07:38 2016 Warm-up 90 percent complete Tue Nov 8 12:08:08 2016 Warm-up 100 percent complete Tue Nov 8 12:08:11 2016 Starting RUN phase Tue Nov 8 12:08:25 2016 Run Heartbeat Client 287: 1.09 Ops/sec Tue Nov 8 12:08:39 2016 Run 10 percent complete Tue Nov 8 12:09:27 2016 Run Heartbeat Client 287: 1.00 Ops/sec Tue Nov 8 12:09:39 2016 Run 30 percent complete Master: Failure message client 235 Error = 22 Master: Client 235 returned error 22 Possibly: Invalid argument Client testing failed: Tue Nov 8 12:09:41 2016 Exit with cleanup Killing 288 clients.
From rhclient9 glusterfs log (log message in UTC; stat result in PST) rhgs-client-1nvme-distrep3x2.log:[2016-11-08 20:09:41.691498] W [fuse-bridge.c:766:fuse_attr_cbk] 0-glusterfs-fuse: 2929172: FSTAT() /specsfs/CL235_VDA2/Dir0/bucket6/ite0.o => -1 (Invalid argument) [root@rhosd0 bucket6]# stat /rhgs/hotbricks/1nvme-distrep3x2-hot/specsfs/CL235_VDA2/Dir0/bucket6/ite0.o File: ‘/rhgs/hotbricks/1nvme-distrep3x2-hot/specsfs/CL235_VDA2/Dir0/bucket6/ite0.o’ Size: 1073741824 Blocks: 2097152 IO Block: 4096 regular file Device: fd0ch/64780d Inode: 742037305 Links: 2 Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2016-11-08 11:36:50.841120000 -0800 Modify: 2016-11-08 12:06:36.017965000 -0800 Change: 2016-11-08 12:09:32.809863402 -0800 Birth: -
Results from 'grep -i "ite0.o" /var/log/glusterfs/bricks/*.log' for servers rhosd4 and rhosd5 rhosd4 /var/log/glusterfs/bricks/rhgs-coldbricks-1nvme-distrep3x2.log:[2016-11-08 20:05:43.819746] D [MSGID: 101107] [gfdb_sqlite3_helper.c:396:gf_sql_insert_link] 0-sqlite3: Failed executing the prepared stmt 3892dd37-6632-482a-b546-06e1228ce77b e7135fe3-1ce8-4126-b58e-2fbd64a8778f ite0.o INSERT INTO GF_FLINK_TB (GF_ID, GF_PID, FNAME, W_DEL_FLAG, LINK_UPDATE) VALUES (?, ?, ?, 0, 0); : SQL logic error or missing database /var/log/glusterfs/bricks/rhgs-coldbricks-1nvme-distrep3x2.log:[2016-11-08 20:09:32.796670] D [MSGID: 115059] [server-rpc-fops.c:819:server_getxattr_cbk] 0-1nvme-distrep3x2-server: 46452: GETXATTR /specsfs/CL235_VDA2/Dir0/bucket6/ite0.o (3892dd37-6632-482a-b546-06e1228ce77b) (trusted.glusterfs.dht.linkto) ==> (No data available) [No data available] /var/log/glusterfs/bricks/rhgs-hotbricks-1nvme-distrep3x2-hot.log:[2016-11-08 20:08:03.102099] D [MSGID: 115050] [server-rpc-fops.c:179:server_lookup_cbk] 0-1nvme-distrep3x2-server: 65485: LOOKUP /specsfs/CL235_VDA2/Dir0/bucket6/ite0.o (e7135fe3-1ce8-4126-b58e-2fbd64a8778f/ite0.o) ==> (No such file or directory) [No such file or directory] rhosd5 /var/log/glusterfs/bricks/rhgs-coldbricks-1nvme-distrep3x2.log:[2016-11-08 20:05:43.842730] D [MSGID: 101107] [gfdb_sqlite3_helper.c:396:gf_sql_insert_link] 0-sqlite3: Failed executing the prepared stmt 3892dd37-6632-482a-b546-06e1228ce77b e7135fe3-1ce8-4126-b58e-2fbd64a8778f ite0.o INSERT INTO GF_FLINK_TB (GF_ID, GF_PID, FNAME, W_DEL_FLAG, LINK_UPDATE) VALUES (?, ?, ?, 0, 0); : SQL logic error or missing database /var/log/glusterfs/bricks/rhgs-hotbricks-1nvme-distrep3x2-hot.log:[2016-11-08 20:08:02.184011] D [MSGID: 115050] [server-rpc-fops.c:179:server_lookup_cbk] 0-1nvme-distrep3x2-server: 86885: LOOKUP /specsfs/CL235_VDA2/Dir0/bucket6/ite0.o (e7135fe3-1ce8-4126-b58e-2fbd64a8778f/ite0.o) ==> (No such file or directory) [No such file or directory]
Results from 'grep -i "ite0.o" /var/log/glusterfs/*.log' for servers rhosd4 (no hits for rhosd5) [2016-11-08 20:08:01.626597] D [MSGID: 0] [dht-common.c:2445:dht_lookup] 0-1nvme-distrep3x2-tier-dht: Calling fresh lookup for /specsfs/CL235_VDA2/Dir0/bucket6/ite0.o on 1nvme-distrep3x2-cold-dht [2016-11-08 20:08:01.626641] D [MSGID: 0] [dht-common.c:2445:dht_lookup] 0-1nvme-distrep3x2-cold-dht: Calling fresh lookup for /specsfs/CL235_VDA2/Dir0/bucket6/ite0.o on 1nvme-distrep3x2-replicate-2 [2016-11-08 20:08:01.627280] D [MSGID: 0] [dht-common.c:2032:dht_lookup_cbk] 0-1nvme-distrep3x2-cold-dht: fresh_lookup returned for /specsfs/CL235_VDA2/Dir0/bucket6/ite0.o with op_ret 0 [2016-11-08 20:08:01.627324] D [MSGID: 0] [dht-common.c:2032:dht_lookup_cbk] 0-1nvme-distrep3x2-tier-dht: fresh_lookup returned for /specsfs/CL235_VDA2/Dir0/bucket6/ite0.o with op_ret 0 [2016-11-08 20:08:01.627397] D [MSGID: 0] [tier.c:733:tier_migrate_using_query_file] 0-1nvme-distrep3x2-tier-dht: Tier promote: src_subvol 1nvme-distrep3x2-cold-dht file /specsfs/CL235_VDA2/Dir0/bucket6/ite0.o [2016-11-08 20:08:01.629635] D [MSGID: 0] [dht-common.c:2371:dht_lookup] 0-1nvme-distrep3x2-cold-dht: calling revalidate lookup for /specsfs/CL235_VDA2/Dir0/bucket6/ite0.o at 1nvme-distrep3x2-replicate-2 [2016-11-08 20:08:01.630021] D [MSGID: 0] [dht-common.c:791:dht_revalidate_cbk] 0-1nvme-distrep3x2-cold-dht: revalidate lookup of /specsfs/CL235_VDA2/Dir0/bucket6/ite0.o returned with op_ret 0 [2016-11-08 20:08:01.630284] D [MSGID: 0] [dht-common.c:2445:dht_lookup] 0-1nvme-distrep3x2-hot-dht: Calling fresh lookup for /specsfs/CL235_VDA2/Dir0/bucket6/ite0.o on 1nvme-distrep3x2-replicate-5 [2016-11-08 20:08:02.160871] D [MSGID: 0] [dht-common.c:2032:dht_lookup_cbk] 0-1nvme-distrep3x2-hot-dht: fresh_lookup returned for /specsfs/CL235_VDA2/Dir0/bucket6/ite0.o with op_ret -1 [No such file or directory] [2016-11-08 20:08:02.160878] D [MSGID: 0] [dht-common.c:2045:dht_lookup_cbk] 0-1nvme-distrep3x2-hot-dht: Entry /specsfs/CL235_VDA2/Dir0/bucket6/ite0.o missing on subvol 1nvme-distrep3x2-replicate-5 [2016-11-08 20:08:02.161506] D [MSGID: 0] [dht-common.c:1623:dht_lookup_everywhere_cbk] 0-1nvme-distrep3x2-hot-dht: returned with op_ret -1 and op_errno 2 (/specsfs/CL235_VDA2/Dir0/bucket6/ite0.o) from subvol 1nvme-distrep3x2-replicate-4 [2016-11-08 20:08:02.445420] D [MSGID: 0] [dht-common.c:1623:dht_lookup_everywhere_cbk] 0-1nvme-distrep3x2-hot-dht: returned with op_ret -1 and op_errno 2 (/specsfs/CL235_VDA2/Dir0/bucket6/ite0.o) from subvol 1nvme-distrep3x2-replicate-5 [2016-11-08 20:08:03.102272] D [MSGID: 0] [dht-common.c:1623:dht_lookup_everywhere_cbk] 0-1nvme-distrep3x2-hot-dht: returned with op_ret -1 and op_errno 2 (/specsfs/CL235_VDA2/Dir0/bucket6/ite0.o) from subvol 1nvme-distrep3x2-replicate-3 [2016-11-08 20:08:03.102290] D [MSGID: 0] [dht-common.c:1349:dht_lookup_everywhere_done] 0-1nvme-distrep3x2-hot-dht: There was no cached file and unlink on hashed is not skipped /specsfs/CL235_VDA2/Dir0/bucket6/ite0.o [2016-11-08 20:08:03.102447] D [MSGID: 0] [dht-common.c:6496:dht_create_wind_to_avail_subvol] 0-1nvme-distrep3x2-hot-dht: creating /specsfs/CL235_VDA2/Dir0/bucket6/ite0.o on 1nvme-distrep3x2-replicate-5 [2016-11-08 20:08:03.183478] D [MSGID: 0] [dht-common.c:2371:dht_lookup] 0-1nvme-distrep3x2-hot-dht: calling revalidate lookup for /specsfs/CL235_VDA2/Dir0/bucket6/ite0.o at 1nvme-distrep3x2-replicate-5 [2016-11-08 20:08:03.185047] D [MSGID: 0] [dht-common.c:791:dht_revalidate_cbk] 0-1nvme-distrep3x2-hot-dht: revalidate lookup of /specsfs/CL235_VDA2/Dir0/bucket6/ite0.o returned with op_ret 0 [2016-11-08 20:09:32.796708] D [MSGID: 0] [client-rpc-fops.c:1157:client3_3_getxattr_cbk] 0-1nvme-distrep3x2-client-4: remote operation failed: No data available. Path: /specsfs/CL235_VDA2/Dir0/bucket6/ite0.o (3892dd37-6632-482a-b546-06e1228ce77b). Key: trusted.glusterfs.dht.linkto [2016-11-08 20:09:32.796906] D [MSGID: 0] [dht-common.c:2371:dht_lookup] 0-1nvme-distrep3x2-tier-dht: calling revalidate lookup for /specsfs/CL235_VDA2/Dir0/bucket6/ite0.o at 1nvme-distrep3x2-cold-dht [2016-11-08 20:09:32.796928] D [MSGID: 0] [dht-common.c:2371:dht_lookup] 0-1nvme-distrep3x2-cold-dht: calling revalidate lookup for /specsfs/CL235_VDA2/Dir0/bucket6/ite0.o at 1nvme-distrep3x2-replicate-2 [2016-11-08 20:09:32.797300] D [MSGID: 0] [dht-common.c:791:dht_revalidate_cbk] 0-1nvme-distrep3x2-cold-dht: revalidate lookup of /specsfs/CL235_VDA2/Dir0/bucket6/ite0.o returned with op_ret 0 [2016-11-08 20:09:32.797326] D [MSGID: 0] [dht-common.c:791:dht_revalidate_cbk] 0-1nvme-distrep3x2-tier-dht: revalidate lookup of /specsfs/CL235_VDA2/Dir0/bucket6/ite0.o returned with op_ret 0 [Structure needs cleaning] [2016-11-08 20:09:32.797351] D [MSGID: 0] [dht-common.c:2371:dht_lookup] 0-1nvme-distrep3x2-hot-dht: calling revalidate lookup for /specsfs/CL235_VDA2/Dir0/bucket6/ite0.o at 1nvme-distrep3x2-replicate-5 [2016-11-08 20:09:32.797715] D [MSGID: 0] [dht-common.c:791:dht_revalidate_cbk] 0-1nvme-distrep3x2-hot-dht: revalidate lookup of /specsfs/CL235_VDA2/Dir0/bucket6/ite0.o returned with op_ret 0
All logs can be found at Google drive location here: https://drive.google.com/drive/u/1/folders/0B3DXjvLEit_KN1lMV0h6dWRJbk0 37735901 Nov 8 14:04 rhgs-hotbricks-1nvme-distrep3x2-hot.rhosd4.log.gz 109771660 Nov 8 14:04 rhgs-coldbricks-1nvme-distrep3x2.rhosd4.log.gz 3730596 Nov 8 14:05 1nvme-distrep3x2-tier.rhosd4.log.gz 3125448 Nov 8 14:05 rhgs-client-1nvme-distrep3x2.rhclient9.log.gz gzip -d <filename.gz> to unzip
(In reply to Annette Clewett from comment #23) > Also see this on rhosd0 after test fails, looks like trying to demote to > cold tier and it fails? > > [root@rhosd0 glusterfs]# grep -i "ehndeie0.preo" 1nvme-distrep3x2-tier.log > [2016-11-04 19:00:00.812749] D [MSGID: 0] [dht-common.c:2445:dht_lookup] > 0-1nvme-distrep3x2-tier-dht: Calling fresh lookup for > /specsfs/CL329_VDA2/Dir0/bucket2/ehndeie0.preo on 1nvme-distrep3x2-cold-dht > [2016-11-04 19:00:00.812790] D [MSGID: 0] [dht-common.c:2445:dht_lookup] > 0-1nvme-distrep3x2-cold-dht: Calling fresh lookup for > /specsfs/CL329_VDA2/Dir0/bucket2/ehndeie0.preo on > 1nvme-distrep3x2-replicate-2 > [tier.c:733:tier_migrate_using_query_file] 0-1nvme-distrep3x2-tier-dht: Tier > demote: src_subvol 1nvme-distrep3x2-hot-dht file > /specsfs/CL329_VDA2/Dir0/bucket2/ehndeie0.preo > [2016-11-04 19:00:00.814033] D [MSGID: 0] [tier.c:233:tier_check_same_node] > 0-1nvme-distrep3x2-tier-dht: /specsfs/CL329_VDA2/Dir0/bucket2/ehndeie0.preo > does not belong to this node This can be ignored - it just means some other node picked up the file for migration.
(In reply to Annette Clewett from comment #29) > All logs can be found at Google drive location here: > > https://drive.google.com/drive/u/1/folders/0B3DXjvLEit_KN1lMV0h6dWRJbk0 > > 37735901 Nov 8 14:04 rhgs-hotbricks-1nvme-distrep3x2-hot.rhosd4.log.gz > 109771660 Nov 8 14:04 rhgs-coldbricks-1nvme-distrep3x2.rhosd4.log.gz > 3730596 Nov 8 14:05 1nvme-distrep3x2-tier.rhosd4.log.gz > 3125448 Nov 8 14:05 rhgs-client-1nvme-distrep3x2.rhclient9.log.gz > > gzip -d <filename.gz> to unzip We will need the client logs as well. Thanks, Nithya
Annette, do you see any Invalid argument errors on any of the server nodes? I'm having trouble connecting to the VPN so I cannot check myself.
Annette, Can you please upload all the server logs and the client logs for the failed client to google drive. Thanks, Nithya
A file's inode in the hot tier subvolume has a NULL layout. The cached subvolume for the same inode in the parent translator is the hot subvolume. When an FSTAT is performed the I/O is directed to the hot tier, which in turn fails on the NULL layout and fails the I/O. Each time this has been recreated in the lab the 1G file in question has just completed promotion . A second observation is the hot tier may have been full at the time the file was created. I've been unable to recreate this on my own or obtain further information from the logs. To solve this problem we need additional info. I have built RPMS containing new log messages. These have to be used to recreate the problem. Contact me when ready to load them.
(In reply to Dan Lambright from comment #34) > A file's inode in the hot tier subvolume has a NULL layout. The cached > subvolume for the same inode in the parent translator is the hot subvolume. > When an FSTAT is performed the I/O is directed to the hot tier, which in > turn fails on the NULL layout and fails the I/O. > > Each time this has been recreated in the lab the 1G file in question has > just completed promotion . A second observation is the hot tier may have > been full at the time the file was created. > > I've been unable to recreate this on my own or obtain further information > from the logs. To solve this problem we need additional info. I have built > RPMS containing new log messages. These have to be used to recreate the > problem. Contact me when ready to load them. Annette, after deploying the new RPMS and hitting the issue, please get the xattrs set on the parent dir of the file on each brick of the volume (both hot and cold). For eg; If the error is returned for dir1/dir2/dir3/file1 , please run getfattr -e hex -d -m . <brickpath>/dir1/dir2/dir3 on each brick. This will help us validate the error.
Starting test client: 21 Host: rhclient10 Workload: VDA2 Workdir: /rhgs/client/1nvme-distrep3x2/specsfs Waiting to finish initialization. Sat Nov 19 13:05:20 2016 Sat Nov 19 13:10:39 2016 Starting INIT phase Sat Nov 19 13:10:41 2016 Init Heartbeat __/\_/\__ Sat Nov 19 13:11:41 2016 Init Heartbeat __/\_/\__ Sat Nov 19 13:12:41 2016 Init Heartbeat __/\_/\__ Sat Nov 19 13:13:26 2016 Init 20 percent complete Sat Nov 19 13:13:41 2016 Init Heartbeat __/\_/\__ Sat Nov 19 13:14:42 2016 Init Heartbeat __/\_/\__ Sat Nov 19 13:15:42 2016 Init Heartbeat __/\_/\__ Sat Nov 19 13:16:42 2016 Init Heartbeat __/\_/\__ Sat Nov 19 13:17:04 2016 Init 50 percent complete Sat Nov 19 13:17:42 2016 Init Heartbeat __/\_/\__ Sat Nov 19 13:17:48 2016 Init 70 percent complete Sat Nov 19 13:18:33 2016 Init 100 percent complete Initialization finished: Sat Nov 19 13:20:27 2016 Testing begins: Sat Nov 19 13:20:27 2016 Waiting for tests to finish. Sat Nov 19 13:20:36 2016 Sat Nov 19 13:20:41 2016 Starting WARM phase Sat Nov 19 13:20:41 2016 Warm Heartbeat Client 431: 0.00 Ops/sec Sat Nov 19 13:21:06 2016 Warm-up 10 percent complete Sat Nov 19 13:21:35 2016 Warm-up 20 percent complete Sat Nov 19 13:21:41 2016 Warm Heartbeat Client 431: 0.99 Ops/sec Sat Nov 19 13:22:05 2016 Warm-up 30 percent complete Sat Nov 19 13:22:35 2016 Warm-up 40 percent complete Sat Nov 19 13:22:41 2016 Warm Heartbeat Client 431: 0.99 Ops/sec Sat Nov 19 13:23:05 2016 Warm-up 50 percent complete Sat Nov 19 13:23:35 2016 Warm-up 60 percent complete Sat Nov 19 13:23:41 2016 Warm Heartbeat Client 431: 1.00 Ops/sec Sat Nov 19 13:24:05 2016 Warm-up 70 percent complete Sat Nov 19 13:24:35 2016 Warm-up 80 percent complete Sat Nov 19 13:24:41 2016 Warm Heartbeat Client 431: 1.01 Ops/sec Sat Nov 19 13:25:05 2016 Warm-up 90 percent complete Sat Nov 19 13:25:35 2016 Warm-up 100 percent complete Sat Nov 19 13:25:38 2016 Starting RUN phase Sat Nov 19 13:25:44 2016 Run Heartbeat Client 431: 0.67 Ops/sec Master: Failure message client 21 Error = 22 Master: Client 21 returned error 22 Possibly: Invalid argument Client testing failed: Sat Nov 19 13:25:55 2016 Exit with cleanup Killing 432 clients. rhclient10 gluster client log - [2016-11-19 21:25:55.365726] W [fuse-bridge.c:766:fuse_attr_cbk] 0-glusterfs-fuse: 7269657: FSTAT() /specsfs/CL21_VDA2/Dir0/bucket6/wrc0.lpol => -1 (Invalid argument) NOTE: I cannot find file wrc0.lpol at location /specsfs/CL21_VDA2/Dir0/bucket6/ on /rhgs/coldbricks or rhgs/hotbricks. When I navigate to this directory on the hot tier (only place directory exists) there is nothing in the directory. This is a change from prior failures. The gluster client log (level DEBUG) for rhclient10 (client with failure) and the gluster tier log (level DEBUG; only tier with reference to filename wrc0.lpol) for rhosd2 are uploaded as well as all of the glusterfs/bricks for the six servers. Google Drive here https://drive.google.com/drive/u/1/folders/0B3DXjvLEit_KZGNUQjBFQklFRmM.
Info Nithya requested (file [root@rhosd0 ~]# getfattr -e hex -d -m . /rhgs/hotbricks/1nvme-distrep3x2-hot/specsfs/CL21_VDA2/Dir0/bucket6/ getfattr: Removing leading '/' from absolute path names # file: rhgs/hotbricks/1nvme-distrep3x2-hot/specsfs/CL21_VDA2/Dir0/bucket6/ trusted.gfid=0xc077a3c447bf48b682d43d98f0f19134 trusted.glusterfs.dht=0x0000000100000000aaaaaaaaffffffff trusted.tier.tier-dht=0x000000010000000000000000035a00c6 [root@rhosd0 ~]# getfattr -e hex -d -m . /rhgs/coldbricks/1nvme-distrep3x2/specsfs/CL21_VDA2/Dir0/bucket6/ getfattr: Removing leading '/' from absolute path names # file: rhgs/coldbricks/1nvme-distrep3x2/specsfs/CL21_VDA2/Dir0/bucket6/ trusted.gfid=0xc077a3c447bf48b682d43d98f0f19134 trusted.glusterfs.dht=0x00000001000000000000000055555554 trusted.tier.tier-dht=0x0000000100000000035a00c7ffffffff NOTE: file wrc0.lpol is not in either of these directories
All info for six servers (rhosd0-rhosd5)for commands below can be found at: http://pastebin.test.redhat.com/432270. getfattr -e hex -d -m . /rhgs/hotbricks/1nvme-distrep3x2-hot/specsfs/CL21_VDA2/Dir0/bucket6/ getfattr -e hex -d -m . /rhgs/hotbricks/1nvme-distrep3x2-hot/specsfs/CL21_VDA2/Dir0/bucket6/wrc0.lpol getfattr -e hex -d -m . /rhgs/coldbricks/1nvme-distrep3x2/specsfs/CL21_VDA2/Dir0/bucket6/ getfattr -e hex -d -m . /rhgs/coldbricks/1nvme-distrep3x2/specsfs/CL21_VDA2/Dir0/bucket6/wrc0.lpol
****rhosd0 [root@rhosd0 ~]# getfattr -e hex -d -m . /rhgs/hotbricks/1nvme-distrep3x2-hot/specsfs/CL21_VDA2/Dir0/bucket6/ getfattr: Removing leading '/' from absolute path names # file: rhgs/hotbricks/1nvme-distrep3x2-hot/specsfs/CL21_VDA2/Dir0/bucket6/ trusted.gfid=0xc077a3c447bf48b682d43d98f0f19134 trusted.glusterfs.dht=0x0000000100000000aaaaaaaaffffffff trusted.tier.tier-dht=0x000000010000000000000000035a00c6 [root@rhosd0 ~]# getfattr -e hex -d -m . /rhgs/hotbricks/1nvme-distrep3x2-hot/specsfs/CL21_VDA2/Dir0/bucket6/wrc0.lpol getfattr: /rhgs/hotbricks/1nvme-distrep3x2-hot/specsfs/CL21_VDA2/Dir0/bucket6/wrc0.lpol: No such file or directory [root@rhosd0 ~]# getfattr -e hex -d -m . /rhgs/coldbricks/1nvme-distrep3x2/specsfs/CL21_VDA2/Dir0/bucket6/ getfattr: Removing leading '/' from absolute path names # file: rhgs/coldbricks/1nvme-distrep3x2/specsfs/CL21_VDA2/Dir0/bucket6/ trusted.gfid=0xc077a3c447bf48b682d43d98f0f19134 trusted.glusterfs.dht=0x00000001000000000000000055555554 trusted.tier.tier-dht=0x0000000100000000035a00c7ffffffff [root@rhosd0 ~]# getfattr -e hex -d -m . /rhgs/coldbricks/1nvme-distrep3x2/specsfs/CL21_VDA2/Dir0/bucket6/wrc0.lpol getfattr: /rhgs/coldbricks/1nvme-distrep3x2/specsfs/CL21_VDA2/Dir0/bucket6/wrc0.lpol: No such file or directory ****rhosd1 [root@rhosd1 ~]# getfattr -e hex -d -m . /rhgs/hotbricks/1nvme-distrep3x2-hot/specsfs/CL21_VDA2/Dir0/bucket6/ getfattr: Removing leading '/' from absolute path names # file: rhgs/hotbricks/1nvme-distrep3x2-hot/specsfs/CL21_VDA2/Dir0/bucket6/ trusted.gfid=0xc077a3c447bf48b682d43d98f0f19134 trusted.glusterfs.dht=0x0000000100000000aaaaaaaaffffffff trusted.tier.tier-dht=0x000000010000000000000000035a00c6 [root@rhosd1 ~]# getfattr -e hex -d -m . /rhgs/hotbricks/1nvme-distrep3x2-hot/specsfs/CL21_VDA2/Dir0/bucket6/wrc0.lpol getfattr: /rhgs/hotbricks/1nvme-distrep3x2-hot/specsfs/CL21_VDA2/Dir0/bucket6/wrc0.lpol: No such file or directory [root@rhosd1 ~]# getfattr -e hex -d -m . /rhgs/coldbricks/1nvme-distrep3x2/specsfs/CL21_VDA2/Dir0/bucket6/ getfattr: Removing leading '/' from absolute path names # file: rhgs/coldbricks/1nvme-distrep3x2/specsfs/CL21_VDA2/Dir0/bucket6/ trusted.gfid=0xc077a3c447bf48b682d43d98f0f19134 trusted.glusterfs.dht=0x00000001000000000000000055555554 trusted.tier.tier-dht=0x0000000100000000035a00c7ffffffff [root@rhosd1 ~]# getfattr -e hex -d -m . /rhgs/coldbricks/1nvme-distrep3x2/specsfs/CL21_VDA2/Dir0/bucket6/wrc0.lpol getfattr: /rhgs/coldbricks/1nvme-distrep3x2/specsfs/CL21_VDA2/Dir0/bucket6/wrc0.lpol: No such file or directory ****rhosd2 [root@rhosd2 ~]# getfattr -e hex -d -m . /rhgs/hotbricks/1nvme-distrep3x2-hot/specsfs/CL21_VDA2/Dir0/bucket6/ getfattr: Removing leading '/' from absolute path names # file: rhgs/hotbricks/1nvme-distrep3x2-hot/specsfs/CL21_VDA2/Dir0/bucket6/ trusted.gfid=0xc077a3c447bf48b682d43d98f0f19134 trusted.glusterfs.dht=0x000000010000000055555555aaaaaaa9 trusted.tier.tier-dht=0x000000010000000000000000035a00c6 [root@rhosd2 ~]# getfattr -e hex -d -m . /rhgs/hotbricks/1nvme-distrep3x2-hot/specsfs/CL21_VDA2/Dir0/bucket6/wrc0.lpol getfattr: Removing leading '/' from absolute path names # file: rhgs/hotbricks/1nvme-distrep3x2-hot/specsfs/CL21_VDA2/Dir0/bucket6/wrc0.lpol trusted.afr.dirty=0x000000000000000000000000 trusted.bit-rot.version=0x02000000000000005830991a0007460b trusted.gfid=0x162424ad9b074a7998752a5811e90223 [root@rhosd2 ~]# getfattr -e hex -d -m . /rhgs/coldbricks/1nvme-distrep3x2/specsfs/CL21_VDA2/Dir0/bucket6/ getfattr: Removing leading '/' from absolute path names # file: rhgs/coldbricks/1nvme-distrep3x2/specsfs/CL21_VDA2/Dir0/bucket6/ trusted.gfid=0xc077a3c447bf48b682d43d98f0f19134 trusted.glusterfs.dht=0x000000010000000055555555aaaaaaa9 trusted.tier.tier-dht=0x0000000100000000035a00c7ffffffff [root@rhosd2 ~]# getfattr -e hex -d -m . /rhgs/coldbricks/1nvme-distrep3x2/specsfs/CL21_VDA2/Dir0/bucket6/wrc0.lpol getfattr: Removing leading '/' from absolute path names # file: rhgs/coldbricks/1nvme-distrep3x2/specsfs/CL21_VDA2/Dir0/bucket6/wrc0.lpol trusted.afr.dirty=0x000000000000000000000000 trusted.bit-rot.version=0x02000000000000005830977b000e5869 trusted.gfid=0x162424ad9b074a7998752a5811e90223 trusted.tier.tier-dht.linkto=0x316e766d652d646973747265703378322d686f742d64687400 ****rhosd3 [root@rhosd3 ~]# getfattr -e hex -d -m . /rhgs/hotbricks/1nvme-distrep3x2-hot/specsfs/CL21_VDA2/Dir0/bucket6/ getfattr: Removing leading '/' from absolute path names # file: rhgs/hotbricks/1nvme-distrep3x2-hot/specsfs/CL21_VDA2/Dir0/bucket6/ trusted.gfid=0xc077a3c447bf48b682d43d98f0f19134 trusted.glusterfs.dht=0x000000010000000055555555aaaaaaa9 trusted.tier.tier-dht=0x000000010000000000000000035a00c6 [root@rhosd3 ~]# getfattr -e hex -d -m . /rhgs/hotbricks/1nvme-distrep3x2-hot/specsfs/CL21_VDA2/Dir0/bucket6/wrc0.lpol getfattr: Removing leading '/' from absolute path names # file: rhgs/hotbricks/1nvme-distrep3x2-hot/specsfs/CL21_VDA2/Dir0/bucket6/wrc0.lpol trusted.afr.dirty=0x000000000000000000000000 trusted.bit-rot.version=0x02000000000000005830991a00078ba5 trusted.gfid=0x162424ad9b074a7998752a5811e90223 [root@rhosd3 ~]# getfattr -e hex -d -m . /rhgs/coldbricks/1nvme-distrep3x2/specsfs/CL21_VDA2/Dir0/bucket6/ getfattr: Removing leading '/' from absolute path names # file: rhgs/coldbricks/1nvme-distrep3x2/specsfs/CL21_VDA2/Dir0/bucket6/ trusted.gfid=0xc077a3c447bf48b682d43d98f0f19134 trusted.glusterfs.dht=0x000000010000000055555555aaaaaaa9 trusted.tier.tier-dht=0x0000000100000000035a00c7ffffffff [root@rhosd3 ~]# getfattr -e hex -d -m . /rhgs/coldbricks/1nvme-distrep3x2/specsfs/CL21_VDA2/Dir0/bucket6/wrc0.lpol getfattr: Removing leading '/' from absolute path names # file: rhgs/coldbricks/1nvme-distrep3x2/specsfs/CL21_VDA2/Dir0/bucket6/wrc0.lpol trusted.afr.dirty=0x000000000000000000000000 trusted.bit-rot.version=0x02000000000000005830977b000e8748 trusted.gfid=0x162424ad9b074a7998752a5811e90223 trusted.tier.tier-dht.linkto=0x316e766d652d646973747265703378322d686f742d64687400 ****rhosd4 [root@rhosd4 ~]# getfattr -e hex -d -m . /rhgs/hotbricks/1nvme-distrep3x2-hot/specsfs/CL21_VDA2/Dir0/bucket6/ getfattr: Removing leading '/' from absolute path names # file: rhgs/hotbricks/1nvme-distrep3x2-hot/specsfs/CL21_VDA2/Dir0/bucket6/ trusted.gfid=0xc077a3c447bf48b682d43d98f0f19134 trusted.glusterfs.dht=0x00000001000000000000000055555554 trusted.tier.tier-dht=0x000000010000000000000000035a00c6 [root@rhosd4 ~]# getfattr -e hex -d -m . /rhgs/hotbricks/1nvme-distrep3x2-hot/specsfs/CL21_VDA2/Dir0/bucket6/wrc0.lpol getfattr: /rhgs/hotbricks/1nvme-distrep3x2-hot/specsfs/CL21_VDA2/Dir0/bucket6/wrc0.lpol: No such file or directory [root@rhosd4 ~]# getfattr -e hex -d -m . /rhgs/coldbricks/1nvme-distrep3x2/specsfs/CL21_VDA2/Dir0/bucket6/ getfattr: Removing leading '/' from absolute path names # file: rhgs/coldbricks/1nvme-distrep3x2/specsfs/CL21_VDA2/Dir0/bucket6/ trusted.gfid=0xc077a3c447bf48b682d43d98f0f19134 trusted.glusterfs.dht=0x0000000100000000aaaaaaaaffffffff trusted.tier.tier-dht=0x0000000100000000035a00c7ffffffff [root@rhosd4 ~]# getfattr -e hex -d -m . /rhgs/coldbricks/1nvme-distrep3x2/specsfs/CL21_VDA2/Dir0/bucket6/wrc0.lpol getfattr: /rhgs/coldbricks/1nvme-distrep3x2/specsfs/CL21_VDA2/Dir0/bucket6/wrc0.lpol: No such file or directory ****rhosd5 [root@rhosd5 ~]# getfattr -e hex -d -m . /rhgs/hotbricks/1nvme-distrep3x2-hot/specsfs/CL21_VDA2/Dir0/bucket6/ getfattr: Removing leading '/' from absolute path names # file: rhgs/hotbricks/1nvme-distrep3x2-hot/specsfs/CL21_VDA2/Dir0/bucket6/ trusted.gfid=0xc077a3c447bf48b682d43d98f0f19134 trusted.glusterfs.dht=0x00000001000000000000000055555554 trusted.tier.tier-dht=0x000000010000000000000000035a00c6 [root@rhosd5 ~]# getfattr -e hex -d -m . /rhgs/hotbricks/1nvme-distrep3x2-hot/specsfs/CL21_VDA2/Dir0/bucket6/wrc0.lpol getfattr: /rhgs/hotbricks/1nvme-distrep3x2-hot/specsfs/CL21_VDA2/Dir0/bucket6/wrc0.lpol: No such file or directory [root@rhosd5 ~]# getfattr -e hex -d -m . /rhgs/coldbricks/1nvme-distrep3x2/specsfs/CL21_VDA2/Dir0/bucket6/ getfattr: Removing leading '/' from absolute path names # file: rhgs/coldbricks/1nvme-distrep3x2/specsfs/CL21_VDA2/Dir0/bucket6/ trusted.gfid=0xc077a3c447bf48b682d43d98f0f19134 trusted.glusterfs.dht=0x0000000100000000aaaaaaaaffffffff trusted.tier.tier-dht=0x0000000100000000035a00c7ffffffff [root@rhosd5 ~]# getfattr -e hex -d -m . /rhgs/coldbricks/1nvme-distrep3x2/specsfs/CL21_VDA2/Dir0/bucket6/wrc0.lpol getfattr: /rhgs/coldbricks/1nvme-distrep3x2/specsfs/CL21_VDA2/Dir0/bucket6/wrc0.lpol: No such file or directory
Uploaded new logs from server rhosd2 & rhosd3, glustershd.log. These two servers are in a replica pair where the file was first written to their cold tier. The message below from glustershd.log for rhosd2 seems to imply a problem - keyword = mismatch. Is it possible this file was not created correctly on the cold tier of one of the two servers in the replica pair? This happens way before the failure after the file promotion to the hot tier (on same replica pair it looks like) is done and the attempt to read the file crashes the SPEC SFS test. log messages from rhclient10 log when file is created on cold tier on rhosd2 & rhosd3 [2016-11-19 19:34:30.963098] D [MSGID: 0] [dht-common.c:1537:dht_lookup_everywhere_done] 2-1nvme-distrep3x2-tier-dht: There was no cached file and unlink on hashed is not skipped /specsfs/CL21_VDA2/Dir0/bucket6/wrc0.lpol [2016-11-19 19:34:30.963313] D [MSGID: 0] [tier-common.c:437:tier_create] 2-1nvme-distrep3x2-tier-dht: creating /specsfs/CL21_VDA2/Dir0/bucket6/wrc0.lpol on 1nvme-distrep3x2-cold-dht [2016-11-19 19:34:30.963350] D [MSGID: 0] [dht-common.c:6684:dht_create_wind_to_avail_subvol] 2-1nvme-distrep3x2-cold-dht: creating /specsfs/CL21_VDA2/Dir0/bucket6/wrc0.lpol on 1nvme-distrep3x2-replicate-1 log messages from rhosd2 glustershd.log (search for file gfid 162424ad-9b07-4a79-9875-2a5811e90223) [2016-11-19 19:36:31.152637] D [MSGID: 0] [afr-self-heald.c:409:afr_shd_index_heal] 0-1nvme-distrep3x2-replicate-1: got entry: 162424ad-9b07-4a79-9875-2a5811e90223 [2016-11-19 19:36:31.160009] D [MSGID: 0] [afr-self-heal-common.c:1396:afr_selfheal_unlocked_inspect] 0-1nvme-distrep3x2-replicate-1: SIZE mismatch 820379648 vs 821559296 on 1nvme-distrep3x2-client-3 for gfid:162424ad-9b07-4a79-9875-2a5811e90223 log messages from rhosd3 glustershd.log (search for file gfid 162424ad-9b07-4a79-9875-2a5811e90223) [2016-11-19 19:36:16.027296] D [MSGID: 0] [afr-self-heald.c:409:afr_shd_index_heal] 0-1nvme-distrep3x2-replicate-1: got entry: 162424ad-9b07-4a79-9875-2a5811e90223
Failure for new RPMs created today, Nov 22nd Starting test client: 281 Host: rhclient8 Workload: VDA2 Workdir: /rhgs/client/1nvme-distrep3x2/specsfs Initialization finished: Tue Nov 22 19:39:21 2016 Testing begins: Tue Nov 22 19:39:21 2016 Waiting for tests to finish. Tue Nov 22 19:39:27 2016 Tue Nov 22 19:39:32 2016 Starting WARM phase Tue Nov 22 19:39:32 2016 Warm Heartbeat Client 287: 0.00 Ops/sec Tue Nov 22 19:39:56 2016 Warm-up 10 percent complete Tue Nov 22 19:40:26 2016 Warm-up 20 percent complete Tue Nov 22 19:40:33 2016 Warm Heartbeat Client 287: 0.94 Ops/sec Tue Nov 22 19:40:56 2016 Warm-up 30 percent complete Tue Nov 22 19:41:25 2016 Warm-up 40 percent complete Tue Nov 22 19:41:34 2016 Warm Heartbeat Client 287: 1.02 Ops/sec Tue Nov 22 19:41:56 2016 Warm-up 50 percent complete Tue Nov 22 19:42:44 2016 Warm Heartbeat Client 287: 0.22 Ops/sec Tue Nov 22 19:42:56 2016 Warm-up 70 percent complete Tue Nov 22 19:43:25 2016 Warm-up 80 percent complete Tue Nov 22 19:43:46 2016 Warm Heartbeat Client 287: 1.00 Ops/sec Tue Nov 22 19:43:57 2016 Warm-up 90 percent complete Tue Nov 22 19:44:25 2016 Warm-up 100 percent complete Tue Nov 22 19:44:27 2016 Starting RUN phase Tue Nov 22 19:44:46 2016 Run Heartbeat Client 287: 0.91 Ops/sec Tue Nov 22 19:44:55 2016 Run 10 percent complete Tue Nov 22 19:45:25 2016 Run 20 percent complete Tue Nov 22 19:45:46 2016 Run Heartbeat Client 287: 0.77 Ops/sec Tue Nov 22 19:45:56 2016 Run 30 percent complete Tue Nov 22 19:46:26 2016 Run 40 percent complete Tue Nov 22 19:46:46 2016 Run Heartbeat Client 287: 1.08 Ops/sec Tue Nov 22 19:46:55 2016 Run 50 percent complete Master: Failure message client 281 Error = 9 Master: Client 281 returned error 9 Possibly: Bad file descriptor Client testing failed: Tue Nov 22 19:47:06 2016 Exit with cleanup Killing 288 clients. gluster client log for rhclient8 [2016-11-23 03:47:06.969953] W [fuse-bridge.c:1290:fuse_err_cbk] 0-glusterfs-fuse: 2969402: FLUSH() ERR => -1 (Bad file descriptor) Client log for rhclient8 uploaded to https://drive.google.com/drive/u/1/folders/0B3DXjvLEit_KYnRXVG5rVnNnTHc
Starting test client: 67 Host: rhclient9 Workload: VDA2 Workdir: /rhgs/client/1nvme-distrep3x2/specsfs Initialization finished: Wed Nov 23 12:42:52 2016 Testing begins: Wed Nov 23 12:42:52 2016 Waiting for tests to finish. Wed Nov 23 12:42:58 2016 Wed Nov 23 12:43:03 2016 Starting WARM phase Wed Nov 23 12:43:03 2016 Warm Heartbeat Client 287: 0.00 Ops/sec Wed Nov 23 12:43:26 2016 Warm-up 10 percent complete Wed Nov 23 12:43:56 2016 Warm-up 20 percent complete Wed Nov 23 12:44:04 2016 Warm Heartbeat Client 287: 1.00 Ops/sec Wed Nov 23 12:44:26 2016 Warm-up 30 percent complete Wed Nov 23 12:44:56 2016 Warm-up 40 percent complete Wed Nov 23 12:45:05 2016 Warm Heartbeat Client 287: 1.00 Ops/sec Wed Nov 23 12:45:26 2016 Warm-up 50 percent complete Wed Nov 23 12:45:56 2016 Warm-up 60 percent complete Wed Nov 23 12:46:11 2016 Warm Heartbeat Client 287: 1.00 Ops/sec Wed Nov 23 12:46:28 2016 Warm-up 70 percent complete Wed Nov 23 12:46:57 2016 Warm-up 80 percent complete Wed Nov 23 12:47:18 2016 Warm Heartbeat Client 287: 1.01 Ops/sec Wed Nov 23 12:47:28 2016 Warm-up 90 percent complete Wed Nov 23 12:47:56 2016 Warm-up 100 percent complete Wed Nov 23 12:47:59 2016 Starting RUN phase Wed Nov 23 12:48:20 2016 Run Heartbeat Client 287: 0.78 Ops/sec Wed Nov 23 12:48:29 2016 Run 10 percent complete Wed Nov 23 12:48:59 2016 Run 20 percent complete Wed Nov 23 12:49:20 2016 Run Heartbeat Client 287: 1.04 Ops/sec Wed Nov 23 12:49:28 2016 Run 30 percent complete Master: Failure message client 67 Error = 9 Master: Client 67 returned error 9 Possibly: Bad file descriptor Client testing failed: Wed Nov 23 12:49:29 2016 Exit with cleanup Killing 288 clients. [2016-11-23 20:49:27.738179] W [fuse-bridge.c:766:fuse_attr_cbk] 0-glusterfs-fuse: 2925467: FSTAT() /specsfs/CL67_VDA2/Dir0/bucket6/astr0.d => -1 (Bad file descriptor) [2016-11-23 20:49:29.501843] W [fuse-bridge.c:1290:fuse_err_cbk] 0-glusterfs-fuse: 2926826: FLUSH() ERR => -1 (Bad file descriptor) Client log for client9 uploaded to https://drive.google.com/drive/u/1/folders/0B3DXjvLEit_KVDBoaE15OUxZMnc
We ran a new test today on glusterfs-3.7.9-12.el7rhgs.x86_64 bits using machines in the Supermicro lab where we limited the hot tier to 1x2 in order to eliminate DHT completely from any cold>hot promotion activity (note that the cold tier remained 3x2 on RAID 6 spinners). Info and failure condition, client error below. [root@rhosd0 ~]# gluster vol info Volume Name: 1nvme-distrep3x2 Type: Tier Volume ID: 20f9b6f0-30a0-42fb-9ff4-aae0e04abc16 Status: Started Number of Bricks: 8 Transport-type: tcp Hot Tier : Hot Tier Type : Replicate Number of Bricks: 1 x 2 = 2 Brick1: n1:/rhgs/hotbricks/1nvme-distrep3x2-hot Brick2: n0:/rhgs/hotbricks/1nvme-distrep3x2-hot Cold Tier: Cold Tier Type : Distributed-Replicate Number of Bricks: 3 x 2 = 6 Brick3: n0:/rhgs/coldbricks/1nvme-distrep3x2 Brick4: n1:/rhgs/coldbricks/1nvme-distrep3x2 Brick5: n2:/rhgs/coldbricks/1nvme-distrep3x2 Brick6: n3:/rhgs/coldbricks/1nvme-distrep3x2 Brick7: n4:/rhgs/coldbricks/1nvme-distrep3x2 Brick8: n5:/rhgs/coldbricks/1nvme-distrep3x2 Options Reconfigured: diagnostics.client-log-level: DEBUG cluster.tier-mode: cache features.ctr-enabled: on performance.readdir-ahead: on *** Starting test client: 45 Host: rhclient10 Workload: VDA2 Workdir: /rhgs/client/1nvme-distrep3x2/specsfs Initialization finished: Fri Dec 2 08:05:52 2016 Testing begins: Fri Dec 2 08:05:52 2016 Waiting for tests to finish. Fri Dec 2 08:05:58 2016 Fri Dec 2 08:06:03 2016 Starting WARM phase Fri Dec 2 08:06:04 2016 Warm Heartbeat Client 287: 0.00 Ops/sec Fri Dec 2 08:06:27 2016 Warm-up 10 percent complete Fri Dec 2 08:06:57 2016 Warm-up 20 percent complete Fri Dec 2 08:07:27 2016 Warm-up 30 percent complete Fri Dec 2 08:07:57 2016 Warm-up 40 percent complete Fri Dec 2 08:08:03 2016 Warm Heartbeat Client 287: 1.00 Ops/sec Fri Dec 2 08:08:26 2016 Warm-up 50 percent complete Fri Dec 2 08:08:56 2016 Warm-up 60 percent complete Fri Dec 2 08:09:05 2016 Warm Heartbeat Client 287: 1.00 Ops/sec Fri Dec 2 08:09:26 2016 Warm-up 70 percent complete Fri Dec 2 08:09:58 2016 Warm-up 80 percent complete Fri Dec 2 08:10:07 2016 Warm Heartbeat Client 287: 0.98 Ops/sec Fri Dec 2 08:10:58 2016 Warm-up 100 percent complete Fri Dec 2 08:11:02 2016 Starting RUN phase Fri Dec 2 08:11:08 2016 Run Heartbeat Client 287: 0.56 Ops/sec Fri Dec 2 08:11:32 2016 Run 10 percent complete Fri Dec 2 08:12:09 2016 Run Heartbeat Client 287: 1.00 Ops/sec Fri Dec 2 08:12:30 2016 Run 30 percent complete Master: Failure message client 45 Error = 22 Master: Client 45 returned error 22 Possibly: Invalid argument Client testing failed: Fri Dec 2 08:12:57 2016 Exit with cleanup Killing 288 clients. Message from rhclient10 glusterfs client log for SPEC SFS client 45 that corresponds with failure: [2016-12-02 16:12:57.047260] W [fuse-bridge.c:766:fuse_attr_cbk] 0-glusterfs-fuse: 8521925: FSTAT() /specsfs/CL45_VDA2/Dir0/bucket6/irg0.oer => -1 (Invalid argument)
I am running the SPECFS workload in the BAGL lab using four servers and four clients against RHGS 3.7.9. The hot tier is 2x2. The slow tier is 2 x (8 + 4). I have not seen the problem reproduce after 12 hours runtime. I will attempt other things, e.g shrink the hot tier in order to provoke more demotion activity. BENCHMARK=VDA LOAD=10 INCR_LOAD=32 NUM_RUNS=10 CLIENT_MOUNTPOINTS="gprfc065.sbu.lab.eng.bos.redhat.com:/mnt gprfc066.sbu.lab.eng.bos.redhat.com:/mnt gprfc067.sbu.lab.eng.bos.redhat.com:/mnt gprfc068.sbu.lab.eng.bos.redhat.com:/mnt" EXEC_PATH=/root/specsfs2014/binaries/linux/x64/netmist USER=root
Testing results for new RPMs with Nithya fix, dec9rpms-fix installed only on clients. timezone = PST Starting test client: 85 Host: rhclient6 Workload: VDA2 Workdir: /rhgs/client/12xec42/specsfs Testing begins: Fri Dec 9 12:08:02 2016 Waiting for tests to finish. Fri Dec 9 12:08:07 2016 Fri Dec 9 12:08:12 2016 Starting WARM phase Fri Dec 9 12:08:12 2016 Warm Heartbeat Client 239: 0.00 Ops/sec Fri Dec 9 12:08:35 2016 Warm-up 10 percent complete Fri Dec 9 12:09:06 2016 Warm-up 20 percent complete Fri Dec 9 12:09:12 2016 Warm Heartbeat Client 239: 0.99 Ops/sec Fri Dec 9 12:09:35 2016 Warm-up 30 percent complete Fri Dec 9 12:10:05 2016 Warm-up 40 percent complete Fri Dec 9 12:10:15 2016 Warm Heartbeat Client 239: 1.02 Ops/sec Fri Dec 9 12:10:37 2016 Warm-up 50 percent complete Fri Dec 9 12:11:07 2016 Warm-up 60 percent complete Fri Dec 9 12:11:16 2016 Warm Heartbeat Client 239: 2.07 Ops/sec Fri Dec 9 12:11:36 2016 Warm-up 70 percent complete Fri Dec 9 12:12:07 2016 Warm-up 80 percent complete Fri Dec 9 12:12:16 2016 Warm Heartbeat Client 239: 1.00 Ops/sec Fri Dec 9 12:12:36 2016 Warm-up 90 percent complete Fri Dec 9 12:13:06 2016 Warm-up 100 percent complete Fri Dec 9 12:13:09 2016 Starting RUN phase Fri Dec 9 12:13:18 2016 Run Heartbeat Client 239: 1.01 Ops/sec Fri Dec 9 12:13:39 2016 Run 10 percent complete Fri Dec 9 12:14:08 2016 Run 20 percent complete Fri Dec 9 12:14:20 2016 Run Heartbeat Client 239: 1.20 Ops/sec Fri Dec 9 12:14:37 2016 Run 30 percent complete Fri Dec 9 12:15:07 2016 Run 40 percent complete Fri Dec 9 12:15:22 2016 Run Heartbeat Client 239: 1.27 Ops/sec Fri Dec 9 12:15:37 2016 Run 50 percent complete Fri Dec 9 12:16:07 2016 Run 60 percent complete Master: Failure message client 85 Error = 9 Master: Client 85 returned error 9 Possibly: Bad file descriptor Client testing failed: Fri Dec 9 12:16:10 2016 Exit with cleanup Killing 240 clients. rhclient6 glusterfs log message for failure (UTC): [2016-12-09 20:16:08.832599] W [fuse-bridge.c:766:fuse_attr_cbk] 0-glusterfs-fuse: 796921: FSTAT() /specsfs/CL85_VDA2/Dir0/bucket6/oitb0.an => -1 (Bad file descriptor) rhclient6 glusterfs log found here - https://drive.google.com/drive/u/1/folders/0B3DXjvLEit_KbE5iS1lQTHFqcTg
Can you get us the gfid of this file: /specsfs/CL85_VDA2/Dir0/bucket6/oitb0.an This is available by running getfattr -m . -d - e hex on the brick on which the file exists. Thanks, Nithya I could not find the file on the cold tier. Below is info for file on two brick in the hot tier on two servers in the same replica pair. *** rhosd0 (1st server in replica pair) [[root@rhosd0 bucket6]# getfattr -e hex -d -m . /rhgs/hotbricks/1nvme-ec42-hot/specsfs/CL85_VDA2/Dir0/bucket6 getfattr: Removing leading '/' from absolute path names # file: rhgs/hotbricks/1nvme-ec42-hot/specsfs/CL85_VDA2/Dir0/bucket6 trusted.gfid=0x2ce4c538a8234d9d9fa1d037c9f66e16 trusted.glusterfs.dht=0x0000000100000000aaaaaaaaffffffff trusted.tier.tier-dht=0x0000000100000000fde38c57ffffffff [root@rhosd0 bucket6]# getfattr -e hex -d -m . /rhgs/hotbricks/1nvme-ec42-hot/specsfs/CL85_VDA2/Dir0/bucket6/oitb0.an getfattr: Removing leading '/' from absolute path names # file: rhgs/hotbricks/1nvme-ec42-hot/specsfs/CL85_VDA2/Dir0/bucket6/oitb0.an trusted.afr.dirty=0x000000000000000000000000 trusted.bit-rot.version=0x0200000000000000584aff09000c008c trusted.gfid=0x02eefce100ea4917be0634dc70d3618a *** rhosd1 (2nd server in replica pair) [root@rhosd1 ~]# getfattr -e hex -d -m . /rhgs/hotbricks/1nvme-ec42-hot/specsfs/CL85_VDA2/Dir0/bucket6 getfattr: Removing leading '/' from absolute path names # file: rhgs/hotbricks/1nvme-ec42-hot/specsfs/CL85_VDA2/Dir0/bucket6 trusted.gfid=0x2ce4c538a8234d9d9fa1d037c9f66e16 trusted.glusterfs.dht=0x0000000100000000aaaaaaaaffffffff trusted.tier.tier-dht=0x0000000100000000fde38c57ffffffff [root@rhosd1 ~]# getfattr -e hex -d -m . /rhgs/hotbricks/1nvme-ec42-hot/specsfs/CL85_VDA2/Dir0/bucket6/oitb0.an getfattr: Removing leading '/' from absolute path names # file: rhgs/hotbricks/1nvme-ec42-hot/specsfs/CL85_VDA2/Dir0/bucket6/oitb0.an trusted.afr.dirty=0x000000000000000000000000 trusted.bit-rot.version=0x0200000000000000584aff0c0002b9b4 trusted.gfid=0x02eefce100ea4917be0634dc70d3618a
Test failed with dec12rpms after running 12 or 13 benchmark values (120-552, increment by 36). Failed on load=552. Starting test client: 835 Host: rhclient9 Workload: VDA2 Workdir: /rhgs/client/12xec42/specsfs Testing begins: Tue Dec 13 17:32:36 2016 Waiting for tests to finish. Tue Dec 13 17:32:59 2016 Tue Dec 13 17:33:03 2016 Starting WARM phase Tue Dec 13 17:33:04 2016 Warm Heartbeat Client 1103: 0.00 Ops/sec Tue Dec 13 17:33:29 2016 Warm-up 10 percent complete Tue Dec 13 17:33:59 2016 Warm-up 20 percent complete Tue Dec 13 17:34:05 2016 Warm Heartbeat Client 1103: 1.05 Ops/sec Tue Dec 13 17:34:29 2016 Warm-up 30 percent complete Tue Dec 13 17:34:58 2016 Warm-up 40 percent complete Tue Dec 13 17:35:07 2016 Warm Heartbeat Client 1103: 1.03 Ops/sec Tue Dec 13 17:35:28 2016 Warm-up 50 percent complete Tue Dec 13 17:35:57 2016 Warm-up 60 percent complete Tue Dec 13 17:36:07 2016 Warm Heartbeat Client 1103: 0.94 Ops/sec Tue Dec 13 17:36:28 2016 Warm-up 70 percent complete Tue Dec 13 17:36:57 2016 Warm-up 80 percent complete Tue Dec 13 17:37:09 2016 Warm Heartbeat Client 1103: 1.03 Ops/sec Tue Dec 13 17:37:26 2016 Warm-up 90 percent complete Tue Dec 13 17:37:56 2016 Warm-up 100 percent complete Tue Dec 13 17:38:00 2016 Starting RUN phase Tue Dec 13 17:38:11 2016 Run Heartbeat Client 1103: 1.08 Ops/sec Tue Dec 13 17:38:58 2016 Run 20 percent complete Tue Dec 13 17:39:13 2016 Run Heartbeat Client 1103: 1.06 Ops/sec Tue Dec 13 17:39:28 2016 Run 30 percent complete Tue Dec 13 17:39:59 2016 Run 40 percent complete Tue Dec 13 17:40:14 2016 Run Heartbeat Client 1103: 0.98 Ops/sec Tue Dec 13 17:40:28 2016 Run 50 percent complete Master: Failure message client 835 Error = 107 Master: Client 835 returned error 107 Possibly: Transport endpoint is not connected Client testing failed: Tue Dec 13 17:40:57 2016 Exit with cleanup Killing 1104 clients. gluster log for rhclient9 can be found at Google Drive location: https://drive.google.com/drive/folders/0B3DXjvLEit_KTHNwbXdoOTZ0WU0
Error message in glusterfs rhclient9 log (UTC): [2016-12-14 01:40:57.107984] D [MSGID: 0] [client-rpc-fops.c:2982:client3_3_lookup_cbk] 0-stack-trace: stack-address: 0x7f25405c85f0, 12xec42-client-43 returned -1 error: Stale file handle [Stale file handle] [2016-12-14 01:40:57.108006] D [MSGID: 0] [afr-lk-common.c:448:transaction_lk_op] 0-12xec42-replicate-2: lk op is for a transaction [2016-12-14 01:40:57.108017] D [MSGID: 0] [defaults.c:844:default_lookup_cbk] 0-stack-trace: stack-address: 0x7f254068c5b8, 12xec42-disperse-7 returned -1 error: Stale file handle [Stale file handle]
On glusterfs version 3.12.2-16.el7rhgs, ran SPEC SFS 2014 VDA workload and below test steps and this issue is not reproduced. Steps: 1) A file exists on subol-1 and has an fd opened on it 2) Perform a write or some such op on the file from one terminal 3) The file is migrated to another subvol 4) From the same mount point but in a different terminal perform a file listing on the parent directory which updates the cached subvolume for the file without opening the fd on the new subvol. 5) Write to the file again from the first terminal with the same fd. Result: Writes on the file are successful. Hence, moving this BZ to Verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:2607