Bug 1388218 - Client stale file handle error in dht-linkfile.c under SPEC SFS 2014 VDA workload
Summary: Client stale file handle error in dht-linkfile.c under SPEC SFS 2014 VDA work...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: distribute
Version: rhgs-3.1
Hardware: x86_64
OS: Linux
high
medium
Target Milestone: ---
: RHGS 3.4.0
Assignee: Nithya Balachandran
QA Contact: Prasad Desala
URL:
Whiteboard: rebase
Depends On: 1402406
Blocks: 1503134
TreeView+ depends on / blocked
 
Reported: 2016-10-24 18:38 UTC by Dustin Black
Modified: 2018-09-17 09:48 UTC (History)
9 users (show)

Fixed In Version: glusterfs-3.12.2-1
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1402406 (view as bug list)
Environment:
Last Closed: 2018-09-04 06:29:44 UTC


Attachments (Terms of Use)
Error message on screen when SPEC SFS VDA test fails (1.30 KB, text/plain)
2016-10-29 16:06 UTC, Annette Clewett
no flags Details
SFS script used to run testing (6.58 KB, application/x-shellscript)
2016-10-29 16:07 UTC, Annette Clewett
no flags Details
SPEC SFS log (753.89 KB, text/plain)
2016-10-29 16:07 UTC, Annette Clewett
no flags Details
Gluser log for rhclient2 (11.03 MB, text/plain)
2016-10-29 16:09 UTC, Annette Clewett
no flags Details
Gluster client log for rhclient8 part2 (UTC; Nov 4, 2016) (9.67 MB, application/x-gzip)
2016-11-04 20:14 UTC, Annette Clewett
no flags Details
Gluster client log for rhclient8 part1 (UTC; Nov 4, 2016) (10.96 MB, application/x-gzip)
2016-11-04 20:17 UTC, Annette Clewett
no flags Details
SPEC SFS log with timestamps (PDT; Nov 4, 2016) (83.16 KB, text/plain)
2016-11-04 20:18 UTC, Annette Clewett
no flags Details
SPEC SFS setup file (Nov 4, 2016) (1004 bytes, text/plain)
2016-11-04 20:20 UTC, Annette Clewett
no flags Details
Error message on screen when SPEC SFS VDA test fails (PDT; Nov 4, 2016) (3.71 KB, text/plain)
2016-11-04 20:24 UTC, Annette Clewett
no flags Details
Gluster server log for rhosd4 (UTC; Nov 4, 2016) (12.82 MB, application/x-gzip)
2016-11-04 20:55 UTC, Annette Clewett
no flags Details


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2018:2607 None None None 2018-09-04 06:31:17 UTC

Description Dustin Black 2016-10-24 18:38:25 UTC
Description of problem:
When running a benchmark test with the SPEC SFS 2014 VDA workload against a 3x2 on 3x2 tiered volume using the FUSE native client, the benchmark will fail with return code 101 between the INIT and WARM phases, reporting a client error of "No such file or directory." The client logs will consistently show an error in dht-linkfile.c, "Failed to set attr uid/gid" as in the log pattern below.

[2016-10-24 15:48:39.738525] I [MSGID: 109045] [dht-common.c:1667:dht_lookup_everywhere_cbk] 2-1nvme-distrep3x2-tier-dht: attempting deletion of stale linkfile /specsfs/CL21
_VDA2/Dir0/bucket11/onebd0.leb on 1nvme-distrep3x2-cold-dht (hashed subvol is 1nvme-distrep3x2-hot-dht)
[2016-10-24 15:48:39.739442] W [MSGID: 114031] [client-rpc-fops.c:664:client3_3_unlink_cbk] 2-1nvme-distrep3x2-client-0: remote operation failed [Device or resource busy]
[2016-10-24 15:48:39.739459] W [MSGID: 114031] [client-rpc-fops.c:664:client3_3_unlink_cbk] 2-1nvme-distrep3x2-client-1: remote operation failed [Device or resource busy]
[2016-10-24 15:48:39.739738] I [MSGID: 109069] [dht-common.c:992:dht_lookup_unlink_cbk] 2-1nvme-distrep3x2-tier-dht: lookup_unlink returned with op_ret -> -1 and op-errno ->
 16 for /specsfs/CL21_VDA2/Dir0/bucket11/onebd0.leb
[2016-10-24 15:48:39.742165] I [MSGID: 108019] [afr-lk-common.c:1086:afr_lock_blocking] 2-1nvme-distrep3x2-replicate-3: unable to lock on even one child
[2016-10-24 15:48:39.742195] I [MSGID: 108019] [afr-transaction.c:1218:afr_post_blocking_inodelk_cbk] 2-1nvme-distrep3x2-replicate-3: Blocking inodelks failed.
[2016-10-24 15:48:39.742209] E [MSGID: 109031] [dht-linkfile.c:306:dht_linkfile_setattr_cbk] 2-1nvme-distrep3x2-tier-dht: Failed to set attr uid/gid on /specsfs/CL21_VDA2/Dir0/bucket11/onebd0.leb :<gfid:00000000-0000-0000-0000-000000000000>  [Stale file handle]
[2016-10-24 15:48:39.853056] I [MSGID: 109045] [dht-common.c:1667:dht_lookup_everywhere_cbk] 2-1nvme-distrep3x2-tier-dht: attempting deletion of stale linkfile /specsfs/CL213_VDA2/Dir0/bucket8/myrpd0.osxom on 1nvme-distrep3x2-cold-dht (hashed subvol is 1nvme-distrep3x2-hot-dht)
[2016-10-24 15:48:39.853577] W [MSGID: 114031] [client-rpc-fops.c:664:client3_3_unlink_cbk] 2-1nvme-distrep3x2-client-2: remote operation failed [Device or resource busy]
[2016-10-24 15:48:39.853602] W [MSGID: 114031] [client-rpc-fops.c:664:client3_3_unlink_cbk] 2-1nvme-distrep3x2-client-3: remote operation failed [Device or resource busy]
[2016-10-24 15:48:39.853788] I [MSGID: 109069] [dht-common.c:992:dht_lookup_unlink_cbk] 2-1nvme-distrep3x2-tier-dht: lookup_unlink returned with op_ret -> -1 and op-errno -> 16 for /specsfs/CL213_VDA2/Dir0/bucket8/myrpd0.osxom
[2016-10-24 15:48:39.856132] I [MSGID: 108019] [afr-lk-common.c:1086:afr_lock_blocking] 2-1nvme-distrep3x2-replicate-3: unable to lock on even one child
[2016-10-24 15:48:39.856150] I [MSGID: 108019] [afr-transaction.c:1218:afr_post_blocking_inodelk_cbk] 2-1nvme-distrep3x2-replicate-3: Blocking inodelks failed.
[2016-10-24 15:48:39.856161] E [MSGID: 109031] [dht-linkfile.c:306:dht_linkfile_setattr_cbk] 2-1nvme-distrep3x2-tier-dht: Failed to set attr uid/gid on /specsfs/CL213_VDA2/Dir0/bucket8/myrpd0.osxom :<gfid:00000000-0000-0000-0000-000000000000>  [Stale file handle]


Version-Release number of selected component (if applicable):
# cat /etc/redhat-storage-release 
Red Hat Gluster Storage Server 3.1 Update 3

# gluster --version
glusterfs 3.7.9 built on Jun 10 2016 06:32:42


How reproducible: Consistently


Steps to Reproduce:
Gluster 3x2 hot tier on a 3x2 cold tier. Hot tier is NVMe with standard LVM thin provisioning stack, cold tier is on RAID 6 spinning disks with standard LVM thin provisioning stack.

# gluster vol info 1nvme-distrep3x2
 
Volume Name: 1nvme-distrep3x2
Type: Tier
Volume ID: adbef8d2-2d15-41b9-b1f1-7d7d070aaf37
Status: Started
Number of Bricks: 12
Transport-type: tcp
Hot Tier :
Hot Tier Type : Distributed-Replicate
Number of Bricks: 3 x 2 = 6
Brick1: n5:/rhgs/hotbricks/1nvme-distrep3x2-hot
Brick2: n4:/rhgs/hotbricks/1nvme-distrep3x2-hot
Brick3: n3:/rhgs/hotbricks/1nvme-distrep3x2-hot
Brick4: n2:/rhgs/hotbricks/1nvme-distrep3x2-hot
Brick5: n1:/rhgs/hotbricks/1nvme-distrep3x2-hot
Brick6: n0:/rhgs/hotbricks/1nvme-distrep3x2-hot
Cold Tier:
Cold Tier Type : Distributed-Replicate
Number of Bricks: 3 x 2 = 6
Brick7: n0:/rhgs/coldbricks/1nvme-distrep3x2
Brick8: n1:/rhgs/coldbricks/1nvme-distrep3x2
Brick9: n2:/rhgs/coldbricks/1nvme-distrep3x2
Brick10: n3:/rhgs/coldbricks/1nvme-distrep3x2
Brick11: n4:/rhgs/coldbricks/1nvme-distrep3x2
Brick12: n5:/rhgs/coldbricks/1nvme-distrep3x2
Options Reconfigured:
cluster.tier-mode: cache
features.ctr-enabled: on
performance.readdir-ahead: on

12 clients connected to the volume with the FUSE native client.

# glusterfs --version
glusterfs 3.7.1 built on Sep 21 2015 07:52:17


Using the SPEC SFS 2014 SP1 benchmark software.

The rc file for the SPEC benchmark:

# cat /tmp/sfs.rc 
BENCHMARK=VDA
LOAD=144
INCR_LOAD=24
NUM_RUNS=4
CLIENT_MOUNTPOINTS=c0:/rhgs/client/1nvme-distrep3x2/specsfs c1:/rhgs/client/1nvme-distrep3x2/specsfs c2:/rhgs/client/1nvme-distrep3x2/specsfs c3:/rhgs/client/1nvme-distrep3x2/specsfs c4:/rhgs/client/1nvme-distrep3x2/specsfs c5:/rhgs/client/1nvme-distrep3x2/specsfs c6:/rhgs/client/1nvme-distrep3x2/specsfs c7:/rhgs/client/1nvme-distrep3x2/specsfs c8:/rhgs/client/1nvme-distrep3x2/specsfs c9:/rhgs/client/1nvme-distrep3x2/specsfs c10:/rhgs/client/1nvme-distrep3x2/specsfs c11:/rhgs/client/1nvme-distrep3x2/specsfs
EXEC_PATH=/root/bin/netmist
USER=root
WARMUP_TIME=300
IPV6_ENABLE=0
PRIME_MON_SCRIPT=
PRIME_MON_ARGS=
NETMIST_LOGS=
PASSWORD=
RUNTIME=300
WORKLOAD_FILE=
OPRATE_MULTIPLIER=
CLIENT_MEM=1g
AGGR_CAP=1g
FILE_SIZE=
DIR_COUNT=10
FILES_PER_DIR=100
UNLINK_FILES=0
LATENCY_GRAPH=1
HEARTBEAT_NOTIFICATIONS=1
DISABLE_FSYNCS=0
USE_RSHRCP=0
BYTE_OFFSET=0
MAX_FD=
PIT_SERVER=
PIT_PORT=
LOCAL_ONLY=0
FILE_ACCESS_LIST=0
SHARING_MODE=0
SOCK_DEBUG=0
TRACEDEBUG=0


The SPEC SFS command line:

python /root/bin/SfsManager -r /tmp/sfs.rc -s $(date +%F-%H-%M-%S) -b /root/bin/benchmarks.xml -d ./



Actual results:
Benchmark fails when at least one client reports back a "No such file or directory" error.

<<< Mon Oct 24 07:25:31 2016: Starting VDA run 1 of 4: STREAMS=144 >>

     SPEC SFS2014 Release         $Revision: 1072 $
...
     Test run time = 300 seconds, Warmup = 300 seconds.
     Op latency reporting activated
     Files per directory set to 1
     Directories per proc set to 1
     Using custom file size 1048576 Kbytes
     Running 288 copies of the test on 12 clients
     Results directory: ./
     Clients have a total of 12288 MiBytes of memory
     Clients have 42 MiBytes of memory size per process
     Clients each have 24 processes
     Adjustable aggregate data set value set to 1024 MiBytes
     Each process file size = 1048576 kbytes
     Client data set size       = 270336 MiBytes
     Total starting data set size = 3244032 MiBytes
     Total initial file space   = 3244032 MiBytes
     Total max file space       = 3538944 MiBytes

[INFO][Mon Oct 24 07:25:31 2016]Exec validation successful
...
     Starting tests: Mon Oct 24 07:25:31 2016 

        Launching 288 processes.
...
        Waiting to finish initialization. Mon Oct 24 07:26:12 2016
        Mon Oct 24 07:30:55 2016 Starting INIT phase
...
        Initialization finished: Mon Oct 24 08:48:28 2016
        Testing begins: Mon Oct 24 08:48:28 2016
        Waiting for tests to finish. Mon Oct 24 08:48:34 2016
        Mon Oct 24 08:48:39 2016 Starting WARM phase
        Mon Oct 24 08:48:39 2016 Warm Heartbeat Client 287:   0.00 Ops/sec
Client testing failed: Mon Oct 24 08:48:46 2016
netmist did not complete successfully.
The netmist return code was: 101



Expected results:
All benchmark iterations complete successfully without client errors.



Additional info:
# cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 7.2 (Maipo)

# uname -r
3.10.0-327.28.3.el7.x86_64

# df -h | grep rhgs
/dev/mapper/rhgs_vg-rhgs_lv           55T  520G   55T   1% /rhgs/coldbricks
/dev/mapper/rhgs_hot_vg-rhgs_hot_lv  742G  531G  211G  72% /rhgs/hotbricks

# mount | grep rhgs
/dev/mapper/rhgs_vg-rhgs_lv on /rhgs/coldbricks type xfs (rw,noatime,nodiratime,attr2,inode64,sunit=512,swidth=5120,noquota)
/dev/mapper/rhgs_hot_vg-rhgs_hot_lv on /rhgs/hotbricks type xfs (rw,noatime,nodiratime,attr2,inode64,logbsize=128k,sunit=256,swidth=512,noquota)

# xfs_info /dev/mapper/rhgs_vg-rhgs_lv 
meta-data=/dev/mapper/rhgs_vg-rhgs_lv isize=512    agcount=55, agsize=268435392 blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=0        finobt=0
data     =                       bsize=4096   blocks=14645721600, imaxpct=1
         =                       sunit=64     swidth=640 blks
naming   =version 2              bsize=8192   ascii-ci=0 ftype=0
log      =internal               bsize=4096   blocks=521728, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

# xfs_info /dev/mapper/rhgs_hot_vg-rhgs_hot_lv 
meta-data=/dev/mapper/rhgs_hot_vg-rhgs_hot_lv isize=512    agcount=32, agsize=6074208 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=0        finobt=0
data     =                       bsize=4096   blocks=194374656, imaxpct=25
         =                       sunit=32     swidth=64 blks
naming   =version 2              bsize=8192   ascii-ci=0 ftype=0
log      =internal               bsize=4096   blocks=94912, version=2
         =                       sectsz=512   sunit=32 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

# tuned-adm active
Current active profile: rhgs-sequential-io

# getenforce
Disabled

# iptables -L
Chain INPUT (policy ACCEPT)
target     prot opt source               destination         

Chain FORWARD (policy ACCEPT)
target     prot opt source               destination         

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination

Comment 2 Dustin Black 2016-10-24 18:40:49 UTC
Possible related BZ 1277088 (partial error message match): https://bugzilla.redhat.com/show_bug.cgi?id=1277088

Comment 4 Nithya Balachandran 2016-10-25 05:17:18 UTC
Dustin,

I see you are running older clients for this test:

> 12 clients connected to the volume with the FUSE native client.

> # glusterfs --version
> glusterfs 3.7.1 built on Sep 21 2015 07:52:17


It is recommended that you use the latest 3.1.3 clients as well. We fixed a lot of tier bugs in 3.1.2 so earlier versions are likely to have issues that are already fixed.

Can you retry the tests with the latest clients and let us know if you still see the issues?

If you do, please attach the logs/sos reports to the BZ.

Comment 5 Dustin Black 2016-10-25 17:24:43 UTC
(In reply to Nithya Balachandran from comment #4)
> Dustin,
> 
> I see you are running older clients for this test:
> 
> > 12 clients connected to the volume with the FUSE native client.
> 
> > # glusterfs --version
> > glusterfs 3.7.1 built on Sep 21 2015 07:52:17
> 
> 
> It is recommended that you use the latest 3.1.3 clients as well. We fixed a
> lot of tier bugs in 3.1.2 so earlier versions are likely to have issues that
> are already fixed.
> 
> Can you retry the tests with the latest clients and let us know if you still
> see the issues?
> 
> If you do, please attach the logs/sos reports to the BZ.

Nithya,

It appears that my client systems are registered to the rhel-7-server repo but not to the rhel-7-server-rh-common repo. The rhel-7-server repo has the glusterfs and glusterfs-fuse packages up to version 3.7.1, so when I did a yum install I believed I had the most updated packages. Looking into the repos, I found that the later packages were apparently moved to the rhel-7-server-rh-common repo.

I am updating my clients now and will re-test.

Comment 6 Annette Clewett 2016-10-29 01:17:05 UTC
Clients were updated to glusterfs 3.7.9 built on Aug 24 2016 02:51:26

Problem was not fixed and looks like similar errors

SPEC SFS error (time in PDT)

        Tue Oct 25 15:31:51 2016 Warm Heartbeat Client 335:   1.01 Ops/sec
        Tue Oct 25 15:32:12 2016 Warm-up 50 percent complete
        Tue Oct 25 15:32:43 2016 Warm-up 60 percent complete
        Tue Oct 25 15:32:53 2016 Warm Heartbeat Client 335:   1.00 Ops/sec
        Tue Oct 25 15:33:13 2016 Warm-up 70 percent complete
        Tue Oct 25 15:33:43 2016 Warm-up 80 percent complete
        Tue Oct 25 15:33:55 2016 Warm Heartbeat Client 335:   1.00 Ops/sec
        Tue Oct 25 15:34:12 2016 Warm-up 90 percent complete
Client testing failed: Tue Oct 25 15:34:13 2016
netmist did not complete successfully.
The netmist return code was: 101



rhclient0 gluserfs log (time in UTC)

[2016-10-25 23:29:12.070138] I [MSGID: 114024] [client-helpers.c:101:this_fd_set_ctx] 0-1nvme-distrep3x2-client-11: /specsfs/CL312_VDA1/Dir0/bucket0/ttr0.tises (eefe9fd7-b94e-4a12-9b03-f0c00cc1973e): trying duplicate remote fd set.
[2016-10-25 23:29:12.070160] I [MSGID: 114024] [client-helpers.c:101:this_fd_set_ctx] 0-1nvme-distrep3x2-client-10: /specsfs/CL312_VDA1/Dir0/bucket0/ttr0.tises (eefe9fd7-b94e-4a12-9b03-f0c00cc1973e): trying duplicate remote fd set.
[2016-10-25 23:29:12.070226] I [MSGID: 114024] [client-helpers.c:101:this_fd_set_ctx] 0-1nvme-distrep3x2-client-11: /specsfs/CL312_VDA1/Dir0/bucket0/ttr0.tises (eefe9fd7-b94e-4a12-9b03-f0c00cc1973e): trying duplicate remote fd set.
[2016-10-25 23:29:12.070239] I [MSGID: 114024] [client-helpers.c:101:this_fd_set_ctx] 0-1nvme-distrep3x2-client-10: /specsfs/CL312_VDA1/Dir0/bucket0/ttr0.tises (eefe9fd7-b94e-4a12-9b03-f0c00cc1973e): trying duplicate remote fd set.
[2016-10-25 23:32:42.838486] W [MSGID: 114031] [client-rpc-fops.c:1572:client3_3_fstat_cbk] 0-1nvme-distrep3x2-client-6: remote operation failed [No such file or directory]
[2016-10-25 23:32:42.838782] W [MSGID: 114031] [client-rpc-fops.c:1572:client3_3_fstat_cbk] 0-1nvme-distrep3x2-client-7: remote operation failed [No such file or directory]
[2016-10-25 23:32:42.839359] W [MSGID: 108008] [afr-read-txn.c:250:afr_read_txn] 0-1nvme-distrep3x2-replicate-5: Unreadable subvolume -1 found with event generation 2 for gfid 55d47778-6a22-4233-9fce-e97e4c0da376. (Possible split-brain)
[2016-10-25 23:32:42.840242] E [MSGID: 109040] [dht-helper.c:1189:dht_migration_complete_check_task] 0-1nvme-distrep3x2-hot-dht: (null): failed to lookup the file on 1nvme-distrep3x2-hot-dht [Stale file handle]
[2016-10-25 23:32:42.840763] W [MSGID: 108008] [afr-read-txn.c:250:afr_read_txn] 0-1nvme-distrep3x2-replicate-5: Unreadable subvolume -1 found with event generation 2 for gfid 55d47778-6a22-4233-9fce-e97e4c0da376. (Possible split-brain)
The message "W [MSGID: 114031] [client-rpc-fops.c:1572:client3_3_fstat_cbk] 0-1nvme-distrep3x2-client-7: remote operation failed [No such file or directory]" repeated 2 times between [2016-10-25 23:32:42.838782] and [2016-10-25 23:32:42.840981]
The message "W [MSGID: 114031] [client-rpc-fops.c:1572:client3_3_fstat_cbk] 0-1nvme-distrep3x2-client-6: remote operation failed [No such file or directory]" repeated 3 times between [2016-10-25 23:32:42.838486] and [2016-10-25 23:32:42.841022]

Comment 7 Annette Clewett 2016-10-29 16:03:09 UTC
Additional failed test results with logs attached

starting test client: 125 Host:  rhclient2    Workload:     VDA2  Workdir: /rhgs/client/1nvme-ec42/specsfs

...

Starting test client: 287 Host: rhclient11    Workload:     VDA2  Workdir: /rhgs/client/1nvme-ec42/specsfs
        Waiting to finish initialization. Sat Oct 29 02:03:07 2016
        Sat Oct 29 02:05:22 2016 Starting INIT phase
        Sat Oct 29 02:05:23 2016 Init Heartbeat __/\_/\__
        Sat Oct 29 02:06:12 2016 Init 20 percent complete
        Sat Oct 29 02:06:13 2016 Init 50 percent complete
        Sat Oct 29 02:06:13 2016 Init 70 percent complete
        Sat Oct 29 02:06:13 2016 Init 100 percent complete
        Initialization finished: Sat Oct 29 02:06:57 2016
        Testing begins: Sat Oct 29 02:06:57 2016
        Waiting for tests to finish. Sat Oct 29 02:07:03 2016
        Sat Oct 29 02:07:08 2016 Starting WARM phase
        Sat Oct 29 02:07:08 2016 Warm Heartbeat Client 287:   0.00 Ops/sec
        Sat Oct 29 02:07:33 2016 Warm-up 10 percent complete
Master: Failure message client 125 Error = 22
Master: Client 125 returned error 22 Possibly: Invalid argument
Client testing failed: Sat Oct 29 02:07:33 2016
Exit with cleanup
Killing 288 clients.

From rhclient2 glusterfs log (attached; UTC)

[2016-10-29 09:07:33.843919] W [fuse-bridge.c:766:fuse_attr_cbk] 0-glusterfs-fuse: 20740513: FSTAT() /specsfs/CL125_VDA2/Dir0/bucket2/iee0.nhn => -1 (Invalid argument)
[2016-10-29 09:07:35.598809] W [MSGID: 114031] [client-rpc-fops.c:1848:client3_3_xattrop_cbk] 0-1nvme-ec42-client-0: remote operation failed. Path: /specsfs/CL53_VDA2/Dir0/bucket11/tut0.i (a02df0b8-fbf2-4789-8465-dc260c71e8b2) [No such file or directory]

Comment 8 Annette Clewett 2016-10-29 16:06:28 UTC
Created attachment 1215341 [details]
Error message on screen when SPEC SFS VDA test fails

Comment 9 Annette Clewett 2016-10-29 16:07:06 UTC
Created attachment 1215342 [details]
SFS script used to run testing

Comment 10 Annette Clewett 2016-10-29 16:07:56 UTC
Created attachment 1215343 [details]
SPEC SFS log

Comment 11 Annette Clewett 2016-10-29 16:09:24 UTC
Created attachment 1215344 [details]
Gluser log for rhclient2

This log has the error message for what caused the SPEC SFS VDA test to fail

Comment 12 Annette Clewett 2016-11-04 19:51:23 UTC
New VDA failure with Logs attached

Testing output on screen before, during and after failure (PDT)
 …
        Starting test client: 329 Host:  rhclient8    Workload:     VDA2  Workdir: /rhgs/client/1nvme-distrep3x2/specsfs
        Starting test client: 330 Host:  rhclient9    Workload:     VDA1  Workdir: /rhgs/client/1nvme-distrep3x2/specsfs
        Starting test client: 331 Host:  rhclient9    Workload:     VDA2  Workdir: /rhgs/client/1nvme-distrep3x2/specsfs
        Starting test client: 332 Host: rhclient10    Workload:     VDA1  Workdir: /rhgs/client/1nvme-distrep3x2/specsfs
        Starting test client: 333 Host: rhclient10    Workload:     VDA2  Workdir: /rhgs/client/1nvme-distrep3x2/specsfs
        Starting test client: 334 Host: rhclient11    Workload:     VDA1  Workdir: /rhgs/client/1nvme-distrep3x2/specsfs
        Starting test client: 335 Host: rhclient11    Workload:     VDA2  Workdir: /rhgs/client/1nvme-distrep3x2/specsfs
        Waiting to finish initialization. Fri Nov  4 10:10:38 2016
        Fri Nov  4 10:15:27 2016 Starting INIT phase
        Fri Nov  4 10:15:43 2016 Init Heartbeat __/\_/\__
        Fri Nov  4 10:16:43 2016 Init Heartbeat __/\_/\__
        Fri Nov  4 10:17:43 2016 Init Heartbeat __/\_/\__
        Fri Nov  4 10:18:43 2016 Init Heartbeat __/\_/\__
        Fri Nov  4 10:18:48 2016 Init 20 percent complete
        Fri Nov  4 10:19:43 2016 Init Heartbeat __/\_/\__
        Fri Nov  4 10:20:43 2016 Init Heartbeat __/\_/\__
        Fri Nov  4 10:21:43 2016 Init Heartbeat __/\_/\__
        Fri Nov  4 10:22:11 2016 Init 50 percent complete
        Fri Nov  4 10:22:44 2016 Init Heartbeat __/\_/\__
        Fri Nov  4 10:23:19 2016 Init 70 percent complete
        Fri Nov  4 10:23:44 2016 Init Heartbeat __/\_/\__
        Fri Nov  4 10:24:44 2016 Init Heartbeat __/\_/\__
        Fri Nov  4 10:25:35 2016 Init 100 percent complete
        Initialization finished: Fri Nov  4 10:27:10 2016
        Testing begins: Fri Nov  4 10:27:10 2016
        Waiting for tests to finish. Fri Nov  4 10:27:17 2016
        Fri Nov  4 10:27:22 2016 Starting WARM phase
        Fri Nov  4 10:27:22 2016 Warm Heartbeat Client 335:   0.00 Ops/sec
        Fri Nov  4 10:27:45 2016 Warm-up 10 percent complete
        Fri Nov  4 10:28:16 2016 Warm-up 20 percent complete
        Fri Nov  4 10:28:22 2016 Warm Heartbeat Client 335:   1.02 Ops/sec
        Fri Nov  4 10:28:46 2016 Warm-up 30 percent complete
        Fri Nov  4 10:29:16 2016 Warm-up 40 percent complete
        Fri Nov  4 10:29:25 2016 Warm Heartbeat Client 335:   1.00 Ops/sec
        Fri Nov  4 10:29:45 2016 Warm-up 50 percent complete
        Fri Nov  4 10:30:15 2016 Warm-up 60 percent complete
        Fri Nov  4 10:30:27 2016 Warm Heartbeat Client 335:   1.01 Ops/sec
        Fri Nov  4 10:30:45 2016 Warm-up 70 percent complete
        Fri Nov  4 10:31:15 2016 Warm-up 80 percent complete
        Fri Nov  4 10:31:27 2016 Warm Heartbeat Client 335:   1.05 Ops/sec
        Fri Nov  4 10:31:45 2016 Warm-up 90 percent complete
        Fri Nov  4 10:32:17 2016 Warm-up 100 percent complete
        Fri Nov  4 10:32:17 2016 Starting RUN phase
        Fri Nov  4 10:32:29 2016 Run  Heartbeat Client 335:   1.00 Ops/sec
        Fri Nov  4 10:32:46 2016 Run 10 percent complete
        Fri Nov  4 10:33:30 2016 Run  Heartbeat Client 335:   1.03 Ops/sec
        Fri Nov  4 10:33:47 2016 Run 30 percent complete
        Fri Nov  4 10:34:15 2016 Run 40 percent complete
        Fri Nov  4 10:34:32 2016 Run  Heartbeat Client 335:   0.99 Ops/sec
        Fri Nov  4 10:34:47 2016 Run 50 percent complete
Master: Failure message client 329 Error = 22
Master: Client 329 returned error 22 Possibly: Invalid argument
Client testing failed: Fri Nov  4 10:34:59 2016
Exit with cleanup
Killing 336 clients.

…

netmist did not complete successfully.

Comment 13 Annette Clewett 2016-11-04 20:14:24 UTC
Created attachment 1217499 [details]
Gluster client log for rhclient8 part2 (UTC; Nov 4, 2016)

Comment 14 Annette Clewett 2016-11-04 20:16:01 UTC
Comment on attachment 1217499 [details]
Gluster client log for rhclient8 part2 (UTC; Nov 4, 2016)

SPEC SFS error may be related to this line in the attached log

[2016-11-04 17:34:59.224968] W [fuse-bridge.c:766:fuse_attr_cbk] 0-glusterfs-fuse: 4506012: FSTAT() /specsfs/CL329_VDA2/Dir0/bucket2/ehndeie0.preo => -1 (Invalid argument)

Comment 15 Annette Clewett 2016-11-04 20:17:02 UTC
Created attachment 1217501 [details]
Gluster client log for rhclient8 part1 (UTC; Nov 4, 2016)

Comment 16 Annette Clewett 2016-11-04 20:18:44 UTC
Created attachment 1217502 [details]
SPEC SFS log with timestamps (PDT; Nov 4, 2016)

Comment 17 Annette Clewett 2016-11-04 20:20:22 UTC
Created attachment 1217503 [details]
SPEC SFS setup file (Nov 4, 2016)

Comment 18 Annette Clewett 2016-11-04 20:24:11 UTC
Created attachment 1217504 [details]
Error message on screen when SPEC SFS VDA test fails (PDT; Nov 4, 2016)

Comment 19 Annette Clewett 2016-11-04 20:54:46 UTC
Found Gluster server rhosd4 showing movement between tiers for file ehndeie0.preo. Used this command grep -i "ehndeie0.preo" 1nvme-distrep3x2-tier.log and found output here: http://pastebin.test.redhat.com/427282. 

Sane file found in rhclient8 gluster log 

[2016-11-04 17:34:59.224968] W [fuse-bridge.c:766:fuse_attr_cbk] 0-glusterfs-fuse: 4506012: FSTAT() /specsfs/CL329_VDA2/Dir0/bucket2/ehndeie0.preo => -1 (Invalid argument)

Both logs are attached.

Comment 20 Annette Clewett 2016-11-04 20:55:54 UTC
Created attachment 1217508 [details]
Gluster server log for rhosd4 (UTC; Nov 4, 2016)

Comment 21 Annette Clewett 2016-11-04 21:01:22 UTC
Output from server log for file with error.

[root@rhosd4 glusterfs]# grep -i "ehndeie0.preo" 1nvme-distrep3x2-tier.log
[2016-11-04 17:33:33.364950] D [MSGID: 0] [dht-common.c:2445:dht_lookup] 0-1nvme-distrep3x2-tier-dht: Calling fresh lookup for /specsfs/CL329_VDA2/Dir0/bucket2/ehndeie0.preo on 1nvme-distrep3x2-cold-dht
[2016-11-04 17:33:33.364972] D [MSGID: 0] [dht-common.c:2445:dht_lookup] 0-1nvme-distrep3x2-cold-dht: Calling fresh lookup for /specsfs/CL329_VDA2/Dir0/bucket2/ehndeie0.preo on 1nvme-distrep3x2-replicate-2
[2016-11-04 17:33:33.365363] D [MSGID: 0] [dht-common.c:2032:dht_lookup_cbk] 0-1nvme-distrep3x2-cold-dht: fresh_lookup returned for /specsfs/CL329_VDA2/Dir0/bucket2/ehndeie0.preo with op_ret 0
[2016-11-04 17:33:33.365382] D [MSGID: 0] [dht-common.c:2032:dht_lookup_cbk] 0-1nvme-distrep3x2-tier-dht: fresh_lookup returned for /specsfs/CL329_VDA2/Dir0/bucket2/ehndeie0.preo with op_ret 0
[2016-11-04 17:33:33.365412] D [MSGID: 0] [tier.c:733:tier_migrate_using_query_file] 0-1nvme-distrep3x2-tier-dht: Tier promote: src_subvol 1nvme-distrep3x2-cold-dht file /specsfs/CL329_VDA2/Dir0/bucket2/ehndeie0.preo
[2016-11-04 17:33:33.368494] D [MSGID: 0] [dht-common.c:2371:dht_lookup] 0-1nvme-distrep3x2-cold-dht: calling revalidate lookup for /specsfs/CL329_VDA2/Dir0/bucket2/ehndeie0.preo at 1nvme-distrep3x2-replicate-2
[2016-11-04 17:33:33.368799] D [MSGID: 0] [dht-common.c:791:dht_revalidate_cbk] 0-1nvme-distrep3x2-cold-dht: revalidate lookup of /specsfs/CL329_VDA2/Dir0/bucket2/ehndeie0.preo returned with op_ret 0
[2016-11-04 17:33:33.369036] D [MSGID: 0] [dht-common.c:2445:dht_lookup] 0-1nvme-distrep3x2-hot-dht: Calling fresh lookup for /specsfs/CL329_VDA2/Dir0/bucket2/ehndeie0.preo on 1nvme-distrep3x2-replicate-5
[2016-11-04 17:33:33.369414] D [MSGID: 0] [dht-common.c:2032:dht_lookup_cbk] 0-1nvme-distrep3x2-hot-dht: fresh_lookup returned for /specsfs/CL329_VDA2/Dir0/bucket2/ehndeie0.preo with op_ret -1 [No such file or directory]
[2016-11-04 17:33:33.369420] D [MSGID: 0] [dht-common.c:2045:dht_lookup_cbk] 0-1nvme-distrep3x2-hot-dht: Entry /specsfs/CL329_VDA2/Dir0/bucket2/ehndeie0.preo missing on subvol 1nvme-distrep3x2-replicate-5
[2016-11-04 17:33:33.369718] D [MSGID: 0] [dht-common.c:1623:dht_lookup_everywhere_cbk] 0-1nvme-distrep3x2-hot-dht: returned with op_ret -1 and op_errno 2 (/specsfs/CL329_VDA2/Dir0/bucket2/ehndeie0.preo) from subvol 1nvme-distrep3x2-replicate-3
[2016-11-04 17:33:33.369930] D [MSGID: 0] [dht-common.c:1623:dht_lookup_everywhere_cbk] 0-1nvme-distrep3x2-hot-dht: returned with op_ret -1 and op_errno 2 (/specsfs/CL329_VDA2/Dir0/bucket2/ehndeie0.preo) from subvol 1nvme-distrep3x2-replicate-5
[2016-11-04 17:33:33.369992] D [MSGID: 0] [dht-common.c:1623:dht_lookup_everywhere_cbk] 0-1nvme-distrep3x2-hot-dht: returned with op_ret -1 and op_errno 2 (/specsfs/CL329_VDA2/Dir0/bucket2/ehndeie0.preo) from subvol 1nvme-distrep3x2-replicate-4
[2016-11-04 17:33:33.370032] D [MSGID: 0] [dht-common.c:1349:dht_lookup_everywhere_done] 0-1nvme-distrep3x2-hot-dht: There was no cached file and  unlink on hashed is not skipped /specsfs/CL329_VDA2/Dir0/bucket2/ehndeie0.preo
[2016-11-04 17:33:33.370147] D [MSGID: 0] [dht-common.c:6496:dht_create_wind_to_avail_subvol] 0-1nvme-distrep3x2-hot-dht: creating /specsfs/CL329_VDA2/Dir0/bucket2/ehndeie0.preo on 1nvme-distrep3x2-replicate-5
[2016-11-04 17:33:33.370827] D [MSGID: 0] [dht-common.c:2371:dht_lookup] 0-1nvme-distrep3x2-hot-dht: calling revalidate lookup for /specsfs/CL329_VDA2/Dir0/bucket2/ehndeie0.preo at 1nvme-distrep3x2-replicate-5
[2016-11-04 17:33:33.371103] D [MSGID: 0] [dht-common.c:791:dht_revalidate_cbk] 0-1nvme-distrep3x2-hot-dht: revalidate lookup of /specsfs/CL329_VDA2/Dir0/bucket2/ehndeie0.preo returned with op_ret 0
[2016-11-04 17:34:02.393288] D [MSGID: 0] [client-rpc-fops.c:1157:client3_3_getxattr_cbk] 0-1nvme-distrep3x2-client-4: remote operation failed: No data available. Path: /specsfs/CL329_VDA2/Dir0/bucket2/ehndeie0.preo (3da564c5-22ff-46c4-9d99-16a35141531a). Key: trusted.glusterfs.dht.linkto
[2016-11-04 17:34:02.393510] D [MSGID: 0] [dht-common.c:2371:dht_lookup] 0-1nvme-distrep3x2-tier-dht: calling revalidate lookup for /specsfs/CL329_VDA2/Dir0/bucket2/ehndeie0.preo at 1nvme-distrep3x2-cold-dht
[2016-11-04 17:34:02.393531] D [MSGID: 0] [dht-common.c:2371:dht_lookup] 0-1nvme-distrep3x2-cold-dht: calling revalidate lookup for /specsfs/CL329_VDA2/Dir0/bucket2/ehndeie0.preo at 1nvme-distrep3x2-replicate-2
[2016-11-04 17:34:02.393811] D [MSGID: 0] [dht-common.c:791:dht_revalidate_cbk] 0-1nvme-distrep3x2-cold-dht: revalidate lookup of /specsfs/CL329_VDA2/Dir0/bucket2/ehndeie0.preo returned with op_ret 0
[2016-11-04 17:34:02.393828] D [MSGID: 0] [dht-common.c:791:dht_revalidate_cbk] 0-1nvme-distrep3x2-tier-dht: revalidate lookup of /specsfs/CL329_VDA2/Dir0/bucket2/ehndeie0.preo returned with op_ret 0 [Structure needs cleaning]
[2016-11-04 17:34:02.393847] D [MSGID: 0] [dht-common.c:2371:dht_lookup] 0-1nvme-distrep3x2-hot-dht: calling revalidate lookup for /specsfs/CL329_VDA2/Dir0/bucket2/ehndeie0.preo at 1nvme-distrep3x2-replicate-5
[2016-11-04 17:34:02.394356] D [MSGID: 0] [dht-common.c:791:dht_revalidate_cbk] 0-1nvme-distrep3x2-hot-dht: revalidate lookup of /specsfs/CL329_VDA2/Dir0/bucket2/ehndeie0.preo returned with op_ret 0

Comment 22 Annette Clewett 2016-11-07 16:05:24 UTC
File create time on hot tier (PDT)

[root@rhosd0 VDA_failed_110416]# cd /rhgs/hotbricks/1nvme-distrep3x2-hot/specsfs/CL329_VDA2/Dir0/bucket2/
[root@rhosd0 bucket2]# ls -ltr
total 1048576
-rw-r--r-- 2 root root 1073741824 Nov  4 10:19 ehndeie0.preo

Comment 23 Annette Clewett 2016-11-07 16:12:38 UTC
Also see this on rhosd0 after test fails, looks like trying to demote to cold tier and it fails?

[root@rhosd0 glusterfs]# grep -i "ehndeie0.preo" 1nvme-distrep3x2-tier.log
[2016-11-04 19:00:00.812749] D [MSGID: 0] [dht-common.c:2445:dht_lookup] 0-1nvme-distrep3x2-tier-dht: Calling fresh lookup for /specsfs/CL329_VDA2/Dir0/bucket2/ehndeie0.preo on 1nvme-distrep3x2-cold-dht
[2016-11-04 19:00:00.812790] D [MSGID: 0] [dht-common.c:2445:dht_lookup] 0-1nvme-distrep3x2-cold-dht: Calling fresh lookup for /specsfs/CL329_VDA2/Dir0/bucket2/ehndeie0.preo on 1nvme-distrep3x2-replicate-2
[2016-11-04 19:00:00.813132] D [MSGID: 0] [dht-common.c:2032:dht_lookup_cbk] 0-1nvme-distrep3x2-cold-dht: fresh_lookup returned for /specsfs/CL329_VDA2/Dir0/bucket2/ehndeie0.preo with op_ret 0
[2016-11-04 19:00:00.813144] D [MSGID: 0] [dht-common.c:2032:dht_lookup_cbk] 0-1nvme-distrep3x2-tier-dht: fresh_lookup returned for /specsfs/CL329_VDA2/Dir0/bucket2/ehndeie0.preo with op_ret 0
[2016-11-04 19:00:00.813157] D [MSGID: 0] [dht-common.c:2151:dht_lookup_cbk] 0-1nvme-distrep3x2-tier-dht: Calling lookup on linkto target 1nvme-distrep3x2-hot-dht for path /specsfs/CL329_VDA2/Dir0/bucket2/ehndeie0.preo
[2016-11-04 19:00:00.813170] D [MSGID: 0] [dht-common.c:2445:dht_lookup] 0-1nvme-distrep3x2-hot-dht: Calling fresh lookup for /specsfs/CL329_VDA2/Dir0/bucket2/ehndeie0.preo on 1nvme-distrep3x2-replicate-5
[2016-11-04 19:00:00.813714] D [MSGID: 0] [dht-common.c:2032:dht_lookup_cbk] 0-1nvme-distrep3x2-hot-dht: fresh_lookup returned for /specsfs/CL329_VDA2/Dir0/bucket2/ehndeie0.pre  with op_ret 0
[2016-11-04 19:00:00.813754] D [MSGID: 0] [tier.c:733:tier_migrate_using_query_file] 0-1nvme-distrep3x2-tier-dht: Tier demote: src_subvol 1nvme-distrep3x2-hot-dht file /specsfs/CL329_VDA2/Dir0/bucket2/ehndeie0.preo
[2016-11-04 19:00:00.814033] D [MSGID: 0] [tier.c:233:tier_check_same_node] 0-1nvme-distrep3x2-tier-dht: /specsfs/CL329_VDA2/Dir0/bucket2/ehndeie0.preo does not belong to this node

Comment 24 Annette Clewett 2016-11-07 16:14:18 UTC
[root@rhosd0 glusterfs]# stat /rhgs/hotbricks/1nvme-distrep3x2-hot/specsfs/CL329_VDA2/Dir0/bucket2/ehndeie0.preo
  File: ‘/rhgs/hotbricks/1nvme-distrep3x2-hot/specsfs/CL329_VDA2/Dir0/bucket2/ehndeie0.preo’
  Size: 1073741824      Blocks: 2097152    IO Block: 4096   regular file
Device: fd0ch/64780d    Inode: 136313724   Links: 2
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2016-11-04 10:18:43.024091000 -0700
Modify: 2016-11-04 10:19:29.488418000 -0700
Change: 2016-11-04 10:34:02.386627448 -0700
 Birth: -

Comment 25 Annette Clewett 2016-11-08 20:20:01 UTC
New test run with DEBUG level set for clients & servers (PST)

SPEC SFS client 235 on Host rhclient9

Starting test client: 235 Host:  rhclient9    Workload:     VDA2  Workdir: /rhgs/client/1nvme-distrep3x2/specsfs

...

        Initialization finished: Tue Nov  8 12:03:04 2016
        Testing begins: Tue Nov  8 12:03:04 2016
        Waiting for tests to finish. Tue Nov  8 12:03:10 2016
        Tue Nov  8 12:03:15 2016 Starting WARM phase
        Tue Nov  8 12:03:15 2016 Warm Heartbeat Client 287:   0.00 Ops/sec
        Tue Nov  8 12:03:38 2016 Warm-up 10 percent complete
        Tue Nov  8 12:04:08 2016 Warm-up 20 percent complete
        Tue Nov  8 12:04:17 2016 Warm Heartbeat Client 287:   1.00 Ops/sec
        Tue Nov  8 12:04:38 2016 Warm-up 30 percent complete
        Tue Nov  8 12:05:10 2016 Warm-up 40 percent complete
        Tue Nov  8 12:05:19 2016 Warm Heartbeat Client 287:   0.82 Ops/sec
        Tue Nov  8 12:05:39 2016 Warm-up 50 percent complete
        Tue Nov  8 12:06:10 2016 Warm-up 60 percent complete
        Tue Nov  8 12:06:21 2016 Warm Heartbeat Client 287:   1.00 Ops/sec
        Tue Nov  8 12:06:39 2016 Warm-up 70 percent complete
        Tue Nov  8 12:07:09 2016 Warm-up 80 percent complete
        Tue Nov  8 12:07:23 2016 Warm Heartbeat Client 287:   1.08 Ops/sec
        Tue Nov  8 12:07:38 2016 Warm-up 90 percent complete
        Tue Nov  8 12:08:08 2016 Warm-up 100 percent complete
        Tue Nov  8 12:08:11 2016 Starting RUN phase
        Tue Nov  8 12:08:25 2016 Run  Heartbeat Client 287:   1.09 Ops/sec
        Tue Nov  8 12:08:39 2016 Run 10 percent complete
        Tue Nov  8 12:09:27 2016 Run  Heartbeat Client 287:   1.00 Ops/sec
        Tue Nov  8 12:09:39 2016 Run 30 percent complete
Master: Failure message client 235 Error = 22
Master: Client 235 returned error 22 Possibly: Invalid argument
Client testing failed: Tue Nov  8 12:09:41 2016
Exit with cleanup
Killing 288 clients.

Comment 26 Annette Clewett 2016-11-08 20:39:47 UTC
From rhclient9 glusterfs log (log message in UTC; stat result in PST)

rhgs-client-1nvme-distrep3x2.log:[2016-11-08 20:09:41.691498] W [fuse-bridge.c:766:fuse_attr_cbk] 0-glusterfs-fuse: 2929172: FSTAT() /specsfs/CL235_VDA2/Dir0/bucket6/ite0.o => -1 (Invalid argument)


[root@rhosd0 bucket6]# stat /rhgs/hotbricks/1nvme-distrep3x2-hot/specsfs/CL235_VDA2/Dir0/bucket6/ite0.o
  File: ‘/rhgs/hotbricks/1nvme-distrep3x2-hot/specsfs/CL235_VDA2/Dir0/bucket6/ite0.o’
  Size: 1073741824      Blocks: 2097152    IO Block: 4096   regular file
Device: fd0ch/64780d    Inode: 742037305   Links: 2
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2016-11-08 11:36:50.841120000 -0800
Modify: 2016-11-08 12:06:36.017965000 -0800
Change: 2016-11-08 12:09:32.809863402 -0800
 Birth: -

Comment 27 Annette Clewett 2016-11-08 21:53:41 UTC
Results from 'grep -i "ite0.o" /var/log/glusterfs/bricks/*.log' for servers rhosd4 and rhosd5

rhosd4
/var/log/glusterfs/bricks/rhgs-coldbricks-1nvme-distrep3x2.log:[2016-11-08 20:05:43.819746] D [MSGID: 101107] [gfdb_sqlite3_helper.c:396:gf_sql_insert_link] 0-sqlite3: Failed executing the prepared stmt 3892dd37-6632-482a-b546-06e1228ce77b e7135fe3-1ce8-4126-b58e-2fbd64a8778f ite0.o INSERT INTO GF_FLINK_TB (GF_ID, GF_PID, FNAME, W_DEL_FLAG, LINK_UPDATE)  VALUES (?, ?, ?, 0, 0); : SQL logic error or missing database

/var/log/glusterfs/bricks/rhgs-coldbricks-1nvme-distrep3x2.log:[2016-11-08 20:09:32.796670] D [MSGID: 115059] [server-rpc-fops.c:819:server_getxattr_cbk] 0-1nvme-distrep3x2-server: 46452: GETXATTR /specsfs/CL235_VDA2/Dir0/bucket6/ite0.o (3892dd37-6632-482a-b546-06e1228ce77b) (trusted.glusterfs.dht.linkto) ==> (No data available) [No data available]

/var/log/glusterfs/bricks/rhgs-hotbricks-1nvme-distrep3x2-hot.log:[2016-11-08 20:08:03.102099] D [MSGID: 115050] [server-rpc-fops.c:179:server_lookup_cbk] 0-1nvme-distrep3x2-server: 65485: LOOKUP /specsfs/CL235_VDA2/Dir0/bucket6/ite0.o (e7135fe3-1ce8-4126-b58e-2fbd64a8778f/ite0.o) ==> (No such file or directory) [No such file or directory]


rhosd5
/var/log/glusterfs/bricks/rhgs-coldbricks-1nvme-distrep3x2.log:[2016-11-08 20:05:43.842730] D [MSGID: 101107] [gfdb_sqlite3_helper.c:396:gf_sql_insert_link] 0-sqlite3: Failed executing the prepared stmt 3892dd37-6632-482a-b546-06e1228ce77b e7135fe3-1ce8-4126-b58e-2fbd64a8778f ite0.o INSERT INTO GF_FLINK_TB (GF_ID, GF_PID, FNAME, W_DEL_FLAG, LINK_UPDATE)  VALUES (?, ?, ?, 0, 0); : SQL logic error or missing database

/var/log/glusterfs/bricks/rhgs-hotbricks-1nvme-distrep3x2-hot.log:[2016-11-08 20:08:02.184011] D [MSGID: 115050] [server-rpc-fops.c:179:server_lookup_cbk] 0-1nvme-distrep3x2-server: 86885: LOOKUP /specsfs/CL235_VDA2/Dir0/bucket6/ite0.o (e7135fe3-1ce8-4126-b58e-2fbd64a8778f/ite0.o) ==> (No such file or directory) [No such file or directory]

Comment 28 Annette Clewett 2016-11-08 21:57:29 UTC
Results from 'grep -i "ite0.o" /var/log/glusterfs/*.log' for servers rhosd4 (no hits for rhosd5)

[2016-11-08 20:08:01.626597] D [MSGID: 0] [dht-common.c:2445:dht_lookup] 0-1nvme-distrep3x2-tier-dht: Calling fresh lookup for /specsfs/CL235_VDA2/Dir0/bucket6/ite0.o on 1nvme-distrep3x2-cold-dht
[2016-11-08 20:08:01.626641] D [MSGID: 0] [dht-common.c:2445:dht_lookup] 0-1nvme-distrep3x2-cold-dht: Calling fresh lookup for /specsfs/CL235_VDA2/Dir0/bucket6/ite0.o on 1nvme-distrep3x2-replicate-2
[2016-11-08 20:08:01.627280] D [MSGID: 0] [dht-common.c:2032:dht_lookup_cbk] 0-1nvme-distrep3x2-cold-dht: fresh_lookup returned for /specsfs/CL235_VDA2/Dir0/bucket6/ite0.o with op_ret 0
[2016-11-08 20:08:01.627324] D [MSGID: 0] [dht-common.c:2032:dht_lookup_cbk] 0-1nvme-distrep3x2-tier-dht: fresh_lookup returned for /specsfs/CL235_VDA2/Dir0/bucket6/ite0.o with op_ret 0
[2016-11-08 20:08:01.627397] D [MSGID: 0] [tier.c:733:tier_migrate_using_query_file] 0-1nvme-distrep3x2-tier-dht: Tier promote: src_subvol 1nvme-distrep3x2-cold-dht file /specsfs/CL235_VDA2/Dir0/bucket6/ite0.o
[2016-11-08 20:08:01.629635] D [MSGID: 0] [dht-common.c:2371:dht_lookup] 0-1nvme-distrep3x2-cold-dht: calling revalidate lookup for /specsfs/CL235_VDA2/Dir0/bucket6/ite0.o at 1nvme-distrep3x2-replicate-2
[2016-11-08 20:08:01.630021] D [MSGID: 0] [dht-common.c:791:dht_revalidate_cbk] 0-1nvme-distrep3x2-cold-dht: revalidate lookup of /specsfs/CL235_VDA2/Dir0/bucket6/ite0.o returned with op_ret 0
[2016-11-08 20:08:01.630284] D [MSGID: 0] [dht-common.c:2445:dht_lookup] 0-1nvme-distrep3x2-hot-dht: Calling fresh lookup for /specsfs/CL235_VDA2/Dir0/bucket6/ite0.o on 1nvme-distrep3x2-replicate-5
[2016-11-08 20:08:02.160871] D [MSGID: 0] [dht-common.c:2032:dht_lookup_cbk] 0-1nvme-distrep3x2-hot-dht: fresh_lookup returned for /specsfs/CL235_VDA2/Dir0/bucket6/ite0.o with op_ret -1 [No such file or directory]
[2016-11-08 20:08:02.160878] D [MSGID: 0] [dht-common.c:2045:dht_lookup_cbk] 0-1nvme-distrep3x2-hot-dht: Entry /specsfs/CL235_VDA2/Dir0/bucket6/ite0.o missing on subvol 1nvme-distrep3x2-replicate-5
[2016-11-08 20:08:02.161506] D [MSGID: 0] [dht-common.c:1623:dht_lookup_everywhere_cbk] 0-1nvme-distrep3x2-hot-dht: returned with op_ret -1 and op_errno 2 (/specsfs/CL235_VDA2/Dir0/bucket6/ite0.o) from subvol 1nvme-distrep3x2-replicate-4
[2016-11-08 20:08:02.445420] D [MSGID: 0] [dht-common.c:1623:dht_lookup_everywhere_cbk] 0-1nvme-distrep3x2-hot-dht: returned with op_ret -1 and op_errno 2 (/specsfs/CL235_VDA2/Dir0/bucket6/ite0.o) from subvol 1nvme-distrep3x2-replicate-5
[2016-11-08 20:08:03.102272] D [MSGID: 0] [dht-common.c:1623:dht_lookup_everywhere_cbk] 0-1nvme-distrep3x2-hot-dht: returned with op_ret -1 and op_errno 2 (/specsfs/CL235_VDA2/Dir0/bucket6/ite0.o) from subvol 1nvme-distrep3x2-replicate-3
[2016-11-08 20:08:03.102290] D [MSGID: 0] [dht-common.c:1349:dht_lookup_everywhere_done] 0-1nvme-distrep3x2-hot-dht: There was no cached file and  unlink on hashed is not skipped /specsfs/CL235_VDA2/Dir0/bucket6/ite0.o
[2016-11-08 20:08:03.102447] D [MSGID: 0] [dht-common.c:6496:dht_create_wind_to_avail_subvol] 0-1nvme-distrep3x2-hot-dht: creating /specsfs/CL235_VDA2/Dir0/bucket6/ite0.o on 1nvme-distrep3x2-replicate-5
[2016-11-08 20:08:03.183478] D [MSGID: 0] [dht-common.c:2371:dht_lookup] 0-1nvme-distrep3x2-hot-dht: calling revalidate lookup for /specsfs/CL235_VDA2/Dir0/bucket6/ite0.o at 1nvme-distrep3x2-replicate-5
[2016-11-08 20:08:03.185047] D [MSGID: 0] [dht-common.c:791:dht_revalidate_cbk] 0-1nvme-distrep3x2-hot-dht: revalidate lookup of /specsfs/CL235_VDA2/Dir0/bucket6/ite0.o returned with op_ret 0
[2016-11-08 20:09:32.796708] D [MSGID: 0] [client-rpc-fops.c:1157:client3_3_getxattr_cbk] 0-1nvme-distrep3x2-client-4: remote operation failed: No data available. Path: /specsfs/CL235_VDA2/Dir0/bucket6/ite0.o (3892dd37-6632-482a-b546-06e1228ce77b). Key: trusted.glusterfs.dht.linkto
[2016-11-08 20:09:32.796906] D [MSGID: 0] [dht-common.c:2371:dht_lookup] 0-1nvme-distrep3x2-tier-dht: calling revalidate lookup for /specsfs/CL235_VDA2/Dir0/bucket6/ite0.o at 1nvme-distrep3x2-cold-dht
[2016-11-08 20:09:32.796928] D [MSGID: 0] [dht-common.c:2371:dht_lookup] 0-1nvme-distrep3x2-cold-dht: calling revalidate lookup for /specsfs/CL235_VDA2/Dir0/bucket6/ite0.o at 1nvme-distrep3x2-replicate-2
[2016-11-08 20:09:32.797300] D [MSGID: 0] [dht-common.c:791:dht_revalidate_cbk] 0-1nvme-distrep3x2-cold-dht: revalidate lookup of /specsfs/CL235_VDA2/Dir0/bucket6/ite0.o returned with op_ret 0
[2016-11-08 20:09:32.797326] D [MSGID: 0] [dht-common.c:791:dht_revalidate_cbk] 0-1nvme-distrep3x2-tier-dht: revalidate lookup of /specsfs/CL235_VDA2/Dir0/bucket6/ite0.o returned with op_ret 0 [Structure needs cleaning]
[2016-11-08 20:09:32.797351] D [MSGID: 0] [dht-common.c:2371:dht_lookup] 0-1nvme-distrep3x2-hot-dht: calling revalidate lookup for /specsfs/CL235_VDA2/Dir0/bucket6/ite0.o at 1nvme-distrep3x2-replicate-5
[2016-11-08 20:09:32.797715] D [MSGID: 0] [dht-common.c:791:dht_revalidate_cbk] 0-1nvme-distrep3x2-hot-dht: revalidate lookup of /specsfs/CL235_VDA2/Dir0/bucket6/ite0.o returned with op_ret 0

Comment 29 Annette Clewett 2016-11-09 01:41:35 UTC
All logs can be found at Google drive location here:

https://drive.google.com/drive/u/1/folders/0B3DXjvLEit_KN1lMV0h6dWRJbk0

37735901 Nov  8 14:04 rhgs-hotbricks-1nvme-distrep3x2-hot.rhosd4.log.gz
109771660 Nov  8 14:04 rhgs-coldbricks-1nvme-distrep3x2.rhosd4.log.gz
3730596 Nov  8 14:05 1nvme-distrep3x2-tier.rhosd4.log.gz
3125448 Nov  8 14:05 rhgs-client-1nvme-distrep3x2.rhclient9.log.gz

gzip -d <filename.gz> to unzip

Comment 30 Nithya Balachandran 2016-11-10 06:40:07 UTC
(In reply to Annette Clewett from comment #23)
> Also see this on rhosd0 after test fails, looks like trying to demote to
> cold tier and it fails?
> 
> [root@rhosd0 glusterfs]# grep -i "ehndeie0.preo" 1nvme-distrep3x2-tier.log
> [2016-11-04 19:00:00.812749] D [MSGID: 0] [dht-common.c:2445:dht_lookup]
> 0-1nvme-distrep3x2-tier-dht: Calling fresh lookup for
> /specsfs/CL329_VDA2/Dir0/bucket2/ehndeie0.preo on 1nvme-distrep3x2-cold-dht
> [2016-11-04 19:00:00.812790] D [MSGID: 0] [dht-common.c:2445:dht_lookup]
> 0-1nvme-distrep3x2-cold-dht: Calling fresh lookup for
> /specsfs/CL329_VDA2/Dir0/bucket2/ehndeie0.preo on
> 1nvme-distrep3x2-replicate-2

> [tier.c:733:tier_migrate_using_query_file] 0-1nvme-distrep3x2-tier-dht: Tier
> demote: src_subvol 1nvme-distrep3x2-hot-dht file
> /specsfs/CL329_VDA2/Dir0/bucket2/ehndeie0.preo
> [2016-11-04 19:00:00.814033] D [MSGID: 0] [tier.c:233:tier_check_same_node]
> 0-1nvme-distrep3x2-tier-dht: /specsfs/CL329_VDA2/Dir0/bucket2/ehndeie0.preo
> does not belong to this node


This can be ignored - it just means some other node picked up the file for migration.

Comment 31 Nithya Balachandran 2016-11-10 06:51:47 UTC
(In reply to Annette Clewett from comment #29)
> All logs can be found at Google drive location here:
> 
> https://drive.google.com/drive/u/1/folders/0B3DXjvLEit_KN1lMV0h6dWRJbk0
> 
> 37735901 Nov  8 14:04 rhgs-hotbricks-1nvme-distrep3x2-hot.rhosd4.log.gz
> 109771660 Nov  8 14:04 rhgs-coldbricks-1nvme-distrep3x2.rhosd4.log.gz
> 3730596 Nov  8 14:05 1nvme-distrep3x2-tier.rhosd4.log.gz
> 3125448 Nov  8 14:05 rhgs-client-1nvme-distrep3x2.rhclient9.log.gz
> 
> gzip -d <filename.gz> to unzip


We will need the client logs as well.

Thanks,
Nithya

Comment 32 Nithya Balachandran 2016-11-10 11:50:08 UTC
Annette, do you see any Invalid argument errors on any of the server nodes? I'm having trouble connecting to the VPN so I cannot check myself.

Comment 33 Nithya Balachandran 2016-11-10 16:29:41 UTC
Annette,

Can you please upload all the server logs and the client logs for the failed client to google drive.

Thanks,
Nithya

Comment 34 Dan Lambright 2016-11-15 19:45:47 UTC
A file's inode in the hot tier subvolume has a NULL layout. The cached subvolume for the same inode in the parent translator is the hot subvolume. When an FSTAT is performed the I/O is directed to the hot tier, which in turn fails on the NULL layout and fails the I/O.

Each time this has been recreated in the lab the 1G file in question has just completed promotion . A second observation is the hot tier may have been full at the time the file was created.

I've been unable to recreate this on my own or obtain further information from the logs. To solve this problem we need additional info. I have built RPMS containing new log messages. These have to be used to recreate the problem. Contact me when ready to load them.

Comment 35 Nithya Balachandran 2016-11-16 15:18:35 UTC
(In reply to Dan Lambright from comment #34)
> A file's inode in the hot tier subvolume has a NULL layout. The cached
> subvolume for the same inode in the parent translator is the hot subvolume.
> When an FSTAT is performed the I/O is directed to the hot tier, which in
> turn fails on the NULL layout and fails the I/O.
> 
> Each time this has been recreated in the lab the 1G file in question has
> just completed promotion . A second observation is the hot tier may have
> been full at the time the file was created.
> 
> I've been unable to recreate this on my own or obtain further information
> from the logs. To solve this problem we need additional info. I have built
> RPMS containing new log messages. These have to be used to recreate the
> problem. Contact me when ready to load them.

Annette, after deploying the new RPMS and hitting the issue, please get the xattrs set on the parent dir of the file on each brick of the volume (both hot and cold).

For eg; If the error is returned for dir1/dir2/dir3/file1 , please run

getfattr -e hex -d -m . <brickpath>/dir1/dir2/dir3 

on each brick. This will help us validate the error.

Comment 36 Annette Clewett 2016-11-20 02:02:29 UTC
Starting test client: 21 Host: rhclient10    Workload:     VDA2  Workdir: /rhgs/client/1nvme-distrep3x2/specsfs

        Waiting to finish initialization. Sat Nov 19 13:05:20 2016
        Sat Nov 19 13:10:39 2016 Starting INIT phase
        Sat Nov 19 13:10:41 2016 Init Heartbeat __/\_/\__
        Sat Nov 19 13:11:41 2016 Init Heartbeat __/\_/\__
        Sat Nov 19 13:12:41 2016 Init Heartbeat __/\_/\__
        Sat Nov 19 13:13:26 2016 Init 20 percent complete
        Sat Nov 19 13:13:41 2016 Init Heartbeat __/\_/\__
        Sat Nov 19 13:14:42 2016 Init Heartbeat __/\_/\__
        Sat Nov 19 13:15:42 2016 Init Heartbeat __/\_/\__
        Sat Nov 19 13:16:42 2016 Init Heartbeat __/\_/\__
        Sat Nov 19 13:17:04 2016 Init 50 percent complete
        Sat Nov 19 13:17:42 2016 Init Heartbeat __/\_/\__
        Sat Nov 19 13:17:48 2016 Init 70 percent complete
        Sat Nov 19 13:18:33 2016 Init 100 percent complete
        Initialization finished: Sat Nov 19 13:20:27 2016
        Testing begins: Sat Nov 19 13:20:27 2016
        Waiting for tests to finish. Sat Nov 19 13:20:36 2016
        Sat Nov 19 13:20:41 2016 Starting WARM phase
        Sat Nov 19 13:20:41 2016 Warm Heartbeat Client 431:   0.00 Ops/sec
        Sat Nov 19 13:21:06 2016 Warm-up 10 percent complete
        Sat Nov 19 13:21:35 2016 Warm-up 20 percent complete
        Sat Nov 19 13:21:41 2016 Warm Heartbeat Client 431:   0.99 Ops/sec
        Sat Nov 19 13:22:05 2016 Warm-up 30 percent complete
        Sat Nov 19 13:22:35 2016 Warm-up 40 percent complete
        Sat Nov 19 13:22:41 2016 Warm Heartbeat Client 431:   0.99 Ops/sec
        Sat Nov 19 13:23:05 2016 Warm-up 50 percent complete
        Sat Nov 19 13:23:35 2016 Warm-up 60 percent complete
        Sat Nov 19 13:23:41 2016 Warm Heartbeat Client 431:   1.00 Ops/sec
        Sat Nov 19 13:24:05 2016 Warm-up 70 percent complete
        Sat Nov 19 13:24:35 2016 Warm-up 80 percent complete
        Sat Nov 19 13:24:41 2016 Warm Heartbeat Client 431:   1.01 Ops/sec
        Sat Nov 19 13:25:05 2016 Warm-up 90 percent complete
        Sat Nov 19 13:25:35 2016 Warm-up 100 percent complete
        Sat Nov 19 13:25:38 2016 Starting RUN phase
        Sat Nov 19 13:25:44 2016 Run  Heartbeat Client 431:   0.67 Ops/sec
Master: Failure message client 21 Error = 22
Master: Client 21 returned error 22 Possibly: Invalid argument
Client testing failed: Sat Nov 19 13:25:55 2016
Exit with cleanup
Killing 432 clients.

rhclient10 gluster client log - 
[2016-11-19 21:25:55.365726] W [fuse-bridge.c:766:fuse_attr_cbk] 0-glusterfs-fuse: 7269657: FSTAT() /specsfs/CL21_VDA2/Dir0/bucket6/wrc0.lpol => -1 (Invalid argument)

NOTE: I cannot find file wrc0.lpol at location /specsfs/CL21_VDA2/Dir0/bucket6/ on /rhgs/coldbricks or rhgs/hotbricks. When I navigate to this directory on the hot tier (only place directory exists) there is nothing in the directory. This is a change from prior failures.

The gluster client log (level DEBUG) for rhclient10 (client with failure) and the gluster tier log (level DEBUG; only tier with reference to filename wrc0.lpol) for rhosd2 are uploaded as well as all of the glusterfs/bricks for the six servers. 

Google Drive here https://drive.google.com/drive/u/1/folders/0B3DXjvLEit_KZGNUQjBFQklFRmM.

Comment 37 Annette Clewett 2016-11-20 02:16:16 UTC
Info Nithya requested (file 

[root@rhosd0 ~]# getfattr -e hex -d -m . /rhgs/hotbricks/1nvme-distrep3x2-hot/specsfs/CL21_VDA2/Dir0/bucket6/
getfattr: Removing leading '/' from absolute path names
# file: rhgs/hotbricks/1nvme-distrep3x2-hot/specsfs/CL21_VDA2/Dir0/bucket6/
trusted.gfid=0xc077a3c447bf48b682d43d98f0f19134
trusted.glusterfs.dht=0x0000000100000000aaaaaaaaffffffff
trusted.tier.tier-dht=0x000000010000000000000000035a00c6

[root@rhosd0 ~]# getfattr -e hex -d -m . /rhgs/coldbricks/1nvme-distrep3x2/specsfs/CL21_VDA2/Dir0/bucket6/
getfattr: Removing leading '/' from absolute path names
# file: rhgs/coldbricks/1nvme-distrep3x2/specsfs/CL21_VDA2/Dir0/bucket6/
trusted.gfid=0xc077a3c447bf48b682d43d98f0f19134
trusted.glusterfs.dht=0x00000001000000000000000055555554
trusted.tier.tier-dht=0x0000000100000000035a00c7ffffffff


NOTE: file wrc0.lpol is not in either of these directories

Comment 38 Annette Clewett 2016-11-21 20:11:18 UTC
All info for six servers (rhosd0-rhosd5)for commands below can be found at:
http://pastebin.test.redhat.com/432270.

getfattr -e hex -d -m . /rhgs/hotbricks/1nvme-distrep3x2-hot/specsfs/CL21_VDA2/Dir0/bucket6/
getfattr -e hex -d -m . /rhgs/hotbricks/1nvme-distrep3x2-hot/specsfs/CL21_VDA2/Dir0/bucket6/wrc0.lpol

getfattr -e hex -d -m . /rhgs/coldbricks/1nvme-distrep3x2/specsfs/CL21_VDA2/Dir0/bucket6/
getfattr -e hex -d -m . /rhgs/coldbricks/1nvme-distrep3x2/specsfs/CL21_VDA2/Dir0/bucket6/wrc0.lpol

Comment 39 Annette Clewett 2016-11-21 20:17:00 UTC
    ****rhosd0
    [root@rhosd0 ~]# getfattr -e hex -d -m . /rhgs/hotbricks/1nvme-distrep3x2-hot/specsfs/CL21_VDA2/Dir0/bucket6/
    getfattr: Removing leading '/' from absolute path names
    # file: rhgs/hotbricks/1nvme-distrep3x2-hot/specsfs/CL21_VDA2/Dir0/bucket6/
    trusted.gfid=0xc077a3c447bf48b682d43d98f0f19134
    trusted.glusterfs.dht=0x0000000100000000aaaaaaaaffffffff
    trusted.tier.tier-dht=0x000000010000000000000000035a00c6
     
    [root@rhosd0 ~]# getfattr -e hex -d -m . /rhgs/hotbricks/1nvme-distrep3x2-hot/specsfs/CL21_VDA2/Dir0/bucket6/wrc0.lpol
    getfattr: /rhgs/hotbricks/1nvme-distrep3x2-hot/specsfs/CL21_VDA2/Dir0/bucket6/wrc0.lpol: No such file or directory
     
    [root@rhosd0 ~]# getfattr -e hex -d -m . /rhgs/coldbricks/1nvme-distrep3x2/specsfs/CL21_VDA2/Dir0/bucket6/
    getfattr: Removing leading '/' from absolute path names
    # file: rhgs/coldbricks/1nvme-distrep3x2/specsfs/CL21_VDA2/Dir0/bucket6/
    trusted.gfid=0xc077a3c447bf48b682d43d98f0f19134
    trusted.glusterfs.dht=0x00000001000000000000000055555554
    trusted.tier.tier-dht=0x0000000100000000035a00c7ffffffff
     
    [root@rhosd0 ~]# getfattr -e hex -d -m . /rhgs/coldbricks/1nvme-distrep3x2/specsfs/CL21_VDA2/Dir0/bucket6/wrc0.lpol
    getfattr: /rhgs/coldbricks/1nvme-distrep3x2/specsfs/CL21_VDA2/Dir0/bucket6/wrc0.lpol: No such file or directory
     
     
    ****rhosd1
    [root@rhosd1 ~]# getfattr -e hex -d -m . /rhgs/hotbricks/1nvme-distrep3x2-hot/specsfs/CL21_VDA2/Dir0/bucket6/
    getfattr: Removing leading '/' from absolute path names
    # file: rhgs/hotbricks/1nvme-distrep3x2-hot/specsfs/CL21_VDA2/Dir0/bucket6/
    trusted.gfid=0xc077a3c447bf48b682d43d98f0f19134
    trusted.glusterfs.dht=0x0000000100000000aaaaaaaaffffffff
    trusted.tier.tier-dht=0x000000010000000000000000035a00c6
     
    [root@rhosd1 ~]# getfattr -e hex -d -m . /rhgs/hotbricks/1nvme-distrep3x2-hot/specsfs/CL21_VDA2/Dir0/bucket6/wrc0.lpol
    getfattr: /rhgs/hotbricks/1nvme-distrep3x2-hot/specsfs/CL21_VDA2/Dir0/bucket6/wrc0.lpol: No such file or directory
     
    [root@rhosd1 ~]# getfattr -e hex -d -m . /rhgs/coldbricks/1nvme-distrep3x2/specsfs/CL21_VDA2/Dir0/bucket6/
    getfattr: Removing leading '/' from absolute path names
    # file: rhgs/coldbricks/1nvme-distrep3x2/specsfs/CL21_VDA2/Dir0/bucket6/
    trusted.gfid=0xc077a3c447bf48b682d43d98f0f19134
    trusted.glusterfs.dht=0x00000001000000000000000055555554
    trusted.tier.tier-dht=0x0000000100000000035a00c7ffffffff
     
    [root@rhosd1 ~]# getfattr -e hex -d -m . /rhgs/coldbricks/1nvme-distrep3x2/specsfs/CL21_VDA2/Dir0/bucket6/wrc0.lpol
    getfattr: /rhgs/coldbricks/1nvme-distrep3x2/specsfs/CL21_VDA2/Dir0/bucket6/wrc0.lpol: No such file or directory
     
     
    ****rhosd2
    [root@rhosd2 ~]# getfattr -e hex -d -m . /rhgs/hotbricks/1nvme-distrep3x2-hot/specsfs/CL21_VDA2/Dir0/bucket6/
    getfattr: Removing leading '/' from absolute path names
    # file: rhgs/hotbricks/1nvme-distrep3x2-hot/specsfs/CL21_VDA2/Dir0/bucket6/
    trusted.gfid=0xc077a3c447bf48b682d43d98f0f19134
    trusted.glusterfs.dht=0x000000010000000055555555aaaaaaa9
    trusted.tier.tier-dht=0x000000010000000000000000035a00c6
     
    [root@rhosd2 ~]# getfattr -e hex -d -m . /rhgs/hotbricks/1nvme-distrep3x2-hot/specsfs/CL21_VDA2/Dir0/bucket6/wrc0.lpol
    getfattr: Removing leading '/' from absolute path names
    # file: rhgs/hotbricks/1nvme-distrep3x2-hot/specsfs/CL21_VDA2/Dir0/bucket6/wrc0.lpol
    trusted.afr.dirty=0x000000000000000000000000
    trusted.bit-rot.version=0x02000000000000005830991a0007460b
    trusted.gfid=0x162424ad9b074a7998752a5811e90223
     
    [root@rhosd2 ~]# getfattr -e hex -d -m . /rhgs/coldbricks/1nvme-distrep3x2/specsfs/CL21_VDA2/Dir0/bucket6/
    getfattr: Removing leading '/' from absolute path names
    # file: rhgs/coldbricks/1nvme-distrep3x2/specsfs/CL21_VDA2/Dir0/bucket6/
    trusted.gfid=0xc077a3c447bf48b682d43d98f0f19134
    trusted.glusterfs.dht=0x000000010000000055555555aaaaaaa9
    trusted.tier.tier-dht=0x0000000100000000035a00c7ffffffff
     
    [root@rhosd2 ~]# getfattr -e hex -d -m . /rhgs/coldbricks/1nvme-distrep3x2/specsfs/CL21_VDA2/Dir0/bucket6/wrc0.lpol
    getfattr: Removing leading '/' from absolute path names
    # file: rhgs/coldbricks/1nvme-distrep3x2/specsfs/CL21_VDA2/Dir0/bucket6/wrc0.lpol
    trusted.afr.dirty=0x000000000000000000000000
    trusted.bit-rot.version=0x02000000000000005830977b000e5869
    trusted.gfid=0x162424ad9b074a7998752a5811e90223
    trusted.tier.tier-dht.linkto=0x316e766d652d646973747265703378322d686f742d64687400
     
     
    ****rhosd3
    [root@rhosd3 ~]# getfattr -e hex -d -m . /rhgs/hotbricks/1nvme-distrep3x2-hot/specsfs/CL21_VDA2/Dir0/bucket6/
    getfattr: Removing leading '/' from absolute path names
    # file: rhgs/hotbricks/1nvme-distrep3x2-hot/specsfs/CL21_VDA2/Dir0/bucket6/
    trusted.gfid=0xc077a3c447bf48b682d43d98f0f19134
    trusted.glusterfs.dht=0x000000010000000055555555aaaaaaa9
    trusted.tier.tier-dht=0x000000010000000000000000035a00c6
     
    [root@rhosd3 ~]# getfattr -e hex -d -m . /rhgs/hotbricks/1nvme-distrep3x2-hot/specsfs/CL21_VDA2/Dir0/bucket6/wrc0.lpol
    getfattr: Removing leading '/' from absolute path names
    # file: rhgs/hotbricks/1nvme-distrep3x2-hot/specsfs/CL21_VDA2/Dir0/bucket6/wrc0.lpol
    trusted.afr.dirty=0x000000000000000000000000
    trusted.bit-rot.version=0x02000000000000005830991a00078ba5
    trusted.gfid=0x162424ad9b074a7998752a5811e90223
     
    [root@rhosd3 ~]# getfattr -e hex -d -m . /rhgs/coldbricks/1nvme-distrep3x2/specsfs/CL21_VDA2/Dir0/bucket6/
    getfattr: Removing leading '/' from absolute path names
    # file: rhgs/coldbricks/1nvme-distrep3x2/specsfs/CL21_VDA2/Dir0/bucket6/
    trusted.gfid=0xc077a3c447bf48b682d43d98f0f19134
    trusted.glusterfs.dht=0x000000010000000055555555aaaaaaa9
    trusted.tier.tier-dht=0x0000000100000000035a00c7ffffffff
     
    [root@rhosd3 ~]# getfattr -e hex -d -m . /rhgs/coldbricks/1nvme-distrep3x2/specsfs/CL21_VDA2/Dir0/bucket6/wrc0.lpol
    getfattr: Removing leading '/' from absolute path names
    # file: rhgs/coldbricks/1nvme-distrep3x2/specsfs/CL21_VDA2/Dir0/bucket6/wrc0.lpol
    trusted.afr.dirty=0x000000000000000000000000
    trusted.bit-rot.version=0x02000000000000005830977b000e8748
    trusted.gfid=0x162424ad9b074a7998752a5811e90223
    trusted.tier.tier-dht.linkto=0x316e766d652d646973747265703378322d686f742d64687400
     
     
    ****rhosd4
    [root@rhosd4 ~]# getfattr -e hex -d -m . /rhgs/hotbricks/1nvme-distrep3x2-hot/specsfs/CL21_VDA2/Dir0/bucket6/
    getfattr: Removing leading '/' from absolute path names
    # file: rhgs/hotbricks/1nvme-distrep3x2-hot/specsfs/CL21_VDA2/Dir0/bucket6/
    trusted.gfid=0xc077a3c447bf48b682d43d98f0f19134
    trusted.glusterfs.dht=0x00000001000000000000000055555554
    trusted.tier.tier-dht=0x000000010000000000000000035a00c6
     
    [root@rhosd4 ~]# getfattr -e hex -d -m . /rhgs/hotbricks/1nvme-distrep3x2-hot/specsfs/CL21_VDA2/Dir0/bucket6/wrc0.lpol
    getfattr: /rhgs/hotbricks/1nvme-distrep3x2-hot/specsfs/CL21_VDA2/Dir0/bucket6/wrc0.lpol: No such file or directory
     
    [root@rhosd4 ~]# getfattr -e hex -d -m . /rhgs/coldbricks/1nvme-distrep3x2/specsfs/CL21_VDA2/Dir0/bucket6/
    getfattr: Removing leading '/' from absolute path names
    # file: rhgs/coldbricks/1nvme-distrep3x2/specsfs/CL21_VDA2/Dir0/bucket6/
    trusted.gfid=0xc077a3c447bf48b682d43d98f0f19134
    trusted.glusterfs.dht=0x0000000100000000aaaaaaaaffffffff
    trusted.tier.tier-dht=0x0000000100000000035a00c7ffffffff
     
    [root@rhosd4 ~]# getfattr -e hex -d -m . /rhgs/coldbricks/1nvme-distrep3x2/specsfs/CL21_VDA2/Dir0/bucket6/wrc0.lpol
    getfattr: /rhgs/coldbricks/1nvme-distrep3x2/specsfs/CL21_VDA2/Dir0/bucket6/wrc0.lpol: No such file or directory
     
     
    ****rhosd5
    [root@rhosd5 ~]# getfattr -e hex -d -m . /rhgs/hotbricks/1nvme-distrep3x2-hot/specsfs/CL21_VDA2/Dir0/bucket6/
    getfattr: Removing leading '/' from absolute path names
    # file: rhgs/hotbricks/1nvme-distrep3x2-hot/specsfs/CL21_VDA2/Dir0/bucket6/
    trusted.gfid=0xc077a3c447bf48b682d43d98f0f19134
    trusted.glusterfs.dht=0x00000001000000000000000055555554
    trusted.tier.tier-dht=0x000000010000000000000000035a00c6
     
    [root@rhosd5 ~]# getfattr -e hex -d -m . /rhgs/hotbricks/1nvme-distrep3x2-hot/specsfs/CL21_VDA2/Dir0/bucket6/wrc0.lpol
    getfattr: /rhgs/hotbricks/1nvme-distrep3x2-hot/specsfs/CL21_VDA2/Dir0/bucket6/wrc0.lpol: No such file or directory
     
    [root@rhosd5 ~]# getfattr -e hex -d -m . /rhgs/coldbricks/1nvme-distrep3x2/specsfs/CL21_VDA2/Dir0/bucket6/
    getfattr: Removing leading '/' from absolute path names
    # file: rhgs/coldbricks/1nvme-distrep3x2/specsfs/CL21_VDA2/Dir0/bucket6/
    trusted.gfid=0xc077a3c447bf48b682d43d98f0f19134
    trusted.glusterfs.dht=0x0000000100000000aaaaaaaaffffffff
    trusted.tier.tier-dht=0x0000000100000000035a00c7ffffffff
     
    [root@rhosd5 ~]# getfattr -e hex -d -m . /rhgs/coldbricks/1nvme-distrep3x2/specsfs/CL21_VDA2/Dir0/bucket6/wrc0.lpol
    getfattr: /rhgs/coldbricks/1nvme-distrep3x2/specsfs/CL21_VDA2/Dir0/bucket6/wrc0.lpol: No such file or directory

Comment 40 Annette Clewett 2016-11-21 21:05:01 UTC
Uploaded new logs from server rhosd2 & rhosd3, glustershd.log. These two servers are in a replica pair where the file was first written to their cold tier. The message below from glustershd.log for rhosd2 seems to imply a problem - keyword = mismatch.

Is it possible this file was not created correctly on the cold tier of one of the two servers in the replica pair? This happens way before the failure after the file promotion to the hot tier (on same replica pair it looks like) is done and the attempt to read the file crashes the SPEC SFS test. 

log messages from rhclient10 log when file is created on cold tier on rhosd2 & rhosd3
[2016-11-19 19:34:30.963098] D [MSGID: 0] [dht-common.c:1537:dht_lookup_everywhere_done] 2-1nvme-distrep3x2-tier-dht: There was no cached file and  unlink on hashed is not skipped /specsfs/CL21_VDA2/Dir0/bucket6/wrc0.lpol
[2016-11-19 19:34:30.963313] D [MSGID: 0] [tier-common.c:437:tier_create] 2-1nvme-distrep3x2-tier-dht: creating /specsfs/CL21_VDA2/Dir0/bucket6/wrc0.lpol on 1nvme-distrep3x2-cold-dht
[2016-11-19 19:34:30.963350] D [MSGID: 0] [dht-common.c:6684:dht_create_wind_to_avail_subvol] 2-1nvme-distrep3x2-cold-dht: creating /specsfs/CL21_VDA2/Dir0/bucket6/wrc0.lpol on 1nvme-distrep3x2-replicate-1

log messages from rhosd2 glustershd.log (search for file gfid 162424ad-9b07-4a79-9875-2a5811e90223)
[2016-11-19 19:36:31.152637] D [MSGID: 0] [afr-self-heald.c:409:afr_shd_index_heal] 0-1nvme-distrep3x2-replicate-1: got entry: 162424ad-9b07-4a79-9875-2a5811e90223
[2016-11-19 19:36:31.160009] D [MSGID: 0] [afr-self-heal-common.c:1396:afr_selfheal_unlocked_inspect] 0-1nvme-distrep3x2-replicate-1: SIZE mismatch 820379648 vs 821559296 on 1nvme-distrep3x2-client-3 for gfid:162424ad-9b07-4a79-9875-2a5811e90223

log messages from rhosd3 glustershd.log (search for file gfid 162424ad-9b07-4a79-9875-2a5811e90223)
[2016-11-19 19:36:16.027296] D [MSGID: 0] [afr-self-heald.c:409:afr_shd_index_heal] 0-1nvme-distrep3x2-replicate-1: got entry: 162424ad-9b07-4a79-9875-2a5811e90223

Comment 41 Annette Clewett 2016-11-23 03:57:27 UTC
Failure for new RPMs created today, Nov 22nd

Starting test client: 281 Host:  rhclient8    Workload:     VDA2  Workdir: /rhgs/client/1nvme-distrep3x2/specsfs

        Initialization finished: Tue Nov 22 19:39:21 2016
        Testing begins: Tue Nov 22 19:39:21 2016
        Waiting for tests to finish. Tue Nov 22 19:39:27 2016
        Tue Nov 22 19:39:32 2016 Starting WARM phase
        Tue Nov 22 19:39:32 2016 Warm Heartbeat Client 287:   0.00 Ops/sec
        Tue Nov 22 19:39:56 2016 Warm-up 10 percent complete
        Tue Nov 22 19:40:26 2016 Warm-up 20 percent complete
        Tue Nov 22 19:40:33 2016 Warm Heartbeat Client 287:   0.94 Ops/sec
        Tue Nov 22 19:40:56 2016 Warm-up 30 percent complete
        Tue Nov 22 19:41:25 2016 Warm-up 40 percent complete
        Tue Nov 22 19:41:34 2016 Warm Heartbeat Client 287:   1.02 Ops/sec
        Tue Nov 22 19:41:56 2016 Warm-up 50 percent complete
        Tue Nov 22 19:42:44 2016 Warm Heartbeat Client 287:   0.22 Ops/sec
        Tue Nov 22 19:42:56 2016 Warm-up 70 percent complete
        Tue Nov 22 19:43:25 2016 Warm-up 80 percent complete
        Tue Nov 22 19:43:46 2016 Warm Heartbeat Client 287:   1.00 Ops/sec
        Tue Nov 22 19:43:57 2016 Warm-up 90 percent complete
        Tue Nov 22 19:44:25 2016 Warm-up 100 percent complete
        Tue Nov 22 19:44:27 2016 Starting RUN phase
        Tue Nov 22 19:44:46 2016 Run  Heartbeat Client 287:   0.91 Ops/sec
        Tue Nov 22 19:44:55 2016 Run 10 percent complete
        Tue Nov 22 19:45:25 2016 Run 20 percent complete
        Tue Nov 22 19:45:46 2016 Run  Heartbeat Client 287:   0.77 Ops/sec
        Tue Nov 22 19:45:56 2016 Run 30 percent complete
        Tue Nov 22 19:46:26 2016 Run 40 percent complete
        Tue Nov 22 19:46:46 2016 Run  Heartbeat Client 287:   1.08 Ops/sec
        Tue Nov 22 19:46:55 2016 Run 50 percent complete
Master: Failure message client 281 Error = 9
Master: Client 281 returned error 9 Possibly: Bad file descriptor
Client testing failed: Tue Nov 22 19:47:06 2016
Exit with cleanup
Killing 288 clients.


gluster client log for rhclient8

[2016-11-23 03:47:06.969953] W [fuse-bridge.c:1290:fuse_err_cbk] 0-glusterfs-fuse: 2969402: FLUSH() ERR => -1 (Bad file descriptor)

Client log for rhclient8 uploaded to https://drive.google.com/drive/u/1/folders/0B3DXjvLEit_KYnRXVG5rVnNnTHc

Comment 42 Annette Clewett 2016-11-24 02:15:38 UTC
Starting test client: 67 Host:  rhclient9    Workload:     VDA2  Workdir: /rhgs/client/1nvme-distrep3x2/specsfs

        Initialization finished: Wed Nov 23 12:42:52 2016
        Testing begins: Wed Nov 23 12:42:52 2016
        Waiting for tests to finish. Wed Nov 23 12:42:58 2016
        Wed Nov 23 12:43:03 2016 Starting WARM phase
        Wed Nov 23 12:43:03 2016 Warm Heartbeat Client 287:   0.00 Ops/sec
        Wed Nov 23 12:43:26 2016 Warm-up 10 percent complete
        Wed Nov 23 12:43:56 2016 Warm-up 20 percent complete
        Wed Nov 23 12:44:04 2016 Warm Heartbeat Client 287:   1.00 Ops/sec
        Wed Nov 23 12:44:26 2016 Warm-up 30 percent complete
        Wed Nov 23 12:44:56 2016 Warm-up 40 percent complete
        Wed Nov 23 12:45:05 2016 Warm Heartbeat Client 287:   1.00 Ops/sec
        Wed Nov 23 12:45:26 2016 Warm-up 50 percent complete
        Wed Nov 23 12:45:56 2016 Warm-up 60 percent complete
        Wed Nov 23 12:46:11 2016 Warm Heartbeat Client 287:   1.00 Ops/sec
        Wed Nov 23 12:46:28 2016 Warm-up 70 percent complete
        Wed Nov 23 12:46:57 2016 Warm-up 80 percent complete
        Wed Nov 23 12:47:18 2016 Warm Heartbeat Client 287:   1.01 Ops/sec
        Wed Nov 23 12:47:28 2016 Warm-up 90 percent complete
        Wed Nov 23 12:47:56 2016 Warm-up 100 percent complete
        Wed Nov 23 12:47:59 2016 Starting RUN phase
        Wed Nov 23 12:48:20 2016 Run  Heartbeat Client 287:   0.78 Ops/sec
        Wed Nov 23 12:48:29 2016 Run 10 percent complete
        Wed Nov 23 12:48:59 2016 Run 20 percent complete
        Wed Nov 23 12:49:20 2016 Run  Heartbeat Client 287:   1.04 Ops/sec
        Wed Nov 23 12:49:28 2016 Run 30 percent complete
Master: Failure message client 67 Error = 9
Master: Client 67 returned error 9 Possibly: Bad file descriptor
Client testing failed: Wed Nov 23 12:49:29 2016
Exit with cleanup
Killing 288 clients.

[2016-11-23 20:49:27.738179] W [fuse-bridge.c:766:fuse_attr_cbk] 0-glusterfs-fuse: 2925467: FSTAT() /specsfs/CL67_VDA2/Dir0/bucket6/astr0.d => -1 (Bad file descriptor)

[2016-11-23 20:49:29.501843] W [fuse-bridge.c:1290:fuse_err_cbk] 0-glusterfs-fuse: 2926826: FLUSH() ERR => -1 (Bad file descriptor)

Client log for client9 uploaded to https://drive.google.com/drive/u/1/folders/0B3DXjvLEit_KVDBoaE15OUxZMnc

Comment 43 Annette Clewett 2016-12-02 17:29:07 UTC
We ran a new test today on glusterfs-3.7.9-12.el7rhgs.x86_64 bits using machines in the Supermicro lab where we limited the hot tier to 1x2 in order to eliminate DHT completely from any cold>hot promotion activity (note that the cold tier remained 3x2 on RAID 6 spinners). Info and failure condition, client error below. 

[root@rhosd0 ~]# gluster vol info
 
Volume Name: 1nvme-distrep3x2
Type: Tier
Volume ID: 20f9b6f0-30a0-42fb-9ff4-aae0e04abc16
Status: Started
Number of Bricks: 8
Transport-type: tcp
Hot Tier :
Hot Tier Type : Replicate
Number of Bricks: 1 x 2 = 2
Brick1: n1:/rhgs/hotbricks/1nvme-distrep3x2-hot
Brick2: n0:/rhgs/hotbricks/1nvme-distrep3x2-hot
Cold Tier:
Cold Tier Type : Distributed-Replicate
Number of Bricks: 3 x 2 = 6
Brick3: n0:/rhgs/coldbricks/1nvme-distrep3x2
Brick4: n1:/rhgs/coldbricks/1nvme-distrep3x2
Brick5: n2:/rhgs/coldbricks/1nvme-distrep3x2
Brick6: n3:/rhgs/coldbricks/1nvme-distrep3x2
Brick7: n4:/rhgs/coldbricks/1nvme-distrep3x2
Brick8: n5:/rhgs/coldbricks/1nvme-distrep3x2
Options Reconfigured:
diagnostics.client-log-level: DEBUG
cluster.tier-mode: cache
features.ctr-enabled: on
performance.readdir-ahead: on

***
Starting test client: 45 Host: rhclient10    Workload:     VDA2  Workdir: /rhgs/client/1nvme-distrep3x2/specsfs

        Initialization finished: Fri Dec  2 08:05:52 2016
        Testing begins: Fri Dec  2 08:05:52 2016
        Waiting for tests to finish. Fri Dec  2 08:05:58 2016
        Fri Dec  2 08:06:03 2016 Starting WARM phase
        Fri Dec  2 08:06:04 2016 Warm Heartbeat Client 287:   0.00 Ops/sec
        Fri Dec  2 08:06:27 2016 Warm-up 10 percent complete
        Fri Dec  2 08:06:57 2016 Warm-up 20 percent complete
        Fri Dec  2 08:07:27 2016 Warm-up 30 percent complete
        Fri Dec  2 08:07:57 2016 Warm-up 40 percent complete
        Fri Dec  2 08:08:03 2016 Warm Heartbeat Client 287:   1.00 Ops/sec
        Fri Dec  2 08:08:26 2016 Warm-up 50 percent complete
        Fri Dec  2 08:08:56 2016 Warm-up 60 percent complete
        Fri Dec  2 08:09:05 2016 Warm Heartbeat Client 287:   1.00 Ops/sec
        Fri Dec  2 08:09:26 2016 Warm-up 70 percent complete
        Fri Dec  2 08:09:58 2016 Warm-up 80 percent complete
        Fri Dec  2 08:10:07 2016 Warm Heartbeat Client 287:   0.98 Ops/sec
        Fri Dec  2 08:10:58 2016 Warm-up 100 percent complete
        Fri Dec  2 08:11:02 2016 Starting RUN phase
        Fri Dec  2 08:11:08 2016 Run  Heartbeat Client 287:   0.56 Ops/sec
        Fri Dec  2 08:11:32 2016 Run 10 percent complete
        Fri Dec  2 08:12:09 2016 Run  Heartbeat Client 287:   1.00 Ops/sec
        Fri Dec  2 08:12:30 2016 Run 30 percent complete
Master: Failure message client 45 Error = 22
Master: Client 45 returned error 22 Possibly: Invalid argument
Client testing failed: Fri Dec  2 08:12:57 2016
Exit with cleanup
Killing 288 clients.

Message from rhclient10 glusterfs client log for SPEC SFS client 45 that corresponds with failure:

[2016-12-02 16:12:57.047260] W [fuse-bridge.c:766:fuse_attr_cbk] 0-glusterfs-fuse: 8521925: FSTAT() /specsfs/CL45_VDA2/Dir0/bucket6/irg0.oer => -1 (Invalid argument)

Comment 44 Dan Lambright 2016-12-03 15:19:29 UTC
I am running the SPECFS workload in the BAGL lab using four servers and four clients against RHGS 3.7.9. 

The hot tier is 2x2. The slow tier is 2 x (8 + 4). 

I have not seen the problem reproduce after 12 hours runtime. I will attempt other things, e.g shrink the hot tier in order to provoke more demotion activity. 

BENCHMARK=VDA
LOAD=10 
INCR_LOAD=32
NUM_RUNS=10
CLIENT_MOUNTPOINTS="gprfc065.sbu.lab.eng.bos.redhat.com:/mnt gprfc066.sbu.lab.eng.bos.redhat.com:/mnt gprfc067.sbu.lab.eng.bos.redhat.com:/mnt gprfc068.sbu.lab.eng.bos.redhat.com:/mnt"
EXEC_PATH=/root/specsfs2014/binaries/linux/x64/netmist
USER=root

Comment 50 Annette Clewett 2016-12-09 20:34:59 UTC
Testing results for new RPMs with Nithya fix, dec9rpms-fix installed only on clients.

timezone = PST
Starting test client: 85 Host:  rhclient6    Workload:     VDA2  Workdir: /rhgs/client/12xec42/specsfs

	Testing begins: Fri Dec  9 12:08:02 2016
	Waiting for tests to finish. Fri Dec  9 12:08:07 2016
	Fri Dec  9 12:08:12 2016 Starting WARM phase
	Fri Dec  9 12:08:12 2016 Warm Heartbeat Client 239:   0.00 Ops/sec
	Fri Dec  9 12:08:35 2016 Warm-up 10 percent complete
	Fri Dec  9 12:09:06 2016 Warm-up 20 percent complete
	Fri Dec  9 12:09:12 2016 Warm Heartbeat Client 239:   0.99 Ops/sec
	Fri Dec  9 12:09:35 2016 Warm-up 30 percent complete
	Fri Dec  9 12:10:05 2016 Warm-up 40 percent complete
	Fri Dec  9 12:10:15 2016 Warm Heartbeat Client 239:   1.02 Ops/sec
	Fri Dec  9 12:10:37 2016 Warm-up 50 percent complete
	Fri Dec  9 12:11:07 2016 Warm-up 60 percent complete
	Fri Dec  9 12:11:16 2016 Warm Heartbeat Client 239:   2.07 Ops/sec
	Fri Dec  9 12:11:36 2016 Warm-up 70 percent complete
	Fri Dec  9 12:12:07 2016 Warm-up 80 percent complete
	Fri Dec  9 12:12:16 2016 Warm Heartbeat Client 239:   1.00 Ops/sec
	Fri Dec  9 12:12:36 2016 Warm-up 90 percent complete
	Fri Dec  9 12:13:06 2016 Warm-up 100 percent complete
	Fri Dec  9 12:13:09 2016 Starting RUN phase
	Fri Dec  9 12:13:18 2016 Run  Heartbeat Client 239:   1.01 Ops/sec
	Fri Dec  9 12:13:39 2016 Run 10 percent complete
	Fri Dec  9 12:14:08 2016 Run 20 percent complete
	Fri Dec  9 12:14:20 2016 Run  Heartbeat Client 239:   1.20 Ops/sec
	Fri Dec  9 12:14:37 2016 Run 30 percent complete
	Fri Dec  9 12:15:07 2016 Run 40 percent complete
	Fri Dec  9 12:15:22 2016 Run  Heartbeat Client 239:   1.27 Ops/sec
	Fri Dec  9 12:15:37 2016 Run 50 percent complete
	Fri Dec  9 12:16:07 2016 Run 60 percent complete
Master: Failure message client 85 Error = 9
Master: Client 85 returned error 9 Possibly: Bad file descriptor
Client testing failed: Fri Dec  9 12:16:10 2016
Exit with cleanup
Killing 240 clients.

rhclient6 glusterfs log message for failure (UTC):
[2016-12-09 20:16:08.832599] W [fuse-bridge.c:766:fuse_attr_cbk] 0-glusterfs-fuse: 796921: FSTAT() /specsfs/CL85_VDA2/Dir0/bucket6/oitb0.an => -1 (Bad file descriptor)

rhclient6 glusterfs log found here - 
https://drive.google.com/drive/u/1/folders/0B3DXjvLEit_KbE5iS1lQTHFqcTg

Comment 51 Annette Clewett 2016-12-12 16:13:22 UTC
Can you get us the gfid of this file:
/specsfs/CL85_VDA2/Dir0/bucket6/oitb0.an

This is  available by running getfattr -m . -d - e hex on the brick on which the file exists.

Thanks,
Nithya

I could not find the file on the cold tier. Below is info for file on two brick in the hot tier on two servers in the same replica pair. 

*** rhosd0 (1st server in replica pair)
[[root@rhosd0 bucket6]# getfattr -e hex -d -m . /rhgs/hotbricks/1nvme-ec42-hot/specsfs/CL85_VDA2/Dir0/bucket6
getfattr: Removing leading '/' from absolute path names
# file: rhgs/hotbricks/1nvme-ec42-hot/specsfs/CL85_VDA2/Dir0/bucket6
trusted.gfid=0x2ce4c538a8234d9d9fa1d037c9f66e16
trusted.glusterfs.dht=0x0000000100000000aaaaaaaaffffffff
trusted.tier.tier-dht=0x0000000100000000fde38c57ffffffff

[root@rhosd0 bucket6]# getfattr -e hex -d -m . /rhgs/hotbricks/1nvme-ec42-hot/specsfs/CL85_VDA2/Dir0/bucket6/oitb0.an 
getfattr: Removing leading '/' from absolute path names
# file: rhgs/hotbricks/1nvme-ec42-hot/specsfs/CL85_VDA2/Dir0/bucket6/oitb0.an
trusted.afr.dirty=0x000000000000000000000000
trusted.bit-rot.version=0x0200000000000000584aff09000c008c
trusted.gfid=0x02eefce100ea4917be0634dc70d3618a

*** rhosd1 (2nd server in replica pair)
[root@rhosd1 ~]# getfattr -e hex -d -m . /rhgs/hotbricks/1nvme-ec42-hot/specsfs/CL85_VDA2/Dir0/bucket6
getfattr: Removing leading '/' from absolute path names
# file: rhgs/hotbricks/1nvme-ec42-hot/specsfs/CL85_VDA2/Dir0/bucket6
trusted.gfid=0x2ce4c538a8234d9d9fa1d037c9f66e16
trusted.glusterfs.dht=0x0000000100000000aaaaaaaaffffffff
trusted.tier.tier-dht=0x0000000100000000fde38c57ffffffff

[root@rhosd1 ~]# getfattr -e hex -d -m . /rhgs/hotbricks/1nvme-ec42-hot/specsfs/CL85_VDA2/Dir0/bucket6/oitb0.an 
getfattr: Removing leading '/' from absolute path names
# file: rhgs/hotbricks/1nvme-ec42-hot/specsfs/CL85_VDA2/Dir0/bucket6/oitb0.an
trusted.afr.dirty=0x000000000000000000000000
trusted.bit-rot.version=0x0200000000000000584aff0c0002b9b4
trusted.gfid=0x02eefce100ea4917be0634dc70d3618a

Comment 52 Annette Clewett 2016-12-14 02:19:19 UTC
Test failed with dec12rpms after running 12 or 13 benchmark values (120-552, increment by 36). Failed on load=552. 

Starting test client: 835 Host:  rhclient9    Workload:     VDA2  Workdir: /rhgs/client/12xec42/specsfs

        Testing begins: Tue Dec 13 17:32:36 2016
        Waiting for tests to finish. Tue Dec 13 17:32:59 2016
        Tue Dec 13 17:33:03 2016 Starting WARM phase
        Tue Dec 13 17:33:04 2016 Warm Heartbeat Client 1103:   0.00 Ops/sec
        Tue Dec 13 17:33:29 2016 Warm-up 10 percent complete
        Tue Dec 13 17:33:59 2016 Warm-up 20 percent complete
        Tue Dec 13 17:34:05 2016 Warm Heartbeat Client 1103:   1.05 Ops/sec
        Tue Dec 13 17:34:29 2016 Warm-up 30 percent complete
        Tue Dec 13 17:34:58 2016 Warm-up 40 percent complete
        Tue Dec 13 17:35:07 2016 Warm Heartbeat Client 1103:   1.03 Ops/sec
        Tue Dec 13 17:35:28 2016 Warm-up 50 percent complete
        Tue Dec 13 17:35:57 2016 Warm-up 60 percent complete
        Tue Dec 13 17:36:07 2016 Warm Heartbeat Client 1103:   0.94 Ops/sec
        Tue Dec 13 17:36:28 2016 Warm-up 70 percent complete
        Tue Dec 13 17:36:57 2016 Warm-up 80 percent complete
        Tue Dec 13 17:37:09 2016 Warm Heartbeat Client 1103:   1.03 Ops/sec
        Tue Dec 13 17:37:26 2016 Warm-up 90 percent complete
        Tue Dec 13 17:37:56 2016 Warm-up 100 percent complete
        Tue Dec 13 17:38:00 2016 Starting RUN phase
        Tue Dec 13 17:38:11 2016 Run  Heartbeat Client 1103:   1.08 Ops/sec
        Tue Dec 13 17:38:58 2016 Run 20 percent complete
        Tue Dec 13 17:39:13 2016 Run  Heartbeat Client 1103:   1.06 Ops/sec
        Tue Dec 13 17:39:28 2016 Run 30 percent complete
        Tue Dec 13 17:39:59 2016 Run 40 percent complete
        Tue Dec 13 17:40:14 2016 Run  Heartbeat Client 1103:   0.98 Ops/sec
        Tue Dec 13 17:40:28 2016 Run 50 percent complete
Master: Failure message client 835 Error = 107
Master: Client 835 returned error 107 Possibly: Transport endpoint is not connected
Client testing failed: Tue Dec 13 17:40:57 2016
Exit with cleanup
Killing 1104 clients.

gluster log for rhclient9 can be found at Google Drive location:
https://drive.google.com/drive/folders/0B3DXjvLEit_KTHNwbXdoOTZ0WU0

Comment 53 Annette Clewett 2016-12-14 02:24:14 UTC
Error message in glusterfs rhclient9 log (UTC):

[2016-12-14 01:40:57.107984] D [MSGID: 0] [client-rpc-fops.c:2982:client3_3_lookup_cbk] 0-stack-trace: stack-address: 0x7f25405c85f0, 12xec42-client-43 returned -1 error: Stale file handle [Stale file handle]
[2016-12-14 01:40:57.108006] D [MSGID: 0] [afr-lk-common.c:448:transaction_lk_op] 0-12xec42-replicate-2: lk op is for a transaction
[2016-12-14 01:40:57.108017] D [MSGID: 0] [defaults.c:844:default_lookup_cbk] 0-stack-trace: stack-address: 0x7f254068c5b8, 12xec42-disperse-7 returned -1 error: Stale file handle [Stale file handle]

Comment 64 Prasad Desala 2018-08-24 04:55:05 UTC
On glusterfs version 3.12.2-16.el7rhgs, ran SPEC SFS 2014 VDA workload and below test steps and this issue is not reproduced.

Steps:
1) A file exists on subol-1 and has an fd opened on it 
2) Perform a write or some such op on the file from one terminal
3) The file is migrated to another subvol
4) From the same mount point but in a different terminal perform a file listing on the parent directory which updates the cached subvolume for the file without opening the fd on the new subvol.
5) Write to the file again from the first terminal with the same fd. 
Result: Writes on the file are successful.

Hence, moving this BZ to Verified.

Comment 66 errata-xmlrpc 2018-09-04 06:29:44 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2607


Note You need to log in before you can comment on or make changes to this bug.