Description of problem: ----------------------- 4 node cluster,1 8*(4+2) volume mounted via gNFS on 4 clients.. I ran smallfile creates in a distributed multithreaded way from 4 clients and 1 client was doing tarball untar as well.. There are close to 20 lakh files in the mount,and that was the intent. Almost 8 hours into my workload,smallfiles got errored : <snip> host = gqac007.sbu.lab.eng.bos.redhat.com,thr = 00,elapsed = 1438.430105,files = 30000,records = 30000,status = ok host = gqac007.sbu.lab.eng.bos.redhat.com,thr = 01,elapsed = 1439.040532,files = 29800,records = 29800,status = ok host = gqac007.sbu.lab.eng.bos.redhat.com,thr = 02,elapsed = 1437.987876,files = 29900,records = 29900,status = ok host = gqac007.sbu.lab.eng.bos.redhat.com,thr = 03,elapsed = 1438.879048,files = 29800,records = 29800,status = ok host = gqac007.sbu.lab.eng.bos.redhat.com,thr = 04,elapsed = 1437.929699,files = 29900,records = 29900,status = ok host = gqac007.sbu.lab.eng.bos.redhat.com,thr = 05,elapsed = 1438.774666,files = 29900,records = 29900,status = ok host = gqac007.sbu.lab.eng.bos.redhat.com,thr = 06,elapsed = 1438.141809,files = 29800,records = 29800,status = ok host = gqac007.sbu.lab.eng.bos.redhat.com,thr = 07,elapsed = 1438.514793,files = 29800,records = 29800,status = ok host = gqac011.sbu.lab.eng.bos.redhat.com,thr = 00,elapsed = 1452.684095,files = 26700,records = 26700,status = ok host = gqac011.sbu.lab.eng.bos.redhat.com,thr = 01,elapsed = 1454.429681,files = 26600,records = 26600,status = ok host = gqac011.sbu.lab.eng.bos.redhat.com,thr = 02,elapsed = 1452.043060,files = 26600,records = 26600,status = ok host = gqac011.sbu.lab.eng.bos.redhat.com,thr = 03,elapsed = 1453.607356,files = 26700,records = 26700,status = ok host = gqac011.sbu.lab.eng.bos.redhat.com,thr = 04,elapsed = 1451.225952,files = 26600,records = 26600,status = ok host = gqac011.sbu.lab.eng.bos.redhat.com,thr = 05,elapsed = 1455.957148,files = 26700,records = 26700,status = ok host = gqac011.sbu.lab.eng.bos.redhat.com,thr = 06,elapsed = 1450.597837,files = 26600,records = 26600,status = ok host = gqac011.sbu.lab.eng.bos.redhat.com,thr = 07,elapsed = 1452.452431,files = 26800,records = 26800,status = ok host = gqac015.sbu.lab.eng.bos.redhat.com,thr = 00,elapsed = 1459.007853,files = 27000,records = 27000,status = ok host = gqac015.sbu.lab.eng.bos.redhat.com,thr = 01,elapsed = 1457.485515,files = 26900,records = 26900,status = ok host = gqac015.sbu.lab.eng.bos.redhat.com,thr = 02,elapsed = 1459.088099,files = 27000,records = 27000,status = ok host = gqac015.sbu.lab.eng.bos.redhat.com,thr = 03,elapsed = 1458.640241,files = 26900,records = 26900,status = ok host = gqac015.sbu.lab.eng.bos.redhat.com,thr = 04,elapsed = 1459.493499,files = 27100,records = 27100,status = ok host = gqac015.sbu.lab.eng.bos.redhat.com,thr = 05,elapsed = 1458.090418,files = 27000,records = 27000,status = ok host = gqac015.sbu.lab.eng.bos.redhat.com,thr = 06,elapsed = 1458.218666,files = 27000,records = 27000,status = ok host = gqac015.sbu.lab.eng.bos.redhat.com,thr = 07,elapsed = 1457.411976,files = 26900,records = 26900,status = ok host = gqac019.sbu.lab.eng.bos.redhat.com,thr = 00,elapsed = 1457.061055,files = 27500,records = 27500,status = ok host = gqac019.sbu.lab.eng.bos.redhat.com,thr = 01,elapsed = 1458.605405,files = 27500,records = 27500,status = ok host = gqac019.sbu.lab.eng.bos.redhat.com,thr = 02,elapsed = -1485449432.368142,files = 0,records = 0,status = ERR: Input/output error host = gqac019.sbu.lab.eng.bos.redhat.com,thr = 03,elapsed = 1458.712247,files = 27500,records = 27500,status = ok host = gqac019.sbu.lab.eng.bos.redhat.com,thr = 04,elapsed = 1454.773907,files = 27400,records = 27400,status = ok host = gqac019.sbu.lab.eng.bos.redhat.com,thr = 05,elapsed = 1457.539779,files = 27400,records = 27400,status = ok host = gqac019.sbu.lab.eng.bos.redhat.com,thr = 06,elapsed = 1458.635387,files = 27500,records = 27500,status = ok host = gqac019.sbu.lab.eng.bos.redhat.com,thr = 07,elapsed = -1485449431.107796,files = 0,records = 0,status = ERR: Input/output error total threads = 32 total files = 832800 total data = 50.830 GB 86.75% of requested files processed, minimum is 70.00 at least one thread encountered error, test may be incomplete [root@gqac007 ~]# </snip> We can see ethat IO errored on one of the clients : gqac019..It had the volume mounted via gqas008 : mount[root@gqac019 ~]# mount|grep butcher gqas008.sbu.lab.eng.bos.redhat.com:/butcher on /gluster-mount type nfs (rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=192.168.97.143,mountvers=3,mountport=38465,mountproto=tcp,local_lock=none,addr=192.168.97.143) [root@gqac019 ~]# I touched a file after the EIO,and it worked fine on the same client,as well on a diff client mounted from the same server. Version-Release number of selected component (if applicable): ------------------------------------------------------------- glusterfs-server-3.8.4-13.el7rhgs.x86_64 How reproducible: ----------------- 1/1 Actual results: --------------- EIO on 1 of the clients. Expected results: ----------------- Clean smallfile run. Additional info: --------------- [root@gqas013 glusterfs]# gluster v status Status of volume: butcher Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick gqas013.sbu.lab.eng.bos.redhat.com:/b ricks1/brick 49152 0 Y 24542 Brick gqas005.sbu.lab.eng.bos.redhat.com:/b ricks1/brick 49152 0 Y 24752 Brick gqas006.sbu.lab.eng.bos.redhat.com:/b ricks1/brick 49152 0 Y 24308 Brick gqas008.sbu.lab.eng.bos.redhat.com:/b ricks1/brick 49152 0 Y 24237 Brick gqas013.sbu.lab.eng.bos.redhat.com:/b ricks2/brick 49153 0 Y 24562 Brick gqas005.sbu.lab.eng.bos.redhat.com:/b ricks2/brick 49153 0 Y 24772 Brick gqas013.sbu.lab.eng.bos.redhat.com:/b ricks3/brick 49154 0 Y 24582 Brick gqas005.sbu.lab.eng.bos.redhat.com:/b ricks3/brick 49154 0 Y 24792 Brick gqas006.sbu.lab.eng.bos.redhat.com:/b ricks3/brick 49153 0 Y 24328 Brick gqas008.sbu.lab.eng.bos.redhat.com:/b ricks3/brick 49153 0 Y 24257 Brick gqas006.sbu.lab.eng.bos.redhat.com:/b ricks2/brick 49154 0 Y 24348 Brick gqas008.sbu.lab.eng.bos.redhat.com:/b ricks2/brick 49154 0 Y 24277 Brick gqas013.sbu.lab.eng.bos.redhat.com:/b ricks5/brick 49155 0 Y 24602 Brick gqas005.sbu.lab.eng.bos.redhat.com:/b ricks5/brick 49155 0 Y 24812 Brick gqas006.sbu.lab.eng.bos.redhat.com:/b ricks5/brick 49155 0 Y 24368 Brick gqas008.sbu.lab.eng.bos.redhat.com:/b ricks5/brick 49155 0 Y 24297 Brick gqas013.sbu.lab.eng.bos.redhat.com:/b ricks4/brick 49156 0 Y 24622 Brick gqas005.sbu.lab.eng.bos.redhat.com:/b ricks4/brick 49156 0 Y 24832 Brick gqas013.sbu.lab.eng.bos.redhat.com:/b ricks6/brick 49157 0 Y 24642 Brick gqas005.sbu.lab.eng.bos.redhat.com:/b ricks6/brick 49157 0 Y 24852 Brick gqas006.sbu.lab.eng.bos.redhat.com:/b ricks6/brick 49156 0 Y 24388 Brick gqas008.sbu.lab.eng.bos.redhat.com:/b ricks6/brick 49156 0 Y 24317 Brick gqas006.sbu.lab.eng.bos.redhat.com:/b ricks4/brick 49157 0 Y 24408 Brick gqas008.sbu.lab.eng.bos.redhat.com:/b ricks4/brick 49157 0 Y 24337 Brick gqas013.sbu.lab.eng.bos.redhat.com:/b ricks7/brick 49158 0 Y 24662 Brick gqas005.sbu.lab.eng.bos.redhat.com:/b ricks7/brick 49158 0 Y 24872 Brick gqas006.sbu.lab.eng.bos.redhat.com:/b ricks7/brick 49158 0 Y 24428 Brick gqas008.sbu.lab.eng.bos.redhat.com:/b ricks7/brick 49158 0 Y 24357 Brick gqas013.sbu.lab.eng.bos.redhat.com:/b ricks8/brick 49159 0 Y 24682 Brick gqas005.sbu.lab.eng.bos.redhat.com:/b ricks8/brick 49159 0 Y 24892 Brick gqas013.sbu.lab.eng.bos.redhat.com:/b ricks9/brick 49160 0 Y 24702 Brick gqas005.sbu.lab.eng.bos.redhat.com:/b ricks9/brick 49160 0 Y 24912 Brick gqas006.sbu.lab.eng.bos.redhat.com:/b ricks9/brick 49159 0 Y 24448 Brick gqas008.sbu.lab.eng.bos.redhat.com:/b ricks9/brick 49159 0 Y 24377 Brick gqas006.sbu.lab.eng.bos.redhat.com:/b ricks8/brick 49160 0 Y 24468 Brick gqas008.sbu.lab.eng.bos.redhat.com:/b ricks8/brick 49160 0 Y 24397 Brick gqas013.sbu.lab.eng.bos.redhat.com:/b ricks10/brick 49161 0 Y 24722 Brick gqas005.sbu.lab.eng.bos.redhat.com:/b ricks10/brick 49161 0 Y 24932 Brick gqas006.sbu.lab.eng.bos.redhat.com:/b ricks10/brick 49161 0 Y 24488 Brick gqas008.sbu.lab.eng.bos.redhat.com:/b ricks10/brick 49161 0 Y 24417 Brick gqas013.sbu.lab.eng.bos.redhat.com:/b ricks11/brick 49162 0 Y 24742 Brick gqas005.sbu.lab.eng.bos.redhat.com:/b ricks11/brick 49162 0 Y 24952 Brick gqas013.sbu.lab.eng.bos.redhat.com:/b ricks12/brick 49163 0 Y 24762 Brick gqas005.sbu.lab.eng.bos.redhat.com:/b ricks12/brick 49163 0 Y 24972 Brick gqas006.sbu.lab.eng.bos.redhat.com:/b ricks12/brick 49162 0 Y 24508 Brick gqas008.sbu.lab.eng.bos.redhat.com:/b ricks12/brick 49162 0 Y 24437 Brick gqas006.sbu.lab.eng.bos.redhat.com:/b ricks11/brick 49163 0 Y 24528 Brick gqas008.sbu.lab.eng.bos.redhat.com:/b ricks11/brick 49163 0 Y 24459 NFS Server on localhost 2049 0 Y 24785 Self-heal Daemon on localhost N/A N/A Y 24791 NFS Server on gqas005.sbu.lab.eng.bos.redha t.com 2049 0 Y 24995 Self-heal Daemon on gqas005.sbu.lab.eng.bos .redhat.com N/A N/A Y 25000 NFS Server on gqas008.sbu.lab.eng.bos.redha t.com 2049 0 Y 24482 Self-heal Daemon on gqas008.sbu.lab.eng.bos .redhat.com N/A N/A Y 24487 NFS Server on gqas006.sbu.lab.eng.bos.redha t.com 2049 0 Y 24551 Self-heal Daemon on gqas006.sbu.lab.eng.bos .redhat.com N/A N/A Y 24556 Task Status of Volume butcher ------------------------------------------------------------------------------ Task : Rebalance ID : 391f7e6a-099d-414e-97ec-4287d2e2dcb3 Status : completed [root@gqas013 glusterfs]#
I see these in nfs.log ,may be related?: nfs.log:[2017-01-26 17:12:35.805431] W [MSGID: 112199] [nfs3-helpers.c:3470:nfs3_log_read_res] 0-nfs-nfsv3: /s/linux-4.9.5.tar.xz => (XID: 47792d38, READ: NFS: 5(I/O error), POSIX: 107(Transport endpoint is not connected)), count: 4294967295, is_eof: 0 [Permission denied] nfs.log:[2017-01-26 17:12:35.806231] W [MSGID: 112199] [nfs3-helpers.c:3515:nfs3_log_newfh_res] 0-nfs-nfsv3: <gfid:92431415-e688-4e51-a885-e172360b63d0>/d_005/_07_24524_ => (XID: 57792d38, CREATE: NFS: 5(I/O error), POSIX: 107(Transport endpoint is not connected)), FH: exportid 00000000-0000-0000-0000-000000000000, gfid 00000000-0000-0000-0000-000000000000, mountid 00000000-0000-0000-0000-000000000000 nfs.log:[2017-01-26 17:12:35.806848] W [MSGID: 112199] [nfs3-helpers.c:3515:nfs3_log_newfh_res] 0-nfs-nfsv3: <gfid:3ba55ffa-b4fa-48e9-8546-412b0e669632>/d_005/_02_24513_ => (XID: 59792d38, CREATE: NFS: 5(I/O error), POSIX: 107(Transport endpoint is not connected)), FH: exportid 00000000-0000-0000-0000-000000000000, gfid 00000000-0000-0000-0000-000000000000, mountid 00000000-0000-0000-0000-000000000000 [Operation now in progress]
Setup is in the same state for NFS ,rpc and posix guys to take a look,if needed..(just because EIO is pretty common on EC over gNFS).
On the client-7 brick logs - volume butcher-client-7 type protocol/client option send-gids true option transport.address-family inet option transport-type tcp option remote-subvolume /bricks3/brick option remote-host gqas005.sbu.lab.eng.bos.redhat.com option ping-timeout 42 end-volume [2017-01-26 17:12:35.781414] W [socket.c:590:__socket_rwv] 0-tcp.butcher-server: writev on 192.168.97.143:954 failed (Connection reset by peer) [2017-01-26 17:12:35.781497] I [MSGID: 115036] [server.c:552:server_rpc_notify] 0-butcher-server: disconnecting connection from gqas008.sbu.lab.eng.bo-2017/01/26-16:43:42:872883-butcher-client-7-0-0 Looks like the connection is reset by client itself (i.e, nfs server on gqas008) [2017-01-26 17:12:35.785225] I [MSGID: 101055] [client_t.c:415:gf_client_unref] 0-butcher-server: Shutting down connection gqas008.sbu.lab.eng.bos.redhat.com-24479-2017/01/26-16:43:42:872883-butcher-client-7-0-0 [2017-01-26 17:12:46.739533] I [login.c:76:gf_auth] 0-auth/login: allowed user names: a968253b-7104-49f7-8de3-9da9fab46670 [2017-01-26 17:12:46.739584] I [MSGID: 115029] [server-handshake.c:693:server_setvolume] 0-butcher-server: accepted client from gqas008.sbu.lab.eng.bos.redhat.com-24479-2017/01/26-16:43:42:872883-butcher-client-7-0-1 (version: 3.8.4) And the connection re-established after ~11sec Also I am targetting particular file/subvols for easy debugging. Similar errors are observed for other files/subvols as well.
Soumya, If we talk about ec subvolume with 6 bricks (4+2), getting ENOTCONN from 2 bricks should not be an issue if all the other 4 bricks are good. Did we kill any brick? If we only have 4 bricks up then this is a corner case and any additional bad bricks could cause EIO.
Thanks Ashish. AFAIK, the bricks were all up and running. Ambarish could confirm. Because of XDR decode failure, connection got reset for 2 bricks.
That's right Ashish/Soumya. Bricks were never killed on purpose. v status showed everything online.
(In reply to Soumya Koduri from comment #10) > Thanks Ashish. AFAIK, the bricks were all up and running. Ambarish could > confirm. Because of XDR decode failure, connection got reset for 2 bricks. Failure of XDR encoding can point to out of memory. From the nfs.log on gqas008 the XDR decoding failed on a cache-invalidation request. In case you can easily reproduce this, you could disable cache-invalidation on the volume and see if it still happens. All of the small files will cause *many* upcalls for cache-invalidation, and in case there is a memory leak there, things will go wrong quickly. From the brick log, it is obvious that upcalls are enabled: 78: volume butcher-upcall 79: type features/upcall 80: option cache-invalidation on 81: option cache-invalidation-timeout 600 82: subvolumes butcher-leases 83: end-volume
Niels/Soumya, I could reproduce the IO error post setting cache-invalidation to "off " : <snip> starting all threads by creating starting gate file /gluster-mount/network_shared/starting_gate.tmp host = gqac007.sbu.lab.eng.bos.redhat.com,thr = 00,elapsed = 2424.873896,files = 49800,records = 49800,status = ok host = gqac007.sbu.lab.eng.bos.redhat.com,thr = 01,elapsed = 2424.687336,files = 50000,records = 50000,status = ok host = gqac007.sbu.lab.eng.bos.redhat.com,thr = 02,elapsed = 2427.134541,files = 50000,records = 50000,status = ok host = gqac007.sbu.lab.eng.bos.redhat.com,thr = 03,elapsed = 2425.296637,files = 50000,records = 50000,status = ok host = gqac007.sbu.lab.eng.bos.redhat.com,thr = 04,elapsed = 2426.700475,files = 50000,records = 50000,status = ok host = gqac007.sbu.lab.eng.bos.redhat.com,thr = 05,elapsed = 2425.863660,files = 50000,records = 50000,status = ok host = gqac007.sbu.lab.eng.bos.redhat.com,thr = 06,elapsed = 2429.188541,files = 49800,records = 49800,status = ok host = gqac007.sbu.lab.eng.bos.redhat.com,thr = 07,elapsed = 2425.738964,files = 49900,records = 49900,status = ok host = gqac011.sbu.lab.eng.bos.redhat.com,thr = 00,elapsed = 2437.387030,files = 44000,records = 44000,status = ok host = gqac011.sbu.lab.eng.bos.redhat.com,thr = 01,elapsed = 2435.214585,files = 43900,records = 43900,status = ok host = gqac011.sbu.lab.eng.bos.redhat.com,thr = 02,elapsed = 2428.436052,files = 44100,records = 44100,status = ok host = gqac011.sbu.lab.eng.bos.redhat.com,thr = 03,elapsed = 2427.499950,files = 43900,records = 43900,status = ok host = gqac011.sbu.lab.eng.bos.redhat.com,thr = 04,elapsed = 2435.004178,files = 43900,records = 43900,status = ok host = gqac011.sbu.lab.eng.bos.redhat.com,thr = 05,elapsed = 2427.760556,files = 43800,records = 43800,status = ok host = gqac011.sbu.lab.eng.bos.redhat.com,thr = 06,elapsed = 2427.780786,files = 43900,records = 43900,status = ok host = gqac011.sbu.lab.eng.bos.redhat.com,thr = 07,elapsed = 2443.709742,files = 43900,records = 43900,status = ok host = gqac015.sbu.lab.eng.bos.redhat.com,thr = 00,elapsed = 2440.027392,files = 48400,records = 48400,status = ok host = gqac015.sbu.lab.eng.bos.redhat.com,thr = 01,elapsed = 2444.744207,files = 48400,records = 48400,status = ok host = gqac015.sbu.lab.eng.bos.redhat.com,thr = 02,elapsed = 2448.815804,files = 48200,records = 48200,status = ok host = gqac015.sbu.lab.eng.bos.redhat.com,thr = 03,elapsed = -1485577902.147237,files = 0,records = 0,status = ERR: Input/output error host = gqac015.sbu.lab.eng.bos.redhat.com,thr = 04,elapsed = 2438.041739,files = 48500,records = 48500,status = ok host = gqac015.sbu.lab.eng.bos.redhat.com,thr = 05,elapsed = 2440.221574,files = 48300,records = 48300,status = ok host = gqac015.sbu.lab.eng.bos.redhat.com,thr = 06,elapsed = 2441.069143,files = 48400,records = 48400,status = ok host = gqac015.sbu.lab.eng.bos.redhat.com,thr = 07,elapsed = 2444.801625,files = 48300,records = 48300,status = ok host = gqac019.sbu.lab.eng.bos.redhat.com,thr = 00,elapsed = 2428.089606,files = 43600,records = 43600,status = ok host = gqac019.sbu.lab.eng.bos.redhat.com,thr = 01,elapsed = 2432.036917,files = 43600,records = 43600,status = ok host = gqac019.sbu.lab.eng.bos.redhat.com,thr = 02,elapsed = 2432.659354,files = 43700,records = 43700,status = ok host = gqac019.sbu.lab.eng.bos.redhat.com,thr = 03,elapsed = 2436.987550,files = 43900,records = 43900,status = ok host = gqac019.sbu.lab.eng.bos.redhat.com,thr = 04,elapsed = 2431.546693,files = 43800,records = 43800,status = ok host = gqac019.sbu.lab.eng.bos.redhat.com,thr = 05,elapsed = 2431.122970,files = 43500,records = 43500,status = ok host = gqac019.sbu.lab.eng.bos.redhat.com,thr = 06,elapsed = 2431.054241,files = 43500,records = 43500,status = ok host = gqac019.sbu.lab.eng.bos.redhat.com,thr = 07,elapsed = 2432.844086,files = 43600,records = 43600,status = ok total threads = 32 </snip> This time I do not see decoding failure messages,though,after setting cache-invalidation to off. But I saw a steady stream of Operation not permitted messages in brick logs : bricks/bricks12-brick.log:[2017-01-27 13:59:26.961051] E [MSGID: 115081] [server-rpc-fops.c:1203:server_fstat_cbk] 0-butcher-server: 1664223: FSTAT -2 (a59b49dc-8f0e-4da6-8b95-2204e7957840) ==> (Operation not permitted) [Operation not permitted] ENOTCONNs are still there : nfs.log:[2017-01-28 05:11:21.433228] W [MSGID: 114031] [client-rpc-fops.c:2938:client3_3_lookup_cbk] 0-butcher-client-43: remote operation failed. Path: <gfid:26d32e9c-5c9b-4fbc-99c7-c7738db92a4d>/d_000/_05_47093_ (00000000-0000-0000-0000-000000000000) [Transport endpoint is not connected] nfs.log:[2017-01-28 05:11:21.435028] W [MSGID: 114031] [client-rpc-fops.c:2938:client3_3_lookup_cbk] 0-butcher-client-43: remote operation failed. Path: <gfid:595a825d-5ced-43bd-841f-68cd07ce174a>/d_000/_02_47021_ (00000000-0000-0000-0000-000000000000) [Transport endpoint is not connected] nfs.log:[2017-01-28 05:11:21.437546] W [MSGID: 114031] [client-rpc-fops.c:2938:client3_3_lookup_cbk] 0-butcher-client-43: remote operation failed. Path: <gfid:c5533125-d748-48b9-bbc1-c080047d727b>/d_000/_07_47017_ (00000000-0000-0000-0000-000000000000) [Transport endpoint is not connected] nfs.log:[2017-01-28 05:11:21.438614] W [MSGID: 114031] [client-rpc-fops.c:2938:client3_3_lookup_cbk] 0-butcher-client-43: remote operation failed. Path: <gfid:b3552abc-2447-49e3-9bfb-185093b45a23>/d_001/_06_47140_ (00000000-0000-0000-0000-000000000000) [Transport endpoint is not connected] I think we are talking about two issues here : A) XDR decoding failures with upcall. B) EIO (with or without mdcache) - which may or may not be related to the ENOTCONNs. Leaving setup in the same state for debug.
@Ambarish, You seem to have turned off performance.cache-invalidation but not features.cache-invalidation ;) .. performance.cache-invalidation: off performance.cache-samba-metadata: on performance.stat-prefetch: on features.cache-invalidation-timeout: 600 features.cache-invalidation: on That would mean upcall xlator is still enabled but callbacks are just not processed by the md-cache layer. I still see RPC decode failure messages - [2017-01-28 05:11:21.403849] W [xdr-rpc.c:55:xdr_to_rpc_call] 0-rpc: failed to decode call msg [2017-01-28 05:11:21.403909] W [rpc-clnt.c:717:rpc_clnt_handle_cbk] 0-butcher-client-43: RPC call decoding failed Thats why I had asked not to set any of the md-cache settings, go with default options and re-run the tests. But before doing so please collect sosreport of the current systems as well for any future reference. Thanks!
Ugggggggggggggggghhh! My bad.. Let me fetch that for ya now :)
Ok,so an EIO is reproducible even without md-cache. Smallfiles passed,but the mkdir I did before that(mkdir r on the mount point) failed with an IO Error,post disabling the mdcache options : nfs.log:[2017-01-29 03:18:44.563748] W [MSGID: 112199] [nfs3-helpers.c:3515:nfs3_log_newfh_res] 0-nfs-nfsv3: /r => (XID: 61f361b8, MKDIR: NFS: 5(I/O error), POSIX: 5(Input/output error)), FH: exportid 00000000-0000-0000-0000-000000000000, gfid 00000000-0000-0000-0000-000000000000, mountid 00000000-0000-0000-0000-000000000000 [Invalid argument] Even kernel untar failed on one of the clients : nfs.log:[2017-01-29 03:22:30.132709] W [MSGID: 112199] [nfs3-helpers.c:3515:nfs3_log_newfh_res] 0-nfs-nfsv3: /linux-4.9.5/Documentation => (XID: b27b9474, MKDIR: NFS: 5(I/O error), POSIX: 5(Input/output error)), FH: exportid 00000000-0000-0000-0000-000000000000, gfid 00000000-0000-0000-0000-000000000000, mountid 00000000-0000-0000-0000-000000000000 [Invalid argument] nfs.log:[2017-01-29 03:29:16.736286] W [MSGID: 112199] [nfs3-helpers.c:3515:nfs3_log_newfh_res] 0-nfs-nfsv3: /linux-4.9.5/arch/arm => (XID: 7b9774, MKDIR: NFS: 5(I/O error), POSIX: 5(Input/output error)), FH: exportid 00000000-0000-0000-0000-000000000000, gfid 00000000-0000-0000-0000-000000000000, mountid 00000000-0000-0000-0000-000000000000
mkdir failure occured on gqac015,which mounted "butcher" via gqas006.sbu.lab.eng.bos.redhat.com:/butcher. tabball untar failed on gqac011 that mounted via gqas005.sbu.lab.eng.bos.redhat.com:/butcher
gqas006.sbu.lab.eng.bos.redhat.com: [2017-01-29 03:18:44.563509] W [MSGID: 122019] [ec-helpers.c:354:ec_loc_gfid_check] 0-butcher-disperse-0: Mismatching GFID's in loc [2017-01-29 03:18:44.563748] W [MSGID: 112199] [nfs3-helpers.c:3515:nfs3_log_newfh_res] 0-nfs-nfsv3: /r => (XID: 61f361b8, MKDIR: NFS: 5(I/O error), POSIX: 5(Input/output error)), FH: exportid 00000000-0000-0000-0000-000000000000, gfid 00000000-0000-0000-0000-000000000000, mountid 00000000-0000-0000-0000-000000000000 [Invalid argument] [2017-01-29 03:18:45.384807] I [MSGID: 109063] [dht-layout.c:713:dht_layout_normalize] 0-butcher-dht: Found anomalies in /r (gfid = dcb2dbee-e7f4-4b91-b4ca-871b0ed69d5c). Holes=1 overlaps=0 gqas005.sbu.lab.eng.bos.redhat.com: [2017-01-29 03:22:30.132328] W [MSGID: 122019] [ec-helpers.c:354:ec_loc_gfid_check] 0-butcher-disperse-4: Mismatching GFID's in loc [2017-01-29 03:22:30.132709] W [MSGID: 112199] [nfs3-helpers.c:3515:nfs3_log_newfh_res] 0-nfs-nfsv3: /linux-4.9.5/Documentation => (XID: b27b9474, MKDIR: NFS: 5(I/O error), POSIX: 5(Input/output error)), FH: exportid 00000000-0000-0000-0000-000000000000, gfid 00000000-0000-0000-0000-000000000000, mountid 00000000-0000-0000-0000-000000000000 [Invalid argument] [2017-01-29 03:22:30.171110] I [MSGID: 109063] [dht-layout.c:713:dht_layout_normalize] 0-butcher-dht: Found anomalies in /linux-4.9.5/Documentation (gfid = 75cc924f-7999-42fd-89bd-dde22acfbc71). Holes=1 overlaps=0 [2017-01-29 03:29:16.735847] W [MSGID: 122019] [ec-helpers.c:354:ec_loc_gfid_check] 0-butcher-disperse-0: Mismatching GFID's in loc [2017-01-29 03:29:16.736286] W [MSGID: 112199] [nfs3-helpers.c:3515:nfs3_log_newfh_res] 0-nfs-nfsv3: /linux-4.9.5/arch/arm => (XID: 7b9774, MKDIR: NFS: 5(I/O error), POSIX: 5(Input/output error)), FH: exportid 00000000-0000-0000-0000-000000000000, gfid 00000000-0000-0000-0000-000000000000, mountid 00000000-0000-0000-0000-000000000000 [2017-01-29 03:29:16.747181] I [MSGID: 109063] [dht-layout.c:713:dht_layout_normalize] 0-butcher-dht: Found anomalies in /linux-4.9.5/arch/arm (gfid = a40c0362-5af2-4c4d-93bc-dda785b43de7). Holes=1 overlaps=0 [2017-01-29 03:34:11.471629] W [MSGID: 122019] [ec-helpers.c:354:ec_loc_gfid_check] 0-butcher-disperse-7: Mismatching GFID's in loc [2017-01-29 03:34:11.471853] W [MSGID: 112199] [nfs3-helpers.c:3515:nfs3_log_newfh_res] 0-nfs-nfsv3: /linux-4.9.5/arch/arm64 => (XID: b10c9a74, MKDIR: NFS: 5(I/O error), POSIX: 5(Input/output error)), FH: exportid 00000000-0000-0000-0000-000000000000, gfid 00000000-0000-0000-0000-000000000000, mountid 00000000-0000-0000-0000-000000000000 [2017-01-29 03:34:11.491232] I [MSGID: 109063] [dht-layout.c:713:dht_layout_normalize] 0-butcher-dht: Found anomalies in /linux-4.9.5/arch/arm64 (gfid = 722607f8-c482-4225-8222-da7934de0925). Holes=1 overlaps=0 [2017-01-29 03:36:01.424196] W [MSGID: 122019] [ec-helpers.c:354:ec_loc_gfid_check] 0-butcher-disperse-4: Mismatching GFID's in loc [2017-01-29 03:36:01.424445] W [MSGID: 112199] [nfs3-helpers.c:3515:nfs3_log_newfh_res] 0-nfs-nfsv3: /linux-4.9.5/arch/cris => (XID: eff39a74, MKDIR: NFS: 5(I/O error), POSIX: 5(Input/output error)), FH: exportid 00000000-0000-0000-0000-000000000000, gfid 00000000-0000-0000-0000-000000000000, mountid 00000000-0000-0000-0000-000000000000 [2017-01-29 03:36:01.439946] I [MSGID: 109063] [dht-layout.c:713:dht_layout_normalize] 0-butcher-dht: Found anomalies in /linux-4.9.5/arch/cris (gfid = e9574c4c-a92d-4692-befa-8fea303dad06). Holes=1 overlaps=0 [2017-01-29 03:37:09.087177] W [MSGID: 122019] [ec-helpers.c:354:ec_loc_gfid_check] 0-butcher-disperse-2: Mismatching GFID's in loc [2017-01-29 03:37:09.087464] W [MSGID: 112199] [nfs3-helpers.c:3515:nfs3_log_newfh_res] 0-nfs-nfsv3: /linux-4.9.5/arch/ia64 => (XID: 1899b74, MKDIR: NFS: 5(I/O error), POSIX: 5(Input/output error)), FH: exportid 00000000-0000-0000-0000-000000000000, gfid 00000000-0000-0000-0000-000000000000, mountid 00000000-0000-0000-0000-000000000000 [Invalid argument] I see Mistmatching gfid errors in both the servers. AFAIR, Pranith once mentioned that in such cases, EC returns EINVAL error. We need which top layer xlator is sending mismatching GFIDs.
I could not repro it on two attempts to recreate it on 3.8.4-25. Since it's a stress test,I'll be revisiting it again during 3.3.Will reopen if I hit it again. Moving this to Verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:2774