Description of problem: posix compliance rename tests failed on EC volume with nfs-ganesha mount. Tests that fails consistently: /opt/qa/tools/posix-testsuite/tests/rename/00.t (Wstat: 0 Tests: 79 Failed: 6) Failed tests: 70-71, 74-75, 78-79 Intermittent Failure: /opt/qa/tools/posix-testsuite/tests/rename/20.t (Wstat: 0 Tests: 16 Failed: 1) Failed test: 16 However, the tests are passing in distribute replicate volume. Version-Release number of selected component (if applicable): nfs-ganesha-gluster-2.4.1-3.el7rhgs.x86_64 nfs-ganesha-2.4.1-3.el7rhgs.x86_64 glusterfs-ganesha-3.8.4-9.el7rhgs.x86_64 How reproducible: Always Steps to Reproduce: 1. Create a nfs-ganesha cluster, create 2*(4+2) EC volume, and enable ganesha on it. 2. Set mdcache options. 3. Mount the volume on client. 4. Run posix_compliance test suite /opt/qa/tools/system_light/run.sh -w /mnt/test_nfs -t posix_compliance -l /var/tmp/posix.log Actual results: posix compliance rename test fails. Expected results: All tests in posix compliance should pass Additional info: ec volume: v4 mount --------------------------- Test Summary Report ------------------- /opt/qa/tools/posix-testsuite/tests/chown/00.t (Wstat: 0 Tests: 171 Failed: 1) Failed test: 77 /opt/qa/tools/posix-testsuite/tests/link/00.t (Wstat: 0 Tests: 82 Failed: 1) Failed test: 77 /opt/qa/tools/posix-testsuite/tests/open/07.t (Wstat: 0 Tests: 23 Failed: 3) Failed tests: 5, 7, 9 /opt/qa/tools/posix-testsuite/tests/rename/00.t (Wstat: 0 Tests: 79 Failed: 6) Failed tests: 70-71, 74-75, 78-79 /opt/qa/tools/posix-testsuite/tests/rename/20.t (Wstat: 0 Tests: 16 Failed: 1) Failed test: 16 Files=185, Tests=1962, 146 wallclock secs ( 1.66 usr 0.60 sys + 16.38 cusr 34.16 csys = 52.80 CPU) Result: FAIL end: 12:51:33 removed posix compliance directories 1 Total 1 tests were successful Switching over to the previous working directory Removing /mnt/ec_test//run5988/ rmdir: failed to remove ‘/mnt/ec_test//run5988/’: Directory not empty rmdir failed:Directory not empty [root@dhcp47-176 ec_test]# ec volume: v3 mount --------------------------- Test Summary Report ------------------- /opt/qa/tools/posix-testsuite/tests/chown/00.t (Wstat: 0 Tests: 171 Failed: 1) Failed test: 77 /opt/qa/tools/posix-testsuite/tests/rename/00.t (Wstat: 0 Tests: 79 Failed: 6) Failed tests: 70-71, 74-75, 78-79 Files=185, Tests=1962, 142 wallclock secs ( 1.65 usr 0.63 sys + 16.39 cusr 33.38 csys = 52.05 CPU) Result: FAIL end: 12:56:10 removed posix compliance directories 1 Total 1 tests were successful Switching over to the previous working directory Removing /mnt/ec_test//run20803/ rmdir: failed to remove ‘/mnt/ec_test//run20803/’: Directory not empty rmdir failed:Directory not empty dist_rep volume : v4 mount ----------------------------------- Test Summary Report ------------------- /opt/qa/tools/posix-testsuite/tests/chown/00.t (Wstat: 0 Tests: 171 Failed: 1) Failed test: 77 /opt/qa/tools/posix-testsuite/tests/link/00.t (Wstat: 0 Tests: 82 Failed: 1) Failed test: 77 /opt/qa/tools/posix-testsuite/tests/open/07.t (Wstat: 0 Tests: 23 Failed: 3) Failed tests: 5, 7, 9 Files=185, Tests=1962, 125 wallclock secs ( 1.57 usr 0.63 sys + 16.44 cusr 34.06 csys = 52.70 CPU) Result: FAIL end: 12:34:35 removed posix compliance directories 1 Total 1 tests were successful Switching over to the previous working directory Removing /mnt/test_nfs/run3658/ dist_rep volume : v3 mount ---------------------------------- Test Summary Report ------------------- /opt/qa/tools/posix-testsuite/tests/chown/00.t (Wstat: 0 Tests: 171 Failed: 1) Failed test: 77 /opt/qa/tools/posix-testsuite/tests/link/00.t (Wstat: 0 Tests: 82 Failed: 1) Failed test: 77 /opt/qa/tools/posix-testsuite/tests/rename/00.t (Wstat: 0 Tests: 79 Failed: 2) Failed tests: 72, 76 Files=185, Tests=1962, 122 wallclock secs ( 1.60 usr 0.56 sys + 16.49 cusr 33.94 csys = 52.59 CPU) Result: FAIL end: 12:42:56 removed posix compliance directories 1 Total 1 tests were successful Switching over to the previous working directory Removing /mnt/test_nfs/run18459/ A different bug has been raised for the failures in dist rep volume - https://bugzilla.redhat.com/show_bug.cgi?id=1404367 sosreports, ganesha logs and tcpdump are at, http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/ec_posix/
Even these posix_compliance tests seem to pass without md-cache settings enabled on the gluster volume. Could you please verify the same?
Soumya, I am still seeing the issue without modifying mdcache settings, Test Summary Report ------------------- /opt/qa/tools/posix-testsuite/tests/chown/00.t (Wstat: 0 Tests: 171 Failed: 1) Failed test: 77 /opt/qa/tools/posix-testsuite/tests/link/00.t (Wstat: 0 Tests: 82 Failed: 1) Failed test: 77 /opt/qa/tools/posix-testsuite/tests/open/07.t (Wstat: 0 Tests: 23 Failed: 3) Failed tests: 5, 7, 9 /opt/qa/tools/posix-testsuite/tests/rename/00.t (Wstat: 0 Tests: 79 Failed: 6) Failed tests: 70-71, 74-75, 78-79 Files=185, Tests=1962, 144 wallclock secs ( 1.63 usr 0.61 sys + 16.25 cusr 33.25 csys = 51.74 CPU) Result: FAIL end: 12:30:35 removed posix compliance directories 1 Total 1 tests were successful Switching over to the previous working directory Removing /mnt/no_mdcache//run16446/ rmdir: failed to remove ‘/mnt/no_mdcache//run16446/’: Directory not empty rmdir failed:Directory not empty [root@dhcp47-176 no_mdcache]#
Proposing this as a blocker. It may affect functionality of EC volume as this test is working fine with distribute replicate volume.
That's right. I couldn't reproduce this issue as well on 2*(4 +1) volume - [root@dhcp35-197 rename]# ./00.t 1..79 ok 1 ok 2 ok 3 ok 4 ok 5 ok 6 ok 7 ok 8 ok 9 ok 10 ok 11 ok 12 ok 13 ok 14 ok 15 ok 16 ok 17 ok 18 ok 19 ok 20 ok 21 ok 22 ok 23 ok 24 ok 25 ok 26 ok 27 ok 28 ok 29 ok 30 ok 31 ok 32 ok 33 ok 34 ok 35 ok 36 ok 37 ok 38 ok 39 ok 40 ok 41 ok 42 ok 43 ok 44 ok 45 ok 46 ok 47 ok 48 ok 49 ok 50 ok 51 ok 52 ok 53 ok 54 ok 55 ok 56 ok 57 ok 58 ok 59 ok 60 ok 61 ok 62 ok 63 ok 64 i am here ok 65 ok 66 ok 67 ok 68 ok 69 ok 70 ok 71 ok 72 ok 73 ok 74 ok 75 ok 76 ok 77 ok 78 ok 79 [root@dhcp35-197 rename]# [root@dhcp35-197 tools]# gluster v info vol_disperse Volume Name: vol_disperse Type: Distributed-Disperse Volume ID: d66d97a1-6bdb-476c-8c24-2c842f2bcb7a Status: Started Snapshot Count: 0 Number of Bricks: 2 x (4 + 2) = 12 Transport-type: tcp Bricks: Brick1: 192.168.122.201:/tmp/disperse_brick1 Brick2: 192.168.122.201:/tmp/disperse_brick2 Brick3: 192.168.122.201:/tmp/disperse_brick3 Brick4: 192.168.122.201:/tmp/disperse_brick4 Brick5: 192.168.122.201:/tmp/disperse_brick5 Brick6: 192.168.122.201:/tmp/disperse_brick6 Brick7: 192.168.122.201:/tmp/disperse_brick7 Brick8: 192.168.122.201:/tmp/disperse_brick8 Brick9: 192.168.122.201:/tmp/disperse_brick9 Brick10: 192.168.122.201:/tmp/disperse_brick10 Brick11: 192.168.122.201:/tmp/disperse_brick11 Brick12: 192.168.122.201:/tmp/disperse_brick12 Options Reconfigured: performance.md-cache-timeout: 600 performance.cache-invalidation: on performance.stat-prefetch: on features.cache-invalidation-timeout: 600 features.cache-invalidation: on transport.address-family: inet performance.readdir-ahead: on nfs.disable: on [root@dhcp35-197 tools]#
Soumya, Setup details: Server machine - 10.70.46.111--> VIP: 10.70.44.92, 10.70.46.115--> VIP: 10.70.44.93 client machine - 10.70.47.49 mount details: 10.70.44.93:/vol_ec on /mnt/ec_test type nfs4 (rw,relatime,vers=4.0,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=10.70.47.49,local_lock=none,addr=10.70.44.93) [root@dhcp47-49 ec_test]# prove -vf /opt/qa/tools/posix-testsuite/tests/rename/00.t /opt/qa/tools/posix-testsuite/tests/rename/00.t .. 1..79 ok 1 ok 2 ok 3 ok 4 ok 5 ok 6 ok 7 ok 8 ok 9 ok 10 ok 11 ok 12 ok 13 ok 14 ok 15 ok 16 ok 17 ok 18 ok 19 ok 20 ok 21 ok 22 ok 23 ok 24 ok 25 ok 26 ok 27 ok 28 ok 29 ok 30 ok 31 ok 32 ok 33 ok 34 ok 35 ok 36 ok 37 ok 38 ok 39 ok 40 ok 41 ok 42 ok 43 ok 44 ok 45 ok 46 ok 47 ok 48 ok 49 ok 50 ok 51 ok 52 ok 53 ok 54 ok 55 ok 56 ok 57 ok 58 ok 59 ok 60 ok 61 ok 62 ok 63 ok 64 i am here ok 65 ok 66 ok 67 ok 68 ok 69 not ok 70 not ok 71 ok 72 ok 73 not ok 74 not ok 75 ok 76 ok 77 not ok 78 not ok 79 Failed 6/79 subtests Test Summary Report ------------------- /opt/qa/tools/posix-testsuite/tests/rename/00.t (Wstat: 0 Tests: 79 Failed: 6) Failed tests: 70-71, 74-75, 78-79 Files=1, Tests=79, 11 wallclock secs ( 0.08 usr 0.01 sys + 0.44 cusr 1.12 csys = 1.65 CPU) Result: FAIL [root@dhcp47-49 ec_test]#
Volume details: ---------------- Volume Name: vol_ec Type: Distributed-Disperse Volume ID: 19111707-7356-4bf9-b6a9-8762548cb531 Status: Started Snapshot Count: 0 Number of Bricks: 2 x (4 + 2) = 12 Transport-type: tcp Bricks: Brick1: dhcp46-111.lab.eng.blr.redhat.com:/bricks/brick6/br6 Brick2: dhcp46-115.lab.eng.blr.redhat.com:/bricks/brick6/br6 Brick3: dhcp46-139.lab.eng.blr.redhat.com:/bricks/brick6/br6 Brick4: dhcp46-124.lab.eng.blr.redhat.com:/bricks/brick6/br6 Brick5: dhcp46-131.lab.eng.blr.redhat.com:/bricks/brick6/br6 Brick6: dhcp46-152.lab.eng.blr.redhat.com:/bricks/brick6/br6 Brick7: dhcp46-111.lab.eng.blr.redhat.com:/bricks/brick7/br7 Brick8: dhcp46-115.lab.eng.blr.redhat.com:/bricks/brick7/br7 Brick9: dhcp46-139.lab.eng.blr.redhat.com:/bricks/brick7/br7 Brick10: dhcp46-124.lab.eng.blr.redhat.com:/bricks/brick7/br7 Brick11: dhcp46-131.lab.eng.blr.redhat.com:/bricks/brick7/br7 Brick12: dhcp46-152.lab.eng.blr.redhat.com:/bricks/brick7/br7 Options Reconfigured: diagnostics.count-fop-hits: on diagnostics.latency-measurement: on diagnostics.client-log-level: INFO performance.md-cache-timeout: 600 performance.cache-invalidation: on performance.stat-prefetch: on features.cache-invalidation-timeout: 600 ganesha.enable: on features.cache-invalidation: on transport.address-family: inet performance.readdir-ahead: on nfs.disable: on nfs-ganesha: enable cluster.enable-shared-storage: enable [root@dhcp46-111 ~]# [root@dhcp46-111 ~]# [root@dhcp46-111 ~]# [root@dhcp46-111 ~]# gluster vol status vol_ec Status of volume: vol_ec Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick dhcp46-111.lab.eng.blr.redhat.com:/br icks/brick6/br6 49153 0 Y 7404 Brick dhcp46-115.lab.eng.blr.redhat.com:/br icks/brick6/br6 49153 0 Y 11692 Brick dhcp46-139.lab.eng.blr.redhat.com:/br icks/brick6/br6 49153 0 Y 22111 Brick dhcp46-124.lab.eng.blr.redhat.com:/br icks/brick6/br6 49152 0 Y 10126 Brick dhcp46-131.lab.eng.blr.redhat.com:/br icks/brick6/br6 49152 0 Y 21400 Brick dhcp46-152.lab.eng.blr.redhat.com:/br icks/brick6/br6 49152 0 Y 17172 Brick dhcp46-111.lab.eng.blr.redhat.com:/br icks/brick7/br7 49154 0 Y 7423 Brick dhcp46-115.lab.eng.blr.redhat.com:/br icks/brick7/br7 49154 0 Y 11711 Brick dhcp46-139.lab.eng.blr.redhat.com:/br icks/brick7/br7 49154 0 Y 22137 Brick dhcp46-124.lab.eng.blr.redhat.com:/br icks/brick7/br7 49153 0 Y 10149 Brick dhcp46-131.lab.eng.blr.redhat.com:/br icks/brick7/br7 49153 0 Y 21419 Brick dhcp46-152.lab.eng.blr.redhat.com:/br icks/brick7/br7 49153 0 Y 17191 Self-heal Daemon on localhost N/A N/A Y 6482 Self-heal Daemon on dhcp46-115.lab.eng.blr. redhat.com N/A N/A Y 13948 Self-heal Daemon on dhcp46-152.lab.eng.blr. redhat.com N/A N/A Y 17688 Self-heal Daemon on dhcp46-139.lab.eng.blr. redhat.com N/A N/A Y 20868 Self-heal Daemon on dhcp46-131.lab.eng.blr. redhat.com N/A N/A Y 21931 Self-heal Daemon on dhcp46-124.lab.eng.blr. redhat.com N/A N/A Y 8885 Task Status of Volume vol_ec ------------------------------------------------------------------------------ There are no active volume tasks [root@dhcp46-111 ~]#
Thanks Arthy. These are initial observations - We have two volumes created on the setup Arthy shared - no_mdcache (with default md-cache settings) vol_ec (with md-cache settings configured) 10.70.44.93:/vol_ec on /mnt/ec_test type nfs4 (rw,relatime,vers=4.0,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=10.70.47.49,local_lock=none,addr=10.70.44.93) dhcp46-111.lab.eng.blr.redhat.com:/no_mdcache on /mnt/nfs type nfs4 (rw,relatime,vers=4.0,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=10.70.47.49,local_lock=none,addr=10.70.46.111) Running individual test on >>> "no_mdcache" volume - [root@dhcp47-49 nfs]# prove -vf /opt/qa/tools/posix-testsuite/tests/rename/00.t /opt/qa/tools/posix-testsuite/tests/rename/00.t .. 1..79 ok 1 ok 2 ... ... ok 78 ok 79 ok All tests successful. Files=1, Tests=79, 12 wallclock secs ( 0.08 usr 0.02 sys + 0.45 cusr 1.20 csys = 1.75 CPU) Result: PASS [root@dhcp47-49 nfs]# >>> "vol_ec" volume - [root@dhcp47-49 nfs]# cd ../ec_test/ [root@dhcp47-49 ec_test]# [root@dhcp47-49 ec_test]# prove -vf /opt/qa/tools/posix-testsuite/tests/rename/00.t /opt/qa/tools/posix-testsuite/tests/rename/00.t .. 1..79 ok 1 ... ... not ok 70 not ok 71 ok 72 ok 73 not ok 74 not ok 75 ok 76 ok 77 not ok 78 not ok 79 Failed 6/79 subtests Test Summary Report ------------------- /opt/qa/tools/posix-testsuite/tests/rename/00.t (Wstat: 0 Tests: 79 Failed: 6) Failed tests: 70-71, 74-75, 78-79 Files=1, Tests=79, 10 wallclock secs ( 0.08 usr 0.01 sys + 0.42 cusr 1.18 csys = 1.69 CPU) Result: FAIL [root@dhcp47-49 ec_test]#
But even on "no_mdcache" volume, when the entire test-suite is ran we see similar failures as Arthy reported earlier. Attaching the results (posix_no_mdcache_v4.log) From gfapi.log I see below errors (maybe few are expected as per posix test-suite) [2016-12-28 14:41:38.604207] W [MSGID: 122033] [ec-common.c:1466:ec_locked] 4-no_mdcache-disperse-0: Failed to complete preop lock [Stale file handle] [2016-12-28 14:41:38.614092] W [MSGID: 122033] [ec-common.c:1466:ec_locked] 4-no_mdcache-disperse-0: Failed to complete preop lock [Stale file handle] ^^^ somehow file/directory got deleted before EC xlator tried to acquire lock. [2016-12-28 14:41:38.622200] W [MSGID: 114031] [client-rpc-fops.c:688:client3_3_rmdir_cbk] 4-no_mdcache-client-0: remote operation failed [Not a directory] [2016-12-28 14:41:38.622680] W [MSGID: 114031] [client-rpc-fops.c:688:client3_3_rmdir_cbk] 4-no_mdcache-client-2: remote operation failed [Not a directory] [2016-12-28 14:41:38.622710] W [MSGID: 114031] [client-rpc-fops.c:688:client3_3_rmdir_cbk] 4-no_mdcache-client-3: remote operation failed [Not a directory] [2016-12-28 14:41:38.622826] W [MSGID: 114031] [client-rpc-fops.c:688:client3_3_rmdir_cbk] 4-no_mdcache-client-1: remote operation failed [Not a directory] [2016-12-28 14:41:38.622893] W [MSGID: 114031] [client-rpc-fops.c:688:client3_3_rmdir_cbk] 4-no_mdcache-client-5: remote operation failed [Not a directory] [2016-12-28 14:41:38.623108] W [MSGID: 114031] [client-rpc-fops.c:688:client3_3_rmdir_cbk] 4-no_mdcache-client-4: remote operation failed [Not a directory] [2016-12-28 14:41:38.631756] W [MSGID: 122019] [ec-helpers.c:354:ec_loc_gfid_check] 4-no_mdcache-disperse-1: Mismatching GFID's in loc [2016-12-28 14:40:02.650432] W [MSGID: 114031] [client-rpc-fops.c:2775:client3_3_link_cbk] 4-no_mdcache-client-0: remote operation failed: (/run23670/fstest_5042fc9cc1c13843124306bd5f815a29/fstest_2aa0fe37002b6445d89b88a923b7862a -> /run23670/fstest_5042fc9cc1c13843124306bd5f815a29/fstest_b41c4fab9ddb2c5647a27905ccb09188) [Permission denied] [2016-12-28 14:40:02.651218] W [MSGID: 114031] [client-rpc-fops.c:2775:client3_3_link_cbk] 4-no_mdcache-client-3: remote operation failed: (/run23670/fstest_5042fc9cc1c13843124306bd5f815a29/fstest_2aa0fe37002b6445d89b88a923b7862a -> /run23670/fstest_5042fc9cc1c13843124306bd5f815a29/fstest_b41c4fab9ddb2c5647a27905ccb09188) [Permission denied] [2016-12-28 14:40:02.651314] W [MSGID: 114031] [client-rpc-fops.c:2775:client3_3_link_cbk] 4-no_mdcache-client-1: remote operation failed: (/run23670/fstest_5042fc9cc1c13843124306bd5f815a29/fstest_2aa0fe37002b6445d89b88a923b7862a -> /run23670/fstest_5042fc9cc1c13843124306bd5f815a29/fstest_b41c4fab9ddb2c5647a27905ccb09188) [Permission denied] I then disabled client-io-threads and re-executed the entire test-suite #gluster v set no_mdcache performance.client-io-threads off All the rename tests seem to pass now (attached log file - posix_no_mdcache_v4_client_io_off.log) I repeatedly toggled this option and re-ran the tests. Observed same behaviour all the time. CCin Pranith. Pranith, Request you to take a look and provide your comments.
Created attachment 1235665 [details] posix_no_mdcache_v4_default.log
Created attachment 1235666 [details] posix_no_mdcache_v4_client_io_off.log
This is happening because dht is not doing link file cleanup as root which was failing with EACCESS, so the subsequent tests are failing. Will send out a patch.
Upstream patch: http://review.gluster.org/16317
downstream patch : https://code.engineering.redhat.com/gerrit/94321
posix compliance rename tests passed. However there are other known failures in posix compliance tests suite. Test Summary Report ------------------- /opt/qa/tools/posix-testsuite/tests/chown/00.t (Wstat: 0 Tests: 171 Failed: 1) Failed test: 77 /opt/qa/tools/posix-testsuite/tests/open/07.t (Wstat: 0 Tests: 23 Failed: 3) Failed tests: 5, 7, 9 Files=185, Tests=1962, 145 wallclock secs ( 1.69 usr 0.59 sys + 16.29 cusr 34.84 csys = 53.41 CPU) Verified the fix in build, nfs-ganesha-gluster-2.4.1-6.el7rhgs.x86_64 nfs-ganesha-2.4.1-6.el7rhgs.x86_64 glusterfs-ganesha-3.8.4-12.el7rhgs.x86_64
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2017-0486.html