Bug 1408413
Summary: | [ganesha + EC]posix compliance rename tests failed on EC volume with nfs-ganesha mount. | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | Arthy Loganathan <aloganat> | ||||||
Component: | distribute | Assignee: | Pranith Kumar K <pkarampu> | ||||||
Status: | CLOSED ERRATA | QA Contact: | Arthy Loganathan <aloganat> | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | unspecified | ||||||||
Version: | rhgs-3.2 | CC: | aloganat, amukherj, aspandey, dang, ffilz, jthottan, mbenjamin, nbalacha, pgurusid, pkarampu, rcyriac, rhinduja, rhs-bugs, skoduri, storage-qa-internal | ||||||
Target Milestone: | --- | ||||||||
Target Release: | RHGS 3.2.0 | ||||||||
Hardware: | x86_64 | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | glusterfs-3.8.4-11 | Doc Type: | If docs needed, set a value | ||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | |||||||||
: | 1409727 (view as bug list) | Environment: | |||||||
Last Closed: | 2017-03-23 05:59:15 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | 1409727, 1412913, 1413061 | ||||||||
Bug Blocks: | 1351528 | ||||||||
Attachments: |
|
Description
Arthy Loganathan
2016-12-23 10:21:25 UTC
Even these posix_compliance tests seem to pass without md-cache settings enabled on the gluster volume. Could you please verify the same? Soumya, I am still seeing the issue without modifying mdcache settings, Test Summary Report ------------------- /opt/qa/tools/posix-testsuite/tests/chown/00.t (Wstat: 0 Tests: 171 Failed: 1) Failed test: 77 /opt/qa/tools/posix-testsuite/tests/link/00.t (Wstat: 0 Tests: 82 Failed: 1) Failed test: 77 /opt/qa/tools/posix-testsuite/tests/open/07.t (Wstat: 0 Tests: 23 Failed: 3) Failed tests: 5, 7, 9 /opt/qa/tools/posix-testsuite/tests/rename/00.t (Wstat: 0 Tests: 79 Failed: 6) Failed tests: 70-71, 74-75, 78-79 Files=185, Tests=1962, 144 wallclock secs ( 1.63 usr 0.61 sys + 16.25 cusr 33.25 csys = 51.74 CPU) Result: FAIL end: 12:30:35 removed posix compliance directories 1 Total 1 tests were successful Switching over to the previous working directory Removing /mnt/no_mdcache//run16446/ rmdir: failed to remove ‘/mnt/no_mdcache//run16446/’: Directory not empty rmdir failed:Directory not empty [root@dhcp47-176 no_mdcache]# Proposing this as a blocker. It may affect functionality of EC volume as this test is working fine with distribute replicate volume. That's right. I couldn't reproduce this issue as well on 2*(4 +1) volume - [root@dhcp35-197 rename]# ./00.t 1..79 ok 1 ok 2 ok 3 ok 4 ok 5 ok 6 ok 7 ok 8 ok 9 ok 10 ok 11 ok 12 ok 13 ok 14 ok 15 ok 16 ok 17 ok 18 ok 19 ok 20 ok 21 ok 22 ok 23 ok 24 ok 25 ok 26 ok 27 ok 28 ok 29 ok 30 ok 31 ok 32 ok 33 ok 34 ok 35 ok 36 ok 37 ok 38 ok 39 ok 40 ok 41 ok 42 ok 43 ok 44 ok 45 ok 46 ok 47 ok 48 ok 49 ok 50 ok 51 ok 52 ok 53 ok 54 ok 55 ok 56 ok 57 ok 58 ok 59 ok 60 ok 61 ok 62 ok 63 ok 64 i am here ok 65 ok 66 ok 67 ok 68 ok 69 ok 70 ok 71 ok 72 ok 73 ok 74 ok 75 ok 76 ok 77 ok 78 ok 79 [root@dhcp35-197 rename]# [root@dhcp35-197 tools]# gluster v info vol_disperse Volume Name: vol_disperse Type: Distributed-Disperse Volume ID: d66d97a1-6bdb-476c-8c24-2c842f2bcb7a Status: Started Snapshot Count: 0 Number of Bricks: 2 x (4 + 2) = 12 Transport-type: tcp Bricks: Brick1: 192.168.122.201:/tmp/disperse_brick1 Brick2: 192.168.122.201:/tmp/disperse_brick2 Brick3: 192.168.122.201:/tmp/disperse_brick3 Brick4: 192.168.122.201:/tmp/disperse_brick4 Brick5: 192.168.122.201:/tmp/disperse_brick5 Brick6: 192.168.122.201:/tmp/disperse_brick6 Brick7: 192.168.122.201:/tmp/disperse_brick7 Brick8: 192.168.122.201:/tmp/disperse_brick8 Brick9: 192.168.122.201:/tmp/disperse_brick9 Brick10: 192.168.122.201:/tmp/disperse_brick10 Brick11: 192.168.122.201:/tmp/disperse_brick11 Brick12: 192.168.122.201:/tmp/disperse_brick12 Options Reconfigured: performance.md-cache-timeout: 600 performance.cache-invalidation: on performance.stat-prefetch: on features.cache-invalidation-timeout: 600 features.cache-invalidation: on transport.address-family: inet performance.readdir-ahead: on nfs.disable: on [root@dhcp35-197 tools]# Soumya, Setup details: Server machine - 10.70.46.111--> VIP: 10.70.44.92, 10.70.46.115--> VIP: 10.70.44.93 client machine - 10.70.47.49 mount details: 10.70.44.93:/vol_ec on /mnt/ec_test type nfs4 (rw,relatime,vers=4.0,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=10.70.47.49,local_lock=none,addr=10.70.44.93) [root@dhcp47-49 ec_test]# prove -vf /opt/qa/tools/posix-testsuite/tests/rename/00.t /opt/qa/tools/posix-testsuite/tests/rename/00.t .. 1..79 ok 1 ok 2 ok 3 ok 4 ok 5 ok 6 ok 7 ok 8 ok 9 ok 10 ok 11 ok 12 ok 13 ok 14 ok 15 ok 16 ok 17 ok 18 ok 19 ok 20 ok 21 ok 22 ok 23 ok 24 ok 25 ok 26 ok 27 ok 28 ok 29 ok 30 ok 31 ok 32 ok 33 ok 34 ok 35 ok 36 ok 37 ok 38 ok 39 ok 40 ok 41 ok 42 ok 43 ok 44 ok 45 ok 46 ok 47 ok 48 ok 49 ok 50 ok 51 ok 52 ok 53 ok 54 ok 55 ok 56 ok 57 ok 58 ok 59 ok 60 ok 61 ok 62 ok 63 ok 64 i am here ok 65 ok 66 ok 67 ok 68 ok 69 not ok 70 not ok 71 ok 72 ok 73 not ok 74 not ok 75 ok 76 ok 77 not ok 78 not ok 79 Failed 6/79 subtests Test Summary Report ------------------- /opt/qa/tools/posix-testsuite/tests/rename/00.t (Wstat: 0 Tests: 79 Failed: 6) Failed tests: 70-71, 74-75, 78-79 Files=1, Tests=79, 11 wallclock secs ( 0.08 usr 0.01 sys + 0.44 cusr 1.12 csys = 1.65 CPU) Result: FAIL [root@dhcp47-49 ec_test]# Volume details: ---------------- Volume Name: vol_ec Type: Distributed-Disperse Volume ID: 19111707-7356-4bf9-b6a9-8762548cb531 Status: Started Snapshot Count: 0 Number of Bricks: 2 x (4 + 2) = 12 Transport-type: tcp Bricks: Brick1: dhcp46-111.lab.eng.blr.redhat.com:/bricks/brick6/br6 Brick2: dhcp46-115.lab.eng.blr.redhat.com:/bricks/brick6/br6 Brick3: dhcp46-139.lab.eng.blr.redhat.com:/bricks/brick6/br6 Brick4: dhcp46-124.lab.eng.blr.redhat.com:/bricks/brick6/br6 Brick5: dhcp46-131.lab.eng.blr.redhat.com:/bricks/brick6/br6 Brick6: dhcp46-152.lab.eng.blr.redhat.com:/bricks/brick6/br6 Brick7: dhcp46-111.lab.eng.blr.redhat.com:/bricks/brick7/br7 Brick8: dhcp46-115.lab.eng.blr.redhat.com:/bricks/brick7/br7 Brick9: dhcp46-139.lab.eng.blr.redhat.com:/bricks/brick7/br7 Brick10: dhcp46-124.lab.eng.blr.redhat.com:/bricks/brick7/br7 Brick11: dhcp46-131.lab.eng.blr.redhat.com:/bricks/brick7/br7 Brick12: dhcp46-152.lab.eng.blr.redhat.com:/bricks/brick7/br7 Options Reconfigured: diagnostics.count-fop-hits: on diagnostics.latency-measurement: on diagnostics.client-log-level: INFO performance.md-cache-timeout: 600 performance.cache-invalidation: on performance.stat-prefetch: on features.cache-invalidation-timeout: 600 ganesha.enable: on features.cache-invalidation: on transport.address-family: inet performance.readdir-ahead: on nfs.disable: on nfs-ganesha: enable cluster.enable-shared-storage: enable [root@dhcp46-111 ~]# [root@dhcp46-111 ~]# [root@dhcp46-111 ~]# [root@dhcp46-111 ~]# gluster vol status vol_ec Status of volume: vol_ec Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick dhcp46-111.lab.eng.blr.redhat.com:/br icks/brick6/br6 49153 0 Y 7404 Brick dhcp46-115.lab.eng.blr.redhat.com:/br icks/brick6/br6 49153 0 Y 11692 Brick dhcp46-139.lab.eng.blr.redhat.com:/br icks/brick6/br6 49153 0 Y 22111 Brick dhcp46-124.lab.eng.blr.redhat.com:/br icks/brick6/br6 49152 0 Y 10126 Brick dhcp46-131.lab.eng.blr.redhat.com:/br icks/brick6/br6 49152 0 Y 21400 Brick dhcp46-152.lab.eng.blr.redhat.com:/br icks/brick6/br6 49152 0 Y 17172 Brick dhcp46-111.lab.eng.blr.redhat.com:/br icks/brick7/br7 49154 0 Y 7423 Brick dhcp46-115.lab.eng.blr.redhat.com:/br icks/brick7/br7 49154 0 Y 11711 Brick dhcp46-139.lab.eng.blr.redhat.com:/br icks/brick7/br7 49154 0 Y 22137 Brick dhcp46-124.lab.eng.blr.redhat.com:/br icks/brick7/br7 49153 0 Y 10149 Brick dhcp46-131.lab.eng.blr.redhat.com:/br icks/brick7/br7 49153 0 Y 21419 Brick dhcp46-152.lab.eng.blr.redhat.com:/br icks/brick7/br7 49153 0 Y 17191 Self-heal Daemon on localhost N/A N/A Y 6482 Self-heal Daemon on dhcp46-115.lab.eng.blr. redhat.com N/A N/A Y 13948 Self-heal Daemon on dhcp46-152.lab.eng.blr. redhat.com N/A N/A Y 17688 Self-heal Daemon on dhcp46-139.lab.eng.blr. redhat.com N/A N/A Y 20868 Self-heal Daemon on dhcp46-131.lab.eng.blr. redhat.com N/A N/A Y 21931 Self-heal Daemon on dhcp46-124.lab.eng.blr. redhat.com N/A N/A Y 8885 Task Status of Volume vol_ec ------------------------------------------------------------------------------ There are no active volume tasks [root@dhcp46-111 ~]# Thanks Arthy. These are initial observations - We have two volumes created on the setup Arthy shared - no_mdcache (with default md-cache settings) vol_ec (with md-cache settings configured) 10.70.44.93:/vol_ec on /mnt/ec_test type nfs4 (rw,relatime,vers=4.0,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=10.70.47.49,local_lock=none,addr=10.70.44.93) dhcp46-111.lab.eng.blr.redhat.com:/no_mdcache on /mnt/nfs type nfs4 (rw,relatime,vers=4.0,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=10.70.47.49,local_lock=none,addr=10.70.46.111) Running individual test on >>> "no_mdcache" volume - [root@dhcp47-49 nfs]# prove -vf /opt/qa/tools/posix-testsuite/tests/rename/00.t /opt/qa/tools/posix-testsuite/tests/rename/00.t .. 1..79 ok 1 ok 2 ... ... ok 78 ok 79 ok All tests successful. Files=1, Tests=79, 12 wallclock secs ( 0.08 usr 0.02 sys + 0.45 cusr 1.20 csys = 1.75 CPU) Result: PASS [root@dhcp47-49 nfs]# >>> "vol_ec" volume - [root@dhcp47-49 nfs]# cd ../ec_test/ [root@dhcp47-49 ec_test]# [root@dhcp47-49 ec_test]# prove -vf /opt/qa/tools/posix-testsuite/tests/rename/00.t /opt/qa/tools/posix-testsuite/tests/rename/00.t .. 1..79 ok 1 ... ... not ok 70 not ok 71 ok 72 ok 73 not ok 74 not ok 75 ok 76 ok 77 not ok 78 not ok 79 Failed 6/79 subtests Test Summary Report ------------------- /opt/qa/tools/posix-testsuite/tests/rename/00.t (Wstat: 0 Tests: 79 Failed: 6) Failed tests: 70-71, 74-75, 78-79 Files=1, Tests=79, 10 wallclock secs ( 0.08 usr 0.01 sys + 0.42 cusr 1.18 csys = 1.69 CPU) Result: FAIL [root@dhcp47-49 ec_test]# But even on "no_mdcache" volume, when the entire test-suite is ran we see similar failures as Arthy reported earlier. Attaching the results (posix_no_mdcache_v4.log) From gfapi.log I see below errors (maybe few are expected as per posix test-suite) [2016-12-28 14:41:38.604207] W [MSGID: 122033] [ec-common.c:1466:ec_locked] 4-no_mdcache-disperse-0: Failed to complete preop lock [Stale file handle] [2016-12-28 14:41:38.614092] W [MSGID: 122033] [ec-common.c:1466:ec_locked] 4-no_mdcache-disperse-0: Failed to complete preop lock [Stale file handle] ^^^ somehow file/directory got deleted before EC xlator tried to acquire lock. [2016-12-28 14:41:38.622200] W [MSGID: 114031] [client-rpc-fops.c:688:client3_3_rmdir_cbk] 4-no_mdcache-client-0: remote operation failed [Not a directory] [2016-12-28 14:41:38.622680] W [MSGID: 114031] [client-rpc-fops.c:688:client3_3_rmdir_cbk] 4-no_mdcache-client-2: remote operation failed [Not a directory] [2016-12-28 14:41:38.622710] W [MSGID: 114031] [client-rpc-fops.c:688:client3_3_rmdir_cbk] 4-no_mdcache-client-3: remote operation failed [Not a directory] [2016-12-28 14:41:38.622826] W [MSGID: 114031] [client-rpc-fops.c:688:client3_3_rmdir_cbk] 4-no_mdcache-client-1: remote operation failed [Not a directory] [2016-12-28 14:41:38.622893] W [MSGID: 114031] [client-rpc-fops.c:688:client3_3_rmdir_cbk] 4-no_mdcache-client-5: remote operation failed [Not a directory] [2016-12-28 14:41:38.623108] W [MSGID: 114031] [client-rpc-fops.c:688:client3_3_rmdir_cbk] 4-no_mdcache-client-4: remote operation failed [Not a directory] [2016-12-28 14:41:38.631756] W [MSGID: 122019] [ec-helpers.c:354:ec_loc_gfid_check] 4-no_mdcache-disperse-1: Mismatching GFID's in loc [2016-12-28 14:40:02.650432] W [MSGID: 114031] [client-rpc-fops.c:2775:client3_3_link_cbk] 4-no_mdcache-client-0: remote operation failed: (/run23670/fstest_5042fc9cc1c13843124306bd5f815a29/fstest_2aa0fe37002b6445d89b88a923b7862a -> /run23670/fstest_5042fc9cc1c13843124306bd5f815a29/fstest_b41c4fab9ddb2c5647a27905ccb09188) [Permission denied] [2016-12-28 14:40:02.651218] W [MSGID: 114031] [client-rpc-fops.c:2775:client3_3_link_cbk] 4-no_mdcache-client-3: remote operation failed: (/run23670/fstest_5042fc9cc1c13843124306bd5f815a29/fstest_2aa0fe37002b6445d89b88a923b7862a -> /run23670/fstest_5042fc9cc1c13843124306bd5f815a29/fstest_b41c4fab9ddb2c5647a27905ccb09188) [Permission denied] [2016-12-28 14:40:02.651314] W [MSGID: 114031] [client-rpc-fops.c:2775:client3_3_link_cbk] 4-no_mdcache-client-1: remote operation failed: (/run23670/fstest_5042fc9cc1c13843124306bd5f815a29/fstest_2aa0fe37002b6445d89b88a923b7862a -> /run23670/fstest_5042fc9cc1c13843124306bd5f815a29/fstest_b41c4fab9ddb2c5647a27905ccb09188) [Permission denied] I then disabled client-io-threads and re-executed the entire test-suite #gluster v set no_mdcache performance.client-io-threads off All the rename tests seem to pass now (attached log file - posix_no_mdcache_v4_client_io_off.log) I repeatedly toggled this option and re-ran the tests. Observed same behaviour all the time. CCin Pranith. Pranith, Request you to take a look and provide your comments. Created attachment 1235665 [details]
posix_no_mdcache_v4_default.log
Created attachment 1235666 [details]
posix_no_mdcache_v4_client_io_off.log
This is happening because dht is not doing link file cleanup as root which was failing with EACCESS, so the subsequent tests are failing. Will send out a patch. Upstream patch: http://review.gluster.org/16317 downstream patch : https://code.engineering.redhat.com/gerrit/94321 posix compliance rename tests passed. However there are other known failures in posix compliance tests suite. Test Summary Report ------------------- /opt/qa/tools/posix-testsuite/tests/chown/00.t (Wstat: 0 Tests: 171 Failed: 1) Failed test: 77 /opt/qa/tools/posix-testsuite/tests/open/07.t (Wstat: 0 Tests: 23 Failed: 3) Failed tests: 5, 7, 9 Files=185, Tests=1962, 145 wallclock secs ( 1.69 usr 0.59 sys + 16.29 cusr 34.84 csys = 53.41 CPU) Verified the fix in build, nfs-ganesha-gluster-2.4.1-6.el7rhgs.x86_64 nfs-ganesha-2.4.1-6.el7rhgs.x86_64 glusterfs-ganesha-3.8.4-12.el7rhgs.x86_64 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2017-0486.html |