Bug 1422431
Summary: | multiple glusterfsd process crashed making the complete subvolume unavailable | |||
---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | Rahul Hinduja <rhinduja> | |
Component: | marker | Assignee: | Poornima G <pgurusid> | |
Status: | CLOSED ERRATA | QA Contact: | Rahul Hinduja <rhinduja> | |
Severity: | urgent | Docs Contact: | ||
Priority: | unspecified | |||
Version: | rhgs-3.2 | CC: | amukherj, asrivast, pgurusid, rcyriac, rhs-bugs, skoduri, storage-qa-internal | |
Target Milestone: | --- | |||
Target Release: | RHGS 3.2.0 | |||
Hardware: | x86_64 | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | glusterfs-3.8.4-15 | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1422776 (view as bug list) | Environment: | ||
Last Closed: | 2017-03-23 06:05:09 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1351528, 1422776, 1424937 |
Description
Rahul Hinduja
2017-02-15 10:44:02 UTC
Upcall invalidations for XATTR operations were added as part of md-cache optimizations. The crash looks similar to bug1387204. Request poornima to take a look. Thanks! The simple reproducer for this issue: Create a plain distribute volume, enable cache-invalidation and marker feature on the server side: gluster vol set <VOLNAME> features.cache-invalidation on gluster vol ser <VOLNAME> indexing on gluster vol quota <VOLNAME> enable And from the fuse mount point, create a file and rename the file. After this the bricks will crash. The reason for the crash is, on recieving a rename fop, marker_rename() stores the, oldloc and newloc in its 'local' struct, once the rename is done, the xtime marker(last updated time) is set on the file, but sending a setxattr fop. When upcall receives the setxattr fop, the loc->inode is NULL and it crashes. The loc->inode can be NULL only in one valid case, i.e. in rename case where the inode of new loc will be NULL. Hence, marker should have got the inode of the new_loc and filled it before issuing a setxattr. Hence moving the component to marker. This is similar to BZ: 1387204 that is already fixed, but when quota is enabled it takes a different code path. Will send the patch in marker-quota to fix the same. downstream patch : https://code.engineering.redhat.com/gerrit/#/c/98136 Verified with build: glusterfs-geo-replication-3.8.4-15.el7rhgs.x86_64 Ran the same test suit which does create,chmod,chown,chgrp,symlink,hardlink,truncate,rename and remove in different crawl method with quota and md-cache enabled. All cases are passed and no crashes are seen: [geo_rahul@skywalker ~]$ grep -ri "quota" /home/geo_rahul/regression/3.2-Regression/3.8.4-15-RHEL7.3-rsync-fuse-with-mdcache.log 2017-02-20 23:55:08,502 INFO run Executing gluster volume quota master enable on 10.70.42.7 2017-02-20 23:55:42,794 INFO run "gluster volume quota master enable" on 10.70.42.7: RETCODE is 0 2017-02-20 23:55:42,794 INFO run "gluster volume quota master enable" on 10.70.42.7: STDOUT is volume quota : success 2017-02-20 23:55:42,795 INFO run Executing gluster volume quota master limit-usage / 100GB on 10.70.42.7 2017-02-20 23:55:47,671 INFO run "gluster volume quota master limit-usage / 100GB" on 10.70.42.7: RETCODE is 0 2017-02-20 23:55:47,671 INFO run "gluster volume quota master limit-usage / 100GB" on 10.70.42.7: STDOUT is volume quota : success 2017-02-20 23:56:20,384 INFO run Executing gluster volume quota slave enable on 10.70.43.249 2017-02-20 23:56:54,749 INFO run "gluster volume quota slave enable" on 10.70.43.249: RETCODE is 0 2017-02-20 23:56:54,749 INFO run "gluster volume quota slave enable" on 10.70.43.249: STDOUT is volume quota : success 2017-02-20 23:56:54,749 INFO run Executing gluster volume quota slave limit-usage / 100GB on 10.70.43.249 2017-02-20 23:56:59,809 INFO run "gluster volume quota slave limit-usage / 100GB" on 10.70.43.249: RETCODE is 0 2017-02-20 23:56:59,809 INFO run "gluster volume quota slave limit-usage / 100GB" on 10.70.43.249: STDOUT is volume quota : success [geo_rahul@skywalker ~]$ [geo_rahul@skywalker ~]$ [geo_rahul@skywalker ~]$ grep -ri "volume set" /home/geo_rahul/regression/3.2-Regression/3.8.4-15-RHEL7.3-rsync-fuse-with-mdcache.log 2017-02-20 23:55:47,671 INFO run Executing gluster volume set master performance.cache-invalidation on on 10.70.42.7 2017-02-20 23:55:48,710 INFO run "gluster volume set master performance.cache-invalidation on" on 10.70.42.7: RETCODE is 0 2017-02-20 23:55:48,711 INFO run "gluster volume set master performance.cache-invalidation on" on 10.70.42.7: STDOUT is volume set: success 2017-02-20 23:55:48,711 INFO run Executing gluster volume set master features.cache-invalidation on on 10.70.42.7 2017-02-20 23:55:49,666 INFO run "gluster volume set master features.cache-invalidation on" on 10.70.42.7: RETCODE is 0 2017-02-20 23:55:49,666 INFO run "gluster volume set master features.cache-invalidation on" on 10.70.42.7: STDOUT is volume set: success 2017-02-20 23:55:49,667 INFO run Executing gluster volume set master performance.md-cache-timeout 600 on 10.70.42.7 2017-02-20 23:55:50,652 INFO run "gluster volume set master performance.md-cache-timeout 600" on 10.70.42.7: RETCODE is 0 2017-02-20 23:55:50,652 INFO run "gluster volume set master performance.md-cache-timeout 600" on 10.70.42.7: STDOUT is volume set: success 2017-02-20 23:55:50,653 INFO run Executing gluster volume set master performance.stat-prefetch on on 10.70.42.7 2017-02-20 23:55:51,615 INFO run "gluster volume set master performance.stat-prefetch on" on 10.70.42.7: RETCODE is 0 2017-02-20 23:55:51,615 INFO run "gluster volume set master performance.stat-prefetch on" on 10.70.42.7: STDOUT is volume set: success 2017-02-20 23:55:51,616 INFO run Executing gluster volume set master features.cache-invalidation-timeout 600 on 10.70.42.7 2017-02-20 23:55:52,584 INFO run "gluster volume set master features.cache-invalidation-timeout 600" on 10.70.42.7: RETCODE is 0 2017-02-20 23:55:52,584 INFO run "gluster volume set master features.cache-invalidation-timeout 600" on 10.70.42.7: STDOUT is volume set: success 2017-02-20 23:56:59,810 INFO run Executing gluster volume set slave performance.cache-invalidation on on 10.70.43.249 2017-02-20 23:57:01,095 INFO run "gluster volume set slave performance.cache-invalidation on" on 10.70.43.249: RETCODE is 0 2017-02-20 23:57:01,095 INFO run "gluster volume set slave performance.cache-invalidation on" on 10.70.43.249: STDOUT is volume set: success 2017-02-20 23:57:01,095 INFO run Executing gluster volume set slave features.cache-invalidation on on 10.70.43.249 2017-02-20 23:57:02,355 INFO run "gluster volume set slave features.cache-invalidation on" on 10.70.43.249: RETCODE is 0 2017-02-20 23:57:02,355 INFO run "gluster volume set slave features.cache-invalidation on" on 10.70.43.249: STDOUT is volume set: success 2017-02-20 23:57:02,355 INFO run Executing gluster volume set slave performance.md-cache-timeout 600 on 10.70.43.249 2017-02-20 23:57:03,531 INFO run "gluster volume set slave performance.md-cache-timeout 600" on 10.70.43.249: RETCODE is 0 2017-02-20 23:57:03,531 INFO run "gluster volume set slave performance.md-cache-timeout 600" on 10.70.43.249: STDOUT is volume set: success 2017-02-20 23:57:03,531 INFO run Executing gluster volume set slave performance.stat-prefetch on on 10.70.43.249 2017-02-20 23:57:04,780 INFO run "gluster volume set slave performance.stat-prefetch on" on 10.70.43.249: RETCODE is 0 2017-02-20 23:57:04,780 INFO run "gluster volume set slave performance.stat-prefetch on" on 10.70.43.249: STDOUT is volume set: success 2017-02-20 23:57:04,780 INFO run Executing gluster volume set slave features.cache-invalidation-timeout 600 on 10.70.43.249 2017-02-20 23:57:06,014 INFO run "gluster volume set slave features.cache-invalidation-timeout 600" on 10.70.43.249: RETCODE is 0 2017-02-20 23:57:06,014 INFO run "gluster volume set slave features.cache-invalidation-timeout 600" on 10.70.43.249: STDOUT is volume set: success [geo_rahul@skywalker ~]$ [root@dhcp42-7 ~]# gluster volume info master Volume Name: master Type: Distributed-Replicate Volume ID: 69c38f0f-c27b-47fe-b02a-8927cfa68eec Status: Started Snapshot Count: 0 Number of Bricks: 6 x 2 = 12 Transport-type: tcp Bricks: Brick1: 10.70.42.7:/bricks/brick0/master_brick0 Brick2: 10.70.41.211:/bricks/brick0/master_brick1 Brick3: 10.70.43.141:/bricks/brick0/master_brick2 Brick4: 10.70.43.156:/bricks/brick0/master_brick3 Brick5: 10.70.42.7:/bricks/brick1/master_brick4 Brick6: 10.70.41.211:/bricks/brick1/master_brick5 Brick7: 10.70.43.141:/bricks/brick1/master_brick6 Brick8: 10.70.43.156:/bricks/brick1/master_brick7 Brick9: 10.70.42.7:/bricks/brick2/master_brick8 Brick10: 10.70.41.211:/bricks/brick2/master_brick9 Brick11: 10.70.43.141:/bricks/brick2/master_brick10 Brick12: 10.70.43.156:/bricks/brick2/master_brick11 Options Reconfigured: changelog.changelog: on geo-replication.ignore-pid-check: on geo-replication.indexing: on features.cache-invalidation-timeout: 600 performance.stat-prefetch: on performance.md-cache-timeout: 600 features.cache-invalidation: on performance.cache-invalidation: on features.quota-deem-statfs: on features.inode-quota: on features.quota: on transport.address-family: inet performance.readdir-ahead: on nfs.disable: on cluster.enable-shared-storage: enable [root@dhcp42-7 ~]# [root@dhcp43-249 ~]# gluster volume info slave Volume Name: slave Type: Distributed-Replicate Volume ID: b410de0b-9c20-4eae-a5b5-e847c5a32c98 Status: Started Snapshot Count: 0 Number of Bricks: 6 x 2 = 12 Transport-type: tcp Bricks: Brick1: 10.70.43.249:/bricks/brick0/slave_brick0 Brick2: 10.70.43.196:/bricks/brick0/slave_brick1 Brick3: 10.70.41.187:/bricks/brick0/slave_brick2 Brick4: 10.70.43.208:/bricks/brick0/slave_brick3 Brick5: 10.70.43.249:/bricks/brick1/slave_brick4 Brick6: 10.70.43.196:/bricks/brick1/slave_brick5 Brick7: 10.70.41.187:/bricks/brick1/slave_brick6 Brick8: 10.70.43.208:/bricks/brick1/slave_brick7 Brick9: 10.70.43.249:/bricks/brick2/slave_brick8 Brick10: 10.70.43.196:/bricks/brick2/slave_brick9 Brick11: 10.70.41.187:/bricks/brick2/slave_brick10 Brick12: 10.70.43.208:/bricks/brick2/slave_brick11 Options Reconfigured: features.cache-invalidation-timeout: 600 performance.stat-prefetch: on performance.md-cache-timeout: 600 features.cache-invalidation: on performance.cache-invalidation: on features.quota-deem-statfs: on features.inode-quota: on features.quota: on transport.address-family: inet performance.readdir-ahead: on nfs.disable: on cluster.enable-shared-storage: enable [root@dhcp43-249 ~]# [geo_rahul@skywalker distaf]$ time python main.py -d "geo_rep" -t "$testcases" test_1_changelog-test-create (__main__.gluster_tests) ... ok test_2_changelog-test-chmod (__main__.gluster_tests) ... ok test_3_changelog-test-chown (__main__.gluster_tests) ... ok test_4_changelog-test-chgrp (__main__.gluster_tests) ... ok test_5_changelog-test-symlink (__main__.gluster_tests) ... ok test_6_changelog-test-hardlink (__main__.gluster_tests) ... ok test_7_changelog-test-truncate (__main__.gluster_tests) ... ok test_8_changelog-test-rename (__main__.gluster_tests) ... ok test_9_changelog-test-remove (__main__.gluster_tests) ... ok test_10_xsync-test-create (__main__.gluster_tests) ... ok test_11_xsync-test-chmod (__main__.gluster_tests) ... ok test_12_xsync-test-chown (__main__.gluster_tests) ... ok test_13_xsync-test-chgrp (__main__.gluster_tests) ... ok test_14_xsync-test-symlink (__main__.gluster_tests) ... ok test_15_xsync-test-hardlink (__main__.gluster_tests) ... ok test_16_xsync-test-truncate (__main__.gluster_tests) ... ok test_17_history-test-create (__main__.gluster_tests) ... ok test_18_history-test-chmod (__main__.gluster_tests) ... ok test_19_history-test-chown (__main__.gluster_tests) ... ok test_20_history-test-chgrp (__main__.gluster_tests) ... ok test_21_history-test-symlink (__main__.gluster_tests) ... ok test_22_history-test-hardlink (__main__.gluster_tests) ... ok test_23_history-test-truncate (__main__.gluster_tests) ... ok test_24_history-test-rename (__main__.gluster_tests) ... ok test_25_history-test-remove (__main__.gluster_tests) ... ok test_26_history-dynamic-create (__main__.gluster_tests) ... ok test_27_history-dynamic-chmod (__main__.gluster_tests) ... ok test_28_history-dynamic-chown (__main__.gluster_tests) ... ok test_29_history-dynamic-chgrp (__main__.gluster_tests) ... ok test_30_history-dynamic-symlink (__main__.gluster_tests) ... ok test_31_history-dynamic-hardlink (__main__.gluster_tests) ... ok test_32_history-dynamic-truncate (__main__.gluster_tests) ... ok test_33_history-dynamic-rename (__main__.gluster_tests) ... ok test_34_history-dynamic-remove (__main__.gluster_tests) ... ok ---------------------------------------------------------------------- Ran 34 tests in 41116.410s OK real 686m23.315s user 0m14.528s sys 0m4.805s [geo_rahul@skywalker distaf]$ Moving this bug to verified state. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2017-0486.html |