Bug 1715447
Summary: | Files in entry split-brain with "type mismatch" | ||||||
---|---|---|---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | Anees Patel <anepatel> | ||||
Component: | replicate | Assignee: | Karthik U S <ksubrahm> | ||||
Status: | CLOSED ERRATA | QA Contact: | Mugdha Soni <musoni> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | rhgs-3.5 | CC: | amukherj, ksubrahm, nchilaka, rhs-bugs, rkavunga, sheggodu, storage-qa-internal, vdas | ||||
Target Milestone: | --- | Flags: | rkavunga:
needinfo-
rkavunga: needinfo- |
||||
Target Release: | RHGS 3.5.0 | ||||||
Hardware: | Unspecified | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | glusterfs-6.0-8 | Doc Type: | If docs needed, set a value | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2019-10-30 12:21:50 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | 1722507 | ||||||
Bug Blocks: | 1696809 | ||||||
Attachments: |
|
Description
Anees Patel
2019-05-30 12:01:04 UTC
Hi Anees, Please provide the sos-reports and the volume status output. Regards, Karthik The steps to reproduce mentioned are: 1. Create 2 1X3 replicate volumes # gluster v list emptyvol testvol_replicated 2. Write continous IO(All types of FOPs) 3. Executed a script that does the following. 1. gets a list of all bricks for a volume- testvol_replicated (3 bricks initially), 2. kills 2 bricks (b0, b1) one after the other (with millisecond difference) and sleep for 3 seconds, brick back bricks up. 3. kill 2 more bricks (b1, b2) one after the other (with millisecond difference) and sleep for 3 seconds, brick back bricks up using glusterd restart. 4. Repeat the steps 1,2 and 3 multiple times 4. Executed Add-brick to convert volume testvol_replicated to 2X3 5. The script keeps running and now gets a list of all 6bricks and kills 2 bricks at a time in loop. 5. Re-balance was executed and heal was triggerred In step 5 where we need to perform rebalance ,as this step is performed with continous IO (crefi) and simultaneously the script in 3 point is being run (which kills 2 bricks from a replica pair) ,the rebalance is failing on multiple nodes which is expected behaviour as quorum is not met. Seeing following errors in rebalance logs: [2019-08-28 07:10:56.327552] E [MSGID: 109016] [dht-rebalance.c:3910:gf_defrag_fix_layout] 0-vol1-dht: Fix layout failed for /dir1/dir1/dir2/dir3/dir4/dir5 [2019-08-28 07:10:56.328646] E [MSGID: 109016] [dht-rebalance.c:3910:gf_defrag_fix_layout] 0-vol1-dht: Fix layout failed for /dir1/dir1/dir2/dir3/dir4 [2019-08-28 07:10:56.366236] E [MSGID: 109016] [dht-rebalance.c:3910:gf_defrag_fix_layout] 0-vol1-dht: Fix layout failed for /dir1/dir1/dir2/dir3 [2019-08-28 07:10:56.367188] E [MSGID: 109016] [dht-rebalance.c:3910:gf_defrag_fix_layout] 0-vol1-dht: Fix layout failed for /dir1/dir1/dir2 [2019-08-28 07:10:56.368057] E [MSGID: 109016] [dht-rebalance.c:3910:gf_defrag_fix_layout] 0-vol1-dht: Fix layout failed for /dir1/dir1 [2019-08-28 07:10:56.368153] W [MSGID: 114061] [client-common.c:3325:client_pre_readdirp_v2] 0-vol1-client-3: (721fbdd2-abca-4aab-bc58-ab979d19ea0a) remote_fd is -1. EBADFD [File descriptor in bad state] [2019-08-28 07:10:56.383551] I [MSGID: 109081] [dht-common.c:5849:dht_setxattr] 0-vol1-dht: fixing the layout of /dir1 [2019-08-28 07:10:56.388174] E [MSGID: 109119] [dht-lock.c:1084:dht_blocking_inodelk_cbk] 0-vol1-dht: inodelk failed on subvol vol1-replicate-0, gfid:721fbdd2-abca-4aab-bc58-ab979d19ea0a [Transport endpoint is not connected] [2019-08-28 07:10:56.388286] E [MSGID: 109016] [dht-rebalance.c:3944:gf_defrag_fix_layout] 0-vol1-dht: Setxattr failed for /dir1 [Transport endpoint is not connected] [2019-08-28 07:10:56.388342] I [dht-rebalance.c:3297:gf_defrag_process_dir] 0-vol1-dht: migrate data called on /dir1 [2019-08-28 07:10:56.409947] W [dht-rebalance.c:3452:gf_defrag_process_dir] 0-vol1-dht: Found error from gf_defrag_get_entry [2019-08-28 07:10:56.410907] E [MSGID: 109111] [dht-rebalance.c:3971:gf_defrag_fix_layout] 0-vol1-dht: gf_defrag_process_dir failed for directory: /dir1 [2019-08-28 07:10:56.413810] E [MSGID: 101172] [events.c:89:_gf_event] 0-vol1-dht: inet_pton failed with return code 0 [Invalid argument] [2019-08-28 07:10:56.413952] I [MSGID: 109028] [dht-rebalance.c:5059:gf_defrag_status_get] 0-vol1-dht: Rebalance is failed. Time taken is 58.00 secs So,now is that script in point 3 supposed to stopped and then rebalance should be triggered or the reporter failed to add about the expected behaviour i.e. rebalance failures ? The expected result in the description says "Heal should complete with no files in split-brain". For data & metadata heal to happen we need all 3 bricks to be up, and rebalance will also not succeed when the script keeps on disconnecting the bricks. So the script in point 3 should be stopped. Created attachment 1611021 [details]
Outputs required to move the bug to verified
Steps followed to test the scenario : 1. Create 2 1X3 replicate volumes # gluster v list emptyvol testvol_replicated 2. Write continous IO(All types of FOPs) 3. Executed a script that does the following. 1. gets a list of all bricks for a volume- testvol_replicated (3 bricks initially), 2. kills 2 bricks (b0, b1) one after the other (with millisecond difference) and sleep for 3 seconds, brick back bricks up. 3. kill 2 more bricks (b1, b2) one after the other (with millisecond difference) and sleep for 3 seconds, brick back bricks up using glusterd restart. 4. Repeat the steps 1,2 and 3 multiple times 4. Executed Add-brick to convert volume testvol_replicated to 2X3 5. The script keeps running and now gets a list of all 6bricks and kills 2 bricks at a time in loop. 5. Re-balance was executed and heal was triggered. Heals were completed with no file in pending and no split brain issues seen . The output has been attached in the bug on basis of which the bug has been moved to verified state. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2019:3249 |