| Summary: | data self-heal happened from sink brick to source brick | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Gluster Storage | Reporter: | spandura | ||||||
| Component: | replicate | Assignee: | Ravishankar N <ravishankar> | ||||||
| Status: | CLOSED WONTFIX | QA Contact: | storage-qa-internal <storage-qa-internal> | ||||||
| Severity: | urgent | Docs Contact: | |||||||
| Priority: | medium | ||||||||
| Version: | rhgs-3.1 | CC: | mzywusko, rcyriac, rhs-bugs, sankarshan, smohan, spandura | ||||||
| Target Milestone: | --- | Keywords: | ZStream | ||||||
| Target Release: | --- | ||||||||
| Hardware: | Unspecified | ||||||||
| OS: | Unspecified | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | Type: | Bug | |||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Attachments: |
|
||||||||
|
Description
spandura
2015-12-29 06:48:14 UTC
Created attachment 1110131 [details]
shell script to create/modify/truncate files
Command Example: file_dir_ops.sh data_ops <mountpoint> truncate1 10 0M
Created attachment 1110720 [details]
Distaf log.
After looking at the distaf logs with Pranith, it was found that the xfs godown method used to kill the brick mounts is taking a long time for killing the bricks of the replica. (One minute interval between successive godowns). Since the modification/truncate tests are run asynchronously, they could have completed before all bricks were down. Also, since the godown method kills xfs, the truncates might not have been synced to the disk, which is why when it comes up, the file size is what it was before truncate. In AFR, if there are no pending xattrs, the bigger file is selected as source and the heal happens when file is accessed from the mount. This explains the arequal checksum mismatch, lack of pending xattrs on the files and the lack of heals by the self-heal demon.
After modifying the test to use BRICK_TAKEDOWN_METHOD="service_kill" instead of xfs godown, the test passed in all 3 runs.
Since this is expected behaviour, it is not a blocker bug per se. But we could provide a fix for 3.1.3 by doing an fsync after truncate in afr transaction. If the fsync fails, then post-op will have pending xattrs for the killed brick and heal will happen in the right direction.
Moving this tp 3.1.4 after triaging. |