| Summary: | Files Not able to heal even clear source is available | ||
|---|---|---|---|
| Product: | Red Hat Gluster Storage | Reporter: | Karan Sandha <ksandha> |
| Component: | arbiter | Assignee: | Ravishankar N <ravishankar> |
| Status: | CLOSED NOTABUG | QA Contact: | Karan Sandha <ksandha> |
| Severity: | unspecified | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | rhgs-3.2 | CC: | ksandha, rhs-bugs, storage-qa-internal |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2016-10-04 06:23:50 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
|
Description
Karan Sandha
2016-10-03 10:09:41 UTC
From the initial observations from Karan's setup, afr_selfheal_data_do() is failing with ENOTCONN (all bricks are up and shd is connected to them though) due to which subsequent functions like afr_selfheal_undo_pending() is not happening and the heal never completes. I was not able to see at what point in afr_selfheal_data_do() the code fails because the values seem to be optimized out when debugging with gdb and break points don't work as expected. There are 2 places which do return ENOTCONN but those conditions are not true in the function and the gdb issue is not helping in finding where the problem is. I have provided my dev VMs (with the latest downstream source)to Karan to see if the issue can be re-created on them. So I managed to debug further with Karan's setup and it was found that the sink brick (Brick2: dhcp46-50.lab.eng.blr.redhat.com:/bricks/brick0/testvol) was 100% full
---------------------------------------------------------------
[root@dhcp46-50 brick0]# df -hT|grep brick0
/dev/mapper/RHS_vg0-RHS_lv0 xfs 6.5G 6.5G 20K 100% /bricks/brick0
[root@dhcp46-50 ~]# cd /bricks/brick0/
[root@dhcp46-50 brick0]# ls
testvol
[root@dhcp46-50 brick0]#
[root@dhcp46-50 brick0]# touch deleteme
touch: cannot touch ‘deleteme’: No space left on device
---------------------------------------------------------------
because of which there were short writes by posix_writev on this brick.
afr_selfheal_data_do {
for (off = 0; off < replies[source].poststat.ia_size; off += block) {
if (AFR_COUNT (healed_sinks, priv->child_count) == 0) {
ret = -ENOTCONN; ------------(1)
goto out;
}
ret = afr_selfheal_data_block(healed_sinks)------(2)
}
In the snippet above, (2) resets healed_sinks[2] to 0 due to short writes, because of which (1) is hit in the next iteration and afr returns ENOTCONN. So undo-pending never happens and the heal attempts go on forever.
Karan, shall I close the BZ? Sorry for not catching this sooner.
Ravi, Yup go ahead. |