Description of problem: ======================== when we delete the brick directories of a deleted base volume and if there is another volume still existing, the deletion results in posix errors as below [root@dhcp35-45 ~]# rm -rf /rhs/brick1/xyz-1 Broadcast message from systemd-journald.eng.blr.redhat.com (Tue 2017-05-16 18:56:12 IST): rhs-brick1-xyz-1[5383]: [2017-05-16 13:26:12.202625] M [MSGID: 113075] [posix-helpers.c:1893:posix_health_check_thread_proc] 0-xyz-1-posix: health-check failed, going down Message from syslogd@dhcp35-45 at May 16 18:56:12 ... rhs-brick1-xyz-1[5383]:[2017-05-16 13:26:12.202625] M [MSGID: 113075] [posix-helpers.c:1893:posix_health_check_thread_proc] 0-xyz-1-posix: health-check failed, going down Version-Release number of selected component (if applicable): ============ 3.8.4-25 How reproducible: ===== always Steps to Reproduce: ======================== 1.enable brick mux on a cluster setup, in my case a 3 node setup, and have multiple LVs for creating bricks 2.create v1 which is 1x3 and create the bricks in the reccommended way by mentioning a directory under the LV mount rather than using the LV path directly 3.now create v2 which is also 1x3 and use different LVs 4. v2 must be using same glusterfsd pid as the v1 due to brick mux 5. now fuse mount v2 and keep performing IOs 6. Now stop v1 and delete v1 7. now delete the brick directory of the deleted base volume v1. 8.you will see posix health check errors just as we see if we do an ungraceful brick shut down(like forceful umount of the parent lv even when vol exists) rhs-brick1-xyz-1[5383]: [2017-05-16 13:26:12.202625] M [MSGID: 113075] [posix-helpers.c:1893:posix_health_check_thread_proc] 0-xyz-1-posix: health-check failed, going down Message from syslogd@dhcp35-45 at May 16 18:56:12 ... rhs-brick1-xyz-1[5383]:[2017-05-16 13:26:12.202625] M [MSGID: 113075] [posix-helpers.c:1893:posix_health_check_thread_proc] 0-xyz-1-posix: health-check failed, going down
upstream patch : https://review.gluster.org/17356
downstream patch :https://code.engineering.redhat.com/gerrit/#/c/108021/
I am still seeing this issue on 3.8.4-27 I deleted the directory of the base volume post deletion of the base volume, and saw posix errors Broadcast message from systemd-journald.eng.blr.redhat.com (Wed 2017-06-07 19:12:25 IST): rhs-brick30-test3_30[23121]: [2017-06-07 13:42:25.016770] M [MSGID: 113075] [posix-helpers.c:1905:posix_health_check_thread_proc] 0-test3_30-posix: health-check failed, going down Message from syslogd@localhost at Jun 7 19:12:25 ... rhs-brick30-test3_30[23121]:[2017-06-07 13:42:25.016770] M [MSGID: 113075] [posix-helpers.c:1905:posix_health_check_thread_proc] 0-test3_30-posix: health-check failed, going down I was testing 1451598 - Brick Multiplexing: Deleting brick directories of the base volume must gracefully detach from glusterfsd without impacting other volumes IO(currently seeing transport end point error) hence moving to failed_qa
RCA: As per log message you can say it is failed-QA but as per functionality aspect is not failed. Why? Earlier when u raised the bugzilla the main issue was thread was not cleaned up properly even after down the brick in graceful manner and it was a huge memory leak. That issue resolved after apply the patch https://review.gluster.org/17458, now the remaining issue is message is shown after remove the brick. It is not high priority but we are in development phase so i will post a patch for the same. Why it was missed in our testing?? As you can see below is the code to monitor health check file and in this code we are calling 30 second sleep before start the activity and then deferred cancel signal. In my testing i stopped the volume after just started the volume and remove the brick from backend. The testing steps were finished before call the function to monitor the file so it was passed in my testing. >>>>>>>>>>>>>>> while (1) { /* aborting sleep() is a request to exit this thread, sleep() * will normally not return when cancelled */ ret = sleep (interval); if (ret > 0) break; /* prevent thread errors while doing the health-check(s) */ pthread_setcancelstate (PTHREAD_CANCEL_DISABLE, NULL); /* Do the health-check.*/ ret = posix_fs_health_check (this); if (ret < 0) goto abort; pthread_setcancelstate (PTHREAD_CANCEL_ENABLE, NULL); } >>>>>>>>>>>>>>>> Regards Mohit Agrawal
Upstream patch link: REVIEW: https://review.gluster.org/17492 (glusterfsd: Deletion of brick dir throw emerg msgs after stop volume) posted (#1) for review on master by MOHIT AGRAWAL (moagrawa)
downstream patch : https://code.engineering.redhat.com/gerrit/#/c/108719/
onqa validation: not seeing posix warnings when repeating above testcase(as mentioned in description) hence moving to verified testversion:3.8.4-32
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:2774