Bug 1451602

Summary: Brick Multiplexing:Even clean Deleting of the brick directories of base volume is resulting in posix health check errors(just as we see in ungraceful delete methods)
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Nag Pavan Chilakam <nchilaka>
Component: coreAssignee: Mohit Agrawal <moagrawa>
Status: CLOSED ERRATA QA Contact: Nag Pavan Chilakam <nchilaka>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: rhgs-3.3CC: amukherj, rhs-bugs, storage-qa-internal
Target Milestone: ---   
Target Release: RHGS 3.3.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: brick-multiplexing
Fixed In Version: glusterfs-3.8.4-28 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1459781 (view as bug list) Environment:
Last Closed: 2017-09-21 04:43:23 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1417151, 1457219, 1459781, 1461647    

Description Nag Pavan Chilakam 2017-05-17 06:46:06 UTC
Description of problem:
========================
when we delete the brick directories of a deleted base volume and if there is another volume still existing, the deletion results in posix errors as below
[root@dhcp35-45 ~]# rm -rf /rhs/brick1/xyz-1
Broadcast message from systemd-journald.eng.blr.redhat.com (Tue 2017-05-16 18:56:12 IST):

rhs-brick1-xyz-1[5383]: [2017-05-16 13:26:12.202625] M [MSGID: 113075] [posix-helpers.c:1893:posix_health_check_thread_proc] 0-xyz-1-posix: health-check failed, going down


Message from syslogd@dhcp35-45 at May 16 18:56:12 ...
 rhs-brick1-xyz-1[5383]:[2017-05-16 13:26:12.202625] M [MSGID: 113075] [posix-helpers.c:1893:posix_health_check_thread_proc] 0-xyz-1-posix: health-check failed, going down


Version-Release number of selected component (if applicable):
============
3.8.4-25

How reproducible:
=====
always

Steps to Reproduce:
========================
1.enable brick mux on a cluster setup, in my case a 3 node setup, and have multiple LVs for creating bricks
2.create v1 which is 1x3 and create the bricks in the reccommended way by mentioning a directory under the LV mount rather than using the LV path directly 
3.now create v2 which is also 1x3 and use different LVs 
4. v2 must be using same glusterfsd pid as the v1 due to brick mux
5. now fuse mount v2 and keep performing IOs
6. Now stop v1 and delete v1
7. now delete the brick directory of the deleted base volume  v1.
8.you will see posix health check errors just as we see if we do an ungraceful brick shut down(like forceful umount of the parent lv even when vol exists)


rhs-brick1-xyz-1[5383]: [2017-05-16 13:26:12.202625] M [MSGID: 113075] [posix-helpers.c:1893:posix_health_check_thread_proc] 0-xyz-1-posix: health-check failed, going down


Message from syslogd@dhcp35-45 at May 16 18:56:12 ...
 rhs-brick1-xyz-1[5383]:[2017-05-16 13:26:12.202625] M [MSGID: 113075] [posix-helpers.c:1893:posix_health_check_thread_proc] 0-xyz-1-posix: health-check failed, going down

Comment 4 Atin Mukherjee 2017-05-22 12:06:25 UTC
upstream patch : https://review.gluster.org/17356

Comment 5 Atin Mukherjee 2017-06-05 04:50:20 UTC
downstream patch :https://code.engineering.redhat.com/gerrit/#/c/108021/

Comment 7 Nag Pavan Chilakam 2017-06-07 13:46:46 UTC
I am still seeing this issue on 3.8.4-27
I deleted the directory of the base volume post deletion of the base volume,
and saw posix errors


Broadcast message from systemd-journald.eng.blr.redhat.com (Wed 2017-06-07 19:12:25 IST):

rhs-brick30-test3_30[23121]: [2017-06-07 13:42:25.016770] M [MSGID: 113075] [posix-helpers.c:1905:posix_health_check_thread_proc] 0-test3_30-posix: health-check failed, going down


Message from syslogd@localhost at Jun  7 19:12:25 ...
 rhs-brick30-test3_30[23121]:[2017-06-07 13:42:25.016770] M [MSGID: 113075] [posix-helpers.c:1905:posix_health_check_thread_proc] 0-test3_30-posix: health-check failed, going down


I was testing 1451598 - Brick Multiplexing: Deleting brick directories of the base volume must gracefully detach from glusterfsd without impacting other volumes IO(currently seeing transport end point error) 


hence moving to failed_qa

Comment 8 Mohit Agrawal 2017-06-08 04:41:31 UTC
RCA: 

As per log message you can say it is failed-QA but as per functionality aspect is not failed.

Why?

 Earlier when u raised the bugzilla the main issue was thread was not cleaned up properly even after down the brick in graceful manner and it was a huge memory leak.
 That issue resolved after apply the patch https://review.gluster.org/17458, now the remaining issue is message is shown after remove the brick.
 It is not high priority but we are in development phase so i will post a patch for the same.

Why it was missed in our testing??
  
 As you can see below is the code to monitor health check file and in this code we are calling 30 second sleep before start the activity and then deferred cancel signal.
 In my testing i stopped the volume after just started the volume and remove the brick from backend. The testing steps were finished before call the function to monitor the file so it was passed in my testing.

>>>>>>>>>>>>>>>

        while (1) {
                /* aborting sleep() is a request to exit this thread, sleep()
                 * will normally not return when cancelled */
                ret = sleep (interval);
                if (ret > 0)
                        break;

                /* prevent thread errors while doing the health-check(s) */
                pthread_setcancelstate (PTHREAD_CANCEL_DISABLE, NULL);

                /* Do the health-check.*/
                ret = posix_fs_health_check (this);
                if (ret < 0)
                        goto abort;

                pthread_setcancelstate (PTHREAD_CANCEL_ENABLE, NULL);
        }

>>>>>>>>>>>>>>>>


Regards
Mohit Agrawal

Comment 9 Mohit Agrawal 2017-06-08 07:45:28 UTC
Upstream patch link:

REVIEW: https://review.gluster.org/17492 (glusterfsd: Deletion of brick dir throw emerg msgs after stop volume) posted (#1) for review on master by MOHIT AGRAWAL (moagrawa)

Comment 10 Atin Mukherjee 2017-06-12 03:18:27 UTC
downstream patch : https://code.engineering.redhat.com/gerrit/#/c/108719/

Comment 12 Nag Pavan Chilakam 2017-07-03 12:59:17 UTC
onqa validation:
not seeing posix warnings when repeating above testcase(as mentioned in description)

hence moving to verified
testversion:3.8.4-32

Comment 14 errata-xmlrpc 2017-09-21 04:43:23 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2774