1451602 – Brick Multiplexing:Even clean Deleting of the brick directories of base volume is resulting in posix health check errors(just as we see in ungraceful delete methods)

Bug 1451602 - Brick Multiplexing:Even clean Deleting of the brick directories of base volume is resulting in posix health check errors(just as we see in ungraceful delete methods)

Summary: Brick Multiplexing:Even clean Deleting of the brick directories of base volum...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	core
Sub Component:
Version:	rhgs-3.3
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	RHGS 3.3.0
Assignee:	Mohit Agrawal
QA Contact:	Nag Pavan Chilakam
Docs Contact:
URL:
Whiteboard:	brick-multiplexing
Depends On:
Blocks:	1417151 1457219 1459781 1461647
TreeView+	depends on / blocked

Reported:	2017-05-17 06:46 UTC by Nag Pavan Chilakam
Modified:	2017-09-21 04:43 UTC (History)
CC List:	3 users (show)
Fixed In Version:	glusterfs-3.8.4-28
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1459781 (view as bug list)
Environment:
Last Closed:	2017-09-21 04:43:23 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2017:2774	0	normal	SHIPPED_LIVE	glusterfs bug fix and enhancement update	2017-09-21 08:16:29 UTC

Description Nag Pavan Chilakam 2017-05-17 06:46:06 UTC

Description of problem:
========================
when we delete the brick directories of a deleted base volume and if there is another volume still existing, the deletion results in posix errors as below
[root@dhcp35-45 ~]# rm -rf /rhs/brick1/xyz-1
Broadcast message from systemd-journald.eng.blr.redhat.com (Tue 2017-05-16 18:56:12 IST):

rhs-brick1-xyz-1[5383]: [2017-05-16 13:26:12.202625] M [MSGID: 113075] [posix-helpers.c:1893:posix_health_check_thread_proc] 0-xyz-1-posix: health-check failed, going down


Message from syslogd@dhcp35-45 at May 16 18:56:12 ...
 rhs-brick1-xyz-1[5383]:[2017-05-16 13:26:12.202625] M [MSGID: 113075] [posix-helpers.c:1893:posix_health_check_thread_proc] 0-xyz-1-posix: health-check failed, going down


Version-Release number of selected component (if applicable):
============
3.8.4-25

How reproducible:
=====
always

Steps to Reproduce:
========================
1.enable brick mux on a cluster setup, in my case a 3 node setup, and have multiple LVs for creating bricks
2.create v1 which is 1x3 and create the bricks in the reccommended way by mentioning a directory under the LV mount rather than using the LV path directly 
3.now create v2 which is also 1x3 and use different LVs 
4. v2 must be using same glusterfsd pid as the v1 due to brick mux
5. now fuse mount v2 and keep performing IOs
6. Now stop v1 and delete v1
7. now delete the brick directory of the deleted base volume  v1.
8.you will see posix health check errors just as we see if we do an ungraceful brick shut down(like forceful umount of the parent lv even when vol exists)


rhs-brick1-xyz-1[5383]: [2017-05-16 13:26:12.202625] M [MSGID: 113075] [posix-helpers.c:1893:posix_health_check_thread_proc] 0-xyz-1-posix: health-check failed, going down


Message from syslogd@dhcp35-45 at May 16 18:56:12 ...
 rhs-brick1-xyz-1[5383]:[2017-05-16 13:26:12.202625] M [MSGID: 113075] [posix-helpers.c:1893:posix_health_check_thread_proc] 0-xyz-1-posix: health-check failed, going down

Comment 4 Atin Mukherjee 2017-05-22 12:06:25 UTC

upstream patch : https://review.gluster.org/17356

Comment 5 Atin Mukherjee 2017-06-05 04:50:20 UTC

downstream patch :https://code.engineering.redhat.com/gerrit/#/c/108021/

Comment 7 Nag Pavan Chilakam 2017-06-07 13:46:46 UTC

I am still seeing this issue on 3.8.4-27
I deleted the directory of the base volume post deletion of the base volume,
and saw posix errors


Broadcast message from systemd-journald.eng.blr.redhat.com (Wed 2017-06-07 19:12:25 IST):

rhs-brick30-test3_30[23121]: [2017-06-07 13:42:25.016770] M [MSGID: 113075] [posix-helpers.c:1905:posix_health_check_thread_proc] 0-test3_30-posix: health-check failed, going down


Message from syslogd@localhost at Jun  7 19:12:25 ...
 rhs-brick30-test3_30[23121]:[2017-06-07 13:42:25.016770] M [MSGID: 113075] [posix-helpers.c:1905:posix_health_check_thread_proc] 0-test3_30-posix: health-check failed, going down


I was testing 1451598 - Brick Multiplexing: Deleting brick directories of the base volume must gracefully detach from glusterfsd without impacting other volumes IO(currently seeing transport end point error) 


hence moving to failed_qa

Comment 8 Mohit Agrawal 2017-06-08 04:41:31 UTC

RCA: 

As per log message you can say it is failed-QA but as per functionality aspect is not failed.

Why?

 Earlier when u raised the bugzilla the main issue was thread was not cleaned up properly even after down the brick in graceful manner and it was a huge memory leak.
 That issue resolved after apply the patch https://review.gluster.org/17458, now the remaining issue is message is shown after remove the brick.
 It is not high priority but we are in development phase so i will post a patch for the same.

Why it was missed in our testing??
  
 As you can see below is the code to monitor health check file and in this code we are calling 30 second sleep before start the activity and then deferred cancel signal.
 In my testing i stopped the volume after just started the volume and remove the brick from backend. The testing steps were finished before call the function to monitor the file so it was passed in my testing.

>>>>>>>>>>>>>>>

        while (1) {
                /* aborting sleep() is a request to exit this thread, sleep()
                 * will normally not return when cancelled */
                ret = sleep (interval);
                if (ret > 0)
                        break;

                /* prevent thread errors while doing the health-check(s) */
                pthread_setcancelstate (PTHREAD_CANCEL_DISABLE, NULL);

                /* Do the health-check.*/
                ret = posix_fs_health_check (this);
                if (ret < 0)
                        goto abort;

                pthread_setcancelstate (PTHREAD_CANCEL_ENABLE, NULL);
        }

>>>>>>>>>>>>>>>>


Regards
Mohit Agrawal

Comment 9 Mohit Agrawal 2017-06-08 07:45:28 UTC

Upstream patch link:

REVIEW: https://review.gluster.org/17492 (glusterfsd: Deletion of brick dir throw emerg msgs after stop volume) posted (#1) for review on master by MOHIT AGRAWAL (moagrawa)

Comment 10 Atin Mukherjee 2017-06-12 03:18:27 UTC

downstream patch : https://code.engineering.redhat.com/gerrit/#/c/108719/

Comment 12 Nag Pavan Chilakam 2017-07-03 12:59:17 UTC

onqa validation:
not seeing posix warnings when repeating above testcase(as mentioned in description)

hence moving to verified
testversion:3.8.4-32

Comment 14 errata-xmlrpc 2017-09-21 04:43:23 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2774

Note You need to log in before you can comment on or make changes to this bug.