Bug 1355801

Summary: Brick process on container node not coming up after node reboot
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Anoop <annair>
Component: rhgs-server-containerAssignee: Humble Chirammal <hchiramm>
Status: CLOSED CURRENTRELEASE QA Contact: Prasanth <pprakash>
Severity: urgent Docs Contact:
Priority: urgent    
Version: rhgs-3.1CC: amukherj, annair, hchiramm, lpabon, mliyazud, pkarampu, pprakash, rcyriac, rhs-bugs, sankarshan, sashinde
Target Milestone: ---Keywords: ZStream
Target Release: RHGS Container Converged 1.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: rhgs-server-docker-3.1.3-12 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-10-14 13:43:58 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1332128    
Attachments:
Description Flags
Node reboot output none

Description Anoop 2016-07-12 14:47:49 UTC
Have a 3 node OpenShift cluster with each of these nodes hosting one RHGS container:

[root@dhcp42-41 ~]# oc get nodes
NAME                               STATUS                     AGE
dhcp42-21.lab.eng.blr.redhat.com   Ready                      15d
dhcp42-41.lab.eng.blr.redhat.com   Ready,SchedulingDisabled   15d
dhcp42-79.lab.eng.blr.redhat.com   Ready                      15d
dhcp43-69.lab.eng.blr.redhat.com   Ready                      15d
[root@dhcp42-41 ~]# oc get -o wide pods
NAME                                                    READY     STATUS      RESTARTS   AGE       NODE
glusterfs-dc-dhcp42-21.lab.eng.blr.redhat.com-1-c489s   1/1       Running     0          13d       dhcp42-21.lab.eng.blr.redhat.com
glusterfs-dc-dhcp42-79.lab.eng.blr.redhat.com-1-imzxn   1/1       Running     0          13d       dhcp42-79.lab.eng.blr.redhat.com
glusterfs-dc-dhcp43-69.lab.eng.blr.redhat.com-1-fg7ja   1/1       Running     0          13d       dhcp43-69.lab.eng.blr.redhat.com
heketi-1-k8l0q                                          1/1       Running     0          12d       dhcp43-69.lab.eng.blr.redhat.com
heketi-storage-copy-job-rwyr3                           0/1       Completed   0          12d       dhcp43-69.lab.eng.blr.redhat.com
router-1-s2zmk                                          1/1       Running     0          13d       dhcp42-79.lab.eng.blr.redhat.com


I rebooted one of the OpenShift nodes, which also took one of the RHGS container nodes down. When the node came back, I see that the container is started however, the brick processes for this node are not.

sh-4.2# gluster vol status|more
Status of volume: heketidbstorage
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.42.79:/var/lib/heketi/mounts/vg
_dac6ff29ed295561f3d2634dcb985826/brick_49c
dab5958c2087baef11d57afd84833/brick         49152     0          Y       98497
Brick 10.70.43.69:/var/lib/heketi/mounts/vg
_2f2a03809f3f15d1ba6b6303e4f23689/brick_e36
277c16ae0515f9175f24454df41a7/brick         49152     0          Y       50985
Brick 10.70.42.21:/var/lib/heketi/mounts/vg
_2af243604dd82a9918105ef492224163/brick_09e
042f6f80d3ee2291952b4b1b5f197/brick         49152     0          Y       572  
Brick 10.70.43.69:/var/lib/heketi/mounts/vg
_4b614ff63c2260a7a2f1a9b49cac8eac/brick_7e9
94973c4ede8d7616064078e8cb9a0/brick         49153     0          Y       51004
Brick 10.70.42.21:/var/lib/heketi/mounts/vg
_6b6ef4b6fe84ee5014a324da05a164ad/brick_b72
901778a18a8db4d632fef20da675b/brick         N/A       N/A        N       N/A  
Brick 10.70.42.79:/var/lib/heketi/mounts/vg
_dac6ff29ed295561f3d2634dcb985826/brick_93f
224e4159c6fdaf670880cba1db807/brick         49153     0          Y       98516
NFS Server on localhost                     2049      0          Y       973  
Self-heal Daemon on localhost               N/A       N/A        Y       985 
NFS Server on 10.70.43.69                   2049      0          Y       46839
Self-heal Daemon on 10.70.43.69             N/A       N/A        Y       46847
NFS Server on 10.70.42.79                   2049      0          Y       107631
Self-heal Daemon on 10.70.42.79             N/A       N/A        Y       107639
 
Task Status of Volume heketidbstorage
------------------------------------------------------------------------------
There are no active volume tasks
 
Status of volume: vol_245115537574a798a1db158b9c130b00
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.43.69:/var/lib/heketi/mounts/vg
_2b00b4b33dda970551ea29d4b8d05e0c/brick_22e
956ca8149a3a261d8a63adf46b6a8/brick         49351     0          Y       40883
Brick 10.70.42.79:/var/lib/heketi/mounts/vg
_7a086f15d90515cfe0398f605e1806bb/brick_84c
2bbd51ab0387ae0f3394457abafec/brick         49350     0          Y       102007
Brick 10.70.42.21:/var/lib/heketi/mounts/vg
_c12c12bb62a005c1442949d5c6069905/brick_a54
3d6b2afb64a8cb316270e99700d67/brick         N/A       N/A        N       N/A  
Brick 10.70.43.69:/var/lib/heketi/mounts/vg
_47cb9c7f6da5c38fa1f2b016831ad7d1/brick_d14
--More--
Brick 10.70.43.69:/var/lib/heketi/mounts/vg
_2b00b4b33dda970551ea29d4b8d05e0c/brick_22e
956ca8149a3a261d8a63adf46b6a8/brick         49351     0          Y       40883
Brick 10.70.42.79:/var/lib/heketi/mounts/vg
_7a086f15d90515cfe0398f605e1806bb/brick_84c
2bbd51ab0387ae0f3394457abafec/brick         49350     0          Y       102007
Brick 10.70.42.21:/var/lib/heketi/mounts/vg
_c12c12bb62a005c1442949d5c6069905/brick_a54
3d6b2afb64a8cb316270e99700d67/brick         N/A       N/A        N       N/A  
Brick 10.70.43.69:/var/lib/heketi/mounts/vg
_47cb9c7f6da5c38fa1f2b016831ad7d1/brick_d14
fb1fa53c7a0053e41c9d5464a16b0/brick         49352     0          Y       40902
Brick 10.70.42.79:/var/lib/heketi/mounts/vg
_dec87dc5a2f3e4161fe6c1d573894fa1/brick_da9
05f2aee642d4bd15899d74ae7dee6/brick         49351     0          Y       102026
Brick 10.70.42.21:/var/lib/heketi/mounts/vg
_181bbba243c96ee49f6b5cb6a6e5021e/brick_527
caf5fd4ad4811cd20012834a77919/brick         N/A       N/A        N       N/A  
NFS Server on localhost                     2049      0          Y       973  
Self-heal Daemon on localhost               N/A       N/A        Y       985  
NFS Server on 10.70.42.79                   2049      0          Y       107631
Self-heal Daemon on 10.70.42.79             N/A       N/A        Y       107639
NFS Server on 10.70.43.69                   2049      0          Y       46839
Self-heal Daemon on 10.70.43.69             N/A       N/A        Y       46847
 
Task Status of Volume vol_245115537574a798a1db158b9c130b00


Logs available on root.42.41:/root/sosreport-dhcp42-21.lab.eng.blr.redhat.com-20160712154443.tar.xz

Comment 2 Atin Mukherjee 2016-07-12 15:00:53 UTC
Based on the discussion I had with Anoop, it seems like glusterd sent a trigger to start the brick, but the brick process didn't come up as it failed to get trusted.glusterfs.volume-id from the brick path which is weird.

Comment 3 Atin Mukherjee 2016-07-13 05:11:55 UTC
This looks similar to BZ 1340049

Anoop,

I believe you haven't tampered the brick from the backend (removing the xattrs accidentally?). If you have the set up with you, could you check the xattrs of the brick which failed to come up from both the host and the container? It seems like while bind mounting the brick path from host to the container the xattrs are not inherited from some reason and if this hypothesis is true, then you should be able to see the difference between the xattr list from host and container. 

Once that's confirmed I think this BZ then moves to the heketi layer to see how exactly the bind mount takes place during a reboot. Honestly this doesn't look like an issue at Gluster layer.

Thanks,
Atin

Comment 4 Anoop 2016-07-13 07:11:28 UTC

<5ef492224163/brick_bf85c53f40d125c097486f6193fabf70/brick                   
getfattr: Removing leading '/' from absolute path names
# file: var/lib/heketi/mounts/vg_2af243604dd82a9918105ef492224163/brick_bf85c53f40d125c097486f6193fabf70/brick
security.selinux=0x73797374656d5f753a6f626a6563745f723a646f636b65725f7661725f6c69625f743a733000

<ts/vg_2af243604dd82a9918105ef492224163/brick_bf85c53f40d125c097486f6193fabf70>
getfattr: Removing leading '/' from absolute path names
# file: var/lib/heketi/mounts/vg_2af243604dd82a9918105ef492224163/brick_bf85c53f40d125c097486f6193fabf70/brick
security.selinux=0x73797374656d5f753a6f626a6563745f723a646f636b65725f7661725f6c69625f743a733000

<49d5c6069905/brick_ee06d72f0d312ec1f8aa5d7926c90821/brick                   
getfattr: Removing leading '/' from absolute path names
# file: var/lib/heketi/mounts/vg_c12c12bb62a005c1442949d5c6069905/brick_ee06d72f0d312ec1f8aa5d7926c90821/brick
security.selinux=0x73797374656d5f753a6f626a6563745f723a646f636b65725f7661725f6c69625f743a733000

Comment 5 Atin Mukherjee 2016-07-13 08:54:37 UTC
(In reply to Anoop from comment #4)
> 
> <5ef492224163/brick_bf85c53f40d125c097486f6193fabf70/brick                   
> getfattr: Removing leading '/' from absolute path names
> # file:
> var/lib/heketi/mounts/vg_2af243604dd82a9918105ef492224163/
> brick_bf85c53f40d125c097486f6193fabf70/brick
> security.
> selinux=0x73797374656d5f753a6f626a6563745f723a646f636b65725f7661725f6c69625f7
> 43a733000
> 
> <ts/vg_2af243604dd82a9918105ef492224163/
> brick_bf85c53f40d125c097486f6193fabf70>
> getfattr: Removing leading '/' from absolute path names
> # file:
> var/lib/heketi/mounts/vg_2af243604dd82a9918105ef492224163/
> brick_bf85c53f40d125c097486f6193fabf70/brick
> security.
> selinux=0x73797374656d5f753a6f626a6563745f723a646f636b65725f7661725f6c69625f7
> 43a733000
> 
> <49d5c6069905/brick_ee06d72f0d312ec1f8aa5d7926c90821/brick                   
> getfattr: Removing leading '/' from absolute path names
> # file:
> var/lib/heketi/mounts/vg_c12c12bb62a005c1442949d5c6069905/
> brick_ee06d72f0d312ec1f8aa5d7926c90821/brick
> security.
> selinux=0x73797374656d5f753a6f626a6563745f723a646f636b65725f7661725f6c69625f7
> 43a733000

Its clear that all of these bricks do not have any gluster xattrs set from this output. Are these bricks even mounted? Could there be a case that we are hitting a race between bind mounting the bricks and bringing up glusterd service?

Comment 6 Atin Mukherjee 2016-07-13 10:22:51 UTC
This doesn't look like a bug at Gluster layer.

Comment 7 Luis Pabón 2016-07-13 12:51:31 UTC
(In reply to Atin Mukherjee from comment #5)
> (In reply to Anoop from comment #4)
> > 
> > <5ef492224163/brick_bf85c53f40d125c097486f6193fabf70/brick                   
> > getfattr: Removing leading '/' from absolute path names
> > # file:
> > var/lib/heketi/mounts/vg_2af243604dd82a9918105ef492224163/
> > brick_bf85c53f40d125c097486f6193fabf70/brick
> > security.
> > selinux=0x73797374656d5f753a6f626a6563745f723a646f636b65725f7661725f6c69625f7
> > 43a733000
> > 
> > <ts/vg_2af243604dd82a9918105ef492224163/
> > brick_bf85c53f40d125c097486f6193fabf70>
> > getfattr: Removing leading '/' from absolute path names
> > # file:
> > var/lib/heketi/mounts/vg_2af243604dd82a9918105ef492224163/
> > brick_bf85c53f40d125c097486f6193fabf70/brick
> > security.
> > selinux=0x73797374656d5f753a6f626a6563745f723a646f636b65725f7661725f6c69625f7
> > 43a733000
> > 
> > <49d5c6069905/brick_ee06d72f0d312ec1f8aa5d7926c90821/brick                   
> > getfattr: Removing leading '/' from absolute path names
> > # file:
> > var/lib/heketi/mounts/vg_c12c12bb62a005c1442949d5c6069905/
> > brick_ee06d72f0d312ec1f8aa5d7926c90821/brick
> > security.
> > selinux=0x73797374656d5f753a6f626a6563745f723a646f636b65725f7661725f6c69625f7
> > 43a733000
> 
> Its clear that all of these bricks do not have any gluster xattrs set from
> this output. Are these bricks even mounted? Could there be a case that we
> are hitting a race between bind mounting the bricks and bringing up glusterd
> service?

Bricks are not bind mounted into the container because the container is running in Privileged mode.  They are mounted as normal.

Humble, please take a look at the container and brick information.

Comment 8 Humble Chirammal 2016-07-14 09:56:53 UTC
(In reply to Anoop from comment #4)

> 
> <49d5c6069905/brick_ee06d72f0d312ec1f8aa5d7926c90821/brick                   
> getfattr: Removing leading '/' from absolute path names
> # file:
> var/lib/heketi/mounts/vg_c12c12bb62a005c1442949d5c6069905/
> brick_ee06d72f0d312ec1f8aa5d7926c90821/brick
> security.
> selinux=0x73797374656d5f753a6f626a6563745f723a646f636b65725f7661725f6c69625f7
> 43a733000

Anoop, I have couple of questions here:

*) iic, above output has been taken from the container ? can you get  the same output from the host ? 

*) Have we tested this scenario ( node reboot) in previous version of RHGS container release (3.1.2) , if yes, Did it pass ?

Comment 9 Anoop 2016-07-14 15:53:50 UTC
Humble,

1. I do not see the brick path on the host (which is anoher issue).

2. This scenarios was tested with the 3.1.2 container and this worked. However, in the older container the we  bind mounted the bricks from host. This is the first time this test is being carried out with LVM running inside the container.

Comment 10 Humble Chirammal 2016-07-14 18:30:58 UTC
(In reply to Anoop from comment #9)

 
> 1. I do not see the brick path on the host (which is anoher issue).
> 

I dont think. You have to mount it in host and check the output.

> 2. This scenarios was tested with the 3.1.2 container and this worked.

IIUC, node reboot was tested with previous container release ( RHGS 3.1.2) and it worked. Correct? 

> However, in the older container the we  bind mounted the bricks from host.
> This is the first time this test is being carried out with LVM running
> inside the container.

Yes. I dont have doubt on it :).

Have some more questions here

*) Have we tested this behaviour with any of the previous container build of APLO ? Or Is this the first time  this test has happened? What I am trying to isolate here is, any changes in the package update caused this issue. It would be helpful if you can provide that info. 

*) Is this issue consistently reproducible in all the QE setup? 

*) Also, after node reboot, did this issue hit on all the bricks on the host or Is it for some of it ? because the snip in c#1 is clobbered ( with --more strings in it) and does not give full output.

Comment 11 Anoop 2016-07-15 01:54:32 UTC
* Not tested reboots with previous builds and I think i will be difficult (considering that we have a GA on 26th July) to go back and test this on older build.

* This is reproducible evertime on my setup.

* See it only on the bricks of the node that was rebooted.

Comment 12 Humble Chirammal 2016-07-15 03:12:28 UTC
> * See it only on the bricks of the node that was rebooted.
Yes, but did all the bricks are effected in this node ? Also can you provide the output requested in comment#8 ?

If its consistently reproducible in your setup, can you perform below command in your system before node reboot ?

#sync && echo 3 > /proc/sys/vm/drop_caches 

Once the above command is successful, can you execute the test ? 

Also, how are you testing the node reboot scenario ? what specific command/action is performed ? 

Are these tests performed on VMs ?

Comment 13 Humble Chirammal 2016-07-15 04:46:04 UTC
[Self note]

It looks to me that the 'extended attributes' in 'trusted namespace' has not been set/synced before node reboot which cause this issue. Or some kind of race in this area cause this behaviour. 

At present, I can think of below isolation steps/RCA.
 
1) mount option  'user_xattr' available for ext* FS.
 
      user_xattr
              Enable Extended User Attributes.  

  Equivalent XFS option: However it looks like its "ON" by default:

       attr2 (default)
              Enable the use of version two of the extended attribute inline allocation policy. 


   Isolation :

        Does ext4 hit this issue ? 
        Any difference if we mount with above options, if its not *ON* .
   
2)    Trusted extended attributes
       Trusted  extended  attributes  are  visible and accessible only to processes that have the CAP_SYS_ADMIN capability.  Attributes in this class are used to implement mechanisms in user space (i.e., outside the kernel) which keep information in extended attributes to which ordinary processes should not have access.

      NOTE:   As we are running in Privileged mode it has "CAP_SYS_ADMIN" capability.


3)  Does any updates in 'attr' or 'libattr' cause this issue ? 

     NOTE: Have to reach out to the package maintainers for further input.

     Isolation: Try the previous version of the builds.

4)  May be not directly related, however check the possibilities with Gluster/AFR folks. 

     https://bugzilla.redhat.com/show_bug.cgi?id=762680
     https://bugzilla.redhat.com/show_bug.cgi?id=811244
     https://www.gluster.org/pipermail/gluster-users/2013-November/015054.html
    
5)  Reproduce this issue without Heketi, that said instead of 'heketi' setting up the disk/device layout and creating volumes do it by manual steps from the container, it can be an isolation.
  

6) Verify the extended attribute's visibility from the host and container before reboot. Comment #8  output can help here.

7) Check the possibility of 'lack of syncing the FS entries' before the node reboot cause this issue?  sync and drop cache in comment#12 is given for the same reason.

Comment 14 Anoop 2016-07-15 11:19:38 UTC
I guess you have all the date you need now.

Comment 15 Luis Pabón 2016-07-16 00:36:47 UTC
Since the builds are not available yet, I ran the tests using upstream binaries:

1. GlusterFS container which has:
glusterfs-3.8.1-1.el7.x86_64
glusterfs-api-3.8.1-1.el7.x86_64
glusterfs-cli-3.8.1-1.el7.x86_64
glusterfs-geo-replication-3.8.1-1.el7.x86_64
glusterfs-libs-3.8.1-1.el7.x86_64
glusterfs-client-xlators-3.8.1-1.el7.x86_64
glusterfs-fuse-3.8.1-1.el7.x86_64
glusterfs-server-3.8.1-1.el7.x86_64

2. Heketi Container based on the same version as downstream: 2.0.5
3. CentOS Atomic 7.2.1141 and OpenShift Origin 1.2

All this can be easily create using the Heketi/OpenShift Vagrant Demo:
https://github.com/heketi/vagrant-heketi/tree/master/openshift

The results are that I was able to reboot the systems 50 times, and all times, the systems came back, and the bricks where available.  Heketi was able to come up every single time.

This is just a datapoint in the tests, but it does show that it is possible to continue rebooting without issue.

Comment 16 Anoop 2016-07-16 01:44:26 UTC
I hit is issue almost every time on my setup. In fact Humble is using my setup and he himself was able to recreate this issue even after ensuring that the xattrs are visible on the host system. 
Humble you may want to update the bug with your observations.

Comment 17 Humble Chirammal 2016-07-16 09:07:41 UTC
Here is the progress made so far. 

I was debugging the issue in Anoop's setup and here is the observation.

*) In one of the OSE node where we have 50 brick processes ( 25 volumes )  running, after the node reboot, Everything came back without issues! This issue was *not* reproducible. 

*) However on the second node, in the same OSE cluster where we have 50 brick processes, after node reboot some of the brick processes were DOWN. Further analysis was done on the problematic node and found that, the 'brick' directory in /var/lib/heketi/mounts/vg_*/brick_* have different permissions ( same as the issue reported here in this bz # https://bugzilla.redhat.com/show_bug.cgi?id=1356050 ) than the permissions before node reboot and the xattrs were missing (I had made sure all the xattrs were present on the brick processes before rebooting the node) on it which caused the brick processes to go down. It looks like the brick directories are newly created/accessed. Then the volumes were 'force' started and rebooted, this time this issue didnt come up. Looks like the issue is not always reproducible, but I can confirm there is an issue which is triggered at times.

*) I made couple of attempts to recreate the issue without heketi and unfortunately the issue was not reproducible. In this recreation, the '/dev/' was exported to the container and the LVM creation was performed from the container and used for the gluster volume. Couple of volumes were created and after node reboot all the brick processes came up. With this result, I am not saying this is an issue with Heketi, but somehow it was not triggered in the manual setup 

*) We have opened a bug against 'attr' package in RHEL (https://bugzilla.redhat.com/show_bug.cgi?id=1356810) as we were seeing missing attributes in some scenarios. 

*) Didnt get a chance to recreate the issue with any of the previous builds of APLO. Will be trying it out soon. 

Also, further isolation is going on and I will update the bugzilla with the status soon. 

@Anoop, I am pretty sure both bugs ( bz#1355801 and bz#1356050 ) are same in this context. Do we need to track it differently or Is it fine to merge ?

Comment 18 Anoop 2016-07-16 14:32:06 UTC
Let's trcak it as separate bugs for now and once we have the RCA we can merge both.

Comment 19 Humble Chirammal 2016-07-18 10:42:36 UTC
[Status update]

We went through different layers to isolate this issue. That said, components like attr, device mapper, kernel, XFS , gluster layers are examined to find out the root cause. After tracing these components it looks like a race in device mapper and FS caused the issue. We have a container image built with 'possible fix'. The image is available @ internal docker registry as docker-registry.usersys.redhat.com/gluster/rhgs-test:2 . Can you please verify this in one of your environment. After the verification, we will cut a brew build.

Comment 20 Humble Chirammal 2016-07-18 11:13:03 UTC
(In reply to Humble Chirammal from comment #17)

> *) However on the second node, in the same OSE cluster where we have 50
> brick processes, after node reboot some of the brick processes were DOWN.
> Further analysis was done on the problematic node and found that, the
> 'brick' directory in /var/lib/heketi/mounts/vg_*/brick_* have different
> permissions ( same as the issue reported here in this bz #
> https://bugzilla.redhat.com/show_bug.cgi?id=1356050 ) than the permissions
> before node reboot and the xattrs were missing (I had made sure all the
> xattrs were present on the brick processes before rebooting the node) on it
> which caused the brick processes to go down. It looks like the brick
> directories are newly created/accessed. Then the volumes were 'force'
> started and rebooted, this time this issue didnt come up. Looks like the
> issue is not always reproducible, but I can confirm there is an issue which
> is triggered at times.
>

One question remained here though: Who is creating 'brick' directory in the problematic occurrence and I was told 'gluster' is *not* doing it. However, looking at below code path, it has found that 'index' xlator does it.

---snip--

int
index_dir_create (xlator_t *this, const char *subdir)
{
    .....
        priv = this->private;
        make_index_dir_path (priv->index_basepath, subdir, fullpath,
                             sizeof (fullpath));
        ret = sys_stat (fullpath, &st);
        if (!ret) {
                if (!S_ISDIR (st.st_mode))
                        ret = -2;
                goto out;
        }

        pathlen = strlen (fullpath);
        if ((pathlen > 1) && fullpath[pathlen - 1] == '/')
                fullpath[pathlen - 1] = '\0';
        dir = strchr (fullpath, '/');
        while (dir) {
                dir = strchr (dir + 1, '/');
                if (dir)
                        len = pathlen - strlen (dir);
                else
                        len = pathlen;
                strncpy (path, fullpath, len);
                path[len] = '\0';
                ret = sys_mkdir (path, 0600);   --> [1]
                if (ret && (errno != EEXIST))
                        goto out;
        }
        ret = 0;


--/snip-- 

Later 'glusterd' fails to find the 'xattrs' on these 'newly' created directories and the brick processes goes down.

@Atin, can you confirm ?

Comment 21 Pranith Kumar K 2016-07-18 16:50:23 UTC
(In reply to Humble Chirammal from comment #20)
> (In reply to Humble Chirammal from comment #17)
> 
> > *) However on the second node, in the same OSE cluster where we have 50
> > brick processes, after node reboot some of the brick processes were DOWN.
> > Further analysis was done on the problematic node and found that, the
> > 'brick' directory in /var/lib/heketi/mounts/vg_*/brick_* have different
> > permissions ( same as the issue reported here in this bz #
> > https://bugzilla.redhat.com/show_bug.cgi?id=1356050 ) than the permissions
> > before node reboot and the xattrs were missing (I had made sure all the
> > xattrs were present on the brick processes before rebooting the node) on it
> > which caused the brick processes to go down. It looks like the brick
> > directories are newly created/accessed. Then the volumes were 'force'
> > started and rebooted, this time this issue didnt come up. Looks like the
> > issue is not always reproducible, but I can confirm there is an issue which
> > is triggered at times.
> >
> 
> One question remained here though: Who is creating 'brick' directory in the
> problematic occurrence and I was told 'gluster' is *not* doing it. However,
> looking at below code path, it has found that 'index' xlator does it.
> 
> ---snip--
> 
> int
> index_dir_create (xlator_t *this, const char *subdir)
> {
>     .....
>         priv = this->private;
>         make_index_dir_path (priv->index_basepath, subdir, fullpath,
>                              sizeof (fullpath));
>         ret = sys_stat (fullpath, &st);
>         if (!ret) {
>                 if (!S_ISDIR (st.st_mode))
>                         ret = -2;
>                 goto out;
>         }
> 
>         pathlen = strlen (fullpath);
>         if ((pathlen > 1) && fullpath[pathlen - 1] == '/')
>                 fullpath[pathlen - 1] = '\0';
>         dir = strchr (fullpath, '/');
>         while (dir) {
>                 dir = strchr (dir + 1, '/');
>                 if (dir)
>                         len = pathlen - strlen (dir);
>                 else
>                         len = pathlen;
>                 strncpy (path, fullpath, len);
>                 path[len] = '\0';
>                 ret = sys_mkdir (path, 0600);   --> [1]
>                 if (ret && (errno != EEXIST))
>                         goto out;
>         }
>         ret = 0;
> 
> 
> --/snip-- 
> 
> Later 'glusterd' fails to find the 'xattrs' on these 'newly' created
> directories and the brick processes goes down.
> 
> @Atin, can you confirm ?

@Humble,
        index_dir_create() does mkdir -p <brick-path>/.glusterfs/indices/xattrop if the path doesn't exist. How did we get into a state where <brick-path> itself is missing? Do we know?

Pranith

Comment 22 Atin Mukherjee 2016-07-18 17:01:11 UTC
@Humble - Honestly I was not aware of that index xlator does a mkdir -p . But as Pranith pointed out complete brickpath gets created only when the brick path is not present. So my apologies to you if this has wasted some of your effort to dig into the problem.

Comment 23 Humble Chirammal 2016-07-18 18:10:41 UTC
(In reply to Pranith Kumar K from comment #21)
> (In reply to Humble Chirammal from comment #20)
> > (In reply to Humble Chirammal from comment #17)
> > 
......
> 
> @Humble,
>         index_dir_create() does mkdir -p
> <brick-path>/.glusterfs/indices/xattrop if the path doesn't exist. How did
> we get into a state where <brick-path> itself is missing? Do we know?
> 

@Pranith, Thanks for confirming. I have a question here, Are we getting any advantage by creating this path if its not present? because, later the 'posix' xlator comes in and complaint there is no 'xattrs' present on this newly created path and fail the brick process anyway. Any thoughts or Am I missing something here?

Comment 24 Humble Chirammal 2016-07-18 18:20:18 UTC
(In reply to Atin Mukherjee from comment #22)
> @Humble - Honestly I was not aware of that index xlator does a mkdir -p .
> But as Pranith pointed out complete brickpath gets created only when the
> brick path is not present. So my apologies to you if this has wasted some of
> your effort to dig into the problem.

@Atin, as there are too many layers involved, it was very difficult to isolate this issue and this behaviour really confused the isolation process. Any way no worries, we all learned the hidden facts :).

Comment 25 Pranith Kumar K 2016-07-19 03:05:46 UTC
(In reply to Humble Chirammal from comment #23)
> (In reply to Pranith Kumar K from comment #21)
> > (In reply to Humble Chirammal from comment #20)
> > > (In reply to Humble Chirammal from comment #17)
> > > 
> ......
> > 
> > @Humble,
> >         index_dir_create() does mkdir -p
> > <brick-path>/.glusterfs/indices/xattrop if the path doesn't exist. How did
> > we get into a state where <brick-path> itself is missing? Do we know?
> > 
> 
> @Pranith, Thanks for confirming. I have a question here, Are we getting any
> advantage by creating this path if its not present? because, later the
> 'posix' xlator comes in and complaint there is no 'xattrs' present on this
> newly created path and fail the brick process anyway. Any thoughts or Am I
> missing something here?

Index xlator is designed to keep the indices on any hard-disk. Not necessarily in the .glusterfs. This is the default place where it is kept. So to make sure it works, we need the path to be created.

@Atin, @Humble, The design when all these decisions were taken is that by the time you create a volume/add-bricks/replace-bricks you already have the brick-path with relavant extended attributes already in place, i.e. glusterd makes sure to create the paths and set extended attributes. Did this decision change in the recent past? Why is it that index xaltor even getting into a state where it has to create a brick path. Why doesn't it already exist by the time we do volume start?

Comment 26 Atin Mukherjee 2016-07-19 03:52:39 UTC
(In reply to Pranith Kumar K from comment #25)
> (In reply to Humble Chirammal from comment #23)
> > (In reply to Pranith Kumar K from comment #21)
> > > (In reply to Humble Chirammal from comment #20)
> > > > (In reply to Humble Chirammal from comment #17)
> > > > 
> > ......
> > > 
> > > @Humble,
> > >         index_dir_create() does mkdir -p
> > > <brick-path>/.glusterfs/indices/xattrop if the path doesn't exist. How did
> > > we get into a state where <brick-path> itself is missing? Do we know?
> > > 
> > 
> > @Pranith, Thanks for confirming. I have a question here, Are we getting any
> > advantage by creating this path if its not present? because, later the
> > 'posix' xlator comes in and complaint there is no 'xattrs' present on this
> > newly created path and fail the brick process anyway. Any thoughts or Am I
> > missing something here?
> 
> Index xlator is designed to keep the indices on any hard-disk. Not
> necessarily in the .glusterfs. This is the default place where it is kept.
> So to make sure it works, we need the path to be created.
> 
> @Atin, @Humble, The design when all these decisions were taken is that by
> the time you create a volume/add-bricks/replace-bricks you already have the
> brick-path with relavant extended attributes already in place, i.e. glusterd
> makes sure to create the paths and set extended attributes. Did this
> decision change in the recent past? Why is it that index xaltor even getting
> into a state where it has to create a brick path. Why doesn't it already
> exist by the time we do volume start?

Refer to Comment 19, because of the race in dev mapper & FS layer in case of node reboot, the bricks are not mounted as I understand.

Comment 27 Pranith Kumar K 2016-07-19 03:57:43 UTC
(In reply to Atin Mukherjee from comment #26)
> (In reply to Pranith Kumar K from comment #25)
> > (In reply to Humble Chirammal from comment #23)
> > > (In reply to Pranith Kumar K from comment #21)
> > > > (In reply to Humble Chirammal from comment #20)
> > > > > (In reply to Humble Chirammal from comment #17)
> > > > > 
> > > ......
> > > > 
> > > > @Humble,
> > > >         index_dir_create() does mkdir -p
> > > > <brick-path>/.glusterfs/indices/xattrop if the path doesn't exist. How did
> > > > we get into a state where <brick-path> itself is missing? Do we know?
> > > > 
> > > 
> > > @Pranith, Thanks for confirming. I have a question here, Are we getting any
> > > advantage by creating this path if its not present? because, later the
> > > 'posix' xlator comes in and complaint there is no 'xattrs' present on this
> > > newly created path and fail the brick process anyway. Any thoughts or Am I
> > > missing something here?
> > 
> > Index xlator is designed to keep the indices on any hard-disk. Not
> > necessarily in the .glusterfs. This is the default place where it is kept.
> > So to make sure it works, we need the path to be created.
> > 
> > @Atin, @Humble, The design when all these decisions were taken is that by
> > the time you create a volume/add-bricks/replace-bricks you already have the
> > brick-path with relavant extended attributes already in place, i.e. glusterd
> > makes sure to create the paths and set extended attributes. Did this
> > decision change in the recent past? Why is it that index xaltor even getting
> > into a state where it has to create a brick path. Why doesn't it already
> > exist by the time we do volume start?
> 
> Refer to Comment 19, because of the race in dev mapper & FS layer in case of
> node reboot, the bricks are not mounted as I understand.

Thanks for this Atin, I guess there is nothing more to be done by gluster at this point?

Comment 28 Atin Mukherjee 2016-07-19 04:17:58 UTC
(In reply to Pranith Kumar K from comment #27)
> (In reply to Atin Mukherjee from comment #26)
> > (In reply to Pranith Kumar K from comment #25)
> > > (In reply to Humble Chirammal from comment #23)
> > > > (In reply to Pranith Kumar K from comment #21)
> > > > > (In reply to Humble Chirammal from comment #20)
> > > > > > (In reply to Humble Chirammal from comment #17)
> > > > > > 
> > > > ......
> > > > > 
> > > > > @Humble,
> > > > >         index_dir_create() does mkdir -p
> > > > > <brick-path>/.glusterfs/indices/xattrop if the path doesn't exist. How did
> > > > > we get into a state where <brick-path> itself is missing? Do we know?
> > > > > 
> > > > 
> > > > @Pranith, Thanks for confirming. I have a question here, Are we getting any
> > > > advantage by creating this path if its not present? because, later the
> > > > 'posix' xlator comes in and complaint there is no 'xattrs' present on this
> > > > newly created path and fail the brick process anyway. Any thoughts or Am I
> > > > missing something here?
> > > 
> > > Index xlator is designed to keep the indices on any hard-disk. Not
> > > necessarily in the .glusterfs. This is the default place where it is kept.
> > > So to make sure it works, we need the path to be created.
> > > 
> > > @Atin, @Humble, The design when all these decisions were taken is that by
> > > the time you create a volume/add-bricks/replace-bricks you already have the
> > > brick-path with relavant extended attributes already in place, i.e. glusterd
> > > makes sure to create the paths and set extended attributes. Did this
> > > decision change in the recent past? Why is it that index xaltor even getting
> > > into a state where it has to create a brick path. Why doesn't it already
> > > exist by the time we do volume start?
> > 
> > Refer to Comment 19, because of the race in dev mapper & FS layer in case of
> > node reboot, the bricks are not mounted as I understand.
> 
> Thanks for this Atin, I guess there is nothing more to be done by gluster at
> this point?

That's right Pranith.

Comment 30 Humble Chirammal 2016-07-20 13:43:34 UTC
> (In reply to Atin Mukherjee from comment #26)
> > (In reply to Pranith Kumar K from comment #25)
> > > Refer to Comment 19, because of the race in dev mapper & FS layer in case of
> > > node reboot, the bricks are not mounted as I understand.
> > 
> > Thanks for this Atin, I guess there is nothing more to be done by gluster at
> > this point?
> 
> That's right Pranith.

As discussed we need an enhancement in gluster index xlator to check we are having indices on brick path or some other path. If its brick path its better to *not* create directories. I will open a new bug for the same.

Comment 31 Humble Chirammal 2016-07-20 13:47:45 UTC
At present heketi writes "/dev/vg/lv" path in custom fstab and it looks like its causing the weird behaviour in case of node reboot. We have to use "/dev/mapper/vg-lv path in fstab to atleast make sure device mapper is in charge. More details are available here #http://post-office.corp.redhat.com/archives/rhs-containers/2016-July/msg00111.html

Comment 32 Humble Chirammal 2016-07-20 19:08:35 UTC
Created attachment 1182234 [details]
Node reboot output

Comment 33 Humble Chirammal 2016-07-20 19:14:56 UTC
(In reply to Humble Chirammal from comment #32)
> Created attachment 1182234 [details]
> Node reboot output

In couple of environments, by changing the mount entries to "/dev/mapper" and with the fix image we tried to reproduce this issue.  The result of one of the setup can be found at c#32. There were 78 bricks created inside the container and after reboot all the bricks are up and running. If possible I would like to try this fix in one of QE setup as well . In both of our dev environment the result looks positive!. However it would be appreciated if QE can also provide problematic setup to test this out. Anoop do you have any setup where I can check the result quickly?

Comment 34 Humble Chirammal 2016-07-21 09:52:47 UTC
(In reply to Anoop from comment #11)
> * Not tested reboots with previous builds and I think i will be difficult
> (considering that we have a GA on 26th July) to go back and test this on
> older build.
> 
> * This is reproducible evertime on my setup.
> 
> * See it only on the bricks of the node that was rebooted.

Anoop, in our last release ( RHGS 3.1.2) we created Volume snapshot from the container which internally created LV in the system which was tested. Have we tested node reboot with this scenario ?

Comment 38 Humble Chirammal 2016-07-21 15:55:41 UTC
@Neha, Did you get a chance to verify whether the brick is mounted, xattrs are present and there is no data loss?

Comment 39 Neha 2016-07-21 16:09:48 UTC
(In reply to Humble Chirammal from comment #38)
> @Neha, Did you get a chance to verify whether the brick is mounted, xattrs
> are present and there is no data loss?

Brick is mounted


df -Th | grep brick_61dee9e630ef0646ae4280684ac1e608
/dev/mapper/vg_7c8a1389ab3b08f27e1d6fd5c9e3027a-brick_61dee9e630ef0646ae4280684ac1e608 xfs       5.0G  592K  5.0G   1% /var/lib/heketi/mounts/vg_7c8a1389ab3b08f27e1d6fd5c9e3027a/brick_61dee9e630ef0646ae4280684ac1e608

xattrs :

getfattr -d -m . -e hex /var/lib/heketi/mounts/vg_7c8a1389ab3b08f27e1d>
getfattr: Removing leading '/' from absolute path names
# file: var/lib/heketi/mounts/vg_7c8a1389ab3b08f27e1d6fd5c9e3027a/brick_61dee9e630ef0646ae4280684ac1e608/brick/
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.dht=0x00000001000000007fffffffffffffff
trusted.glusterfs.volume-id=0x3d3bd68c9e974c96bc4cf788b3490206

There were no data in this volume.

Comment 40 Humble Chirammal 2016-07-22 06:13:07 UTC
@Neha, thanks for this information. I feel,even if there were some data in the brick, it could have persisted, because the xattrs are present perfectly. 

I went through the log of this brick to check why it failed. 

[Analysis]
-----------------snip----------

[2016-07-21 12:55:42.798650] W [MSGID: 101105] [gfdb_sqlite3.h:239:gfdb_set_sql_params] 0-vol_708e587144fcf31fe317d350e5fa1cc2-changetimerecorder: Failed to retrieve sql-db-autovacuum from params.Assigning default value: none
[2016-07-21 12:55:42.799015] I [trash.c:2369:init] 0-vol_708e587144fcf31fe317d350e5fa1cc2-trash: no option specified for 'eliminate', using NULL

[2016-07-21 12:55:42.799158] C [MSGID: 113081] [posix.c:6755:init] 0-vol_708e587144fcf31fe317d350e5fa1cc2-posix: Extended attribute not supported, exiting.     ------------------------------------> [1]

[2016-07-21 12:55:42.799197] E [MSGID: 101019] [xlator.c:433:xlator_init] 0-vol_708e587144fcf31fe317d350e5fa1cc2-posix: Initialization of volume 'vol_708e587144fcf31fe317d350e5fa1cc2-posix' failed, review your volfile again
[2016-07-21 12:55:42.799206] E [graph.c:322:glusterfs_graph_init] 0-vol_708e587144fcf31fe317d350e5fa1cc2-posix: initializing translator failed
[2016-07-21 12:55:42.799212] E [graph.c:661:glusterfs_graph_activate] 0-graph: init failed

------------/snip-------

so, the 'posix' xlator somehow failed to see the 'extended' attribute support on this. 

Eventhough there are different possibilities [3] we were in below code path: 

posix xlator tries to put the extended attribute 'trusted.glusterfs.test" on this path and failed.

     
        op_ret = sys_lsetxattr (dir_data->data,
                                "trusted.glusterfs.test", "working", 8, 0);
        if (op_ret != -1) {
                sys_lremovexattr (dir_data->data, "trusted.glusterfs.test");
        } else {
                tmp_data = dict_get (this->options,
                                     "mandate-attribute");
                if (tmp_data) {
                        if (gf_string2boolean (tmp_data->data,
                                               &tmp_bool) == -1) {
                                gf_msg (this->name, GF_LOG_ERROR, 0,
                                        P_MSG_INVALID_OPTION,
                                        "wrong option provided for key "
                                        "\"mandate-attribute\"");
                                ret = -1;
                                goto out;
                        }
                        if (!tmp_bool) {
                                gf_msg (this->name, GF_LOG_WARNING, 0,
                                        P_MSG_XATTR_NOTSUP,
                                        "Extended attribute not supported, "
                                        "starting as per option");
                        } else {
                                gf_msg (this->name, GF_LOG_CRITICAL, 0,
                                        P_MSG_XATTR_NOTSUP,
                                        "Extended attribute not supported, "
                                        "exiting."); =============> [2]
                                ret = -1;
                                goto out;
                        }
                } else {
                        gf_msg (this->name, GF_LOG_CRITICAL, 0,
                                P_MSG_XATTR_NOTSUP,
                                "Extended attribute not supported, exiting.");
                        ret = -1;
                        goto out;
}

[2] unfortunately there is no proper error code in this code path to identify what was the return from sys_lsetxattr() to understand the possibilities of the root cause. it could be a race in any layer.  

[3]
     *)  EDQUOT Disk quota limits meant that there is insufficient space remaining to store the extended attribute.

     *)  EEXIST XATTR_CREATE was specified, and the attribute exists already.

     *)  ENOATTR
              XATTR_REPLACE was specified, and the attribute does not exist.  (ENOATTR is defined to be a synonym for ENODATA in <attr/xattr.h>.)

      *) ENOSPC There is insufficient space remaining to store the extended attribute.

      *) ENOTSUP
              The namespace prefix of name is not valid.

      *) ENOTSUP
              Extended attributes are not supported by the filesystem, or are disabled,
 
iic, in this setup the possibilities of 'EDQUOT', 'ENOSPC', 'ENOTSUP', 'EEXIST_XATTR_CREATE' are not true.

[Additional Details]

The brick path was mounted before gluster was started and we can see the xattrs were present before gluster start.

[Work Around]

# Definitely volume start force of this volume or reboot should bring this back.

May be someone from gluster posix team can check this scenario.

Comment 41 Atin Mukherjee 2016-07-22 06:23:48 UTC
Pranith,

Can you check this?

Comment 42 Neha 2016-07-22 06:47:58 UTC
Just to clarify there were no data present on the volume before reboot too. I tried force start still brick process is down.

Comment 45 Humble Chirammal 2016-07-22 10:39:02 UTC
Some more analysis was performed and here is the summary:

It looks like eventhough the data is intact on this brick, for some reason at boot time this brick process is mounted as "READONLY" and rest of '49' bricks are mounted READWRITE.


--snip--
/dev/mapper/vg_7c8a1389ab3b08f27e1d6fd5c9e3027a-brick_61dee9e630ef0646ae4280684ac1e608 on /var/lib/heketi/mounts/vg_7c8a1389ab3b08f27e1d6fd5c9e3027a/brick_61dee9e630ef0646ae4280684ac1e608 type xfs (ro,noatime,seclabel,attr2,inode64,logbsize=256k,sunit=512,swidth=512,noquota)
--/snip--

I feel, even a remount with "rw" flag ( mount -o remount,rw /dev/mapper/vg_7c8a1389ab3b08f27e1d6fd5c9e3027a-brick_61dee9e630ef0646ae4280684ac1e608 )can bring this back.

@Neha, can you perform a remount as shown below and find the result?

Comment 53 Red Hat Bugzilla 2023-09-14 03:27:58 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days