Bug 1355689

Summary: heketi service failed to start if two nodes are down
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Neha <nerawat>
Component: heketiAssignee: Michael Adam <madam>
Status: CLOSED NOTABUG QA Contact: Bala Konda Reddy M <bmekala>
Severity: high Docs Contact:
Priority: unspecified    
Version: rhgs-3.1CC: abhishku, annair, bkunal, bmekala, hchiramm, jkaur, jmulligan, kramdoss, madam, nerawat, pprakash, rcyriac, rreddy, rtalur, sanandpa, sankarshan, ssaha, vinug
Target Milestone: ---Keywords: Reopened, ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1359775 (view as bug list) Environment:
Last Closed: 2018-09-19 17:18:59 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1432048    
Bug Blocks: 1573420, 1622458    

Description Neha 2016-07-12 09:12:19 UTC
After bringing down one node heketi pod has started on new node and working fine. But in case of two node failure Heketi pod is not able to start on the third node.

heketi-1-kjif9                                           0/1       CrashLoopBackOff   1          3m        <node>


docker logs daf6805ee867
Heketi 2.0.2
[heketi] INFO 2016/06/23 08:33:47 Loaded kubernetes executor
[heketi] ERROR 2016/06/23 08:33:47 /src/github.com/heketi/heketi/apps/glusterfs/app.go:149: write /var/lib/heketi/heketi.db: read-only file system
ERROR: Unable to start application

Comment 2 Luis Pabón 2016-07-13 13:09:25 UTC
Humble this could be a dup of BZ 1355801

Comment 3 Humble Chirammal 2016-07-14 11:21:38 UTC
(In reply to Luis Pabón from comment #2)
> Humble this could be a dup of BZ 1355801

Here the setup is replica 3 and  2 nodes are down , then the volume move to READONLY. Expected. Isnt it ?

Comment 4 Neha 2016-07-14 12:28:01 UTC
AFAIK this is expected behaviour, here bug is filed for heketi service failure. 

[heketi] ERROR 2016/06/23 08:33:47 /src/github.com/heketi/heketi/apps/glusterfs/app.go:149: write /var/lib/heketi/heketi.db: read-only file system
ERROR: Unable to start application

heketi-1-kjif9                                           0/1       CrashLoopBackOff   1          3m        <node>

Comment 5 Humble Chirammal 2016-07-14 18:34:36 UTC
(In reply to Neha from comment #4)
> AFAIK this is expected behaviour, here bug is filed for heketi service
> failure. 
>

Thanks for confirming!
 
> [heketi] ERROR 2016/06/23 08:33:47
> /src/github.com/heketi/heketi/apps/glusterfs/app.go:149: write
> /var/lib/heketi/heketi.db: read-only file system
> ERROR: Unable to start application
> 
> heketi-1-kjif9                                           0/1      
> CrashLoopBackOff   1          3m        <node>

When the DB file is not writable, the service wont start. That too expected, Isnt it ?

Comment 6 Neha 2016-07-15 07:39:01 UTC
(In reply to Humble Chirammal from comment #5)
> (In reply to Neha from comment #4)
> > AFAIK this is expected behaviour, here bug is filed for heketi service
> > failure. 
> >
> 
> Thanks for confirming!
>  
> > [heketi] ERROR 2016/06/23 08:33:47
> > /src/github.com/heketi/heketi/apps/glusterfs/app.go:149: write
> > /var/lib/heketi/heketi.db: read-only file system
> > ERROR: Unable to start application
> > 
> > heketi-1-kjif9                                           0/1      
> > CrashLoopBackOff   1          3m        <node>
> 
> When the DB file is not writable, the service wont start. That too expected,
> Isnt it ?

Yes that is self explanatory.

This bug report is filed for "BZ 1341943 - Database needs to be placed in a reliable persistent storage in case of failure"

The question here is what is the expectation here with respect to "database reliability"?

For replica 3 my understanding is it should sustain 2 node failures. Correct me if I am wrong here.

In this case it can only sustain one node failure.

How to restore setup back in case of two node failure? 

How long it will try to restart pod (CrashLoopBackOff) in case if nodes are down for longer period of time ?

Comment 7 Humble Chirammal 2016-07-15 07:57:51 UTC
(In reply to Neha from comment #6)
> (In reply to Humble Chirammal from comment #5)
> > (In reply to Neha from comment #4)
> > > AFAIK this is expected behaviour, here bug is filed for heketi service
> > > failure. 
> > >
> > 
> > Thanks for confirming!
> >  
> > > [heketi] ERROR 2016/06/23 08:33:47
> > > /src/github.com/heketi/heketi/apps/glusterfs/app.go:149: write
> > > /var/lib/heketi/heketi.db: read-only file system
> > > ERROR: Unable to start application
> > > 
> > > heketi-1-kjif9                                           0/1      
> > > CrashLoopBackOff   1          3m        <node>
> > 
> > When the DB file is not writable, the service wont start. That too expected,
> > Isnt it ?
> 
> Yes that is self explanatory.
> 
> This bug report is filed for "BZ 1341943 - Database needs to be placed in a
> reliable persistent storage in case of failure"

Thats a different discussion altogether. I dont think that is something we can fix it by this bug report.

> 
> The question here is what is the expectation here with respect to "database
> reliability"?
>

As mentioned above it has to be answered a different bug/thread.
 
> For replica 3 my understanding is it should sustain 2 node failures. Correct
> me if I am wrong here.

Please note that the volume is not 'replica 3', its distributed replica ( 2x3). As discussed this is expected to move a volume to READONLY mode when the quorum is not met to serve the file.


> 
> In this case it can only sustain one node failure.
> 
> How to restore setup back in case of two node failure? 
> 
> How long it will try to restart pod (CrashLoopBackOff) in case if nodes are
> down for longer period of time ?

Neha as mentioned above please open a discussion or question bug for these. As per our discussion I am inclined to close this bug. Please let me know your thought. We will proceed accordingly.

Comment 8 Neha 2016-07-15 11:04:11 UTC
(In reply to Humble Chirammal from comment #7)
> (In reply to Neha from comment #6)
> > (In reply to Humble Chirammal from comment #5)
> > > (In reply to Neha from comment #4)
> > > > AFAIK this is expected behaviour, here bug is filed for heketi service
> > > > failure. 
> > > >
> > > 
> > > Thanks for confirming!
> > >  
> > > > [heketi] ERROR 2016/06/23 08:33:47
> > > > /src/github.com/heketi/heketi/apps/glusterfs/app.go:149: write
> > > > /var/lib/heketi/heketi.db: read-only file system
> > > > ERROR: Unable to start application
> > > > 
> > > > heketi-1-kjif9                                           0/1      
> > > > CrashLoopBackOff   1          3m        <node>
> > > 
> > > When the DB file is not writable, the service wont start. That too expected,
> > > Isnt it ?
> > 
> > Yes that is self explanatory.
> > 
> > This bug report is filed for "BZ 1341943 - Database needs to be placed in a
> > reliable persistent storage in case of failure"
> 
> Thats a different discussion altogether. I dont think that is something we
> can fix it by this bug report.
> 

> > 
> > The question here is what is the expectation here with respect to "database
> > reliability"?
> >
> 
> As mentioned above it has to be answered a different bug/thread.
>  
> > For replica 3 my understanding is it should sustain 2 node failures. Correct
> > me if I am wrong here.
> 
> Please note that the volume is not 'replica 3', its distributed replica (
> 2x3). As discussed this is expected to move a volume to READONLY mode when
> the quorum is not met to serve the file.

Yes that is correct that its [2 x 3]. I dont think so if behaviour will chnage even if its "plain replica" volume.

> 
> 
> > 
> > In this case it can only sustain one node failure.
> > 
> > How to restore setup back in case of two node failure? 
> > 
> > How long it will try to restart pod (CrashLoopBackOff) in case if nodes are
> > down for longer period of time ?
> 
> Neha as mentioned above please open a discussion or question bug for these.
> As per our discussion I am inclined to close this bug. Please let me know
> your thought. We will proceed accordingly.

I believe we can still track this here rather than opening a new bug. This is expected behaviour from gluster point of view but as a solution to make heketi db reliable, its a problem. Already there is parent bug for that #1341943

Comment 9 Humble Chirammal 2016-07-16 12:05:28 UTC

> 
> I believe we can still track this here rather than opening a new bug. This
> is expected behaviour from gluster point of view but as a solution to make
> heketi db reliable, its a problem. Already there is parent bug for that
> #1341943

@neha, iic, we are in agreement on FS goes READONLY scenario which is an expected result from GLUSTER. If you still have doubt please feel free to discuss
.
@Luis, It looks to me that Neha is trying to find answers for below.

*) The question here is what is the expectation here with respect to "database
 reliability"?

*) How to restore setup back in case of two node failure? 

*) How long it will try to restart pod (CrashLoopBackOff) in case if nodes are down for longer period of time ?

Comment 10 Luis Pabón 2016-07-19 02:36:43 UTC
(In reply to Humble Chirammal from comment #9)
> 
> > 
> > I believe we can still track this here rather than opening a new bug. This
> > is expected behaviour from gluster point of view but as a solution to make
> > heketi db reliable, its a problem. Already there is parent bug for that
> > #1341943
> 
> @neha, iic, we are in agreement on FS goes READONLY scenario which is an
> expected result from GLUSTER. If you still have doubt please feel free to
> discuss
> .
> @Luis, It looks to me that Neha is trying to find answers for below.
> 
> *) The question here is what is the expectation here with respect to
> "database
>  reliability"?
The expectation is that it is as reliable as any GlusterFS volume.

> 
> *) How to restore setup back in case of two node failure?
Mount and copy the files out of the volume.  Delete the volume and create a new one with the same name.  Then mount and copy back.

> 
> *) How long it will try to restart pod (CrashLoopBackOff) in case if nodes
> are down for longer period of time ?
That depends on the algorithm in Kubernetes.

Comment 11 Luis Pabón 2016-07-19 04:00:10 UTC
Created a patch upstream to allow startup in read-only mode:
https://github.com/heketi/heketi/issues/435

All read-only commands will work, like listings, and backup.

Comment 21 Luis Pabón 2016-10-05 21:51:09 UTC
Please retest.

Comment 27 krishnaram Karthick 2016-12-28 05:16:27 UTC
The issue is still seen with the heketi-client-3.1.0-10.el7rhgs.x86_64

1) heketi pod was configured on host 'dhcp47-110'. 

[root@dhcp46-2 ~]# oc get pods -o wide
NAME                             READY     STATUS    RESTARTS   AGE       IP             NODE
glusterfs-7buxc                  1/1       Running   0          1d        10.70.47.110   dhcp47-110.lab.eng.blr.redhat.com
glusterfs-qt5fx                  1/1       Running   0          1d        10.70.47.112   dhcp47-112.lab.eng.blr.redhat.com
glusterfs-x9b1n                  1/1       Running   0          1d        10.70.46.224   dhcp46-224.lab.eng.blr.redhat.com
heketi-1-cljgb                   1/1       Running   0          1d        10.128.0.8     dhcp47-110.lab.eng.blr.redhat.com
storage-project-router-1-hw98o   1/1       Running   0          1d        10.70.47.112   dhcp47-112.lab.eng.blr.redhat.com
[root@dhcp46-2 ~]# 

2) node 'dhcp47-110' was shutdown, heketi spun up on 'dhcp46-224'

[root@dhcp46-2 ~]# oc get pods -o wide
NAME                             READY     STATUS    RESTARTS   AGE       IP             NODE
glusterfs-7buxc                  1/1       Running   0          1d        10.70.47.110   dhcp47-110.lab.eng.blr.redhat.com
glusterfs-qt5fx                  1/1       Running   0          1d        10.70.47.112   dhcp47-112.lab.eng.blr.redhat.com
glusterfs-x9b1n                  1/1       Running   0          1d        10.70.46.224   dhcp46-224.lab.eng.blr.redhat.com
heketi-1-vdxdt                   1/1       Running   0          53s       10.131.0.10    dhcp46-224.lab.eng.blr.redhat.com
storage-project-router-1-hw98o   1/1       Running   0          1d        10.70.47.112   dhcp47-112.lab.eng.blr.redhat.com
[root@dhcp46-2 ~]# 

3) shutdown 'dhcp46-224', expecting heketi service to be up on 'dhcp47-112'. heketi pod failed to start up on dhcp47-112.

[root@dhcp46-2 ~]# oc get pods -o wide
NAME                             READY     STATUS             RESTARTS   AGE       IP             NODE
glusterfs-7buxc                  1/1       Running            0          1d        10.70.47.110   dhcp47-110.lab.eng.blr.redhat.com
glusterfs-qt5fx                  1/1       Running            0          1d        10.70.47.112   dhcp47-112.lab.eng.blr.redhat.com
glusterfs-x9b1n                  1/1       Running            0          1d        10.70.46.224   dhcp46-224.lab.eng.blr.redhat.com
heketi-1-zu69v                   0/1       CrashLoopBackOff   5          6m        10.130.0.6     dhcp47-112.lab.eng.blr.redhat.com
storage-project-router-1-hw98o   1/1       Running            0          1d        10.70.47.112   dhcp47-112.lab.eng.blr.redhat.com

Moving the bug to 'Assigned' based on above test.

Comment 28 Humble Chirammal 2016-12-28 06:39:40 UTC
*) volume info of heketidbstorage volume.
*) status ( ex: RO ) of the volume. 
*) describe output of heketi pod.
*) Did kube try to start new pod ?
*) What if you 'delete'  heketi-1-zu69v , Is it starting new heketi pod?
*) What is recorded in Heketi logs ? 

Without these information it is very difficult to proceed.

Comment 30 Humble Chirammal 2016-12-28 10:30:23 UTC
FYI#
 
c#27 talks about 'heketi-1-zu69v'  
heketi-1-zu69v                   0/1       CrashLoopBackOff

however c#29 is from new iteration and pod name is 'heketi-1-4bx8o'.



Its better if you can include 'ls -ld' of /var/lib/heketi and 'ls -l' output of 'heketi.db' file .

At glance: 

Why I am asking this information is due to below reasons.

This is supposed to be fixed with https://github.com/heketi/heketi/pull/436/

I could expect an error message "https://github.com/heketi/heketi/pull/436/files#diff-f394c40886f16cc9392ab7f130752b8bR106" in heketi logs when it try to open it in READONLY mode. 

If it failed 'Unable to open database:' , neither of this is available in logs ,

	app.db, err = bolt.Open(dbfilename, 0600, &bolt.Options{Timeout: 3 * time.Second})
	if err != nil {
		logger.Warning("Unable to open database.  Retrying using read only mode")

		// Try opening as read-only
		app.db, err = bolt.Open(dbfilename, 0666, &bolt.Options{
			ReadOnly: true,
		})
		if err != nil {
			logger.LogError("Unable to open database: %v", err)
			return nil
		}
app.dbReadOnly = true

Comment 31 krishnaram Karthick 2016-12-28 10:32:08 UTC
Other necessary info,

# gluster v list 
heketidbstorage
vol_476e104dacc88c57855b958765e5e20d
vol_5554335ecd62ede9d278b5b5c5fd133a
vol_6407e1b9266794d33f302d572e0fe63c
vol_d1c17bd998262085e2078893501045db
vol_dd4d7b53f504019c07cfa31439513444
vol_f8e1da8579b1d5bfcdbe1ca6fac1245e
sh-4.2# gluster v info heketidbstorage
 
Volume Name: heketidbstorage
Type: Replicate
Volume ID: 2fec6d2b-20a7-4b1c-9411-7a29c4e6bbce
Status: Started
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: 10.70.46.224:/var/lib/heketi/mounts/vg_92a0ea834f0f64420e8b3fa3f638b075/brick_4ec8801b26aa4344cf3a40d13bce4d34/brick
Brick2: 10.70.47.110:/var/lib/heketi/mounts/vg_9021e39a7bd981c1a42501b6e9da487f/brick_bc5901a1a862854d8f73ed24915828ad/brick
Brick3: 10.70.47.112:/var/lib/heketi/mounts/vg_3587717368692d764d40936b0f5fd47f/brick_be67fe7670103762ae3e5b1545dfd55f/brick
Options Reconfigured:
performance.readdir-ahead: on


sh-4.2# ls -ld  /var/lib/heketi
drwxr-xr-x. 3 root root 33 Dec 26 12:17 /var/lib/heketi


[root@dhcp46-2 mnt_tmp]# ls -l
total 132
-rw-r--r--. 1 root root 131072 Dec 28 14:32 heketi.db
drwxr-xr-x. 2 root root   4096 Dec 26 17:48 secret

Comment 32 Mohamed Ashiq 2016-12-29 07:28:48 UTC
Hi,

I did little RCA on this issue. The patch in heketi tries to talk to db when db has 666 permission and it is not tested with 644 permission. gluster readonly mode makes the db 644.

# stat /mnt/heketi.db 
  File: ‘/mnt/heketi.db’
  Size: 131072    	Blocks: 176        IO Block: 131072 regular file
Device: 29h/41d	Inode: 12926231922372432401  Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Context: system_u:object_r:fusefs_t:s0
Access: 2016-12-27 08:50:41.711681035 -0500
Modify: 2016-12-29 01:25:24.200451583 -0500
Change: 2016-12-29 01:25:24.211451693 -0500
 Birth: -


PR: 
https://github.com/heketi/heketi/pull/436/files

Comment 33 Humble Chirammal 2016-12-29 07:42:04 UTC
(In reply to Mohamed Ashiq from comment #32)
> Hi,
> 
> I did little RCA on this issue. The patch in heketi tries to talk to db when
> db has 666 permission and it is not tested with 644 permission. gluster
> readonly mode makes the db 644.
> 
> # stat /mnt/heketi.db 
>   File: ‘/mnt/heketi.db’
>   Size: 131072    	Blocks: 176        IO Block: 131072 regular file
> Device: 29h/41d	Inode: 12926231922372432401  Links: 1
> Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
> Context: system_u:object_r:fusefs_t:s0
> Access: 2016-12-27 08:50:41.711681035 -0500
> Modify: 2016-12-29 01:25:24.200451583 -0500
> Change: 2016-12-29 01:25:24.211451693 -0500
>  Birth: -
> 
> 
> PR: 
> https://github.com/heketi/heketi/pull/436/files

Thanks Ashiq!!

Exactly, as in c#30 and c#31 . If "666" heketi.db is able to start the heketi service in 'readonly' mode and then changing back to 644 with a code change to "644" ( https://github.com/heketi/heketi/pull/436/files#diff-f394c40886f16cc9392ab7f130752b8bR109 ) also worked, we need a new PR.

Comment 34 Mohamed Ashiq 2016-12-29 09:22:59 UTC
Reference:

https://github.com/boltdb/bolt/#read-only-mode


AFAIK this can be used to open a db in read-only mode but does not talk about the underlying filesystem.

Comment 35 Humble Chirammal 2016-12-29 10:12:38 UTC
[NOTE]

It seems that, we need to look this in detail wrt:

The underlying FS state ( RO) + file permission ( 666/644 ) + bolt.Open() actions https://github.com/boltdb/bolt/blob/2e25e3bb4285d41d223bb80b12658a2c9b9bf3e3/db.go#L150 .

Comment 37 Humble Chirammal 2017-01-02 06:24:08 UTC
Release team is notified about the change, I am removing Devel Ack and proposing this for CNS 3.5 release.

Comment 38 RHEL Program Management 2017-01-02 06:32:46 UTC
Development Management has reviewed and declined this request.
You may appeal this decision by reopening this request.

Comment 42 Ramakrishna Reddy Yekulla 2017-02-27 14:21:55 UTC
This is due to the behavior of BoltDB, boltDb does not detect a read-only filesystem, hence heketi does not Open db in read only mode in such cases. 

As a work around we can detect read only filesystem in heketi and handle it.

Comment 43 Humble Chirammal 2017-02-28 12:51:35 UTC
As we store the db in secret since heketi v4, this should be more safe now or not a worry at all in CNS deployments, but still may consider heketi running outside kube/openshift setups. 

Also look at#
https://github.com/heketi/heketi/issues/685#issuecomment-282934600

Comment 44 Ramakrishna Reddy Yekulla 2017-03-02 12:54:32 UTC
The following Pull request addresses the issue ::

https://github.com/heketi/heketi/pull/701

Comment 45 Ramakrishna Reddy Yekulla 2017-03-02 12:57:08 UTC
(In reply to Humble Chirammal from comment #43)
> As we store the db in secret since heketi v4, this should be more safe now
> or not a worry at all in CNS deployments, but still may consider heketi
> running outside kube/openshift setups. 

That is right. This should not be an Issue when used in CNS as the product. I.e in Tandem with OpenShift. But in the scenario in which there is Stand alone (heketi + rhgs) problems needs to be addressed 


> 
> Also look at#
> https://github.com/heketi/heketi/issues/685#issuecomment-282934600

Comment 47 Ramakrishna Reddy Yekulla 2017-03-13 09:19:21 UTC
This is an issue with behavior of the underlying gluster filesystem, the bug lies in Gluster  ::

open() with O_RDWR on a RO-filesystem returns -1 with errno == EROFS
open() with O_RDWR on a RO file returns -1 with error == EACCESS

Comment 49 Humble Chirammal 2017-03-13 15:57:55 UTC
As soon as RHGS bug is opened, I will defer this from CNS 3.5 release.

Comment 50 Ramakrishna Reddy Yekulla 2017-03-14 12:18:30 UTC
https://bugzilla.redhat.com/show_bug.cgi?id=1432048 is the RHGS bug.

Comment 51 Humble Chirammal 2017-03-14 12:32:27 UTC
We have a dependent bug on RHGS now and the fix should land in GlusterFS. I am deferring this bug from this release.

Comment 59 Sweta Anandpara 2018-01-16 05:45:57 UTC
Resetting the needinfo from nerawat to current heketi QE.

Comment 62 Bipin Kunal 2018-05-07 12:36:12 UTC
@Abhishek: Will it be possible for you to reproduce this with latest builds? We/Engineering thinks that this is already fixed. If You can reproduce, we can reconsider else we would like to close this.

Comment 63 John Mulligan 2018-09-18 22:14:44 UTC
(In reply to Bipin Kunal from comment #62)
> @Abhishek: Will it be possible for you to reproduce this with latest builds?
> We/Engineering thinks that this is already fixed. If You can reproduce, we
> can reconsider else we would like to close this.

I too would like to see this closed. It is correct that heketi / heketi pod does not start if two nodes hosting the bricks of heketidbstorage are down. If the bricks are down for unknown reasons we should debug why the glusterfs volume is in an unhealthy state rather than keep this ancient heketi bz alive.

Comment 64 Michael Adam 2018-09-19 17:18:59 UTC
Closing according to the discussion in the triage meeting: this works as designed.