1223205 – [Snapshot] Scheduled job is not processed when one of the node of shared storage volume is down

Bug 1223205 - [Snapshot] Scheduled job is not processed when one of the node of shared storage volume is down

Summary: [Snapshot] Scheduled job is not processed when one of the node of shared stor...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	snapshot
Sub Component:
Version:	rhgs-3.0
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	urgent
Target Milestone:	---
Target Release:	RHGS 3.1.0
Assignee:	Avra Sengupta
QA Contact:	senaik
Docs Contact:
URL:
Whiteboard:	Scheduler
Depends On:	1218573 1230399
Blocks:	1202842 1223636
TreeView+	depends on / blocked

Reported:	2015-05-20 06:06 UTC by Avra Sengupta
Modified:	2016-09-17 12:56 UTC (History)
CC List:	10 users (show)
Fixed In Version:	glusterfs-3.7.1-8
Doc Type:	Bug Fix
Doc Text:
Clone Of:	1218573
Environment:
Last Closed:	2015-07-29 04:43:50 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2015:1495	0	normal	SHIPPED_LIVE	Important: Red Hat Gluster Storage 3.1 update	2015-07-29 08:26:26 UTC

Description Avra Sengupta 2015-05-20 06:06:39 UTC

+++ This bug was initially created as a clone of Bug #1218573 +++

Description of problem:

Scheduler is not picking scheduled jobs, when one of the storage node of shared storage volume is down.

Version-Release number of selected component (if applicable):

[root@localhost glusterfs]# rpm -qa | grep glusterfs
glusterfs-debuginfo-3.7.0alpha0-0.9.git989bea3.el7.centos.x86_64
glusterfs-libs-3.7.0beta1-0.14.git09bbd5c.el7.centos.x86_64
glusterfs-fuse-3.7.0beta1-0.14.git09bbd5c.el7.centos.x86_64
glusterfs-3.7.0beta1-0.14.git09bbd5c.el7.centos.x86_64
glusterfs-extra-xlators-3.7.0beta1-0.14.git09bbd5c.el7.centos.x86_64
glusterfs-geo-replication-3.7.0beta1-0.14.git09bbd5c.el7.centos.x86_64
glusterfs-cli-3.7.0beta1-0.14.git09bbd5c.el7.centos.x86_64
glusterfs-api-3.7.0beta1-0.14.git09bbd5c.el7.centos.x86_64
glusterfs-server-3.7.0beta1-0.14.git09bbd5c.el7.centos.x86_64
glusterfs-devel-3.7.0beta1-0.14.git09bbd5c.el7.centos.x86_64


How reproducible:

100%

Steps to Reproduce:

1. Create 2*2 distributed replicate volume.

2. Create  shared storage replicate volume on storage node which is not part of volume whose snapshot is scheduled. and mount on each storage node on path /var/run/gluster/shared_storage 
3. initialize scheduler on each storage node e.g run snap_scheduler.py init  command 
4. Enable scheduler on storage nodes e.g run snap_scheduler.py enable 
5. Add jobs to create snapshot of volume, with interval of 5  min. e.g snap_scheduler.py add job1 "*/5 * * * *" testvol
6. bring down the both shared storage node.
7. Bring up any  one of the shared storage node.

Actual results:

Scheduled job is not picked by scheduler  

Expected results:

Scheduler should pick the scheduled jobs


Additional info:

[root@localhost glusterfs]# gluster v info testvol
 
Volume Name: testvol
Type: Distributed-Replicate
Volume ID: f5eed851-6f24-4cde-903e-7669f5437bc9
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: 10.70.47.143:/rhs/brick1/b1
Brick2: 10.70.47.145:/rhs/brick1/b2
Brick3: 10.70.47.150:/rhs/brick1/b3
Brick4: 10.70.47.151:/rhs/brick1/b4
Options Reconfigured:
features.quota: on
features.quota-deem-statfs: on
features.uss: enable
features.barrier: disable
====================================
Shared storage volume

[root@localhost ~]# gluster v info meta
 
Volume Name: meta
Type: Replicate
Volume ID: b07daf4e-891d-4022-972a-af181250dc07
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: 10.70.46.248:/rhs/brick1/b1
Brick2: 10.70.46.251:/rhs/brick1/b2

--- Additional comment from  on 2015-05-08 05:45:30 EDT ---

Version : glusterfs 3.7.0beta1 built on May  7 2015
=======

Another scenario where jobs are not picked up:

1) Create a dist-rep volume and mount it
 
2) Create a shared storage and mount it 

Enable Scheduler and schedule jobs on the volumes 
snap_scheduler.py add "A1"  "*/5 * * * * " "vol1"
snap_scheduler: Successfully added snapshot schedule

snap_scheduler.py add "A2"  "*/10 * * * * " "vol2"
snap_scheduler: Successfully added snapshot schedule

3) Take a snapshot of the shared storage 
gluster snapshot create MV_Snap gluster_shared_storage 
snapshot create: success: Snap MV_Snap_GMT-2015.05.08-09.20.26 created successfully

4)Add some more jobs - A3 and A4 

5)Stop the volume and see that at the next scheduled time no job is picked up.

6)Restore the shared storage to the snap taken and start the volume 

7)After restoring the Scheduler lists A1 and A2 jobs, but none of them are picked up

Comment 3 Avra Sengupta 2015-06-11 09:56:34 UTC

Upstream Url:http://review.gluster.org/#/c/11139/
RHS 3.1 Url:http://review.gluster.org/#/c/11168/
RHGS 3.1 Url:https://code.engineering.redhat.com/gerrit/#/c/50514/

Comment 4 senaik 2015-07-05 11:20:00 UTC

Version : glusterfs-3.7.1-7.el6rhs.x86_64

Created shared storage using gluster v set all cluster.shared-storage enable which creates shared storage volume with bricks present at /var/run/gluster/ss_brick, which is a tmpfs and on a node reboot shared storage brick is wiped clean and all the jobs created are lost. 

Proposing this bug as a Blocker since all data on the shared storage volume is lost when the nodes are rebooted.

Steps followed :
===============
1) Create a 2 node (Node1 and Node2)cluster and run gluster v set all cluster.shared-storage enable - this creates 1x2 dist rep volume with bricks present at /var/run/gluster/ss_brick

gluster v info
 
Volume Name: gluster_shared_storage
Type: Replicate
Volume ID: be49ed27-8cb3-4ae3-9d20-f5d8f375c0c9
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: rhs-arch-srv2.lab.eng.blr.redhat.com:/var/run/gluster/ss_brick
Brick2: 10.70.34.50:/var/run/gluster/ss_brick
Options Reconfigured:
performance.readdir-ahead: on
cluster.enable-shared-storage: enable

2) Attach Node3 and Node4 to the cluster and mount shared storage on /var/run/gluster/shared_storage

3) Create a volume with bricks from Node3 and Node4

4) Initialise scheduler on all nodes and enable it. Check status from all nodes- it shows enabled

5) Now add job every 10 min on volume 

6) Power off Node1 and Node2 (the nodes which host bricks for the shared storage volume)

7) Power on Node2 and check snap_scheduler status on srv2 - disabled

8) check snap_scheduler status on srv3 and srv4 - disabled

9) snap_scheduler list shows no jobs! 

Moving the bug back to Assigned and proposing it as a 'Blocker'

Comment 5 Avra Sengupta 2015-07-07 06:16:41 UTC

Fixed with https://code.engineering.redhat.com/gerrit/#/c/52414/

Comment 7 senaik 2015-07-09 06:11:13 UTC

Version : glusterfs-3.7.1-8.el6rhs.x86_64

Followed steps mentioned in Comment 4 and after reboot of the nodes, jobs are listed and scheduler continues to create snapshots 

 snap_scheduler.py list
JOB_NAME         SCHEDULE         OPERATION        VOLUME NAME      
--------------------------------------------------------------------
J1               */5 * * * *      Snapshot Create  vol0        

rebooted the nodes after 5 snapshots were created . After nodes were back up snapshot creation continued.

gluster snapshot list |wc -l 
184

 gluster v info
 
Volume Name: gluster_shared_storage
Type: Replicate
Volume ID: 1002e97f-2f03-4040-a3f9-a403995a35fa
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: rhs-arch-srv2.lab.eng.blr.redhat.com:/var/lib/glusterd/ss_brick
Brick2: 10.70.34.50:/var/lib/glusterd/ss_brick
Options Reconfigured:
performance.readdir-ahead: on
cluster.enable-shared-storage: enable


Marking the bug 'Verified'

Comment 8 errata-xmlrpc 2015-07-29 04:43:50 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2015-1495.html

Note You need to log in before you can comment on or make changes to this bug.