+++ This bug was initially created as a clone of Bug #1218573 +++ Description of problem: Scheduler is not picking scheduled jobs, when one of the storage node of shared storage volume is down. Version-Release number of selected component (if applicable): [root@localhost glusterfs]# rpm -qa | grep glusterfs glusterfs-debuginfo-3.7.0alpha0-0.9.git989bea3.el7.centos.x86_64 glusterfs-libs-3.7.0beta1-0.14.git09bbd5c.el7.centos.x86_64 glusterfs-fuse-3.7.0beta1-0.14.git09bbd5c.el7.centos.x86_64 glusterfs-3.7.0beta1-0.14.git09bbd5c.el7.centos.x86_64 glusterfs-extra-xlators-3.7.0beta1-0.14.git09bbd5c.el7.centos.x86_64 glusterfs-geo-replication-3.7.0beta1-0.14.git09bbd5c.el7.centos.x86_64 glusterfs-cli-3.7.0beta1-0.14.git09bbd5c.el7.centos.x86_64 glusterfs-api-3.7.0beta1-0.14.git09bbd5c.el7.centos.x86_64 glusterfs-server-3.7.0beta1-0.14.git09bbd5c.el7.centos.x86_64 glusterfs-devel-3.7.0beta1-0.14.git09bbd5c.el7.centos.x86_64 How reproducible: 100% Steps to Reproduce: 1. Create 2*2 distributed replicate volume. 2. Create shared storage replicate volume on storage node which is not part of volume whose snapshot is scheduled. and mount on each storage node on path /var/run/gluster/shared_storage 3. initialize scheduler on each storage node e.g run snap_scheduler.py init command 4. Enable scheduler on storage nodes e.g run snap_scheduler.py enable 5. Add jobs to create snapshot of volume, with interval of 5 min. e.g snap_scheduler.py add job1 "*/5 * * * *" testvol 6. bring down the both shared storage node. 7. Bring up any one of the shared storage node. Actual results: Scheduled job is not picked by scheduler Expected results: Scheduler should pick the scheduled jobs Additional info: [root@localhost glusterfs]# gluster v info testvol Volume Name: testvol Type: Distributed-Replicate Volume ID: f5eed851-6f24-4cde-903e-7669f5437bc9 Status: Started Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: 10.70.47.143:/rhs/brick1/b1 Brick2: 10.70.47.145:/rhs/brick1/b2 Brick3: 10.70.47.150:/rhs/brick1/b3 Brick4: 10.70.47.151:/rhs/brick1/b4 Options Reconfigured: features.quota: on features.quota-deem-statfs: on features.uss: enable features.barrier: disable ==================================== Shared storage volume [root@localhost ~]# gluster v info meta Volume Name: meta Type: Replicate Volume ID: b07daf4e-891d-4022-972a-af181250dc07 Status: Started Number of Bricks: 1 x 2 = 2 Transport-type: tcp Bricks: Brick1: 10.70.46.248:/rhs/brick1/b1 Brick2: 10.70.46.251:/rhs/brick1/b2 --- Additional comment from on 2015-05-08 05:45:30 EDT --- Version : glusterfs 3.7.0beta1 built on May 7 2015 ======= Another scenario where jobs are not picked up: 1) Create a dist-rep volume and mount it 2) Create a shared storage and mount it Enable Scheduler and schedule jobs on the volumes snap_scheduler.py add "A1" "*/5 * * * * " "vol1" snap_scheduler: Successfully added snapshot schedule snap_scheduler.py add "A2" "*/10 * * * * " "vol2" snap_scheduler: Successfully added snapshot schedule 3) Take a snapshot of the shared storage gluster snapshot create MV_Snap gluster_shared_storage snapshot create: success: Snap MV_Snap_GMT-2015.05.08-09.20.26 created successfully 4)Add some more jobs - A3 and A4 5)Stop the volume and see that at the next scheduled time no job is picked up. 6)Restore the shared storage to the snap taken and start the volume 7)After restoring the Scheduler lists A1 and A2 jobs, but none of them are picked up
Upstream Url:http://review.gluster.org/#/c/11139/ RHS 3.1 Url:http://review.gluster.org/#/c/11168/ RHGS 3.1 Url:https://code.engineering.redhat.com/gerrit/#/c/50514/
Version : glusterfs-3.7.1-7.el6rhs.x86_64 Created shared storage using gluster v set all cluster.shared-storage enable which creates shared storage volume with bricks present at /var/run/gluster/ss_brick, which is a tmpfs and on a node reboot shared storage brick is wiped clean and all the jobs created are lost. Proposing this bug as a Blocker since all data on the shared storage volume is lost when the nodes are rebooted. Steps followed : =============== 1) Create a 2 node (Node1 and Node2)cluster and run gluster v set all cluster.shared-storage enable - this creates 1x2 dist rep volume with bricks present at /var/run/gluster/ss_brick gluster v info Volume Name: gluster_shared_storage Type: Replicate Volume ID: be49ed27-8cb3-4ae3-9d20-f5d8f375c0c9 Status: Started Number of Bricks: 1 x 2 = 2 Transport-type: tcp Bricks: Brick1: rhs-arch-srv2.lab.eng.blr.redhat.com:/var/run/gluster/ss_brick Brick2: 10.70.34.50:/var/run/gluster/ss_brick Options Reconfigured: performance.readdir-ahead: on cluster.enable-shared-storage: enable 2) Attach Node3 and Node4 to the cluster and mount shared storage on /var/run/gluster/shared_storage 3) Create a volume with bricks from Node3 and Node4 4) Initialise scheduler on all nodes and enable it. Check status from all nodes- it shows enabled 5) Now add job every 10 min on volume 6) Power off Node1 and Node2 (the nodes which host bricks for the shared storage volume) 7) Power on Node2 and check snap_scheduler status on srv2 - disabled 8) check snap_scheduler status on srv3 and srv4 - disabled 9) snap_scheduler list shows no jobs! Moving the bug back to Assigned and proposing it as a 'Blocker'
Fixed with https://code.engineering.redhat.com/gerrit/#/c/52414/
Version : glusterfs-3.7.1-8.el6rhs.x86_64 Followed steps mentioned in Comment 4 and after reboot of the nodes, jobs are listed and scheduler continues to create snapshots snap_scheduler.py list JOB_NAME SCHEDULE OPERATION VOLUME NAME -------------------------------------------------------------------- J1 */5 * * * * Snapshot Create vol0 rebooted the nodes after 5 snapshots were created . After nodes were back up snapshot creation continued. gluster snapshot list |wc -l 184 gluster v info Volume Name: gluster_shared_storage Type: Replicate Volume ID: 1002e97f-2f03-4040-a3f9-a403995a35fa Status: Started Number of Bricks: 1 x 2 = 2 Transport-type: tcp Bricks: Brick1: rhs-arch-srv2.lab.eng.blr.redhat.com:/var/lib/glusterd/ss_brick Brick2: 10.70.34.50:/var/lib/glusterd/ss_brick Options Reconfigured: performance.readdir-ahead: on cluster.enable-shared-storage: enable Marking the bug 'Verified'
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2015-1495.html