Bug 1202388
| Summary: | [SNAPSHOT]: After a volume which has quota enabled is restored to a snap, attaching another node to the cluster is not successful | |||
|---|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | senaik | |
| Component: | snapshot | Assignee: | rjoseph | |
| Status: | CLOSED ERRATA | QA Contact: | Anil Shah <ashah> | |
| Severity: | high | Docs Contact: | ||
| Priority: | high | |||
| Version: | rhgs-3.0 | CC: | annair, asengupt, asriram, asrivast, nsathyan, rcyriac, rhs-bugs, rjoseph, rkavunga, storage-qa-internal, vagarwal | |
| Target Milestone: | --- | |||
| Target Release: | RHGS 3.1.0 | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | SNAPSHOT | |||
| Fixed In Version: | glusterfs-3.7.0-3.el6rhs | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1202436 1204636 (view as bug list) | Environment: | ||
| Last Closed: | 2015-07-29 04:39:21 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | 1122377, 1219744 | |||
| Bug Blocks: | 1191838, 1202436, 1202842, 1204636, 1223636 | |||
|
Description
senaik
2015-03-16 14:16:05 UTC
Could we please have the RCA for this bug updated If quota is enabled on the volume and if it is restored, then we cannot add another node to the cluster. RCA : When we enable quota on a volume, the quota cksum file is created. and is present in /var/lib/glusterd/vols/<volname>/, and the in memory structure of volinfo is updated as well. When we take a snapshot we copy the quota cksum file and along with that we take a copy of in memory structure. During restore we are missing out to fill in the quota cksum value. Hence after we restore the snapshot the cksum is updated to 0. When we try to add a new node, the quota cksum mismatches and because of that we fail to add a new node. While trying out a workaround for this problem. I have hit any interesting issue. The issue is as follows:
1) Let us say we have 2 nodes and a volume (vol1) is spread across these 2 nodes.
2) We take a snapshot
3) We set some options on the volume "vol1"
[a] Volume options are persisted in /var/lib/glusterd/vols/<volname>/info
[b] snap volume info is saved in /var/lib/glusterd/snaps/<snapvol>/info
* Newly added option should be saved in [a] (mentioned above)
4) Peer probe to a new node.
5) Now check the "info file of snap volume (i.e mentioned in point 3.b)
* Info file of snap volume contains the newly set option. Ideally the
snap volume should not contain that option, as it is set after
snapshot create.
* Because of this there will be a mismatch of volume checksum too.
* Until and unless the glusterd does not restart we are safe. Once glusterd
is restarted on the new node, because of checksum mismatch the newly added
node goes into a peer reject state.
There are 2 separate problems for this bug
1) After taking a snapshot if we set any volume option and if we add a new node,
then ideally the new node should have the same "info" file of snap-volume
compared to existing nodes. But, the problem here is along with existing
"info" file contents, the new volume option is also being updated in "info"
file of snap-volume, which should not be the case. Because of this checksum
mismatch happens and peer goes into rejected state.
* Workaround here would be that, after snapshot restore, we need to copy the
/var/lib/glusterd/vols/<volname>/info file from existing node to new
node and restart the glusterd on new node.
2) Let us imagine we have quota enabled on the volume and we take snapshot.
During snapshot restore we fail to re-calculate the quota checksum and
because of that quota checksum is initialized to zero. If we try to add
a new node then quota_conf checksum mismatches and because of that peer
will go into a rejected state
* Workaround here would be that after snapshot restore, the glusterd needs
to be restarted on existing nodes and then a peer probe has to be done on
new node.
Hi Rajesh, I see this bug being added as a known issue for the 3.0.4 release. Please fill out the doc text. Hi Rajesh, Can you please review the edited doc text for technical accuracy and sign off? doc-text seems fine to me. Workaround mentioned in Comment 7 is tested and works fine. root@darkknightrises cron.d]# gluster snapshot create snap3 vol0 snapshot create: success: Snap snap3_GMT-2015.07.04-07.19.09 created successfully [root@darkknightrises cron.d]# gluster snapshot activate snap3_GMT-2015.07.04-07.19.09 Snapshot activate: snap3_GMT-2015.07.04-07.19.09: Snap activated successfully [root@darkknightrises cron.d]# gluster v stop vol0 Stopping volume will make its data inaccessible. Do you want to continue? (y/n) y volume stop: vol0: success [root@darkknightrises cron.d]# gluster snapshot restore snap3_GMT-2015.07.04-07.19.09 Restore operation will replace the original volume with the snapshotted volume. Do you still want to continue? (y/n) y Snapshot restore: snap3_GMT-2015.07.04-07.19.09: Snap restored successfully [root@darkknightrises cron.d]# gluster v start vol0 volume start: vol0: success [root@darkknightrises cron.d]# gluster peer probe 10.70.47.151 peer probe: success. [root@darkknightrises cron.d]# gluster peer status Number of Peers: 4 Hostname: 10.70.33.219 Uuid: e91bd1ef-38d5-4389-8db8-2a6528ccbb17 State: Peer in Cluster (Connected) Hostname: 10.70.44.13 Uuid: 7e6e250c-b14d-4850-96c6-2a194a47b90e State: Peer in Cluster (Connected) Hostname: 10.70.33.225 Uuid: 0e2626da-55f5-4523-95c0-55eefb4e53a3 State: Peer in Cluster (Connected) Hostname: 10.70.47.151 Uuid: 6f5126cf-312a-4274-8a09-380dd6d465ad State: Peer in Cluster (Connected) Bug verified on build glusterfs-3.7.1-7.el6rhs.x86_64. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2015-1495.html |