Bug 1288352

Summary: Few snapshot creation fails with pre-validation failed message on tiered volume.
Product: [Community] GlusterFS Reporter: Nithya Balachandran <nbalacha>
Component: snapshotAssignee: Nithya Balachandran <nbalacha>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: high Docs Contact:
Priority: high    
Version: 3.7.6CC: asengupt, bugs, byarlaga, dlambrig, nbalacha, sankarshan, smohan, sraj, storage-qa-internal
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: glusterfs-3.7.7 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 1287842 Environment:
Last Closed: 2016-04-19 07:24:46 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1278798, 1287842    
Bug Blocks: 1274334    

Comment 1 Nithya Balachandran 2015-12-04 05:00:25 UTC
Description of problem:

Running snapshot creation in loop fails some of the snap creation with pre-validation failed message on tiered volume while large file creation is in progress.


Version-Release number of selected component (if applicable):

glusterfs-3.7.5-5

How reproducible:

Always

Steps to Reproduce:

1.Create a tiered volume with 2x(4+2) EC cold tier and 2x2 dist-rep hot tier
2.FUSE mount the volume on a client. 
3.Start creating 20000 files on the mount point each of 100KB.
4.Simultaneously start creating 100 snapshots in loop with sleep of 5 in 2 snap creation.
5.Observe that few of the snapshot creation fails while the IO is in progress. (in my case 11 out of 100 snapshots fails with pre-validation failed message).
6.Below are the failed messages in logs:

[2015-11-05 21:56:02.975294] E [MSGID: 106062] [glusterd-snapshot.c:1504:glusterd_snap_create_clone_pre_val_use_rsp_dict] 0-management: failed to get the volume count
[2015-11-05 21:56:02.975451] E [MSGID: 106062] [glusterd-snapshot.c:1813:glusterd_snap_pre_validate_use_rsp_dict] 0-management: Unable to use rsp dict
[2015-11-05 21:56:02.975459] E [MSGID: 106122] [glusterd-mgmt.c:600:glusterd_pre_validate_aggr_rsp_dict] 0-management: Failed to aggregate prevalidate response dictionaries.
[2015-11-05 21:56:02.975467] E [MSGID: 106108] [glusterd-mgmt.c:701:gd_mgmt_v3_pre_validate_cbk_fn] 0-management: Failed to aggregate response from  node/brick
[2015-11-05 21:56:02.975497] E [MSGID: 106116] [glusterd-mgmt.c:134:gd_mgmt_v3_collate_errors] 0-management: Pre Validation failed on 10.70.35.140. Please check log file for details.
[2015-11-05 21:56:05.833521] W [socket.c:588:__socket_rwv] 0-nfs: readv on /var/run/gluster/11f5a41d4df7a19d42d4e641eb784bfa.socket failed (Invalid argument)
The message "I [MSGID: 106006] [glusterd-svc-mgmt.c:323:glusterd_svc_common_rpc_notify] 0-management: nfs has disconnected from glusterd." repeated 28 times between [2015-11-05 21:54:39.740819] and [2015-11-05 21:56:05.833576]
[2015-11-05 21:56:08.702620] E [MSGID: 106122] [glusterd-mgmt.c:883:glusterd_mgmt_v3_pre_validate] 0-management: Pre Validation failed on peers
[2015-11-05 21:56:08.702694] E [MSGID: 106122] [glusterd-mgmt.c:2164:glusterd_mgmt_v3_initiate_snap_phases] 0-management: Pre Validation Failed


[2015-11-05 22:04:59.652100] E [MSGID: 106572] [glusterd-snapshot.c:1998:glusterd_snapshot_pause_tier] 0-management: Failed to pause tier. Errstr=(null)
[2015-11-05 22:04:59.652159] E [MSGID: 106572] [glusterd-snapshot.c:2592:glusterd_snapshot_create_prevalidate] 0-management: Failed to pause tier in snap prevalidate.
[2015-11-05 22:04:59.652201] W [MSGID: 106030] [glusterd-snapshot.c:8380:glusterd_snapshot_prevalidate] 0-management: Snapshot create pre-validation failed
[2015-11-05 22:04:59.652215] W [MSGID: 106122] [glusterd-mgmt.c:166:gd_mgmt_v3_pre_validate_fn] 0-management: Snapshot Prevalidate Failed
[2015-11-05 22:04:59.652228] E [MSGID: 106122] [glusterd-mgmt.c:820:glusterd_mgmt_v3_pre_validate] 0-management: Pre Validation failed for operation Snapshot on local node
[2015-11-05 22:04:59.652247] E [MSGID: 106122] [glusterd-mgmt.c:2164:glusterd_mgmt_v3_initiate_snap_phases] 0-management: Pre Validation Failed

Comment 2 Vijay Bellur 2015-12-04 05:06:02 UTC
REVIEW: http://review.gluster.org/12877 (cluster/tier: fix loading tier.so into glusterd) posted (#1) for review on release-3.7 by N Balachandran (nbalacha)

Comment 3 Vijay Bellur 2015-12-04 05:18:38 UTC
REVIEW: http://review.gluster.org/12877 (cluster/tier: fix loading tier.so into glusterd) posted (#2) for review on release-3.7 by N Balachandran (nbalacha)

Comment 4 Vijay Bellur 2015-12-04 13:44:35 UTC
COMMIT: http://review.gluster.org/12877 committed in release-3.7 by Dan Lambright (dlambrig) 
------
commit 0ef60a5c371359d2a5d0d8684a8a58f1f5801525
Author: N Balachandran <nbalacha>
Date:   Fri Dec 4 10:34:37 2015 +0530

    cluster/tier: fix loading tier.so into glusterd
    
    The glusterd process loads the shared libraries of client translators.
    This failed for tiering due to a reference to dht_methods which is
    defined as a global variable which is not necessary.
    The global variable has been removed and this is now a member of
    dht_conf and is now initialised in the *_init calls.
    
    > Change-Id: Ifa0a21e3962b5cd8d9b927ef1d087d3b25312953
    > Signed-off-by: N Balachandran <nbalacha>
    > Reviewed-on: http://review.gluster.org/12863
    > Tested-by: NetBSD Build System <jenkins.org>
    > Tested-by: Gluster Build System <jenkins.com>
    > Reviewed-by: Dan Lambright <dlambrig>
    >Tested-by: Dan Lambright <dlambrig>
    (cherry picked from commit 96fc7f64da2ef09e82845a7ab97574f511a9aae5)
    
    Change-Id: If3cc908ebfcd1f165504f15db2e3079d97f3132e
    BUG: 1288352
    Signed-off-by: N Balachandran <nbalacha>
    Reviewed-on: http://review.gluster.org/12877
    Tested-by: NetBSD Build System <jenkins.org>
    Tested-by: Gluster Build System <jenkins.com>
    Reviewed-by: Dan Lambright <dlambrig>
    Tested-by: Dan Lambright <dlambrig>

Comment 5 Vijay Bellur 2016-01-08 13:40:51 UTC
REVIEW: http://review.gluster.org/13199 (cluster/tier: allow db queries to be interruptable) posted (#1) for review on release-3.7 by Dan Lambright (dlambrig)

Comment 6 Vijay Bellur 2016-02-01 16:30:35 UTC
REVIEW: http://review.gluster.org/13199 (cluster/tier: allow db queries to be interruptable) posted (#2) for review on release-3.7 by Dan Lambright (dlambrig)

Comment 7 Vijay Bellur 2016-02-10 13:30:32 UTC
REVIEW: http://review.gluster.org/13199 (cluster/tier: allow db queries to be interruptable) posted (#3) for review on release-3.7 by Dan Lambright (dlambrig)

Comment 8 Vijay Bellur 2016-02-11 21:33:17 UTC
REVIEW: http://review.gluster.org/13199 (cluster/tier: allow db queries to be interruptable) posted (#4) for review on release-3.7 by Dan Lambright (dlambrig)

Comment 9 Vijay Bellur 2016-02-16 16:51:35 UTC
REVIEW: http://review.gluster.org/13199 (cluster/tier: allow db queries to be interruptable) posted (#5) for review on release-3.7 by Dan Lambright (dlambrig)

Comment 10 Vijay Bellur 2016-02-17 14:05:34 UTC
REVIEW: http://review.gluster.org/13199 (cluster/tier: allow db queries to be interruptable) posted (#6) for review on release-3.7 by Dan Lambright (dlambrig)

Comment 11 Vijay Bellur 2016-02-18 13:55:05 UTC
COMMIT: http://review.gluster.org/13199 committed in release-3.7 by Dan Lambright (dlambrig) 
------
commit 92d08cee31044af4b792ed283011bf7287b00883
Author: Dan Lambright <dlambrig>
Date:   Mon Dec 28 10:57:53 2015 -0500

    cluster/tier: allow db queries to be interruptable
    
    A query to the database may take a long time if the database
    has many entries. The tier daemon also sends IPC calls to the
    bricks which can run slowly, espcially in RHEL6. While it is
    possible to track down each such instance, the snapshot
    feature should not be affected by database operations. It requires
    no migration be underway. Therefore it is okay to pause tiering
    at any time except when DHT is moving a file.  This fix implements
    this strategy by monitoring when control passes to DHT to
    migrate a file using the GF_XATTR_FILE_MIGRATE_KEY trigger. If it
    is not, the pause operation is successful.
    
    > Change-Id: I21f168b1bd424077ad5f38cf82f794060a1fabf6
    > BUG: 1287842
    > Signed-off-by: Dan Lambright <dlambrig>
    > Reviewed-on: http://review.gluster.org/13104
    > Reviewed-by: Joseph Fernandes
    > Tested-by: Gluster Build System <jenkins.com>
    Signed-off-by: Dan Lambright <dlambrig>
    
    Change-Id: I667e0af24eaa66afefa860c4d73b324e4f39b997
    BUG: 1288352
    Signed-off-by: Dan Lambright <dlambrig>
    Reviewed-on: http://review.gluster.org/13199
    Smoke: Gluster Build System <jenkins.com>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.com>

Comment 12 Kaushal 2016-04-19 07:24:46 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.7.9, please open a new bug report.

glusterfs-3.7.9 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] https://www.gluster.org/pipermail/gluster-users/2016-March/025922.html
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user

Comment 13 Kaushal 2016-04-19 07:50:07 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.7.7, please open a new bug report.

glusterfs-3.7.7 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] https://www.gluster.org/pipermail/gluster-users/2016-February/025292.html
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user