1278798 – Few snapshot creation fails with pre-validation failed message on tiered volume.

Bug 1278798 - Few snapshot creation fails with pre-validation failed message on tiered volume.

Summary: Few snapshot creation fails with pre-validation failed message on tiered volume.

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	snapshot
Sub Component:
Version:	rhgs-3.1
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	RHGS 3.1.2
Assignee:	Bug Updates Notification Mailing List
QA Contact:	Shashank Raj
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1286761 (view as bug list)
Depends On:
Blocks:	1260783 1274334 1287842 1288352
TreeView+	depends on / blocked

Reported:	2015-11-06 12:58 UTC by Shashank Raj
Modified:	2016-11-08 03:53 UTC (History)
CC List:	12 users (show)
Fixed In Version:	glusterfs-3.7.5-15
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	1287842 (view as bug list)
Environment:
Last Closed:	2016-03-01 05:53:42 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
logs from one node (526.73 KB, application/zip) 2015-11-16 12:06 UTC, Shashank Raj	no flags	Details
sos report from one node (3.10 MB, application/x-gzip) 2015-12-01 10:15 UTC, Shashank Raj	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1286761	0	high	CLOSED	Snapshot creation fails with "Failed to pause tier in snap prevalidate" message on tier volume	2021-02-22 00:41:40 UTC
Red Hat Product Errata	RHBA-2016:0193	0	normal	SHIPPED_LIVE	Red Hat Gluster Storage 3.1 update 2	2016-03-01 10:20:36 UTC

Internal Links: 1286761

Description Shashank Raj 2015-11-06 12:58:12 UTC

Description of problem:

Running snapshot creation in loop fails some of the snap creation with pre-validation failed message on tiered volume while large file creation is in progress.


Version-Release number of selected component (if applicable):

glusterfs-3.7.5-5

How reproducible:

Always

Steps to Reproduce:

1.Create a tiered volume with 2x(4+2) EC cold tier and 2x2 dist-rep hot tier
2.FUSE mount the volume on a client. 
3.Start creating 20000 files on the mount point each of 100KB.
4.Simultaneously start creating 100 snapshots in loop with sleep of 5 in 2 snap creation.
5.Observe that few of the snapshot creation fails while the IO is in progress. (in my case 11 out of 100 snapshots fails with pre-validation failed message).
6.Below are the failed messages in logs:

[2015-11-05 21:56:02.975294] E [MSGID: 106062] [glusterd-snapshot.c:1504:glusterd_snap_create_clone_pre_val_use_rsp_dict] 0-management: failed to get the volume count
[2015-11-05 21:56:02.975451] E [MSGID: 106062] [glusterd-snapshot.c:1813:glusterd_snap_pre_validate_use_rsp_dict] 0-management: Unable to use rsp dict
[2015-11-05 21:56:02.975459] E [MSGID: 106122] [glusterd-mgmt.c:600:glusterd_pre_validate_aggr_rsp_dict] 0-management: Failed to aggregate prevalidate response dictionaries.
[2015-11-05 21:56:02.975467] E [MSGID: 106108] [glusterd-mgmt.c:701:gd_mgmt_v3_pre_validate_cbk_fn] 0-management: Failed to aggregate response from  node/brick
[2015-11-05 21:56:02.975497] E [MSGID: 106116] [glusterd-mgmt.c:134:gd_mgmt_v3_collate_errors] 0-management: Pre Validation failed on 10.70.35.140. Please check log file for details.
[2015-11-05 21:56:05.833521] W [socket.c:588:__socket_rwv] 0-nfs: readv on /var/run/gluster/11f5a41d4df7a19d42d4e641eb784bfa.socket failed (Invalid argument)
The message "I [MSGID: 106006] [glusterd-svc-mgmt.c:323:glusterd_svc_common_rpc_notify] 0-management: nfs has disconnected from glusterd." repeated 28 times between [2015-11-05 21:54:39.740819] and [2015-11-05 21:56:05.833576]
[2015-11-05 21:56:08.702620] E [MSGID: 106122] [glusterd-mgmt.c:883:glusterd_mgmt_v3_pre_validate] 0-management: Pre Validation failed on peers
[2015-11-05 21:56:08.702694] E [MSGID: 106122] [glusterd-mgmt.c:2164:glusterd_mgmt_v3_initiate_snap_phases] 0-management: Pre Validation Failed


[2015-11-05 22:04:59.652100] E [MSGID: 106572] [glusterd-snapshot.c:1998:glusterd_snapshot_pause_tier] 0-management: Failed to pause tier. Errstr=(null)
[2015-11-05 22:04:59.652159] E [MSGID: 106572] [glusterd-snapshot.c:2592:glusterd_snapshot_create_prevalidate] 0-management: Failed to pause tier in snap prevalidate.
[2015-11-05 22:04:59.652201] W [MSGID: 106030] [glusterd-snapshot.c:8380:glusterd_snapshot_prevalidate] 0-management: Snapshot create pre-validation failed
[2015-11-05 22:04:59.652215] W [MSGID: 106122] [glusterd-mgmt.c:166:gd_mgmt_v3_pre_validate_fn] 0-management: Snapshot Prevalidate Failed
[2015-11-05 22:04:59.652228] E [MSGID: 106122] [glusterd-mgmt.c:820:glusterd_mgmt_v3_pre_validate] 0-management: Pre Validation failed for operation Snapshot on local node
[2015-11-05 22:04:59.652247] E [MSGID: 106122] [glusterd-mgmt.c:2164:glusterd_mgmt_v3_initiate_snap_phases] 0-management: Pre Validation Failed


Actual results:

Snapshot creation fails with IO running on tiered volume.

Expected results:

Snapshot creation should be successful without any issues.

Additional info:

Comment 2 Dan Lambright 2015-11-14 17:18:17 UTC

Can you provide all the logs , including the tier logs and glusterd complete. Ideally you could also set the client log level to DEBUG.

Comment 3 Shashank Raj 2015-11-16 12:06:12 UTC

Created attachment 1094861 [details]
logs from one node

Comment 4 Shashank Raj 2015-12-01 10:14:34 UTC

This issue is still reproducible with the latest build where few of the snapshots fail when there is a promotion or demotion happening on the volume.

Steps to reproduce:

1) Write around 1000 files of 100KB on the tiered volume and make sure these files are in hot tier initially.
2) Now as soon as demotion of these files starts, start creating snapshots in loop with time-interval of 10 secs between each snap create.
3) Observe that the first snapshot fails with pre-validation failed message as "failed to pause tier in snap prevalidate"

[2015-12-01 09:14:12.968667] E [MSGID: 106122] [glusterd-mgmt-handler.c:337:glusterd_handle_pre_validate_fn] 0-management: Pre Validation failed on operation Snapshot
[2015-12-01 09:14:24.649486] W [MSGID: 101095] [xlator.c:143:xlator_volopt_dynload] 0-xlator: /usr/lib64/glusterfs/3.7.5/xlator/cluster/tier.so: undefined symbol: dht_methods
[2015-12-01 09:16:17.069099] E [MSGID: 106572] [glusterd-snapshot.c:1998:glusterd_snapshot_pause_tier] 0-management: Failed to pause tier. Errstr=(null)
[2015-12-01 09:16:17.069171] E [MSGID: 106572] [glusterd-snapshot.c:2592:glusterd_snapshot_create_prevalidate] 0-management: Failed to pause tier in snap prevalidate.
[2015-12-01 09:16:17.069195] W [MSGID: 106030] [glusterd-snapshot.c:8390:glusterd_snapshot_prevalidate] 0-management: Snapshot create pre-validation failed
[2015-12-01 09:16:17.069211] W [MSGID: 106122] [glusterd-mgmt.c:167:gd_mgmt_v3_pre_validate_fn] 0-management: Snapshot Prevalidate Failed


Also, sometimes snapshots in between fails with the same error.

Attached sos report from one node

Comment 5 Shashank Raj 2015-12-01 10:15:57 UTC

Created attachment 1100826 [details]
sos report from one node

Comment 7 Dan Lambright 2015-12-02 03:57:01 UTC

I found tier.so was not loaded by glusterd properly and wrote fix 12840. This seems to be the root cause of the problem, but I have not confirmed.

Comment 8 Vivek Agarwal 2015-12-02 07:28:40 UTC

http://review.gluster.org/12840

Comment 11 Shashank Raj 2015-12-08 11:31:05 UTC

This issue is still reproducible with the latest build glusterfs-3.7.5-10.

[root@dhcp35-141 b3]# gluster --version
glusterfs 3.7.5 built on Dec  8 2015 01:14:33

[root@dhcp35-141 b3]# rpm -qa|grep glusterfs
glusterfs-libs-3.7.5-10.el7rhgs.x86_64
glusterfs-api-3.7.5-10.el7rhgs.x86_64
glusterfs-rdma-3.7.5-10.el7rhgs.x86_64
glusterfs-debuginfo-3.7.5-10.el7rhgs.x86_64
glusterfs-3.7.5-10.el7rhgs.x86_64
glusterfs-fuse-3.7.5-10.el7rhgs.x86_64
glusterfs-cli-3.7.5-10.el7rhgs.x86_64
glusterfs-geo-replication-3.7.5-10.el7rhgs.x86_64
glusterfs-client-xlators-3.7.5-10.el7rhgs.x86_64
glusterfs-server-3.7.5-10.el7rhgs.x86_64


Steps followed are as below:

1) Created a tiered volume, started it,enabled quota and attached tier to it.

2) Set the cluster.watermark-low as 1 and cluster.watermark-hi as 2 for the tiered volume.

Volume Name: tiervolume
Type: Tier
Volume ID: 70b879ba-067d-498b-a29e-4003c650b68b
Status: Started
Number of Bricks: 16
Transport-type: tcp
Hot Tier :
Hot Tier Type : Distributed-Replicate
Number of Bricks: 2 x 2 = 4
Brick1: 10.70.35.140:/bricks/brick3/b3
Brick2: 10.70.35.142:/bricks/brick3/b3
Brick3: 10.70.35.141:/bricks/brick3/b3
Brick4: 10.70.35.228:/bricks/brick3/b3
Cold Tier:
Cold Tier Type : Distributed-Disperse
Number of Bricks: 2 x (4 + 2) = 12
Brick5: 10.70.35.228:/bricks/brick0/b0
Brick6: 10.70.35.141:/bricks/brick0/b0
Brick7: 10.70.35.142:/bricks/brick0/b0
Brick8: 10.70.35.140:/bricks/brick0/b0
Brick9: 10.70.35.228:/bricks/brick1/b1
Brick10: 10.70.35.141:/bricks/brick1/b1
Brick11: 10.70.35.142:/bricks/brick1/b1
Brick12: 10.70.35.140:/bricks/brick1/b1
Brick13: 10.70.35.228:/bricks/brick2/b2
Brick14: 10.70.35.141:/bricks/brick2/b2
Brick15: 10.70.35.142:/bricks/brick2/b2
Brick16: 10.70.35.140:/bricks/brick2/b2
Options Reconfigured:
cluster.watermark-hi: 2
cluster.watermark-low: 1
features.barrier: disable
cluster.tier-mode: cache
features.ctr-enabled: on
features.quota-deem-statfs: on
features.inode-quota: on
features.quota: on
performance.readdir-ahead: on
cluster.enable-shared-storage: enable
auto-delete: enable
snap-activate-on-create: disable

3) Created around 20000 (100KB) files from the mount point.

4) As soon as the demotion of files started, ran a script for creating 50 snapshots with a time interval of 10 sec between each creation.

5) Observed that every second snap creation failed with "failed to pause tier in snap prevalidate" message

[2015-12-08 10:22:26.442693] E [MSGID: 106572] [glusterd-snapshot.c:1998:glusterd_snapshot_pause_tier] 0-management: Failed to pause tier. Errstr=(null)
[2015-12-08 10:22:26.443009] E [MSGID: 106572] [glusterd-snapshot.c:2586:glusterd_snapshot_create_prevalidate] 0-management: Failed to pause tier in snap prevalidate.
[2015-12-08 10:22:26.443037] W [MSGID: 106030] [glusterd-snapshot.c:8386:glusterd_snapshot_prevalidate] 0-management: Snapshot create pre-validation failed
[2015-12-08 10:22:26.443330] W [MSGID: 106122] [glusterd-mgmt.c:167:gd_mgmt_v3_pre_validate_fn] 0-management: Snapshot Prevalidate Failed
[2015-12-08 10:22:26.443365] E [MSGID: 106122] [glusterd-mgmt-handler.c:337:glusterd_handle_pre_validate_fn] 0-management: Pre Validation failed on operation Snapshot
[2015-12-08 10:23:04.746618] E [MSGID: 106122] [glusterd-mgmt-handler.c:337:glusterd_handle_pre_validate_fn] 0-management: Pre Validation failed on operation Snapshot
[2015-12-08 10:23:15.549670] I [MSGID: 106488] [glusterd-handler.c:1472:__glusterd_handle_cli_get_volume] 0-glusterd: Received get vol req
The message "W [MSGID: 106122] [glusterd-mgmt.c:167:gd_mgmt_v3_pre_validate_fn] 0-management: Snapshot Prevalidate Failed" repeated 2 times between [2015-12-08 10:22:26.443330] and [2015-12-08 10:23:55.522315]
[2015-12-08 10:23:55.522320] E [MSGID: 106122] [glusterd-mgmt-handler.c:337:glusterd_handle_pre_validate_fn] 0-management: Pre Validation failed on operation Snapshot
[2015-12-08 10:23:15.552542] I [MSGID: 106488] [glusterd-handler.c:1472:__glusterd_handle_cli_get_volume] 0-glusterd: Received get vol req
The message "E [MSGID: 106572] [glusterd-snapshot.c:1998:glusterd_snapshot_pause_tier] 0-management: Failed to pause tier. Errstr=(null)" repeated 2 times between [2015-12-08 10:22:26.442693] and [2015-12-08 10:23:55.522289]
The message "E [MSGID: 106572] [glusterd-snapshot.c:2586:glusterd_snapshot_create_prevalidate] 0-management: Failed to pause tier in snap prevalidate." repeated 2 times between [2015-12-08 10:22:26.443009] and [2015-12-08 10:23:55.522304]
The message "W [MSGID: 106030] [glusterd-snapshot.c:8386:glusterd_snapshot_prevalidate] 0-management: Snapshot create pre-validation failed" repeated 2 times between [2015-12-08 10:22:26.443037] and [2015-12-08 10:23:55.522312]

6) Out of 50 snapshots, 25 failed with the above error messages in logs.

Note: sosreports are at: http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1278798/New

Comment 12 Shashank Raj 2015-12-08 13:40:50 UTC

Tested by increasing the time duration to 60sec between 2 snap creations and with continuous promotions and demotions.

And out of 30, around 10 snapshot creation failed with the same error message "Failed to pause tier".

Comment 13 Shashank Raj 2015-12-15 11:30:01 UTC

With the latest build glusterfs-3.7.5-11, snapshot creations scheduled via snap_scheduler at every 30 mins fails with the below error messages

[2015-12-15 11:00:03.488940] E [MSGID: 106062] [glusterd-snapshot.c:1504:glusterd_snap_create_clone_pre_val_use_rsp_dict] 0-management: failed to get the volume count
[2015-12-15 11:00:03.488966] E [MSGID: 106062] [glusterd-snapshot.c:1813:glusterd_snap_pre_validate_use_rsp_dict] 0-management: Unable to use rsp dict
[2015-12-15 11:00:03.489100] E [MSGID: 106122] [glusterd-mgmt.c:658:glusterd_pre_validate_aggr_rsp_dict] 0-management: Failed to aggregate prevalidate response dictionaries.
[2015-12-15 11:00:03.489109] E [MSGID: 106108] [glusterd-mgmt.c:760:gd_mgmt_v3_pre_validate_cbk_fn] 0-management: Failed to aggregate response from  node/brick
[2015-12-15 11:00:03.489158] E [MSGID: 106116] [glusterd-mgmt.c:135:gd_mgmt_v3_collate_errors] 0-management: Pre Validation failed on 10.70.35.140. Please check log file for details.
[2015-12-15 11:00:03.489207] E [MSGID: 106122] [glusterd-mgmt.c:942:glusterd_mgmt_v3_pre_validate] 0-management: Pre Validation failed on peers
[2015-12-15 11:00:03.489226] E [MSGID: 106122] [glusterd-mgmt.c:2224:glusterd_mgmt_v3_initiate_snap_phases] 0-management: Pre Validation Failed




[2015-12-15 11:00:03.487870] E [MSGID: 106572] [glusterd-snapshot.c:1998:glusterd_snapshot_pause_tier] 0-management: Failed to pause tier. Errstr=(null)
[2015-12-15 11:00:03.488025] E [MSGID: 106572] [glusterd-snapshot.c:2586:glusterd_snapshot_create_prevalidate] 0-management: Failed to pause tier in snap prevalidate.
[2015-12-15 11:00:03.488066] W [MSGID: 106030] [glusterd-snapshot.c:8386:glusterd_snapshot_prevalidate] 0-management: Snapshot create pre-validation failed
[2015-12-15 11:00:03.488079] W [MSGID: 106122] [glusterd-mgmt.c:167:gd_mgmt_v3_pre_validate_fn] 0-management: Snapshot Prevalidate Failed
[2015-12-15 11:00:03.488104] E [MSGID: 106122] [glusterd-mgmt-handler.c:337:glusterd_handle_pre_validate_fn] 0-management: Pre Validation failed on operation Snapshot

Comment 15 Shashank Raj 2015-12-23 13:18:02 UTC

Verified this bug again with 3.7.5-13 build and the issue is still reproducible with very inconsistent behavior.

[root@dhcp35-228 ~]# rpm -qa|grep glusterfs
glusterfs-libs-3.7.5-13.el7rhgs.x86_64
glusterfs-api-3.7.5-13.el7rhgs.x86_64
glusterfs-rdma-3.7.5-13.el7rhgs.x86_64
glusterfs-debuginfo-3.7.5-12.el7rhgs.x86_64
glusterfs-3.7.5-13.el7rhgs.x86_64
glusterfs-fuse-3.7.5-13.el7rhgs.x86_64
glusterfs-cli-3.7.5-13.el7rhgs.x86_64
glusterfs-geo-replication-3.7.5-13.el7rhgs.x86_64
glusterfs-client-xlators-3.7.5-13.el7rhgs.x86_64
glusterfs-server-3.7.5-13.el7rhgs.x86_64

1) With a time interval of 30s between each snap creation and running 50 snapshot creations in loop.
Result: 3 out of 50 snapshots fails with pre-validation failed message.

2) With a time interval of 60s between each snap creation and running 50 snapshot creations in loop
Result: 13 out of 50 snapshots fails with pre-validation failed message.

3) With a time interval of 300s between each snap creation and running 10 snapshots creation in loop.
Result: 3 out of 10 snapshot fails with pre-validation failed message.

All the snap creations were performed while promotion and demotion of files are going on.

The failures are mixed of below two error messages:

1)

[2015-12-23 09:28:30.682950] E [MSGID: 106062] [glusterd-snapshot.c:1504:glusterd_snap_create_clone_pre_val_use_rsp_dict] 0-management: failed to get the volume count
[2015-12-23 09:28:30.683168] E [MSGID: 106062] [glusterd-snapshot.c:1813:glusterd_snap_pre_validate_use_rsp_dict] 0-management: Unable to use rsp dict
[2015-12-23 09:28:30.683185] E [MSGID: 106122] [glusterd-mgmt.c:658:glusterd_pre_validate_aggr_rsp_dict] 0-management: Failed to aggregate prevalidate response dictionaries.
[2015-12-23 09:28:30.683201] E [MSGID: 106108] [glusterd-mgmt.c:760:gd_mgmt_v3_pre_validate_cbk_fn] 0-management: Failed to aggregate response from  node/brick
[2015-12-23 09:28:30.683255] E [MSGID: 106116] [glusterd-mgmt.c:135:gd_mgmt_v3_collate_errors] 0-management: Pre Validation failed on 10.70.35.140. Please check log file for details.
[2015-12-23 09:28:32.680682] E [MSGID: 106122] [glusterd-mgmt.c:942:glusterd_mgmt_v3_pre_validate] 0-management: Pre Validation failed on peers
[2015-12-23 09:28:32.680733] E [MSGID: 106122] [glusterd-mgmt.c:2224:glusterd_mgmt_v3_initiate_snap_phases] 0-management: Pre Validation Failed

2)

[2015-12-23 09:28:29.253855] E [MSGID: 106572] [glusterd-snapshot.c:1998:glusterd_snapshot_pause_tier] 0-management: Failed to pause tier. Errstr=(null)
[2015-12-23 09:28:30.681882] E [MSGID: 106572] [glusterd-snapshot.c:2586:glusterd_snapshot_create_prevalidate] 0-management: Failed to pause tier in snap prevalidate.
[2015-12-23 09:28:30.681918] W [MSGID: 106030] [glusterd-snapshot.c:8386:glusterd_snapshot_prevalidate] 0-management: Snapshot create pre-validation failed
[2015-12-23 09:28:30.681937] W [MSGID: 106122] [glusterd-mgmt.c:167:gd_mgmt_v3_pre_validate_fn] 0-management: Snapshot Prevalidate Failed
[2015-12-23 09:28:30.681972] E [MSGID: 106122] [glusterd-mgmt-handler.c:337:glusterd_handle_pre_validate_fn] 0-management: Pre Validation failed on operation Snapshot.

The failure is very inconsistent and we cant predict when it will fail.

With the above observations, assigning the bug back.

let me know in case of any more information which is required from my side.

Comment 16 Dan Lambright 2015-12-23 14:54:33 UTC

Can you recreate sos reports with DEBUG logs ?

Comment 17 Vivek Agarwal 2015-12-24 00:18:07 UTC

Can you get the debug logs and sos reports for the same?

Comment 18 Shashank Raj 2015-12-24 04:45:11 UTC

sosreport from two nodes are at http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1278798/Latest/.

facing some issues while generating sosreport for other two nodes as it is getting hanged. Will resolve the issue and update them as well. Meanwhile you can look into the provided data.

Comment 19 Mohammed Rafi KC 2015-12-24 15:25:37 UTC

*** Bug 1286761 has been marked as a duplicate of this bug. ***

Comment 20 Dan Lambright 2015-12-28 15:37:35 UTC

When the database gets populated with many entries, queries take longer. It is possible the timeout will be exceeded at that time. A callback function is called for every entry found in the query. We should check in that function is pause tier has been invoked.

Another rarer issue exists in which a file being migrated at the same time as being updated takes a long time. This also appears to lead to timeouts. The problem appears to be embedded in legacy DHT logic and present on brick add/removal. As its longstanding it will not be covered in this bug.

Comment 21 Shashank Raj 2015-12-31 11:04:26 UTC

Performed the test with 3.7.5-14 build packages with were provided and out of 30, 12 snapshots failed with "Failed to pause tier".

This test was performed on nodes created over VM's and i am yet to try this on physical machines.

sosreport from the test is placed under http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1278798/3.7.5-14/
 
Will update once i have the results from physcial machine runs.

Comment 23 Shashank Raj 2015-12-31 11:31:35 UTC

Was running a script to create 100000 (100KB) files

Continuous IO was going on during snap creation and demotions/promotions started at around ~40000 (100KB) files.

Kept watermark-low as 5 and watermark-hi as 20.

Comment 24 Dan Lambright 2015-12-31 16:01:59 UTC

Are you doing write updates to existing files, or 100% new file creates? I suspect you hit issue #2 in comment 20 and will see if I can confirm in the sos logs.

Comment 25 Dan Lambright 2015-12-31 16:50:42 UTC

Can you rerun this with DEBUG logs and post them? 

gluster v set <vol> diagnostics.client-log-level DEBUG

Here is how I recreated the original problem last week, 

1. two physical nodes with tiered volume 
2. creating 40000 files
3. in parallel, write update the same 100 files in a loop,
3. create a snapshot every 30 seconds, delete it , repeat 

I ran in tiering "test" mode to force promotion/demotion to happen frequently.
 
I'll try 100K files. 

Are there any other things you are doing that I am not doing?

In the sos logs I do not see the same patterns as I did earlier. I do not see any errors in the server side logs this time.

Comment 26 Shashank Raj 2016-01-04 06:42:29 UTC

Once the demotions starts, i write to specific set of files as well.

the steps you mentioned above looks pretty much the same, apart from i am running the tests in cache mode with watermark-low as 5 and watermark-hi as 20.

Will share the results with debug logs as well.

Comment 27 Shashank Raj 2016-01-04 11:23:17 UTC

Results from physical machine runs:
11 out of 50 snapshots failures.

The logs from the physical machine run with test mode can be found under:
http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1278798/Final/

Comment 28 Dan Lambright 2016-01-05 13:36:01 UTC

Wrote fix 13104 upstream.

1. Borrowed performance machines and created thin volumes over physical drives, rather than loopback devices. 
2. Created 200K files. 
3. Continuously updated 100 files in a loop, slept, repeated every minute.
3. Ran overnight creating snapshots every 45 seconds.

With fix 13104 saw no errors on step #3. I'll bump step #2 to 500K files.

I suspect the loopback devices I had earlier created in step #1 masked problems in my earlier tests as they are slower.

Please let me know if this differs from your tests? 

I'll start getting the fix reviewed. I may ask you to run the fix for sanity and will work through that with Vivek.

Comment 32 Shashank Raj 2016-01-08 12:06:36 UTC

Verified this bug with the latest build glusterfs-3.7.5-15 and the results look pretty good with this build.

None of the snap creation failed (out of 70), during continuous demotions/promotions and with the same set of tests which caused failures earlier.

Build details below:

[root@dhcp35-141 b3]# gluster --version
glusterfs 3.7.5 built on Jan  7 2016 23:32:29

[root@dhcp35-141 b3]# rpm -qa|grep glusterfs
glusterfs-client-xlators-3.7.5-15.el7rhgs.x86_64
glusterfs-cli-3.7.5-15.el7rhgs.x86_64
glusterfs-ganesha-3.7.5-15.el7rhgs.x86_64
glusterfs-3.7.5-15.el7rhgs.x86_64
glusterfs-api-3.7.5-15.el7rhgs.x86_64
glusterfs-fuse-3.7.5-15.el7rhgs.x86_64
glusterfs-server-3.7.5-15.el7rhgs.x86_64
glusterfs-api-devel-3.7.5-15.el7rhgs.x86_64
glusterfs-geo-replication-3.7.5-15.el7rhgs.x86_64
glusterfs-libs-3.7.5-15.el7rhgs.x86_64
glusterfs-devel-3.7.5-15.el7rhgs.x86_64
glusterfs-debuginfo-3.7.5-15.el7rhgs.x86_64

However, keeping this bug open and letting the state be ON_QA till we finish regression.

Will be marked as verified once we complete the regression cycle.

Comment 33 Shashank Raj 2016-01-12 15:32:34 UTC

Moving this bug to Verified state. Will be reopened if we hit this issue again during regression cycle.

Comment 35 errata-xmlrpc 2016-03-01 05:53:42 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-0193.html

Note You need to log in before you can comment on or make changes to this bug.