Description of problem:
Smoke tests seem to be failing for different patches for reasons unrelated to the patches themselves:
These are the dbench failures.
Seemingly all these are on slave34. I'm going to quickly sample all the smoke failures to see if they're all happening on slave34.
Data from the last 100 smoke jobs:
Michael and Rajesh, do you have any idea what's going on in these failures. Any pointers to look at where to start debugging would also be useful.
I'm inclined to think that this is a Gluster bug than an infra issue. What could cause "Transport endpoint is not connected" errors from Gluster?
From what I can see from rudimentary searches:
"Transport endpoint is not connected" most likely means the FUSE mount has crashed. This most definitely means an intermittent bug has slipped past us?
I'm guessing Pranith was looking for /var/log/glusterfs files. I've added the logic to grab them for smoke. We never did get them for smoke: https://github.com/gluster/glusterfs-patch-acceptance-tests/pull/62/files
The script: https://github.com/gluster/glusterfs-patch-acceptance-tests/blob/master/smoke.sh
Success case: https://build.gluster.org/job/smoke/30870/console
10:47:56 + wait %3
---> This happens when %2 wait is complete, so dbench was done by this time and the script started waiting on %3 (IOW the line printed is going to be executed)
10:48:51 All tests successful.
10:48:51 Files=191, Tests=1960, 129 wallclock secs ( 1.28 usr 0.36 sys + 9.57 cusr 7.43 csys = 18.64 CPU)
10:48:51 Result: PASS
---> %3 (compliance) completed (took about 129 seconds, dbench would take about 71-72 seconds including the warmup), so the wait above was over and we proceed
---> cleanup starts
10:48:51 + rm -rf clients
10:48:53 + cd -
10:48:53 + finish
10:48:53 + RET=0
---> NOTE: RET here takes the output of rm -rf clients, not sure if this is intended
10:48:53 + '[' 0 -ne 0 ']'
10:48:53 + cleanup
---> cleanup invoked by the finish, and this possibly has the set -x enabled by the script (but watchdog does not see the failed case)
10:48:53 + killall -15 glusterfs glusterfsd glusterd
---> All well!
Failure case: https://build.gluster.org/job/smoke/30852/console
00:03:16 All tests successful.
00:03:16 Files=191, Tests=1960, 93 wallclock secs ( 0.89 usr 0.26 sys + 5.46 cusr 3.30 csys = 9.91 CPU)
00:03:16 Result: PASS
00:11:36 Kicking in watchdog after 600 secs
---> Where are the watchdog cleanup calls noted? It appears that watchdog is called before set -x and hence cleanup is not logged here
---> Assuming cleanup was called, it killed all gluster processes, and dbench finally errored out in the read (no connection), and hence %2 completed
00:11:36 + wait %3
---> wait for %3 starts, and gets over ASAP as compliance has finished running about 8 minutes back (00:03:16)
00:11:36 + rm -rf clients
00:11:36 rm: cannot remove `clients': Transport endpoint is not connected
---> We cannot as watchdog has cleaned up the process, so this rm -rf fails (we failed cleanup, is this an issue for the next run?)
00:11:36 + finish
00:11:36 + RET=1
---> rm -rf failed, so we caught that, is this what is intended?
00:11:36 + '[' 1 -ne 0 ']'
00:11:36 + cat /build/dbench-logs
00:11:36 10 cleanup 581 sec
---> dbench has been attempting cleanup for 580 odd seconds
00:11:36  read failed on handle 10007 (Transport endpoint is not connected)
---> Finally the dbench clients get an error as watchdog shut the process and hence the volume down and we get connection errors and dbench exits
00:11:36 + cleanup
---> Called by finish, and everything fails as watchdog has cleaned up already
00:11:36 + killall -15 glusterfs glusterfsd glusterd
00:11:36 glusterfs: no process killed
00:11:36 glusterfsd: no process killed
00:11:36 glusterd: no process killed
Looks like dbench got stuck at https://github.com/sahlberg/dbench/blob/master/fileio.c#L400 (or pread) and never was able to break out of it. This caused dbench never to complete till the volume and the mount was taken down and it errored out.
Why it got stuck here, would be the next question I guess.
Moving to core as this is quite certainly not an infra issue.
*** Bug 1381016 has been marked as a duplicate of this bug. ***
Assigning this to Pranith as he's debugging this now.
https://bugzilla.redhat.com/show_bug.cgi?id=1381433#c7 has the modified script which we need to deploy to debug this further. The modified script will collect statedumps of the client and server and move them to /var/log/glusterfs so that they will be available in tarball. Nigel could you deploy them whenever you have time?
Why don't you submit a pull request to https://github.com/gluster/glusterfs-patch-acceptance-tests with your changes so I can review and merge them? If we get it merged today, I can trigger a deploy across all machines.
Perhaps the 5 second wait isn't enough?
06:22:41 + mv '/var/run/gluster/*dump*' /var/log/glusterfs/
06:22:41 mv: cannot stat `/var/run/gluster/*dump*': No such file or directory
[2016-11-14 14:46:53.043548] W [fuse-bridge.c:471:fuse_entry_cbk] 0-glusterfs-fuse: 81138: LOOKUP() /_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_12345 => -1 (File name too long)
[2016-11-14 14:46:53.043483] W [MSGID: 114031] [client-rpc-fops.c:2930:client3_3_lookup_cbk] 0-patchy-client-1: remote operation failed. Path: /_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_12345 (00000000-0000-0000-0000-000000000000) [File name too long]
[2016-11-14 14:46:53.043517] W [MSGID: 114031] [client-rpc-fops.c:2930:client3_3_lookup_cbk] 0-patchy-client-0: remote operation failed. Path: /_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_12345 (00000000-0000-0000-0000-000000000000) [File name too long]
[2016-11-14 14:54:53.930498] W [glusterfsd.c:1329:cleanup_and_exit] (-->/lib64/libpthread.so.0(+0x7aa1) [0x7f620a9f1aa1] -->glusterfs(glusterfs_sigwaiter+0xe4) [0x409c58] -->glusterfs(cleanup_and_exit+0x87) [0x408092] ) 0-: received signum (15), shutting down
[2016-11-14 14:54:53.930539] I [fuse-bridge.c:5797:fini] 0-fuse: Unmounting '/mnt'.
[2016-11-14 14:54:53.932418] W [socket.c:588:__socket_rwv] 0-patchy-client-1: readv on 22.214.171.124:49153 failed (No data available)
I don't know why there are 2 statedumps for the mnt process.
I've taken slave34 offline and Vijay's debugging smoke test failures on there now.
Created attachment 1221016 [details]
statedump from the hung glusterfs process
The problem seems related to READ fops not going beyond write-behind. This can be seen in the attached statedump (comment 19). Still need to determine why READs are not being percolated down from write-behind.
Possible fix: http://review.gluster.org/15901
Please refer to the commit message for more explaination
In certain cases, ioc_readv() issues STACK_WIND_TAIL() instead
of STACK_WIND(). One such case is when inode_ctx for that file
is not present (can happen if readdirp was called, and populates
md-cache and serves all the lookups from cache).
Consider the following graph:
Below is the code snippet of ioc_readv calling STACK_WIND_TAIL:
STACK_WIND_TAIL (frame, FIRST_CHILD (frame->this),
FIRST_CHILD (frame->this)->fops->readv, fd,
size, offset, flags, xdata);
/* Ideally, this stack_wind should wind to readdir-ahead:readv()
but it winds to read-ahead:readv(). See below for
STACK_WIND_TAIL (frame, obj, fn, ...)
frame->this = obj;
/* for the above mentioned graph, frame->this will be readdir-ahead
* frame->this = FIRST_CHILD (frame->this) i.e. readdir-ahead, which
* is as expected
THIS = obj;
/* THIS will be read-ahead instead of readdir-ahead!, as obj expands
* to "FIRST_CHILD (frame->this)" and frame->this was pointing
* to readdir-ahead in the previous statement.
fn (frame, obj, params);
/* fn will call read-ahead:readv() instead of readdir-ahead:readv()!
* as fn expands to "FIRST_CHILD (frame->this)->fops->readv" and
* frame->this was pointing ro readdir-ahead in the first statement
Thus, the readdir-ahead's readv() implementation will be skipped, and
ra_readv() will be called with frame->this = "readdir-ahead" and
this = "read-ahead". This can lead to corruption / hang / other problems.
But in this perticular case, when 'frame->this' and 'this' passed
to ra_readv() doesn't match, it causes ra_readv() to call ra_readv()
again!. Thus the logic of read-ahead readv() falls apart and leads to
This update is done in bulk based on the state of the patch and the time since last activity. If the issue is still seen, please reopen the bug.