Description of problem: Smoke tests seem to be failing for different patches for reasons unrelated to the patches themselves: https://build.gluster.org/job/smoke/30808/console https://build.gluster.org/job/smoke/30813/console https://build.gluster.org/job/smoke/30814/console
Checking now.
These are the dbench failures.
Seemingly all these are on slave34. I'm going to quickly sample all the smoke failures to see if they're all happening on slave34.
Data from the last 100 smoke jobs: builtOn: slave26.cloud.gluster.org, url: http://build.gluster.org/job/smoke/30791/ builtOn: slave32.cloud.gluster.org, url: http://build.gluster.org/job/smoke/30822/ builtOn: builder1.rht.gluster.org, url: http://build.gluster.org/job/smoke/30821/ url: http://build.gluster.org/job/smoke/30815/ url: http://build.gluster.org/job/smoke/30782/ builtOn: slave20.cloud.gluster.org, url: http://build.gluster.org/job/smoke/30818/ builtOn: slave23.cloud.gluster.org, url: http://build.gluster.org/job/smoke/30750/ builtOn: slave34.cloud.gluster.org, url: http://build.gluster.org/job/smoke/30814/ url: http://build.gluster.org/job/smoke/30813/ url: http://build.gluster.org/job/smoke/30811/ url: http://build.gluster.org/job/smoke/30808/ url: http://build.gluster.org/job/smoke/30770/ url: http://build.gluster.org/job/smoke/30764/ url: http://build.gluster.org/job/smoke/30733/ url: http://build.gluster.org/job/smoke/30724/
Michael and Rajesh, do you have any idea what's going on in these failures. Any pointers to look at where to start debugging would also be useful.
I'm inclined to think that this is a Gluster bug than an infra issue. What could cause "Transport endpoint is not connected" errors from Gluster?
From what I can see from rudimentary searches: "Transport endpoint is not connected" most likely means the FUSE mount has crashed. This most definitely means an intermittent bug has slipped past us?
I'm guessing Pranith was looking for /var/log/glusterfs files. I've added the logic to grab them for smoke. We never did get them for smoke: https://github.com/gluster/glusterfs-patch-acceptance-tests/pull/62/files
My notes: The script: https://github.com/gluster/glusterfs-patch-acceptance-tests/blob/master/smoke.sh Success case: https://build.gluster.org/job/smoke/30870/console ============= 10:47:56 + wait %3 ---> This happens when %2 wait is complete, so dbench was done by this time and the script started waiting on %3 (IOW the line printed is going to be executed) 10:48:51 All tests successful. 10:48:51 Files=191, Tests=1960, 129 wallclock secs ( 1.28 usr 0.36 sys + 9.57 cusr 7.43 csys = 18.64 CPU) 10:48:51 Result: PASS ---> %3 (compliance) completed (took about 129 seconds, dbench would take about 71-72 seconds including the warmup), so the wait above was over and we proceed ---> cleanup starts 10:48:51 + rm -rf clients 10:48:53 + cd - 10:48:53 /home/jenkins/root/workspace/smoke 10:48:53 + finish 10:48:53 + RET=0 ---> NOTE: RET here takes the output of rm -rf clients, not sure if this is intended 10:48:53 + '[' 0 -ne 0 ']' 10:48:53 + cleanup ---> cleanup invoked by the finish, and this possibly has the set -x enabled by the script (but watchdog does not see the failed case) 10:48:53 + killall -15 glusterfs glusterfsd glusterd ---> All well! Failure case: https://build.gluster.org/job/smoke/30852/console ============= 00:03:16 All tests successful. 00:03:16 Files=191, Tests=1960, 93 wallclock secs ( 0.89 usr 0.26 sys + 5.46 cusr 3.30 csys = 9.91 CPU) 00:03:16 Result: PASS 00:11:36 Kicking in watchdog after 600 secs ---> Where are the watchdog cleanup calls noted? It appears that watchdog is called before set -x and hence cleanup is not logged here ---> Assuming cleanup was called, it killed all gluster processes, and dbench finally errored out in the read (no connection), and hence %2 completed 00:11:36 + wait %3 ---> wait for %3 starts, and gets over ASAP as compliance has finished running about 8 minutes back (00:03:16) 00:11:36 + rm -rf clients 00:11:36 rm: cannot remove `clients': Transport endpoint is not connected ---> We cannot as watchdog has cleaned up the process, so this rm -rf fails (we failed cleanup, is this an issue for the next run?) 00:11:36 + finish 00:11:36 + RET=1 ---> rm -rf failed, so we caught that, is this what is intended? 00:11:36 + '[' 1 -ne 0 ']' 00:11:36 + cat /build/dbench-logs ----------- 00:11:36 10 cleanup 581 sec ---> dbench has been attempting cleanup for 580 odd seconds 00:11:36 [643] read failed on handle 10007 (Transport endpoint is not connected) ---> Finally the dbench clients get an error as watchdog shut the process and hence the volume down and we get connection errors and dbench exits ----------- 00:11:36 + cleanup ---> Called by finish, and everything fails as watchdog has cleaned up already 00:11:36 + killall -15 glusterfs glusterfsd glusterd 00:11:36 glusterfs: no process killed 00:11:36 glusterfsd: no process killed 00:11:36 glusterd: no process killed Root cause: =========== Looks like dbench got stuck at https://github.com/sahlberg/dbench/blob/master/fileio.c#L400 (or pread) and never was able to break out of it. This caused dbench never to complete till the volume and the mount was taken down and it errored out. Why it got stuck here, would be the next question I guess.
Moving to core as this is quite certainly not an infra issue.
*** Bug 1381016 has been marked as a duplicate of this bug. ***
Assigning this to Pranith as he's debugging this now.
https://bugzilla.redhat.com/show_bug.cgi?id=1381433#c7 has the modified script which we need to deploy to debug this further. The modified script will collect statedumps of the client and server and move them to /var/log/glusterfs so that they will be available in tarball. Nigel could you deploy them whenever you have time?
Why don't you submit a pull request to https://github.com/gluster/glusterfs-patch-acceptance-tests with your changes so I can review and merge them? If we get it merged today, I can trigger a deploy across all machines.
https://github.com/gluster/glusterfs-patch-acceptance-tests/pull/65
Perhaps the 5 second wait isn't enough? 06:22:41 + mv '/var/run/gluster/*dump*' /var/log/glusterfs/ 06:22:41 mv: cannot stat `/var/run/gluster/*dump*': No such file or directory https://build.gluster.org/job/smoke/31564/console
From https://build.gluster.org/job/smoke/31600/console: From mnt.log: [2016-11-14 14:46:53.043548] W [fuse-bridge.c:471:fuse_entry_cbk] 0-glusterfs-fuse: 81138: LOOKUP() /_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_12345 => -1 (File name too long) [2016-11-14 14:46:53.043483] W [MSGID: 114031] [client-rpc-fops.c:2930:client3_3_lookup_cbk] 0-patchy-client-1: remote operation failed. Path: /_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_12345 (00000000-0000-0000-0000-000000000000) [File name too long] [2016-11-14 14:46:53.043517] W [MSGID: 114031] [client-rpc-fops.c:2930:client3_3_lookup_cbk] 0-patchy-client-0: remote operation failed. Path: /_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_12345 (00000000-0000-0000-0000-000000000000) [File name too long] [2016-11-14 14:54:53.930498] W [glusterfsd.c:1329:cleanup_and_exit] (-->/lib64/libpthread.so.0(+0x7aa1) [0x7f620a9f1aa1] -->glusterfs(glusterfs_sigwaiter+0xe4) [0x409c58] -->glusterfs(cleanup_and_exit+0x87) [0x408092] ) 0-: received signum (15), shutting down [2016-11-14 14:54:53.930539] I [fuse-bridge.c:5797:fini] 0-fuse: Unmounting '/mnt'. [2016-11-14 14:54:53.932418] W [socket.c:588:__socket_rwv] 0-patchy-client-1: readv on 104.130.192.98:49153 failed (No data available) From glusterdump.11579.dump.1479135284: [global.callpool.stack.1.frame.1] frame=0x7f61dc005b6c ref_count=0 translator=patchy-write-behind complete=0 parent=patchy-read-ahead wind_from=ra_writev wind_to=FIRST_CHILD(this)->fops->writev unwind_to=ra_writev_cbk [global.callpool.stack.1.frame.2] frame=0x7f61dc02592c ref_count=1 translator=patchy-read-ahead complete=0 parent=patchy-io-cache wind_from=ioc_writev wind_to=FIRST_CHILD(this)->fops->writev unwind_to=ioc_writev_cbk From glusterdump.18347.dump.1479132791: [global.callpool.stack.1.frame.1] frame=0x7f383c00477c ref_count=0 translator=patchy-readdir-ahead complete=0 parent=patchy-quick-read wind_from=qr_readv wind_to=FIRST_CHILD (frame->this)->fops->readv unwind_to=default_readv_cbk [global.callpool.stack.1.frame.2] frame=0x7f383c00562c ref_count=1 translator=patchy-quick-read complete=0 parent=patchy-open-behind wind_from=default_readv_resume wind_to=FIRST_CHILD(this)->fops->readv unwind_to=default_readv_cbk I don't know why there are 2 statedumps for the mnt process.
I've taken slave34 offline and Vijay's debugging smoke test failures on there now.
Created attachment 1221016 [details] statedump from the hung glusterfs process
The problem seems related to READ fops not going beyond write-behind. This can be seen in the attached statedump (comment 19). Still need to determine why READs are not being percolated down from write-behind.
Possible fix: http://review.gluster.org/15901 Please refer to the commit message for more explaination
RCA: ==== In certain cases, ioc_readv() issues STACK_WIND_TAIL() instead of STACK_WIND(). One such case is when inode_ctx for that file is not present (can happen if readdirp was called, and populates md-cache and serves all the lookups from cache). Consider the following graph: ... io-cache (parent) | readdir-ahead | read-ahead ... Below is the code snippet of ioc_readv calling STACK_WIND_TAIL: ioc_readv() { ... if (!inode_ctx) STACK_WIND_TAIL (frame, FIRST_CHILD (frame->this), FIRST_CHILD (frame->this)->fops->readv, fd, size, offset, flags, xdata); /* Ideally, this stack_wind should wind to readdir-ahead:readv() but it winds to read-ahead:readv(). See below for explaination. */ ... } STACK_WIND_TAIL (frame, obj, fn, ...) { frame->this = obj; /* for the above mentioned graph, frame->this will be readdir-ahead * frame->this = FIRST_CHILD (frame->this) i.e. readdir-ahead, which * is as expected */ ... THIS = obj; /* THIS will be read-ahead instead of readdir-ahead!, as obj expands * to "FIRST_CHILD (frame->this)" and frame->this was pointing * to readdir-ahead in the previous statement. */ ... fn (frame, obj, params); /* fn will call read-ahead:readv() instead of readdir-ahead:readv()! * as fn expands to "FIRST_CHILD (frame->this)->fops->readv" and * frame->this was pointing ro readdir-ahead in the first statement */ ... } Thus, the readdir-ahead's readv() implementation will be skipped, and ra_readv() will be called with frame->this = "readdir-ahead" and this = "read-ahead". This can lead to corruption / hang / other problems. But in this perticular case, when 'frame->this' and 'this' passed to ra_readv() doesn't match, it causes ra_readv() to call ra_readv() again!. Thus the logic of read-ahead readv() falls apart and leads to hang.
This update is done in bulk based on the state of the patch and the time since last activity. If the issue is still seen, please reopen the bug.