Bug 761808 (GLUSTER-76) - Errors on one server going down
Summary: Errors on one server going down
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: GLUSTER-76
Product: GlusterFS
Classification: Community
Component: replicate
Version: pre-2.0
Hardware: All
OS: Linux
low
low
Target Milestone: ---
Assignee: Vikas Gorur
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2009-06-25 11:13 UTC by Basavanagowda Kanur
Modified: 2009-11-16 11:29 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:


Attachments (Terms of Use)

Description Basavanagowda Kanur 2009-06-25 11:13:15 UTC
[Migrated from RT] - ticket 979 - [http://support.gluster.com/rt/Ticket/Display.html?id=979]

Wed Apr 22 03:13:15 2009  	 guru - Ticket created  	 	 

version: glusterfs 2.0.0 rc8

* 100 TB cluster
* 7 server distribute over replicate
* Rebooted brick7 (which was exporting brick7-ib and brick7-tcp)
* brick7-ib and brick8-tcp were afr'd, brick6-ib and brick7-tcp were
afr'd. So bringing down one server and back up should have been handled
without errors.


Kernel compile (on brick3):

i=0; while true; do ((i++)); echo "======= $i =======" | tee -a results;
/opt/qa/tools/kernel_compile.sh linux-2.6.29.tar.bz2 || break; rm -rf
linux-2.6.29 || break; done
..
======= 2 =======
Extracting Tarball ..

bzip2: I/O or other error, bailing out. Possible reason follows.
bzip2: File descriptor in bad state
Input file = (stdin), output file = (stdout)
tar: Unexpected EOF in archive
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now



dbench (on brick6):
i=0; while true; do echo "===== $i ====="; ((i++));
/opt/benchmarks/dbench-4.0/bin/dbench -s -S -F 48 || break; done
..
48 5943 5.32 MB/sec execute 168 sec latency 17396.433 ms
48 5943 5.29 MB/sec execute 169 sec latency 18397.434 ms
48 5944 5.26 MB/sec execute 170 sec latency 19207.415 ms
[5645] open ./clients/client17/~dmtmp/PARADOX/ANSWER.DB failed for
handle 11085 (No such file or directory)
(5646) ERROR: handle 11085 was not found
[5825] open ./clients/client13/~dmtmp/SEED/MEDIUM.FIL failed for handle
11121 (No such file or directory)
(5826) ERROR: handle 11121 was not found
Child failed with status 1
[root@brick6 dbench]# [5712] open
./clients/client6/~dmtmp/PARADOX/STUDENTS.XG0 failed for handle 11093
(No such file or directory)
[5825] open ./clients/client1/~dmtmp/SEED/MEDIUM.FIL failed for handle
11121 (No such file or directory)

-------------------------------------------------------------------------------
#   	Wed Apr 22 17:33:27 2009 	gowda - Correspondence added

On Wed Apr 22 03:13:15 2009, guru wrote:
> * 100 TB cluster
> * 7 server distribute over replicate
> * Rebooted brick7 (which was exporting brick7-ib and brick7-tcp)
> * brick7-ib and brick8-tcp were afr'd, brick6-ib and brick7-tcp were
> afr'd. So bringing down one server and back up should have been handled
> without errors.
>
>
> Kernel compile (on brick3):
>
> i=0; while true; do ((i++)); echo "======= $i =======" | tee -a results;
> /opt/qa/tools/kernel_compile.sh linux-2.6.29.tar.bz2 || break; rm -rf
> linux-2.6.29 || break; done
> ..
> ======= 2 =======
> Extracting Tarball ..
>
> bzip2: I/O or other error, bailing out. Possible reason follows.
> bzip2: File descriptor in bad state
> Input file = (stdin), output file = (stdout)
> tar: Unexpected EOF in archive
> tar: Unexpected EOF in archive
> tar: Error is not recoverable: exiting now
>
from the log messages for brick3, i saw that afr7 (subvolume of dht) is
down and the kernel tarball is hashed to afr7 (i assumed that backend is
still valid). EBADFD is expected when a subvolume is completely down.
>
>
> dbench (on brick6):
> i=0; while true; do echo "===== $i ====="; ((i++));
> /opt/benchmarks/dbench-4.0/bin/dbench -s -S -F 48 || break; done
> ..
> 48 5943 5.32 MB/sec execute 168 sec latency 17396.433 ms
> 48 5943 5.29 MB/sec execute 169 sec latency 18397.434 ms
> 48 5944 5.26 MB/sec execute 170 sec latency 19207.415 ms
> [5645] open ./clients/client17/~dmtmp/PARADOX/ANSWER.DB failed for
> handle 11085 (No such file or directory)
> (5646) ERROR: handle 11085 was not found
> [5825] open ./clients/client13/~dmtmp/SEED/MEDIUM.FIL failed for handle
> 11121 (No such file or directory)
> (5826) ERROR: handle 11121 was not found
> Child failed with status 1
> [root@brick6 dbench]# [5712] open
> ./clients/client6/~dmtmp/PARADOX/STUDENTS.XG0 failed for handle 11093
> (No such file or directory)
> [5825] open ./clients/client1/~dmtmp/SEED/MEDIUM.FIL failed for handle
> 11121 (No such file or directory)


-- 
gowda
--------------------------------------------------------------------------------
#   	Wed Apr 22 17:45:57 2009 	gowda - Correspondence added

please note that client log from brick6 is 3.8 GB.

please specify the circumstances under which dbench failed.
-- 
gowda

Comment 1 Vikas Gorur 2009-07-09 10:42:43 UTC
Observed similar error in a simple-afr configuration too.

Setup is simple 2-way client-side AFR. Dbench was running when one of the servers was killed and restarted. dbench failed with error:

[21874] unlink ./clients/client3/~dmtmp/EXCEL/SALES1.XLS failed (No such file or directory) - expected NT_STATUS_OK
ERROR: child 3 failed at line 21874

Spec files and log files are in /share/tickets/<bug id>.


Note You need to log in before you can comment on or make changes to this bug.