976835 – NFS: Bonnie++ fails with Can't write block.: Input/output error

Bug 976835 - NFS: Bonnie++ fails with Can't write block.: Input/output error

Summary: NFS: Bonnie++ fails with Can't write block.: Input/output error

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	glusterd
Sub Component:
Version:	2.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	santosh pradhan
QA Contact:	Ben Turner
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2013-06-21 15:18 UTC by Ben Turner
Modified:	2013-09-23 22:43 UTC (History)
CC List:	4 users (show)
Fixed In Version:	glusterfs-devel-3.4.0.12rhs-1
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2013-09-23 22:39:51 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
sos server mounted NFS (1.31 MB, application/x-xz) 2013-06-21 15:50 UTC, Ben Turner	no flags	Details
Sos server (523.64 KB, application/x-xz) 2013-06-21 15:51 UTC, Ben Turner	no flags	Details
sos client (412.31 KB, application/x-xz) 2013-06-21 15:51 UTC, Ben Turner	no flags	Details
View All

Description Ben Turner 2013-06-21 15:18:29 UTC

Description of problem:

When I run bonnie++ on my 1x2 NFS mounted volume I get:

Using uid:0, gid:0.
Writing a byte at a time...done
Writing intelligently...Can't write block.: Input/output error
Can't write block 1433870.

real	35m28.539s
user	0m0.308s
sys	0m8.005s

In /var/log/messages on the server being mounted I saw:

Jun 21 10:42:29 storage-qe03 GlusterFS[7288]: [2013-06-21 14:42:29.488010] C [rpc-clnt.c:460:rpc_clnt_fill_request_info] 0-testvol-client-0: c
annot lookup the saved frame corresponding to xid (137339)
Jun 21 10:43:02 storage-qe03 logger: 2013-06-21 10:43:02 /usr/bin/rhts-test-runner.sh 86519 2640 hearbeat...
Jun 21 10:43:13 storage-qe03 GlusterFS[7288]: [2013-06-21 14:43:13.537970] C [rpc-clnt.c:460:rpc_clnt_fill_request_info] 0-testvol-client-1: c
annot lookup the saved frame corresponding to xid (137500)

And on the client I saw:

Jun 21 10:43:19 storage-qe05 kernel: nfs: server storage-qe03.lab.eng.rdu2.redhat.com OK
Jun 21 10:43:19 storage-qe05 kernel: nfs: server storage-qe03.lab.eng.rdu2.redhat.com OK
Jun 21 10:43:19 storage-qe05 kernel: nfs: server storage-qe03.lab.eng.rdu2.redhat.com OK

I have reproduced this on two different sets of bare metal systems and I have attached sosreports from the second repro attempt. 

Version-Release number of selected component (if applicable):

glusterfs-3.4.0.10rhs-1.el6rhs.x86_64

How reproducible:

Both times I have tried thus far

Steps to Reproduce:
1. Run bonnie++ on a 2x1 NFS mounted volume from a RHEL 6.4 client
2.
3.

Actual results:

I/O error on client.

Expected results:

Normal bonnie++ operation.

Additional info:

Sosreports attached.

Comment 1 Ben Turner 2013-06-21 15:25:07 UTC

More errors from the server being mounted:

[2013-06-21 14:42:29.010032] E [rpc-clnt.c:207:call_bail] 0-testvol-client-0: bailing out frame type(GlusterFS 3.3) op(WRITE(13)) xid = 0x137608x sent = 2013-06-21 14:12:28.982549. timeout = 1800
[2013-06-21 14:42:29.486573] E [rpc-clnt.c:207:call_bail] 0-testvol-client-0: bailing out frame type(GlusterFS 3.3) op(WRITE(13)) xid = 0x137607x sent = 2013-06-21 14:12:28.975136. timeout = 1800
[2013-06-21 14:42:29.488010] C [rpc-clnt.c:460:rpc_clnt_fill_request_info] 0-testvol-client-0: cannot lookup the saved frame corresponding to xid (137339)
[2013-06-21 14:42:29.488132] E [rpc-clnt.c:207:call_bail] 0-testvol-client-0: bailing out frame type(GlusterFS 3.3) op(WRITE(13)) xid = 0x137606x sent = 2013-06-21 14:12:28.967714. timeout = 1800
[2013-06-21 14:42:29.488328] E [rpc-clnt.c:207:call_bail] 0-testvol-client-0: bailing out frame type(GlusterFS 3.3) op(WRITE(13)) xid = 0x137605x sent = 2013-06-21 14:12:28.960302. timeout = 1800
[2013-06-21 14:42:29.488404] E [rpc-clnt.c:368:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x78) [0x7f9daeb6def8] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xc3) [0x7f9daeb6db23] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x7f9daeb6d23e]))) 0-testvol-client-0: forced unwinding frame type(GlusterFS 3.3) op(WRITE(13)) called at 2013-06-21 14:12:29.057388 (xid=0x137609x)
[2013-06-21 14:42:29.493496] E [rpc-clnt.c:207:call_bail] 0-testvol-client-0: bailing out frame type(GlusterFS 3.3) op(WRITE(13)) xid = 0x137604x sent = 2013-06-21 14:12:28.952882. timeout = 1800

Comment 3 Ben Turner 2013-06-21 15:50:50 UTC

Created attachment 763893 [details]
sos server mounted NFS

Comment 4 Ben Turner 2013-06-21 15:51:28 UTC

Created attachment 763894 [details]
Sos server

Comment 5 Ben Turner 2013-06-21 15:51:53 UTC

Created attachment 763895 [details]
sos client

Comment 6 santosh pradhan 2013-06-24 09:00:03 UTC

In the NFS server log, I see the bricks are not connected.

Log snippet:
============

0-testvol-client-0
------------------

[2013-06-21 14:42:29.559517] W [client-rpc-fops.c:1811:client3_3_fxattrop_cbk] 0-testvol-client-0: remote operation failed: Transport endpoint is not connected
[2013-06-21 14:42:29.559565] E [rpc-clnt.c:368:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x78) [0x7f9daeb6def8] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xc3) [0x7f9daeb6db23] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x7f9daeb6d23e]))) 0-testvol-client-0: forced unwinding frame type(GlusterFS 3.3) op(WRITE(13)) called at 2013-06-21 14:12:36.611217 (xid=0x138333x)
[2013-06-21 14:42:29.559572] W [client-rpc-fops.c:866:client3_3_writev_cbk] 0-testvol-client-0: remote operation failed: Transport endpoint is not connected
[2013-06-21 14:42:29.559591] W [rpc-clnt.c:1487:rpc_clnt_submit] 0-testvol-client-0: failed to submit rpc-request (XID: 0x156467x Program: GlusterFS 3.3, ProgVers: 330, Proc: 34) to rpc-transport (testvol-client-0)

0-testvol-client-1
------------------

[2013-06-21 14:43:15.323237] W [client-rpc-fops.c:1811:client3_3_fxattrop_cbk] 0-testvol-client-1: remote operation failed: Transport endpoint is not connected
[2013-06-21 14:43:15.323252] E [rpc-clnt.c:368:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x78) [0x7f9daeb6def8] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xc3) [0x7f9daeb6db23] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x7f9daeb6d23e]))) 0-testvol-client-1: forced unwinding frame type(GlusterFS 3.3) op(FXATTROP(34)) called at 2013-06-21 14:43:13.538009 (xid=0x155907x)


Could you check why they were down?

Thanks,
Santosh

Comment 7 santosh pradhan 2013-06-27 08:03:37 UTC

I tried with latest build i.e. glusterfs-devel-3.4.0.12rhs-1

My bonnie++ test PASSED with following results with 1x2 repli-distr volume.

===========Test result start=============

#/opt/qa/tools/bonnie++ -u root -d /mnt/nfsdir
Using uid:0, gid:0.
Writing a byte at a time...done
Writing intelligently...done
Rewriting...done
Reading a byte at a time...done
Reading intelligently...done
start 'em...done...done...done...done...done...
Create files in sequential order...done.
Stat files in sequential order...done.
Delete files in sequential order...done.
Create files in random order...done.
Stat files in random order...done.
Delete files in random order...done.
Version  1.96       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
santosh-1.locald 4G  1527  98 23227   3  3847   1  3155  99 288897  31  1325  98
Latency             12597us    5126ms    6014ms    9277us    6590us    2987ms
Version  1.96       ------Sequential Create------ --------Random Create--------
santosh-1.localdoma -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16   170   2   553   8   491   6   253   3  1225   9   772   6
Latency             51831us    1127ms   74608us   52319us    8247us   64872us
1.96,1.96,santosh-1.localdomain,1,1372295002,4G,,1527,98,23227,3,3847,1,3155,99,288897,31,1325,98,16,,,,,170,2,553,8,491,6,253,3,1225,9,772,6,12597us,5126ms,6014ms,9277us,6590us,2987ms,51831us,1127ms,74608us,52319us,8247us,64872us

===============Test result End====================

Could you retry the test case?

I am moving it to ON_QA for the time being, please feel free to reopen if hit the issue again.

Thanks,
Santosh

Comment 8 Ben Turner 2013-06-28 15:05:44 UTC

I saw this problem in the glusterfs-3.4.0.10rhs-2.el6rhs.x86_64 and glusterfs-3.4.0.11rhs-2.el6rhs.x86_64 builds but I haven't hit this in daily runs since the glusterfs-devel-3.4.0.12rhs-1 build dropped.  Moving to verified.

Comment 9 Scott Haines 2013-09-23 22:39:51 UTC

Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. 

For information on the advisory, and where to find the updated files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1262.html

Comment 10 Scott Haines 2013-09-23 22:43:49 UTC

Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. 

For information on the advisory, and where to find the updated files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1262.html

Note You need to log in before you can comment on or make changes to this bug.