Bug 765372 (GLUSTER-3640)

Summary: "randomwriter" job failed with 'transport endpoint not connected" error in quick-slave-io ON
Product: [Community] GlusterFS Reporter: M S Vishwanath Bhat <vbhat>
Component: HDFSAssignee: Steve Watt <swatt>
Status: CLOSED EOL QA Contact:
Severity: medium Docs Contact:
Priority: low    
Version: pre-releaseCC: bugs, gluster-bugs, mzywusko
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-10-22 15:40:20 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Jobtracker logs
none
glusterfs client from ubuntu4 machine none

Description M S Vishwanath Bhat 2011-09-27 03:32:35 UTC
Created attachment 675

Comment 1 M S Vishwanath Bhat 2011-09-27 03:33:08 UTC
Created attachment 676

Comment 2 M S Vishwanath Bhat 2011-09-27 06:25:41 UTC
In a 2*3 striped-replicated gluster volume with quick-slave-io ON randomwriter job failed with following backtrace.

11/09/26 18:39:18 INFO mapred.JobClient:  map 88% reduce 0%
11/09/26 18:47:52 INFO mapred.JobClient: Task Id : attempt_201109242150_0008_m_000068_1, Status : FAILED
java.io.IOException: Transport endpoint is not connected
        at java.io.FileOutputStream.writeBytes(Native Method)
        at java.io.FileOutputStream.write(FileOutputStream.java:297)
        at org.apache.hadoop.fs.glusterfs.GlusterFUSEOutputStream.write(GlusterFUSEOutputStream.java:67)
        at org.apache.hadoop.fs.glusterfs.GlusterFUSEOutputStream.write(GlusterFUSEOutputStream.java:52)
        at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:41)
        at java.io.DataOutputStream.writeInt(DataOutputStream.java:199)
        at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:1011)
        at org.apache.hadoop.mapred.SequenceFileOutputFormat$1.write(SequenceFileOutputFormat.java:75)
        at org.apache.hadoop.mapred.MapTask$DirectMapOutputCollector.collect(MapTask.java:680)
        at org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:466)
        at org.apache.hadoop.examples.RandomWriter$Map.map(RandomWriter.java:188)
        at org.apache.hadoop.examples.RandomWriter$Map.map(RandomWriter.java:152)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
        at org.apache.hadoop.mapred.Child.main(Child.java:170)

attempt_201109242150_0008_m_000068_1: Initializing GlusterFS
11/09/26 18:51:45 INFO mapred.JobClient: Task Id : attempt_201109242150_0008_m_000076_0, Status : FAILED
Task attempt_201109242150_0008_m_000076_0 failed to report status for 601 seconds. Killing!
Task attempt_201109242150_0008_m_000076_0 failed to report status for 602 seconds. Killing!


I jobtracker logs pointed out error in ubuntu4 machine.

2011-09-26 18:43:54,211 INFO org.apache.hadoop.mapred.JobTracker: Adding task (cleanup)'attempt_201109242150_0008_m_000079_0' to tip task_201109242150_0008_m_000079, for tracker 'tracker_ubuntu4.gluster.com:localhost/127.0.0.1:38797'
2011-09-26 18:47:49,237 INFO org.apache.hadoop.mapred.TaskInProgress: Error from attempt_201109242150_0008_m_000068_1: java.io.IOException: Transport endpoint is not connected
        at java.io.FileOutputStream.writeBytes(Native Method)
        at java.io.FileOutputStream.write(FileOutputStream.java:297)
        at org.apache.hadoop.fs.glusterfs.GlusterFUSEOutputStream.write(GlusterFUSEOutputStream.java:67)
        at org.apache.hadoop.fs.glusterfs.GlusterFUSEOutputStream.write(GlusterFUSEOutputStream.java:52)
        at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:41)
        at java.io.DataOutputStream.writeInt(DataOutputStream.java:199)
        at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:1011)
        at org.apache.hadoop.mapred.SequenceFileOutputFormat$1.write(SequenceFileOutputFormat.java:75)
        at org.apache.hadoop.mapred.MapTask$DirectMapOutputCollector.collect(MapTask.java:680)
        at org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:466)
        at org.apache.hadoop.examples.RandomWriter$Map.map(RandomWriter.java:188)
        at org.apache.hadoop.examples.RandomWriter$Map.map(RandomWriter.java:152)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
        at org.apache.hadoop.mapred.Child.main(Child.java:170)

2011-09-26 18:47:52,374 INFO org.apache.hadoop.mapred.JobInProgress: Choosing a non-local task task_201109242150_0008_m_000076 for speculation


In ubuntu4 machine I was following errors in client logs.

[2011-09-26 18:33:44.404076] I [afr-self-heal-common.c:2012:afr_self_heal_completion_cbk] 0-hosdu-replicate-0: background  data missing-entry gfid self-heal completed on /rdata/_temporary/_attempt_201109242150_0008_m_000004_0/part-00004
[2011-09-26 18:33:44.404280] W [rpc-clnt.c:1432:rpc_clnt_submit] 0-hosdu-client-1: failed to submit rpc-request (XID: 0x5021529x Program: GlusterFS 3.1, ProgVers: 310, Proc: 41) to rpc-transport (hosdu-client-1)
[2011-09-26 18:33:44.404638] E [rpc-clnt.c:340:saved_frames_unwind] (-->/usr/local/lib/libgfrpc.so.0(rpc_clnt_notify+0x78) [0x7f24e229b7a8] (-->/usr/local/lib/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7d) [0x7f24e229afad] (-->/usr/local/lib/libgfrpc.so.0(saved_frames_destroy+0xe) [0x7f24e229af0e]))) 0-hosdu-client-1: forced unwinding frame type(GlusterFS 3.1) op(FXATTROP(34)) called at 2011-09-26 18:33:43.635563
[2011-09-26 18:33:44.404727] E [rpc-clnt.c:340:saved_frames_unwind] (-->/usr/local/lib/libgfrpc.so.0(rpc_clnt_notify+0x78) [0x7f24e229b7a8] (-->/usr/local/lib/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7d) [0x7f24e229afad] (-->/usr/local/lib/libgfrpc.so.0(saved_frames_destroy+0xe) [0x7f24e229af0e]))) 0-hosdu-client-1: forced unwinding frame type(GlusterFS 3.1) op(FXATTROP(34)) called at 2011-09-26 18:33:43.643358
[2011-09-26 18:33:44.404803] E [rpc-clnt.c:340:saved_frames_unwind] (-->/usr/local/lib/libgfrpc.so.0(rpc_clnt_notify+0x78) [0x7f24e229b7a8] (-->/usr/local/lib/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7d) [0x7f24e229afad] (-->/usr/local/lib/libgfrpc.so.0(saved_frames_destroy+0xe) [0x7f24e229af0e]))) 0-hosdu-client-1: forced unwinding frame type(GlusterFS 3.1) op(FXATTROP(34)) called at 2011-09-26 18:33:43.644278
[2011-09-26 18:33:44.404866] E [rpc-clnt.c:340:saved_frames_unwind] (-->/usr/local/lib/libgfrpc.so.0(rpc_clnt_notify+0x78) [0x7f24e229b7a8] (-->/usr/local/lib/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7d) [0x7f24e229afad] (-->/usr/local/lib/libgfrpc.so.0(saved_frames_destroy+0xe) [0x7f24e229af0e]))) 0-hosdu-client-1: forced unwinding frame type(GlusterFS 3.1) op(FXATTROP(34)) called at 2011-09-26 18:33:43.644580
[2011-09-26 18:33:44.404931] E [rpc-clnt.c:340:saved_frames_unwind] (-->/usr/local/lib/libgfrpc.so.0(rpc_clnt_notify+0x78) [0x7f24e229b7a8] (-->/usr/local/lib/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7d) [0x7f24e229afad] (-->/usr/local/lib/libgfrpc.so.0(saved_frames_destroy+0xe) [0x7f24e229af0e]))) 0-hosdu-client-1: forced unwinding frame type(GlusterFS 3.1) op(FXATTROP(34)) called at 2011-09-26 18:33:43.644867
[2011-09-26 18:33:44.404986] E [rpc-clnt.c:340:saved_frames_unwind] (-->/usr/local/lib/libgfrpc.so.0(rpc_clnt_notify+0x78) [0x7f24e229b7a8] (-->/usr/local/lib/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7d) [0x7f24e229afad] (-->/usr/local/lib/libgfrpc.so.0(saved_frames_destroy+0xe) [0x7f24e229af0e]))) 0-hosdu-client-1: forced unwinding frame type(GlusterFS 3.1) op(FXATTROP(34)) called at 2011-09-26 18:33:43.645171
[2011-09-26 18:33:44.405051] E [rpc-clnt.c:340:saved_frames_unwind] (-->/usr/local/lib/libgfrpc.so.0(rpc_clnt_notify+0x78) [0x7f24e229b7a8] (-->/usr/local/lib/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7d) [0x7f24e229afad] (-->/usr/local/lib/libgfrpc.so.0(saved_frames_destroy+0xe) [0x7f24e229af0e]))) 0-hosdu-client-1: forced unwinding frame type(GlusterFS 3.1) op(FXATTROP(34)) called at 2011-09-26 18:33:43.645451
[2011-09-26 18:33:44.405078] I [client.c:1885:client_rpc_notify] 0-hosdu-client-1: disconnected
[2011-09-26 18:33:44.405125] E [afr-common.c:3476:afr_notify] 0-hosdu-replicate-0: All subvolumes are down. Going offline until atleast one of them comes back up.
[2011-09-26 18:33:47.399088] I [client-handshake.c:1077:select_server_supported_programs] 0-hosdu-client-1: Using Program GlusterFS-3.1.0, Num (1298437), Version (310)
[2011-09-26 18:33:47.400153] I [client-handshake.c:917:client_setvolume_cbk] 0-hosdu-client-1: Connected to 10.1.11.30:24009, attached to remote volume '/data/brick'.

I will attach the jobtracker log and client log from ubuntu 4 machine.

Comment 3 Kaleb KEITHLEY 2015-10-22 15:40:20 UTC
pre-release version is ambiguous and about to be removed as a choice.

If you believe this is still a bug, please change the status back to NEW and choose the appropriate, applicable version for it.