Description of problem: 1.1 Workload ~~~~~~~~~~~~ I was running the catalyst workload on our "standard" setup: - 8 client nodes, each running a catalyst process with 64 threads sharing the load. - 8 server nodes, each running gluster and gluster-swift (RHS2.1u2). Each server exports a single brick: a RAID6 volume over 12 disks (each disk is 938 GB). Each server runs a dozen proxy and object workers. - The clients communicate with the proxy servers over a 10Gb/s front-end network. All other communication happens on a different 10Gb/s back-end network. The workload consists of about 5 million files, split into 10000-file filesets. Each client splits each fileset among its threads, doing PUTs to send the files to the servers. After the end of the PUT phase, a GET phase reads back all the files. Normally, the PUT/GET phases are repeated three times. It takes a few days to go through the complete workload. 1.2 RHS version ~~~~~~~~~~~~~~~ The servers are all running RHS2.1-u2. 1.3 Problem ~~~~~~~~~~~ The run finished the first PUT and GET phases and was working on the second PUT phase. This phase never finished. Seven out of the eight clients finished PUTting all the files, but the eighth (gprfc007.sbu.lab.eng.bos.redhat.com) did not: it successfully finished 99 filesets, and 63 (out of 64) threads finished their part of the 100_{th} fileset. The remaining thread never finished. The last output from the client (successful conclusion of the 99_{th} fileset) came about 22 hours after the beginning of the run, so the problem is not easily reproducible. It turns out that there is an open TCP connection between the catalyst process on gprfc007 and a proxy worker on one of the server nodes (gprfs015): the catalyst thread is blocked on a futex, whereas the proxy worker is doing a poll, timing out and repeating endlessly. The only connections that the proxy worker has open are to memcached: there are no open connections to an object server at all. So it seems that the proxy worker thinks it's all done, but the client thread is still waiting with the connection open. 1.4 netstat output ~~~~~~~~~~~~~~~~~~ 1.4.1 on client gprfc007 ------------------------ ,---- | [root@gprfc007 ~] # netstat -plant | Active Internet connections (servers and established) | Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name | ... | tcp 0 0 172.17.10.7:39904 172.17.40.15:8080 ESTABLISHED 3561/python2.6 | ... `---- 3561 is the pid of the catalyst process: ,---- | [root@gprfc007 ~]# ps awlx | grep 3561 | 0 0 3561 2330 20 0 4473264 21508 futex_ Sl ? 4:51 /usr/bin/python2.6 /home/tools/catalyst/src/gluster-load.py 0 PUT 0 gprfs009-10ge,gprfs010-10ge,gprfs011-10ge,gprfs012-10ge,gprfs013-10ge,gprfs014-10ge,gprfs015-10ge,gprfs016-10ge 8080 AUTH_vol0 gprfc007.pmcDef.pw12.omc1.ow12.dcs64KB.ncs64KB.aw02.cw02.d8192.mtu9k.untuned.clth64.tpf run2 64 1 `---- 1.4.2 on server gprfs015 ------------------------ ,---- | [root@gprfs015 ~]# netstat -plant | grep 8080 | netstat -plant | grep 8080 | tcp 0 0 172.17.40.15:8080 0.0.0.0:* LISTEN 11904/python | tcp 0 0 172.17.40.15:8080 172.17.10.7:39904 ESTABLISHED 11945/python `---- 11945 is the pid of the proxy worker: ,---- | [root@gprfs015 ~]# ps awlx | grep 11945 | ps awlx | grep 11945 | 1 0 11945 11904 20 0 232340 18168 poll_s S ? 81:53 /usr/bin/python /usr/bin/swift-proxy-server /etc/swift/proxy-server.conf `---- 1.5 Traceback ~~~~~~~~~~~~~ There was one traceback in the server log files. It is not clear whether this problem (which I have seen before on various occasions) has anything to do with the hung thread problem. ,---- | Mar 27 19:32:48 gprfs013 object-server ERROR __call__ error with | PUT /vol0/0/AUTH_vol0/gprfc001.pmcDef.pw12.omc1.ow12.dcs64KB.ncs64KB.aw02.cw02.d8192.mtu9k.untuned.clth64.tpf/run2/insightdemo12/docs/20111107_0/job32455/0005/3245500008127.pst.msg : | | Traceback (most recent call last): | File "/usr/lib/python2.6/site-packages/swift/obj/server.py", line 631, in __call__ | res = method(req) | File "/usr/lib/python2.6/site-packages/swift/common/utils.py", line 1870, in wrapped | return func(*a, **kw) | File "/usr/lib/python2.6/site-packages/swift/common/utils.py", line 686, in _timing_stats | resp = func(ctrl, *args, **kwargs) | File "/usr/lib/python2.6/site-packages/swift/obj/server.py", line 370, in PUT | with disk_file.create(size=fsize) as writer: | File "/usr/lib64/python2.6/contextlib.py", line 16, in __enter__ | return self.gen.next() | File "/usr/lib/python2.6/site-packages/gluster/swift/obj/diskfile.py", line 761, in create | self._create_dir_object(self._obj_path) | File "/usr/lib/python2.6/site-packages/gluster/swift/obj/diskfile.py", line 662, in _create_dir_object | ret, newmd = make_directory(cur_path, self.uid, self.gid, md) | File "/usr/lib/python2.6/site-packages/gluster/swift/obj/diskfile.py", line 163, in _make_directory_unlocked | str(serr))) | DiskFileError: _make_directory_unlocked: os.mkdir failed because path /mnt/gluster-object/vol0/gprfc001.pmcDef.pw12.omc1.ow12.dcs64KB.ncs64KB.aw02.cw02.d8192.mtu9k.untuned.clth64.tpf/run2/insightdemo12/docs/20111107_0/job32455/0005 already exists, and a subsequent os.stat on that same path failed ([Errno 5] Input/output error: '/mnt/gluster-object/vol0/gprfc001.pmcDef.pw12.omc1.ow12.dcs64KB.ncs64KB.aw02.cw02.d8192.mtu9k.untuned.clth64.tpf/run2/insightdemo12/docs/20111107_0/job32455/0005') (txn: tx63d95e2d739d46b9b6cae-005334b520) `---- 1.6 Tarball with log files ~~~~~~~~~~~~~~~~~~~~~~~~~~ A tarball containing /var/log/messages and gluster logs from each server, as well as the catalyst error/progress/results files from each client can be found on gprfc009:/BZ/catalyst-hung-thread/logs.tar.xz. The tarball (xz-compressed) is about 2.4GB. Version-Release number of selected component (if applicable): RHS2.1-u2 How reproducible: Not easily - I have seen the problem twice after running the catalyst workload for a day or two each time. Steps to Reproduce: 1. See the description of the problem - the setup is complicated 2. 3. Actual results: Test hung because of a hung thread. Expected results: All threads should finish. Additional info:
I should have added one more section to the initial report: 2.5 Errors in server logs ~~~~~~~~~~~~~~~~~~~~~~~~~ There are two errors in the server logs: ,---- | Mar 27 19:32:48 gprfs013 proxy-server Object PUT returning 503, 0/1 required connections (txn: tx63d95e2d739d46b9b6cae-005334b520) (client_ip: 172.17.10.1) | Mar 27 19:32:48 gprfs013 proxy-server 172.17.10.1 172.17.10.1 27/Mar/2014/23/32/48 PUT /v1/AUTH_vol0/gprfc001.pmcDef.pw12.omc1.ow12.dcs64KB.ncs64KB.aw02.cw02.d8192.mtu9k.untuned.clth64.tpf/run2/insightdemo12/docs/20111107_0/job32455/0005/3245500008127.pst.msg HTTP/1.0 503 - pycurl-gluster_func-44 - - 118 - tx63d95e2d739d46b9b6cae-005334b520 - 0.1248 - - | ... | Mar 27 14:41:22 gprfs015 proxy-server ERROR Client read timeout (5s) (txn: txa14f459df3da49dc9f45f-00533470bf) (client_ip: 172.17.10.7) | Mar 27 14:41:22 gprfs015 proxy-server 172.17.10.7 172.17.10.7 27/Mar/2014/18/41/22 PUT /v1/AUTH_vol0/gprfc007.pmcDef.pw12.omc1.ow12.dcs64KB.ncs64KB.aw02.cw02.d8192.mtu9k.untuned.clth64.tpf/run2/insightdemo12/docs/20120119_238696/job34772/0003/3477200005406.eml HTTP/1.0 408 - pycurl-gluster_func-22 - 65536 116 - txa14f459df3da49dc9f45f-00533470bf - 19.0962 - - `---- The first one seems to be related to the traceback in the initial report. The second one seems to be related to the hung thread. And these two problems are probably independent of each other.
Ack'd. Looked at the first traceback for os.mkdir related issue. Will continue investigations on the second point where client read timed out tomorrow.
I could not resolve "gprfc009" hostname, while reading the logs. [Chetan@cbox ~]$ scp root.corp:/BZ/catalyst-hung-thread/logs.tar.xz . ssh: Could not resolve hostname gprfc009.redhat.corp: Name or service not known [Chetan@cbox ~]$ ssh gprfc009 ssh: Could not resolve hostname gprfc009: Name or service not known [Chetan@cbox ~]$ ssh gprfc009.corp.redhat.com ssh: Could not resolve hostname gprfc009.corp.redhat.com: Name or service not known
Commit pushed to master at https://github.com/openshift/origin-server https://github.com/openshift/origin-server/commit/b6939d9f88ce34a9b8a7cef37481ea2740c3138d Bug 1083197 - remove NULL's from environment variables
Apologies for this delay - I missed this until last week (and then tried to update it but I got another reminder today and found out my update did not go through). It's highly likely that the machine has been clobbered by now, so this information will not be immediately useful, but for future reference, these machines are in the BAGL lab, so the fqdn of gprfc009 is gprfc009.sbu.lab.eng.bos.redhat.com See https://mojo.redhat.com/docs/DOC-142066 for more details.
Closing this bug as RHS 2.1 is EOL. If this bug persists in recent versions, it should be opened against RHGS 3.1.x