Bug 1083197 - Hung thread while running catalyst workload on gluster-swift
Summary: Hung thread while running catalyst workload on gluster-swift
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: gluster-swift
Version: 2.1
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: Thiago da Silva
QA Contact: SATHEESARAN
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-04-01 16:01 UTC by Nick Dokos
Modified: 2015-11-20 06:16 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2015-11-20 06:16:36 UTC
Embargoed:


Attachments (Terms of Use)

Description Nick Dokos 2014-04-01 16:01:51 UTC
Description of problem:


1.1 Workload
~~~~~~~~~~~~

  I was running the catalyst workload on our "standard" setup:

  - 8 client nodes, each running a catalyst process with 64 threads
    sharing the load.
  - 8 server nodes, each running gluster and gluster-swift
    (RHS2.1u2). Each server exports a single brick: a RAID6 volume over
    12 disks (each disk is 938 GB).  Each server runs a dozen proxy and
    object workers.
  - The clients communicate with the proxy servers over a 10Gb/s
    front-end network.  All other communication happens on a different
    10Gb/s back-end network.

  The workload consists of about 5 million files, split into 10000-file
  filesets. Each client splits each fileset among its threads, doing
  PUTs to send the files to the servers.  After the end of the PUT
  phase, a GET phase reads back all the files. Normally, the PUT/GET
  phases are repeated three times. It takes a few days to go through the
  complete workload.


1.2 RHS version
~~~~~~~~~~~~~~~

  The servers are all running RHS2.1-u2.


1.3 Problem
~~~~~~~~~~~

  The run finished the first PUT and GET phases and was working on the
  second PUT phase.  This phase never finished. Seven out of the eight
  clients finished PUTting all the files, but the eighth
  (gprfc007.sbu.lab.eng.bos.redhat.com) did not: it successfully
  finished 99 filesets, and 63 (out of 64) threads finished their part
  of the 100_{th} fileset. The remaining thread never finished. The last
  output from the client (successful conclusion of the 99_{th} fileset)
  came about 22 hours after the beginning of the run, so the problem is
  not easily reproducible.

  It turns out that there is an open TCP connection between the catalyst
  process on gprfc007 and a proxy worker on one of the server nodes
  (gprfs015): the catalyst thread is blocked on a futex, whereas the
  proxy worker is doing a poll, timing out and repeating endlessly. The
  only connections that the proxy worker has open are to memcached:
  there are no open connections to an object server at all.  So it seems
  that the proxy worker thinks it's all done, but the client thread is
  still waiting with the connection open.


1.4 netstat output
~~~~~~~~~~~~~~~~~~

1.4.1 on client gprfc007
------------------------

  ,----
  | [root@gprfc007 ~] # netstat -plant
  | Active Internet connections (servers and established)
  | Proto Recv-Q Send-Q Local Address               Foreign Address             State       PID/Program name   
  | ...
  | tcp        0      0 172.17.10.7:39904           172.17.40.15:8080           ESTABLISHED 3561/python2.6      
  | ...
  `----

  3561 is the pid of the catalyst process:

  ,----
  | [root@gprfc007 ~]# ps awlx | grep 3561
  | 0     0  3561  2330  20   0 4473264 21508 futex_ Sl  ?          4:51 /usr/bin/python2.6 /home/tools/catalyst/src/gluster-load.py 0 PUT 0 gprfs009-10ge,gprfs010-10ge,gprfs011-10ge,gprfs012-10ge,gprfs013-10ge,gprfs014-10ge,gprfs015-10ge,gprfs016-10ge 8080 AUTH_vol0 gprfc007.pmcDef.pw12.omc1.ow12.dcs64KB.ncs64KB.aw02.cw02.d8192.mtu9k.untuned.clth64.tpf run2 64 1
  `----


1.4.2 on server gprfs015
------------------------

  ,----
  | [root@gprfs015 ~]# netstat -plant | grep 8080
  | netstat -plant | grep 8080
  | tcp        0      0 172.17.40.15:8080           0.0.0.0:*                   LISTEN      11904/python        
  | tcp        0      0 172.17.40.15:8080           172.17.10.7:39904           ESTABLISHED 11945/python
  `----

  11945 is the pid of the proxy worker:

  ,----
  | [root@gprfs015 ~]# ps awlx | grep 11945
  | ps awlx | grep 11945
  | 1     0 11945 11904  20   0 232340 18168 poll_s S    ?         81:53 /usr/bin/python /usr/bin/swift-proxy-server /etc/swift/proxy-server.conf
  `----


1.5 Traceback
~~~~~~~~~~~~~

  There was one traceback in the server log files. It is not clear
  whether this problem (which I have seen before on various occasions)
  has anything to do with the hung thread problem.

  ,----
  | Mar 27 19:32:48 gprfs013 object-server ERROR __call__ error with 
  | PUT /vol0/0/AUTH_vol0/gprfc001.pmcDef.pw12.omc1.ow12.dcs64KB.ncs64KB.aw02.cw02.d8192.mtu9k.untuned.clth64.tpf/run2/insightdemo12/docs/20111107_0/job32455/0005/3245500008127.pst.msg :
  |  
  | Traceback (most recent call last):
  |   File "/usr/lib/python2.6/site-packages/swift/obj/server.py", line 631, in __call__
  |     res = method(req)
  |   File "/usr/lib/python2.6/site-packages/swift/common/utils.py", line 1870, in wrapped
  |     return func(*a, **kw)
  |   File "/usr/lib/python2.6/site-packages/swift/common/utils.py", line 686, in _timing_stats
  |     resp = func(ctrl, *args, **kwargs)
  |   File "/usr/lib/python2.6/site-packages/swift/obj/server.py", line 370, in PUT
  |     with disk_file.create(size=fsize) as writer:
  |   File "/usr/lib64/python2.6/contextlib.py", line 16, in __enter__
  |     return self.gen.next()
  |   File "/usr/lib/python2.6/site-packages/gluster/swift/obj/diskfile.py", line 761, in create
  |     self._create_dir_object(self._obj_path)
  |   File "/usr/lib/python2.6/site-packages/gluster/swift/obj/diskfile.py", line 662, in _create_dir_object
  |     ret, newmd = make_directory(cur_path, self.uid, self.gid, md)
  |   File "/usr/lib/python2.6/site-packages/gluster/swift/obj/diskfile.py", line 163, in _make_directory_unlocked
  |     str(serr)))
  | DiskFileError: _make_directory_unlocked: os.mkdir failed because path /mnt/gluster-object/vol0/gprfc001.pmcDef.pw12.omc1.ow12.dcs64KB.ncs64KB.aw02.cw02.d8192.mtu9k.untuned.clth64.tpf/run2/insightdemo12/docs/20111107_0/job32455/0005 already exists, and a subsequent os.stat on that same path failed ([Errno 5] Input/output error: '/mnt/gluster-object/vol0/gprfc001.pmcDef.pw12.omc1.ow12.dcs64KB.ncs64KB.aw02.cw02.d8192.mtu9k.untuned.clth64.tpf/run2/insightdemo12/docs/20111107_0/job32455/0005') (txn: tx63d95e2d739d46b9b6cae-005334b520)
  `----


1.6 Tarball with log files
~~~~~~~~~~~~~~~~~~~~~~~~~~

  A tarball containing /var/log/messages and gluster logs
  from each server, as well as the catalyst error/progress/results files
  from each client can be found on gprfc009:/BZ/catalyst-hung-thread/logs.tar.xz. The tarball  (xz-compressed) is about 2.4GB.

Version-Release number of selected component (if applicable):
RHS2.1-u2

How reproducible:
Not easily - I have seen the problem twice after running the catalyst workload for a day or two each time.

Steps to Reproduce:
1. See the description of the problem - the setup is complicated
2.
3.

Actual results:
Test hung because of a hung thread.

Expected results:
All threads should finish.

Additional info:

Comment 2 Nick Dokos 2014-04-01 17:35:41 UTC
I should have added one more section to the initial report:

2.5 Errors in server logs
~~~~~~~~~~~~~~~~~~~~~~~~~

  There are two errors in the server logs:

  ,----
  | Mar 27 19:32:48 gprfs013 proxy-server Object PUT returning 503, 0/1 required connections (txn: tx63d95e2d739d46b9b6cae-005334b520) (client_ip: 172.17.10.1)
  | Mar 27 19:32:48 gprfs013 proxy-server 172.17.10.1 172.17.10.1 27/Mar/2014/23/32/48 PUT /v1/AUTH_vol0/gprfc001.pmcDef.pw12.omc1.ow12.dcs64KB.ncs64KB.aw02.cw02.d8192.mtu9k.untuned.clth64.tpf/run2/insightdemo12/docs/20111107_0/job32455/0005/3245500008127.pst.msg HTTP/1.0 503 - pycurl-gluster_func-44 - - 118 - tx63d95e2d739d46b9b6cae-005334b520 - 0.1248 - -
  | ...
  | Mar 27 14:41:22 gprfs015 proxy-server ERROR Client read timeout (5s) (txn: txa14f459df3da49dc9f45f-00533470bf) (client_ip: 172.17.10.7)
  | Mar 27 14:41:22 gprfs015 proxy-server 172.17.10.7 172.17.10.7 27/Mar/2014/18/41/22 PUT /v1/AUTH_vol0/gprfc007.pmcDef.pw12.omc1.ow12.dcs64KB.ncs64KB.aw02.cw02.d8192.mtu9k.untuned.clth64.tpf/run2/insightdemo12/docs/20120119_238696/job34772/0003/3477200005406.eml HTTP/1.0 408 - pycurl-gluster_func-22 - 65536 116 - txa14f459df3da49dc9f45f-00533470bf - 19.0962 - -
  `----

  The first one seems to be related to the traceback in the initial report. The second
  one seems to be related to the hung thread. And these two problems are probably independent of each other.

Comment 3 crisbud@redhat.com 2014-04-02 15:19:25 UTC
Ack'd. 

Looked at the first traceback for os.mkdir related issue. Will continue investigations on the second point where client read timed out tomorrow.

Comment 4 crisbud@redhat.com 2014-04-03 08:01:42 UTC
I could not resolve "gprfc009" hostname, while reading the logs. 


[Chetan@cbox ~]$ scp root.corp:/BZ/catalyst-hung-thread/logs.tar.xz .
ssh: Could not resolve hostname gprfc009.redhat.corp: Name or service not known
[Chetan@cbox ~]$ ssh gprfc009 
ssh: Could not resolve hostname gprfc009: Name or service not known
[Chetan@cbox ~]$ ssh gprfc009.corp.redhat.com
ssh: Could not resolve hostname gprfc009.corp.redhat.com: Name or service not known

Comment 5 openshift-github-bot 2014-05-06 03:01:56 UTC
Commit pushed to master at https://github.com/openshift/origin-server

https://github.com/openshift/origin-server/commit/b6939d9f88ce34a9b8a7cef37481ea2740c3138d
Bug 1083197 - remove NULL's from environment variables

Comment 7 Nick Dokos 2014-09-08 14:45:23 UTC
Apologies for this delay - I missed this until last week (and then tried to update it but I got another reminder today and found out my update did not go through).

It's highly likely that the machine has been clobbered by now, so this information will not be immediately useful, but for future reference, these machines are in the BAGL lab, so the fqdn of gprfc009 is

  gprfc009.sbu.lab.eng.bos.redhat.com

See https://mojo.redhat.com/docs/DOC-142066 for more details.

Comment 8 Prashanth Pai 2015-11-20 06:16:36 UTC
Closing this bug as RHS 2.1 is EOL.
If this bug persists in recent versions, it should be opened against RHGS 3.1.x


Note You need to log in before you can comment on or make changes to this bug.