Created attachment 800596 [details] Tarball of client logs and ERROR lines of server logs. Description of problem: * Setup Two clients (gprfc026, 028) running the catalyst workload: first PUT, then GET 4.78 million files, in groups of 10000, with the whole thing repeated three times. Six servers (gprfs009-016 minus 013 and its partner 014: 013 kicked the bucket) in standard configuration: 12 disks in RAID6, one gluster volume, two replicas. All six are running the gluster server and client components and the swift services. * Results The clients were started on 2013/09/11 at 16:58 local time (UTC-5). They kicked around for a few hours, getting errors during the first PUT run in each case: run3 for gprfc026 and run1 from gprfc028 (the discrepancy in the file names was a careless error on my part - no deep significance): see the gprfc0{26,28}/gl-run{3,1}-PUT-{error,progress}.log files for the details. Note that these logs use UTC, so they show times of 21:58 and later. The server logs (except for 009) show errors mostly clumped around two different times: 17:58:10 and 18:05:34, except for 010 which got all its errors at 17:31:14 | Server | number of errors | time | |--------+------------------+----------| | 009 | 0 | | | 010 | 8 | 17:31:14 | | 011 | 2 | 17:58:09 | | 011 | 8 | 18:05:34 | | 011 | 1 | 20:41:17 | | 012 | 11 | 17:58:10 | | 015 | 4 | 17:58:11 | | 016 | 2 | 17:58:09 | | 016 | 7 | 18:05:34 | The same complaint in each case, except for the path - sometimes it is the directory for the container, sometimes it's a file underneath: #+BEGIN_EXAMPLE Sep 11 18:05:34 gprfs016 object-server ERROR __call__ error with PUT /vol0/0/AUTH_vol0/gprfc026.pmcDef.pw24.omc1.ow24.dcs2MB.ncs2MB.daw02.cw02.d8192.mtu9k.untuned.clth64.tpf/run3/insightdemo12/docs/20111114_0/job33242/0001/3324200000050.htm : Traceback (most recent call last): File "/usr/lib/python2.6/site-packages/swift/obj/server.py", line 928, in __call__ res = method(req) File "/usr/lib/python2.6/site-packages/swift/common/utils.py", line 1558, in wrapped return func(*a, **kw) File "/usr/lib/python2.6/site-packages/swift/common/utils.py", line 520, in _timing_stats resp = func(ctrl, *args, **kwargs) File "/usr/lib/python2.6/site-packages/gluster/swift/obj/server.py", line 63, in PUT return server.ObjectController.PUT(self, request) File "/usr/lib/python2.6/site-packages/swift/common/utils.py", line 1558, in wrapped return func(*a, **kw) File "/usr/lib/python2.6/site-packages/swift/common/utils.py", line 520, in _timing_stats resp = func(ctrl, *args, **kwargs) File "/usr/lib/python2.6/site-packages/swift/obj/server.py", line 647, in PUT with file.mkstemp() as fd: File "/usr/lib64/python2.6/contextlib.py", line 16, in __enter__ return self.gen.next() File "/usr/lib/python2.6/site-packages/gluster/swift/common/DiskFile.py", line 755, in mkstemp self._create_dir_object(self._obj_path) File "/usr/lib/python2.6/site-packages/gluster/swift/common/DiskFile.py", line 430, in _create_dir_object ret, newmd = make_directory(cur_path, self.uid, self.gid, md) File "/usr/lib/python2.6/site-packages/gluster/swift/common/DiskFile.py", line 149, in _make_directory_unlocked str(serr))) DiskFileError: _make_directory_unlocked: os.mkdir failed because path /mnt/gluster-object/vol0/gprfc026.pmcDef.pw24.omc1.ow24.dcs2MB.ncs2MB.daw02.cw02.d8192.mtu9k.untuned.clth64.tpf already exists, and a subsequent os.stat on that same path failed ([Errno 2] No such file or directory: '/mnt/gluster-object/vol0/gprfc026.pmcDef.pw2 #+END_EXAMPLE It's not clear to me why the path is truncated in the nested error message (the last line of the example above.) Somehow, the errors cleared up eventually and the runs that started around 18:58 (23:58 UTC) finished with no more errors of any kind. The results were consistent with what Peter Portante obtained earlier with these bits. The client logs and the ERROR lines from the server logs are in the attached tarball. Version-Release number of selected component (if applicable): RHS2.1 How reproducible: Unknown - I am about to start a run with more clients. If I see the same problem, I will update the BZ accordingly. Steps to Reproduce: 1. 2. 3. Actual results: Backtrace (see the description above) Expected results: No backtrace Additional info: Client logs and selections from the server logs in the attached tarball.
I have run catalyst now many times without ever seeing this problem again. I'll close this BZ and reopen it if/when I see the problem again.