1010416 – Errors in gluster/swift/common/DiskFile.py during catalyst run

Bug 1010416 - Errors in gluster/swift/common/DiskFile.py during catalyst run

Summary: Errors in gluster/swift/common/DiskFile.py during catalyst run

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	glusterfs
Sub Component:
Version:	2.1
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Bug Updates Notification Mailing List
QA Contact:	Sudhir D
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2013-09-20 17:58 UTC by Nick Dokos
Modified:	2013-11-27 20:50 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2013-11-27 20:50:02 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Tarball of client logs and ERROR lines of server logs. (6.97 MB, application/x-gzip) 2013-09-20 17:58 UTC, Nick Dokos	no flags	Details
View All

Description Nick Dokos 2013-09-20 17:58:24 UTC

Created attachment 800596 [details]
Tarball of client logs and ERROR lines of server logs.

Description of problem:


* Setup

Two clients (gprfc026, 028) running the catalyst workload: first PUT, then GET
4.78 million files, in groups of 10000, with the whole thing repeated three times.

Six servers (gprfs009-016 minus 013 and its partner 014: 013 kicked the bucket)
in standard configuration: 12 disks in RAID6, one gluster volume, two replicas.
All six are running the gluster server and client components and the swift services.

* Results

The clients were started on 2013/09/11 at 16:58 local time (UTC-5). They kicked
around for a few hours, getting errors during the first PUT run in each case:
run3 for gprfc026 and run1 from gprfc028 (the discrepancy in the file names was
a careless error on my part - no deep significance): see the
gprfc0{26,28}/gl-run{3,1}-PUT-{error,progress}.log files for the details.  Note
that these logs use UTC, so they show times of 21:58 and later.

The server logs (except for 009) show errors mostly clumped around two
different times: 17:58:10 and 18:05:34, except for 010 which got all its errors
at 17:31:14

| Server | number of errors |     time |
|--------+------------------+----------|
|    009 |                0 |          |
|    010 |                8 | 17:31:14 |
|    011 |                2 | 17:58:09 |
|    011 |                8 | 18:05:34 |
|    011 |                1 | 20:41:17 |
|    012 |               11 | 17:58:10 |
|    015 |                4 | 17:58:11 |
|    016 |                2 | 17:58:09 |
|    016 |                7 | 18:05:34 |


The same complaint in each case, except for the path - sometimes it is the directory
for the container, sometimes it's a file underneath:

#+BEGIN_EXAMPLE
Sep 11 18:05:34 gprfs016 object-server ERROR __call__ error with 
PUT /vol0/0/AUTH_vol0/gprfc026.pmcDef.pw24.omc1.ow24.dcs2MB.ncs2MB.daw02.cw02.d8192.mtu9k.untuned.clth64.tpf/run3/insightdemo12/docs/20111114_0/job33242/0001/3324200000050.htm : 
Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/swift/obj/server.py", line 928, in __call__
    res = method(req)
  File "/usr/lib/python2.6/site-packages/swift/common/utils.py", line 1558, in wrapped
    return func(*a, **kw)
  File "/usr/lib/python2.6/site-packages/swift/common/utils.py", line 520, in _timing_stats
    resp = func(ctrl, *args, **kwargs)
  File "/usr/lib/python2.6/site-packages/gluster/swift/obj/server.py", line 63, in PUT
    return server.ObjectController.PUT(self, request)
  File "/usr/lib/python2.6/site-packages/swift/common/utils.py", line 1558, in wrapped
    return func(*a, **kw)
  File "/usr/lib/python2.6/site-packages/swift/common/utils.py", line 520, in _timing_stats
    resp = func(ctrl, *args, **kwargs)
  File "/usr/lib/python2.6/site-packages/swift/obj/server.py", line 647, in PUT
    with file.mkstemp() as fd:
  File "/usr/lib64/python2.6/contextlib.py", line 16, in __enter__
    return self.gen.next()
  File "/usr/lib/python2.6/site-packages/gluster/swift/common/DiskFile.py", line 755, in mkstemp
    self._create_dir_object(self._obj_path)
  File "/usr/lib/python2.6/site-packages/gluster/swift/common/DiskFile.py", line 430, in _create_dir_object
    ret, newmd = make_directory(cur_path, self.uid, self.gid, md)
  File "/usr/lib/python2.6/site-packages/gluster/swift/common/DiskFile.py", line 149, in _make_directory_unlocked
    str(serr)))
DiskFileError: _make_directory_unlocked: os.mkdir failed because path 
/mnt/gluster-object/vol0/gprfc026.pmcDef.pw24.omc1.ow24.dcs2MB.ncs2MB.daw02.cw02.d8192.mtu9k.untuned.clth64.tpf
already exists, and a subsequent os.stat on that same path failed 
([Errno 2] No such file or directory: '/mnt/gluster-object/vol0/gprfc026.pmcDef.pw2
#+END_EXAMPLE

It's not clear to me why the path is truncated in the nested error message (the
last line of the example above.)

Somehow, the errors cleared up eventually and the runs that started around
18:58 (23:58 UTC) finished with no more errors of any kind.  The results were
consistent with what Peter Portante obtained earlier with these bits.

The client logs and the ERROR lines from the server logs are in the
attached tarball.


Version-Release number of selected component (if applicable):

RHS2.1

How reproducible:

Unknown - I am about to start a run with more clients. If I see the same problem, I will update the BZ accordingly.

Steps to Reproduce:
1.
2.
3.

Actual results:

Backtrace (see the description above)

Expected results:

No backtrace

Additional info:

Client logs and selections from the server logs in the attached tarball.

Comment 2 Nick Dokos 2013-11-27 20:50:02 UTC

I have run catalyst now many times without ever seeing this problem again. I'll close this BZ and reopen it if/when I see the problem again.

Note You need to log in before you can comment on or make changes to this bug.