1215430 – erasure coded volumes can't read large directory trees

Bug 1215430 - erasure coded volumes can't read large directory trees

Summary: erasure coded volumes can't read large directory trees

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	replicate
Sub Component:
Version:	rhgs-3.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	RHGS 3.1.0
Assignee:	Pranith Kumar K
QA Contact:	Bhaskarakiran
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1202842
TreeView+	depends on / blocked

Reported:	2015-04-26 14:12 UTC by Ben England
Modified:	2016-11-23 23:12 UTC (History)
CC List:	6 users (show)
Fixed In Version:	glusterfs-3.7.1-1
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2015-07-29 04:42:01 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2015:1495	0	normal	SHIPPED_LIVE	Important: Red Hat Gluster Storage 3.1 update	2015-07-29 08:26:26 UTC

Description Ben England 2015-04-26 14:12:24 UTC

Description of problem:

I/O errors seen while doing equivalent of ls -lR in parallel on multiple directory trees in a Gluster volume.

Version-Release number of selected component (if applicable):

RHS 3.1 - glusterfs-3.7dev-0.1017.git7fb85e3.el6.x86_64
RHEL6.6 - kernel-2.6.32-504.3.2.el6.x86_64

How reproducible:

not reproducible at will, seems like it's triggered by doing a parallel readdir on all clients.

I suspect it's a race condition exposed in erasure coding by multi-thread epoll in some way.  Can lower client.event-threads to 1 and see if it still happens.  

For an example:

http://pbench.perf.lab.eng.bos.redhat.com/results/gprfc032-10ge/glfs37-ec-100kfpt-2015-04-25/256-KB--fpt-1024-fpd-2-thr-op-ls-l-2015-04-25-12-47/r-2015-04-25-12-47.log

Steps to Reproduce:
1. create & mount Gluster 3.7 erasure coded volume on 6 hosts with 12 bricks/host
2. use below smallfile command to create per-thread directory trees with 1000 files/dir
3. use smallfile command to do equivalent of "ls -lR" on the directory trees

Actual results:

EIO returned by some threads

Expected results:

No EIO errors

Additional info:

A ton of these messages in the client mount logs:

[2015-04-26 13:52:31.504545] W [ec-common.c:162:ec_check_status] 0-ec42-disperse-0: Operation failed on some subvolumes (up=3F, mask=3F, remaining=0, good=2F, bad=10)

here's the command that produces the error:

[root@gprfc032-10ge smallfile-master]# /root/smallfile-master/smallfile_cli.py --host-set ../svrs.list --top /mnt/ec42/smf --files 102400 --files-per-dir 1024 --file-size 256 --threads 1 --operation ls-l
smallfile version 2.3
                           hosts in test : ['gprfs022', 'gprfs024', 'gprfs037', 'gprfs038', 'gprfs039', 'gprfs040']
                   top test directory(s) : ['/mnt/ec42/smf']
                               operation : ls-l
                            files/thread : 102400
                                 threads : 1
           record size (KB, 0 = maximum) : 0
                          file size (KB) : 256
                  file size distribution : fixed
                           files per dir : 1024
                            dirs per dir : 10
              threads share directories? : N
                         filename prefix :
                         filename suffix :
             hash file number into dir.? : N
                     fsync after modify? : N
          pause between files (microsec) : 0
                    finish all requests? : Y
                              stonewall? : Y
                 measure response times? : N
                            verify read? : Y
                                verbose? : False
                          log to stderr? : False
                           ext.attr.size : 0
                          ext.attr.count : 0
               permute host directories? : N
                remote program directory : /root/smallfile-master
               network thread sync. dir. : /mnt/ec42/smf/network_shared
starting all threads by creating starting gate file /mnt/ec42/smf/network_shared/starting_gate.tmp
host = gprfs022, thread = 00, elapsed sec. = 82.641262, total files = 97314, total_records = 0, status = ok
host = gprfs024, thread = 00, elapsed sec. = -1430055080.684132, total files = 0, total_records = 0, status = ERR: Input/output error
host = gprfs037, thread = 00, elapsed sec. = 80.640801, total files = 73794, total_records = 0, status = ok
host = gprfs038, thread = 00, elapsed sec. = 78.833245, total files = 102400, total_records = 0, status = ok
host = gprfs039, thread = 00, elapsed sec. = 77.121790, total files = 96334, total_records = 0, status = ok
host = gprfs040, thread = 00, elapsed sec. = 81.461536, total files = 78890, total_records = 0, status = ok
total threads = 6
total files = 448732
 73.04% of requested files processed, minimum is  70.00
82.641262 sec elapsed time
5429.878370 files/sec


here's what happened in the thread that saw the error:

[root@gprfs024 ~]# tail /var/tmp/invoke_logs-00.log 
00 2015-04-26 09:31:20,681 - INFO - do_workload:  opname=ls-l iterations=102400 top_dirs=['/mnt/ec42/smf'] src_dirs=['/mnt/ec42/smf/file_srcdir/gprfs024/thrd_00'] dest_dirs=['/mnt/ec42/smf/file_dstdir/gprfs024/thrd_00'] network_dir=/mnt/ec42/smf/network_shared shared=False record_sz_kb=0 total_sz_kb=256 filesize_distr=-1 files_per_dir=1024 dirs_per_dir=10 dirs_on_demand=False xattr_size=0 xattr_count=0 starting_gate=/mnt/ec42/smf/network_shared/starting_gate.tmp prefix= suffix= hash_to_dir=False fsync=False stonewall=True files_between_checks=20 verify_read=True incompressible=False finish_all_rq=True rsp_times=False tid=00 loglevel=20 filenum=0 filenum_final=0 rq=0 rq_final=0 start=None end=None elapsed=0 host=gprfs024 status=1 abort=False log_to_stderr=False verbose=False
00 2015-04-26 09:31:23,554 - ERROR - [Errno 5] Input/output error: '/mnt/ec42/smf/file_srcdir/gprfs024/thrd_00/d_000'
Traceback (most recent call last):
  File "/root/smallfile-master/smallfile.py", line 1285, in do_workload
    func(self) # call the do_ function for that workload type
  File "/root/smallfile-master/smallfile.py", line 1048, in do_ls_l
    dir_contents.extend(os.listdir(next_dir))
OSError: [Errno 5] Input/output error: '/mnt/ec42/smf/file_srcdir/gprfs024/thrd_00/d_000'
00 2015-04-26 09:31:25,280 - ERROR - invocation did not complete cleanly
00 2015-04-26 09:31:25,280 - INFO - recorded throughput after 1 files


And here's the log messages in the client log confirming that the error was in Gluster not smallfile


[root@gprfs039 ~]# tail -f /var/log/glusterfs/mnt-ec42.log | grep -v ' I '
[2015-04-26 13:24:05.804919] W [ec-combine.c:867:ec_combine_check] 0-ec42-disperse-10: Mismatching xdata in answers of 'LOOKUP'
[2015-04-26 13:24:05.805044] W [ec-combine.c:867:ec_combine_check] 0-ec42-disperse-10: Mismatching xdata in answers of 'LOOKUP'
[2015-04-26 13:24:05.805192] W [ec-combine.c:867:ec_combine_check] 0-ec42-disperse-11: Mismatching xdata in answers of 'LOOKUP'
[2015-04-26 13:24:05.805423] W [ec-combine.c:867:ec_combine_check] 0-ec42-disperse-10: Mismatching xdata in answers of 'LOOKUP'
[2015-04-26 13:24:05.805568] W [ec-combine.c:867:ec_combine_check] 0-ec42-disperse-11: Mismatching xdata in answers of 'LOOKUP'
[2015-04-26 13:24:05.805625] W [ec-common.c:162:ec_check_status] 0-ec42-disperse-11: Operation failed on some subvolumes (up=3F, mask=3F, remaining=0, good=1D, bad=22)
[2015-04-26 13:24:05.809082] W [ec-combine.c:867:ec_combine_check] 0-ec42-disperse-2: Mismatching xdata in answers of 'LOOKUP'
[2015-04-26 13:24:05.809198] W [MSGID: 109005] [dht-selfheal.c:1771:dht_selfheal_directory] 0-ec42-dht: Directory selfheal failed : 3 subvolumes have unrecoverable errors. path = /smf/file_srcdir/gprfs039/thrd_01/d_001/d_000/d_000, gfid = e7c4c580-11bf-47c1-9f23-e2cb92e6cee4


and I did see this:

The message "I [MSGID: 109018] [dht-common.c:772:dht_revalidate_cbk] 0-ec42-dht: Mismatching layouts for /smf/file_srcdir/gprfs039/thrd_00/d_001/d_000/d_000, gfid = 00000000-0000-0000-0000-000000000000" repeated 2 times between [2015-04-26 13:32:41.082975] and [2015-04-26 13:32:41.084756]

tail [root@gprfs024 ~]# tail -f /var/log/glusterfs/mnt-ec42.log | grep -v ' I '
[2015-04-26 13:39:47.077979] W [fuse-bridge.c:484:fuse_entry_cbk] 0-glusterfs-fuse: 11298875: LOOKUP() /smf/file_srcdir/gprfs024/thrd_00/d_001/d_000 => -1 (Input/output error)

Comment 2 Pranith Kumar K 2015-06-11 09:50:48 UTC

This bug is already fixed on glusterfs-3.7.1-1. Moving it to ON_QA

Comment 5 Bhaskarakiran 2015-07-06 09:08:11 UTC

verified this on 3.7.1-7 and didn't see the issue. Marking this as fixed.

Comment 6 errata-xmlrpc 2015-07-29 04:42:01 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2015-1495.html

Note You need to log in before you can comment on or make changes to this bug.