Bug 1602262

Summary: tests/bugs/posix/bug-990028.t fails for distributed regression framework
Product: [Community] GlusterFS Reporter: Deepshikha khandelwal <dkhandel>
Component: testsAssignee: bugs <bugs>
Status: CLOSED NOTABUG QA Contact:
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: mainlineCC: bugs, dkhandel, srangana
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-07-31 06:46:35 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Tar of generated logs for this failed test(includes cli logs also)
none
Updated tar of the failed test none

Description Deepshikha khandelwal 2018-07-18 05:59:19 UTC
Created attachment 1459634 [details]
Tar of generated logs for this failed test(includes cli logs also)

Description of problem:

tests/bugs/posix/bug-990028.t is constantly failing for distributed regression framework. Tar of generated logs are attached for this test.

Comment 1 Shyamsundar 2018-07-18 14:49:51 UTC
This fails as follows,
=========================
TEST 52 (line 37): ln /mnt/glusterfs/0/file1 /mnt/glusterfs/0/file44
ln: failed to create hard link ‘/mnt/glusterfs/0/file44’: No space left on device
RESULT 52: 1
=========================
(continues till the last file) IOW, file44-file50 fail creation
=========================
TEST 58 (line 37): ln /mnt/glusterfs/0/file1 /mnt/glusterfs/0/file50
ln: failed to create hard link ‘/mnt/glusterfs/0/file50’: No space left on device
RESULT 58: 1
=========================

Post this the failures are due to attempts to inspect these files for metadata and attrs and such, so the failures are due to the above.

At first I suspected max-hardlink setting, but this is at a default of 100, and we do not use any specific site.h or tuning when running in the distributed environment (as far as I can tell).

Also, the test, when it fails, has only created 1 empty file and 42 links to the same, this should not cause the bricks to run out of space.

The Gluster logs till now did not throw up any surprises, or causes.

@deepshika, are there any deltas on the dist-testing machined, builds that we need to be aware of?

Also, if we have to run this on one of those setups, would that be through softserve?

Dropping a needinfo for the above, analysis based on data provided is still ongoing.

Comment 2 Shyamsundar 2018-07-18 15:02:11 UTC
One other whacky thing in the tarball of logs, the logs seem to be form the test tests/bugs/bug-1371806_acl.t but the log tests-bugs-posix-bug-990028.t-3.log talks about a different test case failure.

The client is mounting a volume that is 1x6 whereas the test that fails creates a 1x1 volume.

Can we get the logs for the failed test, will help analysis, and may not need an instance to debug on.

Stalling analysis for now.

Comment 3 Deepshikha khandelwal 2018-07-18 15:55:57 UTC
Created attachment 1459746 [details]
Updated tar of the failed test

Comment 4 Deepshikha khandelwal 2018-07-18 16:41:32 UTC
(In reply to Shyamsundar from comment #1)
> This fails as follows,
> =========================
> TEST 52 (line 37): ln /mnt/glusterfs/0/file1 /mnt/glusterfs/0/file44
> ln: failed to create hard link ‘/mnt/glusterfs/0/file44’: No space left on
> device
> RESULT 52: 1
> =========================
> (continues till the last file) IOW, file44-file50 fail creation
> =========================
> TEST 58 (line 37): ln /mnt/glusterfs/0/file1 /mnt/glusterfs/0/file50
> ln: failed to create hard link ‘/mnt/glusterfs/0/file50’: No space left on
> device
> RESULT 58: 1
> =========================
> 
> Post this the failures are due to attempts to inspect these files for
> metadata and attrs and such, so the failures are due to the above.
> 
> At first I suspected max-hardlink setting, but this is at a default of 100,
> and we do not use any specific site.h or tuning when running in the
> distributed environment (as far as I can tell).
> 
> Also, the test, when it fails, has only created 1 empty file and 42 links to
> the same, this should not cause the bricks to run out of space.
> 
> The Gluster logs till now did not throw up any surprises, or causes.
> 
> @deepshika, are there any deltas on the dist-testing machined, builds that
> we need to be aware of?
> 
> Also, if we have to run this on one of those setups, would that be through
> softserve?
> 
> Dropping a needinfo for the above, analysis based on data provided is still
> ongoing.

Thank you for pointing this out. I've updated the tar. These are the only bunch of logs from distributed machines as well as cli logs.
Yes, you can run it via softserve instance.

Comment 5 Shyamsundar 2018-07-18 17:26:06 UTC
Failure is in setxattr on the brick process, the local FS is running out of space when attemting to set GFID backlinks xattrs for hard links. Log snippet as follows,

[2018-07-18 12:50:07.298478]:++++++++++ G_LOG:tests/bugs/posix/bug-990028.t: TEST: 37 ln /mnt/glusterfs/0/file1 /mnt/glusterfs/0/file45 ++++++++++
[2018-07-18 12:50:07.307101] W [MSGID: 113117] [posix-metadata.c:671:posix_set_parent_ctime] 0-patchy-posix: posix parent set mdata failed on file [No such file or directory]
[2018-07-18 12:50:07.322628] W [MSGID: 113093] [posix-gfid-path.c:51:posix_set_gfid2path_xattr] 0-patchy-posix: setting gfid2path xattr failed on /d/backends/brick/file45: key = trusted.gfid2path.4434be659b4d25e
4  [No space left on device]
[2018-07-18 12:50:07.322813] I [MSGID: 115062] [server-rpc-fops_v2.c:1089:server4_link_cbk] 0-patchy-server: 333: LINK /file43 (40ef3115-f818-4cc2-a5c3-64875f7a273a) -> 00000000-0000-0000-0000-000000000001/file4
5, client: CTX_ID:98c24d79-4889-4aba-bc93-91e1d5d73abe-GRAPH_ID:0-PID:4993-HOST:distributed-testing.8b445247-2057-47e7-894f-41e4a91bb536-PC_NAME:patchy-client-0-RECON_NO:-0, error-xlator: patchy-posix [No space 
left on device]
[2018-07-18 12:50:07.335223]:++++++++++ G_LOG:tests/bugs/posix/bug-990028.t: TEST: 37 ln /mnt/glusterfs/0/file1 /mnt/glusterfs/0/file46 ++++++++++

We would need to determine what the backing FS for the brick is (assuming XFS) and also determine if it's xfs_info is using different options than instances running centos-7 regression tests, that would be the next step to understand why this fails.

Also, looking at XFS options, in general, to understand what changes allowed size of extended attrs per inode (if any).

Comment 6 Shyamsundar 2018-07-18 21:45:14 UTC
Test works fine on a softserve instance, XFS and mount details from my setup and softserv as below,

SoftServ:
[root@somari1 glusterfs]# xfs_info /d
meta-data=/dev/loop0             isize=512    agcount=4, agsize=655360 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=0 spinodes=0
data     =                       bsize=4096   blocks=2621440, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
log      =internal               bsize=4096   blocks=2560, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

[root@somari1 glusterfs]# mount
/var/data on /d type xfs (rw,relatime,seclabel,attr2,inode64,noquota)

Localbox:
# xfs_info /d
meta-data=/dev/mapper/fedora-d   isize=512    agcount=4, agsize=6189056 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=1 spinodes=0 rmapbt=0
         =                       reflink=0
data     =                       bsize=4096   blocks=24756224, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
log      =internal               bsize=4096   blocks=12088, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

# mount
/dev/mapper/fedora-d on /d type xfs (rw,relatime,seclabel,attr2,inode64,noquota)

If softserv is not the way to reproduce this, and find out why the local FS is throwing the error, need the actual distributed system or some data from the same (like xfs_info /d) to proceed. Marking this as need info again.

Comment 7 Deepshikha khandelwal 2018-07-31 06:46:12 UTC
This test is now passing in the distributed regression framework.

RCA: The XFS partition /d which is supposed to be created in the distributed servers was not actually happening and using the existing device partitions.

Fix for this is: https://github.com/gluster/glusterfs-patch-acceptance-tests/pull/159/files
Earlier the state of mount module was 'present' which only specifies that the device is to be configured in fstab and does not trigger or create or require a mount.
Changed the state to 'mounted' in which the device will be actively mounted and appropriately configured in fstab. If the mount point is not present, the mount point will be created.

Thanks @Shyam for help in resolving this one.