Bug 1317653

Summary: EINVAL errors while aggregating the directory size by quotad
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Raghavendra Bhat <rabhat>
Component: quotaAssignee: Raghavendra G <rgowdapp>
Status: CLOSED ERRATA QA Contact: Anil Shah <ashah>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: rhgs-3.1CC: amukherj, olim, rabhat, rcyriac, rhinduja, rhs-bugs, smohan, vbellur
Target Milestone: ---   
Target Release: RHGS 3.2.0   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: glusterfs-3.8.4-1 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1332073 (view as bug list) Environment:
Last Closed: 2017-03-23 05:27:33 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1332073, 1351515, 1366746    
Attachments:
Description Flags
A working quota test script none

Comment 2 Vijaikumar Mallikarjuna 2016-03-22 06:49:30 UTC
Below is our analysis on this bug, pasting the email conversation


> On Thu, Mar 10, 2016 at 6:39 AM, FNU Raghavendra Manjunath 
> <rabhat> wrote:

Hi Vijay,

Goldman says, they are facing some issues (some quota related errors they are not able to understand)
https://access.redhat.com/support/cases/#/case/01596072

Basically for them quota aggregator lookup is failing with EINVAL (unfortunately we dont know why). 

"[2016-02-18 00:36:32.783229] W [MSGID: 120022] [quota-enforcer-client.c:236:quota_enforcer_lookup_cbk] 0-v001-unrep-quota: Getting cluster-wide size of directory failed (path: /1002969812 gfid:5a5d42eb-f4ca-497a-8d55-14a8443afbd6) [Invalid argument]"

NOTE: There are some AFR related errors mentioned in the case description. But that is being tracked in a different issue.

They want to know whether close () failure has to got to do anything with the quota error and why quota is logging like it.

As per my analysis,
quotad_aggregator_lookup () function sometimes returns EINVAL if decoding of XDR structure fails. Apart from that it can fail in qd_nameless_lookup () for
i) Failure to get the value for the key "volume-uuid" from the dict
ii) Unable to find the subvolume for the volume-uuid.

Unfortunately we do not log errors in those places.

Is there any other reason why lookup can fail in quotad_aggregator_lookup () ?



Hi,

We trying to analyzing the problem. As quotad is a client process, EINVAL could have come from dht and afr lookup as well.

When looking at the log date from directory "/cases/01596072/2016-03-08-11xxxx/fc-rhs-153158-006.dc.gs.com/var/log/glusterfs/".
I found something below:

[vmallika@collab-shell glusterfs]$ ls -lrt quotad.log-201* | tail
-rwxrwxrwx 1 rabhat rabhat     3898 Oct 31 23:30 quotad.log-20151025.gz
-rwxrwxrwx 1 rabhat rabhat     3370 Nov 14 22:11 quotad.log-20151108.gz
-rwxrwxrwx 1 rabhat rabhat     3699 Nov 21 22:25 quotad.log-20151115.gz
-rwxrwxrwx 1 rabhat rabhat     1660 Dec  5 22:46 quotad.log-20151129.gz
-rwxrwxrwx 1 rabhat rabhat     5585 Dec 12 22:07 quotad.log-20151206.gz
-rwxrwxrwx 1 rabhat rabhat     4870 Dec 19 22:16 quotad.log-20151213.gz
-rwxrwxrwx 1 rabhat rabhat     8003 Jan  9 22:32 quotad.log-20160103.gz
-rwxrwxrwx 1 rabhat rabhat     8914 Jan 23 22:33 quotad.log-20160117.gz
-rwxrwxrwx 1 rabhat rabhat 46707459 Feb 13 22:49 quotad.log-20160214
-rwxrwxrwx 1 rabhat rabhat        0 Mar  5 22:07 quotad.log-20160214.gz

[vmallika@collab-shell glusterfs]$ tail quotad.log-20160214
[2016-02-14 03:49:06.138945] I [MSGID: 100011] [glusterfsd.c:1291:reincarnate] 0-glusterfsd: Fetching the volume file from server...

There are no log message in quotad after "2016-02-14 03:49:06"


But the first EIVAL error was from "2016-02-15"
[vmallika@collab-shell bricks]$ ls -rt | xargs zgrep 'directory failed' | head
rhs-bricksdb-v001-unrep.log:[2016-02-15 21:27:04.408317] E [MSGID: 115070] [server-rpc-fops.c:1555:server_[2016-02-15 21:27:40.097970] W [MSGID: 120022] [quota-enforcer-client.c:236:quota_enforcer_lookup_cbk] 0-v001-unrep-quota: Getting cluster-wide size of directory failed (path: /1003249800 gfid:ba22f26b-df7d-4705-8399-a91a1abef033) [Invalid argument]
rhs-bricksdb-v001-unrep.log:[2016-02-15 21:27:41.180331] W [MSGID: 120022] [quota-enforcer-client.c:236:quota_enforcer_lookup_cbk] 0-v001-unrep-quota: Getting cluster-wide size of directory failed (path: /1003249800 gfid:ba22f26b-df7d-4705-8399-a91a1abef033) [Invalid argument]


And EINVAL was seen on "2016-03-08" also
rhs-bricksdb-v001-unrep.log:[2016-03-08 11:27:47.241728] W [MSGID: 120022] [quota-enforcer-client.c:236:quota_enforcer_lookup_cbk] 0-v001-unrep-quota: Getting cluster-wide size of directory failed (path: /1002978276 gfid:a8ae93db-9ced-420d-abbb-01e0b9e5a6b8) [Invalid argument]


Looks like quotad process is in bad state (might have hung) after "2016-02-14".
Could you please try re-starting quotad process? and see if this solves the problem?



Thanks,
Vijay 


> Regards,
> Raghavendra

Comment 3 Vijaikumar Mallikarjuna 2016-03-22 06:50:43 UTC
We found that quotad was in a bad state with memory corruption. After restarting quotad EINVAL was not seen

Comment 4 Vijaikumar Mallikarjuna 2016-03-22 06:53:49 UTC
Hi Raghavendra,


Have we taken a core-dump and a state-dump, of a quotad process? If taken could you please point to the location of these dump files.
We will check the core-dump and see if we can find any clue on why quotad had memory corruption.

Thanks,
Vijay

Comment 20 Oonkwee Lim 2016-06-30 01:26:28 UTC
Created attachment 1174291 [details]
A working quota test script

Comment 21 Oonkwee Lim 2016-06-30 01:26:52 UTC
Hello,

I have attached the quota test script that works.  The customer would like an engineering review of the script.

Thanks & Regards

Oonkwee
Emerging Technologies
RedHat Global Support

Comment 23 Oonkwee Lim 2016-06-30 05:53:59 UTC
Hello Manikandan,

Patrick Ladd?

The script is already in customer hand.  This was corrected in a remote session with them.

Thanks

Comment 24 Manikandan 2016-06-30 05:59:39 UTC
Hi Oonkwee,

Okay then. That's what I heard during the call or at least he was reviewing the scripts.

Thanks.

Comment 25 Manikandan 2016-08-02 10:40:53 UTC
Defensive fixes patch is posted upstream: http://review.gluster.org/#/c/14102/

Comment 27 Atin Mukherjee 2016-09-17 12:09:56 UTC
Upstream mainline : http://review.gluster.org/14102
Upstream 3.8 : http://review.gluster.org/15325

And the fix is available in rhgs-3.2.0 as part of rebase to GlusterFS 3.8.4.

Comment 31 Anil Shah 2017-01-31 12:41:51 UTC
 Test 1a,
* Have a dist-rep volume mounted and start I/O on mount point and enable quota.
* While I/O is going on, kill one brick of each replica pair.
* Bring the brick back up. Wait till self heal completes to the bricks that were down
* Once self heal is done, go back to point two above and kill a different brick on each replica pair.
* Do this multiple times.
* Verify the sizes of directories as shown by Quota both on individual bricks and through quota list command. See whether there are any dependencies

 Test 1b,

* Have a rename heavy workload, just renames or renames of files to which writes are happening. See quota size is accounted fine.

As discussed with developers, performed above test.
Not seeing any issue quota accounting.

But verified on build glusterfs-3.8.4-13.el7rhgs.x86_64

Comment 33 errata-xmlrpc 2017-03-23 05:27:33 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2017-0486.html