Bug 1317653 - EINVAL errors while aggregating the directory size by quotad
Summary: EINVAL errors while aggregating the directory size by quotad
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: quota
Version: rhgs-3.1
Hardware: x86_64
OS: Linux
unspecified
urgent
Target Milestone: ---
: RHGS 3.2.0
Assignee: Raghavendra G
QA Contact: Anil Shah
URL:
Whiteboard:
Depends On:
Blocks: 1332073 1351515 1366746
TreeView+ depends on / blocked
 
Reported: 2016-03-14 18:38 UTC by Raghavendra Bhat
Modified: 2019-12-16 05:31 UTC (History)
8 users (show)

Fixed In Version: glusterfs-3.8.4-1
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1332073 (view as bug list)
Environment:
Last Closed: 2017-03-23 05:27:33 UTC
Embargoed:


Attachments (Terms of Use)
A working quota test script (60.00 KB, application/x-tar)
2016-06-30 01:26 UTC, Oonkwee Lim
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2017:0486 0 normal SHIPPED_LIVE Moderate: Red Hat Gluster Storage 3.2.0 security, bug fix, and enhancement update 2017-03-23 09:18:45 UTC

Comment 2 Vijaikumar Mallikarjuna 2016-03-22 06:49:30 UTC
Below is our analysis on this bug, pasting the email conversation


> On Thu, Mar 10, 2016 at 6:39 AM, FNU Raghavendra Manjunath 
> <rabhat> wrote:

Hi Vijay,

Goldman says, they are facing some issues (some quota related errors they are not able to understand)
https://access.redhat.com/support/cases/#/case/01596072

Basically for them quota aggregator lookup is failing with EINVAL (unfortunately we dont know why). 

"[2016-02-18 00:36:32.783229] W [MSGID: 120022] [quota-enforcer-client.c:236:quota_enforcer_lookup_cbk] 0-v001-unrep-quota: Getting cluster-wide size of directory failed (path: /1002969812 gfid:5a5d42eb-f4ca-497a-8d55-14a8443afbd6) [Invalid argument]"

NOTE: There are some AFR related errors mentioned in the case description. But that is being tracked in a different issue.

They want to know whether close () failure has to got to do anything with the quota error and why quota is logging like it.

As per my analysis,
quotad_aggregator_lookup () function sometimes returns EINVAL if decoding of XDR structure fails. Apart from that it can fail in qd_nameless_lookup () for
i) Failure to get the value for the key "volume-uuid" from the dict
ii) Unable to find the subvolume for the volume-uuid.

Unfortunately we do not log errors in those places.

Is there any other reason why lookup can fail in quotad_aggregator_lookup () ?



Hi,

We trying to analyzing the problem. As quotad is a client process, EINVAL could have come from dht and afr lookup as well.

When looking at the log date from directory "/cases/01596072/2016-03-08-11xxxx/fc-rhs-153158-006.dc.gs.com/var/log/glusterfs/".
I found something below:

[vmallika@collab-shell glusterfs]$ ls -lrt quotad.log-201* | tail
-rwxrwxrwx 1 rabhat rabhat     3898 Oct 31 23:30 quotad.log-20151025.gz
-rwxrwxrwx 1 rabhat rabhat     3370 Nov 14 22:11 quotad.log-20151108.gz
-rwxrwxrwx 1 rabhat rabhat     3699 Nov 21 22:25 quotad.log-20151115.gz
-rwxrwxrwx 1 rabhat rabhat     1660 Dec  5 22:46 quotad.log-20151129.gz
-rwxrwxrwx 1 rabhat rabhat     5585 Dec 12 22:07 quotad.log-20151206.gz
-rwxrwxrwx 1 rabhat rabhat     4870 Dec 19 22:16 quotad.log-20151213.gz
-rwxrwxrwx 1 rabhat rabhat     8003 Jan  9 22:32 quotad.log-20160103.gz
-rwxrwxrwx 1 rabhat rabhat     8914 Jan 23 22:33 quotad.log-20160117.gz
-rwxrwxrwx 1 rabhat rabhat 46707459 Feb 13 22:49 quotad.log-20160214
-rwxrwxrwx 1 rabhat rabhat        0 Mar  5 22:07 quotad.log-20160214.gz

[vmallika@collab-shell glusterfs]$ tail quotad.log-20160214
[2016-02-14 03:49:06.138945] I [MSGID: 100011] [glusterfsd.c:1291:reincarnate] 0-glusterfsd: Fetching the volume file from server...

There are no log message in quotad after "2016-02-14 03:49:06"


But the first EIVAL error was from "2016-02-15"
[vmallika@collab-shell bricks]$ ls -rt | xargs zgrep 'directory failed' | head
rhs-bricksdb-v001-unrep.log:[2016-02-15 21:27:04.408317] E [MSGID: 115070] [server-rpc-fops.c:1555:server_[2016-02-15 21:27:40.097970] W [MSGID: 120022] [quota-enforcer-client.c:236:quota_enforcer_lookup_cbk] 0-v001-unrep-quota: Getting cluster-wide size of directory failed (path: /1003249800 gfid:ba22f26b-df7d-4705-8399-a91a1abef033) [Invalid argument]
rhs-bricksdb-v001-unrep.log:[2016-02-15 21:27:41.180331] W [MSGID: 120022] [quota-enforcer-client.c:236:quota_enforcer_lookup_cbk] 0-v001-unrep-quota: Getting cluster-wide size of directory failed (path: /1003249800 gfid:ba22f26b-df7d-4705-8399-a91a1abef033) [Invalid argument]


And EINVAL was seen on "2016-03-08" also
rhs-bricksdb-v001-unrep.log:[2016-03-08 11:27:47.241728] W [MSGID: 120022] [quota-enforcer-client.c:236:quota_enforcer_lookup_cbk] 0-v001-unrep-quota: Getting cluster-wide size of directory failed (path: /1002978276 gfid:a8ae93db-9ced-420d-abbb-01e0b9e5a6b8) [Invalid argument]


Looks like quotad process is in bad state (might have hung) after "2016-02-14".
Could you please try re-starting quotad process? and see if this solves the problem?



Thanks,
Vijay 


> Regards,
> Raghavendra

Comment 3 Vijaikumar Mallikarjuna 2016-03-22 06:50:43 UTC
We found that quotad was in a bad state with memory corruption. After restarting quotad EINVAL was not seen

Comment 4 Vijaikumar Mallikarjuna 2016-03-22 06:53:49 UTC
Hi Raghavendra,


Have we taken a core-dump and a state-dump, of a quotad process? If taken could you please point to the location of these dump files.
We will check the core-dump and see if we can find any clue on why quotad had memory corruption.

Thanks,
Vijay

Comment 20 Oonkwee Lim 2016-06-30 01:26:28 UTC
Created attachment 1174291 [details]
A working quota test script

Comment 21 Oonkwee Lim 2016-06-30 01:26:52 UTC
Hello,

I have attached the quota test script that works.  The customer would like an engineering review of the script.

Thanks & Regards

Oonkwee
Emerging Technologies
RedHat Global Support

Comment 23 Oonkwee Lim 2016-06-30 05:53:59 UTC
Hello Manikandan,

Patrick Ladd?

The script is already in customer hand.  This was corrected in a remote session with them.

Thanks

Comment 24 Manikandan 2016-06-30 05:59:39 UTC
Hi Oonkwee,

Okay then. That's what I heard during the call or at least he was reviewing the scripts.

Thanks.

Comment 25 Manikandan 2016-08-02 10:40:53 UTC
Defensive fixes patch is posted upstream: http://review.gluster.org/#/c/14102/

Comment 27 Atin Mukherjee 2016-09-17 12:09:56 UTC
Upstream mainline : http://review.gluster.org/14102
Upstream 3.8 : http://review.gluster.org/15325

And the fix is available in rhgs-3.2.0 as part of rebase to GlusterFS 3.8.4.

Comment 31 Anil Shah 2017-01-31 12:41:51 UTC
 Test 1a,
* Have a dist-rep volume mounted and start I/O on mount point and enable quota.
* While I/O is going on, kill one brick of each replica pair.
* Bring the brick back up. Wait till self heal completes to the bricks that were down
* Once self heal is done, go back to point two above and kill a different brick on each replica pair.
* Do this multiple times.
* Verify the sizes of directories as shown by Quota both on individual bricks and through quota list command. See whether there are any dependencies

 Test 1b,

* Have a rename heavy workload, just renames or renames of files to which writes are happening. See quota size is accounted fine.

As discussed with developers, performed above test.
Not seeing any issue quota accounting.

But verified on build glusterfs-3.8.4-13.el7rhgs.x86_64

Comment 33 errata-xmlrpc 2017-03-23 05:27:33 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2017-0486.html


Note You need to log in before you can comment on or make changes to this bug.