Below is our analysis on this bug, pasting the email conversation
> On Thu, Mar 10, 2016 at 6:39 AM, FNU Raghavendra Manjunath
> <email@example.com> wrote:
Goldman says, they are facing some issues (some quota related errors they are not able to understand)
Basically for them quota aggregator lookup is failing with EINVAL (unfortunately we dont know why).
"[2016-02-18 00:36:32.783229] W [MSGID: 120022] [quota-enforcer-client.c:236:quota_enforcer_lookup_cbk] 0-v001-unrep-quota: Getting cluster-wide size of directory failed (path: /1002969812 gfid:5a5d42eb-f4ca-497a-8d55-14a8443afbd6) [Invalid argument]"
NOTE: There are some AFR related errors mentioned in the case description. But that is being tracked in a different issue.
They want to know whether close () failure has to got to do anything with the quota error and why quota is logging like it.
As per my analysis,
quotad_aggregator_lookup () function sometimes returns EINVAL if decoding of XDR structure fails. Apart from that it can fail in qd_nameless_lookup () for
i) Failure to get the value for the key "volume-uuid" from the dict
ii) Unable to find the subvolume for the volume-uuid.
Unfortunately we do not log errors in those places.
Is there any other reason why lookup can fail in quotad_aggregator_lookup () ?
We trying to analyzing the problem. As quotad is a client process, EINVAL could have come from dht and afr lookup as well.
When looking at the log date from directory "/cases/01596072/2016-03-08-11xxxx/fc-rhs-153158-006.dc.gs.com/var/log/glusterfs/".
I found something below:
[vmallika@collab-shell glusterfs]$ ls -lrt quotad.log-201* | tail
-rwxrwxrwx 1 rabhat rabhat 3898 Oct 31 23:30 quotad.log-20151025.gz
-rwxrwxrwx 1 rabhat rabhat 3370 Nov 14 22:11 quotad.log-20151108.gz
-rwxrwxrwx 1 rabhat rabhat 3699 Nov 21 22:25 quotad.log-20151115.gz
-rwxrwxrwx 1 rabhat rabhat 1660 Dec 5 22:46 quotad.log-20151129.gz
-rwxrwxrwx 1 rabhat rabhat 5585 Dec 12 22:07 quotad.log-20151206.gz
-rwxrwxrwx 1 rabhat rabhat 4870 Dec 19 22:16 quotad.log-20151213.gz
-rwxrwxrwx 1 rabhat rabhat 8003 Jan 9 22:32 quotad.log-20160103.gz
-rwxrwxrwx 1 rabhat rabhat 8914 Jan 23 22:33 quotad.log-20160117.gz
-rwxrwxrwx 1 rabhat rabhat 46707459 Feb 13 22:49 quotad.log-20160214
-rwxrwxrwx 1 rabhat rabhat 0 Mar 5 22:07 quotad.log-20160214.gz
[vmallika@collab-shell glusterfs]$ tail quotad.log-20160214
[2016-02-14 03:49:06.138945] I [MSGID: 100011] [glusterfsd.c:1291:reincarnate] 0-glusterfsd: Fetching the volume file from server...
There are no log message in quotad after "2016-02-14 03:49:06"
But the first EIVAL error was from "2016-02-15"
[vmallika@collab-shell bricks]$ ls -rt | xargs zgrep 'directory failed' | head
rhs-bricksdb-v001-unrep.log:[2016-02-15 21:27:04.408317] E [MSGID: 115070] [server-rpc-fops.c:1555:server_[2016-02-15 21:27:40.097970] W [MSGID: 120022] [quota-enforcer-client.c:236:quota_enforcer_lookup_cbk] 0-v001-unrep-quota: Getting cluster-wide size of directory failed (path: /1003249800 gfid:ba22f26b-df7d-4705-8399-a91a1abef033) [Invalid argument]
rhs-bricksdb-v001-unrep.log:[2016-02-15 21:27:41.180331] W [MSGID: 120022] [quota-enforcer-client.c:236:quota_enforcer_lookup_cbk] 0-v001-unrep-quota: Getting cluster-wide size of directory failed (path: /1003249800 gfid:ba22f26b-df7d-4705-8399-a91a1abef033) [Invalid argument]
And EINVAL was seen on "2016-03-08" also
rhs-bricksdb-v001-unrep.log:[2016-03-08 11:27:47.241728] W [MSGID: 120022] [quota-enforcer-client.c:236:quota_enforcer_lookup_cbk] 0-v001-unrep-quota: Getting cluster-wide size of directory failed (path: /1002978276 gfid:a8ae93db-9ced-420d-abbb-01e0b9e5a6b8) [Invalid argument]
Looks like quotad process is in bad state (might have hung) after "2016-02-14".
Could you please try re-starting quotad process? and see if this solves the problem?
We found that quotad was in a bad state with memory corruption. After restarting quotad EINVAL was not seen
Have we taken a core-dump and a state-dump, of a quotad process? If taken could you please point to the location of these dump files.
We will check the core-dump and see if we can find any clue on why quotad had memory corruption.
Created attachment 1174291 [details]
A working quota test script
I have attached the quota test script that works. The customer would like an engineering review of the script.
Thanks & Regards
RedHat Global Support
The script is already in customer hand. This was corrected in a remote session with them.
Okay then. That's what I heard during the call or at least he was reviewing the scripts.
Defensive fixes patch is posted upstream: http://review.gluster.org/#/c/14102/
Upstream mainline : http://review.gluster.org/14102
Upstream 3.8 : http://review.gluster.org/15325
And the fix is available in rhgs-3.2.0 as part of rebase to GlusterFS 3.8.4.
* Have a dist-rep volume mounted and start I/O on mount point and enable quota.
* While I/O is going on, kill one brick of each replica pair.
* Bring the brick back up. Wait till self heal completes to the bricks that were down
* Once self heal is done, go back to point two above and kill a different brick on each replica pair.
* Do this multiple times.
* Verify the sizes of directories as shown by Quota both on individual bricks and through quota list command. See whether there are any dependencies
* Have a rename heavy workload, just renames or renames of files to which writes are happening. See quota size is accounted fine.
As discussed with developers, performed above test.
Not seeing any issue quota accounting.
But verified on build glusterfs-3.8.4-13.el7rhgs.x86_64
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.