Bug 1480138 - [GSS] Locking failed errors in ganesha.log "Unable to find fd for lock operation","Open for locking failed"
[GSS] Locking failed errors in ganesha.log "Unable to find fd for lock operat...
Status: POST
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: nfs-ganesha (Show other bugs)
3.2
All Linux
high Severity high
: ---
: ---
Assigned To: Kaleb KEITHLEY
Manisha Saini
:
Depends On: 1500704
Blocks:
  Show dependency treegraph
 
Reported: 2017-08-10 06:01 EDT by Prerna Sony
Modified: 2017-11-16 20:03 EST (History)
13 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
test pgm (3.47 KB, text/plain)
2017-09-19 23:01 EDT, Jiffin
no flags Details

  None (edit)
Description Prerna Sony 2017-08-10 06:01:51 EDT
Created attachment 1311651 [details]
ganesha-gfapi logs

Description of problem:

While performing lock tests on GlusterFS , test fails and seeing below errors in ganesha log

 epoch 86f20000 : GlusterNode1 : ganesha.nfsd-51807[work-98] fsal_find_fd :FSAL :CRIT :Open for locking failed
 epoch 86f20000 : GlusterNode1 : ganesha.nfsd-51807[work-98] glusterfs_lock_op2 :FSAL :CRIT :Unable to find fd for lock operation
 epoch 86f20000 : GlusterNode1 : ganesha.nfsd-51807[work-98] state_lock :STATE :MAJ :Unable to lock FSAL, error=STATE_NOT_FOUND

Version-Release number of selected component (if applicable):

     RHGS 3.2
     
rpm -qa | grep ganesha

nfs-ganesha-2.4.1-11.el7rhgs.x86_64
glusterfs-ganesha-3.8.4-18.4.el7rhgs.x86_64
nfs-ganesha-gluster-2.4.1-11.el7rhgs.x86_64

How reproducible:

The issue is reproducible in Customer Environment

Actual results:

Tests are failing with below error messages in ganesha log

Seeing below error messages in ganesha log

epoch 86f20000 : GlusterNode1 : ganesha.nfsd-51807[work-98] fsal_find_fd :FSAL :CRIT :Open for locking failed
epoch 86f20000 : GlusterNode1 : ganesha.nfsd-51807[work-98] glusterfs_lock_op2 :FSAL :CRIT :Unable to find fd for lock operation
epoch 86f20000 : GlusterNode1 : ganesha.nfsd-51807[work-98] state_lock :STATE :MAJ :Unable to lock FSAL, error=STATE_NOT_FOUND
epoch 86f20000 : GlusterNode1 : ganesha.nfsd-51807[work-20] fsal_find_fd :FSAL :CRIT :Open for locking failed
epoch 86f20000 : GlusterNode1 : ganesha.nfsd-51807[work-20] glusterfs_lock_op2 :FSAL :CRIT :Unable to find fd for lock

Expected results:

Lock tests should succeed without any error.

Additional info:
Customer is performing lock test which actually writes concurrent from 2 hosts to 1 file in the NFS-Ganesha export. It puts a fsclk handle on the  file and when the first machine finishes, the file should still remain and not be deleted, since there is still a handle on the file.
Comment 2 Prerna Sony 2017-08-10 06:04 EDT
Created attachment 1311652 [details]
Packet trace from Gluster Node
Comment 3 Prerna Sony 2017-08-10 06:04 EDT
Created attachment 1311653 [details]
ganesha-logs
Comment 4 Prerna Sony 2017-08-11 08:58:58 EDT
As discussed with Jiffin, I have asked the Customer to provide the packet trace with glusterfs packets.
Comment 15 Damian Wojsław 2017-08-29 08:18:33 EDT
I have asked Customer to perform the same test as below:

> Hello
> 
> I would like to ask you to conduct the test again, but with following xlators 
> disabled.
> 
> At first, try this command:
> 
> gluster volume set <volume name> performance.open-behind off
> 
> And repeat the test. If the above doesn't help, please run additionally the 
> below:
> 
> gluster volume set <volume name> performance.write-behind off

According to Customer the test failed the same.

Regards
Damian
Comment 16 Jiffin 2017-08-30 07:17:53 EDT
(In reply to Damian Wojsław from comment #15)
> I have asked Customer to perform the same test as below:
> 
> > Hello
> > 
> > I would like to ask you to conduct the test again, but with following xlators 
> > disabled.
> > 
> > At first, try this command:
> > 
> > gluster volume set <volume name> performance.open-behind off
> > 
> > And repeat the test. If the above doesn't help, please run additionally the 
> > below:
> > 
> > gluster volume set <volume name> performance.write-behind off
> 

That's exactly what I tried to say, from the packet trace they shared the failures looks obvious to me. Either they need to share the test script so that we can understand problem a bit more. Or atleast provide set of steps to reproduce issue in our setup 

Sorry I didn't get why perf xlators help here. Actually in packet trace lock was rejected from ganesha layer, not even passed to gluster layer. Am I missing something ?


> According to Customer the test failed the same.
> 
> Regards
> Damian
Comment 20 Damian Wojsław 2017-09-01 03:01:05 EDT
Hi

Additional data from Customer:

According to the verbose output of the script the following system calls are performed:

System call: fcntl(fd,F_SETLKW,F_WRLCK)
System call: lseek(fd,0,SEEK_END)
System call: write(fd,data)
System call: fcntl(fd,F_SETLK,F_UNLCK)
System call: fcntl(fd,F_SETLKW,F_WRLCK)
System call: lseek(fd,0,SEEK_END)
System call: write(fd,data)
System call: fcntl(fd,F_SETLK,F_UNLCK)

These system calls are done multiple times a second and after the unlock the other client may take over (since that client might have received the lock). (Since both clients are attempting to write to the same file amqmfsck.wrt"

Regarding the final task (after 1 minute of processing, the moment where the issue accures) IBM states the following:

"We use OS function unlink() to delete the file when the concurrent write
test is finished. that function deletes the file only when no more     
process has open the file.                                             
So the file is deleted only when both amqmfsck processes has finished  
their task. It is not expected to receive following error              
" AMQ5667: Error executing system call 'fcntl(F_SETLKW)[c]' on file    
'/MQHA/logging/amqmfsck.wrt', errno 2 ('No such file or directory'). " "

In the end, as far as i understand, this means that even though one of the other clients still has the file open, the file is still being removed after an unlink. Otherwise it would not receive a error regarding the file and wouldn't receive a "stale file" error.
Comment 31 Jiffin 2017-09-19 23:01 EDT
Created attachment 1328206 [details]
test pgm
Comment 49 Jiffin 2017-11-13 01:24:05 EST
Hi Abhishek/Soumya,

Soory to inform that I missed last merging window for my patch for ganesha. I need to address few nit picks , I will refresh the patch today itself. But need to wait till 18th nov for merging(nfs-ganesha has a merging window happens on every friday)

Note You need to log in before you can comment on or make changes to this bug.