Bug 1263084
Summary: | nfs-ganesha crashes due to usage of invalid fd in glfs_close | ||||||
---|---|---|---|---|---|---|---|
Product: | [Retired] nfs-ganesha | Reporter: | Jiffin <jthottan> | ||||
Component: | FSAL_GLUSTER | Assignee: | Jiffin <jthottan> | ||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | |||||
Severity: | high | Docs Contact: | |||||
Priority: | high | ||||||
Version: | devel | CC: | akhakhar, jthottan, kkeithle, mzywusko, ndevos, skoduri | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | All | ||||||
OS: | All | ||||||
Whiteboard: | |||||||
Fixed In Version: | nfs-ganesha-2.2-rc3 | Doc Type: | Bug Fix | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | |||||||
: | 1263094 (view as bug list) | Environment: | |||||
Last Closed: | 2016-02-17 07:09:54 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1263094 | ||||||
Attachments: |
|
This suggests that a previous glfs function returned EBADF or EBADFD (or should have done so). Once the API returned one of these errors, Ganesha should not use the FD anymore (gfapi will cleanup the invalid FD). I'm not sure how "glfd->fd = 0x20" (32 -> errno = EPIPE). [GUESSING] Possibly the glfs->fd is set to an errno at some point? Code snippet from file_close() from FSAL_GLUSTER/handle.c makes this more clear. rc = glfs_close(objhandle->glfd); if (rc != 0) { status = gluster2fsal_error(errno); goto out; } objhandle->glfd = NULL; objhandle->openflags = FSAL_O_CLOSED; out: #ifdef GLTIMING now(&e_time); latency_update(&s_time, &e_time, lat_file_close); #endif return status; Here it glfs_close() destroy glfd even if there is error while closing, but for ganesha that glfd is still valid and try to use that glfd again. This happens( or can happen) in above mentioned case as far as I understand. So as a fix I am planning to remove "goto out" from the if condition. Yes, that looks like a possible candidate. The NOTES section in 'man 2 close' contains some pointers that might be related too: Not checking the return value of close() is a common but nevertheless serious programming error. It is quite possible that errors on a pre‐ vious write(2) operation are first reported at the final close(). Not checking the return value when closing the file may lead to silent loss of data. This can especially be observed with NFS and with disk quota. Note that the return value should be used only for diagnostics. In particular close() should not be retried after an EINTR since this may cause a reused descriptor from another thread to be closed. This explains that a close() returning an error might as well have closed the file descriptor. Possibly we should add checks in the different procedures (like file_write) in FSAL_GLUSTER and check for EBADF(D). Once this error is returned, clear the objhandle->glfd and set the ->openflags to FSAL_O_CLOSED. Thanks Niels for the information. I have sent the fix in file_close() at https://review.gerrithub.io/#/c/246586/. I have few questions regarding second part. 1.) What should be the criteria to have additional check in other procedures? 2.) Should we need to call glfs_close() in those scenarios ? (In reply to Jiffin from comment #4) > I have few questions regarding second part. > > 1.) What should be the criteria to have additional check in other procedures? > 2.) Should we need to call glfs_close() in those scenarios ? I think Ganesha checks for the errors and handles a lot of things already. When any of the glfs_*() functions return an error, we only need to make sure to return a matching status, see src/cache_inode/cache_inode_rdwr.c:200 and after. For errors like EBADF(D) we should set the objhandle->openflags to FSAL_O_CLOSED: 218 if ((fsal_status.major != ERR_FSAL_NOT_OPENED) 219 && (obj_hdl->obj_ops.status(obj_hdl) != FSAL_O_CLOSED)) { When a glfs_*() function returns an error that marks the fd/handle invalid, we should prevent calling cache_inode_close(). This is already the case when the status returns ESTALE, but not so much for EBADF(D) for what I can see. |
Created attachment 1073498 [details] coredump Description of problem: Ganesha daemon crashes due to passage of invalid fd in glfs_close(), more specifically it crash on __GLFS_ENTRY_VALIDATE_FD Macro. Version-Release number of selected component (if applicable): glusterfs-3.8-devel nfs-ganesha-2.3-rc-1 How reproducible: 50% Steps to Reproduce: It is just a one method to produce the issue. 1.Create a volume and set quota limit for that volume 2.Export volume via ganesha (with acls enabled, not sure about other case) 3.Mount the volume using nfsv4 4.Perform I/O's on the mount until quota limit exceeds 5.Remove all the files from the mount (rm -rf on the mount) Actual results: Ganesha daemon crashes Expected results: Ganesha daemon should not crash Additional info: Backtrace of coredump #0 0x00007fd593ecfa4e in pub_glfs_close (glfd=0x7fd534223650) at glfs-fops.c:218 218 __GLFS_ENTRY_VALIDATE_FD (glfd, invalid_fs); Missing separate debuginfos, use: dnf debuginfo-install pcre-8.37-3.fc22.x86_64 (gdb) bt #0 0x00007fd593ecfa4e in pub_glfs_close (glfd=0x7fd534223650) at glfs-fops.c:218 #1 0x00007fd5942f60e0 in file_close (obj_hdl=0x7fd5341be488) at /root/nfs-ganesha/src/FSAL/FSAL_GLUSTER/handle.c:1329 #2 0x00000000004ea71e in cache_inode_close (entry=0x7fd53420d230, flags=128) at /root/nfs-ganesha/src/cache_inode/cache_inode_open_close.c:305 #3 0x00000000004d8731 in cache_inode_remove (entry=0x7fd5500c0d70, name=0x7fd544069d80 "file_661") at /root/nfs-ganesha/src/cache_inode/cache_inode_remove.c:135 #4 0x0000000000478465 in nfs4_op_remove (op=0x7fd55c00ae00, data=0x7fd56ef8ae40, resp=0x7fd54418e110) at /root/nfs-ganesha/src/Protocols/NFS/nfs4_op_remove.c:103 #5 0x000000000045c31a in nfs4_Compound (arg=0x7fd55c00a790, req=0x7fd55c00a5d0, res=0x7fd54418b270) at /root/nfs-ganesha/src/Protocols/NFS/nfs4_Compound.c:710 #6 0x0000000000442c69 in nfs_rpc_execute (reqdata=0x7fd55c00a5a0) at /root/nfs-ganesha/src/MainNFSD/nfs_worker_thread.c:1289 #7 0x0000000000443598 in worker_run (ctx=0x1f8e400) at /root/nfs-ganesha/src/MainNFSD/nfs_worker_thread.c:1548 #8 0x0000000000515115 in fridgethr_start_routine (arg=0x1f8e400) at /root/nfs-ganesha/src/support/fridgethr.c:561 #9 0x00007fd5955d2555 in start_thread (arg=0x7fd56ef8c700) at pthread_create.c:333 #10 0x00007fd595105b9d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109 contents of glfd at frame 0 (gdb) p *glfd $1 = {openfds = {next = 0x7fd500000005, prev = 0x21}, fs = 0x7fd534099880, offset = 140553677477904, fd = 0x20, entries = {next = 0x44, prev = 0x7fd50001712c}, next = 0x7fd534223680, readdirbuf = 0x7fd534223680} Also one thing should be noted, crash happens when last written file( file which exceeds the quota size) is removed.