Bug 811552 - [glusterfs-3.3.0qa33]: fd migration failing in fuse_resolve
Summary: [glusterfs-3.3.0qa33]: fd migration failing in fuse_resolve
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: GlusterFS
Classification: Community
Component: fuse
Version: mainline
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Raghavendra Bhat
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks: 817967
TreeView+ depends on / blocked
 
Reported: 2012-04-11 12:00 UTC by Raghavendra Bhat
Modified: 2013-07-24 17:41 UTC (History)
3 users (show)

Fixed In Version: glusterfs-3.4.0
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2013-07-24 17:41:32 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions: glusterfs-3.3.0qa40
Embargoed:


Attachments (Terms of Use)

Description Raghavendra Bhat 2012-04-11 12:00:30 UTC
Description of problem:
3 replica volume with 3 fuse and 3 nfs clients running different tests such as ping_pong, rdd, fd-perf-test, sanity etc. Some graph changes were also done. According to the log file of one of the fuse clients, while resolving fd migration to the new graph is failing. That log message is printed too many times in the log file.


[2012-03-11 21:59:26.240432] I [afr-common.c:1850:afr_set_root_inode_on_first_lookup] 4-mirror-replicate-0: added root inode
[2012-03-11 21:59:26.241713] W [fd-lk.c:428:print_lock_list] 0-fd-lk: lock list:
[2012-03-11 22:00:52.130850] W [fuse-bridge.c:3561:fuse_migrate_fd] 0-glusterfs-fuse: name-less lookup of gfid (e39a36fd-066e-4f75-9947-e139b5
761c16) failed (Input/output error)
[2012-03-11 22:00:52.130950] W [fuse-resolve.c:346:fuse_resolve_fd] 0-fuse-resolve: migration of fd (0x11be5b0) did not complete, failing fop
with EBADF
[2012-03-11 22:00:52.131203] W [fuse-resolve.c:346:fuse_resolve_fd] 0-fuse-resolve: migration of fd (0x11be5b0) did not complete, failing fop with EBADF
[2012-03-11 22:00:52.131321] W [fuse-resolve.c:346:fuse_resolve_fd] 0-fuse-resolve: migration of fd (0x11be5b0) did not complete, failing fop with EBADF
[2012-03-11 22:00:52.131398] W [fuse-resolve.c:346:fuse_resolve_fd] 0-fuse-resolve: migration of fd (0x11be5b0) did not complete, failing fop with EBADF
[2012-03-11 22:00:52.131463] W [fuse-resolve.c:346:fuse_resolve_fd] 0-fuse-resolve: migration of fd (0x11be5b0) did not complete, failing fop with EBADF
[2012-03-11 22:00:52.131517] W [fuse-resolve.c:346:fuse_resolve_fd] 0-fuse-resolve: migration of fd (0x11be5b0) did not complete, failing fop with EBADF
[2012-03-11 22:00:52.131573] W [fuse-resolve.c:346:fuse_resolve_fd] 0-fuse-resolve: migration of fd (0x11be5b0) did not complete, failing fop with EBADF
[2012-03-11 22:00:52.131627] W [fuse-resolve.c:346:fuse_resolve_fd] 0-fuse-resolve: migration of fd (0x11be5b0) did not complete, failing fop with EBADF
[2012-03-11 22:00:52.131681] W [fuse-resolve.c:346:fuse_resolve_fd] 0-fuse-resolve: migration of fd (0x11be5b0) did not complete, failing fop with EBADF
[2012-03-11 22:00:52.131751] W [fuse-resolve.c:346:fuse_resolve_fd] 0-fuse-resolve: migration of fd (0x11be5b0) did not complete, failing fop with EBADF
[2012-03-11 22:00:52.131809] W [fuse-resolve.c:346:fuse_resolve_fd] 0-fuse-resolve: migration of fd (0x11be5b0) did not complete, failing fop with EBADF
[2012-03-11 22:00:52.131905] W [fuse-resolve.c:346:fuse_resolve_fd] 0-fuse-resolve: migration of fd (0x11be5b0) did not complete, failing fop with EBADF
[2012-03-11 22:00:52.131981] W [fuse-resolve.c:346:fuse_resolve_fd] 0-fuse-resolve: migration of fd (0x11be5b0) did not complete, failing fop with EBADF
[2012-03-11 22:00:52.132064] W [fuse-resolve.c:346:fuse_resolve_fd] 0-fuse-resolve: migration of fd (0x11be5b0) did not complete, failing fop with EBADF
[2012-03-11 22:00:52.132127] W [fuse-resolve.c:346:fuse_resolve_fd] 0-fuse-resolve: migration of fd (0x11be5b0) did not complete, failing fop with EBADF
[2012-03-11 22:00:52.132188] W [fuse-resolve.c:346:fuse_resolve_fd] 0-fuse-resolve: migration of fd (0x11be5b0) did not complete, failing fop with EBADF
[2012-03-11 22:00:52.132243] W [fuse-resolve.c:346:fuse_resolve_fd] 0-fuse-resolve: migration of fd (0x11be5b0) did not complete, failing fop with EBADF
[2012-03-11 22:00:52.132345] W [fuse-resolve.c:346:fuse_resolve_fd] 0-fuse-resolve: migration of fd (0x11be5b0) did not complete, failing fop with EBADF
[2012-03-11 22:00:52.132443] W [fuse-resolve.c:346:fuse_resolve_fd] 0-fuse-resolve: migration of fd (0x11be5b0) did not complete, failing fop with EBADF
[2012-03-11 22:00:52.132512] W [fuse-resolve.c:346:fuse_resolve_fd] 0-fuse-resolve: migration of fd (0x11be5b0) did not complete, failing fop with EBADF
:

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:


gluster v i
 
Volume Name: mirror
Type: Replicate
Volume ID: e6423147-ee12-453f-bcf6-2fb09a9087c5
Status: Started
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: 10.16.156.9:/export/mirror
Brick2: 10.16.156.12:/export/mirror
Brick3: 10.16.156.15:/export/mirror
Options Reconfigured:
features.lock-heal: off
performance.flush-behind: off
performance.stat-prefetch: off
performance.client-io-threads: on
cluster.self-heal-daemon: on

Comment 1 Raghavendra Bhat 2012-04-24 12:22:23 UTC
Just running ping_pong with graph changes is suffecient to hit the bug. It led to total fd count increasing to the VFS max file limit, and thus errors and some of the applications getting segfaulted.

When ran with valgrind ping_pong started getting EBADFD errors and logs indicated fd migration failing.




2012-04-24 16:05:38.852707] W [fuse-resolve.c:360:fuse_resolve_fd] 0-fuse-resolve: migration of fd (0xb577af0) did not complete, failing fop with EBADF
[2012-04-24 16:05:38.853246] D [socket.c:193:__socket_rwv] 3-mirror-client-2: EOF from peer 127.0.0.1:24013
[2012-04-24 16:05:38.854810] D [socket.c:1521:__socket_proto_state_machine] 3-mirror-client-2: reading from socket failed. Error (Transport endpoint is not connected), peer (127.0.0.1:24013)
[2012-04-24 16:05:38.855269] D [socket.c:1807:socket_event_handler] 3-transport: disconnecting now
[2012-04-24 16:05:38.856020] I [client.c:2099:client_rpc_notify] 3-mirror-client-2: disconnected
[2012-04-24 16:05:38.856468] E [afr-common.c:3666:afr_notify] 3-mirror-replicate-0: All subvolumes are down. Going offline until atleast one of them comes back up.
[2012-04-24 16:05:38.856901] D [fuse-bridge.c:4208:notify] 0-fuse: got event 6 on graph 3
[2012-04-24 16:05:38.858771] W [fuse-resolve.c:360:fuse_resolve_fd] 0-fuse-resolve: migration of fd (0xb577af0) did not complete, failing fop with EBADF
[2012-04-24 16:05:38.875087] W [fuse-resolve.c:152:fuse_resolve_gfid_cbk] 0-fuse: f8a743cd-5f91-4132-a870-09b2f9a57627: failed to resolve (Transport endpoint is not connected)
[2012-04-24 16:05:38.877005] W [fuse-bridge.c:2845:fuse_getxattr_resume] 0-glusterfs-fuse: 81528: GETXATTR f8a743cd-5f91-4132-a870-09b2f9a57627/191002180 (security.capability) resolution failed
[2012-04-24 16:05:38.880052] W [fuse-resolve.c:360:fuse_resolve_fd] 0-fuse-resolve: migration of fd (0xb577af0) did not complete, failing fop with EBADF
[2012-04-24 16:05:38.881607] W [fuse-resolve.c:360:fuse_resolve_fd] 0-fuse-resolve: migration of fd (0xb577af0) did not complete, failing fop with EBADF
[2012-04-24 16:05:38.883238] W [fuse-resolve.c:360:fuse_resolve_fd] 0-fuse-resolve: migration of fd (0xb577af0) did not complete, failing fop with EBADF
[2012-04-24 16:05:38.886202] W [fuse-resolve.c:152:fuse_resolve_gfid_cbk] 0-fuse: f8a743cd-5f91-4132-a870-09b2f9a57627: failed to resolve (Transport endpoint is not connected)
[2012-04-24 16:05:38.886846] W [fuse-bridge.c:2845:fuse_getxattr_resume] 0-glusterfs-fuse: 81532: GETXATTR f8a743cd-5f91-4132-a870-09b2f9a57627/191002180 (security.capability) resolution failed
[2012-04-24 16:05:38.888388] W [fuse-resolve.c:360:fuse_resolve_fd] 0-fuse-resolve: migration of fd (0xb577af0) did not complete, failing fop with EBADF
[2012-04-24 16:05:38.889948] W [fuse-resolve.c:360:fuse_resolve_fd] 0-fuse-resolve: migration of fd (0xb577af0) did not complete, failing fop with EBADF
[2012-04-24 16:05:38.891540] W [fuse-resolve.c:360:fuse_resolve_fd] 0-fuse-resolve: migration of fd (0xb577af0) did not complete, failing fop with EBADF
[2012-04-24 16:05:38.894579] W [fuse-resolve.c:152:fuse_resolve_gfid_cbk] 0-fuse: f8a743cd-5f91-4132-a870-09b2f9a57627: failed to resolve (Transport endpoint is not connected)
[2012-04-24 16:05:38.895180] W [fuse-bridge.c:2845:fuse_getxattr_resume] 0-glusterfs-fuse: 81536: GETXATTR f8a743cd-5f91-4132-a870-09b2f9a57627/191002180 (security.capability) resolution failed
[2012-04-24 16:05:38.896698] W [fuse-resolve.c:360:fuse_resolve_fd] 0-fuse-resolve: migration of fd (0xb577af0) did not complete, failing fop with EBADF
[2012-04-24 16:05:38.898243] W [fuse-resolve.c:360:fuse_resolve_fd] 0-fuse-resolve: migration of fd (0xb577af0) did not complete, failing fop with EBADF
[2012-04-24 16:05:38.899826] W [fuse-resolve.c:360:fuse_resolve_fd] 0-fuse-resolve: migration of fd (0xb577af0) did not complete, failing fop with EBADF
[2012-04-24 16:05:38.901319] W [fuse-resolve.c:360:fuse_resolve_fd] 0-fuse-resolve: migration of fd (0xb577af0) did not complete, failing fop with EBADF


This is the dmesg output indicating the fd count hitting the VFS max limit.


Apr 24 12:29:09 hyperspace kernel: [ 3671.564994] VFS: file-max limit 385710 reached
Apr 24 12:29:09 hyperspace kernel: [ 3673.048036] VFS: file-max limit 385710 reached
Apr 24 12:29:09 hyperspace kernel: [ 3673.151796] VFS: file-max limit 385710 reached
Apr 24 12:29:09 hyperspace kernel: [ 3673.162949] VFS: file-max limit 385710 reached
Apr 24 12:29:10 hyperspace kernel: [ 3673.720373] VFS: file-max limit 385710 reached
Apr 24 12:29:10 hyperspace kernel: [ 3673.772767] VFS: file-max limit 385710 reached
Apr 24 12:29:10 hyperspace kernel: [ 3673.798308] VFS: file-max limit 385710 reached
Apr 24 12:29:10 hyperspace kernel: [ 3673.798956] VFS: file-max limit 385710 reached
Apr 24 12:29:10 hyperspace kernel: [ 3673.814634] VFS: file-max limit 385710 reached
Apr 24 12:29:11 hyperspace kernel: [ 3675.410128] show_signal_msg: 24 callbacks suppressed
Apr 24 12:29:11 hyperspace kernel: [ 3675.410134] compiz[1836]: segfault at 20 ip 00007f1139ecd58b sp 00007fffd2822970 error 4 in libdrm_intel.so.1.0.0[7f1139ec7000+a000]
Apr 24 12:29:11 hyperspace kernel: [ 3675.412261] VFS: file-max limit 385710 reached
Apr 24 12:29:13 hyperspace kernel: [ 3677.149905] VFS: file-max limit 385710 reached
Apr 24 12:29:17 hyperspace kernel: [ 3680.831535] VFS: file-max limit 385710 reached
Apr 24 12:29:17 hyperspace postfix/master[1426]: warning: master_wakeup_timer_event: service qmgr(public/qmgr): Too many open files in system
Apr 24 12:29:17 hyperspace postfix/master[1426]: warning: master_wakeup_timer_event: service pickup(public/pickup): Too many open files in system
Apr 24 12:29:17 hyperspace kernel: [ 3680.949343] VFS: file-max limit 385710 reached
Apr 24 12:29:17 hyperspace kernel: [ 3680.953927] VFS: file-max limit 385710 reached
Apr 24 12:29:17 hyperspace kernel: [ 3680.963900] VFS: file-max limit 385710 reached
Apr 24 12:29:17 hyperspace postfix/qmgr[2082]: fatal: scan_dir_push: open directory deferred: Too many open files in system
Apr 24 12:29:17 hyperspace kernel: [ 3681.342443] VFS: file-max limit 385710 reached
Apr 24 12:29:17 hyperspace kernel: [ 3681.370371] VFS: file-max limit 385710 reached
Apr 24 12:29:17 hyperspace kernel: [ 3681.373358] VFS: file-max limit 385710 reached
Apr 24 12:29:18 hyperspace kernel: [ 3681.550102] VFS: file-max limit 385710 reached
Apr 24 12:29:18 hyperspace kernel: [ 3681.666582] VFS: file-max limit 385710 reached
Apr 24 12:29:18 hyperspace kernel: [ 3681.672847] VFS: file-max limit 385710 reached
Apr 24 12:29:18 hyperspace kernel: [ 3681.781079] VFS: file-max limit 385710 reached
Apr 24 12:29:18 hyperspace kernel: [ 3681.782683] VFS: file-max limit 385710 reached
Apr 24 12:29:18 hyperspace postfix/master[1426]: warning: process /usr/lib/postfix/qmgr pid 2082 exit status 1
Apr 24 12:29:18 hyperspace kernel: [ 3682.341382] VFS: file-max limit 385710 reached
Apr 24 12:29:18 hyperspace kernel: [ 3682.521635] VFS: file-max limit 385710 reached
Apr 24 12:29:19 hyperspace kernel: [ 3682.560183] VFS: file-max limit 385710 reached
Apr 24 12:29:19 hyperspace kernel: [ 3682.584952] VFS: file-max limit 385710 reached
Apr 24 12:29:19 hyperspace kernel: [ 3682.635927] VFS: file-max limit 385710 reached
Apr 24 12:29:19 hyperspace kernel: [ 3682.675731] VFS: file-max limit 385710 reached
Apr 24 12:29:19 hyperspace kernel: [ 3682.681351] VFS: file-max limit 385710 reached
Apr 24 12:29:19 hyperspace kernel: [ 3682.691349] VFS: file-max limit 385710 reached
Apr 24 12:29:19 hyperspace kernel: [ 3682.701340] VFS: file-max limit 385710 reached
Apr 24 12:29:19 hyperspace kernel: [ 3682.711316] VFS: file-max limit 385710 reached
Apr 24 12:29:19 hyperspace kernel: [ 3682.834319] VFS: file-max limit 385710 reached
Apr 24 12:29:19 hyperspace kernel: [ 3683.000432] VFS: file-max limit 385710 reached
Apr 24 12:29:19 hyperspace kernel: [ 3683.350397] VFS: file-max limit 385710 reached
:

Comment 2 Anand Avati 2012-04-24 20:08:31 UTC
CHANGE: http://review.gluster.com/3222 (mount/fuse: unref the fds after they have been migrated to the new graph) merged in master by Anand Avati (avati)

Comment 3 Raghavendra Bhat 2012-05-09 09:48:02 UTC
Checked with glusterfs-3.3.0qa40. ran ping_pong with graph changes happening parallely. No such log messages are seen and VFS max limit is not reached since we are properly unrefing the fd after being migrated to the new graph.


Note You need to log in before you can comment on or make changes to this bug.