Description of problem: 3 replica volume with 3 fuse and 3 nfs clients running different tests such as ping_pong, rdd, fd-perf-test, sanity etc. Some graph changes were also done. According to the log file of one of the fuse clients, while resolving fd migration to the new graph is failing. That log message is printed too many times in the log file. [2012-03-11 21:59:26.240432] I [afr-common.c:1850:afr_set_root_inode_on_first_lookup] 4-mirror-replicate-0: added root inode [2012-03-11 21:59:26.241713] W [fd-lk.c:428:print_lock_list] 0-fd-lk: lock list: [2012-03-11 22:00:52.130850] W [fuse-bridge.c:3561:fuse_migrate_fd] 0-glusterfs-fuse: name-less lookup of gfid (e39a36fd-066e-4f75-9947-e139b5 761c16) failed (Input/output error) [2012-03-11 22:00:52.130950] W [fuse-resolve.c:346:fuse_resolve_fd] 0-fuse-resolve: migration of fd (0x11be5b0) did not complete, failing fop with EBADF [2012-03-11 22:00:52.131203] W [fuse-resolve.c:346:fuse_resolve_fd] 0-fuse-resolve: migration of fd (0x11be5b0) did not complete, failing fop with EBADF [2012-03-11 22:00:52.131321] W [fuse-resolve.c:346:fuse_resolve_fd] 0-fuse-resolve: migration of fd (0x11be5b0) did not complete, failing fop with EBADF [2012-03-11 22:00:52.131398] W [fuse-resolve.c:346:fuse_resolve_fd] 0-fuse-resolve: migration of fd (0x11be5b0) did not complete, failing fop with EBADF [2012-03-11 22:00:52.131463] W [fuse-resolve.c:346:fuse_resolve_fd] 0-fuse-resolve: migration of fd (0x11be5b0) did not complete, failing fop with EBADF [2012-03-11 22:00:52.131517] W [fuse-resolve.c:346:fuse_resolve_fd] 0-fuse-resolve: migration of fd (0x11be5b0) did not complete, failing fop with EBADF [2012-03-11 22:00:52.131573] W [fuse-resolve.c:346:fuse_resolve_fd] 0-fuse-resolve: migration of fd (0x11be5b0) did not complete, failing fop with EBADF [2012-03-11 22:00:52.131627] W [fuse-resolve.c:346:fuse_resolve_fd] 0-fuse-resolve: migration of fd (0x11be5b0) did not complete, failing fop with EBADF [2012-03-11 22:00:52.131681] W [fuse-resolve.c:346:fuse_resolve_fd] 0-fuse-resolve: migration of fd (0x11be5b0) did not complete, failing fop with EBADF [2012-03-11 22:00:52.131751] W [fuse-resolve.c:346:fuse_resolve_fd] 0-fuse-resolve: migration of fd (0x11be5b0) did not complete, failing fop with EBADF [2012-03-11 22:00:52.131809] W [fuse-resolve.c:346:fuse_resolve_fd] 0-fuse-resolve: migration of fd (0x11be5b0) did not complete, failing fop with EBADF [2012-03-11 22:00:52.131905] W [fuse-resolve.c:346:fuse_resolve_fd] 0-fuse-resolve: migration of fd (0x11be5b0) did not complete, failing fop with EBADF [2012-03-11 22:00:52.131981] W [fuse-resolve.c:346:fuse_resolve_fd] 0-fuse-resolve: migration of fd (0x11be5b0) did not complete, failing fop with EBADF [2012-03-11 22:00:52.132064] W [fuse-resolve.c:346:fuse_resolve_fd] 0-fuse-resolve: migration of fd (0x11be5b0) did not complete, failing fop with EBADF [2012-03-11 22:00:52.132127] W [fuse-resolve.c:346:fuse_resolve_fd] 0-fuse-resolve: migration of fd (0x11be5b0) did not complete, failing fop with EBADF [2012-03-11 22:00:52.132188] W [fuse-resolve.c:346:fuse_resolve_fd] 0-fuse-resolve: migration of fd (0x11be5b0) did not complete, failing fop with EBADF [2012-03-11 22:00:52.132243] W [fuse-resolve.c:346:fuse_resolve_fd] 0-fuse-resolve: migration of fd (0x11be5b0) did not complete, failing fop with EBADF [2012-03-11 22:00:52.132345] W [fuse-resolve.c:346:fuse_resolve_fd] 0-fuse-resolve: migration of fd (0x11be5b0) did not complete, failing fop with EBADF [2012-03-11 22:00:52.132443] W [fuse-resolve.c:346:fuse_resolve_fd] 0-fuse-resolve: migration of fd (0x11be5b0) did not complete, failing fop with EBADF [2012-03-11 22:00:52.132512] W [fuse-resolve.c:346:fuse_resolve_fd] 0-fuse-resolve: migration of fd (0x11be5b0) did not complete, failing fop with EBADF : Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: gluster v i Volume Name: mirror Type: Replicate Volume ID: e6423147-ee12-453f-bcf6-2fb09a9087c5 Status: Started Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: 10.16.156.9:/export/mirror Brick2: 10.16.156.12:/export/mirror Brick3: 10.16.156.15:/export/mirror Options Reconfigured: features.lock-heal: off performance.flush-behind: off performance.stat-prefetch: off performance.client-io-threads: on cluster.self-heal-daemon: on
Just running ping_pong with graph changes is suffecient to hit the bug. It led to total fd count increasing to the VFS max file limit, and thus errors and some of the applications getting segfaulted. When ran with valgrind ping_pong started getting EBADFD errors and logs indicated fd migration failing. 2012-04-24 16:05:38.852707] W [fuse-resolve.c:360:fuse_resolve_fd] 0-fuse-resolve: migration of fd (0xb577af0) did not complete, failing fop with EBADF [2012-04-24 16:05:38.853246] D [socket.c:193:__socket_rwv] 3-mirror-client-2: EOF from peer 127.0.0.1:24013 [2012-04-24 16:05:38.854810] D [socket.c:1521:__socket_proto_state_machine] 3-mirror-client-2: reading from socket failed. Error (Transport endpoint is not connected), peer (127.0.0.1:24013) [2012-04-24 16:05:38.855269] D [socket.c:1807:socket_event_handler] 3-transport: disconnecting now [2012-04-24 16:05:38.856020] I [client.c:2099:client_rpc_notify] 3-mirror-client-2: disconnected [2012-04-24 16:05:38.856468] E [afr-common.c:3666:afr_notify] 3-mirror-replicate-0: All subvolumes are down. Going offline until atleast one of them comes back up. [2012-04-24 16:05:38.856901] D [fuse-bridge.c:4208:notify] 0-fuse: got event 6 on graph 3 [2012-04-24 16:05:38.858771] W [fuse-resolve.c:360:fuse_resolve_fd] 0-fuse-resolve: migration of fd (0xb577af0) did not complete, failing fop with EBADF [2012-04-24 16:05:38.875087] W [fuse-resolve.c:152:fuse_resolve_gfid_cbk] 0-fuse: f8a743cd-5f91-4132-a870-09b2f9a57627: failed to resolve (Transport endpoint is not connected) [2012-04-24 16:05:38.877005] W [fuse-bridge.c:2845:fuse_getxattr_resume] 0-glusterfs-fuse: 81528: GETXATTR f8a743cd-5f91-4132-a870-09b2f9a57627/191002180 (security.capability) resolution failed [2012-04-24 16:05:38.880052] W [fuse-resolve.c:360:fuse_resolve_fd] 0-fuse-resolve: migration of fd (0xb577af0) did not complete, failing fop with EBADF [2012-04-24 16:05:38.881607] W [fuse-resolve.c:360:fuse_resolve_fd] 0-fuse-resolve: migration of fd (0xb577af0) did not complete, failing fop with EBADF [2012-04-24 16:05:38.883238] W [fuse-resolve.c:360:fuse_resolve_fd] 0-fuse-resolve: migration of fd (0xb577af0) did not complete, failing fop with EBADF [2012-04-24 16:05:38.886202] W [fuse-resolve.c:152:fuse_resolve_gfid_cbk] 0-fuse: f8a743cd-5f91-4132-a870-09b2f9a57627: failed to resolve (Transport endpoint is not connected) [2012-04-24 16:05:38.886846] W [fuse-bridge.c:2845:fuse_getxattr_resume] 0-glusterfs-fuse: 81532: GETXATTR f8a743cd-5f91-4132-a870-09b2f9a57627/191002180 (security.capability) resolution failed [2012-04-24 16:05:38.888388] W [fuse-resolve.c:360:fuse_resolve_fd] 0-fuse-resolve: migration of fd (0xb577af0) did not complete, failing fop with EBADF [2012-04-24 16:05:38.889948] W [fuse-resolve.c:360:fuse_resolve_fd] 0-fuse-resolve: migration of fd (0xb577af0) did not complete, failing fop with EBADF [2012-04-24 16:05:38.891540] W [fuse-resolve.c:360:fuse_resolve_fd] 0-fuse-resolve: migration of fd (0xb577af0) did not complete, failing fop with EBADF [2012-04-24 16:05:38.894579] W [fuse-resolve.c:152:fuse_resolve_gfid_cbk] 0-fuse: f8a743cd-5f91-4132-a870-09b2f9a57627: failed to resolve (Transport endpoint is not connected) [2012-04-24 16:05:38.895180] W [fuse-bridge.c:2845:fuse_getxattr_resume] 0-glusterfs-fuse: 81536: GETXATTR f8a743cd-5f91-4132-a870-09b2f9a57627/191002180 (security.capability) resolution failed [2012-04-24 16:05:38.896698] W [fuse-resolve.c:360:fuse_resolve_fd] 0-fuse-resolve: migration of fd (0xb577af0) did not complete, failing fop with EBADF [2012-04-24 16:05:38.898243] W [fuse-resolve.c:360:fuse_resolve_fd] 0-fuse-resolve: migration of fd (0xb577af0) did not complete, failing fop with EBADF [2012-04-24 16:05:38.899826] W [fuse-resolve.c:360:fuse_resolve_fd] 0-fuse-resolve: migration of fd (0xb577af0) did not complete, failing fop with EBADF [2012-04-24 16:05:38.901319] W [fuse-resolve.c:360:fuse_resolve_fd] 0-fuse-resolve: migration of fd (0xb577af0) did not complete, failing fop with EBADF This is the dmesg output indicating the fd count hitting the VFS max limit. Apr 24 12:29:09 hyperspace kernel: [ 3671.564994] VFS: file-max limit 385710 reached Apr 24 12:29:09 hyperspace kernel: [ 3673.048036] VFS: file-max limit 385710 reached Apr 24 12:29:09 hyperspace kernel: [ 3673.151796] VFS: file-max limit 385710 reached Apr 24 12:29:09 hyperspace kernel: [ 3673.162949] VFS: file-max limit 385710 reached Apr 24 12:29:10 hyperspace kernel: [ 3673.720373] VFS: file-max limit 385710 reached Apr 24 12:29:10 hyperspace kernel: [ 3673.772767] VFS: file-max limit 385710 reached Apr 24 12:29:10 hyperspace kernel: [ 3673.798308] VFS: file-max limit 385710 reached Apr 24 12:29:10 hyperspace kernel: [ 3673.798956] VFS: file-max limit 385710 reached Apr 24 12:29:10 hyperspace kernel: [ 3673.814634] VFS: file-max limit 385710 reached Apr 24 12:29:11 hyperspace kernel: [ 3675.410128] show_signal_msg: 24 callbacks suppressed Apr 24 12:29:11 hyperspace kernel: [ 3675.410134] compiz[1836]: segfault at 20 ip 00007f1139ecd58b sp 00007fffd2822970 error 4 in libdrm_intel.so.1.0.0[7f1139ec7000+a000] Apr 24 12:29:11 hyperspace kernel: [ 3675.412261] VFS: file-max limit 385710 reached Apr 24 12:29:13 hyperspace kernel: [ 3677.149905] VFS: file-max limit 385710 reached Apr 24 12:29:17 hyperspace kernel: [ 3680.831535] VFS: file-max limit 385710 reached Apr 24 12:29:17 hyperspace postfix/master[1426]: warning: master_wakeup_timer_event: service qmgr(public/qmgr): Too many open files in system Apr 24 12:29:17 hyperspace postfix/master[1426]: warning: master_wakeup_timer_event: service pickup(public/pickup): Too many open files in system Apr 24 12:29:17 hyperspace kernel: [ 3680.949343] VFS: file-max limit 385710 reached Apr 24 12:29:17 hyperspace kernel: [ 3680.953927] VFS: file-max limit 385710 reached Apr 24 12:29:17 hyperspace kernel: [ 3680.963900] VFS: file-max limit 385710 reached Apr 24 12:29:17 hyperspace postfix/qmgr[2082]: fatal: scan_dir_push: open directory deferred: Too many open files in system Apr 24 12:29:17 hyperspace kernel: [ 3681.342443] VFS: file-max limit 385710 reached Apr 24 12:29:17 hyperspace kernel: [ 3681.370371] VFS: file-max limit 385710 reached Apr 24 12:29:17 hyperspace kernel: [ 3681.373358] VFS: file-max limit 385710 reached Apr 24 12:29:18 hyperspace kernel: [ 3681.550102] VFS: file-max limit 385710 reached Apr 24 12:29:18 hyperspace kernel: [ 3681.666582] VFS: file-max limit 385710 reached Apr 24 12:29:18 hyperspace kernel: [ 3681.672847] VFS: file-max limit 385710 reached Apr 24 12:29:18 hyperspace kernel: [ 3681.781079] VFS: file-max limit 385710 reached Apr 24 12:29:18 hyperspace kernel: [ 3681.782683] VFS: file-max limit 385710 reached Apr 24 12:29:18 hyperspace postfix/master[1426]: warning: process /usr/lib/postfix/qmgr pid 2082 exit status 1 Apr 24 12:29:18 hyperspace kernel: [ 3682.341382] VFS: file-max limit 385710 reached Apr 24 12:29:18 hyperspace kernel: [ 3682.521635] VFS: file-max limit 385710 reached Apr 24 12:29:19 hyperspace kernel: [ 3682.560183] VFS: file-max limit 385710 reached Apr 24 12:29:19 hyperspace kernel: [ 3682.584952] VFS: file-max limit 385710 reached Apr 24 12:29:19 hyperspace kernel: [ 3682.635927] VFS: file-max limit 385710 reached Apr 24 12:29:19 hyperspace kernel: [ 3682.675731] VFS: file-max limit 385710 reached Apr 24 12:29:19 hyperspace kernel: [ 3682.681351] VFS: file-max limit 385710 reached Apr 24 12:29:19 hyperspace kernel: [ 3682.691349] VFS: file-max limit 385710 reached Apr 24 12:29:19 hyperspace kernel: [ 3682.701340] VFS: file-max limit 385710 reached Apr 24 12:29:19 hyperspace kernel: [ 3682.711316] VFS: file-max limit 385710 reached Apr 24 12:29:19 hyperspace kernel: [ 3682.834319] VFS: file-max limit 385710 reached Apr 24 12:29:19 hyperspace kernel: [ 3683.000432] VFS: file-max limit 385710 reached Apr 24 12:29:19 hyperspace kernel: [ 3683.350397] VFS: file-max limit 385710 reached :
CHANGE: http://review.gluster.com/3222 (mount/fuse: unref the fds after they have been migrated to the new graph) merged in master by Anand Avati (avati)
Checked with glusterfs-3.3.0qa40. ran ping_pong with graph changes happening parallely. No such log messages are seen and VFS max limit is not reached since we are properly unrefing the fd after being migrated to the new graph.