Description of problem: Reopen of a fd is performed after a brick comes online. If the reopen is for an fd which is already marked as released and the reopen fails decrementing of reopen_fd_count is done but the fdctx corresponding to the released fd is still added to the saved_fds list. That stale fdctx remains there for ever leading to -ve count of reopen_fd_count on subsequent 'CHILD_DOWN' then 'CHILD_UP'. Please see the following gdb logs: Here the reopen_fd_count becomes zero. (gdb) s decrement_reopen_fd_count (this=0x22768f0, conf=0x22beb70) at client-lk.c:591 591 uint64_t fd_count = 0; (gdb) n 593 LOCK (&conf->rec_lock); (gdb) 595 fd_count = --(conf->reopen_fd_count); (gdb) 597 UNLOCK (&conf->rec_lock); (gdb) 599 if (fd_count == 0) { (gdb) 600 gf_log (this->name, GF_LOG_INFO, (gdb) 602 client_set_lk_version (this); (gdb) 603 client_notify_parents_child_up (this); (gdb) 606 return fd_count; (gdb) 607 } Breakpoint 10, clnt_release_reopen_fd_cbk (req=0x7f5f5f16733c, iov=0x7f5f5f16737c, count=1, myframe=0x7f5f677a4894) at client-handshake.c:595 595 xlator_t *this = NULL; (gdb) n 596 call_frame_t *frame = NULL; (gdb) 597 clnt_conf_t *conf = NULL; (gdb) 598 clnt_fd_ctx_t *fdctx = NULL; (gdb) 600 frame = myframe; (gdb) 601 this = frame->this; (gdb) 602 fdctx = (clnt_fd_ctx_t *) frame->local; (gdb) 603 conf = (clnt_conf_t *) this->private; (gdb) 605 clnt_fd_lk_reacquire_failed (this, fdctx, conf); (gdb) 607 decrement_reopen_fd_count (this, conf); (gdb) s decrement_reopen_fd_count (this=0x22768f0, conf=0x22beb70) at client-lk.c:591 591 uint64_t fd_count = 0; (gdb) n 593 LOCK (&conf->rec_lock); (gdb) 595 fd_count = --(conf->reopen_fd_count); (gdb) 597 UNLOCK (&conf->rec_lock); (gdb) p fd_count $18 = 18446744073709551615 (gdb) p conf->reopen_fd_count $19 = 18446744073709551615 Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
The check for this is there when we enable lock-self healing. The above condition was not checked because lock self healing was expected to be always on. Since now its optional we must handle this case.
pre-release version is ambiguous and about to be removed as a choice. If you believe this is still a bug, please change the status back to NEW and choose the appropriate, applicable version for it.