Description of problem: While syncing/re-syncing of multiple snapshot i am seeing these error message continuously, which is flooding my log file. It went upto 4 GB in size. librbd::ObjectWatcher: rbd_mirroring: handle_rewatch: error encountered during re-watch: (108) Cannot send after transport endpoint shutdow Version-Release number of selected component (if applicable): rbd-mirror-10.2.2-1.el7cp.x86_64 How reproducible: Once Steps to Reproduce: 1. Create an Image with Journal Enabled. 2. Create 25 snapshots. 3. While snapshot was getting synced, disable Journal. 4. After a while again enable Journal, the remaining snapshot gets synced. 5. Then delete 15 snapshots, while deletion was happening disable Journal. 6. After a while again enable Journal, the remaining snapshot gets synced. While it was syncing (step 6), i saw the Journal state showing as Disabling for a while on the Slave Node But later it got corrected and properly sycned to the Master Node But it generated a Core, logs and rbd-mirror status was continiously showing as: ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 2016-06-17 09:44:59.999928 7fa7dd7fa700 -1 librbd::ObjectWatcher: rbd_mirroring: handle_rewatch: error encountered during re-watch: (108) Cannot send after transport endpoint shutdown 2016-06-17 09:44:59.999933 7fa7dd7fa700 20 librbd::ObjectWatcher: rbd_mirroring: rewatch 2016-06-17 09:45:00.000255 7fa7dd7fa700 20 librbd::ObjectWatcher: rbd_mirroring: handle_rewatch: r=-108 2016-06-17 09:45:00.000263 7fa7dd7fa700 -1 librbd::ObjectWatcher: rbd_mirroring: handle_rewatch: error encountered during re-watch: (108) Cannot send after transport endpoint shutdown 2016-06-17 09:45:00.000270 7fa7dd7fa700 20 librbd::ObjectWatcher: rbd_mirroring: rewatch ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 2016-06-16 20:33:09.917456 7efddcff9700 -1 librbd::ImageWatcher: 0x7efd84030e80 image watch failed: 139627306619536, (107) Transport endpoint is not connected 2016-06-16 20:33:11.649410 7efddcff9700 -1 librbd::ImageWatcher: 0x7efd84022530 image watch failed: 139627306695968, (107) Transport endpoint is not connected 2016-06-16 20:33:11.887698 7efddcff9700 -1 librbd::ImageWatcher: 0x7efd84047a50 image watch failed: 139627306561744, (107) Transport endpoint is not connected 2016-06-16 20:33:39.077626 7efddcff9700 -1 librbd::ImageWatcher: 0x7efd84048470 image watch failed: 139627306601040, (107) Transport endpoint is not connected 2016-06-16 20:33:40.255308 7efddcff9700 -1 librbd::ImageWatcher: 0x7efd84030940 image watch failed: 139627306546144, (107) Transport endpoint is not connected 2016-06-16 20:33:40.533680 7efddcff9700 -1 librbd::ImageWatcher: 0x7efd84031d70 image watch failed: 139627306549744, (107) Transport endpoint is not connected ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Actual results: Seeing lots of Error messages. Expected results: There should not be any Error message Additional info: Has to restart the rbd-mirror to stop the message flood.
@Jason: As the Log file were huge to attach. Please login to Slave: magna003 Log File : /var/log/ceph/qemu-guest-20365.log Core: /tmp/cores/core.tp_librbd.10600.magna003.1466156585 Master: 10.40.44.44 root/redhat
Cluster overloaded and unable to service watch heartbeats within the default 30 second timeout period.
Upstream pull request: https://github.com/ceph/ceph/pull/9970
Verified with 10.2.3-10.el7cp.x86_64 - no core generated. error messages have been observed but its not flooding log and rbd-mirror daemon is not showing faulty status. Hence moving to verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2016-2815.html