Bug 1347664

Summary: Seeing a continuous error messge "librbd::ObjectWatcher: rbd_mirroring: handle_rewatch: error encountered during re-watch: (108) Cannot send after transport endpoint shutdown"
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Tanay Ganguly <tganguly>
Component: RBDAssignee: Jason Dillaman <jdillama>
Status: CLOSED ERRATA QA Contact: Rachana Patel <racpatel>
Severity: medium Docs Contact:
Priority: medium    
Version: 2.0CC: ceph-eng-bugs, flucifre, hnallurv, kurs, uboppana
Target Milestone: rc   
Target Release: 2.1   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: RHEL: ceph-10.2.3-2.el7cp Ubuntu: ceph_10.2.3-3redhat1xenial Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-11-22 19:26:45 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Tanay Ganguly 2016-06-17 11:12:15 UTC
Description of problem:
While syncing/re-syncing of multiple snapshot i am seeing these error message continuously, which is flooding my log file. It went upto 4 GB in size.

librbd::ObjectWatcher: rbd_mirroring: handle_rewatch: error encountered during re-watch: (108) Cannot send after transport endpoint shutdow

Version-Release number of selected component (if applicable):
rbd-mirror-10.2.2-1.el7cp.x86_64

How reproducible:
Once

Steps to Reproduce:
1. Create an Image with Journal Enabled.
2. Create 25 snapshots.
3. While snapshot was getting synced, disable Journal.
4. After a while again enable Journal, the remaining snapshot gets synced.
5. Then delete 15 snapshots, while deletion was happening disable Journal.
6. After a while again enable Journal, the remaining snapshot gets synced.

While it was syncing (step 6), i saw the Journal state showing as Disabling for a while on the Slave Node

But later it got corrected and properly sycned to the Master Node
But it generated a Core, logs and rbd-mirror status was continiously showing as:

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

2016-06-17 09:44:59.999928 7fa7dd7fa700 -1 librbd::ObjectWatcher: rbd_mirroring: handle_rewatch: error encountered during re-watch: (108) Cannot send after transport endpoint shutdown
2016-06-17 09:44:59.999933 7fa7dd7fa700 20 librbd::ObjectWatcher: rbd_mirroring: rewatch
2016-06-17 09:45:00.000255 7fa7dd7fa700 20 librbd::ObjectWatcher: rbd_mirroring: handle_rewatch: r=-108
2016-06-17 09:45:00.000263 7fa7dd7fa700 -1 librbd::ObjectWatcher: rbd_mirroring: handle_rewatch: error encountered during re-watch: (108) Cannot send after transport endpoint shutdown
2016-06-17 09:45:00.000270 7fa7dd7fa700 20 librbd::ObjectWatcher: rbd_mirroring: rewatch


-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

2016-06-16 20:33:09.917456 7efddcff9700 -1 librbd::ImageWatcher: 0x7efd84030e80 image watch failed: 139627306619536, (107) Transport endpoint is not connected
2016-06-16 20:33:11.649410 7efddcff9700 -1 librbd::ImageWatcher: 0x7efd84022530 image watch failed: 139627306695968, (107) Transport endpoint is not connected
2016-06-16 20:33:11.887698 7efddcff9700 -1 librbd::ImageWatcher: 0x7efd84047a50 image watch failed: 139627306561744, (107) Transport endpoint is not connected
2016-06-16 20:33:39.077626 7efddcff9700 -1 librbd::ImageWatcher: 0x7efd84048470 image watch failed: 139627306601040, (107) Transport endpoint is not connected
2016-06-16 20:33:40.255308 7efddcff9700 -1 librbd::ImageWatcher: 0x7efd84030940 image watch failed: 139627306546144, (107) Transport endpoint is not connected
2016-06-16 20:33:40.533680 7efddcff9700 -1 librbd::ImageWatcher: 0x7efd84031d70 image watch failed: 139627306549744, (107) Transport endpoint is not connected


-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Actual results:
Seeing lots of Error messages.

Expected results:
There should not be any Error message

Additional info:

Has to restart the rbd-mirror to stop the message flood.

Comment 3 Tanay Ganguly 2016-06-17 11:15:58 UTC
@Jason:

As the Log file were huge to attach.
Please login to 

Slave: magna003

Log File : /var/log/ceph/qemu-guest-20365.log
Core:      /tmp/cores/core.tp_librbd.10600.magna003.1466156585

Master: 10.40.44.44

root/redhat

Comment 5 Jason Dillaman 2016-06-17 13:16:33 UTC
Cluster overloaded and unable to service watch heartbeats within the default 30 second timeout period.

Comment 6 Jason Dillaman 2016-08-10 19:17:28 UTC
Upstream pull request: https://github.com/ceph/ceph/pull/9970

Comment 11 Rachana Patel 2016-10-31 14:31:41 UTC
Verified with 10.2.3-10.el7cp.x86_64

- no core generated. error messages have been observed but its not flooding log and rbd-mirror daemon is not showing faulty status. Hence moving to verified.

Comment 13 errata-xmlrpc 2016-11-22 19:26:45 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2016-2815.html