Bug 1347664

Summary:	Seeing a continuous error messge "librbd::ObjectWatcher: rbd_mirroring: handle_rewatch: error encountered during re-watch: (108) Cannot send after transport endpoint shutdown"
Product:	[Red Hat Storage] Red Hat Ceph Storage	Reporter:	Tanay Ganguly <tganguly>
Component:	RBD	Assignee:	Jason Dillaman <jdillama>
Status:	CLOSED ERRATA	QA Contact:	Rachana Patel <racpatel>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	2.0	CC:	ceph-eng-bugs, flucifre, hnallurv, kurs, uboppana
Target Milestone:	rc
Target Release:	2.1
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	RHEL: ceph-10.2.3-2.el7cp Ubuntu: ceph_10.2.3-3redhat1xenial	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2016-11-22 19:26:45 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Tanay Ganguly 2016-06-17 11:12:15 UTC

Description of problem:
While syncing/re-syncing of multiple snapshot i am seeing these error message continuously, which is flooding my log file. It went upto 4 GB in size.

librbd::ObjectWatcher: rbd_mirroring: handle_rewatch: error encountered during re-watch: (108) Cannot send after transport endpoint shutdow

Version-Release number of selected component (if applicable):
rbd-mirror-10.2.2-1.el7cp.x86_64

How reproducible:
Once

Steps to Reproduce:
1. Create an Image with Journal Enabled.
2. Create 25 snapshots.
3. While snapshot was getting synced, disable Journal.
4. After a while again enable Journal, the remaining snapshot gets synced.
5. Then delete 15 snapshots, while deletion was happening disable Journal.
6. After a while again enable Journal, the remaining snapshot gets synced.

While it was syncing (step 6), i saw the Journal state showing as Disabling for a while on the Slave Node

But later it got corrected and properly sycned to the Master Node
But it generated a Core, logs and rbd-mirror status was continiously showing as:

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

2016-06-17 09:44:59.999928 7fa7dd7fa700 -1 librbd::ObjectWatcher: rbd_mirroring: handle_rewatch: error encountered during re-watch: (108) Cannot send after transport endpoint shutdown
2016-06-17 09:44:59.999933 7fa7dd7fa700 20 librbd::ObjectWatcher: rbd_mirroring: rewatch
2016-06-17 09:45:00.000255 7fa7dd7fa700 20 librbd::ObjectWatcher: rbd_mirroring: handle_rewatch: r=-108
2016-06-17 09:45:00.000263 7fa7dd7fa700 -1 librbd::ObjectWatcher: rbd_mirroring: handle_rewatch: error encountered during re-watch: (108) Cannot send after transport endpoint shutdown
2016-06-17 09:45:00.000270 7fa7dd7fa700 20 librbd::ObjectWatcher: rbd_mirroring: rewatch


-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

2016-06-16 20:33:09.917456 7efddcff9700 -1 librbd::ImageWatcher: 0x7efd84030e80 image watch failed: 139627306619536, (107) Transport endpoint is not connected
2016-06-16 20:33:11.649410 7efddcff9700 -1 librbd::ImageWatcher: 0x7efd84022530 image watch failed: 139627306695968, (107) Transport endpoint is not connected
2016-06-16 20:33:11.887698 7efddcff9700 -1 librbd::ImageWatcher: 0x7efd84047a50 image watch failed: 139627306561744, (107) Transport endpoint is not connected
2016-06-16 20:33:39.077626 7efddcff9700 -1 librbd::ImageWatcher: 0x7efd84048470 image watch failed: 139627306601040, (107) Transport endpoint is not connected
2016-06-16 20:33:40.255308 7efddcff9700 -1 librbd::ImageWatcher: 0x7efd84030940 image watch failed: 139627306546144, (107) Transport endpoint is not connected
2016-06-16 20:33:40.533680 7efddcff9700 -1 librbd::ImageWatcher: 0x7efd84031d70 image watch failed: 139627306549744, (107) Transport endpoint is not connected


-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Actual results:
Seeing lots of Error messages.

Expected results:
There should not be any Error message

Additional info:

Has to restart the rbd-mirror to stop the message flood.

Comment 3 Tanay Ganguly 2016-06-17 11:15:58 UTC

@Jason:

As the Log file were huge to attach.
Please login to 

Slave: magna003

Log File : /var/log/ceph/qemu-guest-20365.log
Core:      /tmp/cores/core.tp_librbd.10600.magna003.1466156585

Master: 10.40.44.44

root/redhat

Comment 5 Jason Dillaman 2016-06-17 13:16:33 UTC

Cluster overloaded and unable to service watch heartbeats within the default 30 second timeout period.

Comment 6 Jason Dillaman 2016-08-10 19:17:28 UTC

Upstream pull request: https://github.com/ceph/ceph/pull/9970

Comment 11 Rachana Patel 2016-10-31 14:31:41 UTC

Verified with 10.2.3-10.el7cp.x86_64

- no core generated. error messages have been observed but its not flooding log and rbd-mirror daemon is not showing faulty status. Hence moving to verified.

Comment 13 errata-xmlrpc 2016-11-22 19:26:45 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2016-2815.html