Bug 1359613

Summary:	[RFE] Geo-replication Logging Improvements
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Aravinda VK <avishwan>
Component:	geo-replication	Assignee:	Aravinda VK <avishwan>
Status:	CLOSED ERRATA	QA Contact:	Rahul Hinduja <rhinduja>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	rhgs-3.1	CC:	amukherj, asrivast, bugs, csaba, khiremat, rhinduja, rhs-bugs, rreddy
Target Milestone:	---	Keywords:	FutureFeature, ZStream
Target Release:	RHGS 3.3.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	glusterfs-3.8.4-19	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:	1359612	Environment:
Last Closed:	2017-09-21 04:28:23 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1359612, 1387990
Bug Blocks:	1417138

Description Aravinda VK 2016-07-25 06:51:07 UTC

+++ This bug was initially created as a clone of Bug #1359612 +++

Description of problem:

Many improvements can be made to Geo-replication logs to understand the current status and other details.

- Session Creation/Recreation - Will help to understand when session was first created/recreated

- Log the push-pem status(Since hook script runs in background)

- On Start,
    . Monitor Start Message
    . Number of local bricks
    . Worker Start Message with respective Slave node to which it is trying to connect
    . Agent Start Message

- Worker State Changes
    . Change in connected Slave Node
    . Change in Status(Initializing/Active/Passive/Faulty/Stopped/Paused/Resumed)
    . Change in Crawl(Hybrid/History/Changelog)

- Checkpoint Set and Completion Time

- Sync time logs:
  If Rsync failure, Summary of errors (CHANGELOG start-end name and error)

- When Config Changed, log CONFIG_NAME OLD_VALUE and NEW_VALUE

- On worker restart,
   Log last worker's and agents uptime (Because monitor knows about it :P)

- Performance metrics: (Optional only if enabled)
    - After each Changelog batch processed
        - Number of Changelogs processed and START and END changelog
        - Number of ENTRY, DATA and META
        - Time took to Parse Changelogs, Entry Operations, Meta and Sync

Comment 2 Kotresh HR 2017-02-22 06:11:40 UTC

Upstream Patch:
https://review.gluster.org/#/c/15684/  (master)
https://review.gluster.org/#/c/15710/  (3.9)

Comment 6 Atin Mukherjee 2017-03-24 08:41:33 UTC

downstream patch : https://code.engineering.redhat.com/gerrit/#/c/101284/

Comment 10 Rahul Hinduja 2017-06-29 11:48:17 UTC

Verified this bug with build: glusterfs-geo-replication-3.8.4-28.el7rhgs.x86_64

Have seen the logging improvements as mentioned below: 

=> Worker and connected slave node details: 

Following logs makes it more clear for the below work flow:

1. Monitor starts the worker and agent 
2. Worker will do the ssh connection between master and slave. gsyncd on slave mounts the slave volume on local slave host.
3. Worker mounts the master volume locally
4. Upon succesful completion of 2 and 3. It acknowledges back to the monitor.

[2017-06-29 11:33:33.563153] I [monitor(monitor):275:monitor] Monitor: starting gsyncd worker(/rhs/brick2/b4). Slave node: ssh://root.43.150:gluster://localhost:slave
[2017-06-29 11:33:33.588865] I [monitor(monitor):275:monitor] Monitor: starting gsyncd worker(/rhs/brick1/b1). Slave node: ssh://root.43.52:gluster://localhost:slave
[2017-06-29 11:33:33.815712] I [resource(/rhs/brick2/b4):1676:connect_remote] SSH: Initializing SSH connection between master and slave...
[2017-06-29 11:33:33.818607] I [changelogagent(/rhs/brick2/b4):73:__init__] ChangelogAgent: Agent listining...
[2017-06-29 11:33:33.824904] I [resource(/rhs/brick1/b1):1676:connect_remote] SSH: Initializing SSH connection between master and slave...
[2017-06-29 11:33:33.840003] I [changelogagent(/rhs/brick1/b1):73:__init__] ChangelogAgent: Agent listining...
[2017-06-29 11:33:40.65862] I [resource(/rhs/brick2/b4):1683:connect_remote] SSH: SSH connection between master and slave established. Time taken: 6.2496 secs
[2017-06-29 11:33:40.66544] I [resource(/rhs/brick2/b4):1416:connect] GLUSTER: Mounting gluster volume locally...
[2017-06-29 11:33:40.139708] I [resource(/rhs/brick1/b1):1683:connect_remote] SSH: SSH connection between master and slave established. Time taken: 6.3142 secs
[2017-06-29 11:33:40.140361] I [resource(/rhs/brick1/b1):1416:connect] GLUSTER: Mounting gluster volume locally...
[2017-06-29 11:33:45.215692] I [resource(/rhs/brick2/b4):1429:connect] GLUSTER: Mounted gluster volume. Time taken: 5.1487 secs
[2017-06-29 11:33:45.216334] I [gsyncd(/rhs/brick2/b4):778:main_i] <top>: Closing feedback fd, waking up the monitor
[2017-06-29 11:33:45.280860] I [resource(/rhs/brick1/b1):1429:connect] GLUSTER: Mounted gluster volume. Time taken: 5.1400 secs
[2017-06-29 11:33:45.281521] I [gsyncd(/rhs/brick1/b1):778:main_i] <top>: Closing feedback fd, waking up the monitor



=> Monitor, Worker, Crawl Status information:

Monitor:

[2017-06-29 11:00:47.744501] I [gsyncd(monitor):714:main_i] <top>: Monitor Status: Created
[2017-06-29 11:32:11.164010] I [gsyncd(monitor):714:main_i] <top>: Monitor Status: Paused
[2017-06-29 11:32:28.162978] I [gsyncd(monitor):714:main_i] <top>: Monitor Status: Started

Worker:

[2017-06-29 11:33:30.802614] I [gsyncdstatus(monitor):240:set_worker_status] GeorepStatus: Worker Status: Faulty
[2017-06-29 11:33:30.805674] I [gsyncdstatus(monitor):240:set_worker_status] GeorepStatus: Worker Status: Faulty
[2017-06-29 11:33:32.261878] I [gsyncd(conf):688:main_i] <top>: Config Set: session-owner = 689c5bf5-d4c4-4387-bd5b-12295f2c6203
[2017-06-29 11:33:33.254089] I [gsyncdstatus(monitor):240:set_worker_status] GeorepStatus: Worker Status: Initializing...
[2017-06-29 11:33:33.263180] I [gsyncdstatus(monitor):240:set_worker_status] GeorepStatus: Worker Status: Initializing...

Crawl:

[2017-06-29 11:33:47.352823] I [gsyncdstatus(/rhs/brick1/b1):244:set_worker_crawl_status] GeorepStatus: Crawl Status: History Crawl
[2017-06-29 11:34:02.415805] I [gsyncdstatus(/rhs/brick1/b1):244:set_worker_crawl_status] GeorepStatus: Crawl Status: Changelog Crawl
[2017-06-29 11:09:15.227492] I [gsyncdstatus(/rhs/brick2/b4):244:set_worker_crawl_status] GeorepStatus: Crawl Status: Hybrid Crawl


=> Config set/reset

[2017-06-29 11:07:37.389173] I [gsyncd(conf):688:main_i] <top>: Config Set: use_meta_volume = true
[2017-06-29 11:33:30.253581] I [gsyncd(conf):688:main_i] <top>: Config Set: use-tarssh = true
[2017-06-29 11:40:26.599968] I [gsyncd(conf):693:main_i] <top>: Config Reset: use-tarssh


=> Checkpoint Set/Reset

[2017-06-29 11:34:32.176332] I [gsyncd(conf):685:main_i] <top>: Checkpoint Set: 2017-06-29 11:34:29
[2017-06-29 11:35:17.624742] I [gsyncd(conf):691:main_i] <top>: Checkpoint Reset
[2017-06-29 11:35:56.278028] I [gsyncd(conf):685:main_i] <top>: Checkpoint Set: 2017-06-29 11:35:54
[2017-06-29 11:37:06.418883] I [master(/rhs/brick2/b4):1125:crawl] _GMaster: slave's time: (1498734558, 0)
[2017-06-29 11:37:06.420852] I [master(/rhs/brick1/b1):1125:crawl] _GMaster: slave's time: (1498734558, 0)
[2017-06-29 11:37:06.485577] I [gsyncdstatus(/rhs/brick2/b4):230:merger] GeorepStatus: Checkpoint completed. Checkpoint Time: 2017-06-29 11:35:54, Completion Time: 2017-06-29 11:37:06
[2017-06-29 11:37:06.496103] I [gsyncdstatus(/rhs/brick1/b1):230:merger] GeorepStatus: Checkpoint completed. Checkpoint Time: 2017-06-29 11:35:54, Completion Time: 2017-06-29 11:37:06


Moving this bug to verified as it covers the logging improvement intended by the fix in bug. Any new logging improvement will be raised as new bug.

Comment 12 errata-xmlrpc 2017-09-21 04:28:23 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2774

Comment 13 errata-xmlrpc 2017-09-21 04:54:54 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2774