Bug 1359613 - [RFE] Geo-replication Logging Improvements
Summary: [RFE] Geo-replication Logging Improvements
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: geo-replication
Version: rhgs-3.1
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: RHGS 3.3.0
Assignee: Aravinda VK
QA Contact: Rahul Hinduja
URL:
Whiteboard:
Depends On: 1359612 1387990
Blocks: 1417138
TreeView+ depends on / blocked
 
Reported: 2016-07-25 06:51 UTC by Aravinda VK
Modified: 2017-09-21 04:54 UTC (History)
8 users (show)

Fixed In Version: glusterfs-3.8.4-19
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1359612
Environment:
Last Closed: 2017-09-21 04:28:23 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2017:2774 0 normal SHIPPED_LIVE glusterfs bug fix and enhancement update 2017-09-21 08:16:29 UTC

Description Aravinda VK 2016-07-25 06:51:07 UTC
+++ This bug was initially created as a clone of Bug #1359612 +++

Description of problem:

Many improvements can be made to Geo-replication logs to understand the current status and other details.

- Session Creation/Recreation - Will help to understand when session was first created/recreated

- Log the push-pem status(Since hook script runs in background)

- On Start,
    . Monitor Start Message
    . Number of local bricks
    . Worker Start Message with respective Slave node to which it is trying to connect
    . Agent Start Message

- Worker State Changes
    . Change in connected Slave Node
    . Change in Status(Initializing/Active/Passive/Faulty/Stopped/Paused/Resumed)
    . Change in Crawl(Hybrid/History/Changelog)

- Checkpoint Set and Completion Time

- Sync time logs:
  If Rsync failure, Summary of errors (CHANGELOG start-end name and error)

- When Config Changed, log CONFIG_NAME OLD_VALUE and NEW_VALUE

- On worker restart,
   Log last worker's and agents uptime (Because monitor knows about it :P)

- Performance metrics: (Optional only if enabled)
    - After each Changelog batch processed
        - Number of Changelogs processed and START and END changelog
        - Number of ENTRY, DATA and META
        - Time took to Parse Changelogs, Entry Operations, Meta and Sync

Comment 2 Kotresh HR 2017-02-22 06:11:40 UTC
Upstream Patch:
https://review.gluster.org/#/c/15684/  (master)
https://review.gluster.org/#/c/15710/  (3.9)

Comment 6 Atin Mukherjee 2017-03-24 08:41:33 UTC
downstream patch : https://code.engineering.redhat.com/gerrit/#/c/101284/

Comment 10 Rahul Hinduja 2017-06-29 11:48:17 UTC
Verified this bug with build: glusterfs-geo-replication-3.8.4-28.el7rhgs.x86_64

Have seen the logging improvements as mentioned below: 

=> Worker and connected slave node details: 

Following logs makes it more clear for the below work flow:

1. Monitor starts the worker and agent 
2. Worker will do the ssh connection between master and slave. gsyncd on slave mounts the slave volume on local slave host.
3. Worker mounts the master volume locally
4. Upon succesful completion of 2 and 3. It acknowledges back to the monitor.

[2017-06-29 11:33:33.563153] I [monitor(monitor):275:monitor] Monitor: starting gsyncd worker(/rhs/brick2/b4). Slave node: ssh://root.43.150:gluster://localhost:slave
[2017-06-29 11:33:33.588865] I [monitor(monitor):275:monitor] Monitor: starting gsyncd worker(/rhs/brick1/b1). Slave node: ssh://root.43.52:gluster://localhost:slave
[2017-06-29 11:33:33.815712] I [resource(/rhs/brick2/b4):1676:connect_remote] SSH: Initializing SSH connection between master and slave...
[2017-06-29 11:33:33.818607] I [changelogagent(/rhs/brick2/b4):73:__init__] ChangelogAgent: Agent listining...
[2017-06-29 11:33:33.824904] I [resource(/rhs/brick1/b1):1676:connect_remote] SSH: Initializing SSH connection between master and slave...
[2017-06-29 11:33:33.840003] I [changelogagent(/rhs/brick1/b1):73:__init__] ChangelogAgent: Agent listining...
[2017-06-29 11:33:40.65862] I [resource(/rhs/brick2/b4):1683:connect_remote] SSH: SSH connection between master and slave established. Time taken: 6.2496 secs
[2017-06-29 11:33:40.66544] I [resource(/rhs/brick2/b4):1416:connect] GLUSTER: Mounting gluster volume locally...
[2017-06-29 11:33:40.139708] I [resource(/rhs/brick1/b1):1683:connect_remote] SSH: SSH connection between master and slave established. Time taken: 6.3142 secs
[2017-06-29 11:33:40.140361] I [resource(/rhs/brick1/b1):1416:connect] GLUSTER: Mounting gluster volume locally...
[2017-06-29 11:33:45.215692] I [resource(/rhs/brick2/b4):1429:connect] GLUSTER: Mounted gluster volume. Time taken: 5.1487 secs
[2017-06-29 11:33:45.216334] I [gsyncd(/rhs/brick2/b4):778:main_i] <top>: Closing feedback fd, waking up the monitor
[2017-06-29 11:33:45.280860] I [resource(/rhs/brick1/b1):1429:connect] GLUSTER: Mounted gluster volume. Time taken: 5.1400 secs
[2017-06-29 11:33:45.281521] I [gsyncd(/rhs/brick1/b1):778:main_i] <top>: Closing feedback fd, waking up the monitor



=> Monitor, Worker, Crawl Status information:

Monitor:

[2017-06-29 11:00:47.744501] I [gsyncd(monitor):714:main_i] <top>: Monitor Status: Created
[2017-06-29 11:32:11.164010] I [gsyncd(monitor):714:main_i] <top>: Monitor Status: Paused
[2017-06-29 11:32:28.162978] I [gsyncd(monitor):714:main_i] <top>: Monitor Status: Started

Worker:

[2017-06-29 11:33:30.802614] I [gsyncdstatus(monitor):240:set_worker_status] GeorepStatus: Worker Status: Faulty
[2017-06-29 11:33:30.805674] I [gsyncdstatus(monitor):240:set_worker_status] GeorepStatus: Worker Status: Faulty
[2017-06-29 11:33:32.261878] I [gsyncd(conf):688:main_i] <top>: Config Set: session-owner = 689c5bf5-d4c4-4387-bd5b-12295f2c6203
[2017-06-29 11:33:33.254089] I [gsyncdstatus(monitor):240:set_worker_status] GeorepStatus: Worker Status: Initializing...
[2017-06-29 11:33:33.263180] I [gsyncdstatus(monitor):240:set_worker_status] GeorepStatus: Worker Status: Initializing...

Crawl:

[2017-06-29 11:33:47.352823] I [gsyncdstatus(/rhs/brick1/b1):244:set_worker_crawl_status] GeorepStatus: Crawl Status: History Crawl
[2017-06-29 11:34:02.415805] I [gsyncdstatus(/rhs/brick1/b1):244:set_worker_crawl_status] GeorepStatus: Crawl Status: Changelog Crawl
[2017-06-29 11:09:15.227492] I [gsyncdstatus(/rhs/brick2/b4):244:set_worker_crawl_status] GeorepStatus: Crawl Status: Hybrid Crawl


=> Config set/reset

[2017-06-29 11:07:37.389173] I [gsyncd(conf):688:main_i] <top>: Config Set: use_meta_volume = true
[2017-06-29 11:33:30.253581] I [gsyncd(conf):688:main_i] <top>: Config Set: use-tarssh = true
[2017-06-29 11:40:26.599968] I [gsyncd(conf):693:main_i] <top>: Config Reset: use-tarssh


=> Checkpoint Set/Reset

[2017-06-29 11:34:32.176332] I [gsyncd(conf):685:main_i] <top>: Checkpoint Set: 2017-06-29 11:34:29
[2017-06-29 11:35:17.624742] I [gsyncd(conf):691:main_i] <top>: Checkpoint Reset
[2017-06-29 11:35:56.278028] I [gsyncd(conf):685:main_i] <top>: Checkpoint Set: 2017-06-29 11:35:54
[2017-06-29 11:37:06.418883] I [master(/rhs/brick2/b4):1125:crawl] _GMaster: slave's time: (1498734558, 0)
[2017-06-29 11:37:06.420852] I [master(/rhs/brick1/b1):1125:crawl] _GMaster: slave's time: (1498734558, 0)
[2017-06-29 11:37:06.485577] I [gsyncdstatus(/rhs/brick2/b4):230:merger] GeorepStatus: Checkpoint completed. Checkpoint Time: 2017-06-29 11:35:54, Completion Time: 2017-06-29 11:37:06
[2017-06-29 11:37:06.496103] I [gsyncdstatus(/rhs/brick1/b1):230:merger] GeorepStatus: Checkpoint completed. Checkpoint Time: 2017-06-29 11:35:54, Completion Time: 2017-06-29 11:37:06


Moving this bug to verified as it covers the logging improvement intended by the fix in bug. Any new logging improvement will be raised as new bug.

Comment 12 errata-xmlrpc 2017-09-21 04:28:23 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2774

Comment 13 errata-xmlrpc 2017-09-21 04:54:54 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2774


Note You need to log in before you can comment on or make changes to this bug.