+++ This bug was initially created as a clone of Bug #1359612 +++ Description of problem: Many improvements can be made to Geo-replication logs to understand the current status and other details. - Session Creation/Recreation - Will help to understand when session was first created/recreated - Log the push-pem status(Since hook script runs in background) - On Start, . Monitor Start Message . Number of local bricks . Worker Start Message with respective Slave node to which it is trying to connect . Agent Start Message - Worker State Changes . Change in connected Slave Node . Change in Status(Initializing/Active/Passive/Faulty/Stopped/Paused/Resumed) . Change in Crawl(Hybrid/History/Changelog) - Checkpoint Set and Completion Time - Sync time logs: If Rsync failure, Summary of errors (CHANGELOG start-end name and error) - When Config Changed, log CONFIG_NAME OLD_VALUE and NEW_VALUE - On worker restart, Log last worker's and agents uptime (Because monitor knows about it :P) - Performance metrics: (Optional only if enabled) - After each Changelog batch processed - Number of Changelogs processed and START and END changelog - Number of ENTRY, DATA and META - Time took to Parse Changelogs, Entry Operations, Meta and Sync
Upstream Patch: https://review.gluster.org/#/c/15684/ (master) https://review.gluster.org/#/c/15710/ (3.9)
downstream patch : https://code.engineering.redhat.com/gerrit/#/c/101284/
Verified this bug with build: glusterfs-geo-replication-3.8.4-28.el7rhgs.x86_64 Have seen the logging improvements as mentioned below: => Worker and connected slave node details: Following logs makes it more clear for the below work flow: 1. Monitor starts the worker and agent 2. Worker will do the ssh connection between master and slave. gsyncd on slave mounts the slave volume on local slave host. 3. Worker mounts the master volume locally 4. Upon succesful completion of 2 and 3. It acknowledges back to the monitor. [2017-06-29 11:33:33.563153] I [monitor(monitor):275:monitor] Monitor: starting gsyncd worker(/rhs/brick2/b4). Slave node: ssh://root.43.150:gluster://localhost:slave [2017-06-29 11:33:33.588865] I [monitor(monitor):275:monitor] Monitor: starting gsyncd worker(/rhs/brick1/b1). Slave node: ssh://root.43.52:gluster://localhost:slave [2017-06-29 11:33:33.815712] I [resource(/rhs/brick2/b4):1676:connect_remote] SSH: Initializing SSH connection between master and slave... [2017-06-29 11:33:33.818607] I [changelogagent(/rhs/brick2/b4):73:__init__] ChangelogAgent: Agent listining... [2017-06-29 11:33:33.824904] I [resource(/rhs/brick1/b1):1676:connect_remote] SSH: Initializing SSH connection between master and slave... [2017-06-29 11:33:33.840003] I [changelogagent(/rhs/brick1/b1):73:__init__] ChangelogAgent: Agent listining... [2017-06-29 11:33:40.65862] I [resource(/rhs/brick2/b4):1683:connect_remote] SSH: SSH connection between master and slave established. Time taken: 6.2496 secs [2017-06-29 11:33:40.66544] I [resource(/rhs/brick2/b4):1416:connect] GLUSTER: Mounting gluster volume locally... [2017-06-29 11:33:40.139708] I [resource(/rhs/brick1/b1):1683:connect_remote] SSH: SSH connection between master and slave established. Time taken: 6.3142 secs [2017-06-29 11:33:40.140361] I [resource(/rhs/brick1/b1):1416:connect] GLUSTER: Mounting gluster volume locally... [2017-06-29 11:33:45.215692] I [resource(/rhs/brick2/b4):1429:connect] GLUSTER: Mounted gluster volume. Time taken: 5.1487 secs [2017-06-29 11:33:45.216334] I [gsyncd(/rhs/brick2/b4):778:main_i] <top>: Closing feedback fd, waking up the monitor [2017-06-29 11:33:45.280860] I [resource(/rhs/brick1/b1):1429:connect] GLUSTER: Mounted gluster volume. Time taken: 5.1400 secs [2017-06-29 11:33:45.281521] I [gsyncd(/rhs/brick1/b1):778:main_i] <top>: Closing feedback fd, waking up the monitor => Monitor, Worker, Crawl Status information: Monitor: [2017-06-29 11:00:47.744501] I [gsyncd(monitor):714:main_i] <top>: Monitor Status: Created [2017-06-29 11:32:11.164010] I [gsyncd(monitor):714:main_i] <top>: Monitor Status: Paused [2017-06-29 11:32:28.162978] I [gsyncd(monitor):714:main_i] <top>: Monitor Status: Started Worker: [2017-06-29 11:33:30.802614] I [gsyncdstatus(monitor):240:set_worker_status] GeorepStatus: Worker Status: Faulty [2017-06-29 11:33:30.805674] I [gsyncdstatus(monitor):240:set_worker_status] GeorepStatus: Worker Status: Faulty [2017-06-29 11:33:32.261878] I [gsyncd(conf):688:main_i] <top>: Config Set: session-owner = 689c5bf5-d4c4-4387-bd5b-12295f2c6203 [2017-06-29 11:33:33.254089] I [gsyncdstatus(monitor):240:set_worker_status] GeorepStatus: Worker Status: Initializing... [2017-06-29 11:33:33.263180] I [gsyncdstatus(monitor):240:set_worker_status] GeorepStatus: Worker Status: Initializing... Crawl: [2017-06-29 11:33:47.352823] I [gsyncdstatus(/rhs/brick1/b1):244:set_worker_crawl_status] GeorepStatus: Crawl Status: History Crawl [2017-06-29 11:34:02.415805] I [gsyncdstatus(/rhs/brick1/b1):244:set_worker_crawl_status] GeorepStatus: Crawl Status: Changelog Crawl [2017-06-29 11:09:15.227492] I [gsyncdstatus(/rhs/brick2/b4):244:set_worker_crawl_status] GeorepStatus: Crawl Status: Hybrid Crawl => Config set/reset [2017-06-29 11:07:37.389173] I [gsyncd(conf):688:main_i] <top>: Config Set: use_meta_volume = true [2017-06-29 11:33:30.253581] I [gsyncd(conf):688:main_i] <top>: Config Set: use-tarssh = true [2017-06-29 11:40:26.599968] I [gsyncd(conf):693:main_i] <top>: Config Reset: use-tarssh => Checkpoint Set/Reset [2017-06-29 11:34:32.176332] I [gsyncd(conf):685:main_i] <top>: Checkpoint Set: 2017-06-29 11:34:29 [2017-06-29 11:35:17.624742] I [gsyncd(conf):691:main_i] <top>: Checkpoint Reset [2017-06-29 11:35:56.278028] I [gsyncd(conf):685:main_i] <top>: Checkpoint Set: 2017-06-29 11:35:54 [2017-06-29 11:37:06.418883] I [master(/rhs/brick2/b4):1125:crawl] _GMaster: slave's time: (1498734558, 0) [2017-06-29 11:37:06.420852] I [master(/rhs/brick1/b1):1125:crawl] _GMaster: slave's time: (1498734558, 0) [2017-06-29 11:37:06.485577] I [gsyncdstatus(/rhs/brick2/b4):230:merger] GeorepStatus: Checkpoint completed. Checkpoint Time: 2017-06-29 11:35:54, Completion Time: 2017-06-29 11:37:06 [2017-06-29 11:37:06.496103] I [gsyncdstatus(/rhs/brick1/b1):230:merger] GeorepStatus: Checkpoint completed. Checkpoint Time: 2017-06-29 11:35:54, Completion Time: 2017-06-29 11:37:06 Moving this bug to verified as it covers the logging improvement intended by the fix in bug. Any new logging improvement will be raised as new bug.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:2774