Bug 1146397 - Dist-geo-rep : After glusterd restart, geo-rep logs get ChangelogException: [Errno 2] No such file or directory"
Summary: Dist-geo-rep : After glusterd restart, geo-rep logs get ChangelogException: ...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: GlusterFS
Classification: Community
Component: geo-replication
Version: mainline
Hardware: x86_64
OS: Linux
medium
high
Target Milestone: ---
Assignee: Aravinda VK
QA Contact:
URL:
Whiteboard:
Depends On: 1118754
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-09-25 07:52 UTC by Aravinda VK
Modified: 2015-12-29 10:26 UTC (History)
10 users (show)

Fixed In Version: glusterfs-3.7
Doc Type: Bug Fix
Doc Text:
Clone Of: 1118754
Environment:
Last Closed: 2015-12-29 10:26:45 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:


Attachments (Terms of Use)

Description Aravinda VK 2014-09-25 07:52:52 UTC
+++ This bug was initially created as a clone of Bug #1118754 +++

Description of problem: after upgrade from RHS2.1(3.4.0.59rhs) to RHS3.0(3.6.0.24-1), geo-rep logs get  ChangelogException: [Errno 2] No such file or directory". After this backtrace, it goes to hybrid crawl and fails to do history crawl. 

:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
[2014-07-11 18:12:16.920133] I [master(/bricks/brick1/master_b3):1222:register] _GMaster: xsync temp directory: /var/lib/misc/glusterfsd/master/ssh%3A%2F%2Froot%4010.70.43.122%3Agluster%3A%2F%2F127.0.0.1%3Aslave/c236684c114c1c9f2bdbc3dabb727d2b/xsync
[2014-07-11 18:12:16.928941] I [master(/bricks/brick2/master_b7):452:crawlwrap] _GMaster: primary master with volume id 25a332b7-4569-4069-be16-1e107759d847 ...
[2014-07-11 18:12:16.952737] I [master(/bricks/brick2/master_b7):463:crawlwrap] _GMaster: crawl interval: 1 seconds
[2014-07-11 18:12:16.973531] E [repce(agent):117:worker] <top>: call failed:
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 113, in worker
    res = getattr(self.obj, rmeth)(*in_data[2:])
  File "/usr/libexec/glusterfs/python/syncdaemon/changelogagent.py", line 51, in history
    num_parallel)
  File "/usr/libexec/glusterfs/python/syncdaemon/libgfchangelog.py", line 94, in cl_history_changelog
    cls.raise_changelog_err()
  File "/usr/libexec/glusterfs/python/syncdaemon/libgfchangelog.py", line 27, in raise_changelog_err
    raise ChangelogException(errn, os.strerror(errn))
ChangelogException: [Errno 2] No such file or directory
[2014-07-11 18:12:16.975254] E [repce(/bricks/brick2/master_b7):207:__call__] RepceClient: call 2607:140481144624896:1405082536.97 (history) failed on peer with ChangelogException
[2014-07-11 18:12:16.979331] I [master(/bricks/brick3/master_b11):66:gmaster_builder] <top>: setting up xsync change detection mode
[2014-07-11 18:12:16.980051] I [master(/bricks/brick3/master_b11):387:__init__] _GMaster: using 'rsync' as the sync engine
[2014-07-11 18:12:16.982465] I [master(/bricks/brick3/master_b11):66:gmaster_builder] <top>: setting up changelog change detection mode
[2014-07-11 18:12:16.983171] I [master(/bricks/brick3/master_b11):387:__init__] _GMaster: using 'rsync' as the sync engine

:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

Version-Release number of selected component (if applicable): upgrade from RHS2.1(3.4.0.59rhs) to RHS3.0(3.6.0.24-1)


How reproducible: Didn't try to reproduce. 


Steps to Reproduce:
1.create geo-rep relationship between master and slave in 2.1(3.4.0.59rhs) version. 
2.create some data on master and let it sync to slave. 
3. stop geo-rep.
4. keep creating data on master.
5. Upgrade glusterfs on all the nodes on slave first and then master, using the steps, 
    pkill glusterfsd

    pkill glusterfs

    pkill glusterd

    yum update glusterfs -y 

6. then start geo-rep.
7. Check geo-rep log-files. 


Actual results:  geo-rep logs get  ChangelogException: [Errno 2] No such file or directory"


Expected results: there shouldn't be such backtraces and after geo-rep start, it shouldn't fail to do history crawl. 


Additional info:

--- Additional comment from Vijaykumar Koppad on 2014-07-11 09:14:57 EDT ---

Since it fails to do history crawl after upgrade, it might affect renames and deletes done during the upgrade(During the time geo-rep was stopped)

--- Additional comment from Venky Shankar on 2014-07-14 03:06:54 EDT ---

Vijaykumar, please upload sosreports.

--- Additional comment from Vijaykumar Koppad on 2014-07-14 06:42:45 EDT ---



--- Additional comment from Ajeet Jha on 2014-07-16 04:41:31 EDT ---

The bug is a genuinly acceptable issue, it was being misunderstood because of traceback and errno.

EXPLANATION:
Geo-rep "start", after upgrade, called history with a start time(start time is the moment master gluster was stopped) which is not recorded in htime(because htimes are recorded in the upgraded version), hence no linkages found. This causes history to return -1, which causes agent to raise the exception.

What needs to be done: No logical code-base change but logging improvements could help in debugging in future.

--- Additional comment from Nagaprasad Sathyanarayana on 2014-07-17 08:00:51 EDT ---

Moving this to future release based on engineering discussion.

--- Additional comment from Vijaykumar Koppad on 2014-07-21 06:19:30 EDT ---

It happened in two other scenarios, which didn't involve upgrade, but doesn't happen consistently. 

First scenario
========================================
1. create and start geo-rep relationship between master and slave. 
2. disable changelog.
3. create data on master.
4. check geo-rep logs, there could have traceback as given in description.
========================================

second scenario
=========================================
1. create and start geo-rep relationship between master and slave.
2. kill monitor, feedback and agent processes from one of the active nodes. 
3. create data on master.
4. start force geo-rep.
5. Check geo-rep logs for traceback. 
   It doesn't happen everytime.
=========================================

--- Additional comment from John Skeoch on 2014-08-24 20:51:45 EDT ---

User vkoppad@redhat.com's account has been closed

--- Additional comment from John Skeoch on 2014-08-24 20:53:44 EDT ---

User vkoppad@redhat.com's account has been closed

--- Additional comment from Nagaprasad Sathyanarayana on 2014-09-12 02:18:39 EDT ---

When these BZs were deferred from 3.0 release, they were marked to be triaged in 3.0.1 release. Since 3.0.1 is now the Zero day errata (which are tracked with Target Milestone=3.0.1, Internal Whiteboard=Denali 0day), changing the target milestone of these to 3.0.2.  Engineers still must triage these BZs and pull only the feasible ones into 3.0.2 release.

--- Additional comment from Aravinda VK on 2014-09-25 03:49:39 EDT ---

Changelog agent(`ps -ax | grep gsyncd | grep agent`) interacts with changelogapi and raises exception in case of any error. geo-rep worker communicates with agent using RPC. Changelog Exceptions are handled in worker. Since RPC propagates traceback from agent to worker, Exception is logged in log files. These exceptions are no effect as these are handled in worker. But it confuses users.

Comment 1 Anand Avati 2014-09-25 07:55:08 UTC
REVIEW: http://review.gluster.org/8846 (geo-rep: Clean RPC error handling between agent and worker) posted (#1) for review on master by Aravinda VK (avishwan@redhat.com)

Comment 2 Aravinda VK 2015-05-10 03:41:32 UTC
Some more work expected on the posted patch. Moving back to Assigned.

Comment 3 Aravinda VK 2015-12-29 10:26:45 UTC
No new Changelogs index file(HTIME) is created after upgrade/brick node reboots. HTIME file will be created only when Changelog disabled and enabled. (BZ 1211327)

This is fixed from Gluster 3.7 release. Please reopen if found again.


Note You need to log in before you can comment on or make changes to this bug.