Bug 1118754

Summary: Dist-geo-rep : after upgrade from RHS2.1(3.4.0.59rhs) to RHS3.0(3.6.0.24-1), geo-rep logs get ChangelogException: [Errno 2] No such file or directory"
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Vijaykumar Koppad <vkoppad>
Component: geo-replicationAssignee: Bug Updates Notification Mailing List <rhs-bugs>
Status: CLOSED CURRENTRELEASE QA Contact: amainkar
Severity: high Docs Contact:
Priority: low    
Version: rhgs-3.0CC: aavati, avishwan, csaba, david.macdonald, mzywusko, nlevinki, nsathyan, vagarwal, vshankar
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard: usability
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1146397 (view as bug list) Environment:
Last Closed: 2015-08-06 15:00:43 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1146397    
Attachments:
Description Flags
sosreport of the all the nodes. none

Description Vijaykumar Koppad 2014-07-11 13:09:10 UTC
Description of problem: after upgrade from RHS2.1(3.4.0.59rhs) to RHS3.0(3.6.0.24-1), geo-rep logs get  ChangelogException: [Errno 2] No such file or directory". After this backtrace, it goes to hybrid crawl and fails to do history crawl. 

:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
[2014-07-11 18:12:16.920133] I [master(/bricks/brick1/master_b3):1222:register] _GMaster: xsync temp directory: /var/lib/misc/glusterfsd/master/ssh%3A%2F%2Froot%4010.70.43.122%3Agluster%3A%2F%2F127.0.0.1%3Aslave/c236684c114c1c9f2bdbc3dabb727d2b/xsync
[2014-07-11 18:12:16.928941] I [master(/bricks/brick2/master_b7):452:crawlwrap] _GMaster: primary master with volume id 25a332b7-4569-4069-be16-1e107759d847 ...
[2014-07-11 18:12:16.952737] I [master(/bricks/brick2/master_b7):463:crawlwrap] _GMaster: crawl interval: 1 seconds
[2014-07-11 18:12:16.973531] E [repce(agent):117:worker] <top>: call failed:
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 113, in worker
    res = getattr(self.obj, rmeth)(*in_data[2:])
  File "/usr/libexec/glusterfs/python/syncdaemon/changelogagent.py", line 51, in history
    num_parallel)
  File "/usr/libexec/glusterfs/python/syncdaemon/libgfchangelog.py", line 94, in cl_history_changelog
    cls.raise_changelog_err()
  File "/usr/libexec/glusterfs/python/syncdaemon/libgfchangelog.py", line 27, in raise_changelog_err
    raise ChangelogException(errn, os.strerror(errn))
ChangelogException: [Errno 2] No such file or directory
[2014-07-11 18:12:16.975254] E [repce(/bricks/brick2/master_b7):207:__call__] RepceClient: call 2607:140481144624896:1405082536.97 (history) failed on peer with ChangelogException
[2014-07-11 18:12:16.979331] I [master(/bricks/brick3/master_b11):66:gmaster_builder] <top>: setting up xsync change detection mode
[2014-07-11 18:12:16.980051] I [master(/bricks/brick3/master_b11):387:__init__] _GMaster: using 'rsync' as the sync engine
[2014-07-11 18:12:16.982465] I [master(/bricks/brick3/master_b11):66:gmaster_builder] <top>: setting up changelog change detection mode
[2014-07-11 18:12:16.983171] I [master(/bricks/brick3/master_b11):387:__init__] _GMaster: using 'rsync' as the sync engine

:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

Version-Release number of selected component (if applicable): upgrade from RHS2.1(3.4.0.59rhs) to RHS3.0(3.6.0.24-1)


How reproducible: Didn't try to reproduce. 


Steps to Reproduce:
1.create geo-rep relationship between master and slave in 2.1(3.4.0.59rhs) version. 
2.create some data on master and let it sync to slave. 
3. stop geo-rep.
4. keep creating data on master.
5. Upgrade glusterfs on all the nodes on slave first and then master, using the steps, 
    pkill glusterfsd

    pkill glusterfs

    pkill glusterd

    yum update glusterfs -y 

6. then start geo-rep.
7. Check geo-rep log-files. 


Actual results:  geo-rep logs get  ChangelogException: [Errno 2] No such file or directory"


Expected results: there shouldn't be such backtraces and after geo-rep start, it shouldn't fail to do history crawl. 


Additional info:

Comment 1 Vijaykumar Koppad 2014-07-11 13:14:57 UTC
Since it fails to do history crawl after upgrade, it might affect renames and deletes done during the upgrade(During the time geo-rep was stopped)

Comment 2 Venky Shankar 2014-07-14 07:06:54 UTC
Vijaykumar, please upload sosreports.

Comment 3 Vijaykumar Koppad 2014-07-14 10:42:45 UTC
Created attachment 917734 [details]
sosreport of the all the nodes.

Comment 4 Ajeet Jha 2014-07-16 08:41:31 UTC
The bug is a genuinly acceptable issue, it was being misunderstood because of traceback and errno.

EXPLANATION:
Geo-rep "start", after upgrade, called history with a start time(start time is the moment master gluster was stopped) which is not recorded in htime(because htimes are recorded in the upgraded version), hence no linkages found. This causes history to return -1, which causes agent to raise the exception.

What needs to be done: No logical code-base change but logging improvements could help in debugging in future.

Comment 6 Vijaykumar Koppad 2014-07-21 10:19:30 UTC
It happened in two other scenarios, which didn't involve upgrade, but doesn't happen consistently. 

First scenario
========================================
1. create and start geo-rep relationship between master and slave. 
2. disable changelog.
3. create data on master.
4. check geo-rep logs, there could have traceback as given in description.
========================================

second scenario
=========================================
1. create and start geo-rep relationship between master and slave.
2. kill monitor, feedback and agent processes from one of the active nodes. 
3. create data on master.
4. start force geo-rep.
5. Check geo-rep logs for traceback. 
   It doesn't happen everytime.
=========================================

Comment 10 Aravinda VK 2014-09-25 07:49:39 UTC
Changelog agent(`ps -ax | grep gsyncd | grep agent`) interacts with changelogapi and raises exception in case of any error. geo-rep worker communicates with agent using RPC. Changelog Exceptions are handled in worker. Since RPC propagates traceback from agent to worker, Exception is logged in log files. These exceptions are no effect as these are handled in worker. But it confuses users.

Comment 12 Aravinda VK 2015-08-06 15:00:43 UTC
No new Changelogs index file(HTIME) is created after upgrade/brick node reboots. HTIME file will be created only when Changelog disabled and enabled. (BZ 1211327)

This issue is not seen during upgrade tests of RHGS 3.1. Closing this bug. Please reopen if this issue found again.