Bug 1118754

Summary:

Dist-geo-rep : after upgrade from RHS2.1(3.4.0.59rhs) to RHS3.0(3.6.0.24-1), geo-rep logs get ChangelogException: [Errno 2] No such file or directory"

Product:

[Red Hat Storage] Red Hat Gluster Storage

Reporter:

Vijaykumar Koppad <vkoppad>

Component:

geo-replication

Assignee:

Bug Updates Notification Mailing List <rhs-bugs>

Status:

CLOSED CURRENTRELEASE

QA Contact:

amainkar

Severity:

high

Docs Contact:

Priority:

low

Version:

rhgs-3.0

CC:

aavati, avishwan, csaba, david.macdonald, mzywusko, nlevinki, nsathyan, vagarwal, vshankar

Target Milestone:

---

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

usability

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Clones:

1146397 (view as bug list)

Environment:

Last Closed:

2015-08-06 15:00:43 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1146397

Attachments:

Description	Flags
sosreport of the all the nodes.	none

Description Vijaykumar Koppad 2014-07-11 13:09:10 UTC

Description of problem: after upgrade from RHS2.1(3.4.0.59rhs) to RHS3.0(3.6.0.24-1), geo-rep logs get  ChangelogException: [Errno 2] No such file or directory". After this backtrace, it goes to hybrid crawl and fails to do history crawl. 

:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
[2014-07-11 18:12:16.920133] I [master(/bricks/brick1/master_b3):1222:register] _GMaster: xsync temp directory: /var/lib/misc/glusterfsd/master/ssh%3A%2F%2Froot%4010.70.43.122%3Agluster%3A%2F%2F127.0.0.1%3Aslave/c236684c114c1c9f2bdbc3dabb727d2b/xsync
[2014-07-11 18:12:16.928941] I [master(/bricks/brick2/master_b7):452:crawlwrap] _GMaster: primary master with volume id 25a332b7-4569-4069-be16-1e107759d847 ...
[2014-07-11 18:12:16.952737] I [master(/bricks/brick2/master_b7):463:crawlwrap] _GMaster: crawl interval: 1 seconds
[2014-07-11 18:12:16.973531] E [repce(agent):117:worker] <top>: call failed:
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 113, in worker
    res = getattr(self.obj, rmeth)(*in_data[2:])
  File "/usr/libexec/glusterfs/python/syncdaemon/changelogagent.py", line 51, in history
    num_parallel)
  File "/usr/libexec/glusterfs/python/syncdaemon/libgfchangelog.py", line 94, in cl_history_changelog
    cls.raise_changelog_err()
  File "/usr/libexec/glusterfs/python/syncdaemon/libgfchangelog.py", line 27, in raise_changelog_err
    raise ChangelogException(errn, os.strerror(errn))
ChangelogException: [Errno 2] No such file or directory
[2014-07-11 18:12:16.975254] E [repce(/bricks/brick2/master_b7):207:__call__] RepceClient: call 2607:140481144624896:1405082536.97 (history) failed on peer with ChangelogException
[2014-07-11 18:12:16.979331] I [master(/bricks/brick3/master_b11):66:gmaster_builder] <top>: setting up xsync change detection mode
[2014-07-11 18:12:16.980051] I [master(/bricks/brick3/master_b11):387:__init__] _GMaster: using 'rsync' as the sync engine
[2014-07-11 18:12:16.982465] I [master(/bricks/brick3/master_b11):66:gmaster_builder] <top>: setting up changelog change detection mode
[2014-07-11 18:12:16.983171] I [master(/bricks/brick3/master_b11):387:__init__] _GMaster: using 'rsync' as the sync engine

:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

Version-Release number of selected component (if applicable): upgrade from RHS2.1(3.4.0.59rhs) to RHS3.0(3.6.0.24-1)


How reproducible: Didn't try to reproduce. 


Steps to Reproduce:
1.create geo-rep relationship between master and slave in 2.1(3.4.0.59rhs) version. 
2.create some data on master and let it sync to slave. 
3. stop geo-rep.
4. keep creating data on master.
5. Upgrade glusterfs on all the nodes on slave first and then master, using the steps, 
    pkill glusterfsd

    pkill glusterfs

    pkill glusterd

    yum update glusterfs -y 

6. then start geo-rep.
7. Check geo-rep log-files. 


Actual results:  geo-rep logs get  ChangelogException: [Errno 2] No such file or directory"


Expected results: there shouldn't be such backtraces and after geo-rep start, it shouldn't fail to do history crawl. 


Additional info:

Comment 1 Vijaykumar Koppad 2014-07-11 13:14:57 UTC

Since it fails to do history crawl after upgrade, it might affect renames and deletes done during the upgrade(During the time geo-rep was stopped)

Comment 2 Venky Shankar 2014-07-14 07:06:54 UTC

Vijaykumar, please upload sosreports.

Comment 3 Vijaykumar Koppad 2014-07-14 10:42:45 UTC

Created attachment 917734 [details]
sosreport of the all the nodes.

Comment 4 Ajeet Jha 2014-07-16 08:41:31 UTC

The bug is a genuinly acceptable issue, it was being misunderstood because of traceback and errno.

EXPLANATION:
Geo-rep "start", after upgrade, called history with a start time(start time is the moment master gluster was stopped) which is not recorded in htime(because htimes are recorded in the upgraded version), hence no linkages found. This causes history to return -1, which causes agent to raise the exception.

What needs to be done: No logical code-base change but logging improvements could help in debugging in future.

Comment 6 Vijaykumar Koppad 2014-07-21 10:19:30 UTC

It happened in two other scenarios, which didn't involve upgrade, but doesn't happen consistently. 

First scenario
========================================
1. create and start geo-rep relationship between master and slave. 
2. disable changelog.
3. create data on master.
4. check geo-rep logs, there could have traceback as given in description.
========================================

second scenario
=========================================
1. create and start geo-rep relationship between master and slave.
2. kill monitor, feedback and agent processes from one of the active nodes. 
3. create data on master.
4. start force geo-rep.
5. Check geo-rep logs for traceback. 
   It doesn't happen everytime.
=========================================

Comment 10 Aravinda VK 2014-09-25 07:49:39 UTC

Changelog agent(`ps -ax | grep gsyncd | grep agent`) interacts with changelogapi and raises exception in case of any error. geo-rep worker communicates with agent using RPC. Changelog Exceptions are handled in worker. Since RPC propagates traceback from agent to worker, Exception is logged in log files. These exceptions are no effect as these are handled in worker. But it confuses users.

Comment 12 Aravinda VK 2015-08-06 15:00:43 UTC

No new Changelogs index file(HTIME) is created after upgrade/brick node reboots. HTIME file will be created only when Changelog disabled and enabled. (BZ 1211327)

This issue is not seen during upgrade tests of RHGS 3.1. Closing this bug. Please reopen if this issue found again.