1448386 – [geo-rep]: Worker crashed with TypeError: expected string or buffer

Bug 1448386 - [geo-rep]: Worker crashed with TypeError: expected string or buffer

Summary: [geo-rep]: Worker crashed with TypeError: expected string or buffer

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	geo-replication
Sub Component:
Version:	rhgs-3.3
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	RHGS 3.3.0
Assignee:	Aravinda VK
QA Contact:	Rochelle
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1417151 1459620
TreeView+	depends on / blocked

Reported:	2017-05-05 10:28 UTC by Rochelle
Modified:	2017-09-21 04:41 UTC (History)
CC List:	4 users (show)
Fixed In Version:	glusterfs-3.8.4-28
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1459620 (view as bug list)
Environment:
Last Closed:	2017-09-21 04:41:45 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2017:2774	0	normal	SHIPPED_LIVE	glusterfs bug fix and enhancement update	2017-09-21 08:16:29 UTC

Description Rochelle 2017-05-05 10:28:23 UTC

Description of problem:
=======================

While running geo-replication sanity check which does following fop's (create,chmod,chown,chgrp,hardlink,symlink,truncate,rename,remove) observed the following worker crash. Eventually the checksum matches at master and slave, and hence do not know after which fop or crawl this is observed. The crash is only seen once and worker became online post that. 

[2017-05-04 17:24:29.679775] I [gsyncdstatus(/bricks/brick0/master_brick1):276:set_passive] GeorepStatus: Worker Status: Passive
[2017-05-04 17:24:30.635793] I [master(/bricks/brick1/master_brick4):1195:crawl] _GMaster: slave's time: (1493917690, 0)
[2017-05-04 17:24:35.496570] E [syncdutils(/bricks/brick1/master_brick4):296:log_raise_exception] <top>: FAIL:
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 326, in twrap
    tf(*aa)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1575, in syncjob
    po = self.sync_engine(pb, self.log_err)
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1702, in rsync
    log_err=log_err)
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 56, in sup
    sys._getframe(1).f_code.co_name)(*a, **kw)
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1025, in rsync
    "log_rsync_performance", default_value=False))
  File "/usr/libexec/glusterfs/python/syncdaemon/configinterface.py", line 264, in get_realtime
    return self.get(opt, printValue=False, default_value=default_value)
  File "/usr/libexec/glusterfs/python/syncdaemon/configinterface.py", line 369, in get
    self.update_to(d, allow_unresolved=True)
  File "/usr/libexec/glusterfs/python/syncdaemon/configinterface.py", line 359, in update_to
    update_from_sect(sect, MultiDict(dct, mad, *self.auxdicts))
  File "/usr/libexec/glusterfs/python/syncdaemon/configinterface.py", line 343, in update_from_sect
    dct[k] = Template(v).safe_substitute(mud)
  File "/usr/lib64/python2.7/string.py", line 205, in safe_substitute
    return self.pattern.sub(convert, self.template)
TypeError: expected string or buffer
[2017-05-04 17:24:35.563095] I [syncdutils(/bricks/brick1/master_brick4):237:finalize] <top>: exiting.
[2017-05-04 17:24:35.572370] I [repce(/bricks/brick1/master_brick4):92:service_loop] RepceServer: terminating on reaching EOF.
[2017-05-04 17:24:35.573046] I [syncdutils(/bricks/brick1/master_brick4):237:finalize] <top>: exiting.

Version-Release number of selected component (if applicable):
=============================================================
glusterfs-geo-replication-3.8.4-24.el7rhgs.x86_64


Steps to Reproduce:
==================
Do not know the exact steps since it was seen in the automation run. 
Will work to find out the specific steps and update this space later.

Actual results:
==============

Worker crashed and then came online.
Arequal checksum between master and slave matches.


Expected results:
=================

Worker should not crash.

Comment 5 Aravinda VK 2017-06-06 12:48:33 UTC

Easy reproducer:

cd /usr/libexec/glusterfs/python/syncdaemon/
python

    from configinterface import GConffile
    conf = GConffile("/var/lib/glusterd/geo-replication/gsyncd_template.conf", 
                     ["master", "slave"], {})
    print conf.get()
    conf.set("log-rsync-performance", 10)
    print conf.get("log-rsync-performance")

Above script fails with the same traceback.

RCA: We are not restarting the worker for Some of the configuration changes(For example, log-rsync-performance), this causes non string value passed as template. Worker restart fixes this since it reads config values from file as string instead of actual type.

Comment 6 Atin Mukherjee 2017-06-08 04:00:12 UTC

upstream patch : https://review.gluster.org/17489

Comment 7 Aravinda VK 2017-06-13 06:57:46 UTC

Upstream patches:
https://review.gluster.org/17489
https://review.gluster.org/17503

Downstream patches:
https://code.engineering.redhat.com/gerrit/108895
https://code.engineering.redhat.com/gerrit/108896

Comment 9 Rochelle 2017-06-24 17:21:37 UTC

Thanks Aravinda for providing the easy reproducer. Have reproduced this issue with build: glusterfs-geo-replication-3.8.4-27.el7rhgs.x86_64

[root@dhcp41-189 ~]# 
[root@dhcp41-189 ~]# cd /usr/libexec/glusterfs/python/syncdaemon/
[root@dhcp41-189 syncdaemon]# python
Python 2.7.5 (default, Aug  2 2016, 04:20:16) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from configinterface import GConffile
>>> conf = GConffile("/var/lib/glusterd/geo-replication/gsyncd_template.conf", 
...                  ["master", "slave"], {})
>>> print conf.get()
special_sync_mode: partial
state_socket_unencoded: /var/lib/glusterd/geo-replication/${mastervol}/${eSlave}.socket
gluster_log_file: /var/log/glusterfs/geo-replication/${mastervol}/${eSlave}${local_id}.gluster.log
ssh_command: ssh -oPasswordAuthentication=no -oStrictHostKeyChecking=no -i /var/lib/glusterd/geo-replication/secret.pem
ignore_deletes: true
change_detector: changelog
gluster_command_dir: /usr/sbin/
working_dir: /var/run/gluster/${mastervol}/${eSlave}
remote_gsyncd: /usr/libexec/glusterfs/gsyncd
log_file: /var/log/glusterfs/geo-replication/${mastervol}/${eSlave}.log
socketdir: /var/run/gluster
state_file: /var/lib/glusterd/geo-replication/${mastervol}_${remotehost}_${slavevol}/monitor.status
state_detail_file: /var/lib/glusterd/geo-replication/${mastervol}_${remotehost}_${slavevol}/${eSlave}-detail.status
ssh_command_tar: ssh -oPasswordAuthentication=no -oStrictHostKeyChecking=no -i /var/lib/glusterd/geo-replication/tar_ssh.pem
pid_file: /var/lib/glusterd/geo-replication/${mastervol}_${remotehost}_${slavevol}/monitor.pid
georep_session_working_dir: /var/lib/glusterd/geo-replication/${mastervol}_${remotehost}_${slavevol}/
gluster_params: aux-gfid-mount acl
None
>>> conf.set("log-rsync-performance", 10)
>>> print conf.get("log-rsync-performance")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "configinterface.py", line 369, in get
    self.update_to(d, allow_unresolved=True)
  File "configinterface.py", line 361, in update_to
    update_from_sect(self.section(), MultiDict(dct, *self.auxdicts))
  File "configinterface.py", line 343, in update_from_sect
    dct[k] = Template(v).safe_substitute(mud)
  File "/usr/lib64/python2.7/string.py", line 205, in safe_substitute
    return self.pattern.sub(convert, self.template)
TypeError: expected string or buffer
>>> quit ()
[root@dhcp41-189 syncdaemon]# 
[root@dhcp41-189 syncdaemon]# 
[root@dhcp41-189 syncdaemon]# rpm -qa | grep gluster
glusterfs-libs-3.8.4-27.el7rhgs.x86_64
glusterfs-rdma-3.8.4-27.el7rhgs.x86_64
vdsm-gluster-4.17.33-1.1.el7rhgs.noarch
glusterfs-cli-3.8.4-27.el7rhgs.x86_64
glusterfs-client-xlators-3.8.4-27.el7rhgs.x86_64
glusterfs-server-3.8.4-27.el7rhgs.x86_64
gluster-nagios-addons-0.2.8-1.el7rhgs.x86_64
glusterfs-api-3.8.4-27.el7rhgs.x86_64
glusterfs-geo-replication-3.8.4-27.el7rhgs.x86_64
glusterfs-fuse-3.8.4-27.el7rhgs.x86_64
gluster-nagios-common-0.2.4-1.el7rhgs.noarch
python-gluster-3.8.4-25.el7rhgs.noarch
glusterfs-3.8.4-27.el7rhgs.x86_64


Validation of the same step with build: glusterfs-geo-replication-3.8.4-28.el7rhgs.x86_64 does not result in the same traceback as mentioned below:

[root@dhcp41-160 syncdaemon]# 
[root@dhcp41-160 syncdaemon]# 
[root@dhcp41-160 syncdaemon]# python
Python 2.7.5 (default, May  3 2017, 07:55:04) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-14)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from configinterface import GConffile
>>> conf = GConffile("/var/lib/glusterd/geo-replication/gsyncd_template.conf", 
...                      ["master", "slave"], {})
>>> print conf.get()
special_sync_mode: partial
state_socket_unencoded: /var/lib/glusterd/geo-replication/${mastervol}/${eSlave}.socket
gluster_log_file: /var/log/glusterfs/geo-replication/${mastervol}/${eSlave}${local_id}.gluster.log
ssh_command: ssh -oPasswordAuthentication=no -oStrictHostKeyChecking=no -i /var/lib/glusterd/geo-replication/secret.pem
ignore_deletes: true
change_detector: changelog
gluster_command_dir: /usr/sbin/
working_dir: /var/run/gluster/${mastervol}/${eSlave}
remote_gsyncd: /usr/libexec/glusterfs/gsyncd
log_file: /var/log/glusterfs/geo-replication/${mastervol}/${eSlave}.log
socketdir: /var/run/gluster
state_file: /var/lib/glusterd/geo-replication/${mastervol}_${remotehost}_${slavevol}/monitor.status
state_detail_file: /var/lib/glusterd/geo-replication/${mastervol}_${remotehost}_${slavevol}/${eSlave}-detail.status
ssh_command_tar: ssh -oPasswordAuthentication=no -oStrictHostKeyChecking=no -i /var/lib/glusterd/geo-replication/tar_ssh.pem
pid_file: /var/lib/glusterd/geo-replication/${mastervol}_${remotehost}_${slavevol}/monitor.pid
georep_session_working_dir: /var/lib/glusterd/geo-replication/${mastervol}_${remotehost}_${slavevol}/
gluster_params: aux-gfid-mount acl
log_rsync_performance: 10
None
>>> conf.set("log-rsync-performance", 10)
>>> print conf.get("log-rsync-performance")
10
None
>>> exit()
[root@dhcp41-160 syncdaemon]# rpm -qa | grep gluster
glusterfs-client-xlators-3.8.4-28.el7rhgs.x86_64
glusterfs-rdma-3.8.4-28.el7rhgs.x86_64
vdsm-gluster-4.17.33-1.1.el7rhgs.noarch
glusterfs-libs-3.8.4-28.el7rhgs.x86_64
glusterfs-api-3.8.4-28.el7rhgs.x86_64
glusterfs-cli-3.8.4-28.el7rhgs.x86_64
python-gluster-3.8.4-28.el7rhgs.noarch
libvirt-daemon-driver-storage-gluster-3.2.0-4.el7.x86_64
glusterfs-3.8.4-28.el7rhgs.x86_64
glusterfs-server-3.8.4-28.el7rhgs.x86_64
glusterfs-geo-replication-3.8.4-28.el7rhgs.x86_64
samba-vfs-glusterfs-4.6.3-1.el7rhgs.x86_64
gluster-nagios-addons-0.2.9-1.el7rhgs.x86_64
gluster-nagios-common-0.2.4-1.el7rhgs.noarch
glusterfs-fuse-3.8.4-28.el7rhgs.x86_64




Also, Have run the automation test suite with rsync and fuse which was initially caused this as mentioned in description. All the cases ran successfully. Moving the bug to verified state.

Comment 11 errata-xmlrpc 2017-09-21 04:41:45 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2774

Note You need to log in before you can comment on or make changes to this bug.