Bug 1448386

Summary: [geo-rep]: Worker crashed with TypeError: expected string or buffer
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Rochelle <rallan>
Component: geo-replicationAssignee: Aravinda VK <avishwan>
Status: CLOSED ERRATA QA Contact: Rochelle <rallan>
Severity: high Docs Contact:
Priority: unspecified    
Version: rhgs-3.3CC: amukherj, csaba, rhs-bugs, storage-qa-internal
Target Milestone: ---   
Target Release: RHGS 3.3.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: glusterfs-3.8.4-28 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1459620 (view as bug list) Environment:
Last Closed: 2017-09-21 04:41:45 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1417151, 1459620    

Description Rochelle 2017-05-05 10:28:23 UTC
Description of problem:
=======================

While running geo-replication sanity check which does following fop's (create,chmod,chown,chgrp,hardlink,symlink,truncate,rename,remove) observed the following worker crash. Eventually the checksum matches at master and slave, and hence do not know after which fop or crawl this is observed. The crash is only seen once and worker became online post that. 

[2017-05-04 17:24:29.679775] I [gsyncdstatus(/bricks/brick0/master_brick1):276:set_passive] GeorepStatus: Worker Status: Passive
[2017-05-04 17:24:30.635793] I [master(/bricks/brick1/master_brick4):1195:crawl] _GMaster: slave's time: (1493917690, 0)
[2017-05-04 17:24:35.496570] E [syncdutils(/bricks/brick1/master_brick4):296:log_raise_exception] <top>: FAIL:
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 326, in twrap
    tf(*aa)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1575, in syncjob
    po = self.sync_engine(pb, self.log_err)
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1702, in rsync
    log_err=log_err)
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 56, in sup
    sys._getframe(1).f_code.co_name)(*a, **kw)
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1025, in rsync
    "log_rsync_performance", default_value=False))
  File "/usr/libexec/glusterfs/python/syncdaemon/configinterface.py", line 264, in get_realtime
    return self.get(opt, printValue=False, default_value=default_value)
  File "/usr/libexec/glusterfs/python/syncdaemon/configinterface.py", line 369, in get
    self.update_to(d, allow_unresolved=True)
  File "/usr/libexec/glusterfs/python/syncdaemon/configinterface.py", line 359, in update_to
    update_from_sect(sect, MultiDict(dct, mad, *self.auxdicts))
  File "/usr/libexec/glusterfs/python/syncdaemon/configinterface.py", line 343, in update_from_sect
    dct[k] = Template(v).safe_substitute(mud)
  File "/usr/lib64/python2.7/string.py", line 205, in safe_substitute
    return self.pattern.sub(convert, self.template)
TypeError: expected string or buffer
[2017-05-04 17:24:35.563095] I [syncdutils(/bricks/brick1/master_brick4):237:finalize] <top>: exiting.
[2017-05-04 17:24:35.572370] I [repce(/bricks/brick1/master_brick4):92:service_loop] RepceServer: terminating on reaching EOF.
[2017-05-04 17:24:35.573046] I [syncdutils(/bricks/brick1/master_brick4):237:finalize] <top>: exiting.

Version-Release number of selected component (if applicable):
=============================================================
glusterfs-geo-replication-3.8.4-24.el7rhgs.x86_64


Steps to Reproduce:
==================
Do not know the exact steps since it was seen in the automation run. 
Will work to find out the specific steps and update this space later.

Actual results:
==============

Worker crashed and then came online.
Arequal checksum between master and slave matches.


Expected results:
=================

Worker should not crash.

Comment 5 Aravinda VK 2017-06-06 12:48:33 UTC
Easy reproducer:

cd /usr/libexec/glusterfs/python/syncdaemon/
python

    from configinterface import GConffile
    conf = GConffile("/var/lib/glusterd/geo-replication/gsyncd_template.conf", 
                     ["master", "slave"], {})
    print conf.get()
    conf.set("log-rsync-performance", 10)
    print conf.get("log-rsync-performance")

Above script fails with the same traceback.

RCA: We are not restarting the worker for Some of the configuration changes(For example, log-rsync-performance), this causes non string value passed as template. Worker restart fixes this since it reads config values from file as string instead of actual type.

Comment 6 Atin Mukherjee 2017-06-08 04:00:12 UTC
upstream patch : https://review.gluster.org/17489

Comment 9 Rochelle 2017-06-24 17:21:37 UTC
Thanks Aravinda for providing the easy reproducer. Have reproduced this issue with build: glusterfs-geo-replication-3.8.4-27.el7rhgs.x86_64

[root@dhcp41-189 ~]# 
[root@dhcp41-189 ~]# cd /usr/libexec/glusterfs/python/syncdaemon/
[root@dhcp41-189 syncdaemon]# python
Python 2.7.5 (default, Aug  2 2016, 04:20:16) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from configinterface import GConffile
>>> conf = GConffile("/var/lib/glusterd/geo-replication/gsyncd_template.conf", 
...                  ["master", "slave"], {})
>>> print conf.get()
special_sync_mode: partial
state_socket_unencoded: /var/lib/glusterd/geo-replication/${mastervol}/${eSlave}.socket
gluster_log_file: /var/log/glusterfs/geo-replication/${mastervol}/${eSlave}${local_id}.gluster.log
ssh_command: ssh -oPasswordAuthentication=no -oStrictHostKeyChecking=no -i /var/lib/glusterd/geo-replication/secret.pem
ignore_deletes: true
change_detector: changelog
gluster_command_dir: /usr/sbin/
working_dir: /var/run/gluster/${mastervol}/${eSlave}
remote_gsyncd: /usr/libexec/glusterfs/gsyncd
log_file: /var/log/glusterfs/geo-replication/${mastervol}/${eSlave}.log
socketdir: /var/run/gluster
state_file: /var/lib/glusterd/geo-replication/${mastervol}_${remotehost}_${slavevol}/monitor.status
state_detail_file: /var/lib/glusterd/geo-replication/${mastervol}_${remotehost}_${slavevol}/${eSlave}-detail.status
ssh_command_tar: ssh -oPasswordAuthentication=no -oStrictHostKeyChecking=no -i /var/lib/glusterd/geo-replication/tar_ssh.pem
pid_file: /var/lib/glusterd/geo-replication/${mastervol}_${remotehost}_${slavevol}/monitor.pid
georep_session_working_dir: /var/lib/glusterd/geo-replication/${mastervol}_${remotehost}_${slavevol}/
gluster_params: aux-gfid-mount acl
None
>>> conf.set("log-rsync-performance", 10)
>>> print conf.get("log-rsync-performance")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "configinterface.py", line 369, in get
    self.update_to(d, allow_unresolved=True)
  File "configinterface.py", line 361, in update_to
    update_from_sect(self.section(), MultiDict(dct, *self.auxdicts))
  File "configinterface.py", line 343, in update_from_sect
    dct[k] = Template(v).safe_substitute(mud)
  File "/usr/lib64/python2.7/string.py", line 205, in safe_substitute
    return self.pattern.sub(convert, self.template)
TypeError: expected string or buffer
>>> quit ()
[root@dhcp41-189 syncdaemon]# 
[root@dhcp41-189 syncdaemon]# 
[root@dhcp41-189 syncdaemon]# rpm -qa | grep gluster
glusterfs-libs-3.8.4-27.el7rhgs.x86_64
glusterfs-rdma-3.8.4-27.el7rhgs.x86_64
vdsm-gluster-4.17.33-1.1.el7rhgs.noarch
glusterfs-cli-3.8.4-27.el7rhgs.x86_64
glusterfs-client-xlators-3.8.4-27.el7rhgs.x86_64
glusterfs-server-3.8.4-27.el7rhgs.x86_64
gluster-nagios-addons-0.2.8-1.el7rhgs.x86_64
glusterfs-api-3.8.4-27.el7rhgs.x86_64
glusterfs-geo-replication-3.8.4-27.el7rhgs.x86_64
glusterfs-fuse-3.8.4-27.el7rhgs.x86_64
gluster-nagios-common-0.2.4-1.el7rhgs.noarch
python-gluster-3.8.4-25.el7rhgs.noarch
glusterfs-3.8.4-27.el7rhgs.x86_64


Validation of the same step with build: glusterfs-geo-replication-3.8.4-28.el7rhgs.x86_64 does not result in the same traceback as mentioned below:

[root@dhcp41-160 syncdaemon]# 
[root@dhcp41-160 syncdaemon]# 
[root@dhcp41-160 syncdaemon]# python
Python 2.7.5 (default, May  3 2017, 07:55:04) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-14)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from configinterface import GConffile
>>> conf = GConffile("/var/lib/glusterd/geo-replication/gsyncd_template.conf", 
...                      ["master", "slave"], {})
>>> print conf.get()
special_sync_mode: partial
state_socket_unencoded: /var/lib/glusterd/geo-replication/${mastervol}/${eSlave}.socket
gluster_log_file: /var/log/glusterfs/geo-replication/${mastervol}/${eSlave}${local_id}.gluster.log
ssh_command: ssh -oPasswordAuthentication=no -oStrictHostKeyChecking=no -i /var/lib/glusterd/geo-replication/secret.pem
ignore_deletes: true
change_detector: changelog
gluster_command_dir: /usr/sbin/
working_dir: /var/run/gluster/${mastervol}/${eSlave}
remote_gsyncd: /usr/libexec/glusterfs/gsyncd
log_file: /var/log/glusterfs/geo-replication/${mastervol}/${eSlave}.log
socketdir: /var/run/gluster
state_file: /var/lib/glusterd/geo-replication/${mastervol}_${remotehost}_${slavevol}/monitor.status
state_detail_file: /var/lib/glusterd/geo-replication/${mastervol}_${remotehost}_${slavevol}/${eSlave}-detail.status
ssh_command_tar: ssh -oPasswordAuthentication=no -oStrictHostKeyChecking=no -i /var/lib/glusterd/geo-replication/tar_ssh.pem
pid_file: /var/lib/glusterd/geo-replication/${mastervol}_${remotehost}_${slavevol}/monitor.pid
georep_session_working_dir: /var/lib/glusterd/geo-replication/${mastervol}_${remotehost}_${slavevol}/
gluster_params: aux-gfid-mount acl
log_rsync_performance: 10
None
>>> conf.set("log-rsync-performance", 10)
>>> print conf.get("log-rsync-performance")
10
None
>>> exit()
[root@dhcp41-160 syncdaemon]# rpm -qa | grep gluster
glusterfs-client-xlators-3.8.4-28.el7rhgs.x86_64
glusterfs-rdma-3.8.4-28.el7rhgs.x86_64
vdsm-gluster-4.17.33-1.1.el7rhgs.noarch
glusterfs-libs-3.8.4-28.el7rhgs.x86_64
glusterfs-api-3.8.4-28.el7rhgs.x86_64
glusterfs-cli-3.8.4-28.el7rhgs.x86_64
python-gluster-3.8.4-28.el7rhgs.noarch
libvirt-daemon-driver-storage-gluster-3.2.0-4.el7.x86_64
glusterfs-3.8.4-28.el7rhgs.x86_64
glusterfs-server-3.8.4-28.el7rhgs.x86_64
glusterfs-geo-replication-3.8.4-28.el7rhgs.x86_64
samba-vfs-glusterfs-4.6.3-1.el7rhgs.x86_64
gluster-nagios-addons-0.2.9-1.el7rhgs.x86_64
gluster-nagios-common-0.2.4-1.el7rhgs.noarch
glusterfs-fuse-3.8.4-28.el7rhgs.x86_64




Also, Have run the automation test suite with rsync and fuse which was initially caused this as mentioned in description. All the cases ran successfully. Moving the bug to verified state.

Comment 11 errata-xmlrpc 2017-09-21 04:41:45 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2774