Description of problem: ======================= While running geo-replication sanity check which does following fop's (create,chmod,chown,chgrp,hardlink,symlink,truncate,rename,remove) observed the following worker crash. Eventually the checksum matches at master and slave, and hence do not know after which fop or crawl this is observed. The crash is only seen once and worker became online post that. [2017-05-04 17:24:29.679775] I [gsyncdstatus(/bricks/brick0/master_brick1):276:set_passive] GeorepStatus: Worker Status: Passive [2017-05-04 17:24:30.635793] I [master(/bricks/brick1/master_brick4):1195:crawl] _GMaster: slave's time: (1493917690, 0) [2017-05-04 17:24:35.496570] E [syncdutils(/bricks/brick1/master_brick4):296:log_raise_exception] <top>: FAIL: Traceback (most recent call last): File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 326, in twrap tf(*aa) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1575, in syncjob po = self.sync_engine(pb, self.log_err) File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1702, in rsync log_err=log_err) File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 56, in sup sys._getframe(1).f_code.co_name)(*a, **kw) File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1025, in rsync "log_rsync_performance", default_value=False)) File "/usr/libexec/glusterfs/python/syncdaemon/configinterface.py", line 264, in get_realtime return self.get(opt, printValue=False, default_value=default_value) File "/usr/libexec/glusterfs/python/syncdaemon/configinterface.py", line 369, in get self.update_to(d, allow_unresolved=True) File "/usr/libexec/glusterfs/python/syncdaemon/configinterface.py", line 359, in update_to update_from_sect(sect, MultiDict(dct, mad, *self.auxdicts)) File "/usr/libexec/glusterfs/python/syncdaemon/configinterface.py", line 343, in update_from_sect dct[k] = Template(v).safe_substitute(mud) File "/usr/lib64/python2.7/string.py", line 205, in safe_substitute return self.pattern.sub(convert, self.template) TypeError: expected string or buffer [2017-05-04 17:24:35.563095] I [syncdutils(/bricks/brick1/master_brick4):237:finalize] <top>: exiting. [2017-05-04 17:24:35.572370] I [repce(/bricks/brick1/master_brick4):92:service_loop] RepceServer: terminating on reaching EOF. [2017-05-04 17:24:35.573046] I [syncdutils(/bricks/brick1/master_brick4):237:finalize] <top>: exiting. Version-Release number of selected component (if applicable): ============================================================= glusterfs-geo-replication-3.8.4-24.el7rhgs.x86_64 Steps to Reproduce: ================== Do not know the exact steps since it was seen in the automation run. Will work to find out the specific steps and update this space later. Actual results: ============== Worker crashed and then came online. Arequal checksum between master and slave matches. Expected results: ================= Worker should not crash.
Easy reproducer: cd /usr/libexec/glusterfs/python/syncdaemon/ python from configinterface import GConffile conf = GConffile("/var/lib/glusterd/geo-replication/gsyncd_template.conf", ["master", "slave"], {}) print conf.get() conf.set("log-rsync-performance", 10) print conf.get("log-rsync-performance") Above script fails with the same traceback. RCA: We are not restarting the worker for Some of the configuration changes(For example, log-rsync-performance), this causes non string value passed as template. Worker restart fixes this since it reads config values from file as string instead of actual type.
upstream patch : https://review.gluster.org/17489
Upstream patches: https://review.gluster.org/17489 https://review.gluster.org/17503 Downstream patches: https://code.engineering.redhat.com/gerrit/108895 https://code.engineering.redhat.com/gerrit/108896
Thanks Aravinda for providing the easy reproducer. Have reproduced this issue with build: glusterfs-geo-replication-3.8.4-27.el7rhgs.x86_64 [root@dhcp41-189 ~]# [root@dhcp41-189 ~]# cd /usr/libexec/glusterfs/python/syncdaemon/ [root@dhcp41-189 syncdaemon]# python Python 2.7.5 (default, Aug 2 2016, 04:20:16) [GCC 4.8.5 20150623 (Red Hat 4.8.5-4)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from configinterface import GConffile >>> conf = GConffile("/var/lib/glusterd/geo-replication/gsyncd_template.conf", ... ["master", "slave"], {}) >>> print conf.get() special_sync_mode: partial state_socket_unencoded: /var/lib/glusterd/geo-replication/${mastervol}/${eSlave}.socket gluster_log_file: /var/log/glusterfs/geo-replication/${mastervol}/${eSlave}${local_id}.gluster.log ssh_command: ssh -oPasswordAuthentication=no -oStrictHostKeyChecking=no -i /var/lib/glusterd/geo-replication/secret.pem ignore_deletes: true change_detector: changelog gluster_command_dir: /usr/sbin/ working_dir: /var/run/gluster/${mastervol}/${eSlave} remote_gsyncd: /usr/libexec/glusterfs/gsyncd log_file: /var/log/glusterfs/geo-replication/${mastervol}/${eSlave}.log socketdir: /var/run/gluster state_file: /var/lib/glusterd/geo-replication/${mastervol}_${remotehost}_${slavevol}/monitor.status state_detail_file: /var/lib/glusterd/geo-replication/${mastervol}_${remotehost}_${slavevol}/${eSlave}-detail.status ssh_command_tar: ssh -oPasswordAuthentication=no -oStrictHostKeyChecking=no -i /var/lib/glusterd/geo-replication/tar_ssh.pem pid_file: /var/lib/glusterd/geo-replication/${mastervol}_${remotehost}_${slavevol}/monitor.pid georep_session_working_dir: /var/lib/glusterd/geo-replication/${mastervol}_${remotehost}_${slavevol}/ gluster_params: aux-gfid-mount acl None >>> conf.set("log-rsync-performance", 10) >>> print conf.get("log-rsync-performance") Traceback (most recent call last): File "<stdin>", line 1, in <module> File "configinterface.py", line 369, in get self.update_to(d, allow_unresolved=True) File "configinterface.py", line 361, in update_to update_from_sect(self.section(), MultiDict(dct, *self.auxdicts)) File "configinterface.py", line 343, in update_from_sect dct[k] = Template(v).safe_substitute(mud) File "/usr/lib64/python2.7/string.py", line 205, in safe_substitute return self.pattern.sub(convert, self.template) TypeError: expected string or buffer >>> quit () [root@dhcp41-189 syncdaemon]# [root@dhcp41-189 syncdaemon]# [root@dhcp41-189 syncdaemon]# rpm -qa | grep gluster glusterfs-libs-3.8.4-27.el7rhgs.x86_64 glusterfs-rdma-3.8.4-27.el7rhgs.x86_64 vdsm-gluster-4.17.33-1.1.el7rhgs.noarch glusterfs-cli-3.8.4-27.el7rhgs.x86_64 glusterfs-client-xlators-3.8.4-27.el7rhgs.x86_64 glusterfs-server-3.8.4-27.el7rhgs.x86_64 gluster-nagios-addons-0.2.8-1.el7rhgs.x86_64 glusterfs-api-3.8.4-27.el7rhgs.x86_64 glusterfs-geo-replication-3.8.4-27.el7rhgs.x86_64 glusterfs-fuse-3.8.4-27.el7rhgs.x86_64 gluster-nagios-common-0.2.4-1.el7rhgs.noarch python-gluster-3.8.4-25.el7rhgs.noarch glusterfs-3.8.4-27.el7rhgs.x86_64 Validation of the same step with build: glusterfs-geo-replication-3.8.4-28.el7rhgs.x86_64 does not result in the same traceback as mentioned below: [root@dhcp41-160 syncdaemon]# [root@dhcp41-160 syncdaemon]# [root@dhcp41-160 syncdaemon]# python Python 2.7.5 (default, May 3 2017, 07:55:04) [GCC 4.8.5 20150623 (Red Hat 4.8.5-14)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from configinterface import GConffile >>> conf = GConffile("/var/lib/glusterd/geo-replication/gsyncd_template.conf", ... ["master", "slave"], {}) >>> print conf.get() special_sync_mode: partial state_socket_unencoded: /var/lib/glusterd/geo-replication/${mastervol}/${eSlave}.socket gluster_log_file: /var/log/glusterfs/geo-replication/${mastervol}/${eSlave}${local_id}.gluster.log ssh_command: ssh -oPasswordAuthentication=no -oStrictHostKeyChecking=no -i /var/lib/glusterd/geo-replication/secret.pem ignore_deletes: true change_detector: changelog gluster_command_dir: /usr/sbin/ working_dir: /var/run/gluster/${mastervol}/${eSlave} remote_gsyncd: /usr/libexec/glusterfs/gsyncd log_file: /var/log/glusterfs/geo-replication/${mastervol}/${eSlave}.log socketdir: /var/run/gluster state_file: /var/lib/glusterd/geo-replication/${mastervol}_${remotehost}_${slavevol}/monitor.status state_detail_file: /var/lib/glusterd/geo-replication/${mastervol}_${remotehost}_${slavevol}/${eSlave}-detail.status ssh_command_tar: ssh -oPasswordAuthentication=no -oStrictHostKeyChecking=no -i /var/lib/glusterd/geo-replication/tar_ssh.pem pid_file: /var/lib/glusterd/geo-replication/${mastervol}_${remotehost}_${slavevol}/monitor.pid georep_session_working_dir: /var/lib/glusterd/geo-replication/${mastervol}_${remotehost}_${slavevol}/ gluster_params: aux-gfid-mount acl log_rsync_performance: 10 None >>> conf.set("log-rsync-performance", 10) >>> print conf.get("log-rsync-performance") 10 None >>> exit() [root@dhcp41-160 syncdaemon]# rpm -qa | grep gluster glusterfs-client-xlators-3.8.4-28.el7rhgs.x86_64 glusterfs-rdma-3.8.4-28.el7rhgs.x86_64 vdsm-gluster-4.17.33-1.1.el7rhgs.noarch glusterfs-libs-3.8.4-28.el7rhgs.x86_64 glusterfs-api-3.8.4-28.el7rhgs.x86_64 glusterfs-cli-3.8.4-28.el7rhgs.x86_64 python-gluster-3.8.4-28.el7rhgs.noarch libvirt-daemon-driver-storage-gluster-3.2.0-4.el7.x86_64 glusterfs-3.8.4-28.el7rhgs.x86_64 glusterfs-server-3.8.4-28.el7rhgs.x86_64 glusterfs-geo-replication-3.8.4-28.el7rhgs.x86_64 samba-vfs-glusterfs-4.6.3-1.el7rhgs.x86_64 gluster-nagios-addons-0.2.9-1.el7rhgs.x86_64 gluster-nagios-common-0.2.4-1.el7rhgs.noarch glusterfs-fuse-3.8.4-28.el7rhgs.x86_64 Also, Have run the automation test suite with rsync and fuse which was initially caused this as mentioned in description. All the cases ran successfully. Moving the bug to verified state.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:2774