1852541 – swift_rsync exited in 2/3 controllers after hard reboot

Bug 1852541 - swift_rsync exited in 2/3 controllers after hard reboot

Summary: swift_rsync exited in 2/3 controllers after hard reboot

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-tripleo-heat-templates
Sub Component:
Version:	16.1 (Train)
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	z3
Target Release:	16.1 (Train on RHEL 8.2)
Assignee:	Giulio Fidente
QA Contact:	Gal Amado
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-06-30 16:31 UTC by Eduardo Olivares
Modified:	2020-12-15 18:36 UTC (History)
CC List:	10 users (show)
Fixed In Version:	puppet-rsync-1.1.1-0.20200311051621.a7d4f84.el8ost openstack-tripleo-heat-templates-11.3.2-1.20200914170168.el8ost
Doc Type:	Known Issue
Doc Text:	There is a known issue with the Object Storage service (swift). If you use pre-deployed nodes, you might encounter the following error message in `/var/log/containers/stdouts/swift_rsync.log`: + "failed to create pid file /var/run/rsyncd.pid: File exists" + Workaround: Enter the following command on all pre-deployed Controller nodes: + ``` for d in $(podman inspect swift_rsync \| jq '.[].GraphDriver.Data.UpperDir') /var/lib/config-data/puppet-generated/swift; do sed -i -e '/pid file/d' $d/etc/rsyncd.conf; done ```
Clone Of:
Environment:
Last Closed:	2020-12-15 18:35:45 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	puppetlabs puppetlabs-rsync pull 120	None	closed	Add option to change pid_file setting	2020-12-14 11:22:03 UTC
Launchpad	1896605	None	None	None	2020-09-22 11:02:43 UTC
OpenStack gerrit	739859	None	MERGED	Configure rsyncd without pid file for Swift	2020-12-14 11:22:02 UTC
OpenStack gerrit	754863	None	MERGED	Configure rsyncd without pid file for Swift	2020-12-14 11:22:04 UTC
RDO	29550	None	None	None	2020-09-22 11:04:58 UTC
Red Hat Product Errata	RHEA-2020:5413	None	None	None	2020-12-15 18:36:19 UTC

Description Eduardo Olivares 2020-06-30 16:31:38 UTC

Description of problem:
I am running some tests on RHOS-16.1-RHEL-8-20200625.n.0 and found that swift_rsync is only running on controller-1. It has status "exited(11)" on controller-0 and controller-2
One of the tests I run performs a hard reboot on the three controller nodes, executed in parallel. I reviewed test logs and it seems the issue started happenning rigt after these reboots. I tried to run a soft reboot after this, but did not fix the issue.


When restarting this container with podman restart, the following error is shown (tailing /var/log/containers/stdouts/swift_rsync.log file):
2020-06-30T16:23:00.775833402+00:00 stderr F failed to create pid file /var/run/rsyncd.pid: File exists

I did not really find that file either on the main file system or under swift_proxy container.

I have not seen any other error under /var/log/containers/swift/

I am testing on two envs with the same OSP16.1 versions and the issue is only reproduced on one of them. Let me explain the differences:
1) NO ISSUE
Virtualized environment (all OC nodes are VM)
Installed with RHOS-16.1-RHEL-8-20200625.n.0 directly

2) ISSUE
Hybrid env (two compute nodes are BM servers)
Installed with core_puddle: RHOS-16.1-RHEL-8-20200610.n.0 and then updated to RHOS-16.1-RHEL-8-20200625.n.0 (I did not test this with RHOS-16.1-RHEL-8-20200610.n.0)



Version-Release number of selected component (if applicable):
RHOS-16.1-RHEL-8-20200625.n.0
openstack-swift-proxy-2.23.2-0.20200505123431.2e50b58.el8ost.noarch
python3-swift-2.23.2-0.20200505123431.2e50b58.el8ost.noarch
puppet-swift-15.4.1-0.20200524163422.cc79e4a.el8ost.noarch

How reproducible:
1/2 (see notes about two different envs above)


Steps to Reproduce:
1. apparently it happened after compute nodes were hard rebooted in parallel
2.
3.

Actual results:
Only 1/3 swift_rsync running

Expected results:
3/3 swift_rsync running

Comment 2 Pete Zaitcev 2020-07-01 00:45:49 UTC

I suspect it's the same underlying problem as this bug:
 https://bugzilla.redhat.com/show_bug.cgi?id=1517548

The daemon configuration needs to have the pid file
disabled (because it's in a container anyway).

Christian, do you mind looking at this?

Comment 11 Giulio Fidente 2020-07-07 21:50:19 UTC

Can't be merged until the puppetlabs-rsync module is updated to include https://github.com/puppetlabs/puppetlabs-rsync/pull/120

Comment 22 Christian Schwede (cschwede) 2020-07-15 09:36:44 UTC

Looking at the sosreports 2 of the 3 rsyncd.conf files have the pid setting still applied (node 0 and 2). Both files are also older than the one on node 1, likely missing a config file update. Node 1 does not have that setting and behaves as expected, because we added a workaround in t-h-t 2 years ago that fixes the setting: https://review.opendev.org/#/c/577403

The question is now why hasn't step 3 been applied on node 0 and 2? Any ideas?

Comment 23 Christian Schwede (cschwede) 2020-07-15 09:46:44 UTC

Are node 0 and 2 pre-deployed nodes?

In that case the workaround would look like this:

1. Execute the following command on all controller nodes that are pre-deployed:

for d in $(podman inspect swift_rsync | jq '.[].GraphDriver.Data.UpperDir'| xargs) /var/lib/config-data/puppet-generated/swift; do sed -i -e '/pid file/d' $d/etc/rsyncd.conf; done

That should fix this issue until we have a permanent fix merged.

Comment 24 ndeevy 2020-07-15 13:22:15 UTC

Hi Christian (and all),

Can you review the doc text draft that I added for technical accuracy?

Thanks,
Naomi


______________________________________________

There is currently a known issue with the Object Storage service (swift). If you are using pre-deployed nodes, you might encounter the following error message in /var/log/containers/stdouts/swift_rsync.log:

"failed to create pid file /var/run/rsyncd.pid: File exists"

Workaround: Enter the following command on all Controller nodes that are pre-deployed:

for d in $(podman inspect swift_rsync | jq '.[].GraphDriver.Data.UpperDir'| xargs) /var/lib/config-data/puppet-generated/swift; do sed -i -e '/pid file/d' $d/etc/rsyncd.conf; done 

______________________________________________

Comment 25 Christian Schwede (cschwede) 2020-07-16 15:10:08 UTC

(In reply to ndeevy from comment #24)
> Can you review the doc text draft that I added for technical accuracy?

Thanks Naomi; Pete just stumbled upon a minor issue - the "|xargs" is not needed in the command, I removed that part in the doc entry.

> ______________________________________________
> 
> There is currently a known issue with the Object Storage service (swift). If
> you are using pre-deployed nodes, you might encounter the following error
> message in /var/log/containers/stdouts/swift_rsync.log:
> 
> "failed to create pid file /var/run/rsyncd.pid: File exists"
> 
> Workaround: Enter the following command on all Controller nodes that are
> pre-deployed:
> 
> for d in $(podman inspect swift_rsync | jq '.[].GraphDriver.Data.UpperDir'|
> xargs) /var/lib/config-data/puppet-generated/swift; do sed -i -e '/pid
> file/d' $d/etc/rsyncd.conf; done

Comment 37 Gal Amado 2020-12-02 11:42:05 UTC

Verifyed on :
RHOS-16.1-RHEL-8-20201130.n.0
[heat-admin@controller-0 ~]$ rpm -qa | grep puppet-rsync
puppet-rsync-1.1.1-0.20200311051621.a7d4f84.el8ost.noarch


1. "grep pid_file /var/lib/config-data/puppet generated/swift/etc/rsyncd.conf"
returnded nothing on all of controllers. 

2. swift_rsync is up and running after rebooting the controllers 

3. no errors at /var/log/containers/stdouts/swift_rsync.log

Comment 44 errata-xmlrpc 2020-12-15 18:35:45 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 16.1.3 bug fix and enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2020:5413

Note You need to log in before you can comment on or make changes to this bug.