1851906 – [Ceph] Increasing the number of RGW instances from 1 to 2 fails with "Failed to load environment files: No such file or directory"

Bug 1851906 - [Ceph] Increasing the number of RGW instances from 1 to 2 fails with "Failed to load environment files: No such file or directory"

Summary: [Ceph] Increasing the number of RGW instances from 1 to 2 fails with "Failed ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	Ceph-Ansible
Sub Component:
Version:	4.1
Hardware:	All
OS:	Linux
Priority:	unspecified
Severity:	medium
Target Milestone:	z2
Target Release:	4.1
Assignee:	Guillaume Abrioux
QA Contact:	Sunil Kumar Nagaraju
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-06-29 11:42 UTC by Mustafa Aydın
Modified:	2020-09-30 17:26 UTC (History)
CC List:	10 users (show)
Fixed In Version:	ceph-ansible-4.0.29-1.el8cp, ceph-ansible-4.0.29-1.el7cp
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-09-30 17:26:19 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
ansible logs from management VM (7.98 MB, application/gzip) 2020-06-29 11:56 UTC, Mustafa Aydın	no flags	Details
sosreport from one of the rgw nodes (11.22 MB, application/x-xz) 2020-06-29 12:01 UTC, Mustafa Aydın	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	ceph ceph-ansible pull 5487	0	None	closed	rgw: fix multi instances scaleout	2021-01-12 15:53:39 UTC
Red Hat Product Errata	RHBA-2020:4144	0	None	None	None	2020-09-30 17:26:44 UTC

Description Mustafa Aydın 2020-06-29 11:42:20 UTC

Description of problem:

Increasing the number of RGW instances from "1" to "2" with changinf the parameter radosgw_num_instances to  "2". It fails with the following error logs for new instances;

 stdout: |-
    Socket file  could not be found, which means Rados Gateway is not running. Showing ceph-rgw unit logs now:
    -- Logs begin at Mon 2020-06-29 06:58:57 EDT, end at Mon 2020-06-29 07:28:35 EDT. --
    Jun 29 07:27:00 ceph4osd3.aydin.lab systemd[1]: Failed to load environment files: No such file or directory
    Jun 29 07:27:00 ceph4osd3.aydin.lab systemd[1]: ceph-radosgw.rgw1.service failed to run 'start-pre' task: No such file or directory
    Jun 29 07:27:00 ceph4osd3.aydin.lab systemd[1]: Failed to start Ceph RGW.
    Jun 29 07:27:00 ceph4osd3.aydin.lab systemd[1]: Unit ceph-radosgw.rgw1.service entered failed state.
    Jun 29 07:27:00 ceph4osd3.aydin.lab systemd[1]: ceph-radosgw.rgw1.service failed.
    Jun 29 07:27:10 ceph4osd3.aydin.lab systemd[1]: ceph-radosgw.rgw1.service holdoff time over, scheduling restart.
    Jun 29 07:27:10 ceph4osd3.aydin.lab systemd[1]: Stopped Ceph RGW.
    Jun 29 07:27:10 ceph4osd3.aydin.lab systemd[1]: Failed to load environment files: No such file or directory
    Jun 29 07:27:10 ceph4osd3.aydin.lab systemd[1]: ceph-radosgw.rgw1.service failed to run 'start-pre' task: No such file or directory
    Jun 29 07:27:10 ceph4osd3.aydin.lab systemd[1]: Failed to start Ceph RGW.
    Jun 29 07:27:10 ceph4osd3.aydin.lab systemd[1]: Unit ceph-radosgw.rgw1.service entered failed state.
    Jun 29 07:27:10 ceph4osd3.aydin.lab systemd[1]: ceph-radosgw.rgw1.service failed.


Version-Release number of selected component (if applicable):
RHCS 4.1

How reproducible:
Always

Steps to Reproduce:
- Install the ceph cluster w/o setting the radosgw_num_instances parameter at all.yml
- Add the "radosgw_num_instances: 2" variable to all.yaml
 

Actual results:

 stdout: |-
    Socket file  could not be found, which means Rados Gateway is not running. Showing ceph-rgw unit logs now:
    -- Logs begin at Mon 2020-06-29 06:58:57 EDT, end at Mon 2020-06-29 07:28:35 EDT. --
    Jun 29 07:27:00 ceph4osd3.aydin.lab systemd[1]: Failed to load environment files: No such file or directory
    Jun 29 07:27:00 ceph4osd3.aydin.lab systemd[1]: ceph-radosgw.rgw1.service failed to run 'start-pre' task: No such file or directory
    Jun 29 07:27:00 ceph4osd3.aydin.lab systemd[1]: Failed to start Ceph RGW.
    Jun 29 07:27:00 ceph4osd3.aydin.lab systemd[1]: Unit ceph-radosgw.rgw1.service entered failed state.
    Jun 29 07:27:00 ceph4osd3.aydin.lab systemd[1]: ceph-radosgw.rgw1.service failed.
    Jun 29 07:27:10 ceph4osd3.aydin.lab systemd[1]: ceph-radosgw.rgw1.service holdoff time over, scheduling restart.
    Jun 29 07:27:10 ceph4osd3.aydin.lab systemd[1]: Stopped Ceph RGW.
    Jun 29 07:27:10 ceph4osd3.aydin.lab systemd[1]: Failed to load environment files: No such file or directory
    Jun 29 07:27:10 ceph4osd3.aydin.lab systemd[1]: ceph-radosgw.rgw1.service failed to run 'start-pre' task: No such file or directory
    Jun 29 07:27:10 ceph4osd3.aydin.lab systemd[1]: Failed to start Ceph RGW.
    Jun 29 07:27:10 ceph4osd3.aydin.lab systemd[1]: Unit ceph-radosgw.rgw1.service entered failed state.
    Jun 29 07:27:10 ceph4osd3.aydin.lab systemd[1]: ceph-radosgw.rgw1.service failed.

Expected results:

- Second instances are provisioned successfuly at the first attempt

Additional info:
- When running the site-docker.yaml again w/o changing anything it succeeds.

Comment 1 Mustafa Aydın 2020-06-29 11:56:50 UTC

Created attachment 1699123 [details]
ansible logs from management VM

Comment 2 Mustafa Aydın 2020-06-29 12:01:37 UTC

Created attachment 1699124 [details]
sosreport from one of the rgw nodes

Comment 14 errata-xmlrpc 2020-09-30 17:26:19 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat Ceph Storage 4.1 Bug Fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4144

Note You need to log in before you can comment on or make changes to this bug.