1518788 – [ceph-ansible] [ceph-container] : rolling update of dmcrypt OSDs failed searching for container expose_partitions_<disk>

Bug 1518788 - [ceph-ansible] [ceph-container] : rolling update of dmcrypt OSDs failed searching for container expose_partitions_<disk>

Summary: [ceph-ansible] [ceph-container] : rolling update of dmcrypt OSDs failed searc...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	Container
Sub Component:
Version:	3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	z1
Target Release:	3.0
Assignee:	Sébastien Han
QA Contact:	Vasishta
Docs Contact:	Aron Gunn
URL:
Whiteboard:
Depends On:
Blocks:	1494421
TreeView+	depends on / blocked

Reported:	2017-11-29 14:59 UTC by Vasishta
Modified:	2018-03-08 15:46 UTC (History)
CC List:	16 users (show)
Fixed In Version:	rhceph:ceph-3.0-rhel-7-docker-candidate-53483-20180117211610
Doc Type:	Bug Fix
Doc Text:	Previously, when upgrading a containerized cluster to Red Hat Ceph Storage 3, the ceph-ansible utility failed to upgrade encrypted OSD nodes. As a consequence, such OSDs could not come up and journald logs included the following error message: "Error response from daemon: No such container: expose_partitions_<disk>" This bug has been fixed by modifying the underlying source code, and ceph-ansible upgrades containerized encrypted OSDs as expected.
Clone Of:
Environment:
Last Closed:	2018-03-08 15:46:22 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
File contains contents of OSD journald log snippet after enabling verbose (123.84 KB, text/plain) 2017-11-29 14:59 UTC, Vasishta	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	ceph ceph-container pull 854	0	None	None	None	2017-11-29 19:13:01 UTC
Red Hat Product Errata	RHBA-2018:0473	0	normal	SHIPPED_LIVE	updated rhceph-3.0-rhel7 container image	2018-03-08 20:45:28 UTC

Description Vasishta 2017-11-29 14:59:45 UTC

Created attachment 1360383 [details]
File contains contents of OSD journald log snippet after enabling verbose

Description of problem:
Rolling update of containerized cluster from 2.4 to 3.0 failed on dmcrypt OSDs failed searching for container expose_partitions_<disk>

Version-Release number of selected component (if applicable):
ceph-ansible-3.0.14-1.el7cp.noarch
Container image - rhceph:3-2

Steps to Reproduce:
1. Initialize ceph 2.4 cluster.
2. Update ceph-ansible to 3.x and follow doc to update the cluster to 3.0


Actual results:
OSDs are failing to come up saying expose_partitions_sdd

Expected results:
OSDs must get updated successfully 

Additional info:

OSD configurations as mentioned in inventory file -

<node> ceph_osd_docker_prepare_env="-e CLUSTER={{ cluster }} -e OSD_JOURNAL_SIZE={{ journal_size }} -e OSD_FORCE_ZAP=1 -e OSD_DMCRYPT=1" ceph_osd_docker_extra_env="-e CLUSTER={{ cluster }} -e CEPH_DAEMON=OSD_CEPH_DISK_ACTIVATE -e OSD_JOURNAL_SIZE={{ journal_size }} -e OSD_DMCRYPT=1" devices="['/dev/sdb','/dev/sdc','/dev/sdd']"


Contents of all.yml -
$ grep -Ev '(^#|^$)' group_vars/all.yml
---
dummy:
ceph_docker_image_tag: "3-2"
fetch_directory: ~/ceph-ansible-keys
cluster: humpty
monitor_interface: "eno1"
radosgw_interface: "eno1"
public_network: 10.8.128.0/21
docker: true
ceph_docker_image: "rhceph"
mon_containerized_deployment: true
ceph_mon_docker_interface: "eno1"
ceph_mon_docker_subnet: "{{ public_network }}"
ceph_docker_registry: "brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888" #registry.access.redhat.com

Comment 4 Vasishta 2017-11-29 15:46:55 UTC

An additional information -

As only three OSD nodes were there, playbook failed after retrying 40 times  while running task "waiting for clean pgs..." as all osds were down on a node.

Comment 5 Sébastien Han 2017-11-29 19:13:01 UTC

That's a nasty bug with no workaround at the moment :/
Only the patch will fix the issue.

Comment 6 Harish NV Rao 2017-11-30 09:27:32 UTC

@Drew, Can you please add the summary of the decision made in yesterday's program call regarding this bug?

Comment 7 Harish NV Rao 2017-12-01 06:31:45 UTC

(In reply to Harish NV Rao from comment #6)
> @Drew, Can you please add the summary of the decision made in yesterday's
> program call regarding this bug?

@Drew, a gentle reminder. Please note that this bz is still having target release as 3.0.

Comment 8 Drew Harris 2017-12-01 16:01:11 UTC

We will release note this for 3.0 and target it for the next async. I'm tracking this bug manually for now until we have a new target location for it.

Comment 10 Sébastien Han 2017-12-06 17:01:25 UTC

lgtm

Comment 15 Vasishta 2018-02-23 15:33:36 UTC

Hi,

Working fine using ceph-ansible-3.0.26-1.el7cp.noarch to upgrade to
ceph-3.0-rhel-7-docker-candidate-38019-20180222163657 from rhceph-3-rhel7.

While upgrading cluster having OSDs with collocated journals, faced issue
reported in Bug 1548357, Followed workaround mentioned in
https://bugzilla.redhat.com/show_bug.cgi?id=1548357#c4 . (mention mgr's name at
top of mons group in inventory file)

Moving to VERIFIED state.

Regards,
Vasishta Shatsry
AQE, Ceph

Comment 18 errata-xmlrpc 2018-03-08 15:46:22 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0473

Note You need to log in before you can comment on or make changes to this bug.