1885528 – RHOSP 16.1 Upgrades - Ceph osd systemd service has reference to docker when ceph has been updated to RHEL8

Bug 1885528 - RHOSP 16.1 Upgrades - Ceph osd systemd service has reference to docker when ceph has been updated to RHEL8

Summary: RHOSP 16.1 Upgrades - Ceph osd systemd service has reference to docker when c...

Keywords:
Status:	CLOSED DUPLICATE of bug 1885558
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-tripleo-heat-templates
Sub Component:
Version:	16.1 (Train)
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	zstream
Target Release:	16.1 (Train on RHEL 8.2)
Assignee:	Lukas Bezdicka
QA Contact:	David Rosenfeld
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-10-06 09:52 UTC by Giovanni Battista Sciortino
Modified:	2020-11-23 16:44 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-11-23 16:44:42 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
output of the command "openstack overcloud upgrade run --stack overcloud --tags system_upgrade --limit ceph0" (5.50 KB, text/plain) 2020-10-06 10:20 UTC, Giovanni Battista Sciortino	no flags	Details
View All

Description Giovanni Battista Sciortino 2020-10-06 09:52:00 UTC

Description of problem:

Following FFU documentation [1] the command "openstack overcloud upgrade run --stack overcloud --tags system_upgrade --limit ceph0" is executed without errors (see attached output) but the systemd service "ceph-osd" on the first node doesn't start with the following errors:

[root@ceph0 ~]# journalctl -u ceph-osd |tail -n 7                                                                                                                                  
Oct 06 10:10:10 ceph0 podman[269786]: Error: no container with name or ID ceph-osd-2 found: no such container                                                                                
Oct 06 10:10:10 ceph0 podman[269796]: Error: Failed to evict container: "": Failed to find container "ceph-osd-2" in state: no container with name or ID ceph-osd-2 found: no such container 
Oct 06 10:10:10 ceph0 systemd[1]: Started Ceph OSD.
Oct 06 10:10:10 ceph0 ceph-osd-run.sh[269807]: /usr/share/ceph-osd-run.sh: line 14: docker: command not found                                                                                
Oct 06 10:10:10 ceph0 ceph-osd-run.sh[269807]: No data partition found for OSD
Oct 06 10:10:10 ceph0 systemd[1]: ceph-osd: Main process exited, code=exited, status=1/FAILURE                                                                                     
Oct 06 10:10:10 ceph0 systemd[1]: ceph-osd: Failed with result 'exit-code'.

The file /usr/share/ceph-osd-run.sh has reference to docker (but the node has been already upgraded to RHEL8)

[root@ceph0 ~]# grep -n docker /usr/share/ceph-osd-run.sh                                                                                                                                    
14:  DATA_PART=$(docker run --rm --ulimit nofile=1024:4096 --privileged=true -v /dev/:/dev/ -v /etc/ceph:/etc/ceph:z --entrypoint ceph-disk hammer.ipa.gpslab.club:5000/gpslabrhosp-library-rhosp16_1_ffu-osp13_containers-rhceph-3-rhel7:3-46 list | grep ", osd\.${1}," | awk '{ print $1 }')                                                                                            
27:  DOCKER_ENV=$(docker run --rm --net=host --privileged=true -v /dev/:/dev/ -v /etc/ceph:/etc/ceph:z -e CLUSTER=ceph -e OSD_DEVICE=${1} hammer.ipa.gpslab.club:5000/gpslabrhosp-library-rhosp16_1_ffu-osp13_containers-rhceph-3-rhel7:3-46 disk_list)

Also the file /usr/share/ceph-ansible/roles/ceph-osd/templates/ceph-osd-run.sh.j2 provided by package ceph-ansible-3.2.49-1.el7cp.noarch contains reference to docker.

(undercloud) [stack@undercloud templates]$ grep -n "docker run" /usr/share/ceph-ansible/roles/ceph-osd/templates/ceph-osd-run.sh.j2                                                          
19:  DATA_PART=$(docker run --rm --ulimit nofile=1024:4096 --privileged=true -v /dev/:/dev/ -v /etc/ceph:/etc/ceph:z --entrypoint ceph-disk {{ ceph_docker_registry }}/{{ ceph_docker_image }}:{{ ceph_docker_image_tag }} list | grep ", osd\.${1}," | awk '{ print $1 }')
32:  DOCKER_ENV=$(docker run --rm --net=host --privileged=true -v /dev/:/dev/ -v /etc/ceph:/etc/ceph:z -e CLUSTER={{ cluster }} -e OSD_DEVICE=${1} {{ ceph_docker_registry }}/{{ ceph_docker_image }}:{{ ceph_docker_image_tag }} disk_list)
61:    part=$(docker run --privileged=true -v /dev:/dev --entrypoint /usr/sbin/ceph-disk {{ ceph_docker_registry}}/{{ ceph_docker_image }}:{{ ceph_docker_image_tag }} list /dev/${1} | awk '/journal / {print $1}')



Version-Release number of selected component (if applicable):

ceph-ansible-3.2.49-1.el7cp.noarch


How reproducible:

Execute the documentation [1] until the command "openstack overcloud upgrade run --stack STACK NAME --limit overcloud-cephstorage-0" of the paragraph "17.3. Upgrading the operating system for Ceph Storage nodes" and verify the status of Ceph osd service after this command is executed.

Actual results:

The ceph-osd service on the upgraded nodes fails to start.

[root@ceph0 ~]# journalctl -u ceph-osd |tail -n 7                                                                                                                                  
Oct 06 10:10:10 ceph0 podman[269786]: Error: no container with name or ID ceph-osd-2 found: no such container                                                                                
Oct 06 10:10:10 ceph0 podman[269796]: Error: Failed to evict container: "": Failed to find container "ceph-osd-2" in state: no container with name or ID ceph-osd-2 found: no such container 
Oct 06 10:10:10 ceph0 systemd[1]: Started Ceph OSD.
Oct 06 10:10:10 ceph0 ceph-osd-run.sh[269807]: /usr/share/ceph-osd-run.sh: line 14: docker: command not found                                                                                
Oct 06 10:10:10 ceph0 ceph-osd-run.sh[269807]: No data partition found for OSD
Oct 06 10:10:10 ceph0 systemd[1]: ceph-osd: Main process exited, code=exited, status=1/FAILURE                                                                                     
Oct 06 10:10:10 ceph0 systemd[1]: ceph-osd: Failed with result 'exit-code'.


Expected results:

The ceph-osd service on the upgraded nodes start without errors.

Additional info:


[1] https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.1/html-single/framework_for_upgrades_13_to_16.1/index#upgrading-the-operating-system-for-ceph-storage-nodes-upgrading-overcloud-standard

Comment 1 Giovanni Battista Sciortino 2020-10-06 10:20:35 UTC

Created attachment 1719323 [details]
output of the command "openstack overcloud upgrade run --stack overcloud --tags system_upgrade --limit ceph0"

Comment 2 Giovanni Battista Sciortino 2020-10-06 13:52:28 UTC

If it's expected that ceph OSD services starts again after executing the commands on the paragraph 17.3, I found the following workaround. I don't know if there are best solution to this problem.

Before to execute the commands of the paragraph 17.3 I have applied the following changes on in the file /usr/share/ceph-ansible/roles/ceph-osd/templates/ceph-osd-run.sh.j2 of the undercloud:
1) replaced the string "docker" with "({{ container_binary }}"
2) Added a command "--net=host" in the podman command related to the variable DATA_PART

I report below a diff of the changes executed:

 [root@undercloud ~]# diff /usr/share/ceph-ansible/roles/ceph-osd/templates/ceph-osd-run.sh.j2 /usr/share/ceph-ansible/roles/ceph-osd/templates/ceph-osd-run.sh.j2.orig                      
19c19
<   DATA_PART=$({{ container_binary }} run --net=host --rm --ulimit nofile=1024:4096 --privileged=true -v /dev/:/dev/ -v /etc/ceph:/etc/ceph:z --entrypoint ceph-disk {{ ceph_docker_registry
}}/{{ ceph_docker_image }}:{{ ceph_docker_image_tag }} list | grep ", osd\.${1}," | awk '{ print $1 }')                                                                                      
---
>   DATA_PART=$(docker run --rm --ulimit nofile=1024:4096 --privileged=true -v /dev/:/dev/ -v /etc/ceph:/etc/ceph:z --entrypoint ceph-disk {{ ceph_docker_registry }}/{{ ceph_docker_image }}:{{ ceph_docker_image_tag }} list | grep ", osd\.${1}," | awk '{ print $1 }')
32c32
<   DOCKER_ENV=$({{ container_binary }} run --rm --net=host --privileged=true -v /dev/:/dev/ -v /etc/ceph:/etc/ceph:z -e CLUSTER={{ cluster }} -e OSD_DEVICE=${1} {{ ceph_docker_registry }}/{{ ceph_docker_image }}:{{ ceph_docker_image_tag }} disk_list)
---
>   DOCKER_ENV=$(docker run --rm --net=host --privileged=true -v /dev/:/dev/ -v /etc/ceph:/etc/ceph:z -e CLUSTER={{ cluster }} -e OSD_DEVICE=${1} {{ ceph_docker_registry }}/{{ ceph_docker_image }}:{{ ceph_docker_image_tag }} disk_list)


Applying this changes seems that OSD service returns in state running.

[1] https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.1/html-single/framework_for_upgrades_13_to_16.1/index#upgrading-the-operating-system-for-ceph-storage-nodes-upgrading-overcloud-standard

Comment 5 Lukas Bezdicka 2020-10-21 10:16:07 UTC

I'm pretty sure this is user error where most likely ceph-ansible was upgraded to ceph4 version instead of ceph3.

Comment 6 Giovanni Battista Sciortino 2020-10-21 10:53:33 UTC

(In reply to Lukas Bezdicka from comment #5)
> I'm pretty sure this is user error where most likely ceph-ansible was
> upgraded to ceph4 version instead of ceph3.

The version of ceph-ansible used as been described in the first answer of this case is ceph-ansible-3.2.49-1.el7cp.noarch.rpm .

The same package can be also downloaded from https://access.redhat.com/downloads/content/ceph-ansible/3.2.49-1.el7cp/noarch/fd431d51/package the file /usr/share/ceph-ansible/roles/ceph-osd/templates/ceph-osd-run.sh.j2 provided by this package has reference to docker as described in this bugzilla.

Note You need to log in before you can comment on or make changes to this bug.