Description of problem: Following FFU documentation [1] the command "openstack overcloud upgrade run --stack overcloud --tags system_upgrade --limit ceph0" is executed without errors (see attached output) but the systemd service "ceph-osd" on the first node doesn't start with the following errors: [root@ceph0 ~]# journalctl -u ceph-osd |tail -n 7 Oct 06 10:10:10 ceph0 podman[269786]: Error: no container with name or ID ceph-osd-2 found: no such container Oct 06 10:10:10 ceph0 podman[269796]: Error: Failed to evict container: "": Failed to find container "ceph-osd-2" in state: no container with name or ID ceph-osd-2 found: no such container Oct 06 10:10:10 ceph0 systemd[1]: Started Ceph OSD. Oct 06 10:10:10 ceph0 ceph-osd-run.sh[269807]: /usr/share/ceph-osd-run.sh: line 14: docker: command not found Oct 06 10:10:10 ceph0 ceph-osd-run.sh[269807]: No data partition found for OSD Oct 06 10:10:10 ceph0 systemd[1]: ceph-osd: Main process exited, code=exited, status=1/FAILURE Oct 06 10:10:10 ceph0 systemd[1]: ceph-osd: Failed with result 'exit-code'. The file /usr/share/ceph-osd-run.sh has reference to docker (but the node has been already upgraded to RHEL8) [root@ceph0 ~]# grep -n docker /usr/share/ceph-osd-run.sh 14: DATA_PART=$(docker run --rm --ulimit nofile=1024:4096 --privileged=true -v /dev/:/dev/ -v /etc/ceph:/etc/ceph:z --entrypoint ceph-disk hammer.ipa.gpslab.club:5000/gpslabrhosp-library-rhosp16_1_ffu-osp13_containers-rhceph-3-rhel7:3-46 list | grep ", osd\.${1}," | awk '{ print $1 }') 27: DOCKER_ENV=$(docker run --rm --net=host --privileged=true -v /dev/:/dev/ -v /etc/ceph:/etc/ceph:z -e CLUSTER=ceph -e OSD_DEVICE=${1} hammer.ipa.gpslab.club:5000/gpslabrhosp-library-rhosp16_1_ffu-osp13_containers-rhceph-3-rhel7:3-46 disk_list) Also the file /usr/share/ceph-ansible/roles/ceph-osd/templates/ceph-osd-run.sh.j2 provided by package ceph-ansible-3.2.49-1.el7cp.noarch contains reference to docker. (undercloud) [stack@undercloud templates]$ grep -n "docker run" /usr/share/ceph-ansible/roles/ceph-osd/templates/ceph-osd-run.sh.j2 19: DATA_PART=$(docker run --rm --ulimit nofile=1024:4096 --privileged=true -v /dev/:/dev/ -v /etc/ceph:/etc/ceph:z --entrypoint ceph-disk {{ ceph_docker_registry }}/{{ ceph_docker_image }}:{{ ceph_docker_image_tag }} list | grep ", osd\.${1}," | awk '{ print $1 }') 32: DOCKER_ENV=$(docker run --rm --net=host --privileged=true -v /dev/:/dev/ -v /etc/ceph:/etc/ceph:z -e CLUSTER={{ cluster }} -e OSD_DEVICE=${1} {{ ceph_docker_registry }}/{{ ceph_docker_image }}:{{ ceph_docker_image_tag }} disk_list) 61: part=$(docker run --privileged=true -v /dev:/dev --entrypoint /usr/sbin/ceph-disk {{ ceph_docker_registry}}/{{ ceph_docker_image }}:{{ ceph_docker_image_tag }} list /dev/${1} | awk '/journal / {print $1}') Version-Release number of selected component (if applicable): ceph-ansible-3.2.49-1.el7cp.noarch How reproducible: Execute the documentation [1] until the command "openstack overcloud upgrade run --stack STACK NAME --limit overcloud-cephstorage-0" of the paragraph "17.3. Upgrading the operating system for Ceph Storage nodes" and verify the status of Ceph osd service after this command is executed. Actual results: The ceph-osd service on the upgraded nodes fails to start. [root@ceph0 ~]# journalctl -u ceph-osd |tail -n 7 Oct 06 10:10:10 ceph0 podman[269786]: Error: no container with name or ID ceph-osd-2 found: no such container Oct 06 10:10:10 ceph0 podman[269796]: Error: Failed to evict container: "": Failed to find container "ceph-osd-2" in state: no container with name or ID ceph-osd-2 found: no such container Oct 06 10:10:10 ceph0 systemd[1]: Started Ceph OSD. Oct 06 10:10:10 ceph0 ceph-osd-run.sh[269807]: /usr/share/ceph-osd-run.sh: line 14: docker: command not found Oct 06 10:10:10 ceph0 ceph-osd-run.sh[269807]: No data partition found for OSD Oct 06 10:10:10 ceph0 systemd[1]: ceph-osd: Main process exited, code=exited, status=1/FAILURE Oct 06 10:10:10 ceph0 systemd[1]: ceph-osd: Failed with result 'exit-code'. Expected results: The ceph-osd service on the upgraded nodes start without errors. Additional info: [1] https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.1/html-single/framework_for_upgrades_13_to_16.1/index#upgrading-the-operating-system-for-ceph-storage-nodes-upgrading-overcloud-standard
Created attachment 1719323 [details] output of the command "openstack overcloud upgrade run --stack overcloud --tags system_upgrade --limit ceph0"
If it's expected that ceph OSD services starts again after executing the commands on the paragraph 17.3, I found the following workaround. I don't know if there are best solution to this problem. Before to execute the commands of the paragraph 17.3 I have applied the following changes on in the file /usr/share/ceph-ansible/roles/ceph-osd/templates/ceph-osd-run.sh.j2 of the undercloud: 1) replaced the string "docker" with "({{ container_binary }}" 2) Added a command "--net=host" in the podman command related to the variable DATA_PART I report below a diff of the changes executed: [root@undercloud ~]# diff /usr/share/ceph-ansible/roles/ceph-osd/templates/ceph-osd-run.sh.j2 /usr/share/ceph-ansible/roles/ceph-osd/templates/ceph-osd-run.sh.j2.orig 19c19 < DATA_PART=$({{ container_binary }} run --net=host --rm --ulimit nofile=1024:4096 --privileged=true -v /dev/:/dev/ -v /etc/ceph:/etc/ceph:z --entrypoint ceph-disk {{ ceph_docker_registry }}/{{ ceph_docker_image }}:{{ ceph_docker_image_tag }} list | grep ", osd\.${1}," | awk '{ print $1 }') --- > DATA_PART=$(docker run --rm --ulimit nofile=1024:4096 --privileged=true -v /dev/:/dev/ -v /etc/ceph:/etc/ceph:z --entrypoint ceph-disk {{ ceph_docker_registry }}/{{ ceph_docker_image }}:{{ ceph_docker_image_tag }} list | grep ", osd\.${1}," | awk '{ print $1 }') 32c32 < DOCKER_ENV=$({{ container_binary }} run --rm --net=host --privileged=true -v /dev/:/dev/ -v /etc/ceph:/etc/ceph:z -e CLUSTER={{ cluster }} -e OSD_DEVICE=${1} {{ ceph_docker_registry }}/{{ ceph_docker_image }}:{{ ceph_docker_image_tag }} disk_list) --- > DOCKER_ENV=$(docker run --rm --net=host --privileged=true -v /dev/:/dev/ -v /etc/ceph:/etc/ceph:z -e CLUSTER={{ cluster }} -e OSD_DEVICE=${1} {{ ceph_docker_registry }}/{{ ceph_docker_image }}:{{ ceph_docker_image_tag }} disk_list) Applying this changes seems that OSD service returns in state running. [1] https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.1/html-single/framework_for_upgrades_13_to_16.1/index#upgrading-the-operating-system-for-ceph-storage-nodes-upgrading-overcloud-standard
I'm pretty sure this is user error where most likely ceph-ansible was upgraded to ceph4 version instead of ceph3.
(In reply to Lukas Bezdicka from comment #5) > I'm pretty sure this is user error where most likely ceph-ansible was > upgraded to ceph4 version instead of ceph3. The version of ceph-ansible used as been described in the first answer of this case is ceph-ansible-3.2.49-1.el7cp.noarch.rpm . The same package can be also downloaded from https://access.redhat.com/downloads/content/ceph-ansible/3.2.49-1.el7cp/noarch/fd431d51/package the file /usr/share/ceph-ansible/roles/ceph-osd/templates/ceph-osd-run.sh.j2 provided by this package has reference to docker as described in this bugzilla.