Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2238930

Summary: cephadm hung on 'ceph orch status' before applying spec during FFU with director operator
Product: Red Hat OpenStack Reporter: John Fulton <johfulto>
Component: tripleo-ansibleAssignee: Manoj Katari <mkatari>
Status: CLOSED ERRATA QA Contact: Alfredo <alfrgarc>
Severity: high Docs Contact:
Priority: high    
Version: 17.1 (Wallaby)CC: alfrgarc, mariel, mkatari, vhariria
Target Milestone: z2Keywords: Triaged
Target Release: 17.1   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: tripleo-ansible-3.3.1-17.1.20231101230823.4d015bf.el9ost Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2024-01-16 14:31:00 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description John Fulton 2023-09-14 12:07:34 UTC
This bug is just like BZ 2222589 except it happened earlier in the upgrade process.

In this context, DeployedCeph=false because we're using director Operator (not regular director). So Tripleo client is not calling the cephadm.yml playbook. Instead this playbook is calling it as executed by Heat:

  https://github.com/openstack/tripleo-ansible/blob/stable/wallaby/tripleo_ansible/roles/tripleo_run_cephadm/tasks/main.yml

The fix to BZ 2222589 was to ensure ceph orch doesn't hang because showing the ceph cluster status:

  https://review.opendev.org/c/openstack/tripleo-ansible/+/887565/3/tripleo_ansible/playbooks/cephadm.yml#110

However, ceph orch hung before we could get to line 110 as evidenced by the log. 

The cephadm_command.log has the following:

2023-09-13 10:52:50,297 p=15897 u=cloud-admin n=ansible | 2023-09-13 10:52:50.297311 | 0a580a82-004e-f885-4c3b-00000000009f |       TASK | set prometheus container image in ceph configuratio
n
2023-09-13 10:52:50,351 p=15897 u=cloud-admin n=ansible | 2023-09-13 10:52:50.351523 | 0a580a82-004e-f885-4c3b-00000000009f |    SKIPPED | set prometheus container image in ceph configuratio
n | controller-0

So in the above we've finished the "Run ceph config" import_role

  https://github.com/openstack/tripleo-ansible/blob/master/tripleo_ansible/playbooks/cephadm.yml#L38

Next we get to the import_role of "Apply Ceph spec". 

2023-09-13 10:52:50,369 p=15897 u=cloud-admin n=ansible | 2023-09-13 10:52:50.369428 | 0a580a82-004e-f885-4c3b-0000000000b9 |       TASK | Stat spec file on bootstrap node
2023-09-13 10:52:50,765 p=15897 u=cloud-admin n=ansible | 2023-09-13 10:52:50.765799 | 0a580a82-004e-f885-4c3b-0000000000b9 |         OK | Stat spec file on bootstrap node | controller-0 | i
tem=/home/ceph-admin/specs/ceph_spec.yaml
2023-09-13 10:52:50,782 p=15897 u=cloud-admin n=ansible | 2023-09-13 10:52:50.782531 | 0a580a82-004e-f885-4c3b-0000000000ba |       TASK | Fail if spec file is missing
2023-09-13 10:52:50,853 p=15897 u=cloud-admin n=ansible | 2023-09-13 10:52:50.852928 | 0a580a82-004e-f885-4c3b-0000000000ba |    SKIPPED | Fail if spec file is missing | controller-0 | item=
{'changed': False, 'stat': {'exists': True, 'path': '/home/ceph-admin/specs/ceph_spec.yaml', 'mode': '0644', 'isdir': False, 'ischr': False, 'isblk': False, 'isreg': True, 'isfifo': False, '
islnk': False, 'issock': False, 'uid': 1001, 'gid': 1002, 'size': 1237, 'inode': 2073981, 'dev': 64515, 'nlink': 1, 'atime': 1694602336.5713072, 'mtime': 1694602336.120307, 'ctime': 16946023
36.5843072, 'wusr': True, 'rusr': True, 'xusr': False, 'wgrp': False, 'rgrp': True, 'xgrp': False, 'woth': False, 'roth': True, 'xoth': False, 'isuid': False, 'isgid': False, 'blocks': 8, 'b
lock_size': 4096, 'device_type': 0, 'readable': True, 'writeable': True, 'executable': False, 'pw_name': 'ceph-admin', 'gr_name': 'ceph-admin', 'checksum': '002e86d1aef045a674919ebb76aae634f
0cbdb5e', 'mimetype': 'text/plain', 'charset': 'us-ascii', 'version': '3983780799', 'attributes': [], 'attr_flags': ''}, 'invocation': {'module_args': {'path': '/home/ceph-admin/specs/ceph_s
pec.yaml', 'follow': False, 'get_md5': False, 'get_checksum': True, 'get_mime': True, 'get_attributes': True, 'checksum_algorithm': 'sha1'}}, 'failed': False, 'item': '/home/ceph-admin/specs
/ceph_spec.yaml', 'ansible_loop_var': 'item'}
2023-09-13 10:52:50,866 p=15897 u=cloud-admin n=ansible | 2023-09-13 10:52:50.866369 | 0a580a82-004e-f885-4c3b-0000000000bb |       TASK | Get ceph_cli
2023-09-13 10:52:50,932 p=15897 u=cloud-admin n=ansible | 2023-09-13 10:52:50.932478 | 460dee40-8a81-4066-8542-ff00f1a4dd3d |   INCLUDED | /usr/share/ansible/roles/tripleo_cephadm/tasks/ceph
_cli.yaml | controller-0
2023-09-13 10:52:50,940 p=15897 u=cloud-admin n=ansible | 2023-09-13 10:52:50.940855 | 0a580a82-004e-f885-4c3b-000000000311 |       TASK | Set ceph CLI
2023-09-13 10:52:51,136 p=15897 u=cloud-admin n=ansible | 2023-09-13 10:52:51.135895 | 0a580a82-004e-f885-4c3b-000000000311 |         OK | Set ceph CLI | controller-0
2023-09-13 10:52:51,148 p=15897 u=cloud-admin n=ansible | 2023-09-13 10:52:51.148670 | 0a580a82-004e-f885-4c3b-0000000000bc |       TASK | Get the ceph orchestrator status
2023-09-14 08:39:26,487 p=15897 u=cloud-admin n=ansible | 2023-09-14 08:39:26.487215 | 0a580a82-004e-f885-4c3b-0000000000bc |    CHANGED | Get the ceph orchestrator status | controller-0
2023-09-14 08:39:26,498 p=15897 u=cloud-admin n=ansible | 2023-09-14 08:39:26.498187 | 0a580a82-004e-f885-4c3b-0000000000bd |       TASK | Fail if ceph orchestrator is not available

Note the timestamp! 

From Sep 13 10:52 to Sep 14 08:39, ceph orchestrator was hung. It only got unhung because someone manually restarted it. I.e. they applied the workaround themselves as our fix on line 110 is needed earlier in the process.

Comment 16 errata-xmlrpc 2024-01-16 14:31:00 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 17.1.2 bug fix and enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2024:0209