1847113 – After FFWD we should unset ContainerCeph3DaemonImage

Bug 1847113 - After FFWD we should unset ContainerCeph3DaemonImage

Summary: After FFWD we should unset ContainerCeph3DaemonImage

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-tripleo-heat-templates
Sub Component:
Version:	16.1 (Train)
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	rc
Target Release:	16.1 (Train on RHEL 8.2)
Assignee:	Jose Luis Franco
QA Contact:	David Rosenfeld
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1850212
TreeView+	depends on / blocked

Reported:	2020-06-15 17:16 UTC by Giulio Fidente
Modified:	2020-11-25 05:50 UTC (History)
CC List:	16 users (show)
Fixed In Version:	openstack-tripleo-heat-templates-11.3.2-0.20200616081527.396affd.el8ost
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Clones:	1850212 (view as bug list)
Environment:
Last Closed:	2020-07-29 07:53:11 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
OpenStack gerrit	735639	0	None	MERGED	[TRAIN ONLY] Add FFU parameters in lifecycle env files.	2021-02-11 15:49:34 UTC
Red Hat Product Errata	RHBA-2020:3148	0	None	None	None	2020-07-29 07:53:32 UTC

Description Giulio Fidente 2020-06-15 17:16:28 UTC

Description of problem:
Before running the overcloud prepare we need to set ContainerCeph3DaemonImage to trigger the ceph-ansile docker-to-podman playbook using an rhcs3 container image; on converge though we should unset it or the condition at [1] will trigger rolling_update too with the rhcs3 image

1. https://github.com/openstack/tripleo-heat-templates/blob/stable/train/deployment/ceph-ansible/ceph-base.yaml#L340

Comment 2 John Fulton 2020-06-15 17:30:34 UTC

This issue presents when after the converge step. When you run `openstack overcloud external-upgrade run --stack $STACK --tags ceph` it fails with the following ceph-ansible error:

2020-06-15 11:08:47,273 p=264551 u=root n=ansible | TASK [container | disallow pre-nautilus OSDs and enable all new nautilus-only functionality] ***
2020-06-15 11:08:47,274 p=264551 u=root n=ansible | Monday 15 June 2020  11:08:47 -0400 (0:00:00.485)       0:20:09.714 *********** 
2020-06-15 11:08:49,215 p=264551 u=root n=ansible | fatal: [osp-test-octopi-zorillas-controller-0 -> 10.10.0.116]: FAILED! => changed=true 
  cmd:
  - podman
  - exec
  - ceph-mon-osp-test-octopi-zorillas-controller-0
  - ceph
  - osd
  - require-osd-release
  - nautilus
  delta: '0:00:01.537954'
  end: '2020-06-15 15:08:49.184682'
  msg: non-zero return code
  rc: 22
  start: '2020-06-15 15:08:47.646728'
  stderr: |-
    Invalid command: nautilus not in luminous
    osd require-osd-release luminous {--yes-i-really-mean-it} :  set the minimum allowed OSD release to participate in the cluster
    Error EINVAL: invalid command
    Error: non zero exit code: 22: OCI runtime error
  stderr_lines: <omitted>
  stdout: ''
  stdout_lines: <omitted>
2020-06-15 11:08:49,216 p=264551 u=root n=ansible | NO MORE HOSTS LEFT *************************************************************
2020-06-15 11:08:49,219 p=264551 u=root n=ansible | PLAY RECAP *********************************************************************
2020-06-15 11:08:49,219 p=264551 u=root n=ansible | localhost                  : ok=1    changed=0    unreachable=0    failed=0    skipped=1    rescued=0    ignored=0   
2020-06-15 11:08:49,219 p=264551 u=root n=ansible | osp-test-octopi-zorillas-cephstorage-0 : ok=160  changed=14   unreachable=0    failed=0    skipped=251  rescued=0    ignored=0   
2020-06-15 11:08:49,219 p=264551 u=root n=ansible | osp-test-octopi-zorillas-cephstorage-1 : ok=160  changed=14   unreachable=0    failed=0    skipped=251  rescued=0    ignored=0   
2020-06-15 11:08:49,220 p=264551 u=root n=ansible | osp-test-octopi-zorillas-cephstorage-2 : ok=161  changed=13   unreachable=0    failed=0    skipped=250  rescued=0    ignored=0   
2020-06-15 11:08:49,220 p=264551 u=root n=ansible | osp-test-octopi-zorillas-controller-0 : ok=420  changed=47   unreachable=0    failed=1    skipped=612  rescued=0    ignored=0   
2020-06-15 11:08:49,220 p=264551 u=root n=ansible | osp-test-octopi-zorillas-controller-1 : ok=297  changed=29   unreachable=0    failed=0    skipped=494  rescued=0    ignored=0   
2020-06-15 11:08:49,220 p=264551 u=root n=ansible | osp-test-octopi-zorillas-controller-2 : ok=293  changed=27   unreachable=0    failed=0    skipped=484  rescued=0    ignored=0   
2020-06-15 11:08:49,220 p=264551 u=root n=ansible | osp-test-octopi-zorillas-novacompute-0 : ok=114  changed=8    unreachable=0    failed=0    skipped=236  rescued=0    ignored=0   
2020-06-15 11:08:49,220 p=264551 u=root n=ansible | osp-test-octopi-zorillas-novacompute-1 : ok=111  changed=7    unreachable=0    failed=0    skipped=225  rescued=0    ignored=0   
2020-06-15 11:08:49,220 p=264551 u=root n=ansible | Monday 15 June 2020  11:08:49 -0400 (0:00:01.946)       0:20:11.661 *********** 
2020-06-15 11:08:49,221 p=264551 u=root n=ansible | =============================================================================== 
2020-06-15 11:08:49,225 p=264551 u=root n=ansible | waiting for clean pgs... ----------------------------------------------- 36.07s
2020-06-15 11:08:49,225 p=264551 u=root n=ansible | gather and delegate facts ---------------------------------------------- 28.58s
2020-06-15 11:08:49,225 p=264551 u=root n=ansible | stop standby ceph mds -------------------------------------------------- 26.17s
2020-06-15 11:08:49,225 p=264551 u=root n=ansible | ceph-container-common : pulling osp-test-octopi-zorillas-undercloud.ctlplane.hextupleo.lab:8787/rhceph/rhceph-3-rhel7:3-40 image -- 17.95s

Comment 3 John Fulton 2020-06-15 17:33:04 UTC

As per comment #2 the rolling_update playbook was using not RHCSv4 containers but RHCSv3 containers! 

That is why the following task fails. 

 https://github.com/ceph/ceph-ansible/blob/v4.0.23/infrastructure-playbooks/rolling_update.yml#L945

So we need a way in THT for the person doing the upgrade to specify that they want ceph4 containers to be used.

Comment 4 John Fulton 2020-06-15 18:02:53 UTC

HOW TO AVOID THIS ISSUE

1. before running the converge step create a file called no_ceph3.yaml (or something similar) containing the following value:


parameter_defaults:
  ContainerCeph3DaemonImage: ''


2. When you run converge step include the file as the last argument of your openstack overcloud deploy command. 

E.g. "openstack overcloud deploy ... -e no_ceph3.yaml

If you've already run the converge step and encountered this bug, then you may re-run run it.


3. Proceed to the ceph upgrade as usual by running a command like: `openstack overcloud external-upgrade run --stack $STACK --tags ceph`

Comment 29 errata-xmlrpc 2020-07-29 07:53:11 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:3148

Note You need to log in before you can comment on or make changes to this bug.