Description of problem: OSP11 -> OSP12 upgrade: major-upgrade-composable-steps-docker on environments with radosgw enabled fails: FAILED! => {"changed": false, "failed": true, "msg": "you must set radosgw_interface, radosgw_address or radosgw_address_block"} Version-Release number of selected component (if applicable): ceph-ansible-3.0.8-1.el7cp.noarch How reproducible: 100% Steps to Reproduce: 1. Deploy OSP11 with radosgw enabled: source ~/stackrc export THT=/usr/share/openstack-tripleo-heat-templates/ openstack overcloud deploy --templates $THT \ -e $THT/environments/network-isolation.yaml \ -e $THT/environments/network-management.yaml \ -e $THT/environments/storage-environment.yaml \ -e $THT/environments/ceph-radosgw.yaml \ -e $THT/environments/tls-endpoints-public-ip.yaml \ -e ~/openstack_deployment/environments/nodes.yaml \ -e ~/openstack_deployment/environments/network-environment.yaml \ -e ~/openstack_deployment/environments/disk-layout.yaml \ -e ~/openstack_deployment/environments/public_vip.yaml \ -e ~/openstack_deployment/environments/enable-tls.yaml \ -e ~/openstack_deployment/environments/inject-trust-anchor.yaml \ -e ~/openstack_deployment/environments/neutron-settings.yaml \ 2. Upgrade to OSP12 source ~/stackrc export THT=/usr/share/openstack-tripleo-heat-templates/ openstack overcloud deploy --templates $THT \ -e $THT/environments/network-isolation.yaml \ -e $THT/environments/network-management.yaml \ -e $THT/environments/ceph-ansible/ceph-ansible.yaml \ -e $THT/environments/ceph-radosgw.yaml \ -e $THT/environments/tls-endpoints-public-ip.yaml \ -e ~/openstack_deployment/environments/nodes.yaml \ -e ~/openstack_deployment/environments/network-environment.yaml \ -e ~/openstack_deployment/environments/disk-layout.yaml \ -e ~/openstack_deployment/environments/public_vip.yaml \ -e ~/openstack_deployment/environments/enable-tls.yaml \ -e ~/openstack_deployment/environments/inject-trust-anchor.yaml \ -e ~/openstack_deployment/environments/neutron-settings.yaml \ -e /usr/share/openstack-tripleo-heat-templates/environments/major-upgrade-composable-steps-docker.yaml \ -e /home/stack/ceph-ansible-env.yaml \ -e /home/stack/docker-osp12.yaml \ Actual results: Upgrade fails: [root@undercloud-0 stack]# tail /var/log/mistral/ceph-install-workflow.log 2017-11-04 17:33:23,336 p=27936 u=mistral | TASK [ceph-docker-common : make sure radosgw_interface, radosgw_address or radosgw_address_block is defined] *** 2017-11-04 17:33:23,395 p=27936 u=mistral | fatal: [192.168.0.24]: FAILED! => {"changed": false, "failed": true, "msg": "you must set radosgw_interface, radosgw_address or radosgw_address_block"} 2017-11-04 17:33:23,396 p=27936 u=mistral | PLAY RECAP ********************************************************************* 2017-11-04 17:33:23,396 p=27936 u=mistral | 192.168.0.13 : ok=4 changed=0 unreachable=0 failed=0 2017-11-04 17:33:23,396 p=27936 u=mistral | 192.168.0.17 : ok=4 changed=0 unreachable=0 failed=0 2017-11-04 17:33:23,397 p=27936 u=mistral | 192.168.0.19 : ok=4 changed=0 unreachable=0 failed=0 2017-11-04 17:33:23,397 p=27936 u=mistral | 192.168.0.20 : ok=4 changed=0 unreachable=0 failed=0 2017-11-04 17:33:23,397 p=27936 u=mistral | 192.168.0.23 : ok=4 changed=0 unreachable=0 failed=0 2017-11-04 17:33:23,397 p=27936 u=mistral | 192.168.0.24 : ok=24 changed=3 unreachable=0 failed=1 2017-11-04 17:33:23,397 p=27936 u=mistral | localhost : ok=0 changed=0 unreachable=0 failed=0 Expected results: Upgrade completes fine. Additional info: [stack@undercloud-0 ~]$ cat /home/stack/ceph-ansible-env.yaml parameter_defaults: CephAnsibleDisksConfig: devices: - '/dev/vdb' - '/dev/vdc'
Created attachment 1347910 [details] ceph-install-workflow.log
I think the issue is with the list of environment files passed on upgrade. Specifically this: -e $THT/environments/ceph-radosgw.yaml \ should be -e $THT/environments/ceph-ansible/ceph-rgw.yaml \ Same is for the MDS service, the old environment file at environments/services/ceph-mds.yaml is deploying using puppet-ceph; the new environment file to be used is environments/ceph-ansible/ceph-mds.yaml Should we turn this into an upgrade docs bug?
(In reply to Giulio Fidente from comment #2) > I think the issue is with the list of environment files passed on upgrade. > Specifically this: > > -e $THT/environments/ceph-radosgw.yaml \ > > should be > > -e $THT/environments/ceph-ansible/ceph-rgw.yaml \ > > Same is for the MDS service, the old environment file at > > environments/services/ceph-mds.yaml > > is deploying using puppet-ceph; the new environment file to be used is > > environments/ceph-ansible/ceph-mds.yaml > > Should we turn this into an upgrade docs bug? Sorry, I missed the environment files. I'm going to try using the ceph-ansible environments and see how it goes.
After switching the environment files to use the ceph-ansible ones upgrade completed ok but several issues show up: 1. radosgw services are still running under systemd: [root@overcloud-controller-0 heat-admin]# systemctl list-units -a | grep rados ceph-radosgw.service loaded active running Ceph rados gateway ceph-radosgw.service loaded activating auto-restart Ceph RGW system-ceph\x2dradosgw.slice loaded active active system-ceph\x2dradosgw.slice ceph-radosgw.target loaded active active ceph target allowing to start/stop all ceph-radosgw@.service instances at once [root@overcloud-controller-0 heat-admin]# systemctl status ceph-radosgw.service ● ceph-radosgw.service - Ceph rados gateway Loaded: loaded (/usr/lib/systemd/system/ceph-radosgw@.service; enabled; vendor preset: disabled) Active: active (running) since Mon 2017-11-06 18:39:43 UTC; 20h ago Main PID: 72610 (radosgw) CGroup: /system.slice/system-ceph\x2dradosgw.slice/ceph-radosgw.service └─72610 /usr/bin/radosgw -f --cluster ceph --name client.radosgw.gateway --setuser ceph --setgroup ceph Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable. [root@overcloud-controller-0 heat-admin]# systemctl status ceph-radosgw.service ● ceph-radosgw.service - Ceph RGW Loaded: loaded (/etc/systemd/system/ceph-radosgw@.service; enabled; vendor preset: disabled) Active: activating (auto-restart) (Result: exit-code) since Tue 2017-11-07 15:11:12 UTC; 8s ago Process: 137550 ExecStopPost=/usr/bin/docker stop ceph-rgw-overcloud-controller-0 (code=exited, status=1/FAILURE) Process: 137339 ExecStart=/usr/bin/docker run --rm --net=host --memory=1g --cpu-quota=100000 -v /var/lib/ceph:/var/lib/ceph -v /etc/ceph:/etc/ceph -e RGW_CIVETWEB_IP=10.0.0.145 -v /etc/localtime:/etc/localtime:r o -e CEPH_DAEMON=RGW -e CLUSTER=ceph -e RGW_CIVETWEB_PORT=8080 --name=ceph-rgw-overcloud-controller-0 docker-registry.engineering.redhat.com/ceph/rhceph-2-rhel7:latest (code=exited, status=5) Process: 137331 ExecStartPre=/usr/bin/docker rm ceph-rgw-overcloud-controller-0 (code=exited, status=1/FAILURE) Process: 137323 ExecStartPre=/usr/bin/docker stop ceph-rgw-overcloud-controller-0 (code=exited, status=1/FAILURE) Main PID: 137339 (code=exited, status=5) Nov 07 15:11:12 overcloud-controller-0 systemd[1]: Unit ceph-radosgw.service entered failed state. Nov 07 15:11:12 overcloud-controller-0 systemd[1]: ceph-radosgw.service failed. 2. There is no radosgw container running after the upgrade completes: [root@overcloud-controller-0 heat-admin]# docker ps | grep ceph c4b3874e93ed docker-registry.engineering.redhat.com/ceph/rhceph-2-rhel7:latest "/entrypoint.sh" 17 hours ago Up 17 hours ceph-mon-overcloud-controller-0 [root 3. After rebooting a controller node the radosgw container starts but haproxy container fails to start: [root@overcloud-controller-2 heat-admin]# docker ps | grep ceph 9b76d42c4927 docker-registry.engineering.redhat.com/ceph/rhceph-2-rhel7:latest "/entrypoint.sh" 17 hours ago Up 17 hours ceph-rgw-overcloud-controller-2 e3b570004295 docker-registry.engineering.redhat.com/ceph/rhceph-2-rhel7:latest "/entrypoint.sh" 17 hours ago Up 17 hours ceph-mon-overcloud-controller-2 Docker container set: haproxy-bundle [docker-registry.engineering.redhat.com/rhosp12/openstack-haproxy-docker:pcmklatest] haproxy-bundle-docker-0 (ocf::heartbeat:docker): Started overcloud-controller-0 haproxy-bundle-docker-1 (ocf::heartbeat:docker): Started overcloud-controller-1 haproxy-bundle-docker-2 (ocf::heartbeat:docker): Stopped Failed Actions: * haproxy-bundle-docker-2_start_0 on overcloud-controller-2 'unknown error' (1): call=89, status=complete, exitreason='Newly created docker container exited after start', last-rc-change='Mon Nov 6 21:51:59 2017', queued=0ms, exec=9021ms The radosgw service binds on all addresses: [root@overcloud-controller-2 heat-admin]# ps axu | grep radosgw ceph 10068 0.1 0.2 3800048 33436 ? Ssl Nov06 2:01 /usr/bin/radosgw --cluster ceph --setuser ceph --setgroup ceph -d -n client.rgw.overcloud-controller-2 -k /var/lib/ceph/radosgw/overcloud-controller-2/keyring --rgw-socket-path= --rgw-zonegroup= --rgw-zone= --rgw-frontends=civetweb port=8080 root 140381 0.0 0.0 112664 972 pts/0 S+ 15:14 0:00 grep --color=auto radosgw [root@overcloud-controller-2 heat-admin]# netstat -tupan | grep radosgw tcp 0 0 0.0.0.0:8080 0.0.0.0:* LISTEN 10068/radosgw tcp 0 0 10.0.0.153:38624 10.0.0.149:6800 ESTABLISHED 10068/radosgw tcp 0 0 10.0.0.153:35116 10.0.0.155:6802 ESTABLISHED 10068/radosgw tcp 0 0 10.0.0.153:34206 10.0.0.149:6802 ESTABLISHED 10068/radosgw tcp 0 0 10.0.0.153:59340 10.0.0.142:6802 ESTABLISHED 10068/radosgw tcp 0 0 10.0.0.153:52438 10.0.0.153:6789 ESTABLISHED 10068/radosgw tcp 0 0 10.0.0.153:60062 10.0.0.155:6800 ESTABLISHED 10068/radosgw tcp 0 0 10.0.0.153:55256 10.0.0.142:6800 ESTABLISHED 10068/radosgw
Yes, it is fixed in https://bugzilla.redhat.com/show_bug.cgi?id=1498183
Providing the QA ack.
A deployment with rgw failed. The rgw contianer fail to start because port 8080 is already in use civetweb: 0x55abc6c5adc0: set_ports_option: cannot bind to 172.17.3.20:8080: 98 (Address already in use)
fix in https://github.com/ceph/ceph-ansible/releases/tag/v3.0.18
The verification failed with the same error with rhceph:ceph-2-rhel-7-docker-candidate-43803-20180119213048
Hi Yogev, from what i've seen, rhceph:ceph-2-rhel-7-docker-candidate-43803-20180119213048 contains the fix for this issue. Any chance we can access the environment where you tried to deploy this image ?
tested it with the latest image and it passed.
*** Bug 1539192 has been marked as a duplicate of this bug. ***
Hi Aron, Actually, this BZ was filled in Ceph Storage Product / Container Component but the actual solution here is a more a precision in the procedure for upgrading OSP11 -> OSP12. I don't mind filling the Doc Text field but I'm not sure what will be the impact regarding this BZ affection.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:0341
*** Bug 1536074 has been marked as a duplicate of this bug. ***