Bug 1576782

Summary: [UPDATE] update failed at Task [Retag pcmklatest to latest Cinder-Backup image]
Product: Red Hat OpenStack Reporter: Raviv Bar-Tal <rbartal>
Component: openstack-tripleo-heat-templatesAssignee: Emilien Macchi <emacchi>
Status: CLOSED ERRATA QA Contact: Raviv Bar-Tal <rbartal>
Severity: medium Docs Contact:
Priority: high    
Version: 13.0 (Queens)CC: dbecker, jschluet, jstransk, mbracho, mbultel, mburns, morazi
Target Milestone: rcKeywords: Triaged
Target Release: 13.0 (Queens)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-tripleo-heat-templates-8.0.2-22.el7ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-06-27 13:55:31 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
controller sosreport part a
none
controller sosreport part b
none
controller sosreport part c
none
controller sosreport part d
none
controller sosreport part e
none
/home/stack files none

Description Raviv Bar-Tal 2018-05-10 11:54:50 UTC
Description of problem:
Update from 2018-05-07.2 build failed  on controller update in the task [Retag pcmklatest to latest Cinder-Backup image]
Error message: 
"Error response from daemon: no such id: 192.168.24.1:8787/rhosp13/openstack-cinder-backup:2018-05-07.2"

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. Install osp13 build 2018-05-07.2
2. update unercloud
3. update overcloud


Actual results:


Expected results:


Additional info:
See attached logs.
Automatic job on stage server:
http://staging-jenkins2-qe-playground.usersys.redhat.com/view/DFG/view/upgrades/view/update/job/DFG-upgrades-updates-13-from-2018-05-07.2-HA-ipv4/1/console

Comment 1 Raviv Bar-Tal 2018-05-10 12:01:50 UTC
Created attachment 1434337 [details]
controller sosreport part a

Comment 2 Raviv Bar-Tal 2018-05-10 12:03:17 UTC
As a result of the error controller 2 is offline:
[heat-admin@controller-0 ~]$ sudo pcs status
Cluster name: tripleo_cluster
Stack: corosync
Current DC: controller-1 (version 1.1.18-11.el7_5.2-2b07d5c5a9) - partition with quorum
Last updated: Thu May 10 11:59:37 2018
Last change: Wed May  9 16:26:09 2018 by root via cibadmin on controller-0

12 nodes configured
38 resources configured

Online: [ controller-0 controller-1 ]
OFFLINE: [ controller-2 ]
GuestOnline: [ galera-bundle-0@controller-0 galera-bundle-1@controller-1 rabbitmq-bundle-0@controller-0 rabbitmq-bundle-1@controller-1 redis-bundle-0@controller-0 redis-bundle-1@controller-1 ]

Full list of resources:

 Docker container set: rabbitmq-bundle [192.168.24.1:8787/rhosp13/openstack-rabbitmq:pcmklatest]
   rabbitmq-bundle-0	(ocf::heartbeat:rabbitmq-cluster):	Started controller-0
   rabbitmq-bundle-1	(ocf::heartbeat:rabbitmq-cluster):	Started controller-1
   rabbitmq-bundle-2	(ocf::heartbeat:rabbitmq-cluster):	Stopped
 Docker container set: galera-bundle [192.168.24.1:8787/rhosp13/openstack-mariadb:pcmklatest]
   galera-bundle-0	(ocf::heartbeat:galera):	Master controller-0
   galera-bundle-1	(ocf::heartbeat:galera):	Master controller-1
   galera-bundle-2	(ocf::heartbeat:galera):	Stopped
 Docker container set: redis-bundle [192.168.24.1:8787/rhosp13/openstack-redis:pcmklatest]
   redis-bundle-0	(ocf::heartbeat:redis):	Master controller-0
   redis-bundle-1	(ocf::heartbeat:redis):	Slave controller-1
   redis-bundle-2	(ocf::heartbeat:redis):	Stopped
 ip-192.168.24.8	(ocf::heartbeat:IPaddr2):	Started controller-0
 ip-10.0.0.101	(ocf::heartbeat:IPaddr2):	Started controller-1
 ip-172.17.1.12	(ocf::heartbeat:IPaddr2):	Started controller-1
 ip-172.17.1.13	(ocf::heartbeat:IPaddr2):	Started controller-0
 ip-172.17.3.10	(ocf::heartbeat:IPaddr2):	Started controller-1
 ip-172.17.4.19	(ocf::heartbeat:IPaddr2):	Started controller-0
 Docker container set: haproxy-bundle [192.168.24.1:8787/rhosp13/openstack-haproxy:pcmklatest]
   haproxy-bundle-docker-0	(ocf::heartbeat:docker):	Started controller-0
   haproxy-bundle-docker-1	(ocf::heartbeat:docker):	Started controller-1
   haproxy-bundle-docker-2	(ocf::heartbeat:docker):	Stopped
 Docker container: openstack-cinder-volume [192.168.24.1:8787/rhosp13/openstack-cinder-volume:pcmklatest]
   openstack-cinder-volume-docker-0	(ocf::heartbeat:docker):	Started controller-0
 Docker container: openstack-cinder-backup [192.168.24.1:8787/rhosp13/openstack-cinder-backup:pcmklatest]
   openstack-cinder-backup-docker-0	(ocf::heartbeat:docker):	Started controller-1

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled
[heat-admin@controller-0 ~]$

Comment 3 Raviv Bar-Tal 2018-05-10 12:05:14 UTC
Created attachment 1434338 [details]
controller sosreport part b

Comment 4 Raviv Bar-Tal 2018-05-10 12:06:26 UTC
Created attachment 1434339 [details]
controller sosreport part c

Comment 5 Raviv Bar-Tal 2018-05-10 12:08:17 UTC
Created attachment 1434340 [details]
controller sosreport part d

Comment 6 Raviv Bar-Tal 2018-05-10 12:10:35 UTC
Created attachment 1434341 [details]
controller sosreport part e

Comment 7 Raviv Bar-Tal 2018-05-10 12:12:05 UTC
Created attachment 1434342 [details]
/home/stack files

Comment 9 Jiri Stransky 2018-05-11 14:59:31 UTC
Looking at logs + code, this is probably specifically affecting cinder-backup service. I have a fix proposal but wasn't able to test it yet as i hit unrelated issues with upstream env.

Raviv, to progress forward with testing, i think you can either:

* apply the intended fix https://review.openstack.org/567806 to your enviornment (this would be nice as we'd also pre-validate the fix downstream),

or

* temporarily remove environments/cinder-backup.yaml from the command lines used when testing.

Comment 10 Raviv Bar-Tal 2018-05-14 11:25:43 UTC
I have manually applied the patch and the update passed this stage,
We should have this patch merged and landing downstream asapץ

Comment 12 Jiri Stransky 2018-05-15 13:16:57 UTC
The patch is hitting instability in the upstream CI, but once it lands at least to master, we can propose a downstream backport without waiting on the upstream one i think.

Comment 21 errata-xmlrpc 2018-06-27 13:55:31 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:2086