Bug 1631687 - upgrade OCP on Atomic Host 7.4.5 failed
Summary: upgrade OCP on Atomic Host 7.4.5 failed
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Containers
Version: 3.11.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 3.11.z
Assignee: Giuseppe Scrivano
QA Contact: weiwei jiang
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-09-21 09:42 UTC by Weihua Meng
Modified: 2019-06-26 09:08 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-06-26 09:07:51 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2019:1605 0 None None None 2019-06-26 09:07:59 UTC

Description Weihua Meng 2018-09-21 09:42:05 UTC
Description of problem:
upgrade OCP on Atomic Host 7.4.5 failed
upgrade succeeds if on Atomic Host 7.5.3
I did not see upgrade AH OS is necessary in doc  https://docs.openshift.com/container-platform/3.10/upgrading/index.html
if it is necessary, it is better let playbook do it, or give message about it 
OCP v3.9 on AH 7.4.5 upgrade successfully to v3.10
 
Version-Release number of the following components:
openshift-ansible-3.11.12-1.git.0.0c64f7a.el7.noarch

Kernel Version: 3.10.0-693.21.1.el7.x86_64
Operating System: Red Hat Enterprise Linux Atomic Host 7.4.5

How reproducible:
Always

Steps to Reproduce:
1. Install OCP v3.10 on Atomic Host 7.4.5
2. Upgrade to v3.11


Actual results:
Upgrade failed

Failure summary:


  1. Hosts:    wmengugah745ol-node-1.0921-hb2.qe.rhcloud.com, wmengugah745ol-node-registry-router-1.0921-hb2.qe.rhcloud.com
     Play:     Update registry authentication credentials
     Task:     Install or Update node system container
     Message:  time="2018-09-21T08:10:07Z" level=fatal msg="Error: blob sha256:367d845540573038025f445c654675aa63905ec8682938fb45bc00f40849c37b is already present, but with size 200670683 instead of 74930327" 
               
               

  2. Hosts:    wmengugah745ol-master-etcd-1.0921-hb2.qe.rhcloud.com
     Play:     Update registry authentication credentials
     Task:     Install or Update node system container
     Message:  time="2018-09-21T08:10:09Z" level=fatal msg="Error: blob sha256:367d845540573038025f445c654675aa63905ec8682938fb45bc00f40849c37b is already present, but with size 200670683 instead of 74930327" 

Expected results:
Upgrade succeeded
Additional info:
Please attach logs from ansible-playbook with the -vvv flag

Comment 4 Scott Dodson 2018-09-21 12:22:22 UTC
This is failing in a module call that updates the system container using the atomic command. Moving over to containers team.

Here's the module call

https://github.com/openshift/openshift-ansible/blob/master/roles/openshift_node/tasks/node_system_container_install.yml#L2-L28

Here's the source for that module

https://github.com/openshift/openshift-ansible/blob/master/roles/lib_openshift/library/oc_atomic_container.py

Comment 5 Scott Dodson 2018-09-21 13:21:12 UTC
We should test this using Atomic Host 7.5 as minimum version since that was required by 3.10.

https://access.redhat.com/articles/2176281#comment-1326561

Comment 6 Antonio Murdaca 2018-09-21 15:15:50 UTC
This is failing in containes/image Copy method, not sure where skopeo is being used or containers/image. Does anyone know that? Miloslav, do you know what's happening?

Comment 7 Antonio Murdaca 2018-09-21 16:11:51 UTC
Failure happens during this call to "atomic install" https://github.com/openshift/openshift-ansible/blob/master/roles/lib_openshift/library/oc_atomic_container.py#L81

which in turn calls into "skopeo copy" (iirc, Giuseppe?).

Figuring out why we're hitting this corner case and how to solve it.

Comment 8 Giuseppe Scrivano 2018-09-21 20:13:28 UTC
I think the issue is caused by the old version of skopeo present on AH 7.4.5 that didn't correctly report the layer size from the ostree storage.

As a workaround the metadata of the system containers branches can be deleted, forcing to fully re-fetch the images: "ostree refs --delete ociimage"

Comment 9 N. Harrison Ripps 2018-09-21 20:48:25 UTC
Per discussion with Mrunal; now that a workaround has been identified, we will defer this to 3.11.z.

Comment 10 Antonio Murdaca 2018-09-24 07:30:30 UTC
alright, so for 3.11.z this is going to be just a matter of using a newer skopeo, correct? Lokesh, could you look into building a newer skopeo?

Comment 11 Giuseppe Scrivano 2018-09-24 12:44:01 UTC
it works if both the skopeo used to install and upgrade OCP are updated.  An updated skopeo will still fail to upgrade if OCP was installed used the old version.

Comment 17 Giuseppe Scrivano 2019-04-12 16:41:34 UTC
this has been fixed

Comment 18 weiwei jiang 2019-04-29 11:07:12 UTC
Checked with v3.10.127 upgrade to v3.11.98 with atomic host 7.4.5 and not met this issue, so move to verified.

Comment 20 errata-xmlrpc 2019-06-26 09:07:51 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:1605


Note You need to log in before you can comment on or make changes to this bug.