Bug 1631687

Summary: upgrade OCP on Atomic Host 7.4.5 failed
Product: OpenShift Container Platform Reporter: Weihua Meng <wmeng>
Component: ContainersAssignee: Giuseppe Scrivano <gscrivan>
Status: CLOSED ERRATA QA Contact: weiwei jiang <wjiang>
Severity: high Docs Contact:
Priority: high    
Version: 3.11.0CC: amurdaca, aos-bugs, gscrivan, jokerman, mitr, mmccomas, mpatel
Target Milestone: ---   
Target Release: 3.11.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-06-26 09:07:51 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Weihua Meng 2018-09-21 09:42:05 UTC
Description of problem:
upgrade OCP on Atomic Host 7.4.5 failed
upgrade succeeds if on Atomic Host 7.5.3
I did not see upgrade AH OS is necessary in doc  https://docs.openshift.com/container-platform/3.10/upgrading/index.html
if it is necessary, it is better let playbook do it, or give message about it 
OCP v3.9 on AH 7.4.5 upgrade successfully to v3.10
 
Version-Release number of the following components:
openshift-ansible-3.11.12-1.git.0.0c64f7a.el7.noarch

Kernel Version: 3.10.0-693.21.1.el7.x86_64
Operating System: Red Hat Enterprise Linux Atomic Host 7.4.5

How reproducible:
Always

Steps to Reproduce:
1. Install OCP v3.10 on Atomic Host 7.4.5
2. Upgrade to v3.11


Actual results:
Upgrade failed

Failure summary:


  1. Hosts:    wmengugah745ol-node-1.0921-hb2.qe.rhcloud.com, wmengugah745ol-node-registry-router-1.0921-hb2.qe.rhcloud.com
     Play:     Update registry authentication credentials
     Task:     Install or Update node system container
     Message:  time="2018-09-21T08:10:07Z" level=fatal msg="Error: blob sha256:367d845540573038025f445c654675aa63905ec8682938fb45bc00f40849c37b is already present, but with size 200670683 instead of 74930327" 
               
               

  2. Hosts:    wmengugah745ol-master-etcd-1.0921-hb2.qe.rhcloud.com
     Play:     Update registry authentication credentials
     Task:     Install or Update node system container
     Message:  time="2018-09-21T08:10:09Z" level=fatal msg="Error: blob sha256:367d845540573038025f445c654675aa63905ec8682938fb45bc00f40849c37b is already present, but with size 200670683 instead of 74930327" 

Expected results:
Upgrade succeeded
Additional info:
Please attach logs from ansible-playbook with the -vvv flag

Comment 4 Scott Dodson 2018-09-21 12:22:22 UTC
This is failing in a module call that updates the system container using the atomic command. Moving over to containers team.

Here's the module call

https://github.com/openshift/openshift-ansible/blob/master/roles/openshift_node/tasks/node_system_container_install.yml#L2-L28

Here's the source for that module

https://github.com/openshift/openshift-ansible/blob/master/roles/lib_openshift/library/oc_atomic_container.py

Comment 5 Scott Dodson 2018-09-21 13:21:12 UTC
We should test this using Atomic Host 7.5 as minimum version since that was required by 3.10.

https://access.redhat.com/articles/2176281#comment-1326561

Comment 6 Antonio Murdaca 2018-09-21 15:15:50 UTC
This is failing in containes/image Copy method, not sure where skopeo is being used or containers/image. Does anyone know that? Miloslav, do you know what's happening?

Comment 7 Antonio Murdaca 2018-09-21 16:11:51 UTC
Failure happens during this call to "atomic install" https://github.com/openshift/openshift-ansible/blob/master/roles/lib_openshift/library/oc_atomic_container.py#L81

which in turn calls into "skopeo copy" (iirc, Giuseppe?).

Figuring out why we're hitting this corner case and how to solve it.

Comment 8 Giuseppe Scrivano 2018-09-21 20:13:28 UTC
I think the issue is caused by the old version of skopeo present on AH 7.4.5 that didn't correctly report the layer size from the ostree storage.

As a workaround the metadata of the system containers branches can be deleted, forcing to fully re-fetch the images: "ostree refs --delete ociimage"

Comment 9 N. Harrison Ripps 2018-09-21 20:48:25 UTC
Per discussion with Mrunal; now that a workaround has been identified, we will defer this to 3.11.z.

Comment 10 Antonio Murdaca 2018-09-24 07:30:30 UTC
alright, so for 3.11.z this is going to be just a matter of using a newer skopeo, correct? Lokesh, could you look into building a newer skopeo?

Comment 11 Giuseppe Scrivano 2018-09-24 12:44:01 UTC
it works if both the skopeo used to install and upgrade OCP are updated.  An updated skopeo will still fail to upgrade if OCP was installed used the old version.

Comment 17 Giuseppe Scrivano 2019-04-12 16:41:34 UTC
this has been fixed

Comment 18 weiwei jiang 2019-04-29 11:07:12 UTC
Checked with v3.10.127 upgrade to v3.11.98 with atomic host 7.4.5 and not met this issue, so move to verified.

Comment 20 errata-xmlrpc 2019-06-26 09:07:51 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:1605