Bug 1707928

Summary: Installs failing on latest 4.1 Beta 5 nightly (OCP) builds: controller version mismatch for rendered-master
Product: OpenShift Container Platform Reporter: Mike Fiedler <mifiedle>
Component: ReleaseAssignee: Vikas Laad <vlaad>
Status: CLOSED CURRENTRELEASE QA Contact: Peter Ruan <pruan>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 4.1.0CC: amurdaca, aos-bugs, calfonso, ccoleman, jokerman, lserven, mmccomas, sbatsche, smunilla, walters, wking, wsun
Target Milestone: ---Keywords: BetaBlocker, TestBlocker
Target Release: 4.1.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-05-10 11:48:18 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1706606    
Attachments:
Description Flags
oc -n origin get -o yaml imagestream 4.1
none
install log none

Description Mike Fiedler 2019-05-08 17:24:59 UTC
Description of problem:

Installs of 


failed with this error:

Installing from release registry.svc.ci.openshift.org/ocp/release:4.1.0-0.nightly-2019-05-08-150359
level=warning msg="Found override for ReleaseImage. Please be warned, this is not advised"
level=info msg="Consuming \"Install Config\" from target directory"
level=info msg="Creating infrastructure resources..."
level=info msg="Waiting up to 30m0s for the Kubernetes API at https://api.ci-op-wf3h7rx7-0e31a.origin-ci-int-aws.dev.rhcloud.com:6443..."
level=info msg="API v1.13.4+2e01d67 up"
level=info msg="Waiting up to 30m0s for bootstrapping to complete..."
level=info msg="Destroying the bootstrap resources..."
level=info msg="Waiting up to 30m0s for the cluster at https://api.ci-op-wf3h7rx7-0e31a.origin-ci-int-aws.dev.rhcloud.com:6443 to initialize..."
level=fatal msg="failed to initialize the cluster: Cluster operator machine-config is reporting a failure: Failed to resync 4.1.0-0.nightly-2019-05-08-150359 because: timed out waiting for the condition during syncRequiredMachineConfigPools: pool master has not progressed to latest configuration: controller version mismatch for rendered-master-4dff26ada6de23170e5f280a5b37b927 expected 4.1.0-201905071832-dirty has 4.1.0-201905081021-dirty, retrying: timed out waiting for the condition"

This blocks verification of bug 1706606


Version-Release number of selected component (if applicable):
4.1.0-0.nightly-2019-05-08-150359	
4.1.0-0.nightly-2019-05-08-133845

How reproducible: 2/2 builds so far.   Trying a newer one and will capture full install logs

Comment 2 Colin Walters 2019-05-08 17:54:37 UTC
Is there a CI job or something with at least openshift/must-gather results from this?

Comment 3 W. Trevor King 2019-05-08 17:56:13 UTC
We had two of these in CI; attaching my notes so far:

$ cat /tmp/timeline.txt 
* 2019-05-08T00:27:55Z machine-config-operator tag at generation 458, commit 335f82e64973a3b8ddea5cd50a8b7506f4e4cefe [1]
* ...additional machine-config-operator tag bumps...
* 2019-05-08T13:04:26Z machine-config-operator tag at generation 474, commit fe5ae490aa3a339f2f1865ba368a133f8ec85f39 [2]
* 2019-05-08T13:42:35Z machine-config-operator in e2e-aws-4.1 job 172 at commit 335f82e64973a3b8ddea5cd50a8b7506f4e4cefe [4]
* 2019-05-08T15:08:08Z machine-config-operator in e2e-aws-4.1 job 173 at commit 335f82e64973a3b8ddea5cd50a8b7506f4e4cefe [5]
* 2019-05-08T16:59:48Z machine-config-operator tag at generation 476, commit 51e89e958ed5a5e0ae7a0d433c3b0ec7520fe6bf [3].  There is no generation 475 in the history [2,3].

[0]: All of the following is.yaml consumers are fed by a single:
$ oc -n origin get -o yaml imagestream 4.1 >/tmp/is.yaml

[1]:
$ yaml2json </tmp/is.yaml | jq '.status.tags[] | select(.tag == "machine-config-operator").items[4]'
{
 "image": "sha256:6ac2fcaa74f4fab8859e55ec5bb6cf5b29376ed064ec78c04f9ffde9fa6326ea",
 "generation": "458",
 "dockerImageReference": "docker-registry.default.svc:5000/origin/4.1@sha256:6ac2fcaa74f4fab8859e55ec5bb6cf5b29376ed064ec78c04f9ffde9fa6326ea",
 "created": "2019-05-08T00:27:55Z"
}
$ oc image info registry.svc.ci.openshift.org/origin/4.1@sha256:6ac2fcaa74f4fab8859e55ec5bb6cf5b29376ed064ec78c04f9ffde9fa6326ea | grep commit.id
             io.openshift.build.commit.id=335f82e64973a3b8ddea5cd50a8b7506f4e4cefe

[2]:
$ yaml2json </tmp/is.yaml | jq '.status.tags[] | select(.tag == "machine-config-operator").items[1]'
{
  "image": "sha256:0aab5df0cc6889bb94a993bd01ecfa02bdcafc1f532a1dbe82a6c252285b0c14",
  "generation": "474",
  "created": "2019-05-08T13:04:26Z",
  "dockerImageReference": "docker-registry.default.svc:5000/origin/4.1@sha256:0aab5df0cc6889bb94a993bd01ecfa02bdcafc1f532a1dbe82a6c252285b0c14"
}
$ oc image info registry.svc.ci.openshift.org/origin/4.1@sha256:0aab5df0cc6889bb94a993bd01ecfa02bdcafc1f532a1dbe82a6c252285b0c14 | grep commit.id
             io.openshift.build.commit.id=fe5ae490aa3a339f2f1865ba368a133f8ec85f39

[3]:
$ yaml2json </tmp/is.yaml | jq '.status.tags[] | select(.tag == "machine-config-operator").items[0]'
{
  "created": "2019-05-08T16:59:48Z",
  "dockerImageReference": "docker-registry.default.svc:5000/origin/4.1@sha256:b10b49b0845eb1b8632eb6cf4b6ff42882d233582f0ff38c42797b677d066fe0",
  "image": "sha256:b10b49b0845eb1b8632eb6cf4b6ff42882d233582f0ff38c42797b677d066fe0",
  "generation": "476"
}
$ oc image info registry.svc.ci.openshift.org/origin/4.1@sha256:b10b49b0845eb1b8632eb6cf4b6ff42882d233582f0ff38c42797b677d066fe0 | grep commit.id
             io.openshift.build.commit.id=51e89e958ed5a5e0ae7a0d433c3b0ec7520fe6bf

[4]:
$ date --iso=s --utc --date="@$(curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.1/172/started.json | jq -r .timestamp)"
2019-05-08T13:42:35+0000
$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.1/172/artifacts/release-images-latest/release-images-latest | jq '.spec.tags[] | select(.name == "machine-config-operator")'
{
  "name": "machine-config-operator",
  "annotations": {
    "io.openshift.build.commit.id": "335f82e64973a3b8ddea5cd50a8b7506f4e4cefe",
    "io.openshift.build.commit.ref": "",
    "io.openshift.build.source-location": "https://github.com/openshift/machine-config-operator"
  },
  "from": {
    "kind": "DockerImage",
    "name": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:79229c53da661e9f1dcf5624ff9e7eb0b995a4bc53d24268a37251d75830452c"
  },
  "generation": 2,
  "importPolicy": {},
  "referencePolicy": {
    "type": "Source"
  }
}

[5]:
$ date --iso=s --utc --date="@$(curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.1/173/started.json | jq -r .timestamp)"
2019-05-08T15:08:08+0000
$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.1/173/artifacts/release-images-latest/release-images-latest | jq '.spec.tags[] | select(.name == "machine-config-operator")'
{
  "name": "machine-config-operator",
  "annotations": {
    "io.openshift.build.commit.id": "335f82e64973a3b8ddea5cd50a8b7506f4e4cefe",
    "io.openshift.build.commit.ref": "",
    "io.openshift.build.source-location": "https://github.com/openshift/machine-config-operator"
  },
  "from": {
    "kind": "DockerImage",
    "name": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:79229c53da661e9f1dcf5624ff9e7eb0b995a4bc53d24268a37251d75830452c"
  },
  "generation": 2,
  "importPolicy": {},
  "referencePolicy": {
    "type": "Source"
  }
}

Comment 4 W. Trevor King 2019-05-08 17:57:10 UTC
Created attachment 1565756 [details]
oc -n origin get -o yaml imagestream 4.1

Comment 6 W. Trevor King 2019-05-08 18:02:22 UTC
Not really new info, but just underlining the failure to pull in the current machine-config-operator tag:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.1/172/build-log.txt
...
2019/05/08 13:42:58 Resolved release:latest registry.svc.ci.openshift.org/ocp/release@sha256:8b798278fa803005f8d24300261a89905f3f328fb68d4228324c7e4737423428
...
2019/05/08 13:42:58 Tagged shared images from ocp/4.1:${component}, images will be pullable from registry.svc.ci.openshift.org/ci-op-ssm1lmkd/stable:${component}
2019/05/08 13:43:06 Importing release image latest
2019/05/08 13:44:48 Imported release 4.1.0-0.nightly-2019-05-08-133845 created at 2019-05-08 13:40:47 +0000 UTC with 83 images to tag release:latest
...
2019/05/08 13:44:50 Running pod e2e-aws
Installing from release registry.svc.ci.openshift.org/ocp/release:4.1.0-0.nightly-2019-05-08-133845
...
$ oc adm release info --image-for=machine-config-operator registry.svc.ci.openshift.org/ocp/release@sha256:8b798278fa803005f8d24300261a89905f3f328fb68d4228324c7e4737423428
quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:79229c53da661e9f1dcf5624ff9e7eb0b995a4bc53d24268a37251d75830452c
$ oc adm release info --image-for=machine-config-operator registry.svc.ci.openshift.org/ocp/release:4.1.0-0.nightly-2019-05-08-133845
quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:79229c53da661e9f1dcf5624ff9e7eb0b995a4bc53d24268a37251d75830452c
$ oc image info quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:79229c53da661e9f1dcf5624ff9e7eb0b995a4bc53d24268a37251d75830452c | grep commit.id
             io.openshift.build.commit.id=335f82e64973a3b8ddea5cd50a8b7506f4e4cefe

Comment 7 Mike Fiedler 2019-05-08 18:03:40 UTC
Created attachment 1565758 [details]
install log

4.1.0-0.nightly-2019-05-08-161156 failed as well.  Full install log attached.

Comment 9 W. Trevor King 2019-05-08 18:11:14 UTC
Simple test for a given release image:

$ oc adm release info --commits registry.svc.ci.openshift.org/ocp/release:4.1.0-0.nightly-2019-05-08-161156 | grep machine-config-operator
  machine-config-controller                     https://github.com/openshift/machine-config-operator                       fe5ae490aa3a339f2f1865ba368a133f8ec85f39  <- this mismatch is not good
  machine-config-daemon                         https://github.com/openshift/machine-config-operator                       335f82e64973a3b8ddea5cd50a8b7506f4e4cefe  <- these are older commits
  machine-config-operator                       https://github.com/openshift/machine-config-operator                       335f82e64973a3b8ddea5cd50a8b7506f4e4cefe
  machine-config-server                         https://github.com/openshift/machine-config-operator                       335f82e64973a3b8ddea5cd50a8b7506f4e4cefe
  setup-etcd-environment                        https://github.com/openshift/machine-config-operator                       335f82e64973a3b8ddea5cd50a8b7506f4e4cefe

Vikas is kicking things to get those rebuilt for nightlies so they all match.  But we should figure out the bug behind this that is probably also the source of the mismatches in CI.

Comment 10 W. Trevor King 2019-05-08 18:15:01 UTC
Current CI tip no longer exhibits this bug:

$ oc adm release info --commits registry.svc.ci.openshift.org/ocp/release:4.1.0-0.ci-2019-05-08-180426 | grep machine-config-operator
  machine-config-controller                     https://github.com/openshift/machine-config-operator                       51e89e958ed5a5e0ae7a0d433c3b0ec7520fe6bf
  machine-config-daemon                         https://github.com/openshift/machine-config-operator                       51e89e958ed5a5e0ae7a0d433c3b0ec7520fe6bf
  machine-config-operator                       https://github.com/openshift/machine-config-operator                       51e89e958ed5a5e0ae7a0d433c3b0ec7520fe6bf
  machine-config-server                         https://github.com/openshift/machine-config-operator                       51e89e958ed5a5e0ae7a0d433c3b0ec7520fe6bf
  setup-etcd-environment                        https://github.com/openshift/machine-config-operator                       51e89e958ed5a5e0ae7a0d433c3b0ec7520fe6bf

but is likely still vulnerable to whatever the issue is for future releases.

Comment 12 W. Trevor King 2019-05-08 19:30:31 UTC
I've filed https://github.com/openshift/ci-operator/pull/343 to help with debugging 'oc adm release new ...' in the CI context, but it's probably going to be a bit before that lands, gets deployed, and we hit whatever the issue is again in CI.  I dunno if we want to keep digging into 'oc adm release new' in the meantime, or just wait for a localized nightly fix.  As far as release blocking, probably just wait until we have a consistent nightly that addresses our other remaining beta blockers.

Comment 13 W. Trevor King 2019-05-08 19:36:16 UTC
Checking for this issue more broadly in the most recent nightly [1], here's checking for repositories which back multiple images:

$ oc adm release info --commits registry.svc.ci.openshift.org/ocp/release:4.1.0-0.nightly-2019-05-08-190528 | grep github | awk '{print $2}' | sort | uniq -c | grep -v '^ *1 '
      2 https://github.com/openshift/containernetworking-plugins
      2 https://github.com/openshift/installer
      3 https://github.com/openshift/jenkins
      2 https://github.com/openshift/kubecsr
      5 https://github.com/openshift/machine-config-operator
      7 https://github.com/openshift/ose
      2 https://github.com/openshift/prometheus-operator

And checking their versions:

$ oc adm release info --commits registry.svc.ci.openshift.org/ocp/release:4.1.0-0.nightly-2019-05-08-190528 | grep github | awk '{print $2, $3, $1}' | sort | grep 'github.com/openshift/\(containernetworking-plugins\|installer\|jenkins\|kubecst\|machine-config-operator\|ose\|prometheus-operator\)'
https://github.com/openshift/containernetworking-plugins a59efc83a90bf0eb25d56de88532d627132580e4 container-networking-plugins-supported
https://github.com/openshift/containernetworking-plugins a59efc83a90bf0eb25d56de88532d627132580e4 container-networking-plugins-unsupported
https://github.com/openshift/installer 6e5093d4e4d0e2069957a54db95c69b9eaa2b3a2 installer
https://github.com/openshift/installer 6e5093d4e4d0e2069957a54db95c69b9eaa2b3a2 installer-artifacts
https://github.com/openshift/jenkins 599ecd4cf25fc809446e5746afb23b95f5940bc3 jenkins
https://github.com/openshift/jenkins 599ecd4cf25fc809446e5746afb23b95f5940bc3 jenkins-agent-maven
https://github.com/openshift/jenkins 599ecd4cf25fc809446e5746afb23b95f5940bc3 jenkins-agent-nodejs
https://github.com/openshift/machine-config-operator 51e89e958ed5a5e0ae7a0d433c3b0ec7520fe6bf machine-config-operator    <- not a match; Vikas is rebuilding to fix
https://github.com/openshift/machine-config-operator 6e7615ad4926831ddaa63fdb054fcd9cf6517b46 machine-config-controller
https://github.com/openshift/machine-config-operator 6e7615ad4926831ddaa63fdb054fcd9cf6517b46 machine-config-daemon
https://github.com/openshift/machine-config-operator 6e7615ad4926831ddaa63fdb054fcd9cf6517b46 machine-config-server
https://github.com/openshift/machine-config-operator 6e7615ad4926831ddaa63fdb054fcd9cf6517b46 setup-etcd-environment
https://github.com/openshift/ose 4b4690a2806792061afcec394eadba7363736e02 cli-artifacts
https://github.com/openshift/ose 4b4690a2806792061afcec394eadba7363736e02 deployer
https://github.com/openshift/ose 4b4690a2806792061afcec394eadba7363736e02 tests
https://github.com/openshift/ose 4f60fbe4aa866d23974d44499d84f1d370e51f94 cli
https://github.com/openshift/ose 4f60fbe4aa866d23974d44499d84f1d370e51f94 hyperkube
https://github.com/openshift/ose 4f60fbe4aa866d23974d44499d84f1d370e51f94 hypershift
https://github.com/openshift/ose 4f60fbe4aa866d23974d44499d84f1d370e51f94 node
https://github.com/openshift/ose-ovn-kubernetes 4105e7303a13114932b7c2d1ad376b5162383c78 ovn-kubernetes
https://github.com/openshift/prometheus-operator 73ed419e5254d39ad301403ddd48e190d8e02fd8 prometheus-config-reloader
https://github.com/openshift/prometheus-operator 73ed419e5254d39ad301403ddd48e190d8e02fd8 prometheus-operator

So we look good outside of the rebuild Vikas has running now.

[1]: https://openshift-release.svc.ci.openshift.org/

Comment 15 W. Trevor King 2019-05-08 19:40:35 UTC
Ah, I guess there are also OSE mismatches, e.g. 4b4690a2806792061afcec394eadba7363736e02 cli-artifacts vs. 4f60fbe4aa866d23974d44499d84f1d370e51f94 cli.  Dunno if that's an issue or not, but:

$ git log --oneline --graph --decorate 4b4690a28^...4f60fbe4aa86
* 4f60fbe4aa (tag: v4.1.0-201905080833, origin/master, origin/HEAD) Automatic commit of package [openshift] release [4.1.0-201905080833].
* d34b8bd345 (tag: v4.1.0-201905080646) Automatic commit of package [openshift] release [4.1.0-201905080646].
*   8f5651d200 Merge remote-tracking branch master
|\  
| *   21f1917756 Merge pull request #22759 from dcbw/proxy-init-node-network-ready
| |\  
| | * bd17f3337e sdn: wait for proxy initialization before declaring node network ready
| | * 90701751c4 sdn: move CNI config file handling to pkg/cmd/openshift-sdn
* | 01d26fe197 (tag: v4.1.0-201905072245) Automatic commit of package [openshift] release [4.1.0-201905072245].
* | 516fd08dc3 (tag: v4.1.0-201905072232) Automatic commit of package [openshift] release [4.1.0-201905072232].
* |   925e23b037 Merge remote-tracking branch master
|\ \  
| |/  
| * e8a2d41602 Merge pull request #22796 from bparees/disable_reg
| * 8bfec95f52 disable broken test
* 4b4690a280 (tag: v4.1.0-201905071548) Automatic commit of package [openshift] release [4.1.0-201905071548].
$ git diff --stat 4b4690a28..4f60fbe4aa86
 .tito/packages/openshift                      |  2 +-
 origin.spec                                   |  6 +-
 pkg/cmd/openshift-sdn/cmd.go                  | 32 ++++++++--
 pkg/cmd/openshift-sdn/proxy.go                | 90 ++++++++++++++++++++++++---
 pkg/cmd/openshift-sdn/sdn.go                  | 25 +++++---
 pkg/network/node/node.go                      | 38 +----------
 test/integration/dockerregistryclient_test.go |  2 +-
 7 files changed, 135 insertions(+), 60 deletions(-)

doesn't look too bad for the older deployer or cli-artifacts.  There is the one test change, but I don't think folks care about the referenced tests image anyway.

Comment 16 W. Trevor King 2019-05-08 19:41:45 UTC
Moving back to POST.  Per comment 13, 4.1.0-0.nightly-2019-05-08-190528 is still broken and we're in the process of building another attempt.

Comment 17 W. Trevor King 2019-05-08 19:45:32 UTC
4.1.0-0.nightly-2019-05-08-194149 resolves the MCO divergence:

$ oc adm release info --commits registry.svc.ci.openshift.org/ocp/release:4.1.0-0.nightly-2019-05-08-194149 | grep github | awk '{print $2}' | sort | uniq -c | grep -v '^ *1 '
      2 https://github.com/openshift/containernetworking-plugins
      2 https://github.com/openshift/installer
      3 https://github.com/openshift/jenkins
      2 https://github.com/openshift/kubecsr
      5 https://github.com/openshift/machine-config-operator
      7 https://github.com/openshift/ose
      2 https://github.com/openshift/prometheus-operator
$ oc adm release info --commits registry.svc.ci.openshift.org/ocp/release:4.1.0-0.nightly-2019-05-08-194149 | grep github | awk '{print $2, $3, $1}' | sort | grep 'github.com/openshift/\(containernetworking-plugins\|installer\|jenkins\|kubecst\|machine-config-operator\|ose\|prometheus-operator\)'
https://github.com/openshift/containernetworking-plugins a59efc83a90bf0eb25d56de88532d627132580e4 container-networking-plugins-supported
https://github.com/openshift/containernetworking-plugins a59efc83a90bf0eb25d56de88532d627132580e4 container-networking-plugins-unsupported
https://github.com/openshift/installer 6e5093d4e4d0e2069957a54db95c69b9eaa2b3a2 installer
https://github.com/openshift/installer 6e5093d4e4d0e2069957a54db95c69b9eaa2b3a2 installer-artifacts
https://github.com/openshift/jenkins 599ecd4cf25fc809446e5746afb23b95f5940bc3 jenkins
https://github.com/openshift/jenkins 599ecd4cf25fc809446e5746afb23b95f5940bc3 jenkins-agent-maven
https://github.com/openshift/jenkins 599ecd4cf25fc809446e5746afb23b95f5940bc3 jenkins-agent-nodejs
https://github.com/openshift/machine-config-operator 6e7615ad4926831ddaa63fdb054fcd9cf6517b46 machine-config-controller
https://github.com/openshift/machine-config-operator 6e7615ad4926831ddaa63fdb054fcd9cf6517b46 machine-config-daemon
https://github.com/openshift/machine-config-operator 6e7615ad4926831ddaa63fdb054fcd9cf6517b46 machine-config-operator
https://github.com/openshift/machine-config-operator 6e7615ad4926831ddaa63fdb054fcd9cf6517b46 machine-config-server
https://github.com/openshift/machine-config-operator 6e7615ad4926831ddaa63fdb054fcd9cf6517b46 setup-etcd-environment
https://github.com/openshift/ose 4b4690a2806792061afcec394eadba7363736e02 cli-artifacts
https://github.com/openshift/ose 4b4690a2806792061afcec394eadba7363736e02 deployer
https://github.com/openshift/ose 4b4690a2806792061afcec394eadba7363736e02 tests
https://github.com/openshift/ose 4f60fbe4aa866d23974d44499d84f1d370e51f94 cli
https://github.com/openshift/ose 4f60fbe4aa866d23974d44499d84f1d370e51f94 hyperkube
https://github.com/openshift/ose 4f60fbe4aa866d23974d44499d84f1d370e51f94 hypershift
https://github.com/openshift/ose 4f60fbe4aa866d23974d44499d84f1d370e51f94 node
https://github.com/openshift/ose-ovn-kubernetes 4105e7303a13114932b7c2d1ad376b5162383c78 ovn-kubernetes
https://github.com/openshift/prometheus-operator 73ed419e5254d39ad301403ddd48e190d8e02fd8 prometheus-config-reloader
https://github.com/openshift/prometheus-operator 73ed419e5254d39ad301403ddd48e190d8e02fd8 prometheus-operator

Neither Vikas nor I are concerned about the OSE divergence.  Back to ON_QA

Comment 18 W. Trevor King 2019-05-08 20:32:01 UTC
For future reference, here's an improved one-liner for identifying diverged repos:

$ oc adm release info --commits registry.svc.ci.openshift.org/ocp/release:4.1.0-0.nightly-2019-05-08-194149 | grep github | awk '{print $2, $3}' | sort | uniq | grep -v '^ *1 ' | awk '{print $1}' | uniq -c | grep -v '^ *1 '
      2 https://github.com/openshift/ose

Then you can drill in to just those:

$ oc adm release info --commits registry.svc.ci.openshift.org/ocp/release:4.1.0-0.nightly-2019-05-08-194149 | grep github | awk '{print $2, $3, $1}' | grep '/openshift/ose ' | sort
https://github.com/openshift/ose 4b4690a2806792061afcec394eadba7363736e02 cli-artifacts
https://github.com/openshift/ose 4b4690a2806792061afcec394eadba7363736e02 deployer
https://github.com/openshift/ose 4b4690a2806792061afcec394eadba7363736e02 tests
https://github.com/openshift/ose 4f60fbe4aa866d23974d44499d84f1d370e51f94 cli
https://github.com/openshift/ose 4f60fbe4aa866d23974d44499d84f1d370e51f94 hyperkube
https://github.com/openshift/ose 4f60fbe4aa866d23974d44499d84f1d370e51f94 hypershift
https://github.com/openshift/ose 4f60fbe4aa866d23974d44499d84f1d370e51f94 node

Comment 19 Mike Fiedler 2019-05-08 23:27:49 UTC
05-08-220133 still has the issue

Comment 20 W. Trevor King 2019-05-09 00:44:18 UTC
There are some full-rebuilds going on now

Comment 21 lserven 2019-05-09 08:16:38 UTC
*** Bug 1708127 has been marked as a duplicate of this bug. ***

Comment 22 lserven 2019-05-09 08:19:04 UTC
4 more builds have been executed since 2019-05-09 00:44:18 UTC, 100% are still failing.

Comment 23 Wei Sun 2019-05-09 08:43:21 UTC
4.1.0-0.nightly-2019-05-09-072847 still has the issue

Comment 24 Antonio Murdaca 2019-05-09 08:52:30 UTC
I'm not yet sure who's building nightly but the MCO hasn't changed that code path in months. What we've been seeing is that the operators are now built using a version which includes timestamp which is essentially what's causing this. This used to have just git hashes w/o relying on times. Is the build setting VERSION_OVERRIDE to the payload's version with timestamp? if so, why it changed?

Comment 25 Clayton Coleman 2019-05-09 11:39:48 UTC
Can someone explain succinctly what the problem is?  I have limited time today and there’s a lot of data but no summary.

Comment 26 W. Trevor King 2019-05-09 13:11:43 UTC
MCO and MCC compare version tags to see if they are speaking the same language.  When ART's build drops divergent tags, they freak out:

  ...controller version mismatch for rendered-master-4dff26ada6de23170e5f280a5b37b927 expected 4.1.0-201905071832-dirty has 4.1.0-201905081021-dirty...

With minute-granularity on the tags, they are now often divergent.  Comparing by commit hash instead would be more robust.

Comment 27 W. Trevor King 2019-05-09 13:19:56 UTC
ART is not setting VERSION_OVERRIDE.  They are dropping new tags in their MCO checkout, and those tags get picked up by [1].  Replacing that default with:

  VERSION_OVERRIDE="$(git rev-parse --verify 'HEAD^{commit}')"

would fix this, at the expense of having 4.1.0..., etc., human-recognizable versions.

[1]: https://github.com/openshift/machine-config-operator/blob/dcb37138e6c781379e4a5cbe3ce84fc190656340/hack/build-go.sh#L20

Comment 28 Antonio Murdaca 2019-05-09 13:28:33 UTC
MCC still needs at least a git hash to be 100% sure that we're speaking the same language or we're gonna bail. Just version and timestamp isn't enough and timestamps can drift in minutes of course...

Comment 29 lserven 2019-05-09 14:04:48 UTC
VERSION_OVERRIDE="$($(printf '%s-%s' $(git describe 2>/dev/null| true) $(git rev-parse --verify 'HEAD^{commit}') | sed 's/\(.*\)-/\1/')"

To transform "4.1.0-0.nightly-2019-05-08-150359" -> "4.1.0-<HASH>"
And with some safety to transform something without a tag to "<HASH>"

Comment 30 Clayton Coleman 2019-05-09 14:28:57 UTC
Do not depend on git tags in your repos for human readable versions.  Use git commit shas.  The tags ARt sets will never be exposed to customers and the git commit is what automation will use to correlate code to output.  Gut tags create extra complexity for little value, except in very constrained use cases.

Comment 31 W. Trevor King 2019-05-09 14:42:05 UTC
Do ART clones always preserve history?  Might be more reliable to use the root tree-ish hash.

Comment 33 W. Trevor King 2019-05-09 19:12:48 UTC
mco#728 is still in flight, but 4.1.0-0.nightly-2019-05-09-182710 has matching tags:

$ oc image info $(oc adm release info --image-for=machine-config-controller registry.svc.ci.openshift.org/ocp/release:4.1.0-0.nightly-2019-05-09-182710) | grep access.redhat.com/
             url=https://access.redhat.com/containers/#/registry.access.redhat.com/openshift/ose-machine-config-controller/images/v4.1.0-201905091018
$ oc image info $(oc adm release info --image-for=machine-config-operator registry.svc.ci.openshift.org/ocp/release:4.1.0-0.nightly-2019-05-09-182710) | grep access.redhat.com/
             url=https://access.redhat.com/containers/#/registry.access.redhat.com/openshift/ose-machine-config-operator/images/v4.1.0-201905091018

so it won't exhibit this issue.

Comment 34 Antonio Murdaca 2019-05-09 19:21:23 UTC
PR https://github.com/openshift/machine-config-operator/pull/728 has merged

Comment 35 W. Trevor King 2019-05-09 19:28:41 UTC
It's not clear to me what the scope should be to make this VERIFIED, or even ON_QA.  4.1.0-0.nightly-2019-05-09-182710 has the narrow fix (make the tags match, comment 33).  mco#728 landed, but is not in a nightly yet, and it has a more robust fix (compare Git commits).  There's also ART/CI bug(s) that can lead to different commits of the same repository entirely being used for different referenced images within the same release image (comment 3, comment 9, nobody has a fix yet).

Comment 36 Peter Ruan 2019-05-09 19:34:05 UTC
verified with 4.1.0-0.nightly-2019-05-09-182710

Comment 38 W. Trevor King 2019-05-10 17:16:59 UTC
The broader ART/CI commit-consitency issue now has it's own bug 1708648.