As a build cop today, I saw the 4.1 to 4.2 upgrade fail in CI. The clusteroperator/network reports the following conditions "apiVersion": "config.openshift.io/v1", "kind": "ClusterOperator", "metadata": { "creationTimestamp": "2019-09-05T07:55:17Z", "generation": 1, "name": "network", "resourceVersion": "34496", "selfLink": "/apis/config.openshift.io/v1/clusteroperators/network", "uid": "806d6150-cfb2-11e9-8569-12e6de07b346" }, "spec": {}, "status": { "conditions": [ { "lastTransitionTime": "2019-09-05T08:00:54Z", "status": "False", "type": "Degraded" }, { "lastTransitionTime": "2019-09-05T08:39:42Z", "message": "DaemonSet \"openshift-sdn/sdn\" is not available (awaiting 3 nodes)", "reason": "Deploying", "status": "True", "type": "Progressing" }, { "lastTransitionTime": "2019-09-05T07:56:44Z", "status": "True", "type": "Available" }, { "lastTransitionTime": "2019-09-05T08:39:41Z", "status": "True", "type": "Upgradeable" } ], The sdn pod log has the following entry rm: cannot remove '/etc/cni/net.d/80-openshift-network.conf': Permission denied CI: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1-to-4.2/336
*** Bug 1749448 has been marked as a duplicate of this bug. ***
One theory that's been discussed is that there's some issue with the host directory not existing prior to the containers mounting it as a volume. Thus, the directory is created on the fly. It's been mentioned that this should create the directory with 0755 permissions, and all of openshift-sdn/sdn, Multus & Kuryr daemonsets specify that pods should be run as privileged -- so they should be able to rwx files in the directory that's been created. As an educated guess, I have a WIP PR up for the MCO to create the directory in advance available @ https://github.com/openshift/machine-config-operator/pull/1105
This error shows up in ~75% of our ^release-.*-upgrade failures from the past 24 hours, and as a result those jobs are passing less than 50% of the time [1]. It's also broader than 4.1->4.2: $ curl -s 'https://ci-search-ci-search-next.svc.ci.openshift.org/search?name=%5Erelease-.*-upgrade&maxAge=24h&search=openshift-sdn/sdn.*is%20not%20available' | jq -r '. | keys[]'https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1-to-4.2/336 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1-to-4.2/337 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.1-to-4.2/140 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.1-to-4.2/141 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.2/223 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/6653 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/6663 ... https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/6700 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/6702 It occasionally turns up outside of upgrade CI, but is more rare there: $ curl -s 'https://ci-search-ci-search-next.svc.ci.openshift.org/search?maxAge=24h&context=0&search=openshift-sdn/sdn.*is%20not%20available' | jq -r '. | keys[]' | grep -v upgrade https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/23712/pull-ci-openshift-origin-master-e2e-aws-serial/9715 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_console/2602/pull-ci-openshift-console-master-e2e-aws/8184 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_installer/2162/pull-ci-openshift-installer-master-e2e-libvirt/1348 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_openshift-apiserver/21/pull-ci-openshift-openshift-apiserver-master-e2e-aws/64 [1]: https://ci-search-ci-search-next.svc.ci.openshift.org/chart?name=^release-.*-upgrade&search=Cluster%20did%20not%20complete%20upgrade:%20timed%20out%20waiting%20for%20the%20condition&search=openshift-sdn/sdn.*is%20not%20available
Removed "4.1->4.2" from the title based on my earlier comment. For example, the 6702 job linked from the comment is a 4.2.0-0.ci-2019-09-05-150846 -> 4.2.0-0.ci-2019-09-05-174936 upgrade [1]. [1]: https://openshift-release.svc.ci.openshift.org/releasetag/4.2.0-0.ci-2019-09-05-183300?from=4.2.0-0.ci-2019-09-05-161239
Still working on https://github.com/openshift/machine-config-operator/pull/1105 -- currently, when creating `/var/run/multus/cni/net.d/.dummy` -- this causes openshift-sdn/sdn to fail with: healthcheck.go:42] waiting for OVS to start: dial unix /var/run/openvswitch/db.sock: connect: no such file or directory With some input from Dan Winship, he's noted that "it looks like the ovs pod mounts /run/openvswitch while the sdn pod mounts /var/run/openvswitch. And those should be the same, but apparently aren't now. So it seems like maybe [this change] is causing /var/run and /run to become separate directories". When attempting to instead create `/run/multus/cni/net.d/.dummy` this creates another issue in which the API fails to come up -- detail available @ https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1105/pull-ci-openshift-machine-config-operator-master-e2e-aws/5053/artifacts/
This likely needs a 4.1 backports as the majority of 4.1->4.2 fail on the 4.1 version (e.g. https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1-to-4.2/347)
Our MCO PR was merged, however, after it made it into the current build, Michal Dulko from the Kuryr team has assessed that it was not sufficient. They are still encountering problems. In the meanwhile, we have submitted a revert for our alternate configuration directory changes to the CNO @ https://github.com/openshift/cluster-network-operator/pull/310 -- which will cause https://bugzilla.redhat.com/show_bug.cgi?id=1732598 to remain open.
Thanks to the Kuryr guys for pulling up some selinux logs for us and have identified some denials, I have pasted those here @ https://pastebin.com/iGvBec6u Luis Tomas has also noted: seems the process being denied has container_t label, while the directory has var_run_t. scontext=system_u:system_r:container_t:s0:c0,c912 tcontext=system_u:object_r:var_run_t:s0 and yes... is to the dir --> tclass=dir
Check the latest build are success in https://prow.svc.ci.openshift.org/job-history/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1-to-4.2 I also tried upgrade from 4.1.15 to 4.2.0-0.nightly-2019-09-10-181551, it's also upgrade successfully. history: - completionTime: "2019-09-11T02:20:05Z" image: registry.svc.ci.openshift.org/ocp/release:4.2.0-0.nightly-2019-09-10-181551 startedTime: "2019-09-11T02:14:02Z" state: Completed verified: false version: 4.2.0-0.nightly-2019-09-10-181551 - completionTime: "2019-09-11T02:14:02Z" image: quay.io/openshift-release-dev/ocp-release@sha256:0a7f743a98e4d0937f44561138a03db8c09cdc4817a771a67f154e032435bcef startedTime: "2019-09-11T01:15:06Z" state: Completed verified: false version: 4.1.15 observedGeneration: 2 versionHash: 1aelqjLy9Eo= kind: List metadata: Verified this bug
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2922
4.1.28 -> 4.2.12 hit [1]: fail [k8s.io/kubernetes/test/e2e/framework/service_util.go:915]: Dec 19 21:05:43.729: Could not reach HTTP service through adce4ee1722a211eabcc41299b3d4af0-198512352.us-east-1.elb.amazonaws.com:80 after 3m0s mentioned in bug 1749448 which was closed as a dup of this one. Do we need to reopen something on this front? The update completed successfully [2], but still, a 3m outage seems like something we want to understand. [1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1/980 [2]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1/980/artifacts/e2e-aws-upgrade/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-a0dbe73b7831a8ddb9a2c58a560461d7c2c23a92231289a2104b93e7723c0eff/cluster-scoped-resources/config.openshift.io/clusterversions/version.yaml
And hit this in a 4.1.18 -> 4.1.28 run too [1]. [1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1/991