1749446 – upgrade failure - openshift-sdn/sdn Daemonset is not available

Bug 1749446 - upgrade failure - openshift-sdn/sdn Daemonset is not available

Summary: upgrade failure - openshift-sdn/sdn Daemonset is not available

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.2.0
Assignee:	Casey Callendrello
QA Contact:	zhaozhanqi
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1749448 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-09-05 15:57 UTC by Abu Kashem
Modified:	2019-12-20 03:48 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-10-16 06:40:33 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-network-operator pull 310	None	closed	Bug 1749446: Revert "Merge pull request #299 from dougbtv/alternate-config-dir"	2021-01-15 19:57:40 UTC
Github	openshift machine-config-operator pull 1105	None	closed	Bug 1749446: Creates alternative CNI configuration directory for the cluster network operator	2021-01-15 19:57:40 UTC
Red Hat Product Errata	RHBA-2019:2922	None	None	None	2019-10-16 06:40:44 UTC

Description Abu Kashem 2019-09-05 15:57:43 UTC

As a build cop today, I saw the 4.1 to 4.2 upgrade fail in CI. 
The clusteroperator/network reports the following conditions

 "apiVersion": "config.openshift.io/v1",
            "kind": "ClusterOperator",
            "metadata": {
                "creationTimestamp": "2019-09-05T07:55:17Z",
                "generation": 1,
                "name": "network",
                "resourceVersion": "34496",
                "selfLink": "/apis/config.openshift.io/v1/clusteroperators/network",
                "uid": "806d6150-cfb2-11e9-8569-12e6de07b346"
            },
            "spec": {},
            "status": {
                "conditions": [
                    {
                        "lastTransitionTime": "2019-09-05T08:00:54Z",
                        "status": "False",
                        "type": "Degraded"
                    },
                    {
                        "lastTransitionTime": "2019-09-05T08:39:42Z",
                        "message": "DaemonSet \"openshift-sdn/sdn\" is not available (awaiting 3 nodes)",
                        "reason": "Deploying",
                        "status": "True",
                        "type": "Progressing"
                    },
                    {
                        "lastTransitionTime": "2019-09-05T07:56:44Z",
                        "status": "True",
                        "type": "Available"
                    },
                    {
                        "lastTransitionTime": "2019-09-05T08:39:41Z",
                        "status": "True",
                        "type": "Upgradeable"
                    }
                ],



The sdn pod log has the following entry
rm: cannot remove '/etc/cni/net.d/80-openshift-network.conf': Permission denied

CI: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1-to-4.2/336

Comment 1 Ben Bennett 2019-09-05 17:19:38 UTC

*** Bug 1749448 has been marked as a duplicate of this bug. ***

Comment 2 Douglas Smith 2019-09-05 20:15:04 UTC

One theory that's been discussed is that there's some issue with the host directory not existing prior to the containers mounting it as a volume. Thus, the directory is created on the fly. It's been mentioned that this should create the directory with 0755 permissions, and all of openshift-sdn/sdn, Multus & Kuryr daemonsets specify that pods should be run as privileged -- so they should be able to rwx files in the directory that's been created. 

As an educated guess, I have a WIP PR up for the MCO to create the directory in advance available @ https://github.com/openshift/machine-config-operator/pull/1105

Comment 3 W. Trevor King 2019-09-05 21:28:11 UTC

This error shows up in ~75% of our ^release-.*-upgrade failures from the past 24 hours, and as a result those jobs are passing less than 50% of the time [1].  It's also broader than 4.1->4.2:

$ curl -s 'https://ci-search-ci-search-next.svc.ci.openshift.org/search?name=%5Erelease-.*-upgrade&maxAge=24h&search=openshift-sdn/sdn.*is%20not%20available' | jq -r '. | keys[]'https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1-to-4.2/336
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1-to-4.2/337
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.1-to-4.2/140
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.1-to-4.2/141
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.2/223
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/6653
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/6663
...
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/6700
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/6702

It occasionally turns up outside of upgrade CI, but is more rare there:

$ curl -s 'https://ci-search-ci-search-next.svc.ci.openshift.org/search?maxAge=24h&context=0&search=openshift-sdn/sdn.*is%20not%20available' | jq -r '. | keys[]' | grep -v upgrade 
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/23712/pull-ci-openshift-origin-master-e2e-aws-serial/9715
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_console/2602/pull-ci-openshift-console-master-e2e-aws/8184
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_installer/2162/pull-ci-openshift-installer-master-e2e-libvirt/1348
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_openshift-apiserver/21/pull-ci-openshift-openshift-apiserver-master-e2e-aws/64

[1]: https://ci-search-ci-search-next.svc.ci.openshift.org/chart?name=^release-.*-upgrade&search=Cluster%20did%20not%20complete%20upgrade:%20timed%20out%20waiting%20for%20the%20condition&search=openshift-sdn/sdn.*is%20not%20available

Comment 4 W. Trevor King 2019-09-05 21:33:39 UTC

Removed "4.1->4.2" from the title based on my earlier comment.  For example, the 6702 job linked from the comment is a 4.2.0-0.ci-2019-09-05-150846 -> 4.2.0-0.ci-2019-09-05-174936 upgrade [1].

[1]: https://openshift-release.svc.ci.openshift.org/releasetag/4.2.0-0.ci-2019-09-05-183300?from=4.2.0-0.ci-2019-09-05-161239

Comment 5 Douglas Smith 2019-09-06 16:32:00 UTC

Still working on https://github.com/openshift/machine-config-operator/pull/1105 -- currently, when creating `/var/run/multus/cni/net.d/.dummy` -- this causes openshift-sdn/sdn to fail with:


healthcheck.go:42] waiting for OVS to start: dial unix /var/run/openvswitch/db.sock: connect: no such file or directory


With some input from Dan Winship, he's noted that "it looks like the ovs pod mounts /run/openvswitch while the sdn pod mounts /var/run/openvswitch. And those should be the same, but apparently aren't now. So it seems like maybe [this change] is causing /var/run and /run to become separate directories".

When attempting to instead create `/run/multus/cni/net.d/.dummy` this creates another issue in which the API fails to come up -- detail available @ https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1105/pull-ci-openshift-machine-config-operator-master-e2e-aws/5053/artifacts/

Comment 7 Antonio Murdaca 2019-09-09 10:14:23 UTC

This likely needs a 4.1 backports as the majority of 4.1->4.2 fail on the 4.1 version (e.g. https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1-to-4.2/347)

Comment 8 Douglas Smith 2019-09-09 15:27:02 UTC

Our MCO PR was merged, however, after it made it into the current build, Michal Dulko from the Kuryr team has assessed that it was not sufficient. They are still encountering problems.

In the meanwhile, we have submitted a revert for our alternate configuration directory changes to the CNO @ https://github.com/openshift/cluster-network-operator/pull/310 -- which will cause https://bugzilla.redhat.com/show_bug.cgi?id=1732598 to remain open.

Comment 9 Douglas Smith 2019-09-09 16:04:21 UTC

Thanks to the Kuryr guys for pulling up some selinux logs for us and have identified some denials, I have pasted those here @ https://pastebin.com/iGvBec6u

Luis Tomas has also noted:

seems the process being denied has container_t label, while the directory has var_run_t. 

scontext=system_u:system_r:container_t:s0:c0,c912 tcontext=system_u:object_r:var_run_t:s0

and yes... is to the dir --> tclass=dir

Comment 11 zhaozhanqi 2019-09-11 03:31:43 UTC

Check the latest build are success in https://prow.svc.ci.openshift.org/job-history/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1-to-4.2

I also tried upgrade from 4.1.15 to 4.2.0-0.nightly-2019-09-10-181551, it's also upgrade successfully.

    history:
    - completionTime: "2019-09-11T02:20:05Z"
      image: registry.svc.ci.openshift.org/ocp/release:4.2.0-0.nightly-2019-09-10-181551
      startedTime: "2019-09-11T02:14:02Z"
      state: Completed
      verified: false
      version: 4.2.0-0.nightly-2019-09-10-181551
    - completionTime: "2019-09-11T02:14:02Z"
      image: quay.io/openshift-release-dev/ocp-release@sha256:0a7f743a98e4d0937f44561138a03db8c09cdc4817a771a67f154e032435bcef
      startedTime: "2019-09-11T01:15:06Z"
      state: Completed
      verified: false
      version: 4.1.15
    observedGeneration: 2
    versionHash: 1aelqjLy9Eo=
kind: List
metadata:

Verified this bug

Comment 12 errata-xmlrpc 2019-10-16 06:40:33 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922

Comment 13 W. Trevor King 2019-12-19 23:28:02 UTC

4.1.28 -> 4.2.12 hit [1]:

fail [k8s.io/kubernetes/test/e2e/framework/service_util.go:915]: Dec 19 21:05:43.729: Could not reach HTTP service through adce4ee1722a211eabcc41299b3d4af0-198512352.us-east-1.elb.amazonaws.com:80 after 3m0s

mentioned in bug 1749448 which was closed as a dup of this one.  Do we need to reopen something on this front?  The update completed successfully [2], but still, a 3m outage seems like something we want to understand.

[1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1/980
[2]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1/980/artifacts/e2e-aws-upgrade/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-a0dbe73b7831a8ddb9a2c58a560461d7c2c23a92231289a2104b93e7723c0eff/cluster-scoped-resources/config.openshift.io/clusterversions/version.yaml

Comment 14 W. Trevor King 2019-12-20 03:48:07 UTC

And hit this in a 4.1.18 -> 4.1.28 run too [1].

[1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1/991

Note You need to log in before you can comment on or make changes to this bug.