1907644 – fix up handling of non-critical annotations on daemonsets/deployments

Bug 1907644 - fix up handling of non-critical annotations on daemonsets/deployments

Summary: fix up handling of non-critical annotations on daemonsets/deployments

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Dan Winship
QA Contact:	zhaozhanqi
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-12-14 22:25 UTC by Dan Winship
Modified:	2021-02-24 15:44 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-02-24 15:43:42 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-network-operator pull 911	0	None	closed	Bug 1907644: fix up non-critical / Progressing status handling	2021-01-25 10:06:24 UTC
Red Hat Product Errata	RHSA-2020:5633	0	None	None	None	2021-02-24 15:44:00 UTC

Description Dan Winship 2020-12-14 22:25:58 UTC

We need to make sure CNO doesn't erroneously show up as Degraded during the install. See PR. https://github.com/openshift/cluster-network-operator/pull/911

Comment 2 zhaozhanqi 2021-01-06 08:21:29 UTC

Verified this bug on 4.7.0-0.nightly-2021-01-06-012750

Check the openshift-network-operator pod logs:

    #oc logs network-operator-55496d8847-9thwc -n openshift-network-operator | grep "Deployment \"openshift-network-diagnostics/network-check-source\""

    Waiting for Deployment "openshift-network-diagnostics/network-check-source" to be created
    Waiting for Deployment "openshift-network-diagnostics/network-check-source" to be created
    Deployment "openshift-network-diagnostics/network-check-source" is not yet scheduled on any nodes
    Deployment "openshift-network-diagnostics/network-check-source" is not yet scheduled on any nodes
    Deployment "openshift-network-diagnostics/network-check-source" is waiting for other operators to become ready
    Deployment "openshift-network-diagnostics/network-check-source" is waiting for other operators to become ready
    Deployment "openshift-network-diagnostics/network-check-source" is waiting for other operators to become ready
    Deployment "openshift-network-diagnostics/network-check-source" is waiting for other operators to become ready
    Deployment "openshift-network-diagnostics/network-check-source" is waiting for other operators to become ready
    Deployment "openshift-network-diagnostics/network-check-source" is waiting for other operators to become ready
    Deployment "openshift-network-diagnostics/network-check-source" is waiting for other operators to become ready
    Deployment "openshift-network-diagnostics/network-check-source" is waiting for other operators to become ready
    Deployment "openshift-network-diagnostics/network-check-source" is waiting for other operators to become ready
    Deployment "openshift-network-diagnostics/network-check-source" is waiting for other operators to become ready
    Deployment "openshift-network-diagnostics/network-check-source" is waiting for other operators to become ready
    Deployment "openshift-network-diagnostics/network-check-source" is waiting for other operators to become ready
    Deployment "openshift-network-diagnostics/network-check-source" is waiting for other operators to become ready
    Deployment "openshift-network-diagnostics/network-check-source" is waiting for other operators to become ready
    Deployment "openshift-network-diagnostics/network-check-source" is waiting for other operators to become ready
    Deployment "openshift-network-diagnostics/network-check-source" is waiting for other operators to become ready
    Deployment "openshift-network-diagnostics/network-check-source" is waiting for other operators to become ready
    Deployment "openshift-network-diagnostics/network-check-source" is waiting for other operators to become ready
    Deployment "openshift-network-diagnostics/network-check-source" is waiting for other operators to become ready
    Deployment "openshift-network-diagnostics/network-check-source" is waiting for other operators to become ready
    Deployment "openshift-network-diagnostics/network-check-source" is not available (awaiting 1 nodes)
    Deployment "openshift-network-diagnostics/network-check-source" is not available (awaiting 1 nodes)
    Deployment "openshift-network-diagnostics/network-check-source" is not available (awaiting 1 nodes)

Comment 3 zhaozhanqi 2021-01-25 10:19:19 UTC

Hi, @danw 

we may met the CNO Degraded is'True' when upgrade from 4.6 to 4.7 due to DaemonSet "openshift-network-diagnostics/network-check-target" is not available since the worker is Insufficient for schedule the pod, see:

[2021-01-08T02:44:41.529Z] Name:         network
[2021-01-08T02:44:41.529Z] Namespace:    
[2021-01-08T02:44:41.529Z] Labels:       <none>
[2021-01-08T02:44:41.529Z] Annotations:  network.operator.openshift.io/last-seen-state:
[2021-01-08T02:44:41.529Z]                 {"DaemonsetStates":[{"Namespace":"openshift-network-diagnostics","Name":"network-check-target","LastSeenStatus":{"currentNumberScheduled":...
[2021-01-08T02:44:41.529Z] API Version:  config.openshift.io/v1
[2021-01-08T02:44:41.529Z] Kind:         ClusterOperator
...
[2021-01-08T02:44:41.529Z] Spec:
[2021-01-08T02:44:41.529Z] Status:
[2021-01-08T02:44:41.529Z]   Conditions:
[2021-01-08T02:44:41.529Z]     Last Transition Time:  2021-01-08T00:28:10Z
[2021-01-08T02:44:41.529Z]     Message:               DaemonSet "openshift-network-diagnostics/network-check-target" rollout is not making progress - last change 2021-01-08T00:14:26Z
[2021-01-08T02:44:41.529Z]     Reason:                RolloutHung
[2021-01-08T02:44:41.529Z]     Status:                True
[2021-01-08T02:44:41.529Z]     Type:                  Degraded
[2021-01-08T02:44:41.529Z]     Last Transition Time:  2021-01-07T22:38:25Z
[2021-01-08T02:44:41.529Z]     Status:                True
[2021-01-08T02:44:41.529Z]     Type:                  Upgradeable
[2021-01-08T02:44:41.529Z]     Last Transition Time:  2021-01-08T00:13:10Z
[2021-01-08T02:44:41.529Z]     Message:               DaemonSet "openshift-network-diagnostics/network-check-target" is not available (awaiting 2 nodes)
[2021-01-08T02:44:41.529Z]     Reason:                Deploying
[2021-01-08T02:44:41.529Z]     Status:                True
[2021-01-08T02:44:41.529Z]     Type:                  Progressing
[2021-01-08T02:44:41.529Z]     Last Transition Time:  2021-01-08T00:13:10Z
[2021-01-08T02:44:41.529Z]     Message:               The network is starting up
[2021-01-08T02:44:41.529Z]     Reason:                Startup
[2021-01-08T02:44:41.529Z]     Status:                False
[2021-01-08T02:44:41.529Z]     Type:                  Available

Check the logs of pod:

lastTransitionTime: "2021-01-22T19:58:18Z"
message: '0/6 nodes are available: 1 Insufficient memory, 5 node(s) didn''t match Pod''s node affinity.'
reason: Unschedulable
status: "False"
type: PodScheduled
phase: Pending
qosClass: Burstable


this the network operator status:

network                                    4.7.0-0.nightly-2021-01-22-134922   False       True          True       148m

I think network operator Degraded should not be 'true' even the openshift-network-diagnostics/network-check-target pod is not running.  this will affect the upgrade flow. 

So I reopen this issue.

Comment 4 Dan Winship 2021-01-25 14:27:44 UTC

The "don't mark CNO Degraded because of non-critical DaemonSets" hack only operates at cluster install time, because we know the cluster isn't fully functional at that point (no worker nodes, no Service CA Operator) and so some pods won't be able to be started. But during an *update*, the cluster is expected to remain fully operational at all times, so if network-check-target is not rolling out, that's actually a problem and *should* be reported.

(And "0/6 nodes are available: 1 Insufficient memory" makes it sound like there's something wrong with this cluster.)

Re-closing this bz, because CNO error reporting is working as expected. If you have must-gather from that cluster, or if you can reproduce this problem later, then please open a new bug about network-check-target not deploying successfully.

Comment 5 zhaozhanqi 2021-01-26 08:00:12 UTC

ok, thanks the reply @Dan,  I thought network-diagnostics are not very important pods and it should not blocked the CNO status. 

anyway, the root reason is there is node has Insufficient memory cause the pod cannot be scheduled.

Comment 8 errata-xmlrpc 2021-02-24 15:43:42 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633

Note You need to log in before you can comment on or make changes to this bug.