Bug 1907644

Summary: fix up handling of non-critical annotations on daemonsets/deployments
Product: OpenShift Container Platform Reporter: Dan Winship <danw>
Component: NetworkingAssignee: Dan Winship <danw>
Networking sub component: openshift-sdn QA Contact: zhaozhanqi <zzhao>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: high CC: aconstan
Version: 4.7   
Target Milestone: ---   
Target Release: 4.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-02-24 15:43:42 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Dan Winship 2020-12-14 22:25:58 UTC
We need to make sure CNO doesn't erroneously show up as Degraded during the install. See PR. https://github.com/openshift/cluster-network-operator/pull/911

Comment 2 zhaozhanqi 2021-01-06 08:21:29 UTC
Verified this bug on 4.7.0-0.nightly-2021-01-06-012750

Check the openshift-network-operator pod logs:

    #oc logs network-operator-55496d8847-9thwc -n openshift-network-operator | grep "Deployment \"openshift-network-diagnostics/network-check-source\""

    Waiting for Deployment "openshift-network-diagnostics/network-check-source" to be created
    Waiting for Deployment "openshift-network-diagnostics/network-check-source" to be created
    Deployment "openshift-network-diagnostics/network-check-source" is not yet scheduled on any nodes
    Deployment "openshift-network-diagnostics/network-check-source" is not yet scheduled on any nodes
    Deployment "openshift-network-diagnostics/network-check-source" is waiting for other operators to become ready
    Deployment "openshift-network-diagnostics/network-check-source" is waiting for other operators to become ready
    Deployment "openshift-network-diagnostics/network-check-source" is waiting for other operators to become ready
    Deployment "openshift-network-diagnostics/network-check-source" is waiting for other operators to become ready
    Deployment "openshift-network-diagnostics/network-check-source" is waiting for other operators to become ready
    Deployment "openshift-network-diagnostics/network-check-source" is waiting for other operators to become ready
    Deployment "openshift-network-diagnostics/network-check-source" is waiting for other operators to become ready
    Deployment "openshift-network-diagnostics/network-check-source" is waiting for other operators to become ready
    Deployment "openshift-network-diagnostics/network-check-source" is waiting for other operators to become ready
    Deployment "openshift-network-diagnostics/network-check-source" is waiting for other operators to become ready
    Deployment "openshift-network-diagnostics/network-check-source" is waiting for other operators to become ready
    Deployment "openshift-network-diagnostics/network-check-source" is waiting for other operators to become ready
    Deployment "openshift-network-diagnostics/network-check-source" is waiting for other operators to become ready
    Deployment "openshift-network-diagnostics/network-check-source" is waiting for other operators to become ready
    Deployment "openshift-network-diagnostics/network-check-source" is waiting for other operators to become ready
    Deployment "openshift-network-diagnostics/network-check-source" is waiting for other operators to become ready
    Deployment "openshift-network-diagnostics/network-check-source" is waiting for other operators to become ready
    Deployment "openshift-network-diagnostics/network-check-source" is waiting for other operators to become ready
    Deployment "openshift-network-diagnostics/network-check-source" is waiting for other operators to become ready
    Deployment "openshift-network-diagnostics/network-check-source" is waiting for other operators to become ready
    Deployment "openshift-network-diagnostics/network-check-source" is not available (awaiting 1 nodes)
    Deployment "openshift-network-diagnostics/network-check-source" is not available (awaiting 1 nodes)
    Deployment "openshift-network-diagnostics/network-check-source" is not available (awaiting 1 nodes)

Comment 3 zhaozhanqi 2021-01-25 10:19:19 UTC
Hi, @danw 

we may met the CNO Degraded is'True' when upgrade from 4.6 to 4.7 due to DaemonSet "openshift-network-diagnostics/network-check-target" is not available since the worker is Insufficient for schedule the pod, see:

[2021-01-08T02:44:41.529Z] Name:         network
[2021-01-08T02:44:41.529Z] Namespace:    
[2021-01-08T02:44:41.529Z] Labels:       <none>
[2021-01-08T02:44:41.529Z] Annotations:  network.operator.openshift.io/last-seen-state:
[2021-01-08T02:44:41.529Z]                 {"DaemonsetStates":[{"Namespace":"openshift-network-diagnostics","Name":"network-check-target","LastSeenStatus":{"currentNumberScheduled":...
[2021-01-08T02:44:41.529Z] API Version:  config.openshift.io/v1
[2021-01-08T02:44:41.529Z] Kind:         ClusterOperator
...
[2021-01-08T02:44:41.529Z] Spec:
[2021-01-08T02:44:41.529Z] Status:
[2021-01-08T02:44:41.529Z]   Conditions:
[2021-01-08T02:44:41.529Z]     Last Transition Time:  2021-01-08T00:28:10Z
[2021-01-08T02:44:41.529Z]     Message:               DaemonSet "openshift-network-diagnostics/network-check-target" rollout is not making progress - last change 2021-01-08T00:14:26Z
[2021-01-08T02:44:41.529Z]     Reason:                RolloutHung
[2021-01-08T02:44:41.529Z]     Status:                True
[2021-01-08T02:44:41.529Z]     Type:                  Degraded
[2021-01-08T02:44:41.529Z]     Last Transition Time:  2021-01-07T22:38:25Z
[2021-01-08T02:44:41.529Z]     Status:                True
[2021-01-08T02:44:41.529Z]     Type:                  Upgradeable
[2021-01-08T02:44:41.529Z]     Last Transition Time:  2021-01-08T00:13:10Z
[2021-01-08T02:44:41.529Z]     Message:               DaemonSet "openshift-network-diagnostics/network-check-target" is not available (awaiting 2 nodes)
[2021-01-08T02:44:41.529Z]     Reason:                Deploying
[2021-01-08T02:44:41.529Z]     Status:                True
[2021-01-08T02:44:41.529Z]     Type:                  Progressing
[2021-01-08T02:44:41.529Z]     Last Transition Time:  2021-01-08T00:13:10Z
[2021-01-08T02:44:41.529Z]     Message:               The network is starting up
[2021-01-08T02:44:41.529Z]     Reason:                Startup
[2021-01-08T02:44:41.529Z]     Status:                False
[2021-01-08T02:44:41.529Z]     Type:                  Available

Check the logs of pod:

lastTransitionTime: "2021-01-22T19:58:18Z"
message: '0/6 nodes are available: 1 Insufficient memory, 5 node(s) didn''t match Pod''s node affinity.'
reason: Unschedulable
status: "False"
type: PodScheduled
phase: Pending
qosClass: Burstable


this the network operator status:

network                                    4.7.0-0.nightly-2021-01-22-134922   False       True          True       148m

I think network operator Degraded should not be 'true' even the openshift-network-diagnostics/network-check-target pod is not running.  this will affect the upgrade flow. 

So I reopen this issue.

Comment 4 Dan Winship 2021-01-25 14:27:44 UTC
The "don't mark CNO Degraded because of non-critical DaemonSets" hack only operates at cluster install time, because we know the cluster isn't fully functional at that point (no worker nodes, no Service CA Operator) and so some pods won't be able to be started. But during an *update*, the cluster is expected to remain fully operational at all times, so if network-check-target is not rolling out, that's actually a problem and *should* be reported.

(And "0/6 nodes are available: 1 Insufficient memory" makes it sound like there's something wrong with this cluster.)

Re-closing this bz, because CNO error reporting is working as expected. If you have must-gather from that cluster, or if you can reproduce this problem later, then please open a new bug about network-check-target not deploying successfully.

Comment 5 zhaozhanqi 2021-01-26 08:00:12 UTC
ok, thanks the reply @Dan,  I thought network-diagnostics are not very important pods and it should not blocked the CNO status. 

anyway, the root reason is there is node has Insufficient memory cause the pod cannot be scheduled.

Comment 8 errata-xmlrpc 2021-02-24 15:43:42 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633