Bug 1907644 - fix up handling of non-critical annotations on daemonsets/deployments
Summary: fix up handling of non-critical annotations on daemonsets/deployments
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.7
Hardware: Unspecified
OS: Unspecified
high
medium
Target Milestone: ---
: 4.7.0
Assignee: Dan Winship
QA Contact: zhaozhanqi
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-12-14 22:25 UTC by Dan Winship
Modified: 2021-02-24 15:44 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-02-24 15:43:42 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-network-operator pull 911 0 None closed Bug 1907644: fix up non-critical / Progressing status handling 2021-01-25 10:06:24 UTC
Red Hat Product Errata RHSA-2020:5633 0 None None None 2021-02-24 15:44:00 UTC

Description Dan Winship 2020-12-14 22:25:58 UTC
We need to make sure CNO doesn't erroneously show up as Degraded during the install. See PR. https://github.com/openshift/cluster-network-operator/pull/911

Comment 2 zhaozhanqi 2021-01-06 08:21:29 UTC
Verified this bug on 4.7.0-0.nightly-2021-01-06-012750

Check the openshift-network-operator pod logs:

    #oc logs network-operator-55496d8847-9thwc -n openshift-network-operator | grep "Deployment \"openshift-network-diagnostics/network-check-source\""

    Waiting for Deployment "openshift-network-diagnostics/network-check-source" to be created
    Waiting for Deployment "openshift-network-diagnostics/network-check-source" to be created
    Deployment "openshift-network-diagnostics/network-check-source" is not yet scheduled on any nodes
    Deployment "openshift-network-diagnostics/network-check-source" is not yet scheduled on any nodes
    Deployment "openshift-network-diagnostics/network-check-source" is waiting for other operators to become ready
    Deployment "openshift-network-diagnostics/network-check-source" is waiting for other operators to become ready
    Deployment "openshift-network-diagnostics/network-check-source" is waiting for other operators to become ready
    Deployment "openshift-network-diagnostics/network-check-source" is waiting for other operators to become ready
    Deployment "openshift-network-diagnostics/network-check-source" is waiting for other operators to become ready
    Deployment "openshift-network-diagnostics/network-check-source" is waiting for other operators to become ready
    Deployment "openshift-network-diagnostics/network-check-source" is waiting for other operators to become ready
    Deployment "openshift-network-diagnostics/network-check-source" is waiting for other operators to become ready
    Deployment "openshift-network-diagnostics/network-check-source" is waiting for other operators to become ready
    Deployment "openshift-network-diagnostics/network-check-source" is waiting for other operators to become ready
    Deployment "openshift-network-diagnostics/network-check-source" is waiting for other operators to become ready
    Deployment "openshift-network-diagnostics/network-check-source" is waiting for other operators to become ready
    Deployment "openshift-network-diagnostics/network-check-source" is waiting for other operators to become ready
    Deployment "openshift-network-diagnostics/network-check-source" is waiting for other operators to become ready
    Deployment "openshift-network-diagnostics/network-check-source" is waiting for other operators to become ready
    Deployment "openshift-network-diagnostics/network-check-source" is waiting for other operators to become ready
    Deployment "openshift-network-diagnostics/network-check-source" is waiting for other operators to become ready
    Deployment "openshift-network-diagnostics/network-check-source" is waiting for other operators to become ready
    Deployment "openshift-network-diagnostics/network-check-source" is waiting for other operators to become ready
    Deployment "openshift-network-diagnostics/network-check-source" is waiting for other operators to become ready
    Deployment "openshift-network-diagnostics/network-check-source" is not available (awaiting 1 nodes)
    Deployment "openshift-network-diagnostics/network-check-source" is not available (awaiting 1 nodes)
    Deployment "openshift-network-diagnostics/network-check-source" is not available (awaiting 1 nodes)

Comment 3 zhaozhanqi 2021-01-25 10:19:19 UTC
Hi, @danw@redhat.com 

we may met the CNO Degraded is'True' when upgrade from 4.6 to 4.7 due to DaemonSet "openshift-network-diagnostics/network-check-target" is not available since the worker is Insufficient for schedule the pod, see:

[2021-01-08T02:44:41.529Z] Name:         network
[2021-01-08T02:44:41.529Z] Namespace:    
[2021-01-08T02:44:41.529Z] Labels:       <none>
[2021-01-08T02:44:41.529Z] Annotations:  network.operator.openshift.io/last-seen-state:
[2021-01-08T02:44:41.529Z]                 {"DaemonsetStates":[{"Namespace":"openshift-network-diagnostics","Name":"network-check-target","LastSeenStatus":{"currentNumberScheduled":...
[2021-01-08T02:44:41.529Z] API Version:  config.openshift.io/v1
[2021-01-08T02:44:41.529Z] Kind:         ClusterOperator
...
[2021-01-08T02:44:41.529Z] Spec:
[2021-01-08T02:44:41.529Z] Status:
[2021-01-08T02:44:41.529Z]   Conditions:
[2021-01-08T02:44:41.529Z]     Last Transition Time:  2021-01-08T00:28:10Z
[2021-01-08T02:44:41.529Z]     Message:               DaemonSet "openshift-network-diagnostics/network-check-target" rollout is not making progress - last change 2021-01-08T00:14:26Z
[2021-01-08T02:44:41.529Z]     Reason:                RolloutHung
[2021-01-08T02:44:41.529Z]     Status:                True
[2021-01-08T02:44:41.529Z]     Type:                  Degraded
[2021-01-08T02:44:41.529Z]     Last Transition Time:  2021-01-07T22:38:25Z
[2021-01-08T02:44:41.529Z]     Status:                True
[2021-01-08T02:44:41.529Z]     Type:                  Upgradeable
[2021-01-08T02:44:41.529Z]     Last Transition Time:  2021-01-08T00:13:10Z
[2021-01-08T02:44:41.529Z]     Message:               DaemonSet "openshift-network-diagnostics/network-check-target" is not available (awaiting 2 nodes)
[2021-01-08T02:44:41.529Z]     Reason:                Deploying
[2021-01-08T02:44:41.529Z]     Status:                True
[2021-01-08T02:44:41.529Z]     Type:                  Progressing
[2021-01-08T02:44:41.529Z]     Last Transition Time:  2021-01-08T00:13:10Z
[2021-01-08T02:44:41.529Z]     Message:               The network is starting up
[2021-01-08T02:44:41.529Z]     Reason:                Startup
[2021-01-08T02:44:41.529Z]     Status:                False
[2021-01-08T02:44:41.529Z]     Type:                  Available

Check the logs of pod:

lastTransitionTime: "2021-01-22T19:58:18Z"
message: '0/6 nodes are available: 1 Insufficient memory, 5 node(s) didn''t match Pod''s node affinity.'
reason: Unschedulable
status: "False"
type: PodScheduled
phase: Pending
qosClass: Burstable


this the network operator status:

network                                    4.7.0-0.nightly-2021-01-22-134922   False       True          True       148m

I think network operator Degraded should not be 'true' even the openshift-network-diagnostics/network-check-target pod is not running.  this will affect the upgrade flow. 

So I reopen this issue.

Comment 4 Dan Winship 2021-01-25 14:27:44 UTC
The "don't mark CNO Degraded because of non-critical DaemonSets" hack only operates at cluster install time, because we know the cluster isn't fully functional at that point (no worker nodes, no Service CA Operator) and so some pods won't be able to be started. But during an *update*, the cluster is expected to remain fully operational at all times, so if network-check-target is not rolling out, that's actually a problem and *should* be reported.

(And "0/6 nodes are available: 1 Insufficient memory" makes it sound like there's something wrong with this cluster.)

Re-closing this bz, because CNO error reporting is working as expected. If you have must-gather from that cluster, or if you can reproduce this problem later, then please open a new bug about network-check-target not deploying successfully.

Comment 5 zhaozhanqi 2021-01-26 08:00:12 UTC
ok, thanks the reply @Dan,  I thought network-diagnostics are not very important pods and it should not blocked the CNO status. 

anyway, the root reason is there is node has Insufficient memory cause the pod cannot be scheduled.

Comment 8 errata-xmlrpc 2021-02-24 15:43:42 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633


Note You need to log in before you can comment on or make changes to this bug.