Bug 1792914 - MachineConfigDaemonReasonAnnotationKey gives node description beyond 262144 characters [NEEDINFO]
Summary: MachineConfigDaemonReasonAnnotationKey gives node description beyond 262144 c...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Machine Config Operator
Version: 4.2.z
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.4.0
Assignee: Erica von Buelow
QA Contact: Michael Nguyen
URL:
Whiteboard:
: 1809018 (view as bug list)
Depends On:
Blocks: 1809693
TreeView+ depends on / blocked
 
Reported: 2020-01-20 11:41 UTC by Ravi Trivedi
Modified: 2020-05-04 11:25 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1797790 1809693 (view as bug list)
Environment:
Last Closed: 2020-05-04 11:25:32 UTC
Target Upstream Version:
travi: needinfo? (evb)
igreen: needinfo? (evb)
travi: needinfo? (evb)
travi: needinfo? (evb)
igreen: needinfo? (evb)


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Github openshift machine-config-operator pull 1404 None closed Bug 1792914: prevent hitting annotation max size limit on nodes 2020-11-02 10:40:29 UTC
Red Hat Product Errata RHBA-2020:0581 None None None 2020-05-04 11:25:55 UTC

Internal Links: 1798677

Description Ravi Trivedi 2020-01-20 11:41:38 UTC
Description of problem:

The MCO is not able to set Degraded annotation for a node because the MachineConfigDaemonReasonAnnotationKey or "machineconfiguration.openshift.io/reason" annotation key is populated with node details above 262144 characters.

Sample log:

E0114 15:30:26.852292    3832 writer.go:142] Error setting Degraded annotation for node NODENAME: unable to update node "&Node{ObjectMeta:k8s_io_apimachinery_pkg_apis_meta_v1.ObjectMeta{Name:,GenerateName:,Namespace:,SelfLink:,UID:,ResourceVersion:,Generation:0,CreationTimestamp:0001-01-01 00:00:00 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{},Annotations:map[string]string{},OwnerReferences:[],Finalizers:[],ClusterName:,Initializers:nil,ManagedFields:[],},Spec:NodeSpec{PodCIDR:,DoNotUse_ExternalID:,ProviderID:,Unschedulable:false,Taints:[],ConfigSource:nil,},Status:NodeStatus{Capacity:ResourceList{},Allocatable:ResourceList{},Phase:,Conditions:[],Addresses:[],DaemonEndpoints:NodeDaemonEndpoints{KubeletEndpoint:DaemonEndpoint{Port:0,},},NodeInfo:NodeSystemInfo{MachineID:,SystemUUID:,BootID:,KernelVersion:,OSImage:,ContainerRuntimeVersion:,KubeletVersion:,KubeProxyVersion:,OperatingSystem:,Architecture:,},Images:[],VolumesInUse:[],VolumesAttached:[],Config:nil,},}": Node "NODENAME" is invalid: metadata.annotations: Too long: must have at most 262144 characters


Version-Release number of selected component (if applicable):
Openshift 4.2.8

Additional info:
- We see that the 'MachineConfigDaemonReasonAnnotationKey' i.e, "machineconfiguration.openshift.io/reason" annotation key is populated with the error itself which is completely describing the node details IF the "machineconfiguration.openshift.io/desiredConfig" annotation is not set.

- This annotation is set only when 'MachineConfigDaemonStateAnnotationKey' i.e, "machineconfiguration.openshift.io/state" is set to "Degraded" or "Unreconcilable" states.

- If the "machineconfiguration.openshift.io/desiredConfig" is set then the reason annotation gives a better output such as:

machineconfiguration.openshift.io/currentConfig : rendered-master-2bcec19576ffe7462de176a3e46f64c3
machineconfiguration.openshift.io/desiredConfig : rendered-master-2bcec19576ffe7462de176a3e46f64c3
machineconfiguration.openshift.io/reason : unexpected on-disk state validating against rendered-master-2bcec19576ffe7462de176a3e46f64c3 ==> Better output
machineconfiguration.openshift.io/ssh : accessed
machineconfiguration.openshift.io/state : Degraded
volumes.kubernetes.io/controller-managed-attach-detach : true

Comment 7 Stephen Cuppett 2020-01-20 13:01:39 UTC
Setting to current development branch (4.4). For fixes, if any, required/requested for prior versions, clones of this BZ will be created targeting those z-streams.

Comment 8 Erica von Buelow 2020-01-21 07:56:42 UTC
I'm working on a fix to first address the way-to-long annotations: 
1) removing the entire node object from that particular error
2) truncating any errors before putting then in annotations
3) (still thinking on this one) capping the length on before setting any annotation on nodes

More speculative improvements to think about after getting those in above would be like:
- putting that type of debug data somewhere else (e.g. a configmap) so that our annotations are always clear and known content
- moving away from using node annotations like this! especially the multiple writers (mcd and mcc) which causes a world of hurt

If the problem comes back, yeah you could try deleting the machineconfiguration.openshift.io/reason annotation.

Comment 9 Ilan Green 2020-01-21 08:38:02 UTC
(In reply to Erica von Buelow from comment #8)
> I'm working on a fix to first address the way-to-long annotations: 
> 1) removing the entire node object from that particular error
> 2) truncating any errors before putting then in annotations
> 3) (still thinking on this one) capping the length on before setting any
> annotation on nodes
> 
> More speculative improvements to think about after getting those in above
> would be like:
> - putting that type of debug data somewhere else (e.g. a configmap) so that
> our annotations are always clear and known content
> - moving away from using node annotations like this! especially the multiple
> writers (mcd and mcc) which causes a world of hurt
> 
> If the problem comes back, yeah you could try deleting the
> machineconfiguration.openshift.io/reason annotation.

Would it be possible please to outline the accurate steps?
  Would it be somewhat similar to the steps outlined at: https://bugzilla.redhat.com/show_bug.cgi?id=1717970#c4
Currently this happens for one worker node and one master node in the same cluster
For the worker node scheduling is disabled hence I gather we can ask the customer safely have the customer try these steps.

Comment 41 Antonio Murdaca 2020-03-04 13:44:23 UTC
*** Bug 1809018 has been marked as a duplicate of this bug. ***

Comment 45 errata-xmlrpc 2020-05-04 11:25:32 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581


Note You need to log in before you can comment on or make changes to this bug.