Description of problem: The MCO is not able to set Degraded annotation for a node because the MachineConfigDaemonReasonAnnotationKey or "machineconfiguration.openshift.io/reason" annotation key is populated with node details above 262144 characters. Sample log: E0114 15:30:26.852292 3832 writer.go:142] Error setting Degraded annotation for node NODENAME: unable to update node "&Node{ObjectMeta:k8s_io_apimachinery_pkg_apis_meta_v1.ObjectMeta{Name:,GenerateName:,Namespace:,SelfLink:,UID:,ResourceVersion:,Generation:0,CreationTimestamp:0001-01-01 00:00:00 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{},Annotations:map[string]string{},OwnerReferences:[],Finalizers:[],ClusterName:,Initializers:nil,ManagedFields:[],},Spec:NodeSpec{PodCIDR:,DoNotUse_ExternalID:,ProviderID:,Unschedulable:false,Taints:[],ConfigSource:nil,},Status:NodeStatus{Capacity:ResourceList{},Allocatable:ResourceList{},Phase:,Conditions:[],Addresses:[],DaemonEndpoints:NodeDaemonEndpoints{KubeletEndpoint:DaemonEndpoint{Port:0,},},NodeInfo:NodeSystemInfo{MachineID:,SystemUUID:,BootID:,KernelVersion:,OSImage:,ContainerRuntimeVersion:,KubeletVersion:,KubeProxyVersion:,OperatingSystem:,Architecture:,},Images:[],VolumesInUse:[],VolumesAttached:[],Config:nil,},}": Node "NODENAME" is invalid: metadata.annotations: Too long: must have at most 262144 characters Version-Release number of selected component (if applicable): Openshift 4.2.8 Additional info: - We see that the 'MachineConfigDaemonReasonAnnotationKey' i.e, "machineconfiguration.openshift.io/reason" annotation key is populated with the error itself which is completely describing the node details IF the "machineconfiguration.openshift.io/desiredConfig" annotation is not set. - This annotation is set only when 'MachineConfigDaemonStateAnnotationKey' i.e, "machineconfiguration.openshift.io/state" is set to "Degraded" or "Unreconcilable" states. - If the "machineconfiguration.openshift.io/desiredConfig" is set then the reason annotation gives a better output such as: machineconfiguration.openshift.io/currentConfig : rendered-master-2bcec19576ffe7462de176a3e46f64c3 machineconfiguration.openshift.io/desiredConfig : rendered-master-2bcec19576ffe7462de176a3e46f64c3 machineconfiguration.openshift.io/reason : unexpected on-disk state validating against rendered-master-2bcec19576ffe7462de176a3e46f64c3 ==> Better output machineconfiguration.openshift.io/ssh : accessed machineconfiguration.openshift.io/state : Degraded volumes.kubernetes.io/controller-managed-attach-detach : true
Setting to current development branch (4.4). For fixes, if any, required/requested for prior versions, clones of this BZ will be created targeting those z-streams.
I'm working on a fix to first address the way-to-long annotations: 1) removing the entire node object from that particular error 2) truncating any errors before putting then in annotations 3) (still thinking on this one) capping the length on before setting any annotation on nodes More speculative improvements to think about after getting those in above would be like: - putting that type of debug data somewhere else (e.g. a configmap) so that our annotations are always clear and known content - moving away from using node annotations like this! especially the multiple writers (mcd and mcc) which causes a world of hurt If the problem comes back, yeah you could try deleting the machineconfiguration.openshift.io/reason annotation.
(In reply to Erica von Buelow from comment #8) > I'm working on a fix to first address the way-to-long annotations: > 1) removing the entire node object from that particular error > 2) truncating any errors before putting then in annotations > 3) (still thinking on this one) capping the length on before setting any > annotation on nodes > > More speculative improvements to think about after getting those in above > would be like: > - putting that type of debug data somewhere else (e.g. a configmap) so that > our annotations are always clear and known content > - moving away from using node annotations like this! especially the multiple > writers (mcd and mcc) which causes a world of hurt > > If the problem comes back, yeah you could try deleting the > machineconfiguration.openshift.io/reason annotation. Would it be possible please to outline the accurate steps? Would it be somewhat similar to the steps outlined at: https://bugzilla.redhat.com/show_bug.cgi?id=1717970#c4 Currently this happens for one worker node and one master node in the same cluster For the worker node scheduling is disabled hence I gather we can ask the customer safely have the customer try these steps.
*** Bug 1809018 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0581
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days