Bug 2030562 - MachineHealthCheck's field .spec.unhealthyConditions[0].timeout hits "SHOULD NOT HAPPEN" error "expected string, got &value" in kube-apiserver logs
Summary: MachineHealthCheck's field .spec.unhealthyConditions[0].timeout hits "SHOULD ...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.8
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: ---
Assignee: Joel Speed
QA Contact: sunzhaohua
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-12-09 06:51 UTC by Rahul Gangwar
Modified: 2022-01-12 03:33 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 2026562
Environment:
Last Closed: 2022-01-11 12:40:50 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Rahul Gangwar 2021-12-09 06:51:09 UTC
Bug 2026562 fixed the error for the field ".spec.maxUnhealthy". Now found another field ".spec.unhealthyConditions[0].timeout" of similar error in latest 4.8 payload, so creating this separate bug to track this new field, details are given in next comment

Comment 1 Rahul Gangwar 2021-12-09 06:52:58 UTC
Reproduced again in 
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.24    True        False         7m15s   Cluster version is 4.8.24

/var/log/pods/openshift-kube-apiserver_kube-apiserver-qeci-..-lftfh-master-1_b..e-8c..-..b4-a9ba-ae0aa..be../kube-apiserver/..log..-..-..+.. stderr F E.. ..      .. fieldmanager.go..] "[SHOULD NOT HAPPEN] failed to update managedFields" err="failed to convert new object (machine.openshift.io/v1beta1, Kind=MachineHealthCheck) to smd typed: .spec.unhealthyConditions[0].timeout: expected string, got &value.valueUnstructured{Value..}" VersionKind="machine.openshift.io/v1beta1, Kind=MachineHealthCheck"
1

Comment 2 Huali Liu 2021-12-09 10:19:24 UTC
(In reply to Rahul Gangwar from comment #1)
> Reproduced again in 
> NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
> version   4.8.24    True        False         7m15s   Cluster version is
> 4.8.24

Hi Rahul, can you please show me how to get this log?
 
> /var/log/pods/openshift-kube-apiserver_kube-apiserver-qeci-..-lftfh-master-
> 1_b..e-8c..-..b4-a9ba-ae0aa..be../kube-apiserver/..log..-..-..+.. stderr F
> E.. ..      .. fieldmanager.go..] "[SHOULD NOT HAPPEN] failed to update
> managedFields" err="failed to convert new object
> (machine.openshift.io/v1beta1, Kind=MachineHealthCheck) to smd typed:
> .spec.unhealthyConditions[0].timeout: expected string, got
> &value.valueUnstructured{Value..}"
> VersionKind="machine.openshift.io/v1beta1, Kind=MachineHealthCheck"
> 1

I checked like this but didn't get the error.

1.Vsphere 7 IPI installed cluster with windows_server_ver_2004 workers and OVNKubernetes network

2.Run this test case https://polarion.engineering.redhat.com/polarion/#/project/OSE/workitem?id=OCP-25691

3.Check the kube-apiserver logs on all master nodes.

liuhuali@Lius-MacBook-Pro huali-test % oc get pod -n openshift-kube-apiserver              
NAME                                        READY   STATUS      RESTARTS   AGE
installer-2-huliu131-gsm4h-master-1         0/1     Completed   0          85m
installer-3-huliu131-gsm4h-master-0         0/1     Completed   0          84m
installer-3-huliu131-gsm4h-master-2         0/1     Completed   0          82m
installer-4-huliu131-gsm4h-master-2         0/1     Completed   0          81m
installer-5-huliu131-gsm4h-master-1         0/1     Completed   0          80m
installer-6-huliu131-gsm4h-master-0         0/1     Completed   0          72m
installer-6-huliu131-gsm4h-master-1         0/1     Completed   0          79m
installer-6-huliu131-gsm4h-master-2         0/1     Completed   0          75m
installer-7-huliu131-gsm4h-master-0         0/1     Completed   0          63m
installer-7-huliu131-gsm4h-master-1         0/1     Completed   0          66m
installer-7-huliu131-gsm4h-master-2         0/1     Completed   0          59m
kube-apiserver-huliu131-gsm4h-master-0      5/5     Running     0          61m
kube-apiserver-huliu131-gsm4h-master-1      5/5     Running     0          64m
kube-apiserver-huliu131-gsm4h-master-2      5/5     Running     0          57m
revision-pruner-3-huliu131-gsm4h-master-0   0/1     Completed   0          82m
revision-pruner-6-huliu131-gsm4h-master-0   0/1     Completed   0          67m
revision-pruner-6-huliu131-gsm4h-master-1   0/1     Completed   0          76m
revision-pruner-6-huliu131-gsm4h-master-2   0/1     Completed   0          72m
revision-pruner-7-huliu131-gsm4h-master-0   0/1     Completed   0          60m
revision-pruner-7-huliu131-gsm4h-master-1   0/1     Completed   0          64m
revision-pruner-7-huliu131-gsm4h-master-2   0/1     Completed   0          56m
liuhuali@Lius-MacBook-Pro huali-test % oc logs   kube-apiserver-huliu131-gsm4h-master-0   -n openshift-kube-apiserver|grep "SHOULD NOT HAPPEN" 
Using deprecated annotation `kubectl.kubernetes.io/default-logs-container` in pod/kube-apiserver-huliu131-gsm4h-master-0. Please use `kubectl.kubernetes.io/default-container` instead
liuhuali@Lius-MacBook-Pro huali-test % oc logs    kube-apiserver-huliu131-gsm4h-master-2   -n openshift-kube-apiserver|grep "SHOULD NOT HAPPEN"
Using deprecated annotation `kubectl.kubernetes.io/default-logs-container` in pod/kube-apiserver-huliu131-gsm4h-master-2. Please use `kubectl.kubernetes.io/default-container` instead
liuhuali@Lius-MacBook-Pro huali-test % oc logs   kube-apiserver-huliu131-gsm4h-master-1    -n openshift-kube-apiserver|grep "SHOULD NOT HAPPEN"
Using deprecated annotation `kubectl.kubernetes.io/default-logs-container` in pod/kube-apiserver-huliu131-gsm4h-master-1. Please use `kubectl.kubernetes.io/default-container` instead
liuhuali@Lius-MacBook-Pro huali-test %

Comment 3 Rahul Gangwar 2021-12-09 11:18:02 UTC
We got the logs from test run.
Cluster template - ipi-on-osp/versioned-installer-fips-ovn-ci
must-gather link - https://drive.google.com/file/d/1q3zugFCdOAsWlGqTXyE1ApmY3OdEFddT/view?usp=sharing

 find . -name "*.log"|grep -inr "SHOULD NOT HAPPEN"
./must-gather.local.3960990593128029605/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-a4dc776b581ba6c4b2d1a5a08ccd09598177e2934e9159d0eac4067b1bdcce01/audit_logs/kube-apiserver/qeci-30541-fvs2b-master-1.c.openshift-qe.internal-termination.log:2814:E1208 23:07:04.288418      20 fieldmanager.go:191] "[SHOULD NOT HAPPEN] failed to update managedFields" err="failed to convert new object (certificates.k8s.io/v1, Kind=CertificateSigningRequest) to smd typed: .status.conditions: duplicate entries for key [type=\"Approved\"]" VersionKind="/, Kind="
./must-gather.local.3960990593128029605/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-a4dc776b581ba6c4b2d1a5a08ccd09598177e2934e9159d0eac4067b1bdcce01/audit_logs/kube-apiserver/qeci-30541-fvs2b-master-1.c.openshift-qe.internal-termination.log:3605:E1208 23:07:10.082625      20 fieldmanager.go:191] "[SHOULD NOT HAPPEN] failed to update managedFields" err="failed to convert new object (certificates.k8s.io/v1, Kind=CertificateSigningRequest) to smd typed: .status.conditions: duplicate entries for key [type=\"Approved\"]" VersionKind="/, Kind="

Comment 4 Huali Liu 2021-12-09 11:22:16 UTC
ok, thanks for you information. I'll check again.

Comment 5 Huali Liu 2021-12-10 01:13:29 UTC
Hi Rahul, 

(In reply to Rahul Gangwar from comment #3)
> We got the logs from test run.
> Cluster template - ipi-on-osp/versioned-installer-fips-ovn-ci
> must-gather link -
> https://drive.google.com/file/d/1q3zugFCdOAsWlGqTXyE1ApmY3OdEFddT/
> view?usp=sharing
>
in Comment 3 the error shows "(certificates.k8s.io/v1, Kind=CertificateSigningRequest)", 

>  find . -name "*.log"|grep -inr "SHOULD NOT HAPPEN"
> ./must-gather.local.3960990593128029605/quay-io-openshift-release-dev-ocp-v4-
> 0-art-dev-sha256-
> a4dc776b581ba6c4b2d1a5a08ccd09598177e2934e9159d0eac4067b1bdcce01/audit_logs/
> kube-apiserver/qeci-30541-fvs2b-master-1.c.openshift-qe.internal-termination.
> log:2814:E1208 23:07:04.288418      20 fieldmanager.go:191] "[SHOULD NOT
> HAPPEN] failed to update managedFields" err="failed to convert new object
> (certificates.k8s.io/v1, Kind=CertificateSigningRequest) to smd typed:
> .status.conditions: duplicate entries for key [type=\"Approved\"]"
> VersionKind="/, Kind="
> ./must-gather.local.3960990593128029605/quay-io-openshift-release-dev-ocp-v4-
> 0-art-dev-sha256-
> a4dc776b581ba6c4b2d1a5a08ccd09598177e2934e9159d0eac4067b1bdcce01/audit_logs/
> kube-apiserver/qeci-30541-fvs2b-master-1.c.openshift-qe.internal-termination.
> log:3605:E1208 23:07:10.082625      20 fieldmanager.go:191] "[SHOULD NOT
> HAPPEN] failed to update managedFields" err="failed to convert new object
> (certificates.k8s.io/v1, Kind=CertificateSigningRequest) to smd typed:
> .status.conditions: duplicate entries for key [type=\"Approved\"]"
> VersionKind="/, Kind="


(In reply to Rahul Gangwar from comment #1)
> Reproduced again in 
> NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
> version   4.8.24    True        False         7m15s   Cluster version is
> 4.8.24
> 
but in the title and Comment 1 the error shows "(machine.openshift.io/v1beta1, Kind=MachineHealthCheck)",
 
> /var/log/pods/openshift-kube-apiserver_kube-apiserver-qeci-..-lftfh-master-
> 1_b..e-8c..-..b4-a9ba-ae0aa..be../kube-apiserver/..log..-..-..+.. stderr F
> E.. ..      .. fieldmanager.go..] "[SHOULD NOT HAPPEN] failed to update
> managedFields" err="failed to convert new object
> (machine.openshift.io/v1beta1, Kind=MachineHealthCheck) to smd typed:
> .spec.unhealthyConditions[0].timeout: expected string, got
> &value.valueUnstructured{Value..}"
> VersionKind="machine.openshift.io/v1beta1, Kind=MachineHealthCheck"
> 1

These are two different errors. Can you please confirm which is the real error?

Comment 6 Huali Liu 2021-12-10 11:03:51 UTC
so it's another error, and the above test steps cannot be applied to it. Sorry for mistaken it with Bug 2026562

Comment 8 Joel Speed 2021-12-13 14:20:29 UTC
Looking through the existing logs and gathers attached to this BZ, there is no evidence that this issue has actually occurred.
While I appreciate that it's hard to capture the logs, if we could capture the MachineHealthCheck resources from the clusters that this error has occurred on/is occurring on, it would be helpful to understand what values are triggering this error.

Without having a particular value that we know triggers the issue, it's difficult to reproduce the issue and therefore difficult to validate we are fixing the actual issue

Comment 9 Rahul Gangwar 2021-12-13 14:33:24 UTC
Please check below link is helpful to you.
https://drive.google.com/file/d/1RaNO8Qk96slVIk0-tBnw1ondfq__y2bh/view?usp=sharing

Comment 10 Joel Speed 2021-12-13 14:40:25 UTC
Looking at that link, there is only one MHC installed, which is the default.

It has a timeout value of `0s` which is a string, not sure why the error would be tripping.

Notably, looking at the logs again, there is no mention of the MHC error, just job and batch

Comment 11 Rahul Gangwar 2021-12-13 14:47:02 UTC
 Old pods can't be gathered if they're gone after rotated. That's why we don't have this MHC error in must-gather logs. We will try to update logs, if error captured again.

Comment 12 Rahul Gangwar 2021-12-17 06:47:44 UTC
@Joel same detected again, must-gather for that https://drive.google.com/file/d/1hfsa33AC6DB24BcSVpIEBcYEBeLBortG/view?usp=sharing

Comment 13 Joel Speed 2021-12-17 09:26:38 UTC
Looking into these logs, we see:

2021-12-17T03:46:01.363296436Z E1217 03:46:01.363130      16 fieldmanager.go:197] "[SHOULD NOT HAPPEN] failed to update managedFields" err="failed to convert new object (machine.openshift.io/v1beta1, Kind=MachineHealthCheck) to smd typed: .spec.unhealthyConditions[0].timeout: expected string, got &value.valueUnstructured{Value:3}" VersionKind="machine.openshift.io/v1beta1, Kind=MachineHealthCheck"

However, the only MHC installed within the cluster, from the gather at least, has a timeout value of `0s`. This implies to me that either we aren't actually seeing all of the machinehealthchecks or someone is trying to create a new one, and the API server is rightfully rejecting the invalid MHC.

Looking at the MHC validation (which notably hasn't changed between releases: https://github.com/openshift/machine-api-operator/blob/383c9b959b69044ec533118cf5d41f17101137f1/pkg/apis/machine/v1beta1/machinehealthcheck_types.go#L111-L112):
```
	// +kubebuilder:validation:Pattern="^([0-9]+(\\.[0-9]+)?(ns|us|µs|ms|s|m|h))+$"
	// +kubebuilder:validation:Type:=string
```
The value 3 wouldn't be valid according to the regex. The regex forces a string suffix, so the string validation seems appropriate to me.

Comment 14 Joel Speed 2022-01-06 13:16:50 UTC
@rgangwar Do you have any response to my previous comment? Not sure how we can move forward at the moment as we aren't able to reproduce this and the must gathers have thus far not aligned to the errors reported

Comment 15 Rahul Gangwar 2022-01-07 07:37:29 UTC
@joel same error detected again but these must gather error not have error because at the time must-gather taken before that pods are rotated. Not sure these logs help you or not.

https://drive.google.com/file/d/1GMBkIJh-ZAgSJ5qy4Vf8gxGizLUyvxHJ/view?usp=sharing, https://drive.google.com/file/d/1uKnEIWRIsWxJgPtHCh8V9QCQkW7fy0SO/view?usp=sharing

Comment 16 Joel Speed 2022-01-10 17:52:26 UTC
Nothing useful in those either. Do you have access to the clusters that these errors are occurring on? Could you `oc get machinehealthchecks -A -o yaml` into a file and upload here some examples from clusters experiencing the issue?

Comment 17 Rahul Gangwar 2022-01-11 08:09:20 UTC
We got above error in auto testruns, in this cluster are got pruned after test run completion. I will try to share the details if it is again reported.

Comment 18 Joel Speed 2022-01-11 09:22:39 UTC
Do you have a link to the test suite that is being run in these automated tests? Perhaps looking at the test source code might help us to identify what's happening

Comment 21 Rahul Gangwar 2022-01-11 11:02:15 UTC
@zhsun Please share the MHC test cases details to Joel of above logs and testrun-id is 20220103-0541.

Comment 22 Rahul Gangwar 2022-01-11 11:02:35 UTC
@zhsun Please share the MHC test cases details to Joel of above logs and testrun-id is 20220103-0541.

Comment 23 sunzhaohua 2022-01-11 12:09:52 UTC
I tried with mhc below, I could get SHOULD NOT HAPPEN error from must-gather.

apiVersion: "machine.openshift.io/v1beta1"
kind: "MachineHealthCheck"
metadata:
  name: mhc2
  namespace: openshift-machine-api
spec:
  selector: {}
  unhealthyConditions:
  - type: Ready
    status: "False"
    timeout: 300s
  - type: Ready
    status: Unknown
    timeout: 3
  maxUnhealthy: 3

$ find . -name "*.log"|grep -inr "SHOULD NOT HAPPEN"                                                 [19:55:43]
./quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-207543a62af5243a05ab3fe972768d534d2d21bc3c29acb36bdf80691e841ce6/namespaces/openshift-kube-apiserver/pods/kube-apiserver-ip-10-0-129-90.us-east-2.compute.internal/kube-apiserver/kube-apiserver/logs/current.log:1526:2022-01-11T11:52:50.058691868Z E0111 11:52:50.058606      17 fieldmanager.go:197] "[SHOULD NOT HAPPEN] failed to update managedFields" err="failed to convert new object (machine.openshift.io/v1beta1, Kind=MachineHealthCheck) to smd typed: .spec.unhealthyConditions[1].timeout: expected string, got &value.valueUnstructured{Value:3}" VersionKind="machine.openshift.io/v1beta1, Kind=MachineHealthCheck"

Comment 24 Joel Speed 2022-01-11 12:40:50 UTC
I've just tried to reproduce this given the example above.

When I create the MHC I get:

```
The MachineHealthCheck "mhc2" is invalid: spec.unhealthyConditions.timeout: Invalid value: "integer": spec.unhealthyConditions.timeout in body must be of type string: "integer"
```

This error is expected and intended, we have restricted the type so that users have to set a string, notably, the MHC object is not stored and is not created.

As far as I can tell, this is working as expected and unfortunately, this log line is a bit of a red herring. I don't know why the log line occurs when the MHC object is not actually created. This error will occur whenever a user attempts to create an invalid object.

I agree with Huali Liu who mentioned on slack that this isn't actually a bug, the openapi validations are working as expected.


Note You need to log in before you can comment on or make changes to this bug.