Bug 2008926 - [sig-api-machinery] API data in etcd should be stored at the correct location and version for all resources [Serial] [Suite:openshift/conformance/serial]
Summary: [sig-api-machinery] API data in etcd should be stored at the correct location...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: kube-apiserver
Version: 4.10
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.10.0
Assignee: Stefan Schimanski
QA Contact: Rahul Gangwar
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-09-29 13:55 UTC by Devan Goodwin
Modified: 2022-03-10 16:14 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-03-10 16:13:56 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift origin pull 26490 0 None open Bug 2008926: etcd-storage-test: mark node ready to save it for 5min from cloud provider 2021-09-29 16:24:28 UTC
Red Hat Product Errata RHSA-2022:0056 0 None None None 2022-03-10 16:14:21 UTC

Description Devan Goodwin 2021-09-29 13:55:45 UTC
[sig-api-machinery] API data in etcd should be stored at the correct location and version for all resources [Serial] [Suite:openshift/conformance/serial]

is failing frequently in CI, see:
https://sippy.ci.openshift.org/sippy-ng/tests/4.10/analysis?test=%5Bsig-api-machinery%5D%20API%20data%20in%20etcd%20should%20be%20stored%20at%20the%20correct%20location%20and%20version%20for%20all%20resources%20%5BSerial%5D%20%5BSuite%3Aopenshift%2Fconformance%2Fserial%5D

TRT is concerned about this test, given: https://search.ci.openshift.org/?search=API+data+in+etcd+should+be+stored+at+the+correct+location+and+version+for+all+resources&maxAge=48h&context=1&type=bug%2Bjunit&name=4.10.*serial&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

We can see that presently, over the past 40% of failures in periodic-ci-openshift-release-master-ci-4.10-e2e-aws-serial are this failure. 

The problem appears to *only* be affecting AWS, other platforms pass 100% of the time.

The affected AWS jobs are: e2e-aws-serial and e2e-aws-techpreview-serial.

A sample job run failing this test today: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-aws-serial/1443049181332639744

It appears the test started noticibly deteriorating on Sep 27.

40% of our failures is too large to ignore, and would be a big improvement for CI signal if we can get to the bottom of this.

Comment 1 Stefan Schimanski 2021-09-29 14:32:39 UTC
The test fails with

fail [github.com/openshift/origin/test/extended/etcd/etcd_storage_path.go:518]: test failed:
failed to clean up etcd: &errors.StatusError{ErrStatus:v1.Status{TypeMeta:v1.TypeMeta{Kind:"Status", APIVersion:"v1"}, ListMeta:v1.ListMeta{SelfLink:"", ResourceVersion:"", Continue:"", RemainingItemCount:(*int64)(nil)}, Status:"Failure", Message:"nodes \"node1\" not found", Reason:"NotFound", Details:(*v1.StatusDetails)(0xc0017c0840), Code:404}}

i.e. the test is unable to clean up after it runs.

When looking at the audit logs, it's visible that the AWS node controller removes the node object:

{"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"84279591-3c64-41df-918d-c5e0a0ed06e3","stage":"ResponseComplete","requestURI":"/api/v1/nodes/node1","verb":"delete","user":{"username":"system:serviceaccount:kube-system:node-controller","uid":"81dd1878-5c8a-402a-8600-0e8d88a253a8","groups":["system:serviceaccounts","system:serviceaccounts:kube-system","system:authenticated"]},"sourceIPs":["10.0.199.142"],"userAgent":"kube-controller-manager/v1.22.1 (linux/amd64) kubernetes/91b30ca/system:serviceaccount:kube-system:node-controller","objectRef":{"resource":"nodes","name":"node1","apiVersion":"v1"},"responseStatus":{"metadata":{},"status":"Success","code":200},"requestReceivedTimestamp":"2021-09-29T04:18:49.289256Z","stageTimestamp":"2021-09-29T04:18:49.310260Z","annotations":{"authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"system:controller:node-controller\" of ClusterRole \"system:controller:node-controller\" to ServiceAccount \"node-controller/kube-system\""}}

This node object is essential for the test to verify proper serialization of the object.

Comment 2 Stefan Schimanski 2021-09-29 15:19:10 UTC
The node controller does the delete for unknown nodes, compare k8s.io/cloud-provider/controllers/nodelifecycle:

  klog.V(2).Infof("deleting node since it is no longer present in cloud provider: %s", node.Name)

Maybe one can delay that by setting the node to ready. Then at least the node is kept until it is set not ready. This might be enough for it to survive long enough.

Comment 3 Stefan Schimanski 2021-09-29 15:45:07 UTC
https://github.com/kubernetes/kubernetes/pull/105349

Comment 5 Rahul Gangwar 2021-10-01 07:24:17 UTC
As checked prow CI jobs, junit test was got passed 
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-aws-serial/1443778864760229888

: [sig-api-machinery] API data in etcd should be stored at the correct location and version for all resources [Serial] [Suite:openshift/conformance/serial]	22s


https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-aws-serial/1443718659464761344

: [sig-api-machinery] API data in etcd should be stored at the correct location and version for all resources [Serial] [Suite:openshift/conformance/serial]	12s

Comment 9 errata-xmlrpc 2022-03-10 16:13:56 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056


Note You need to log in before you can comment on or make changes to this bug.