[sig-api-machinery] API data in etcd should be stored at the correct location and version for all resources [Serial] [Suite:openshift/conformance/serial] is failing frequently in CI, see: https://sippy.ci.openshift.org/sippy-ng/tests/4.10/analysis?test=%5Bsig-api-machinery%5D%20API%20data%20in%20etcd%20should%20be%20stored%20at%20the%20correct%20location%20and%20version%20for%20all%20resources%20%5BSerial%5D%20%5BSuite%3Aopenshift%2Fconformance%2Fserial%5D TRT is concerned about this test, given: https://search.ci.openshift.org/?search=API+data+in+etcd+should+be+stored+at+the+correct+location+and+version+for+all+resources&maxAge=48h&context=1&type=bug%2Bjunit&name=4.10.*serial&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job We can see that presently, over the past 40% of failures in periodic-ci-openshift-release-master-ci-4.10-e2e-aws-serial are this failure. The problem appears to *only* be affecting AWS, other platforms pass 100% of the time. The affected AWS jobs are: e2e-aws-serial and e2e-aws-techpreview-serial. A sample job run failing this test today: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-aws-serial/1443049181332639744 It appears the test started noticibly deteriorating on Sep 27. 40% of our failures is too large to ignore, and would be a big improvement for CI signal if we can get to the bottom of this.
The test fails with fail [github.com/openshift/origin/test/extended/etcd/etcd_storage_path.go:518]: test failed: failed to clean up etcd: &errors.StatusError{ErrStatus:v1.Status{TypeMeta:v1.TypeMeta{Kind:"Status", APIVersion:"v1"}, ListMeta:v1.ListMeta{SelfLink:"", ResourceVersion:"", Continue:"", RemainingItemCount:(*int64)(nil)}, Status:"Failure", Message:"nodes \"node1\" not found", Reason:"NotFound", Details:(*v1.StatusDetails)(0xc0017c0840), Code:404}} i.e. the test is unable to clean up after it runs. When looking at the audit logs, it's visible that the AWS node controller removes the node object: {"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"84279591-3c64-41df-918d-c5e0a0ed06e3","stage":"ResponseComplete","requestURI":"/api/v1/nodes/node1","verb":"delete","user":{"username":"system:serviceaccount:kube-system:node-controller","uid":"81dd1878-5c8a-402a-8600-0e8d88a253a8","groups":["system:serviceaccounts","system:serviceaccounts:kube-system","system:authenticated"]},"sourceIPs":["10.0.199.142"],"userAgent":"kube-controller-manager/v1.22.1 (linux/amd64) kubernetes/91b30ca/system:serviceaccount:kube-system:node-controller","objectRef":{"resource":"nodes","name":"node1","apiVersion":"v1"},"responseStatus":{"metadata":{},"status":"Success","code":200},"requestReceivedTimestamp":"2021-09-29T04:18:49.289256Z","stageTimestamp":"2021-09-29T04:18:49.310260Z","annotations":{"authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"system:controller:node-controller\" of ClusterRole \"system:controller:node-controller\" to ServiceAccount \"node-controller/kube-system\""}} This node object is essential for the test to verify proper serialization of the object.
The node controller does the delete for unknown nodes, compare k8s.io/cloud-provider/controllers/nodelifecycle: klog.V(2).Infof("deleting node since it is no longer present in cloud provider: %s", node.Name) Maybe one can delay that by setting the node to ready. Then at least the node is kept until it is set not ready. This might be enough for it to survive long enough.
https://github.com/kubernetes/kubernetes/pull/105349
As checked prow CI jobs, junit test was got passed https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-aws-serial/1443778864760229888 : [sig-api-machinery] API data in etcd should be stored at the correct location and version for all resources [Serial] [Suite:openshift/conformance/serial] 22s https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-aws-serial/1443718659464761344 : [sig-api-machinery] API data in etcd should be stored at the correct location and version for all resources [Serial] [Suite:openshift/conformance/serial] 12s
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056