Description of problem: The samples operator is found to be degraded with a confusing message. Version-Release number of selected component (if applicable): 4.4.3 How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: "type": "Degraded", "reason": "APIServerError", "message": "open : no such file or directory error reading file [];" Expected results: A message what can help understand what is broken. Additional info:
We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges. The expectation is that the assignee answers these questions. Who is impacted? Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time What is the impact? Up to 2 minute disruption in edge routing Up to 90seconds of API downtime etcd loses quorum and you have to restore from backup How involved is remediation? Issue resolves itself after five minutes Admin uses oc to fix things Admin must SSH to hosts, restore from backups, or other non standard admin activities Is this a regression? No, itβs always been like this we just never noticed Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1
Removing the upgrade blocker as the issue might be with the etcd error we are seeing along with the sample operator degraded status. { "type": "Degraded", "status": "True", "lastTransitionTime": "2020-05-06T02:38:46Z", "reason": "APIServerDeployment_UnavailablePod::EncryptionKeyController_Error::OpenshiftAPIServerStaticResources_SyncError::Workload_SyncError", "message": "EncryptionKeyControllerDegraded: etcdserver: leader changed\nWorkloadDegraded: \"image-import-ca\": etcdserver: leader changed\nWorkloadDegraded: \"deployments\": invalid dependency reference: \"etcdserver: leader changed\"\nWorkloadDegraded: \nOpenshiftAPIServerStaticResourcesDegraded: \"v3.11.0/openshift-apiserver/ns.yaml\" (string): etcdserver: leader changed\nOpenshiftAPIServerStaticResourcesDegraded: \"v3.11.0/openshift-apiserver/svc.yaml\" (string): etcdserver: leader changed\nOpenshiftAPIServerStaticResourcesDegraded: \nAPIServerDeploymentDegraded: 1 of 3 requested instances are unavailable" }, The investigation is still going on and a bug will be filed soon. Also another bug created to make insights operator grab events from openshift-etcd-operator in case openshift api server is degrade dhttps://bugzilla.redhat.com/show_bug.cgi?id=1832220
operator logs: 2020-05-06T11:23:09.278800618Z time="2020-05-06T11:23:09Z" level=info msg="updated template rhdm75-optaweb-employee-rostering" 2020-05-06T11:23:09.303330589Z time="2020-05-06T11:23:09Z" level=info msg="updated template amq63-ssl" 2020-05-06T11:23:09.303421715Z time="2020-05-06T11:23:09Z" level=info msg="CRDERROR event temp udpate err" 2020-05-06T11:23:09.306865727Z time="2020-05-06T11:23:09Z" level=info msg="open : no such file or directory error reading file []" 2020-05-06T11:23:09.306865727Z time="2020-05-06T11:23:09Z" level=info msg="CRDUPDATE event temp udpate err"
Lowering severity on this to "medium". The Samples operator is degraded because the openshift apiserver is degrading/returning errors. That said, the message reported to the clusteroperator (and the logs) is not helpful to a cluster admin.
An intermediate update: 1) https://bugzilla.redhat.com/show_bug.cgi?id=1828065 was fixed in 4.5 a week ago and to a large degree addresses what I think was uncovered while I was out 2) I have just cloned that bug to 4.4.z ... the 4.4.z bug is https://bugzilla.redhat.com/show_bug.cgi?id=1832344 3) automated cherrypick did not work ... I've have manually picked and am about to submit a 4.4 PR 4) it is possible that fix fully handles the appearance of 'no such file or directory error reading file []' in cluster operator status, but I'll circle back and revisit / clarify once 3) is off and running ... 5) if more is warranted, I'll handle with this bug; otherwise, I'll close this bug as a dup of the cloned bug in 2) Of course if folks here disagree with that roadmap, please advise
*** This bug has been marked as a duplicate of bug 1832344 ***