Bug 1832193

Summary: incomprehensible error message from samples operator
Product: OpenShift Container Platform Reporter: Oleg Bulatov <obulatov>
Component: SamplesAssignee: Gabe Montero <gmontero>
Status: CLOSED DUPLICATE QA Contact: XiuJuan Wang <xiuwang>
Severity: medium Docs Contact:
Priority: urgent    
Version: 4.4CC: adam.kaplan, lmohanty, vrutkovs
Target Milestone: ---Keywords: Upgrades
Target Release: 4.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-05-06 15:59:03 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Oleg Bulatov 2020-05-06 10:14:14 UTC
Description of problem:

The samples operator is found to be degraded with a confusing message.

Version-Release number of selected component (if applicable):

4.4.3

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:

    "type": "Degraded",
    "reason": "APIServerError",
    "message": "open : no such file or directory error reading file [];"

Expected results:

A message what can help understand what is broken.

Additional info:

Comment 2 Lalatendu Mohanty 2020-05-06 10:38:45 UTC
We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges. The expectation is that the assignee answers these questions.

Who is impacted?
  Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet
  All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time
What is the impact?
  Up to 2 minute disruption in edge routing
  Up to 90seconds of API downtime
  etcd loses quorum and you have to restore from backup
How involved is remediation?
  Issue resolves itself after five minutes
  Admin uses oc to fix things
  Admin must SSH to hosts, restore from backups, or other non standard admin activities
Is this a regression?
  No, it’s always been like this we just never noticed
  Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1

Comment 4 Lalatendu Mohanty 2020-05-06 11:11:01 UTC
Removing the upgrade blocker as the issue might be with the etcd error we are seeing along with the sample operator degraded status. 

{
        "type": "Degraded",
        "status": "True",
        "lastTransitionTime": "2020-05-06T02:38:46Z",
        "reason": "APIServerDeployment_UnavailablePod::EncryptionKeyController_Error::OpenshiftAPIServerStaticResources_SyncError::Workload_SyncError",
        "message": "EncryptionKeyControllerDegraded: etcdserver: leader changed\nWorkloadDegraded: \"image-import-ca\": etcdserver: leader changed\nWorkloadDegraded: \"deployments\": invalid dependency reference: \"etcdserver: leader changed\"\nWorkloadDegraded: \nOpenshiftAPIServerStaticResourcesDegraded: \"v3.11.0/openshift-apiserver/ns.yaml\" (string): etcdserver: leader changed\nOpenshiftAPIServerStaticResourcesDegraded: \"v3.11.0/openshift-apiserver/svc.yaml\" (string): etcdserver: leader changed\nOpenshiftAPIServerStaticResourcesDegraded: \nAPIServerDeploymentDegraded: 1 of 3 requested instances are unavailable"
      }, 

The investigation is still going on and a bug will be filed soon. 

Also another bug created to make insights operator grab events from openshift-etcd-operator in case openshift api server is degrade dhttps://bugzilla.redhat.com/show_bug.cgi?id=1832220

Comment 5 Vadim Rutkovsky 2020-05-06 11:55:24 UTC
operator logs:

2020-05-06T11:23:09.278800618Z time="2020-05-06T11:23:09Z" level=info msg="updated template rhdm75-optaweb-employee-rostering"
2020-05-06T11:23:09.303330589Z time="2020-05-06T11:23:09Z" level=info msg="updated template amq63-ssl"
2020-05-06T11:23:09.303421715Z time="2020-05-06T11:23:09Z" level=info msg="CRDERROR event temp udpate err"
2020-05-06T11:23:09.306865727Z time="2020-05-06T11:23:09Z" level=info msg="open : no such file or directory error reading file []"
2020-05-06T11:23:09.306865727Z time="2020-05-06T11:23:09Z" level=info msg="CRDUPDATE event temp udpate err"

Comment 6 Adam Kaplan 2020-05-06 13:32:02 UTC
Lowering severity on this to "medium". The Samples operator is degraded because the openshift apiserver is degrading/returning errors. That said, the message reported to the clusteroperator (and the logs) is not helpful to a cluster admin.

Comment 8 Gabe Montero 2020-05-06 15:46:12 UTC
An intermediate update:

1) https://bugzilla.redhat.com/show_bug.cgi?id=1828065 was fixed in 4.5 a week ago and to a large degree addresses what I think was uncovered while I was out

2) I have just cloned that bug to 4.4.z ... the 4.4.z bug is https://bugzilla.redhat.com/show_bug.cgi?id=1832344

3) automated cherrypick did not work ... I've have manually picked and am about to submit a 4.4 PR

4) it is possible that fix fully handles the appearance of 'no such file or directory error reading file []' in cluster operator status, but I'll circle back and revisit / clarify once 3) is off and running ... 

5) if more is warranted, I'll handle with this bug; otherwise, I'll close this bug as a dup of the cloned bug in 2)

Of course if folks here disagree with that roadmap, please advise

Comment 9 Vadim Rutkovsky 2020-05-06 15:59:03 UTC

*** This bug has been marked as a duplicate of bug 1832344 ***