Bug 1832193 - incomprehensible error message from samples operator
Summary: incomprehensible error message from samples operator
Keywords:
Status: CLOSED DUPLICATE of bug 1832344
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Samples
Version: 4.4
Hardware: Unspecified
OS: Unspecified
urgent
medium
Target Milestone: ---
: 4.5.0
Assignee: Gabe Montero
QA Contact: XiuJuan Wang
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-05-06 10:14 UTC by Oleg Bulatov
Modified: 2020-05-06 15:59 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-05-06 15:59:03 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Oleg Bulatov 2020-05-06 10:14:14 UTC
Description of problem:

The samples operator is found to be degraded with a confusing message.

Version-Release number of selected component (if applicable):

4.4.3

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:

    "type": "Degraded",
    "reason": "APIServerError",
    "message": "open : no such file or directory error reading file [];"

Expected results:

A message what can help understand what is broken.

Additional info:

Comment 2 Lalatendu Mohanty 2020-05-06 10:38:45 UTC
We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges. The expectation is that the assignee answers these questions.

Who is impacted?
  Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet
  All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time
What is the impact?
  Up to 2 minute disruption in edge routing
  Up to 90seconds of API downtime
  etcd loses quorum and you have to restore from backup
How involved is remediation?
  Issue resolves itself after five minutes
  Admin uses oc to fix things
  Admin must SSH to hosts, restore from backups, or other non standard admin activities
Is this a regression?
  No, it’s always been like this we just never noticed
  Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1

Comment 4 Lalatendu Mohanty 2020-05-06 11:11:01 UTC
Removing the upgrade blocker as the issue might be with the etcd error we are seeing along with the sample operator degraded status. 

{
        "type": "Degraded",
        "status": "True",
        "lastTransitionTime": "2020-05-06T02:38:46Z",
        "reason": "APIServerDeployment_UnavailablePod::EncryptionKeyController_Error::OpenshiftAPIServerStaticResources_SyncError::Workload_SyncError",
        "message": "EncryptionKeyControllerDegraded: etcdserver: leader changed\nWorkloadDegraded: \"image-import-ca\": etcdserver: leader changed\nWorkloadDegraded: \"deployments\": invalid dependency reference: \"etcdserver: leader changed\"\nWorkloadDegraded: \nOpenshiftAPIServerStaticResourcesDegraded: \"v3.11.0/openshift-apiserver/ns.yaml\" (string): etcdserver: leader changed\nOpenshiftAPIServerStaticResourcesDegraded: \"v3.11.0/openshift-apiserver/svc.yaml\" (string): etcdserver: leader changed\nOpenshiftAPIServerStaticResourcesDegraded: \nAPIServerDeploymentDegraded: 1 of 3 requested instances are unavailable"
      }, 

The investigation is still going on and a bug will be filed soon. 

Also another bug created to make insights operator grab events from openshift-etcd-operator in case openshift api server is degrade dhttps://bugzilla.redhat.com/show_bug.cgi?id=1832220

Comment 5 Vadim Rutkovsky 2020-05-06 11:55:24 UTC
operator logs:

2020-05-06T11:23:09.278800618Z time="2020-05-06T11:23:09Z" level=info msg="updated template rhdm75-optaweb-employee-rostering"
2020-05-06T11:23:09.303330589Z time="2020-05-06T11:23:09Z" level=info msg="updated template amq63-ssl"
2020-05-06T11:23:09.303421715Z time="2020-05-06T11:23:09Z" level=info msg="CRDERROR event temp udpate err"
2020-05-06T11:23:09.306865727Z time="2020-05-06T11:23:09Z" level=info msg="open : no such file or directory error reading file []"
2020-05-06T11:23:09.306865727Z time="2020-05-06T11:23:09Z" level=info msg="CRDUPDATE event temp udpate err"

Comment 6 Adam Kaplan 2020-05-06 13:32:02 UTC
Lowering severity on this to "medium". The Samples operator is degraded because the openshift apiserver is degrading/returning errors. That said, the message reported to the clusteroperator (and the logs) is not helpful to a cluster admin.

Comment 8 Gabe Montero 2020-05-06 15:46:12 UTC
An intermediate update:

1) https://bugzilla.redhat.com/show_bug.cgi?id=1828065 was fixed in 4.5 a week ago and to a large degree addresses what I think was uncovered while I was out

2) I have just cloned that bug to 4.4.z ... the 4.4.z bug is https://bugzilla.redhat.com/show_bug.cgi?id=1832344

3) automated cherrypick did not work ... I've have manually picked and am about to submit a 4.4 PR

4) it is possible that fix fully handles the appearance of 'no such file or directory error reading file []' in cluster operator status, but I'll circle back and revisit / clarify once 3) is off and running ... 

5) if more is warranted, I'll handle with this bug; otherwise, I'll close this bug as a dup of the cloned bug in 2)

Of course if folks here disagree with that roadmap, please advise

Comment 9 Vadim Rutkovsky 2020-05-06 15:59:03 UTC

*** This bug has been marked as a duplicate of bug 1832344 ***


Note You need to log in before you can comment on or make changes to this bug.