Bug 1744560 - RHHI cluster lost API connectivity for some time after graceful shut down of master node
Summary: RHHI cluster lost API connectivity for some time after graceful shut down of ...
Keywords:
Status: CLOSED DUPLICATE of bug 1751978
Alias: None
Product: Kubernetes-native Infrastructure
Classification: Red Hat
Component: Management
Version: 1.0
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ---
: ---
Assignee: Ohad Levy
QA Contact: Arik Chernetsky
URL:
Whiteboard:
Depends On:
Blocks: 1741265
TreeView+ depends on / blocked
 
Reported: 2019-08-22 12:32 UTC by Artem Hrechanychenko
Modified: 2020-04-06 13:15 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-12-13 11:18:54 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift baremetal-runtimecfg pull 18 0 'None' closed Do not consider unresolvable backends 2021-01-14 17:46:22 UTC

Description Artem Hrechanychenko 2019-08-22 12:32:09 UTC
Description of problem:

test bare metal host shut down operation,
it consist of two steps:
1) Start Node maintenance from UI/CLI, waiting until Node maintenance will reach "Succeeded" phae
2) Gracefully Shut down from UI

After some time after shutting down node - cluster isn't operable from UI or OC console 

[cloud-user@rhhi-node-worker-0 dev-scripts]$ oc status
Error from server (ServiceUnavailable): the server is currently unable to handle the request (get routes.route.openshift.io)
Error from server (ServiceUnavailable): the server is currently unable to handle the request (get imagestreams.image.openshift.io)
Error from server (ServiceUnavailable): the server is currently unable to handle the request (get buildconfigs.build.openshift.io)
Error from server (ServiceUnavailable): the server is currently unable to handle the request (get deploymentconfigs.apps.openshift.io)
Error from server (ServiceUnavailable): the server is currently unable to handle the request (get builds.build.openshift.io)

While I'm able to login to another master

oc status
In project default on server https://api.rhhi-ahrechan-tlv.qe.lab.redhat.com:6443

svc/openshift - kubernetes.default.svc.cluster.local
svc/kubernetes - 172.30.0.1:443 -> 6443

View details with 'oc describe <resource>/<name>' or list everything with 'oc get all'.


after some time API connectivity returns

[cloud-user@rhhi-node-worker-0 dev-scripts]$ oc status
In project default on server https://api.rhhi-ahrechan-tlv.qe.lab.redhat.com:6443

svc/openshift - kubernetes.default.svc.cluster.local
svc/kubernetes - 172.30.0.1:443 -> 6443



Version-Release number of selected component (if applicable):
4.2.0-0.nightly-2019-08-22-043819

How reproducible:
Always

Steps to Reproduce:
1.Deploy RHHI
2.Start maintenance for bare metal host
3.Shut down bare metal host
4. Check API connectivity

Actual results:
after shut down - Error from server (ServiceUnavailable): the server is currently unable to handle the request 

Expected results:
Cluster is alive all time when shut down 1 of 3 master

Additional info:

Comment 1 Doug Hellmann 2019-08-23 23:00:57 UTC
This seems like an OpenShift issue, rather than a RHHI issue. Does OpenShift support removing one of the masters from a 3-node cluster?

Comment 2 Steven Hardy 2019-09-23 13:24:51 UTC
https://bugzilla.redhat.com/show_bug.cgi?id=1751978 looks related, possibly this is the same issue?

Comment 3 Bob Fournier 2019-12-06 16:25:07 UTC
Is this still occurring? If not, can we close this?

Comment 4 Steven Hardy 2019-12-13 11:18:54 UTC
Lets close this, it's either a dupe of https://bugzilla.redhat.com/show_bug.cgi?id=1751978 or we didn't get sufficient info to determine the root cause

*** This bug has been marked as a duplicate of bug 1751978 ***


Note You need to log in before you can comment on or make changes to this bug.