Bug 2177317 - [GSS] OpenShift Data Foundation odf-console plugin stuck in degraded/failed state [NEEDINFO]
Summary: [GSS] OpenShift Data Foundation odf-console plugin stuck in degraded/failed s...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: management-console
Version: 4.11
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: ---
Assignee: Bipul Adhikari
QA Contact: Prasad Desala
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-03-10 20:56 UTC by khover
Modified: 2023-08-09 16:46 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-03-28 12:46:37 UTC
Embargoed:
skatiyar: needinfo? (khover)


Attachments (Terms of Use)

Description khover 2023-03-10 20:56:23 UTC
Description of problem (please be detailed as possible and provide log
snippests):

Upon installing ODF 4.11 OpenShift Data Foundation Operator Configuration Doesn't Show Backing Storage Page.

There was no dropdown popup to refresh so refreshed manually.
================
After the operator is successfully installed, a pop-up with a message, Web console update is available appears on the user interface. Click Refresh web console from this pop-up for the console changes to reflect.
===============

Still no Backing Storage Page

odf-console plugin stuck in degraded/failed state

Verified the odf console pod was running and plugin enabled.

# oc get consoles.operator.openshift.io cluster -o jsonpath='{.spec.plugins}{"\n"}'
["odf-console"]

already tried to hard refresh the browser and clean up the cache as per [1]
[1] https://access.redhat.com/solutions/6824581

And tried to patch.



After installing the storagecluster via yaml and successful install.

Still no ODF dashboard Also, "Object Buckets" and "Object Bucket Claims" menu entries  are not displayed in the "Storage" menu


Version of all relevant components (if applicable):

ODF 4.11.5


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

Yes, ODF dashboard and Noobaa needs functionality.

Is there any workaround available to the best of your knowledge?

Not outside of what has already been tried

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?

3

Can this issue reproducible?

In the customer env yes repeatedly 


Can this issue reproduce from the UI?

In the customer env yes repeatedly

If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.
2.
3.


Actual results:


Expected results:


Additional info:

Comment 1 khover 2023-03-10 21:01:03 UTC
supportshell case 03448212 ...

|   IDX |  PRFX  | FILENAME                               |   SIZE (KB) | DATE                 | SOURCE   |   CACHED |
|-------|--------|----------------------------------------|-------------|----------------------|----------|----------|
|     1 |  0010  | ODF_Issue.PNG                          |       39.51 | 2023-02-27 23:18 UTC | S3       |     Yes  |
|     2 |  0020  | must-gather_2023-03-01-1900-UTC.tar.xz |    15880.94 | 2023-03-01 21:58 UTC | S3       |     Yes  |
|     3 |  0030  | odf-plugin-fail.PNG                    |       21.54 | 2023-03-03 19:19 UTC | S3       |     Yes  |
|     4 |  0040  | odf-storage-warning.PNG                |       17.95 | 2023-03-03 19:19 UTC | S3       |     Yes  |
|     5 |  0050  | 2023-03-06-0900-MT_sanitized.tar.xz    |    11015.93 | 2023-03-06 16:39 UTC | S3       |     Yes  |
|     6 |  0060  | AJ ODF Issue.pdf                       |      548.97 | 2023-03-09 20:59 UTC | S3       |     Yes  |
|     7 |  0070  | CS ODF Issue.pdf                       |      549.73 | 2023-03-09 20:59 UTC | S3       |     Yes  |
|     8 |  0080  | odf-operator_failure.PNG               |       18.22 | 2023-03-10 17:49 UTC | S3       |     Yes  |
|     9 |  0090  | 2023-03-03-10-1100-MT.tar.xz           |     2552.44 | 2023-03-10 18:14 UTC | S3       |     Yes  |

Comment 2 Sanjal Katiyar 2023-03-11 05:36:58 UTC
Hi, few questions:
1. What is the OCP version ?
2. What is the "basePath" value in "consolePlugin" CR for odf-console ?
3. Is ODF operator install successful ?
4. Is StorageCluster healthy ?
5. Is NooBaa CR healthy/ready as well ?

Also, I can see bunch of file names in https://bugzilla.redhat.com/show_bug.cgi?id=2177317#c1, but can u plz tell me how to find the must-gathers ? I am not sure where to find them, can u either paste the link or attach the must-gathers to this BZ as an attachment ?

Thanks.

Comment 3 khover 2023-03-11 19:36:44 UTC
Hi,

Answering in line.

1. OCP 4.11 

2. basePath: /   this was patched to /compatibility/ and plugin was green for 30 seconds then reverted back to failed and yaml was reverted to basePath: /

   Attempts were made to clear cache, refresh manually, restarted plugin pod and even deleted the plugin instance.

3. ODF operator was installed 4.11.5 but never got the a pop-up with a message, Web console update is available appears on the user interface.

   Did a oc get pods -n and all the operators were installed.

4. Storagecluster is health ok with 3 osds up and in after yaml install to sidestep inability to get Backing Storage Page.

5. Noobaa is showing the following error.

2023-03-10T17:53:56.361969088Z time="2023-03-10T17:53:56Z" level=info msg="✅ Exists: CephObjectStoreUser \"noobaa-ceph-objectstore-user\"\n"
2023-03-10T17:53:56.361969088Z time="2023-03-10T17:53:56Z" level=info msg="SetPhase: \"Connecting\"" sys=openshift-storage/noobaa
2023-03-10T17:53:56.362099266Z time="2023-03-10T17:53:56Z" level=info msg="Collected addresses: &{NodePorts:[https://172.27.204.71:31844] PodPorts:[https://10.130.4.190:8443] InternalIP:[https://172.30.128.16:443] InternalDNS:[https://noobaa-mgmt.openshift-storage.svc:443] ExternalIP:[] ExternalDNS:[https://noobaa-mgmt-openshift-storage.apps.ocp.redacted.gov]}" func=CheckServiceStatus service=noobaa-mgmt sys=openshift-storage/noobaa
2023-03-10T17:53:56.362107118Z time="2023-03-10T17:53:56Z" level=info msg="✈️  RPC: auth.read_auth() Request: <nil>"
2023-03-10T17:53:59.187063481Z time="2023-03-10T17:53:59Z" level=info msg="RPC: Connecting websocket (0xc001e03680) &{RPC:0xc000481a90 Address:wss://noobaa-mgmt.openshift-storage.svc.cluster.local:443/rpc/ State:init WS:<nil> PendingRequests:map[] NextRequestID:0 Lock:{state:9 sema:0} ReconnectDelay:3s cancelPings:<nil>}"
2023-03-10T17:53:59.192982534Z time="2023-03-10T17:53:59Z" level=error msg="RPC: closing connection (0xc001e03680) &{RPC:0xc000481a90 Address:wss://noobaa-mgmt.openshift-storage.svc.cluster.local:443/rpc/ State:init WS:<nil> PendingRequests:map[] NextRequestID:0 Lock:{state:9 sema:0} ReconnectDelay:3s cancelPings:<nil>}"
2023-03-10T17:53:59.192982534Z time="2023-03-10T17:53:59Z" level=warning msg="RPC: RemoveConnection wss://noobaa-mgmt.openshift-storage.svc.cluster.local:443/rpc/ current=0xc001e03680 conn=0xc001e03680"
2023-03-10T17:53:59.192982534Z time="2023-03-10T17:53:59Z" level=error msg="RPC: Reconnect - got error: failed to websocket dial: expected handshake response status code 101 but got 503"
2023-03-10T17:53:59.193031985Z time="2023-03-10T17:53:59Z" level=error msg="⚠️  RPC: auth.read_auth() Call failed: RPC: connection (0xc001e03680) already closed &{RPC:0xc000481a90 Address:wss://noobaa-mgmt.openshift-storage.svc.cluster.local:443/rpc/ State:closed WS:<nil> PendingRequests:map[] NextRequestID:0 Lock:{state:1 sema:0} ReconnectDelay:3s cancelPings:<nil>}"
2023-03-10T17:53:59.193031985Z time="2023-03-10T17:53:59Z" level=info msg="SetPhase: temporary error during phase \"Connecting\"" sys=openshift-storage/noobaa
2023-03-10T17:53:59.193031985Z time="2023-03-10T17:53:59Z" level=warning msg="RPC: GetConnection creating connection to wss://noobaa-mgmt.openshift-storage.svc.cluster.local:443/rpc/ 0xc0016221e0"
2023-03-10T17:53:59.193042304Z time="2023-03-10T17:53:59Z" level=info msg="RPC: Reconnect (0xc0016221e0) delay &{RPC:0xc000481a90 Address:wss://noobaa-mgmt.openshift-storage.svc.cluster.local:443/rpc/ State:init WS:<nil> PendingRequests:map[] NextRequestID:0 Lock:{state:1 sema:0} ReconnectDelay:3s cancelPings:<nil>}"
2023-03-10T17:53:59.193088703Z time="2023-03-10T17:53:59Z" level=warning msg="⏳ Temporary Error: RPC: connection (0xc001e03680) already closed &{RPC:0xc000481a90 Address:wss://noobaa-mgmt.openshift-storage.svc.cluster.local:443/rpc/ State:closed WS:<nil> PendingRequests:map[] NextRequestID:0 Lock:{state:1 sema:0} ReconnectDelay:3s cancelPings:<nil>}" sys=openshift-storage/noobaa
2023-03-10T17:53:59.204702775Z time="2023-03-10T17:53:59Z" level=info msg="UpdateStatus: Done generation 1" sys=openshift-storage/noobaa
2023-03-10T17:54:02.193922860Z time="2023-03-10T17:54:02Z" level=info msg="RPC: Connecting websocket (0xc0016221e0) &{RPC:0xc000481a90 Address:wss://noobaa-mgmt.openshift-storage.svc.cluster.local:443/rpc/ State:init WS:<nil> PendingRequests:map[] NextRequestID:0 Lock:{state:1 sema:0} ReconnectDelay:3s cancelPings:<nil>}"
2023-03-10T17:54:02.199189383Z time="2023-03-10T17:54:02Z" level=error msg="RPC: closing connection (0xc0016221e0) &{RPC:0xc000481a90 Address:wss://noobaa-mgmt.openshift-storage.svc.cluster.local:443/rpc/ State:init WS:<nil> PendingRequests:map[] NextRequestID:0 Lock:{state:1 sema:0} ReconnectDelay:3s cancelPings:<nil>}"
2023-03-10T17:54:02.199189383Z time="2023-03-10T17:54:02Z" level=warning msg="RPC: RemoveConnection wss://noobaa-mgmt.openshift-storage.svc.cluster.local:443/rpc/ current=0xc0016221e0 conn=0xc0016221e0"
2023-03-10T17:54:02.199219970Z time="2023-03-10T17:54:02Z" level=error msg="RPC: Reconnect - got error: failed to websocket dial: expected handshake response status code 101 but got 503"
2023-03-10T17:54:02.199235639Z time="2023-03-10T17:54:02Z" level=warning msg="RPC: GetConnection creating connection to wss://noobaa-mgmt.openshift-storage.svc.cluster.local:443/rpc/ 0xc001622ae0"
2023-03-10T17:54:02.199252366Z time="2023-03-10T17:54:02Z" level=info msg="RPC: Reconnect (0xc001622ae0) delay &{RPC:0xc000481a90 Address:wss://noobaa-mgmt.openshift-storage.svc.cluster.local:443/rpc/ State:init WS:<nil> PendingRequests:map[] NextRequestID:0 Lock:{state:1 sema:0} ReconnectDelay:3s cancelPings:<nil>}"

The must gathers are uploaded to support shell case 03448212.

Please let me know if you do not have access to support shell.

Additionally, we do not have a OCP must gather, but I can get one generated if needed.

Comment 4 khover 2023-03-13 20:02:10 UTC
Any updates on possible solutions or if additional data is needed ?

This is a blocker for the customer.

Comment 5 Sanjal Katiyar 2023-03-14 06:24:57 UTC
I can see StorageCluster and NooBaa CR are not in Ready/Succeeded state... still it should not affect UI pod and ODF UI would still be visible...
make sure of these 2 things:
1. Nodes have sufficient resource for pod to work (though I guess this is not a issue),
2. This is not because of this: https://bugzilla.redhat.com/show_bug.cgi?id=2139785 (any of ipv4 or ipv6 is not disabled).

If both things are fine, let's connect via screen share, I would like to explore the live cluster.

Comment 7 khover 2023-03-14 11:44:20 UTC
Thanks Sanjal,

That is a big time zone diff :) 

I will reach out to the customer.

Im sure they will be available as they are eager to get the ODF dashboard working.

All Nodes have plenty of resources ( highest consumer 50% ) 

I will try to verify your action item # 2

Comment 8 khover 2023-03-14 19:36:09 UTC
Hi Sanjal,

Info provided by the customer.

1.  We have no known issues with IPv4 communication in the clusters.  All worker and master nodes are on the same layer 2, and our load balancer configuration is correct as far as we know.
2 . IPv6 is set to "Automatic" on all worker and master nodes.

I informed them of the meeting you scheduled 8:30 PM IST and they will be there.

Comment 9 khover 2023-03-15 21:24:20 UTC
Hi Sanjal,

Customer just updated case with the following:

The following error shows "odf-console-service.openshift-storage.svc.cluster.local" being resolved to a virtual IP on our load balancer, and TCP 9001 connections to that IP being refused:
`Get "https://odf-console-service.openshift-storage.svc.cluster.local:9001/plugin-manifest.json": dial tcp 172.X.X.X:9001: connect: connection refused`
Our team confirmed that IP is not configured to load balance TCP 9001.

Comment 10 Sanjal Katiyar 2023-03-16 05:24:18 UTC
(In reply to khover from comment #9)
> Hi Sanjal,
> 
> Customer just updated case with the following:
> 
> The following error shows
> "odf-console-service.openshift-storage.svc.cluster.local" being resolved to
> a virtual IP on our load balancer, and TCP 9001 connections to that IP being
> refused:
> `Get
> "https://odf-console-service.openshift-storage.svc.cluster.local:9001/plugin-
> manifest.json": dial tcp 172.X.X.X:9001: connect: connection refused`
> Our team confirmed that IP is not configured to load balance TCP 9001.

I did install exact OCP 4.11.27 and ODF 4.11.6 versions, and was successfully able to see ODF UI on my cluster. So, does not seems like immediate bug in the product.
There are definitely some configurations which are messing with the Service for the "odf-console" pod (or as pointed out by Kevan, NooBaa is giving similar error as well) as it is returning "502 Bad Gateway" instead of static UI contents.
Investigating further for the root cause or possible workaround, will keep posted on this BZ.

Comment 12 khover 2023-03-18 11:51:58 UTC
Hello,

OCP must gather is uploaded to support shell.

They have 2 mirrored clusters with the same issue.

The must gathers are uploaded to support shell case 03448212.

Please let me know if you do not have access to support shell.

Also collected the following. If I missed something in these steps please advise.

 $ oc get svc/odf-console-service -o yaml
apiVersion: v1
kind: Service
metadata:
  annotations:
    service.alpha.openshift.io/serving-cert-secret-name: odf-console-serving-cert
    service.alpha.openshift.io/serving-cert-signed-by: openshift-service-serving-signer@1674099240
    service.beta.openshift.io/serving-cert-signed-by: openshift-service-serving-signer@1674099240
  creationTimestamp: "2023-03-16T21:56:57Z"
  labels:
    app: odf-console
  name: odf-console-service
  namespace: openshift-storage
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: Deployment
    name: odf-console
    uid: 53b12b8a-c78e-48b2-b3aa-8698d5031c99
  resourceVersion: "38522260"
  uid: f1b903d0-35a5-4ba6-8042-85ef9a18638c
spec:
  clusterIP: 172.30.X.X
  clusterIPs:
  - 172.30.X.X
  internalTrafficPolicy: Cluster
  ipFamilies:
  - IPv4
  ipFamilyPolicy: SingleStack
  ports:
  - name: console-port
    port: 9001
    protocol: TCP
    targetPort: 9001
  selector:
    app: odf-console
  sessionAffinity: None
  type: ClusterIP
status:
  loadBalancer: {}

> Ssh to a node (or debug to a node) after getting the pod IP that the service is forwarding to.
ssh core.gov

> Then curl directly the pod IP:<port>
curl -k https://172.30.X.X:9001                                               
<html>
<head><title>403 Forbidden</title></head>
<body>
<center><h1>403 Forbidden</h1></center>
<hr><center>nginx/1.20.1</center>
</body>
</html>

> pod replies OK, then move over and curl the service IP of the service:<port>
curl -k https://10.128.X.X:9001                                               
<html>
<head><title>403 Forbidden</title></head>
<body>
<center><h1>403 Forbidden</h1></center>
<hr><center>nginx/1.20.1</center>
</body>
</html>

curl -k https://172.26.X.X:9001
curl: (7) Failed to connect to 172.26.X.X port 9001: Connection refused

> is pod responding ok if you rsh in and curl localhost:<port>? --> application OK
curl -k https://localhost:9001
<html>
<head><title>403 Forbidden</title></head>
<body>
<center><h1>403 Forbidden</h1></center>
<hr><center>nginx/1.20.1</center>
</body>
</html>

> is pod responding OK from a host node? <pod-network OK>
curl -k https://10.128.X.X:9001                                               
<html>
<head><title>403 Forbidden</title></head>
<body>
<center><h1>403 Forbidden</h1></center>
<hr><center>nginx/1.20.1</center>
</body>
</html>

> is service forwarding requests to backends from host node? <service OK>
> is service routable from other pods outside namespace? <policy/nftables OK>
curl -k https://odf-console-service.openshift-storage.svc.cluster.local
<html>
  <head>
    ...
  </head>
  <body>
    <div>
      <h1>Application is not available</h1>
      <p>The application is currently not serving requests at this endpoint. It may not have been started or is still starting.</p>

      <div class="alert alert-info">
        <p class="info">
          Possible reasons you are seeing this page:
        </p>
        <ul>
          <li>
            <strong>The host doesn't exist.</strong>
            Make sure the hostname was typed correctly and that a route matching this hostname exists.
          </li>
          <li>
            <strong>The host exists, but doesn't have a matching path.</strong>
            Check if the URL path was typed correctly and that the route was created using the desired path.
          </li>
          <li>
            <strong>Route and path matches, but all pods are down.</strong>
            Make sure that the resources exposed by this route (pods, services, deployment configs, etc) have at least one pod running.
          </li>
        </ul>
      </div>
    </div>
  </body>
</html>

> if route exposed; can you hit the route bypassing the loadbalancer? (*haproxy OK --- requires use of parameter --resolve with curl)
No route exposed.

Comment 13 khover 2023-03-20 13:43:54 UTC
Hello Team,

Any updates on this blocker for the customer ?

Comment 16 khover 2023-03-22 18:49:31 UTC
Hello Team,

Thanks for all your help on this ! 

I think we can close this as not a bug.

Will Russell ( shift networking team ) notes from cu call:

After some additional debugging, we checked with a few NDOT validation queries and we observed that the lookup was succeeding against the upstream LB with the request for odf-console-service.openshift-storage.svc.cluster.local.cs.ocp.servcices.stamp.tsa.dhs.gov 


We force resolved with ndots 4 and observed that it did plumb correctly to the internal address.


We then discovered that there was a wildcard entry for *.cs.ocp.services.stamp.tsa.dhs.gov which WAS allowing things to succeed, but ALSO was force-resolving this ndots lookup at the external nameservers at the eternal loadbalancer:


(.cluster.local.cs.ocp.) matches rule above.


After removing this wildcard entry and creating an entry for .apps.cs.ocp.services.stamp.tsa.dhs.gov, passthrough resolution succeeded, because the pods cannot resolve against the upstream address anymore (.cluster.local.cs.ocp* does not match *.apps.cs.ocp)


We are now able to route correctly through to the backends/service and issues with storage/networking appear resolved.

ODF console is working as expected now.

Creating KCS for future.


Note You need to log in before you can comment on or make changes to this bug.