Bug 1831693 - OCS 4.3 on OCP 4.4 with single-stack IPv6, install timeouts
Summary: OCS 4.3 on OCP 4.4 with single-stack IPv6, install timeouts
Keywords:
Status: CLOSED NEXTRELEASE
Alias: None
Product: Red Hat OpenShift Container Storage
Classification: Red Hat Storage
Component: rook
Version: 4.3
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Rohan CJ
QA Contact: Elad
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-05-05 13:55 UTC by Boris Deschenes
Modified: 2023-09-15 00:31 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-05-24 04:54:30 UTC
Embargoed:


Attachments (Terms of Use)
rook-ceph-operator logs (11.05 KB, text/plain)
2020-05-05 13:55 UTC, Boris Deschenes
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github rook rook pull 6283 0 None closed ceph: support IPv6 single-stack 2021-01-29 17:31:09 UTC

Description Boris Deschenes 2020-05-05 13:55:35 UTC
Created attachment 1685219 [details]
rook-ceph-operator logs

Description of problem:
When deploying OCS4.3 on a single-stack IPv4 OCP 4.4, the installation of the operator times out and some strange error messages appear in the logs of rook-ceph-operator (see attachment for full logs):

here is the alarming part:
2020-05-04 20:37:21.564322 I | exec: Running command: ceph orchestrator set backend rook --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json --out-file /tmp/253648401
2020-05-04 20:37:21.849183 I | exec: no valid command found; 10 closest matches:
pg stat
pg getmap
pg dump {all|summary|sum|delta|pools|osds|pgs|pgs_brief [all|summary|sum|delta|pools|osds|pgs|pgs_brief...]}
pg dump_json {all|summary|sum|pools|osds|pgs [all|summary|sum|pools|osds|pgs...]}
pg dump_pools_json
pg ls-by-pool <poolstr> {<states> [<states>...]}
pg ls-by-primary <osdname (id|osd.id)> {<int>} {<states> [<states>...]}
pg ls-by-osd <osdname (id|osd.id)> {<int>} {<states> [<states>...]}
pg ls {<int>} {<states> [<states>...]}
pg dump_stuck {inactive|unclean|stale|undersized|degraded [inactive|unclean|stale|undersized|degraded...]} {<int>}
Error EINVAL: invalid command

Version-Release number of selected component (if applicable):
OCS 4.3.0 on OCP 4.4 bare-metal single-stack IPv6

How reproducible:
re-deployed this cluster multiple times, always the same result

Steps to Reproduce:
1. deploy OCP 4.4 single-stack IPv6 on bare-metal
2. deploy OCS according to the official doc (local-storage operator, then OCS)
3. installation of OCS never completes because the rook-ceph-operator never signal completion to the ocs-operator which in turns never signals he is healthy

Actual results:
installation stalls and timeout

Expected results:
installation completes with a functioning OCS cluster

Additional info:
I believe this is related to the fact that this cluster is running IPv6 single-stack, I cannot explain the invalid command in the logs but I don't see anything related to IPv6 in the ceph config file and I believe there should be something like "ms bind ipv6" in there.. see below for the config generated on a single-stack IPv6 cluster (note the absence of ms bind parameter):

[global]
fsid                     = 98a0cc4b-3318-4506-91c0-7b4fd145b00b
mon initial members      = a b c d e
mon host                 = v1:[2001:4958:a:3e00:0:1:3:18e6]:6789,v1:[2001:4958:a:3e00:0:1:3:5fab]:6789,v1:[2001:4958:a:3e00:0:1
:3:ac1d]:6789,v1:[2001:4958:a:3e00:0:1:3:ffae]:6789,v1:[2001:4958:a:3e00:0:1:3:f04c]:6789
mon keyvaluedb           = rocksdb
mon_allow_pool_delete    = true
mon_max_pg_per_osd       = 1000
debug default            = 0
debug rados              = 0
debug mon                = 0
debug osd                = 0
debug bluestore          = 0
debug filestore          = 0
debug journal            = 0
debug leveldb            = 0
filestore_omap_backend   = rocksdb
osd pg bits              = 11
osd pgp bits             = 11
osd pool default size    = 1
osd pool default pg num  = 100
osd pool default pgp num = 100
rbd_default_features     = 3
fatal signal handlers    = false
[client.admin]
keyring = /var/lib/rook/openshift-storage/client.admin.keyring

Comment 2 Yaniv Kaul 2020-05-06 08:07:39 UTC
- Is it IPv4 or IPv6? It's unclear?
- Severity?

I assume resolving works (DNS, etc.).

Comment 3 Boris Deschenes 2020-05-06 13:35:12 UTC
Hi Yaniv,

This is for a single-stack IPv6 cluster, but the ceph configuration generated by OCS does not seem to be IPv6-enabled (missing the "ms bind ipv6 = true" config parameter in the ceph conf file (pasted above).  As for the invalid pg command being issued (see logs), I have no idea but it could also be related to IPv6.

so to answer your questions:
* this is for IPv6
* severity: OCS 4.3 is unusable with an IPv6 clusterNetwork

Comment 4 Yaniv Kaul 2020-05-06 14:44:57 UTC
(In reply to Boris Deschenes from comment #3)
> Hi Yaniv,
> 
> This is for a single-stack IPv6 cluster, but the ceph configuration
> generated by OCS does not seem to be IPv6-enabled (missing the "ms bind ipv6
> = true" config parameter in the ceph conf file (pasted above).  As for the
> invalid pg command being issued (see logs), I have no idea but it could also
> be related to IPv6.
> 
> so to answer your questions:
> * this is for IPv6
> * severity: OCS 4.3 is unusable with an IPv6 clusterNetwork

Severity is still not set.
I'm happy that you've tested, but IPv6 is indeed not supported for OCS (yet).
It'd be great if you can gather the logs (via the ocs must-gather utility as well as OCP) to ensure we are actually listening on IPv6 addresses. I'm really unsure if we do...

Comment 5 Boris Deschenes 2020-05-07 12:52:38 UTC
rook-ceph was deployed successfully on this cluster (OCP 4.4 bare metal IPv6 single stack)

Comment 6 Yaniv Kaul 2020-05-07 14:59:21 UTC
Boris, severity?

Comment 7 Boris Deschenes 2020-05-11 13:39:32 UTC
OK I put severity high, since OCS is completely unusable with single-stack IPv6.

Please note that when deploying rook-ceph on this cluster, we encountered the same issue, and it was solve with the following:

apiVersion: v1
data:
 config: |
  [global]
  ms bind ipv6 = true
kind: ConfigMap
metadata:
 name: rook-config-override
 namespace: rook-ceph

so it looks like the only issue was the missing "ms bind v6 = true" which is the exact same issue we're encountering in another part of the platform (CNI ,unrelated BZ 1831006), basically the operators generate IPv4-only configurations, sometimes as simple as binding to 0.0.0.0 instead of ::/0 and that prevents us from progressing with single-stack IPv6..

Comment 8 Yaniv Kaul 2020-05-11 15:18:10 UTC
Travis - please see comment above - is that something we need in Rook?

Comment 9 Travis Nielsen 2020-05-11 16:52:50 UTC
Yes, we need to make a change in Rook to support setting this configuration automatically. There is an upstream issue tracking this.
https://github.com/rook/rook/issues/3850

Comment 10 Santosh Pillai 2021-01-12 05:14:40 UTC
https://github.com/rook/rook/pull/6283/ is the upstream PR in rook to support IPv6. We added a new field to the ceph cluster CRD called `IPFamily` (https://github.com/rook/rook/blob/master/Documentation/ceph-cluster-crd.md#ipfamily). 

This should be supported from OCS operator (or from the UI) as well. 

@Jose @Umanga Any plans/suggestions to support this from OCS operator?

Comment 13 Travis Nielsen 2021-05-17 15:46:35 UTC
Rohan can you update the latest status on this?

Comment 14 Rohan CJ 2021-05-24 04:54:30 UTC
Have a PR pending in NooBaa and am about to send one to ocs-operator.

NooBaa PR to change from listening on 127.0.0.1 to listening on localhost (IIUC): https://github.com/noobaa/noobaa-core/pull/6352
OCS Operator PR to pass through full network spec instead of partial: pending

There is a Jira for this as it is a new feature: https://issues.redhat.com/browse/KNIP-1467
I'll close this as it seems unnecessary overhead to track it twice.

Please reopen if I am mistaken.

Comment 15 Red Hat Bugzilla 2023-09-15 00:31:27 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days


Note You need to log in before you can comment on or make changes to this bug.