Created attachment 1685219 [details] rook-ceph-operator logs Description of problem: When deploying OCS4.3 on a single-stack IPv4 OCP 4.4, the installation of the operator times out and some strange error messages appear in the logs of rook-ceph-operator (see attachment for full logs): here is the alarming part: 2020-05-04 20:37:21.564322 I | exec: Running command: ceph orchestrator set backend rook --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json --out-file /tmp/253648401 2020-05-04 20:37:21.849183 I | exec: no valid command found; 10 closest matches: pg stat pg getmap pg dump {all|summary|sum|delta|pools|osds|pgs|pgs_brief [all|summary|sum|delta|pools|osds|pgs|pgs_brief...]} pg dump_json {all|summary|sum|pools|osds|pgs [all|summary|sum|pools|osds|pgs...]} pg dump_pools_json pg ls-by-pool <poolstr> {<states> [<states>...]} pg ls-by-primary <osdname (id|osd.id)> {<int>} {<states> [<states>...]} pg ls-by-osd <osdname (id|osd.id)> {<int>} {<states> [<states>...]} pg ls {<int>} {<states> [<states>...]} pg dump_stuck {inactive|unclean|stale|undersized|degraded [inactive|unclean|stale|undersized|degraded...]} {<int>} Error EINVAL: invalid command Version-Release number of selected component (if applicable): OCS 4.3.0 on OCP 4.4 bare-metal single-stack IPv6 How reproducible: re-deployed this cluster multiple times, always the same result Steps to Reproduce: 1. deploy OCP 4.4 single-stack IPv6 on bare-metal 2. deploy OCS according to the official doc (local-storage operator, then OCS) 3. installation of OCS never completes because the rook-ceph-operator never signal completion to the ocs-operator which in turns never signals he is healthy Actual results: installation stalls and timeout Expected results: installation completes with a functioning OCS cluster Additional info: I believe this is related to the fact that this cluster is running IPv6 single-stack, I cannot explain the invalid command in the logs but I don't see anything related to IPv6 in the ceph config file and I believe there should be something like "ms bind ipv6" in there.. see below for the config generated on a single-stack IPv6 cluster (note the absence of ms bind parameter): [global] fsid = 98a0cc4b-3318-4506-91c0-7b4fd145b00b mon initial members = a b c d e mon host = v1:[2001:4958:a:3e00:0:1:3:18e6]:6789,v1:[2001:4958:a:3e00:0:1:3:5fab]:6789,v1:[2001:4958:a:3e00:0:1 :3:ac1d]:6789,v1:[2001:4958:a:3e00:0:1:3:ffae]:6789,v1:[2001:4958:a:3e00:0:1:3:f04c]:6789 mon keyvaluedb = rocksdb mon_allow_pool_delete = true mon_max_pg_per_osd = 1000 debug default = 0 debug rados = 0 debug mon = 0 debug osd = 0 debug bluestore = 0 debug filestore = 0 debug journal = 0 debug leveldb = 0 filestore_omap_backend = rocksdb osd pg bits = 11 osd pgp bits = 11 osd pool default size = 1 osd pool default pg num = 100 osd pool default pgp num = 100 rbd_default_features = 3 fatal signal handlers = false [client.admin] keyring = /var/lib/rook/openshift-storage/client.admin.keyring
- Is it IPv4 or IPv6? It's unclear? - Severity? I assume resolving works (DNS, etc.).
Hi Yaniv, This is for a single-stack IPv6 cluster, but the ceph configuration generated by OCS does not seem to be IPv6-enabled (missing the "ms bind ipv6 = true" config parameter in the ceph conf file (pasted above). As for the invalid pg command being issued (see logs), I have no idea but it could also be related to IPv6. so to answer your questions: * this is for IPv6 * severity: OCS 4.3 is unusable with an IPv6 clusterNetwork
(In reply to Boris Deschenes from comment #3) > Hi Yaniv, > > This is for a single-stack IPv6 cluster, but the ceph configuration > generated by OCS does not seem to be IPv6-enabled (missing the "ms bind ipv6 > = true" config parameter in the ceph conf file (pasted above). As for the > invalid pg command being issued (see logs), I have no idea but it could also > be related to IPv6. > > so to answer your questions: > * this is for IPv6 > * severity: OCS 4.3 is unusable with an IPv6 clusterNetwork Severity is still not set. I'm happy that you've tested, but IPv6 is indeed not supported for OCS (yet). It'd be great if you can gather the logs (via the ocs must-gather utility as well as OCP) to ensure we are actually listening on IPv6 addresses. I'm really unsure if we do...
rook-ceph was deployed successfully on this cluster (OCP 4.4 bare metal IPv6 single stack)
Boris, severity?
OK I put severity high, since OCS is completely unusable with single-stack IPv6. Please note that when deploying rook-ceph on this cluster, we encountered the same issue, and it was solve with the following: apiVersion: v1 data: config: | [global] ms bind ipv6 = true kind: ConfigMap metadata: name: rook-config-override namespace: rook-ceph so it looks like the only issue was the missing "ms bind v6 = true" which is the exact same issue we're encountering in another part of the platform (CNI ,unrelated BZ 1831006), basically the operators generate IPv4-only configurations, sometimes as simple as binding to 0.0.0.0 instead of ::/0 and that prevents us from progressing with single-stack IPv6..
Travis - please see comment above - is that something we need in Rook?
Yes, we need to make a change in Rook to support setting this configuration automatically. There is an upstream issue tracking this. https://github.com/rook/rook/issues/3850
https://github.com/rook/rook/pull/6283/ is the upstream PR in rook to support IPv6. We added a new field to the ceph cluster CRD called `IPFamily` (https://github.com/rook/rook/blob/master/Documentation/ceph-cluster-crd.md#ipfamily). This should be supported from OCS operator (or from the UI) as well. @Jose @Umanga Any plans/suggestions to support this from OCS operator?
Rohan can you update the latest status on this?
Have a PR pending in NooBaa and am about to send one to ocs-operator. NooBaa PR to change from listening on 127.0.0.1 to listening on localhost (IIUC): https://github.com/noobaa/noobaa-core/pull/6352 OCS Operator PR to pass through full network spec instead of partial: pending There is a Jira for this as it is a new feature: https://issues.redhat.com/browse/KNIP-1467 I'll close this as it seems unnecessary overhead to track it twice. Please reopen if I am mistaken.
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days