Bug 1979293

Summary: noobaa health check failed after running the node replacement test
Product: [Red Hat Storage] Red Hat OpenShift Container Storage Reporter: Itzhak <ikave>
Component: rookAssignee: Sébastien Han <shan>
Status: VERIFIED --- QA Contact: Ben Eli <belimele>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.8CC: belimele, bniver, ebenahar, mbenjamin, mbukatov, muagarwa, nberry, shan
Target Milestone: ---Keywords: Automation, Regression
Target Release: OCS 4.8.0Flags: shan: needinfo? (mbenjamin)
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: 4.8.0-452.ci Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Itzhak 2021-07-05 13:52:04 UTC
Description of problem (please be detailed as possible and provide log
snippets):
When running the node replacement test, in the sanity check at the end of the test - in the noobaa health check, the default backingstore was in an error state  
'Connection name already exists: noobaa-default-backing-store'.


Version of all relevant components (if applicable):
OCS 4.8, OCP 4.8, VSPHERE, LSO Cluster.
All relevant components you find here: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/ikave-vm-lso48/ikave-vm-lso48_20210630T112148/logs/test_report_1625120299.html

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes. The default backingstore is in an error state, after the node replacement process. 

Is there any workaround available to the best of your knowledge?
No.

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
2. 

Can this issue reproducible?
Yes.

Can this issue reproduce from the UI?
Not sure.

If this is a regression, please provide more details to justify this:
Yes. In OCS 4.7 I didn't face this issue.

Steps to Reproduce:
1. Deployed a cluster with the configurations: vSphere LSO, OCS 4.8, OCS 4.8.
2. Run the node replacement test.



Actual results:
At the end of the test, the noobaa health check failed with the error: "Cluster health is NOT OK"

Expected results:
At the end of the test, the noobaa health check should pass without any errors.


Additional info:

I got this error in the sanity check at the end of the PR validation https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/4247/.
This is the failed test: https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/4247/testReport/tests.manage.z_cluster.nodes.test_node_replacement_proactive/TestNodeReplacement/test_nodereplacement_proactive/. 

The test is located here: https://github.com/red-hat-storage/ocs-ci/blob/master/tests/manage/z_cluster/nodes/test_node_replacement_proactive.py

Comment 3 Raz Tamir 2021-07-07 07:13:12 UTC
Based on the description the test used to pass and therefore no new functionality.
In case the automation framework changed, will require re-evaluation

Comment 10 Danny 2021-07-08 11:46:58 UTC
Hi Sebastien, 

we encountered a scenario where the ceph-object-store-user access_key (for noobaa default backing store) is updated in the secret after a node replacement. we currently do not support updating the backing-store underlying resource in noobaa-core, and the backing-store is failing due to this change.

is this a new behavior in rook\rgw? it seems to have passed before but now this test fails?

Comment 12 Sébastien Han 2021-07-08 13:23:37 UTC
Hum, are you saying that the access_key changed? Can we go into the toolbox and run 'radosgw-admin user show --uid=<noobaa user>
Thanks

Comment 13 Danny 2021-07-08 14:43:31 UTC
the output of radosgw-admin is below.
looks like there are many keys for the user. the one which is stored in noobaa-core is WHFNMBA9NZCNYW8XK0ED - this is the original key that was set in the secret.
the secret now contains the key 2S9I9CHR09QL1PB1A3BV.




sh-4.4# radosgw-admin user info --uid='noobaa-ceph-objectstore-user'
{
    "user_id": "noobaa-ceph-objectstore-user",
    "display_name": "my display name",
    "email": "",
    "suspended": 0,
    "max_buckets": 1000,
    "subusers": [],
    "keys": [
        {
            "user": "noobaa-ceph-objectstore-user",
            "access_key": "2S9I9CHR09QL1PB1A3BV",
            "secret_key": "XhzXpfTF4YXLDmQI0TLwgRaquAdzWJjRUVha5FM7"
        },
        {
            "user": "noobaa-ceph-objectstore-user",
            "access_key": "67TRKIAADTLLTV11V7B8",
            "secret_key": "mAlPALmEbq0K8AIDTaDPKhit46kxoTsuH943Y54s"
        },
        {
            "user": "noobaa-ceph-objectstore-user",
            "access_key": "6SR53EYLE81FWG6F5SUS",
            "secret_key": "Jve2hIZA9yr4Fb4lIPXA1942MIC7GHEC0z1Vbu3F"
        },
        {
            "user": "noobaa-ceph-objectstore-user",
            "access_key": "93L4C26ARDXNEEYPNEJF",
            "secret_key": "Cl2TPvP05NAqCleYV8X3D8XRtDR4KKcyCbLjBM6a"
        },
        {
            "user": "noobaa-ceph-objectstore-user",
            "access_key": "KVTEQX4UCYAP8M64WNBW",
            "secret_key": "QKnjz8tAW5ADsvEJ8BU4jT7SD6GxhARYFkuHC1DL"
        },
        {
            "user": "noobaa-ceph-objectstore-user",
            "access_key": "L0Z8C3L5G2Y18V8939U3",
            "secret_key": "rjwmQ6ztCCuchDQKy6RyW7kCycuRAdD3MBZWrrFZ"
        },
        {
            "user": "noobaa-ceph-objectstore-user",
            "access_key": "PGU1ZRHK9U9Q1AKPVO0K",
            "secret_key": "30QMNZhvEQEP4OhcHb7ytO3DJHbLNoYFDdw3tYZt"
        },
        {
            "user": "noobaa-ceph-objectstore-user",
            "access_key": "PLH66N2Q71Q74ODUNV7J",
            "secret_key": "9y6DA1wQq3Z5LTzXeG2MRSODsj5SDdB2dDgdnIPo"
        },
        {
            "user": "noobaa-ceph-objectstore-user",
            "access_key": "WHFNMBA9NZCNYW8XK0ED",
            "secret_key": "lzIBaFBULuutWU0qJfDMygVEySgsHuRRwnR1vpdS"
        }
    ],
    "swift_keys": [],
    "caps": [],
    "op_mask": "read, write, delete",
    "default_placement": "",
    "default_storage_class": "",
    "placement_tags": [],
    "bucket_quota": {
        "enabled": false,
        "check_on_raw": false,
        "max_size": -1,
        "max_size_kb": 0,
        "max_objects": -1
    },
    "user_quota": {
        "enabled": false,
        "check_on_raw": false,
        "max_size": -1,
        "max_size_kb": 0,
        "max_objects": -1
    },
    "temp_url_keys": [],
    "type": "rgw",
    "mfa_ids": []
}

Comment 14 Sébastien Han 2021-07-08 16:35:11 UTC
Hum ok that's what I thought, but regardless, the key is always valid. I'm not sure what is the issue exactly.

Is it that noobaa has saved the access_key WHFNMBA9NZCNYW8XK0ED internally but reads the secret to find the secret_key?

Can someone describe in more detail what the problem is?

Danny?

Comment 15 Alexander Indenbaum 2021-07-08 18:55:25 UTC
@Sébastien, nice to bump into you again.

This is the flow:

1. s3 access_key of rook's ceph object user used by the noobaaa backing store is updated
2. noobaa core still holds old s3 credentials as reflected with "noobaa api system read_system"
3. noobaa operator during backing store reconciling cycle fetches "noobaa api system" from noobaa core and the updated rook's ceph object user access_key from k8s secret and detects this state as new S3 connection. 
4. noobaa operator tries to register the new connection with noobaa core. noobaa core matches external connections by their name, i.e. noobaa-default-backing-store, thus returning RPC error 'Connection name already exists: noobaa-default-backing-store'
5. noobaa backing store gets stuck in "Creating" phase, never reaching Readiness.

Comment 16 Danny 2021-07-08 18:58:38 UTC
we need to distinguish between the backing-store, which is a CRD to represent a storage resource in noobaa, and the cloud-resource in noobaa which is connected to the cloud (or RGW in that case), and using it as storage.
the cloud resource will continue to function as long as the credentials are valid, and it is probably functioning in this case as well.

the problem is with the phase of the backing store which is stuck in "Creating" phase. this is because noobaa operator sees different credentials in the secret and assumes this is a new connection. it then tries to create a new connection but uses the same name (the backing store name). noobaa core returns an error that the connection already exists and the operator bails from the reconcile.

the main trigger for this bug is the change in the secret. I think we first want to understand what causes this change so we can know how common it can be

Comment 17 Sébastien Han 2021-07-09 13:45:49 UTC
Ok thanks Danny and Alexander for the inputs. I logged in the env and I have evidence that our Rook code did this.
For reference, the RGW logs are showing 9 user creations just like we have 9 keys...

2021-06-17 12:45:35.722 7f2990fc1700  1 beast: 0x7f29de078660: 10.129.2.12 - rgw-admin-ops-user [17/Jun/2021:12:45:35.714 +0000] "PUT /admin/user?display-name=my%20display%20name&format=json&uid=noobaa-ceph-objectstore-user HTTP/1.1" 200 692 - "Go-http-client/1.1" - latency=0.00800031s
2021-06-20 14:43:19.793 7f2995fcb700  1 beast: 0x7f29de078660: 10.128.3.222 - rgw-admin-ops-user [20/Jun/2021:14:43:19.785 +0000] "PUT /admin/user?display-name=my%20display%20name&format=json&uid=noobaa-ceph-objectstore-user HTTP/1.1" 200 824 - "Go-http-client/1.1" - latency=0.00800031s
2021-06-20 14:43:20.002 7f28c3e27700  1 beast: 0x7f29de078660: 10.128.3.222 - rgw-admin-ops-user [20/Jun/2021:14:43:19.995 +0000] "PUT /admin/user?display-name=my%20display%20name&format=json&uid=noobaa-ceph-objectstore-user HTTP/1.1" 200 956 - "Go-http-client/1.1" - latency=0.00700027s
2021-06-23 06:44:27.356 7f2930700700  1 beast: 0x7f29de078660: 10.131.1.119 - rgw-admin-ops-user [23/Jun/2021:06:44:27.347 +0000] "PUT /admin/user?display-name=my%20display%20name&format=json&uid=noobaa-ceph-objectstore-user HTTP/1.1" 200 1088 - "Go-http-client/1.1" - latency=0.00800031s
2021-06-23 06:44:27.591 7f28fde9b700  1 beast: 0x7f29de078660: 10.131.1.119 - rgw-admin-ops-user [23/Jun/2021:06:44:27.574 +0000] "PUT /admin/user?display-name=my%20display%20name&format=json&uid=noobaa-ceph-objectstore-user HTTP/1.1" 200 1220 - "Go-http-client/1.1" - latency=0.0160006s
2021-06-27 07:46:52.838 7f2981fa3700  1 beast: 0x7f29de078660: 10.128.3.53 - rgw-admin-ops-user [27/Jun/2021:07:46:52.831 +0000] "PUT /admin/user?display-name=my%20display%20name&format=json&uid=noobaa-ceph-objectstore-user HTTP/1.1" 200 1352 - "Go-http-client/1.1" - latency=0.00700027s
2021-07-01 10:49:23.251 7f2934f09700  1 beast: 0x7f29de078660: 10.129.2.61 - rgw-admin-ops-user [01/Jul/2021:10:49:23.237 +0000] "PUT /admin/user?display-name=my%20display%20name&format=json&uid=noobaa-ceph-objectstore-user HTTP/1.1" 200 1484 - "Go-http-client/1.1" - latency=0.0140013s
2021-07-01 10:49:23.615 7f299a7d4700  1 beast: 0x7f29de078660: 10.129.2.61 - rgw-admin-ops-user [01/Jul/2021:10:49:23.607 +0000] "PUT /admin/user?display-name=my%20display%20name&format=json&uid=noobaa-ceph-objectstore-user HTTP/1.1" 200 1616 - "Go-http-client/1.1" - latency=0.00800075s
2021-07-04 09:33:01.109 7fd9c0bc7700  1 beast: 0x7fda7fd62660: 10.129.2.47 - rgw-admin-ops-user [04/Jul/2021:09:33:01.090 +0000] "PUT /admin/user?display-name=my%20display%20name&format=json&uid=noobaa-ceph-objectstore-user HTTP/1.1" 200 1748 - "Go-http-client/1.1" - latency=0.0180017s

All of them were done by the Rook Operator.
BUT the last 8 are ancient and only the last one corresponds to the creation of the CR which happened here:

2021-07-04 09:33:01.087450 I | ceph-object-store-user-controller: creating ceph object user "noobaa-ceph-objectstore-user" in namespace "openshift-storage"
2021-07-04 09:33:01.111073 I | ceph-object-store-user-controller: created ceph object user "noobaa-ceph-objectstore-user"
2021-07-04 09:33:01.144674 I | ceph-spec: updated ceph object "rook-ceph-object-user-ocs-storagecluster-cephobjectstore-noobaa-ceph-objectstore-user"
2021-07-04 09:33:01.144695 I | ceph-object-store-user-controller: created ceph object user secret "rook-ceph-object-user-ocs-storagecluster-cephobjectstore-noobaa-ceph-objectstore-user"

Now the secret rook-ceph-object-user-ocs-storagecluster-cephobjectstore-noobaa-ceph-objectstore-user as well as the CR are indeed 22 days old. So the secret has been updated regularly with new keys.

So 2021-07-04 is the key date. Also, the operator pod seems to have been running since Sun, 04 Jul 2021, this date again :) So it's just 5 days, older pods are 22 days old.
So I believe we have lost the logs from the previous Operator instance?

So far this is a Rook bug so moving to Rook. It looks like each time the operator restarts a new key gets created.
Something bugs me though, on Ceph Pacific (16.2.4) when I call the admin ops API repeatedly to create a user, I get an http code 409 that the user already exists which is what we want.

On Nautilus 14.2.11-181.el8cp it looks like we get a bunch of 200 each time.
So I'm thinking the bug is in Ceph, moving to ceph for investigation from the rgw team.
A local repro with 14.2.22 gives a bunch of 200 as well.


Matt, PTAL ASAP, has the admin ops API changed from Nautilus to Pacific? (I know it's a big jump...)
If the fix is not backportable in time I'll try to work around that in Rook in //.

Comment 18 Sébastien Han 2021-07-09 14:10:23 UTC
Ok I'm moving this to Rook since we can get a simple fix.

Comment 19 Michael Adam 2021-07-09 14:34:44 UTC
re-adding lost acks

Comment 23 Sébastien Han 2021-07-26 13:05:56 UTC
*** Bug 1983192 has been marked as a duplicate of this bug. ***