Bug 1979293
| Summary: | noobaa health check failed after running the node replacement test | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Container Storage | Reporter: | Itzhak <ikave> |
| Component: | rook | Assignee: | Sébastien Han <shan> |
| Status: | VERIFIED --- | QA Contact: | Ben Eli <belimele> |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.8 | CC: | belimele, bniver, ebenahar, mbenjamin, mbukatov, muagarwa, nberry, shan |
| Target Milestone: | --- | Keywords: | Automation, Regression |
| Target Release: | OCS 4.8.0 | Flags: | shan:
needinfo?
(mbenjamin) |
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | 4.8.0-452.ci | Doc Type: | No Doc Update |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | Type: | Bug | |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Itzhak
2021-07-05 13:52:04 UTC
Based on the description the test used to pass and therefore no new functionality. In case the automation framework changed, will require re-evaluation Hi Sebastien, we encountered a scenario where the ceph-object-store-user access_key (for noobaa default backing store) is updated in the secret after a node replacement. we currently do not support updating the backing-store underlying resource in noobaa-core, and the backing-store is failing due to this change. is this a new behavior in rook\rgw? it seems to have passed before but now this test fails? Hum, are you saying that the access_key changed? Can we go into the toolbox and run 'radosgw-admin user show --uid=<noobaa user> Thanks the output of radosgw-admin is below.
looks like there are many keys for the user. the one which is stored in noobaa-core is WHFNMBA9NZCNYW8XK0ED - this is the original key that was set in the secret.
the secret now contains the key 2S9I9CHR09QL1PB1A3BV.
sh-4.4# radosgw-admin user info --uid='noobaa-ceph-objectstore-user'
{
"user_id": "noobaa-ceph-objectstore-user",
"display_name": "my display name",
"email": "",
"suspended": 0,
"max_buckets": 1000,
"subusers": [],
"keys": [
{
"user": "noobaa-ceph-objectstore-user",
"access_key": "2S9I9CHR09QL1PB1A3BV",
"secret_key": "XhzXpfTF4YXLDmQI0TLwgRaquAdzWJjRUVha5FM7"
},
{
"user": "noobaa-ceph-objectstore-user",
"access_key": "67TRKIAADTLLTV11V7B8",
"secret_key": "mAlPALmEbq0K8AIDTaDPKhit46kxoTsuH943Y54s"
},
{
"user": "noobaa-ceph-objectstore-user",
"access_key": "6SR53EYLE81FWG6F5SUS",
"secret_key": "Jve2hIZA9yr4Fb4lIPXA1942MIC7GHEC0z1Vbu3F"
},
{
"user": "noobaa-ceph-objectstore-user",
"access_key": "93L4C26ARDXNEEYPNEJF",
"secret_key": "Cl2TPvP05NAqCleYV8X3D8XRtDR4KKcyCbLjBM6a"
},
{
"user": "noobaa-ceph-objectstore-user",
"access_key": "KVTEQX4UCYAP8M64WNBW",
"secret_key": "QKnjz8tAW5ADsvEJ8BU4jT7SD6GxhARYFkuHC1DL"
},
{
"user": "noobaa-ceph-objectstore-user",
"access_key": "L0Z8C3L5G2Y18V8939U3",
"secret_key": "rjwmQ6ztCCuchDQKy6RyW7kCycuRAdD3MBZWrrFZ"
},
{
"user": "noobaa-ceph-objectstore-user",
"access_key": "PGU1ZRHK9U9Q1AKPVO0K",
"secret_key": "30QMNZhvEQEP4OhcHb7ytO3DJHbLNoYFDdw3tYZt"
},
{
"user": "noobaa-ceph-objectstore-user",
"access_key": "PLH66N2Q71Q74ODUNV7J",
"secret_key": "9y6DA1wQq3Z5LTzXeG2MRSODsj5SDdB2dDgdnIPo"
},
{
"user": "noobaa-ceph-objectstore-user",
"access_key": "WHFNMBA9NZCNYW8XK0ED",
"secret_key": "lzIBaFBULuutWU0qJfDMygVEySgsHuRRwnR1vpdS"
}
],
"swift_keys": [],
"caps": [],
"op_mask": "read, write, delete",
"default_placement": "",
"default_storage_class": "",
"placement_tags": [],
"bucket_quota": {
"enabled": false,
"check_on_raw": false,
"max_size": -1,
"max_size_kb": 0,
"max_objects": -1
},
"user_quota": {
"enabled": false,
"check_on_raw": false,
"max_size": -1,
"max_size_kb": 0,
"max_objects": -1
},
"temp_url_keys": [],
"type": "rgw",
"mfa_ids": []
}
Hum ok that's what I thought, but regardless, the key is always valid. I'm not sure what is the issue exactly. Is it that noobaa has saved the access_key WHFNMBA9NZCNYW8XK0ED internally but reads the secret to find the secret_key? Can someone describe in more detail what the problem is? Danny? @Sébastien, nice to bump into you again. This is the flow: 1. s3 access_key of rook's ceph object user used by the noobaaa backing store is updated 2. noobaa core still holds old s3 credentials as reflected with "noobaa api system read_system" 3. noobaa operator during backing store reconciling cycle fetches "noobaa api system" from noobaa core and the updated rook's ceph object user access_key from k8s secret and detects this state as new S3 connection. 4. noobaa operator tries to register the new connection with noobaa core. noobaa core matches external connections by their name, i.e. noobaa-default-backing-store, thus returning RPC error 'Connection name already exists: noobaa-default-backing-store' 5. noobaa backing store gets stuck in "Creating" phase, never reaching Readiness. we need to distinguish between the backing-store, which is a CRD to represent a storage resource in noobaa, and the cloud-resource in noobaa which is connected to the cloud (or RGW in that case), and using it as storage. the cloud resource will continue to function as long as the credentials are valid, and it is probably functioning in this case as well. the problem is with the phase of the backing store which is stuck in "Creating" phase. this is because noobaa operator sees different credentials in the secret and assumes this is a new connection. it then tries to create a new connection but uses the same name (the backing store name). noobaa core returns an error that the connection already exists and the operator bails from the reconcile. the main trigger for this bug is the change in the secret. I think we first want to understand what causes this change so we can know how common it can be Ok thanks Danny and Alexander for the inputs. I logged in the env and I have evidence that our Rook code did this. For reference, the RGW logs are showing 9 user creations just like we have 9 keys... 2021-06-17 12:45:35.722 7f2990fc1700 1 beast: 0x7f29de078660: 10.129.2.12 - rgw-admin-ops-user [17/Jun/2021:12:45:35.714 +0000] "PUT /admin/user?display-name=my%20display%20name&format=json&uid=noobaa-ceph-objectstore-user HTTP/1.1" 200 692 - "Go-http-client/1.1" - latency=0.00800031s 2021-06-20 14:43:19.793 7f2995fcb700 1 beast: 0x7f29de078660: 10.128.3.222 - rgw-admin-ops-user [20/Jun/2021:14:43:19.785 +0000] "PUT /admin/user?display-name=my%20display%20name&format=json&uid=noobaa-ceph-objectstore-user HTTP/1.1" 200 824 - "Go-http-client/1.1" - latency=0.00800031s 2021-06-20 14:43:20.002 7f28c3e27700 1 beast: 0x7f29de078660: 10.128.3.222 - rgw-admin-ops-user [20/Jun/2021:14:43:19.995 +0000] "PUT /admin/user?display-name=my%20display%20name&format=json&uid=noobaa-ceph-objectstore-user HTTP/1.1" 200 956 - "Go-http-client/1.1" - latency=0.00700027s 2021-06-23 06:44:27.356 7f2930700700 1 beast: 0x7f29de078660: 10.131.1.119 - rgw-admin-ops-user [23/Jun/2021:06:44:27.347 +0000] "PUT /admin/user?display-name=my%20display%20name&format=json&uid=noobaa-ceph-objectstore-user HTTP/1.1" 200 1088 - "Go-http-client/1.1" - latency=0.00800031s 2021-06-23 06:44:27.591 7f28fde9b700 1 beast: 0x7f29de078660: 10.131.1.119 - rgw-admin-ops-user [23/Jun/2021:06:44:27.574 +0000] "PUT /admin/user?display-name=my%20display%20name&format=json&uid=noobaa-ceph-objectstore-user HTTP/1.1" 200 1220 - "Go-http-client/1.1" - latency=0.0160006s 2021-06-27 07:46:52.838 7f2981fa3700 1 beast: 0x7f29de078660: 10.128.3.53 - rgw-admin-ops-user [27/Jun/2021:07:46:52.831 +0000] "PUT /admin/user?display-name=my%20display%20name&format=json&uid=noobaa-ceph-objectstore-user HTTP/1.1" 200 1352 - "Go-http-client/1.1" - latency=0.00700027s 2021-07-01 10:49:23.251 7f2934f09700 1 beast: 0x7f29de078660: 10.129.2.61 - rgw-admin-ops-user [01/Jul/2021:10:49:23.237 +0000] "PUT /admin/user?display-name=my%20display%20name&format=json&uid=noobaa-ceph-objectstore-user HTTP/1.1" 200 1484 - "Go-http-client/1.1" - latency=0.0140013s 2021-07-01 10:49:23.615 7f299a7d4700 1 beast: 0x7f29de078660: 10.129.2.61 - rgw-admin-ops-user [01/Jul/2021:10:49:23.607 +0000] "PUT /admin/user?display-name=my%20display%20name&format=json&uid=noobaa-ceph-objectstore-user HTTP/1.1" 200 1616 - "Go-http-client/1.1" - latency=0.00800075s 2021-07-04 09:33:01.109 7fd9c0bc7700 1 beast: 0x7fda7fd62660: 10.129.2.47 - rgw-admin-ops-user [04/Jul/2021:09:33:01.090 +0000] "PUT /admin/user?display-name=my%20display%20name&format=json&uid=noobaa-ceph-objectstore-user HTTP/1.1" 200 1748 - "Go-http-client/1.1" - latency=0.0180017s All of them were done by the Rook Operator. BUT the last 8 are ancient and only the last one corresponds to the creation of the CR which happened here: 2021-07-04 09:33:01.087450 I | ceph-object-store-user-controller: creating ceph object user "noobaa-ceph-objectstore-user" in namespace "openshift-storage" 2021-07-04 09:33:01.111073 I | ceph-object-store-user-controller: created ceph object user "noobaa-ceph-objectstore-user" 2021-07-04 09:33:01.144674 I | ceph-spec: updated ceph object "rook-ceph-object-user-ocs-storagecluster-cephobjectstore-noobaa-ceph-objectstore-user" 2021-07-04 09:33:01.144695 I | ceph-object-store-user-controller: created ceph object user secret "rook-ceph-object-user-ocs-storagecluster-cephobjectstore-noobaa-ceph-objectstore-user" Now the secret rook-ceph-object-user-ocs-storagecluster-cephobjectstore-noobaa-ceph-objectstore-user as well as the CR are indeed 22 days old. So the secret has been updated regularly with new keys. So 2021-07-04 is the key date. Also, the operator pod seems to have been running since Sun, 04 Jul 2021, this date again :) So it's just 5 days, older pods are 22 days old. So I believe we have lost the logs from the previous Operator instance? So far this is a Rook bug so moving to Rook. It looks like each time the operator restarts a new key gets created. Something bugs me though, on Ceph Pacific (16.2.4) when I call the admin ops API repeatedly to create a user, I get an http code 409 that the user already exists which is what we want. On Nautilus 14.2.11-181.el8cp it looks like we get a bunch of 200 each time. So I'm thinking the bug is in Ceph, moving to ceph for investigation from the rgw team. A local repro with 14.2.22 gives a bunch of 200 as well. Matt, PTAL ASAP, has the admin ops API changed from Nautilus to Pacific? (I know it's a big jump...) If the fix is not backportable in time I'll try to work around that in Rook in //. Ok I'm moving this to Rook since we can get a simple fix. re-adding lost acks *** Bug 1983192 has been marked as a duplicate of this bug. *** |