Bug 1986016 - [RFE] allow rook-ceph-operator-config env variables (like CSI_LOG_LEVEL ) to change defaults on upgrade
Summary: [RFE] allow rook-ceph-operator-config env variables (like CSI_LOG_LEVEL ) to ...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: ocs-operator
Version: unspecified
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ODF 4.12.0
Assignee: Malay Kumar parida
QA Contact: Rachael
URL:
Whiteboard:
Depends On:
Blocks: 2011326 2107226
TreeView+ depends on / blocked
 
Reported: 2021-07-26 14:08 UTC by Madhu Rajanna
Modified: 2023-08-09 17:00 UTC (History)
16 users (show)

Fixed In Version:
Doc Type: Enhancement
Doc Text:
.When upgrading from one version of {product-name} operator to another, you can now change the `rook-ceph-operator` configuration variable defaults In case you have modified some `rook-ceph-operator` configuration variables, then during operator upgrade those variables will not change. Also, if you have not changed the default variables, you are now able to change them from one version to another. To make this possible, the default variables in the rook CSV are now set as ENV variables, so the changes you make will have higher precedence, and they will be in the `rook-ceph-operator` configmap.
Clone Of:
Environment:
Last Closed: 2023-02-08 14:06:28 UTC
Embargoed:
rar: needinfo-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github red-hat-storage ocs-operator pull 1648 0 None Merged Moving the defaults for rookceph operator from configmap to csv 2022-09-07 05:23:15 UTC
Red Hat Bugzilla 2054952 1 unspecified CLOSED [MetroDR] Add `CSI_ENABLE_CSIADDONS: "true"` option in rook-ceph-operator-override configmap for upgraded clusters. 2023-12-08 04:27:48 UTC

Internal Links: 2054952

Description Madhu Rajanna 2021-07-26 14:08:41 UTC
Description of problem (please be detailed as possible and provide log
snippets):

rook-ceph-operator-config configmap is not updated when OCS is upgraded from 4.5 to other version

Version of all relevant components (if applicable):


affected version upgraded from < 4.5

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?
Manual editing of the configmap to add a new configuration


If any new configurations are adding to the rook-ceph-operator-config configmap https://github.com/openshift/ocs-operator/blob/master/controllers/ocsinitialization/ocsinitialization_controller.go#L324 the ocs-operator is choosing not to update the configmap if the configmap already exists https://github.com/openshift/ocs-operator/blob/master/controllers/ocsinitialization/ocsinitialization_controller.go#L346-L361

Looks like this is by design. CSI_LOG_LEVEL was added in 4.6. As the configmap is not updated the cephcsi is running with a default log level which is `0` and no info or debug logs are present due to this analyzing the cephcsi BZ is difficult.

Expected results:


Additional info:

Comment 2 Mudit Agarwal 2021-07-26 14:11:10 UTC
Not a 4.8 blocker but we should consider this fixing in 4.8.z

Comment 3 Martin Bukatovic 2021-07-27 09:55:14 UTC
Could you post a diff of the config map between:

- 4.5 and 4.6
- 4.5 and 4.6
- 4.5 and 4.7

so that we can evaluate an impact of this properly?

We are making these decision too easily without due diligence.

Comment 5 Martin Bukatovic 2021-07-27 09:57:29 UTC
And off course, 4.5 and 4.8.

Comment 8 Mudit Agarwal 2021-07-27 10:42:55 UTC
(In reply to Martin Bukatovic from comment #3)
> Could you post a diff of the config map between:
> 
> - 4.5 and 4.6
> - 4.5 and 4.6
> - 4.5 and 4.7
> 
> so that we can evaluate an impact of this properly?

What difference will it make? 
We don't reconcile the rook-config map during upgrade and because of that whatever config map you have before upgrade you will have the same.
This is a design decision https://github.com/openshift/ocs-operator/blob/release-4.8/controllers/ocsinitialization/ocsinitialization_controller.go#L355-L358

We added CSI_LOG_LEVEL to 5 in 4.6, but if an user upgrades from a previous version this value will not be set.
It has nothing to do with this value, everything which is added in a later release will not appear in the map after upgrade because the map is not reconciled.
This might be called an incomplete fix for this variable from csi point of view but not a regression.

> 
> We are making these decision too easily without due diligence.
I think we are jumping on conclusion too fast because this is not something for which we should delay the release which is scheduled for next week:
1. This is not a regression.
2. One can always manually add the value to the config map.

Comment 10 Mudit Agarwal 2021-07-27 11:27:56 UTC
(In reply to Martin Bukatovic from comment #9)

>> Thank you for the explanation here, now I agree it's not a regression itself.

No regression means no blocker because
1. The blocker flag was added because of the regression keyword.
2. This has a simple workaround and not something which should delay the release.
3. Can be added in next z-stream.
And no blocker means no 4.8.0 at this point of time.

> > 2. One can always manually add the value to the config map.
> 
> This doesn't look ok, even if we have it documented (btw do we?).
No we haven't documented it because we found this bug now (after 3 releases)

> 
> I would like to understand the impact of this bug exactly. And it seems that
> we haven't done necessary analysis, or we are failing to communicate it in
> the bug so that I can easily understand it (maybe others can).
The impact of this particular bug is that log level for csi won't be 5 which means we won't be having all debug logs.

> > It has nothing to do with this value, everything which is added in a later release will not appear in the map after upgrade because the map is not reconciled.
> > This might be called an incomplete fix for this variable from csi point of view but not a regression.
> 
> If I read it right, it means that any updates in the config map won't reach
> updated cluster instances.
This means that if a customer has added any value in the config map it will be present in the upgraded cluster but
any new value which is added as part of the code in the new release will not be present in the upgraded cluster.


> So I would like to know:
> 
> - which customers are affected exactly
> - which features are affected (eg. there is BZ 1926525)
We don't know, we need to check all the values which were added in different releases.


> 
> What I would like to really avoid is the situation, when a significant
> portion of OCS customers won't have expected values in their config map
> because of this problem.
> 
> If we conclude that this all means that recent updates in rook ceph
> configmap won't reach most of our existing userbase, I don't think it's
> acceptable and that it creates regressions for these customers, even though
> the problem is not a regression itself.
I still don't think that is the definition of regression, because they will have the config map which was present before upgrade. 
So, its a case of incomplete fix not regression. Regression means any change in the existing behaviour.

And we CAN NOT reconcile the config map because that would mean that if a customer has added anything explicitly to the configmap 
then that value will be lost and that would be a real regression for that customer.

This has to be taken case to case basis.

Comment 13 Jose A. Rivera 2021-10-11 16:06:46 UTC
This is certainly needed, but at the moment it's really not a release blocker. As such, moving to ODF 4.10, and we can backport it if desired.

In general the rough solution is to specify our defaults in the rook-ceph-operator Deployment itself, rather than using the ConfigMap. This would require a change in how the OCS CSV is generated. Nothing too bad, just needs to be accounted for.

Comment 15 Martin Bukatovic 2021-10-11 21:37:15 UTC
The config map in question is defined in function newRookCephOperatorConfig() from the following file:

controllers/ocsinitialization/ocsinitialization_controller.go

And it seems that between 4.8 and 4.9 release, we are not introducing any *new* regressions,
as there are no updates in this function. One can check this via eg.: 

```
git diff origin/release-4.8..HEAD -- controllers/ocsinitialization/ocsinitialization_controller.go
```

That said, I still believe that it's not a good idea to have such traps introduced,
and that it should have been fixed in 4.8.

Comment 16 Mudit Agarwal 2021-10-12 03:14:13 UTC
Bipin, as Jose mentioned we need more work for this and will definitely take this in 4.10 (can be backported too)

Travis, Can anyone from rook pick it up so that we give it some soak time in master?

Comment 18 Mudit Agarwal 2021-10-12 07:56:19 UTC
1) I see you have mentioned that there is a workaround but couldn't find it. Is it about deleting the configMap?
[Mudit] Just manually edit the configmap and add what is required, support does that all the time.
2) How complicated is the fix?
[Mudit] As Jose mentioned, it needs a csv change, so even though the fix might not be complicated it might be risky for 4.9

>>  what is the impact if not fixed in 4.9.0?
[Mudit] For 4.9, this is the only fix which gets affected (Bug #1964055). You will see the fix on new deployments but not on the upgraded clusters.

Comment 21 Mudit Agarwal 2021-10-12 09:16:17 UTC
Adding it as a known issue for 4.9

Aiming it to get fixed in master/4.10 asap so that we can decide on the backport.

Comment 22 Travis Nielsen 2021-10-12 18:15:05 UTC
Jose What about the following approach?
- Reconcile the configmap regularly, instead of only once
- During Reconcile of the configmap
  - Get the current configmap contents (if it exists)
  - Apply the OCS settings **if they are not specified**
  - Update the configmap

This way, all existing settings stay intact in the configmap and the user can override the values if needed.

The problem with requiring the settings to be included in the CSV generation (at least with current implementation) is that it moves the downstream settings into the upstream rook repo, which isn't appropriate. But if there was a way the CSV generation could be updated by OCS to apply the downstream settings in the OCS operator repo, this would certainly be preferred.

Comment 23 Jose A. Rivera 2022-01-20 15:40:02 UTC
(In reply to Travis Nielsen from comment #22)
> Jose What about the following approach?
> - Reconcile the configmap regularly, instead of only once
> - During Reconcile of the configmap
>   - Get the current configmap contents (if it exists)
>   - Apply the OCS settings **if they are not specified**
>   - Update the configmap
> 
> This way, all existing settings stay intact in the configmap and the user
> can override the values if needed.

This could be a viable solution, yes, but at that point we're basically duplicating functionality that already exists in Rook.

> The problem with requiring the settings to be included in the CSV generation
> (at least with current implementation) is that it moves the downstream
> settings into the upstream rook repo, which isn't appropriate. But if there
> was a way the CSV generation could be updated by OCS to apply the downstream
> settings in the OCS operator repo, this would certainly be preferred.

Yes, we can easily do this. Our csv-merger tool extracts the Rook-Ceph CSV into memory, so we can apply any customizations before merging it into ocs-operator.

At this point, especially with the incoming csi-addons, I think this should get a bump in severity.

Comment 24 Elad 2022-02-13 09:26:51 UTC
Hi,

The requirement here is being discussed for quite some time but with no productive outcome. I think that the main reason is that this is an RFE but its boundaries remain here in the bug discussion and not as part of release content. 
Furthermore, this RFE is now devel acked for 4.10.0 while we way post feature freeze and the behaviour + expected results are still being discussed/
I suggest transforming this to a Jira item with a clear requirement and having a proper design discussion for it.

Comment 30 Mudit Agarwal 2022-05-24 05:41:02 UTC
Epic is targeted for 4.12

Comment 31 Mudit Agarwal 2022-10-11 12:15:12 UTC
The epic is in QE review for 4.12, please test with any of the latest 4.12 builds


Note You need to log in before you can comment on or make changes to this bug.