Hide Forgot
- What is the nature and description of the request? As an admin I require the ability to configure how long events stay in the cluster. By default, I believe it is around 4 hours (please correct me if I am wrong), however sometimes we need the events after that time has passed. Allowing the setting to be changed will allow for better investigation. - Why does the customer need this? (List the business requirements here) To improve troubleshooting support. Very frequently a customer will run into behavior and not collect the events immediately and by the time they report the problem to Support the events are gone. Unless issue is high enough priority (sev1 or 2 in support SLAs), support might not even get to the customer requesting the events in time to collect them. - How would the customer like to achieve this? (List the functional requirements here) Provide a setting in the master-config.yaml file that changes how long events stay around. - For each functional requirement listed, specify how Red Hat and the customer can test to confirm the requirement is successfully implemented. Make sure that performance for etcd and/or cluster is not impacted and if it is (most importantly if it scales) document this heavily so users are aware of the potential impact. - Is there already an existing RFE upstream or in Red Hat Bugzilla? Not that I am aware of EX: My cluster is having issues, something that I cannot seem to reliably reproduce, but I need to leave for the weekend shortly, so instead of having to wait for the next work day to test this behavior out, I set the events lifetime to 3 days and then start the process that was having issues and leave. Then come Monday the events are still there so I can see in more detail what went wrong (vs. now where I would run it and the events would be gone before the next day). Additional discussion points: How are events deleted in the first place? Was 4 hours determined, because it is the longest point in time before impact is noticed?
Are you asking for something other than event-ttl? https://docs.openshift.org/latest/install_config/master_node_configuration.html#master-configuration-files
I was not aware of this feature at all. The example specifically mentioned there is about _decreasing_ that time, does it work to allow you to increase this time? Has any testing been done to see what kind of impact increasing the time might create? The reason I opened this was very intentionally with the focus on Increasing the time as troubleshooting issues a couple days removed is sometimes more difficult because there are no events any longer.
Increasing event ttl is allowed. It has the effect of consuming more memory in etcd and the masters to do so. But as long as you can afford the memory, there isn't any issue in doing so. We are also looking into storing events as part of metrics or logging: https://trello.com/c/IQ5gFOrY/ So I suggest we either close this bug since the existing ttl is already there or we can leave it and refer to the existing trello card.
https://bugzilla.redhat.com/show_bug.cgi?id=1287512 is a possible duplicate.