Bug 2022664
| Summary: | Object service remain in unhealthy state in the fresh deployment. | ||||||
|---|---|---|---|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Bipin Kunal <bkunal> | ||||
| Component: | Multi-Cloud Object Gateway | Assignee: | Nimrod Becker <nbecker> | ||||
| Status: | CLOSED WORKSFORME | QA Contact: | Elad <ebenahar> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | unspecified | ||||||
| Version: | 4.9 | CC: | dzaken, etamir, mmuench, nbecker, ocs-bugs, odf-bz-bot | ||||
| Target Milestone: | --- | ||||||
| Target Release: | --- | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2022-02-15 12:41:50 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
Hi Bipin, can we get a must-gather? Hi Bipin. I see in the logs that the default backingstore is probably stuck in INITIALIZING state, but I cannot find the root cause. If this happens again and we can get a live cluster that would help to debug and find the RC Hi Danny, A couple of things here, 1) How can we triage the incoming bugs quicker and see if there are all debug data for the RCA or not. If bugs are triaged early there are chances that you can get the reproducer/live set up for debugging as well. I have often observed delays in triaging and providing initial analysis on NooBaa bugs. 2) It is always good to add detailed analysis, it helps the reporter to better understand the issue and do better analysis at the next occurance. Can you add what did you see and where did you see and what was your approach of debugging? 3) There is less likely that you will get the reproducer or live system always, what are we doing to improve the logging to make must-gather sufficient for debugging? -Bipin Kunal (In reply to Bipin Kunal from comment #5) > Hi Danny, > > A couple of things here, > > 1) How can we triage the incoming bugs quicker and see if there are all > debug data for the RCA or not. If bugs are triaged early there are chances > that you can get the reproducer/live set up for debugging as well. I have > often observed delays in triaging and providing initial analysis on NooBaa > bugs. > 2) It is always good to add detailed analysis, it helps the reporter to > better understand the issue and do better analysis at the next occurance. > Can you add what did you see and where did you see and what was your > approach of debugging? > 3) There is less likely that you will get the reproducer or live system > always, what are we doing to improve the logging to make must-gather > sufficient for debugging? > > -Bipin Kunal Hi Bipin, I agree that triaging the bugs early increases the chances of finding the root cause. As to what can be done to improve the process - I think that getting a good initial analysis from the reporter is a key to a quick triaging. I think the following can help us handle it better: 1. providing a full description of the problem - what is not working? what is expected? 2. trying to look a bit lower level into different components and CRs to identify what is not working, and why is the system in an unhealthy state. this can help us get a sense of what might be wrong and who should look at the bug 3. attach a must-gather to the bz. many times must gather is not attached to the bug and this creates a big delay until we can start to investigate the issue. I agree that getting a live system is not always an option, but many times the logs themselves are not enough. just adding stuff to the logs is not a good solution. this is another reason why initial analysis is important. As to what I see in this specific issue - in the default backingstore CR I see that the mode is INITIALIZING. this means that the storage resource in noobaa-core is not initialized yet. I could not find the root cause for that in the logs, so increasing the log level or adding log messages in the core pod can help. http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/BZ-2022664/quay-io-rhceph-dev-ocs-must-gather-sha256-2cfd1466b584aba7c3d94a6d2397267443416fe7db8a7fc132e4137ed640ca8f/noobaa/namespaces/openshift-storage/noobaa.io/backingstores/noobaa-default-backing-store.yaml a few other questions: 1. did the issue occur once, or do you see it in other clusters as well? 2. does it happen on a specific platform? 3. did the system eventually become healthy after some time, or did it remain unhealthy the whole time? (In reply to Danny from comment #7) > a few other questions: > 1. did the issue occur once, or do you see it in other clusters as well? This was just a one-time occurrence. I think we may hit this issue if there is a long delay in ceph cluster creation. At least that was a bit unique in my setup. That was not an intentional delay but somehow it happened.... don't know why.....maybe network(but I doubt that). > 2. does it happen on a specific platform? Can't say. In my case it was AWS. > 3. did the system eventually become healthy after some time, or did it > remain unhealthy the whole time? No, I did wait for several hours. (In reply to Danny from comment #6) > (In reply to Bipin Kunal from comment #5) > > Hi Danny, > > > > A couple of things here, > > > > 1) How can we triage the incoming bugs quicker and see if there are all > > debug data for the RCA or not. If bugs are triaged early there are chances > > that you can get the reproducer/live set up for debugging as well. I have > > often observed delays in triaging and providing initial analysis on NooBaa > > bugs. > > > 2) It is always good to add detailed analysis, it helps the reporter to > > better understand the issue and do better analysis at the next occurance. > > Can you add what did you see and where did you see and what was your > > approach of debugging? > > 3) There is less likely that you will get the reproducer or live system > > always, what are we doing to improve the logging to make must-gather > > sufficient for debugging? > > > > -Bipin Kunal > > Hi Bipin, > > I agree that triaging the bugs early increases the chances of finding the > root cause. > As to what can be done to improve the process - I think that getting a good > initial analysis from the reporter is a key to a quick triaging. > I think the following can help us handle it better: > 1. providing a full description of the problem - what is not working? what > is expected? > 2. trying to look a bit lower level into different components and CRs to > identify what is not working, and why is the system in an unhealthy state. > this can help us get a sense of what might be wrong and who should look at > the bug > 3. attach a must-gather to the bz. many times must gather is not attached to > the bug and this creates a big delay until we can start to investigate the > issue. > Ack, thanks > I agree that getting a live system is not always an option, but many times > the logs themselves are not enough. just adding stuff to the logs is not a > good solution. this is another reason why initial analysis is important. > > > As to what I see in this specific issue - in the default backingstore CR I > see that the mode is INITIALIZING. this means that the storage resource in > noobaa-core is not initialized yet. I could not find the root cause for that > in the logs, so increasing the log level or adding log messages in the core > pod can help. > http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/BZ-2022664/quay-io- > rhceph-dev-ocs-must-gather-sha256- > 2cfd1466b584aba7c3d94a6d2397267443416fe7db8a7fc132e4137ed640ca8f/noobaa/ > namespaces/openshift-storage/noobaa.io/backingstores/noobaa-default-backing- > store.yaml Ack Just adding something to consider here, There have been issues (From customers) that have been pushed to be fixed by GSS to reduce the amount of logging. This comes in a direct conflict with the request to add more logging... |
Created attachment 1841415 [details] screenshot Description of problem (please be detailed as possible and provide log snippests): Object service has been in an unhealthy state for several hours after the fresh ODF deployment Version of all relevant components (if applicable): OCP-4.9.0 OCS-quay.io/rhceph-dev/ocs-registry:4.9.0-233.ci Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Not sure why it shows unhealthy, All the pods are up and running.