Bug 1889734
Summary: | Nodes become unavailable during CNV performance tests | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | guy chen <guchen> |
Component: | Node | Assignee: | Ryan Phillips <rphillips> |
Node sub component: | Kubelet | QA Contact: | Sunil Choudhary <schoudha> |
Status: | CLOSED DUPLICATE | Docs Contact: | |
Severity: | urgent | ||
Priority: | unspecified | CC: | aos-bugs, danken, fdeutsch, jokerman, jsafrane, pkliczew |
Version: | 4.6.z | Keywords: | Reopened |
Target Milestone: | --- | ||
Target Release: | 4.7.0 | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2020-11-10 15:09:01 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
guy chen
2020-10-20 13:31:38 UTC
Kubernetes / OCP does not monitor storage / mount health, its work ends with mount(). Applications (pods) are supposed to have their own readiness / liveness probes to ensure the application as a whole works, incl. its mounted storage volumes. Implementing mount check in kubelet would be *very* complicated. You can file a RFE. Jan, from what you say I understand why NFS-dependent workload running on the node would stop performing. But why is it acceptable that the node becomes unknown? If the node stayed up, health probes could report the specific problematic load and have it rescheduled; other workload can keep running; new workload can be added. Is there a way to address this bug other than mount health checks? Nodes being <unknown> in "oc top nodes" is odd, it may not be related to NFS issues at all. Do you have cluster must-gather + kubelet logs from the node? Anything interesting in kubelet logs? What does "oc describe node" say? This bug will not be fixed in the upcoming sprint. |