Bug 2009078
Summary: | NetworkPodsCrashLooping alerts in upgrade CI jobs | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | jamo luhrsen <jluhrsen> |
Component: | Networking | Assignee: | Nadia Pinaeva <npinaeva> |
Networking sub component: | ovn-kubernetes | QA Contact: | Anurag saxena <anusaxen> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | medium | ||
Priority: | medium | CC: | deads, mkennell, npinaeva, wking |
Version: | 4.10 | Keywords: | Reopened |
Target Milestone: | --- | ||
Target Release: | 4.10.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2022-03-10 16:13:56 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
jamo luhrsen
2021-09-29 21:11:22 UTC
This is expected. I performed an upgrade. Upgrade of ovn-k was completed successfully with 0 restarts on all pods. MCO restarts each node individually. The high restart count is due to the high number of containers for the ovn-kubernetes pods. I saw MCO going through each node to reboot. I saw the restart count jump from 0 to the pod's container count usually when kubelet came back up. I recommend closing this if it isn't directly causing the test to fail and it looks like that is the case right now. OVN-kubernetes restarting is expected during a reboot. > This is expected. I performed an upgrade. Upgrade of ovn-k was completed successfully with 0 restarts on all pods. > MCO restarts each node individually. The high restart count is due to the high number of containers for the ovn-kubernetes pods. > I saw MCO going through each node to reboot. I saw the restart count jump from 0 to the pod's container count usually when kubelet came back up. ok, the pod restart count makes sense now, but does it correlate to the alerts as well? In other words, is it ok/expected that something like the sbdb container in an ovnkube-master pod would be marked as crashlooping for almost 10 minutes? > I recommend closing this if it isn't directly causing the test to fail and it looks like that is the case right now. > OVN-kubernetes restarting is expected during a reboot. the failure is because some service was not responding for several minutes: fail [github.com/openshift/origin/test/e2e/upgrade/service/service.go:161]: Sep 28 21:28:46.157: Service was unreachable during disruption for at least 2m7s of 1h17m5s (3%): if the crashlooping alerts are expected though, and not part of the reason why we have an unreachable service, we can close this bug with that explanation. closing this as not a bug. the alerts are expected, as Martin pointed out. Another thread discussing the same: https://coreos.slack.com/archives/CDCP2LA9L/p1633109908134100 This error is hiding CI signal for alerts on 4.10 upgrade jobs. We need a PR to avoid reporting this in our tests. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056 |