Bug 2099794
Summary: | crio occasionally fails to start during deployment | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Linux 8 | Reporter: | Alexander Chuzhoy <sasha> | |
Component: | NetworkManager | Assignee: | Beniamino Galvani <bgalvani> | |
Status: | CLOSED ERRATA | QA Contact: | David Jaša <djasa> | |
Severity: | high | Docs Contact: | ||
Priority: | medium | |||
Version: | 8.6 | CC: | bgalvani, dornelas, jligon, lrintel, mfilanov, mko, mrussell, nmagnezi, nstielau, otuchfel, pehunt, pmaidmen, rkhan, rravaiol, rravanel, sfaye, sizucchi, sukulkar, thaller, till, toneata, trwest, vbenes, wking | |
Target Milestone: | rc | Keywords: | TestBlocker, Triaged, ZStream | |
Target Release: | --- | Flags: | rravaiol:
needinfo-
|
|
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | NetworkManager-1.40.2-1.el8 | Doc Type: | No Doc Update | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 2123412 2132059 2132281 2132282 (view as bug list) | Environment: | ||
Last Closed: | 2023-05-16 09:04:54 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 2123412, 2132059, 2132281, 2132282 |
Description
Alexander Chuzhoy
2022-06-21 17:21:13 UTC
*** Bug 2101490 has been marked as a duplicate of this bug. *** Hey @pehunt , these are the full logs: Jul 11 19:44:38 spoke-rwn-0-1 systemd[1]: Starting Container Runtime Interface for OCI (CRI-O)... Jul 11 19:44:39 spoke-rwn-0-1 crio[2144]: time="2022-07-11 19:44:39.013451971Z" level=info msg="Starting CRI-O, version: 1.24.1-11.rhaos4.11.gitb0d2ef3.el8, git: ()" Jul 11 19:44:39 spoke-rwn-0-1 crio[2144]: time="2022-07-11 19:44:39.014017663Z" level=info msg="Node configuration value for hugetlb cgroup is true" Jul 11 19:44:39 spoke-rwn-0-1 crio[2144]: time="2022-07-11 19:44:39.014108765Z" level=info msg="Node configuration value for pid cgroup is true" Jul 11 19:44:39 spoke-rwn-0-1 crio[2144]: time="2022-07-11 19:44:39.014193717Z" level=info msg="Node configuration value for memoryswap cgroup is true" Jul 11 19:44:39 spoke-rwn-0-1 crio[2144]: time="2022-07-11 19:44:39.014237574Z" level=info msg="Node configuration value for cgroup v2 is false" Jul 11 19:44:39 spoke-rwn-0-1 crio[2144]: time="2022-07-11 19:44:39.022080843Z" level=info msg="Node configuration value for systemd CollectMode is true" Jul 11 19:44:39 spoke-rwn-0-1 crio[2144]: time="2022-07-11 19:44:39.029460658Z" level=info msg="Node configuration value for systemd AllowedCPUs is true" Jul 11 19:44:39 spoke-rwn-0-1 crio[2144]: time="2022-07-11 19:44:39.031461785Z" level=info msg="Using default capabilities: CAP_CHOWN, CAP_DAC_OVERRIDE, CAP_FSETID, CAP_FOWNER, CAP_SETGID, CAP_SETUID, CAP_SETPCAP, CAP_NET_BIND_SERVICE, CAP_KILL" Jul 11 19:44:39 spoke-rwn-0-1 crio[2144]: time="2022-07-11 19:44:39.035196028Z" level=info msg="Conmon does support the --sync option" Jul 11 19:44:39 spoke-rwn-0-1 crio[2144]: time="2022-07-11 19:44:39.035272468Z" level=info msg="Conmon does support the --log-global-size-max option" Jul 11 19:44:39 spoke-rwn-0-1 crio[2144]: time="2022-07-11 19:44:39.036460758Z" level=info msg="Using seccomp default profile when unspecified: true" Jul 11 19:44:39 spoke-rwn-0-1 crio[2144]: time="2022-07-11 19:44:39.036728910Z" level=info msg="No seccomp profile specified, using the internal default" Jul 11 19:44:39 spoke-rwn-0-1 crio[2144]: time="2022-07-11 19:44:39.036787570Z" level=info msg="AppArmor is disabled by the system or at CRI-O build-time" Jul 11 19:44:39 spoke-rwn-0-1 crio[2144]: time="2022-07-11 19:44:39.036829129Z" level=info msg="No blockio config file specified, blockio not configured" Jul 11 19:44:39 spoke-rwn-0-1 crio[2144]: time="2022-07-11 19:44:39.036878867Z" level=info msg="RDT not available in the host system" Jul 11 19:44:39 spoke-rwn-0-1 crio[2144]: time="2022-07-11 19:44:39.037042906Z" level=info msg="Updated default CNI network name to " Jul 11 19:44:39 spoke-rwn-0-1 crio[2144]: time="2022-07-11 19:44:39.067318702Z" level=warning msg="Error encountered when checking whether cri-o should wipe images: version file /var/lib/crio/version not found: open /var/lib/crio/version: no such file or directory" Jul 11 19:44:39 spoke-rwn-0-1 crio[2144]: time="2022-07-11 19:44:39.068212424Z" level=fatal msg="Failed to start streaming server: listen tcp [fd2e:6f44:5dd8:6::38]:10010: bind: cannot assign requested address" Jul 11 19:44:39 spoke-rwn-0-1 systemd[1]: crio.service: Main process exited, code=exited, status=1/FAILURE Jul 11 19:44:39 spoke-rwn-0-1 systemd[1]: crio.service: Failed with result 'exit-code'. Jul 11 19:44:39 spoke-rwn-0-1 systemd[1]: Failed to start Container Runtime Interface for OCI (CRI-O). Jul 11 19:44:39 spoke-rwn-0-1 systemd[1]: crio.service: Consumed 78ms CPU time We often run into this issue. it looks like the cri-o that previously was running is still somehow holding open the port. Are we sure there aren't two cri-o's running at the same time? how does the bootstrap in place work WRT kubelet/cri-o The error for an already used port on a bind is "bind: address already in use", which is not the error we see in the logs here. However, the error when attempting to bind to an interface that doesn't exist seems to be to be the same as we see in the logs here: "bind: cannot assign requested address" ------------------------------------------------------------------------------------------------------------------ Steps to reproduce: ------------------------------------------------------------------------------------------------------------------ Using a Beaker machine and assisted test infra TEST_TEARDOWN=false KUBE_API=true ENABLE_KUBE_API=true NUM_MASTERS=1 ISO_IMAGE_TYPE=minimal-iso STATIC_IPS=true TEST_FUNC=test_kubeapi make setup run test_kube_api_parallel ------------------------------------------------------------------------------------------------------------------ When finished, you should be able to query KVM to find out the host file entries you should add for host addresses ------------------------------------------------------------------------------------------------------------------ # virsh list Id Name State ------------------------------------------------------ 1 minikube running 4 test-infra-cluster-c0f6efad-master-0 running 5 test-infra-cluster-20358370-master-0 running From here you can query to find ip addresses to add to your hosts file virsh net-list Name State Autostart Persistent -------------------------------------------------------------------------- default active yes yes mk-minikube active yes yes test-infra-net-20358370 active yes yes test-infra-net-c0f6efad active yes yes test-infra-secondary-network-20358370 active yes yes test-infra-secondary-network-c0f6efad active yes yes [root@localhost assisted-test-infra]# virsh net-dhcp-leases test-infra-secondary-network-20358370 Expiry Time MAC address Protocol IP address Hostname Client ID or DUID --------------------------------------------------------------------------------------------------------------------- 2022-08-17 17:02:03 02:00:00:68:a3:81 ipv6 3001:db9:1::e2/120 - 00:03:00:01:02:00:00:68:a3:81 [root@localhost assisted-test-infra]# virsh net-dhcp-leases test-infra-secondary-network-c0f6efad Expiry Time MAC address Protocol IP address Hostname Client ID or DUID --------------------------------------------------------------------------------------------------------------------------------------- 2022-08-17 17:06:21 02:00:00:ed:a2:e8 ipv4 192.168.145.10/24 test-infra-cluster-c0f6efad-master-0 01:02:00:00:ed:a2:e8 ------------------------------------------------------------------------------------------------------------------ Add some entries to /etc/hosts ------------------------------------------------------------------------------------------------------------------ 3001:db9:1::e2 api.test-infra-cluster-20358370.redhat.com 192.168.145.10 api.test-infra-cluster-c0f6efad.redhat.com ------------------------------------------------------------------------------------------------------------------ Log into the command line of the IPv6 host ------------------------------------------------------------------------------------------------------------------ ssh core.redhat.com journalctl | grep crio and you will see something similar to the following Aug 17 18:41:31 test-infra-cluster-20358370-master-0 crio[2319]: time="2022-08-17 18:41:31.447622177Z" level=fatal msg="Failed to start streaming server: listen tcp [1001:db9:1::12]:10010: bind: cannot assign requested address" Hi guys, we are hitting the same issue where the signature is ``` Aug 23 10:25:59 ostest-extraworker-0 crio[2415]: time="2022-08-23 10:25:59.635765832Z" level=fatal msg="Failed to start streaming server: listen tcp [fd2e:6f44:5dd8:c956::1a]:10010: bind: cannot assign requested address" ``` Luckily we have the full journal attached as part of https://issues.redhat.com/browse/OCPBUGS-511. Raising the priority, if the workaround is to reboot the node then it's not a good solution for ZTP, no one will be able to reboot the nodes in ZTP flow. The issue reproduces with 4.12: NAME="Red Hat Enterprise Linux CoreOS" ID="rhcos" ID_LIKE="rhel fedora" VERSION="412.86.202208292256-0" VERSION_ID="4.12" PLATFORM_ID="platform:el8" PRETTY_NAME="Red Hat Enterprise Linux CoreOS 412.86.202208292256-0 (Ootpa)" ANSI_COLOR="0;31" CPE_NAME="cpe:/o:redhat:enterprise_linux:8::coreos" HOME_URL="https://www.redhat.com/" DOCUMENTATION_URL="https://docs.openshift.com/container-platform/4.12/" BUG_REPORT_URL="https://bugzilla.redhat.com/" REDHAT_BUGZILLA_PRODUCT="OpenShift Container Platform" REDHAT_BUGZILLA_PRODUCT_VERSION="4.12" REDHAT_SUPPORT_PRODUCT="OpenShift Container Platform" REDHAT_SUPPORT_PRODUCT_VERSION="4.12" OPENSHIFT_VERSION="4.12" RHEL_VERSION="8.6" OSTREE_VERSION="412.86.202208292256-0" The only difference is that we don't see anything in the output from journalctl -u crio: [core@api ~]$ sudo crictl ps FATA[0000] unable to determine runtime API version: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix /var/run/crio/crio.sock: connect: no such file or directory" [core@api ~]$ sudo systemctl status crio ● crio.service - Container Runtime Interface for OCI (CRI-O) Loaded: loaded (/usr/lib/systemd/system/crio.service; disabled; vendor preset: disabled) Drop-In: /etc/systemd/system/crio.service.d └─01-kubens.conf, 10-mco-default-env.conf, 10-mco-default-madv.conf, 10-mco-profile-unix-socket.conf, 20-nodenet.conf Active: inactive (dead) Docs: https://github.com/cri-o/cri-o [core@api ~]$ sudo journalctl -u crio -- Logs begin at Wed 2022-08-31 02:10:13 UTC, end at Wed 2022-08-31 13:17:08 UTC. -- -- No entries -- There are broken tests because of this issue and it is considered to be a test blocker. Are there any updates on the status/proposed fixes for this? This seems a duplicate of bug 2096386. In RHEL 8.7 NM waits for DHCPv6 addresses to become non-tentative before signaling that the interface is ready. However, the original bug is still open because in case of DAD collision NM doesn't send a DHCPDECLINE, and so the fix is considered incomplete. Anyway, the fix should be enough to solve the "bind: cannot assign requested address" problem. This issue has been very initially observed by the Assisted Installer team in OCP 4.11; my understanding is that fixing it in RHEL 8.7 makes that fix available only in OCP 4.12. If this is correct, can we please track the backport? I think 4.12 will stay on 8.6, 4.13 will move to 8.8. @bgalvani is a backport to 8.6 possible? Hi Peter, Yes a backport to 8.6 is possible. Let me set the correct flag for that. there might be more to this. As this is a duplicate of bug 2096386, which was already partly fixed in rhel-8.7-GA. But there was a follow-up fix that will make 2096386 fully fixed in future rhel-8.8. that means, rhel-8.7 still partly has the problem. We probably should reopen the rhel-8.7.z bug 2132281. See also https://bugzilla.redhat.com/show_bug.cgi?id=2096386#c31 In any case, this will be fixed on rhel-8.8. Moving this bug forward. VERIFIED in NetworkManager-1.40.2-1.el8 The duplicate address is removed from the host once kernel DAD finds that another interface in the network segment already has it. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (NetworkManager bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2023:2968 |