Created attachment 1786978 [details] agent.service.logs Description of problem: Running through some wan emulation scenarios on the stage scale environment, one host failed to run the agent after booting the discovery iso. The current wan emulation introduces 150ms of latency and 0.02% packet loss on outgoing traffic from the hypervisor hosting the vm. This occurred on 1/72 simulated bare metal machines. Version-Release number of selected component (if applicable): ACM hub cluster: 4.8.0-0.nightly-2021-05-19-123944 SNO cluster: 4.8.0-fc.3 Assisted-installer version: stable.17.05.2021-17.46 How reproducible: 1/72 clusters in an environment with "wan emulation" Steps to Reproduce: 1. 2. 3. Actual results: May 25 17:22:28 localhost systemd[1]: Starting agent.service... May 25 17:22:28 localhost podman[1966]: time="2021-05-25T17:22:28Z" level=warning msg="Can't read link \"/var/lib/containers/storage/overlay/l/RVGJ3YR6IUTVNUEKJVWG6HVCUQ\" because it does not exist. Going through storage to recreate the missing symlinks." May 25 17:22:28 localhost podman[1966]: Error: readlink /var/lib/containers/storage/overlay/l/RVGJ3YR6IUTVNUEKJVWG6HVCUQ: no such file or directory May 25 17:22:28 localhost systemd[1]: agent.service: Control process exited, code=exited status=125 May 25 17:22:28 localhost systemd[1]: agent.service: Failed with result 'exit-code'. May 25 17:22:28 localhost systemd[1]: Failed to start agent.service. Expected results: agent to attempt to re-pull image in the event wWAN network corrupts the image Additional info:
It seems that the wan emulation has corrupted the image during download: # ssh core.8.123 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ @ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @ @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY! Someone could be eavesdropping on you right now (man-in-the-middle attack)! It is also possible that a host key has just been changed. The fingerprint for the ECDSA key sent by the remote host is SHA256:2Rk3V6eb0S2xe+lLDjQ9XXJzuiRP3SpZNakriGxvwbA. Please contact your system administrator. Add correct host key in /root/.ssh/known_hosts to get rid of this message. Offending ECDSA key in /root/.ssh/known_hosts:18 Password authentication is disabled to avoid man-in-the-middle attacks. Keyboard-interactive authentication is disabled to avoid man-in-the-middle attacks. ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** This is a host being installed by the OpenShift Assisted Installer. It will be installed from scratch during the installation. The primary service is agent.service. To watch its status run e.g sudo journalctl -u agent.service ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** Last login: Tue May 25 18:49:53 2021 from 198.18.8.1 [core@localhost ~]$ sudo podman images Error: error retrieving size of image "7a0d8af3d3e80fbe430e174f363eafa4fa5534bc5720b7e1a6116681c37b8791": you may need to remove the image to resolve the error: unable to determine size: error locating layer with ID "a5d40be26b41a76fa7c8305e5f9fa00ccd7819b9078d016258b904d7dd1bc2f7": layer not known Also in the logs: May 25 17:18:48 localhost systemd[1]: Starting agent.service... May 25 17:18:49 localhost podman[1538]: Trying to pull quay.io/ocpmetal/assisted-installer-agent:stable.17.05.2021-17.46... May 25 17:18:59 localhost podman[1538]: Getting image source signatures May 25 17:19:01 localhost podman[1538]: Copying blob sha256:0181ea251735fe65ed58c00c99a112e48638371f0f8789f7383de3dddc0e2fd1 ... May 25 17:19:20 localhost podman[1538]: Copying blob sha256:8068da8437ef14577019c074663828182a2732043bd86ec88a298951d925abe9 May 25 17:20:19 localhost podman[1538]: time="2021-05-25T17:20:19Z" level=warning msg="failed, retrying in 1s ... (1/3). Error: Error writing blob: error storing blob to file \"/var/tmp/storage389343244/6\": error happened during read: read tcp 198.18.8.123:60700->13.32.204.13:443: read: connection reset by peer" May 25 17:20:26 localhost podman[1538]: Getting image source signatures May 25 17:20:28 localhost podman[1538]: Copying blob sha256:0181ea251735fe65ed58c00c99a112e48638371f0f8789f7383de3dddc0e2fd1 ...
Running `podman rmi --all --force` to remove the corrupted image allowed the agent to start (after restarting it manually) I can now see the SNO host as discovered.
We should try to recover from such failures. The agent.service is configured to always restart, with a delay of 3s. We may be able to add a `clean-up` logic in case of failures (using `OnFailure` so that the agent image is removed before restarting the agent). The agent.service logs won't be lost because they go to `journald`.
Taking this issue for now until we verify the idea and find someone to work on it
After some investigations it seems there is a missing symlink in `/var/lib/containers` that we can recover without removing and pulling again the image. Moving `/var/lib/containers/storage/overlay-images/images.json` aside makes podman recreate symlinks so that `podman images` works again. A simple workaround using this: ``` # mv /var/lib/containers/storage/overlay-images/images.json /var/lib/containers/storage/overlay-images/images.json.temp # podman images # mv /var/lib/containers/storage/overlay-images/images.json.temp /var/lib/containers/storage/overlay-images/images.json ``` A related issue in c/storage seemed to be https://github.com/containers/storage/pull/822 (https://bugzilla.redhat.com/show_bug.cgi?id=1921128) but according to the status there I cannot fully confirm that we are still hitting the same problem. As for now the proposed workaround (https://github.com/openshift/assisted-service/pull/1836) will remove the image so that it's pulled again. Based on the outcome of further testing we will decide if another approach is needed.
On my most recent tests: * egress hypervisor latency 75ms, 0.02% packet loss * ingress hjypervisor latency 75ms, 0.02% packet loss * hub cluster ocp 4.8.0-fc.8 * clusterimageset / sno ocp is ocp 4.8.0-fc.8 * ACM 2.3.0-DOWNSTREAM-2021-06-07-19-53-02 I saw 0/60 cluster fail because of this issue over the last two iterations.
(In reply to Alex Krzos from comment #8) > On my most recent tests: > * egress hypervisor latency 75ms, 0.02% packet loss > * ingress hjypervisor latency 75ms, 0.02% packet loss > * hub cluster ocp 4.8.0-fc.8 > * clusterimageset / sno ocp is ocp 4.8.0-fc.8 > * ACM 2.3.0-DOWNSTREAM-2021-06-07-19-53-02 > > > I saw 0/60 cluster fail because of this issue over the last two iterations. This is great news, thanks! A better, longer-term, solution is being worked on as part of https://bugzilla.redhat.com/show_bug.cgi?id=1966872 There will soon be a package available for 8.4 that we can/should depend on to prevent similar failures in different workflows that depend/require podman
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759