Bug 1964591 - [master] ACM/ZTP with Wan emulation fails to start the agent service
Summary: [master] ACM/ZTP with Wan emulation fails to start the agent service
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: assisted-installer
Version: 4.8
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.9.0
Assignee: Mat Kowalski
QA Contact: Yuri Obshansky
URL:
Whiteboard: AI-Team-Platform KNI-EDGE-4.8
Depends On:
Blocks: 1971630
TreeView+ depends on / blocked
 
Reported: 2021-05-25 18:42 UTC by Alex Krzos
Modified: 2021-10-18 17:32 UTC (History)
7 users (show)

Fixed In Version: OCP-Metal-v1.0.21.3
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 1971630 (view as bug list)
Environment:
Last Closed: 2021-10-18 17:31:45 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
agent.service.logs (607.81 KB, text/plain)
2021-05-25 18:42 UTC, Alex Krzos
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github containers podman pull 10637 0 None Merged [3.0.1-rhel] handle corrupted images 2022-02-17 19:22:25 UTC
Github openshift assisted-service pull 1836 0 None Merged Bug 1964591: Remove AI Agent image in case of service failure 2022-02-17 19:22:32 UTC
Red Hat Bugzilla 1966872 1 urgent CLOSED podman's image index corrupted during WAN emulation tests 2023-09-29 09:32:06 UTC
Red Hat Product Errata RHSA-2021:3759 0 None None None 2021-10-18 17:32:04 UTC


Description Alex Krzos 2021-05-25 18:42:55 UTC
Created attachment 1786978 [details]
agent.service.logs

Description of problem:
Running through some wan emulation scenarios on the stage scale environment, one host failed to run the agent after booting the discovery iso.  The current wan emulation introduces 150ms of latency and 0.02% packet loss on outgoing traffic from the hypervisor hosting the vm.  This occurred on 1/72 simulated bare metal machines.

Version-Release number of selected component (if applicable):
ACM hub cluster: 4.8.0-0.nightly-2021-05-19-123944
SNO cluster: 4.8.0-fc.3
Assisted-installer version: stable.17.05.2021-17.46

How reproducible:
1/72 clusters in an environment with "wan emulation"

Steps to Reproduce:
1.
2.
3.

Actual results:
May 25 17:22:28 localhost systemd[1]: Starting agent.service...
May 25 17:22:28 localhost podman[1966]: time="2021-05-25T17:22:28Z" level=warning msg="Can't read link \"/var/lib/containers/storage/overlay/l/RVGJ3YR6IUTVNUEKJVWG6HVCUQ\" because it does not exist. Going through storage to recreate the missing symlinks."
May 25 17:22:28 localhost podman[1966]: Error: readlink /var/lib/containers/storage/overlay/l/RVGJ3YR6IUTVNUEKJVWG6HVCUQ: no such file or directory                                          
May 25 17:22:28 localhost systemd[1]: agent.service: Control process exited, code=exited status=125                                                                                          
May 25 17:22:28 localhost systemd[1]: agent.service: Failed with result 'exit-code'.
May 25 17:22:28 localhost systemd[1]: Failed to start agent.service.


Expected results:
agent to attempt to re-pull image in the event wWAN network corrupts the image

Additional info:

Comment 1 Alex Krzos 2021-05-25 18:52:16 UTC
It seems that the wan emulation has corrupted the image during download:

# ssh core.8.123
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@    WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!     @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
Someone could be eavesdropping on you right now (man-in-the-middle attack)!
It is also possible that a host key has just been changed.
The fingerprint for the ECDSA key sent by the remote host is
SHA256:2Rk3V6eb0S2xe+lLDjQ9XXJzuiRP3SpZNakriGxvwbA.
Please contact your system administrator.
Add correct host key in /root/.ssh/known_hosts to get rid of this message.
Offending ECDSA key in /root/.ssh/known_hosts:18
Password authentication is disabled to avoid man-in-the-middle attacks.
Keyboard-interactive authentication is disabled to avoid man-in-the-middle attacks.

**  **  **  **  **  **  **  **  **  **  **  **  **  **  **  **  **  ** **  **  **  **  **  **  **
This is a host being installed by the OpenShift Assisted Installer.
It will be installed from scratch during the installation.
The primary service is agent.service.  To watch its status run e.g
sudo journalctl -u agent.service
**  **  **  **  **  **  **  **  **  **  **  **  **  **  **  **  **  ** **  **  **  **  **  **  **
Last login: Tue May 25 18:49:53 2021 from 198.18.8.1
[core@localhost ~]$ sudo podman images
Error: error retrieving size of image "7a0d8af3d3e80fbe430e174f363eafa4fa5534bc5720b7e1a6116681c37b8791": you may need to remove the image to resolve the error: unable to determine size: error locating layer with ID "a5d40be26b41a76fa7c8305e5f9fa00ccd7819b9078d016258b904d7dd1bc2f7": layer not known


Also in the logs:
May 25 17:18:48 localhost systemd[1]: Starting agent.service...
May 25 17:18:49 localhost podman[1538]: Trying to pull quay.io/ocpmetal/assisted-installer-agent:stable.17.05.2021-17.46...
May 25 17:18:59 localhost podman[1538]: Getting image source signatures
May 25 17:19:01 localhost podman[1538]: Copying blob sha256:0181ea251735fe65ed58c00c99a112e48638371f0f8789f7383de3dddc0e2fd1
...
May 25 17:19:20 localhost podman[1538]: Copying blob sha256:8068da8437ef14577019c074663828182a2732043bd86ec88a298951d925abe9
May 25 17:20:19 localhost podman[1538]: time="2021-05-25T17:20:19Z" level=warning msg="failed, retrying in 1s ... (1/3). Error: Error writing blob: error storing blob to file \"/var/tmp/storage389343244/6\": error happened during read: read tcp 198.18.8.123:60700->13.32.204.13:443: read: connection reset by peer"
May 25 17:20:26 localhost podman[1538]: Getting image source signatures
May 25 17:20:28 localhost podman[1538]: Copying blob sha256:0181ea251735fe65ed58c00c99a112e48638371f0f8789f7383de3dddc0e2fd1
...

Comment 2 Alex Krzos 2021-05-25 19:05:13 UTC
Running `podman rmi --all --force` to remove the corrupted image allowed the agent to start (after restarting it manually)  I can now see the SNO host as discovered.

Comment 3 Flavio Percoco 2021-05-26 09:02:12 UTC
We should try to recover from such failures. 

The agent.service is configured to always restart, with a delay of 3s. We may be able to add a `clean-up` logic in case of failures (using `OnFailure` so that the agent image is removed before restarting the agent).

The agent.service logs won't be lost because they go to `journald`.

Comment 4 Flavio Percoco 2021-05-26 09:07:00 UTC
Taking this issue for now until we verify the idea and find someone to work on it

Comment 5 Mat Kowalski 2021-05-26 14:45:08 UTC
After some investigations it seems there is a missing symlink in `/var/lib/containers` that we can recover without removing and pulling again the image. Moving `/var/lib/containers/storage/overlay-images/images.json` aside makes podman recreate symlinks so that `podman images` works again. A simple workaround using this:

```
# mv /var/lib/containers/storage/overlay-images/images.json /var/lib/containers/storage/overlay-images/images.json.temp
# podman images
# mv /var/lib/containers/storage/overlay-images/images.json.temp /var/lib/containers/storage/overlay-images/images.json
```

A related issue in c/storage seemed to be https://github.com/containers/storage/pull/822 (https://bugzilla.redhat.com/show_bug.cgi?id=1921128) but according to the status there I cannot fully confirm that we are still hitting the same problem. As for now the proposed workaround (https://github.com/openshift/assisted-service/pull/1836) will remove the image so that it's pulled again. Based on the outcome of further testing we will decide if another approach is needed.

Comment 8 Alex Krzos 2021-06-10 12:58:02 UTC
On my most recent tests:
* egress hypervisor latency 75ms, 0.02% packet loss
* ingress hjypervisor latency 75ms, 0.02% packet loss
* hub cluster ocp 4.8.0-fc.8
* clusterimageset / sno ocp is ocp 4.8.0-fc.8
* ACM 2.3.0-DOWNSTREAM-2021-06-07-19-53-02


I saw 0/60 cluster fail because of this issue over the last two iterations.

Comment 9 Flavio Percoco 2021-06-11 07:45:42 UTC
(In reply to Alex Krzos from comment #8)
> On my most recent tests:
> * egress hypervisor latency 75ms, 0.02% packet loss
> * ingress hjypervisor latency 75ms, 0.02% packet loss
> * hub cluster ocp 4.8.0-fc.8
> * clusterimageset / sno ocp is ocp 4.8.0-fc.8
> * ACM 2.3.0-DOWNSTREAM-2021-06-07-19-53-02
> 
> 
> I saw 0/60 cluster fail because of this issue over the last two iterations.

This is great news, thanks!

A better, longer-term, solution is being worked on as part of https://bugzilla.redhat.com/show_bug.cgi?id=1966872

There will soon be a package available for 8.4 that we can/should depend on to prevent similar failures in different workflows that depend/require podman

Comment 13 errata-xmlrpc 2021-10-18 17:31:45 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759


Note You need to log in before you can comment on or make changes to this bug.