Bug 1971632 - [4.8.0] ACM/ZTP with Wan emulation, several clusters fail to step past discovery
Summary: [4.8.0] ACM/ZTP with Wan emulation, several clusters fail to step past discovery
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: assisted-installer
Version: 4.8
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.8.0
Assignee: Ori Amizur
QA Contact: Chad Crum
URL:
Whiteboard: AI-Team-Platform KNI-EDGE-4.8
Depends On: 1964688
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-06-14 13:18 UTC by Ronnie Lazar
Modified: 2021-07-27 23:13 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1964688
Environment:
Last Closed: 2021-07-27 23:12:53 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift assisted-service pull 2057 0 None open [ocm-2.3] Bug 1971632: ACM/ZTP with Wan emulation, several clusters fail to step past discovery 2021-06-23 08:38:19 UTC
Red Hat Bugzilla 1964591 1 urgent CLOSED [master] ACM/ZTP with Wan emulation fails to start the agent service 2021-10-18 17:32:04 UTC
Red Hat Bugzilla 1964688 1 urgent CLOSED [master] ACM/ZTP with Wan emulation, several clusters fail to step past discovery 2022-08-28 08:47:34 UTC
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 23:13:09 UTC

Description Ronnie Lazar 2021-06-14 13:18:27 UTC
+++ This bug was initially created as a clone of Bug #1964688 +++

Created attachment 1786987 [details]
Assisted-service logs

Description of problem:
Running through wan emulation scenarios on the stage scale environment, several hosts failed to install after booting the discovery agent.  The current wan emulation introduces 150ms of latency and 0.02% packet loss on outgoing traffic from the hypervisor hosting the vm.  This occurred on 3/72 simulated bare metal machines. One more machine joined after it was forced to re-pull the container images that the agent pulls upon running.


Version-Release number of selected component (if applicable):
ACM hub cluster: 4.8.0-0.nightly-2021-05-19-123944
SNO cluster: 4.8.0-fc.3
Assisted-installer version: stable.17.05.2021-17.46


How reproducible:
3/72 during this test

Steps to Reproduce:
1.
2.
3.

Actual results:
The clusters never move further in provisioning cycle and stay "Stuck" on the discovery agent.

Expected results:
Discovery to complete and install of the machines to begin.


Additional info:
It seems it may be related to this log line emitted from the assisted-service off the hub cluster:

time="2021-05-25T20:16:26Z" level=error msg="Failed decode <free-network-addresses-ba74f63c> reply for host <e02e0774-5e2a-41ec-8bcf-5b3f62988ccd> cluster <4a0a9e18-2630-4108-ae17-d280f0ff2f5c>" func="github.com/openshift/assisted-service/internal/bminventory.(*bareMetalInventory).PostStepReply" file="/go/src/github.com/openshift/origin/internal/bminventory/inventory.go:2779" cluster_id=4a0a9e18-2630-4108-ae17-d280f0ff2f5c error="unexpected end of JSON input" go-id=14637052 host_id=e02e0774-5e2a-41ec-8bcf-5b3f62988ccd pkg=Inventory request_id=398d4db2-091c-45e7-b4e6-72987f683253

Of all the host machines stuck in a discovery state during this test, all of them have similar log lines related to their respective host ids:

# oc logs -n assisted-installer assisted-service-fc55dd8bb-5vf96 -c assisted-service | grep "Failed decode" | awk '{print $9}' | sort | uniq -c
     27 <afd90092-47d3-46c6-bbdb-9f9845360bdb>
     32 <cdd9f30f-9761-441f-8550-6ce2813d1f85>
     38 <d79186a7-131f-4c51-8f6f-b068e81d290c>
     36 <e02e0774-5e2a-41ec-8bcf-5b3f62988ccd>

* Note there are 4 host ids, originally there was 3 hosts stuck in this state, another joined after its corrupted image was replaced per this bug (https://bugzilla.redhat.com/show_bug.cgi?id=1964591)

Attaching journal logs from sno00031 and assisted-service container to help diagnose here

--- Additional comment from akrzos on 20210525T20:43:51

Created attachment 1786988 [details]
sno00031 journal logs

--- Additional comment from fpercoco on 20210601T09:42:27

Alex, any chance we can have a new run of this and collect more logs (including a tcpdump)

--- Additional comment from akrzos on 20210602T18:14:36

(In reply to Flavio Percoco from comment #2)
> Alex, any chance we can have a new run of this and collect more logs
> (including a tcpdump)

I have re-run this with https disabled since last time trying to decode what was being sent through ssl seemed impossible with the current setup and I was unable to reproduce with https disabled (however I also switched to fc.7 ocp version) reading back I ran with fc.3 so I'll retry with that build.

--- Additional comment from akrzos on 20210610T13:38:19

A quick update I am still seeing this failure type.  I have re-run with newer bits and adjusted the test to more accurately resemble wan emulation:
* egress hypervisor latency 75ms, 0.02% packet loss
* ingress hjypervisor latency 75ms, 0.02% packet loss
* hub cluster ocp 4.8.0-fc.8
* clusterimageset / sno ocp is ocp 4.8.0-fc.8
* ACM 2.3.0-DOWNSTREAM-2021-06-07-19-53-02


1st run:
3/60 failed with this failure type
2nd run:
2/60 failed with this failure type.

I expect to get a bit more time to dig at this again today.  I suspect something is wrong with the free_addresses command:

[root@localhost ~]# podman run --privileged --net=host --rm --quiet --name akrzos_free_addresses_scanner -v /var/log:/var/log -v /run/systemd/journal/socket:/run/systemd/journal/socket quay.io:443/acm-d/assisted-installer-agent-rhel8@sha256:5f97714289d1dffe43db44056a8d04bffde18957745ebdf6675eda53d7b0dfc7 free_addresses ["198.18.8.0/21"]
invalid character '.' after array element

--- Additional comment from akrzos on 20210610T20:05:07

It is still unclear to me what exactly is failing here.  Reviewing a cluster that is actively failing because of this, I see these things:

Following this request_id = 0e39f985-ebe1-4b47-98d5-c4ba2707a61b

I put a snippet of logs here - https://gist.github.com/akrzos/4db55dd43109212320504eba50d76bba since they are lengthy lines and tend to get chopped up when pasted into a bz.

It seems the assisted-service is saying that the host status is insufficient ("host: <4a0c6719-e600-49ef-9f1a-5bcc1a31d1c5>, host status: <insufficient>") and to run some commands to gather data/host information, it also asks to scan for free ip addresses.  At first I thought there was an error with invoking this command however it seems to output valid output when invoked off the sno itself:

root@localhost ~]# podman run --privileged --net=host --rm --quiet --name test_free_addresses_scanner -v /var/log:/var/log -v /run/systemd/journal/socket:/run/systemd/journal/socket quay.io:443/acm-d/assisted-installer-agent-rhel8@sha256:5f97714289d1dffe43db44056a8d04bffde18957745ebdf6675eda53d7b0dfc7 free_addresses '["198.18.8.0/21"]'                                                [{"free_addresses":["198.18.8.0","198.18.8.2","198.18.8.10","198.18.8.11",.....
...,"198.18.15.254","198.18.15.255"],"network":"198.18.8.0/21"}]

[root@localhost ~]# podman run --privileged --net=host --rm --quiet --name test_free_addresses_scanner -v /var/log:/var/log -v /run/systemd/journal/socket:/run/systemd/journal/socket quay.io:443/acm-d/assisted-installer-agent-rhel8@sha256:5f97714289d1dffe43db44056a8d04bffde18957745ebdf6675eda53d7b0dfc7 free_addresses '["198.18.8.0/21"]' | jq -r ".[].free_addresses|length"
1973

Which is the expected number (1973) of free addresses after accounting for all the hosts on this subnet.


The SNO sends the free addresses over:

Jun 10 18:41:04 localhost next_step_runne[1965]: time="10-06-2021 18:41:04" level=info msg="Sending step <free-network-addresses-18ddd3dc> reply error <> exit-code <0>" file="step_processor.go:48" request_id=0e39f985-ebe1-4b47-98d5-c4ba2707a61b

However the assisted-service seems to say that there is a json decoding issue with the data:

time="2021-06-10T18:41:04Z" level=error msg="Failed decode <free-network-addresses-18ddd3dc> reply for host <4a0c6719-e600-49ef-9f1a-5bcc1a31d1c5> cluster <9a797e9e-e37e-44a9-bc05-48c9b6d5d654>" func="github.com/openshift/assisted-service/internal/bminventory.(*bareMetalInventory).PostStepReply" file="/remote-source/app/internal/bminventory/inventory.go:2828" cluster_id=9a797e9e-e37e-44a9-bc05-48c9b6d5d654 error="unexpected end of JSON input" go-id=635517 host_id=4a0c6719-e600-49ef-9f1a-5bcc1a31d1c5 pkg=Inventory request_id=0e39f985-ebe1-4b47-98d5-c4ba2707a61b

And the sno then gets the reply that the data it sent was bad

Jun 10 18:41:04 localhost next_step_runne[1965]: time="10-06-2021 18:41:04" level=warning msg="Assisted service returned status code Bad Request after processing step reply.  Reason: unexpected end of JSON input" file="step_processor.go:74" request_id=0e39f985-ebe1-4b47-98d5-c4ba2707a61b

It is unclear what "packaging" the agent does to the free address data that could have corrupted it.  I would suspect that since we have a /21 configured for all SNOs and the hub cluster that perhaps the large number of free addresses is "too much" data however only 2/60 SNOs have this failure.

So I dug into more of the objects off the hub cluster and found the events log for these failed clusters.  The most notable event that could indicate an error are:

  {
    "cluster_id": "9a797e9e-e37e-44a9-bc05-48c9b6d5d654",
    "event_time": "2021-06-09T22:43:27.481Z",
    "host_id": "4a0c6719-e600-49ef-9f1a-5bcc1a31d1c5",
    "message": "Host sno00017: validation 'container-images-available' that used to succeed is now failing",
    "severity": "warning"
  },
  {
    "cluster_id": "9a797e9e-e37e-44a9-bc05-48c9b6d5d654",
    "event_time": "2021-06-09T22:43:27.490Z",
    "host_id": "4a0c6719-e600-49ef-9f1a-5bcc1a31d1c5",
    "message": "Host sno00017: updated status from \"preparing-for-installation\" to \"insufficient\" (Host cannot be installed due to following failing validation(s): Failed to fetch container images needed for installation from quay.io:443/acm-d/assisted-installer-rhel8@sha256:834c3259f30c5b139e02712aa9c56413535749bf8e8ab9adfd4908902d7e651a ; Host couldn't synchronize with any NTP server)",
    "severity": "warning"
  },
  {
    "cluster_id": "9a797e9e-e37e-44a9-bc05-48c9b6d5d654",
    "event_time": "2021-06-09T22:43:35.478Z",
    "message": "Cluster validation 'all-hosts-are-ready-to-install' that used to succeed is now failing",
    "severity": "warning"
  },
  {
    "cluster_id": "9a797e9e-e37e-44a9-bc05-48c9b6d5d654",
    "event_time": "2021-06-09T22:43:35.487Z",
    "message": "Updated status of cluster sno00017 to insufficient",
    "severity": "info"
  },


It seems to me that something "told" the assisted-installer the sno was having trouble pulling images and it gave up and prevents the cluster from further install in this case. (instead of a corrupted https packet (failure to decode))

--- Additional comment from akrzos on 20210610T20:06:37

Created attachment 1790017 [details]
sno00017 events log

--- Additional comment from akrzos on 20210610T20:07:35

Created attachment 1790018 [details]
sno00037 events log

--- Additional comment from akrzos on 20210611T20:06:58

I believe the smoking gun for this is:

Jun 10 21:06:37 localhost next_step_runne[1940]: time="10-06-2021 21:06:37" level=info msg="Sending step <container-image-availability-6a5ee7e6> reply output <{\"images\":[{\"name\":\"quay.io:443/acm-d/assisted-installer-rhel8@sha256:834c3259f30c5b139e02712aa9c56413535749bf8e8ab9adfd4908902d7e651a\",\"result\":\"failure\"},{\"download_rate\":6.553431398919053,\"name\":\"quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:75dcac253d32eeb10fba6fc760a55c23b424dbe075946f7c16ae9491902c61dc\",\"result\":\"success\",\"size_bytes\":475449430,\"time\":72.549692071},{\"download_rate\":15.100748670432615,\"name\":\"quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:83276e1421645322d9abdf21752c170645fac26456b05b19b1870b7c0b808dbb\",\"result\":\"success\",\"size_bytes\":428752695,\"time\":28.392810473}]}> error <time=\"2021-06-10T21:06:37Z\" level=warning msg=\"lstat /sys/fs/cgroup/devices/machine.slice/libpod-08e75a959d2036bbf7df221acb80e17c1d05e105f69ff7466e7a2dbbfd7cade2.scope: no such file or directory\"\n> exit-code <0>" file="step_processor.go:50" request_id=1df87a8d-ee20-4a58-b934-8811150766c7


{\"images\":[{\"name\":\"quay.io:443/acm-d/assisted-installer-rhel8@sha256:834c3259f30c5b139e02712aa9c56413535749bf8e8ab9adfd4908902d7e651a\",\"result\":\"failure\"}

Looks like the container-image-availability step does not have any mechanism to retry.

--- Additional comment from fpercoco on 20210614T11:35:24

(In reply to Alex Krzos from comment #8)
> I believe the smoking gun for this is:
> 
> Jun 10 21:06:37 localhost next_step_runne[1940]: time="10-06-2021 21:06:37"
> level=info msg="Sending step <container-image-availability-6a5ee7e6> reply
> output
> <{\"images\":[{\"name\":\"quay.io:443/acm-d/assisted-installer-rhel8@sha256:
> 834c3259f30c5b139e02712aa9c56413535749bf8e8ab9adfd4908902d7e651a\",
> \"result\":\"failure\"},{\"download_rate\":6.553431398919053,\"name\":\"quay.
> io/openshift-release-dev/ocp-v4.0-art-dev@sha256:
> 75dcac253d32eeb10fba6fc760a55c23b424dbe075946f7c16ae9491902c61dc\",
> \"result\":\"success\",\"size_bytes\":475449430,\"time\":72.549692071},
> {\"download_rate\":15.100748670432615,\"name\":\"quay.io/openshift-release-
> dev/ocp-v4.0-art-dev@sha256:
> 83276e1421645322d9abdf21752c170645fac26456b05b19b1870b7c0b808dbb\",
> \"result\":\"success\",\"size_bytes\":428752695,\"time\":28.392810473}]}>
> error <time=\"2021-06-10T21:06:37Z\" level=warning msg=\"lstat
> /sys/fs/cgroup/devices/machine.slice/libpod-
> 08e75a959d2036bbf7df221acb80e17c1d05e105f69ff7466e7a2dbbfd7cade2.scope: no
> such file or directory\"\n> exit-code <0>" file="step_processor.go:50"
> request_id=1df87a8d-ee20-4a58-b934-8811150766c7
> 
> 
> {\"images\":[{\"name\":\"quay.io:443/acm-d/assisted-installer-rhel8@sha256:
> 834c3259f30c5b139e02712aa9c56413535749bf8e8ab9adfd4908902d7e651a\",
> \"result\":\"failure\"}
> 
> Looks like the container-image-availability step does not have any mechanism
> to retry.

Great investigation, @akrzos 

I think we should investigate both options. From conversations over chat, it looks like cri-o may have a similar issue to https://bugzilla.redhat.com/show_bug.cgi?id=1966872

However, it is true that sending `free_addresses` may end up in a quite big json. It would be nice to verify whether there's a retry logic for it and that it is, indeed, working as expected. Maybe we can find this info in the assisted-service logs.

Is the testing host still up? Any chance we can take a look at it and run some criocli commands?

--- Additional comment from alazar on 20210614T13:07:45

@oamizur Lets disable the Free-List tool on operator based deployments

Comment 3 nshidlin 2021-06-29 17:38:02 UTC
this bug doesn't block OCP4.8.0 -release -  the fix will be delivered the the ACM channel on the AI-operator add verified then

Comment 9 errata-xmlrpc 2021-07-27 23:12:53 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438


Note You need to log in before you can comment on or make changes to this bug.