Bug 1964688 - [master] ACM/ZTP with Wan emulation, several clusters fail to step past discovery
Summary: [master] ACM/ZTP with Wan emulation, several clusters fail to step past disco...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: assisted-installer
Version: 4.8
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: ---
Assignee: Ori Amizur
QA Contact: Yuri Obshansky
URL:
Whiteboard: AI-Team-Platform KNI-EDGE-4.8
Depends On:
Blocks: 1971632
TreeView+ depends on / blocked
 
Reported: 2021-05-25 20:39 UTC by Alex Krzos
Modified: 2022-08-28 08:45 UTC (History)
2 users (show)

Fixed In Version: OCP-Metal-v1.0.23.1
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1971632 (view as bug list)
Environment:
Last Closed: 2022-08-28 08:45:59 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Assisted-service logs (2.27 MB, application/gzip)
2021-05-25 20:39 UTC, Alex Krzos
no flags Details
sno00031 journal logs (141.46 KB, application/gzip)
2021-05-25 20:43 UTC, Alex Krzos
no flags Details
sno00017 events log (7.81 KB, text/plain)
2021-06-10 20:06 UTC, Alex Krzos
no flags Details
sno00037 events log (7.83 KB, text/plain)
2021-06-10 20:07 UTC, Alex Krzos
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1964591 1 urgent CLOSED [master] ACM/ZTP with Wan emulation fails to start the agent service 2021-10-18 17:32:04 UTC

Internal Links: 1971632

Description Alex Krzos 2021-05-25 20:39:01 UTC
Created attachment 1786987 [details]
Assisted-service logs

Description of problem:
Running through wan emulation scenarios on the stage scale environment, several hosts failed to install after booting the discovery agent.  The current wan emulation introduces 150ms of latency and 0.02% packet loss on outgoing traffic from the hypervisor hosting the vm.  This occurred on 3/72 simulated bare metal machines. One more machine joined after it was forced to re-pull the container images that the agent pulls upon running.


Version-Release number of selected component (if applicable):
ACM hub cluster: 4.8.0-0.nightly-2021-05-19-123944
SNO cluster: 4.8.0-fc.3
Assisted-installer version: stable.17.05.2021-17.46


How reproducible:
3/72 during this test

Steps to Reproduce:
1.
2.
3.

Actual results:
The clusters never move further in provisioning cycle and stay "Stuck" on the discovery agent.

Expected results:
Discovery to complete and install of the machines to begin.


Additional info:
It seems it may be related to this log line emitted from the assisted-service off the hub cluster:

time="2021-05-25T20:16:26Z" level=error msg="Failed decode <free-network-addresses-ba74f63c> reply for host <e02e0774-5e2a-41ec-8bcf-5b3f62988ccd> cluster <4a0a9e18-2630-4108-ae17-d280f0ff2f5c>" func="github.com/openshift/assisted-service/internal/bminventory.(*bareMetalInventory).PostStepReply" file="/go/src/github.com/openshift/origin/internal/bminventory/inventory.go:2779" cluster_id=4a0a9e18-2630-4108-ae17-d280f0ff2f5c error="unexpected end of JSON input" go-id=14637052 host_id=e02e0774-5e2a-41ec-8bcf-5b3f62988ccd pkg=Inventory request_id=398d4db2-091c-45e7-b4e6-72987f683253

Of all the host machines stuck in a discovery state during this test, all of them have similar log lines related to their respective host ids:

# oc logs -n assisted-installer assisted-service-fc55dd8bb-5vf96 -c assisted-service | grep "Failed decode" | awk '{print $9}' | sort | uniq -c
     27 <afd90092-47d3-46c6-bbdb-9f9845360bdb>
     32 <cdd9f30f-9761-441f-8550-6ce2813d1f85>
     38 <d79186a7-131f-4c51-8f6f-b068e81d290c>
     36 <e02e0774-5e2a-41ec-8bcf-5b3f62988ccd>

* Note there are 4 host ids, originally there was 3 hosts stuck in this state, another joined after its corrupted image was replaced per this bug (https://bugzilla.redhat.com/show_bug.cgi?id=1964591)

Attaching journal logs from sno00031 and assisted-service container to help diagnose here

Comment 1 Alex Krzos 2021-05-25 20:43:51 UTC
Created attachment 1786988 [details]
sno00031 journal logs

Comment 2 Flavio Percoco 2021-06-01 09:42:27 UTC
Alex, any chance we can have a new run of this and collect more logs (including a tcpdump)

Comment 3 Alex Krzos 2021-06-02 18:14:36 UTC
(In reply to Flavio Percoco from comment #2)
> Alex, any chance we can have a new run of this and collect more logs
> (including a tcpdump)

I have re-run this with https disabled since last time trying to decode what was being sent through ssl seemed impossible with the current setup and I was unable to reproduce with https disabled (however I also switched to fc.7 ocp version) reading back I ran with fc.3 so I'll retry with that build.

Comment 4 Alex Krzos 2021-06-10 13:38:19 UTC
A quick update I am still seeing this failure type.  I have re-run with newer bits and adjusted the test to more accurately resemble wan emulation:
* egress hypervisor latency 75ms, 0.02% packet loss
* ingress hjypervisor latency 75ms, 0.02% packet loss
* hub cluster ocp 4.8.0-fc.8
* clusterimageset / sno ocp is ocp 4.8.0-fc.8
* ACM 2.3.0-DOWNSTREAM-2021-06-07-19-53-02


1st run:
3/60 failed with this failure type
2nd run:
2/60 failed with this failure type.

I expect to get a bit more time to dig at this again today.  I suspect something is wrong with the free_addresses command:

[root@localhost ~]# podman run --privileged --net=host --rm --quiet --name akrzos_free_addresses_scanner -v /var/log:/var/log -v /run/systemd/journal/socket:/run/systemd/journal/socket quay.io:443/acm-d/assisted-installer-agent-rhel8@sha256:5f97714289d1dffe43db44056a8d04bffde18957745ebdf6675eda53d7b0dfc7 free_addresses ["198.18.8.0/21"]
invalid character '.' after array element

Comment 5 Alex Krzos 2021-06-10 20:05:07 UTC
It is still unclear to me what exactly is failing here.  Reviewing a cluster that is actively failing because of this, I see these things:

Following this request_id = 0e39f985-ebe1-4b47-98d5-c4ba2707a61b

I put a snippet of logs here - https://gist.github.com/akrzos/4db55dd43109212320504eba50d76bba since they are lengthy lines and tend to get chopped up when pasted into a bz.

It seems the assisted-service is saying that the host status is insufficient ("host: <4a0c6719-e600-49ef-9f1a-5bcc1a31d1c5>, host status: <insufficient>") and to run some commands to gather data/host information, it also asks to scan for free ip addresses.  At first I thought there was an error with invoking this command however it seems to output valid output when invoked off the sno itself:

root@localhost ~]# podman run --privileged --net=host --rm --quiet --name test_free_addresses_scanner -v /var/log:/var/log -v /run/systemd/journal/socket:/run/systemd/journal/socket quay.io:443/acm-d/assisted-installer-agent-rhel8@sha256:5f97714289d1dffe43db44056a8d04bffde18957745ebdf6675eda53d7b0dfc7 free_addresses '["198.18.8.0/21"]'                                                [{"free_addresses":["198.18.8.0","198.18.8.2","198.18.8.10","198.18.8.11",.....
...,"198.18.15.254","198.18.15.255"],"network":"198.18.8.0/21"}]

[root@localhost ~]# podman run --privileged --net=host --rm --quiet --name test_free_addresses_scanner -v /var/log:/var/log -v /run/systemd/journal/socket:/run/systemd/journal/socket quay.io:443/acm-d/assisted-installer-agent-rhel8@sha256:5f97714289d1dffe43db44056a8d04bffde18957745ebdf6675eda53d7b0dfc7 free_addresses '["198.18.8.0/21"]' | jq -r ".[].free_addresses|length"
1973

Which is the expected number (1973) of free addresses after accounting for all the hosts on this subnet.


The SNO sends the free addresses over:

Jun 10 18:41:04 localhost next_step_runne[1965]: time="10-06-2021 18:41:04" level=info msg="Sending step <free-network-addresses-18ddd3dc> reply error <> exit-code <0>" file="step_processor.go:48" request_id=0e39f985-ebe1-4b47-98d5-c4ba2707a61b

However the assisted-service seems to say that there is a json decoding issue with the data:

time="2021-06-10T18:41:04Z" level=error msg="Failed decode <free-network-addresses-18ddd3dc> reply for host <4a0c6719-e600-49ef-9f1a-5bcc1a31d1c5> cluster <9a797e9e-e37e-44a9-bc05-48c9b6d5d654>" func="github.com/openshift/assisted-service/internal/bminventory.(*bareMetalInventory).PostStepReply" file="/remote-source/app/internal/bminventory/inventory.go:2828" cluster_id=9a797e9e-e37e-44a9-bc05-48c9b6d5d654 error="unexpected end of JSON input" go-id=635517 host_id=4a0c6719-e600-49ef-9f1a-5bcc1a31d1c5 pkg=Inventory request_id=0e39f985-ebe1-4b47-98d5-c4ba2707a61b

And the sno then gets the reply that the data it sent was bad

Jun 10 18:41:04 localhost next_step_runne[1965]: time="10-06-2021 18:41:04" level=warning msg="Assisted service returned status code Bad Request after processing step reply.  Reason: unexpected end of JSON input" file="step_processor.go:74" request_id=0e39f985-ebe1-4b47-98d5-c4ba2707a61b

It is unclear what "packaging" the agent does to the free address data that could have corrupted it.  I would suspect that since we have a /21 configured for all SNOs and the hub cluster that perhaps the large number of free addresses is "too much" data however only 2/60 SNOs have this failure.

So I dug into more of the objects off the hub cluster and found the events log for these failed clusters.  The most notable event that could indicate an error are:

  {
    "cluster_id": "9a797e9e-e37e-44a9-bc05-48c9b6d5d654",
    "event_time": "2021-06-09T22:43:27.481Z",
    "host_id": "4a0c6719-e600-49ef-9f1a-5bcc1a31d1c5",
    "message": "Host sno00017: validation 'container-images-available' that used to succeed is now failing",
    "severity": "warning"
  },
  {
    "cluster_id": "9a797e9e-e37e-44a9-bc05-48c9b6d5d654",
    "event_time": "2021-06-09T22:43:27.490Z",
    "host_id": "4a0c6719-e600-49ef-9f1a-5bcc1a31d1c5",
    "message": "Host sno00017: updated status from \"preparing-for-installation\" to \"insufficient\" (Host cannot be installed due to following failing validation(s): Failed to fetch container images needed for installation from quay.io:443/acm-d/assisted-installer-rhel8@sha256:834c3259f30c5b139e02712aa9c56413535749bf8e8ab9adfd4908902d7e651a ; Host couldn't synchronize with any NTP server)",
    "severity": "warning"
  },
  {
    "cluster_id": "9a797e9e-e37e-44a9-bc05-48c9b6d5d654",
    "event_time": "2021-06-09T22:43:35.478Z",
    "message": "Cluster validation 'all-hosts-are-ready-to-install' that used to succeed is now failing",
    "severity": "warning"
  },
  {
    "cluster_id": "9a797e9e-e37e-44a9-bc05-48c9b6d5d654",
    "event_time": "2021-06-09T22:43:35.487Z",
    "message": "Updated status of cluster sno00017 to insufficient",
    "severity": "info"
  },


It seems to me that something "told" the assisted-installer the sno was having trouble pulling images and it gave up and prevents the cluster from further install in this case. (instead of a corrupted https packet (failure to decode))

Comment 6 Alex Krzos 2021-06-10 20:06:37 UTC
Created attachment 1790017 [details]
sno00017 events log

Comment 7 Alex Krzos 2021-06-10 20:07:35 UTC
Created attachment 1790018 [details]
sno00037 events log

Comment 8 Alex Krzos 2021-06-11 20:06:58 UTC
I believe the smoking gun for this is:

Jun 10 21:06:37 localhost next_step_runne[1940]: time="10-06-2021 21:06:37" level=info msg="Sending step <container-image-availability-6a5ee7e6> reply output <{\"images\":[{\"name\":\"quay.io:443/acm-d/assisted-installer-rhel8@sha256:834c3259f30c5b139e02712aa9c56413535749bf8e8ab9adfd4908902d7e651a\",\"result\":\"failure\"},{\"download_rate\":6.553431398919053,\"name\":\"quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:75dcac253d32eeb10fba6fc760a55c23b424dbe075946f7c16ae9491902c61dc\",\"result\":\"success\",\"size_bytes\":475449430,\"time\":72.549692071},{\"download_rate\":15.100748670432615,\"name\":\"quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:83276e1421645322d9abdf21752c170645fac26456b05b19b1870b7c0b808dbb\",\"result\":\"success\",\"size_bytes\":428752695,\"time\":28.392810473}]}> error <time=\"2021-06-10T21:06:37Z\" level=warning msg=\"lstat /sys/fs/cgroup/devices/machine.slice/libpod-08e75a959d2036bbf7df221acb80e17c1d05e105f69ff7466e7a2dbbfd7cade2.scope: no such file or directory\"\n> exit-code <0>" file="step_processor.go:50" request_id=1df87a8d-ee20-4a58-b934-8811150766c7


{\"images\":[{\"name\":\"quay.io:443/acm-d/assisted-installer-rhel8@sha256:834c3259f30c5b139e02712aa9c56413535749bf8e8ab9adfd4908902d7e651a\",\"result\":\"failure\"}

Looks like the container-image-availability step does not have any mechanism to retry.

Comment 9 Flavio Percoco 2021-06-14 11:35:24 UTC
(In reply to Alex Krzos from comment #8)
> I believe the smoking gun for this is:
> 
> Jun 10 21:06:37 localhost next_step_runne[1940]: time="10-06-2021 21:06:37"
> level=info msg="Sending step <container-image-availability-6a5ee7e6> reply
> output
> <{\"images\":[{\"name\":\"quay.io:443/acm-d/assisted-installer-rhel8@sha256:
> 834c3259f30c5b139e02712aa9c56413535749bf8e8ab9adfd4908902d7e651a\",
> \"result\":\"failure\"},{\"download_rate\":6.553431398919053,\"name\":\"quay.
> io/openshift-release-dev/ocp-v4.0-art-dev@sha256:
> 75dcac253d32eeb10fba6fc760a55c23b424dbe075946f7c16ae9491902c61dc\",
> \"result\":\"success\",\"size_bytes\":475449430,\"time\":72.549692071},
> {\"download_rate\":15.100748670432615,\"name\":\"quay.io/openshift-release-
> dev/ocp-v4.0-art-dev@sha256:
> 83276e1421645322d9abdf21752c170645fac26456b05b19b1870b7c0b808dbb\",
> \"result\":\"success\",\"size_bytes\":428752695,\"time\":28.392810473}]}>
> error <time=\"2021-06-10T21:06:37Z\" level=warning msg=\"lstat
> /sys/fs/cgroup/devices/machine.slice/libpod-
> 08e75a959d2036bbf7df221acb80e17c1d05e105f69ff7466e7a2dbbfd7cade2.scope: no
> such file or directory\"\n> exit-code <0>" file="step_processor.go:50"
> request_id=1df87a8d-ee20-4a58-b934-8811150766c7
> 
> 
> {\"images\":[{\"name\":\"quay.io:443/acm-d/assisted-installer-rhel8@sha256:
> 834c3259f30c5b139e02712aa9c56413535749bf8e8ab9adfd4908902d7e651a\",
> \"result\":\"failure\"}
> 
> Looks like the container-image-availability step does not have any mechanism
> to retry.

Great investigation, @akrzos 

I think we should investigate both options. From conversations over chat, it looks like cri-o may have a similar issue to https://bugzilla.redhat.com/show_bug.cgi?id=1966872

However, it is true that sending `free_addresses` may end up in a quite big json. It would be nice to verify whether there's a retry logic for it and that it is, indeed, working as expected. Maybe we can find this info in the assisted-service logs.

Is the testing host still up? Any chance we can take a look at it and run some criocli commands?

Comment 10 Ronnie Lazar 2021-06-14 13:07:45 UTC
@oamizur Lets disable the Free-List tool on operator based deployments

Comment 11 Ronnie Lazar 2021-06-14 14:09:35 UTC
There are two separate issues here:
1 - The large free-addresses list is getting truncated and the assisted-service is not able to decode the json
2 - Some images are not being pulled correctly

the second issue is similar to https://bugzilla.redhat.com/show_bug.cgi?id=1964591. The long-term solution should come from podman fix to https://bugzilla.redhat.com/show_bug.cgi?id=1966872

Comment 13 Alex Krzos 2021-07-12 17:28:21 UTC
Removing needinfo since bug is verified.


Note You need to log in before you can comment on or make changes to this bug.