Created attachment 1786987 [details] Assisted-service logs Description of problem: Running through wan emulation scenarios on the stage scale environment, several hosts failed to install after booting the discovery agent. The current wan emulation introduces 150ms of latency and 0.02% packet loss on outgoing traffic from the hypervisor hosting the vm. This occurred on 3/72 simulated bare metal machines. One more machine joined after it was forced to re-pull the container images that the agent pulls upon running. Version-Release number of selected component (if applicable): ACM hub cluster: 4.8.0-0.nightly-2021-05-19-123944 SNO cluster: 4.8.0-fc.3 Assisted-installer version: stable.17.05.2021-17.46 How reproducible: 3/72 during this test Steps to Reproduce: 1. 2. 3. Actual results: The clusters never move further in provisioning cycle and stay "Stuck" on the discovery agent. Expected results: Discovery to complete and install of the machines to begin. Additional info: It seems it may be related to this log line emitted from the assisted-service off the hub cluster: time="2021-05-25T20:16:26Z" level=error msg="Failed decode <free-network-addresses-ba74f63c> reply for host <e02e0774-5e2a-41ec-8bcf-5b3f62988ccd> cluster <4a0a9e18-2630-4108-ae17-d280f0ff2f5c>" func="github.com/openshift/assisted-service/internal/bminventory.(*bareMetalInventory).PostStepReply" file="/go/src/github.com/openshift/origin/internal/bminventory/inventory.go:2779" cluster_id=4a0a9e18-2630-4108-ae17-d280f0ff2f5c error="unexpected end of JSON input" go-id=14637052 host_id=e02e0774-5e2a-41ec-8bcf-5b3f62988ccd pkg=Inventory request_id=398d4db2-091c-45e7-b4e6-72987f683253 Of all the host machines stuck in a discovery state during this test, all of them have similar log lines related to their respective host ids: # oc logs -n assisted-installer assisted-service-fc55dd8bb-5vf96 -c assisted-service | grep "Failed decode" | awk '{print $9}' | sort | uniq -c 27 <afd90092-47d3-46c6-bbdb-9f9845360bdb> 32 <cdd9f30f-9761-441f-8550-6ce2813d1f85> 38 <d79186a7-131f-4c51-8f6f-b068e81d290c> 36 <e02e0774-5e2a-41ec-8bcf-5b3f62988ccd> * Note there are 4 host ids, originally there was 3 hosts stuck in this state, another joined after its corrupted image was replaced per this bug (https://bugzilla.redhat.com/show_bug.cgi?id=1964591) Attaching journal logs from sno00031 and assisted-service container to help diagnose here
Created attachment 1786988 [details] sno00031 journal logs
Alex, any chance we can have a new run of this and collect more logs (including a tcpdump)
(In reply to Flavio Percoco from comment #2) > Alex, any chance we can have a new run of this and collect more logs > (including a tcpdump) I have re-run this with https disabled since last time trying to decode what was being sent through ssl seemed impossible with the current setup and I was unable to reproduce with https disabled (however I also switched to fc.7 ocp version) reading back I ran with fc.3 so I'll retry with that build.
A quick update I am still seeing this failure type. I have re-run with newer bits and adjusted the test to more accurately resemble wan emulation: * egress hypervisor latency 75ms, 0.02% packet loss * ingress hjypervisor latency 75ms, 0.02% packet loss * hub cluster ocp 4.8.0-fc.8 * clusterimageset / sno ocp is ocp 4.8.0-fc.8 * ACM 2.3.0-DOWNSTREAM-2021-06-07-19-53-02 1st run: 3/60 failed with this failure type 2nd run: 2/60 failed with this failure type. I expect to get a bit more time to dig at this again today. I suspect something is wrong with the free_addresses command: [root@localhost ~]# podman run --privileged --net=host --rm --quiet --name akrzos_free_addresses_scanner -v /var/log:/var/log -v /run/systemd/journal/socket:/run/systemd/journal/socket quay.io:443/acm-d/assisted-installer-agent-rhel8@sha256:5f97714289d1dffe43db44056a8d04bffde18957745ebdf6675eda53d7b0dfc7 free_addresses ["198.18.8.0/21"] invalid character '.' after array element
It is still unclear to me what exactly is failing here. Reviewing a cluster that is actively failing because of this, I see these things: Following this request_id = 0e39f985-ebe1-4b47-98d5-c4ba2707a61b I put a snippet of logs here - https://gist.github.com/akrzos/4db55dd43109212320504eba50d76bba since they are lengthy lines and tend to get chopped up when pasted into a bz. It seems the assisted-service is saying that the host status is insufficient ("host: <4a0c6719-e600-49ef-9f1a-5bcc1a31d1c5>, host status: <insufficient>") and to run some commands to gather data/host information, it also asks to scan for free ip addresses. At first I thought there was an error with invoking this command however it seems to output valid output when invoked off the sno itself: root@localhost ~]# podman run --privileged --net=host --rm --quiet --name test_free_addresses_scanner -v /var/log:/var/log -v /run/systemd/journal/socket:/run/systemd/journal/socket quay.io:443/acm-d/assisted-installer-agent-rhel8@sha256:5f97714289d1dffe43db44056a8d04bffde18957745ebdf6675eda53d7b0dfc7 free_addresses '["198.18.8.0/21"]' [{"free_addresses":["198.18.8.0","198.18.8.2","198.18.8.10","198.18.8.11",..... ...,"198.18.15.254","198.18.15.255"],"network":"198.18.8.0/21"}] [root@localhost ~]# podman run --privileged --net=host --rm --quiet --name test_free_addresses_scanner -v /var/log:/var/log -v /run/systemd/journal/socket:/run/systemd/journal/socket quay.io:443/acm-d/assisted-installer-agent-rhel8@sha256:5f97714289d1dffe43db44056a8d04bffde18957745ebdf6675eda53d7b0dfc7 free_addresses '["198.18.8.0/21"]' | jq -r ".[].free_addresses|length" 1973 Which is the expected number (1973) of free addresses after accounting for all the hosts on this subnet. The SNO sends the free addresses over: Jun 10 18:41:04 localhost next_step_runne[1965]: time="10-06-2021 18:41:04" level=info msg="Sending step <free-network-addresses-18ddd3dc> reply error <> exit-code <0>" file="step_processor.go:48" request_id=0e39f985-ebe1-4b47-98d5-c4ba2707a61b However the assisted-service seems to say that there is a json decoding issue with the data: time="2021-06-10T18:41:04Z" level=error msg="Failed decode <free-network-addresses-18ddd3dc> reply for host <4a0c6719-e600-49ef-9f1a-5bcc1a31d1c5> cluster <9a797e9e-e37e-44a9-bc05-48c9b6d5d654>" func="github.com/openshift/assisted-service/internal/bminventory.(*bareMetalInventory).PostStepReply" file="/remote-source/app/internal/bminventory/inventory.go:2828" cluster_id=9a797e9e-e37e-44a9-bc05-48c9b6d5d654 error="unexpected end of JSON input" go-id=635517 host_id=4a0c6719-e600-49ef-9f1a-5bcc1a31d1c5 pkg=Inventory request_id=0e39f985-ebe1-4b47-98d5-c4ba2707a61b And the sno then gets the reply that the data it sent was bad Jun 10 18:41:04 localhost next_step_runne[1965]: time="10-06-2021 18:41:04" level=warning msg="Assisted service returned status code Bad Request after processing step reply. Reason: unexpected end of JSON input" file="step_processor.go:74" request_id=0e39f985-ebe1-4b47-98d5-c4ba2707a61b It is unclear what "packaging" the agent does to the free address data that could have corrupted it. I would suspect that since we have a /21 configured for all SNOs and the hub cluster that perhaps the large number of free addresses is "too much" data however only 2/60 SNOs have this failure. So I dug into more of the objects off the hub cluster and found the events log for these failed clusters. The most notable event that could indicate an error are: { "cluster_id": "9a797e9e-e37e-44a9-bc05-48c9b6d5d654", "event_time": "2021-06-09T22:43:27.481Z", "host_id": "4a0c6719-e600-49ef-9f1a-5bcc1a31d1c5", "message": "Host sno00017: validation 'container-images-available' that used to succeed is now failing", "severity": "warning" }, { "cluster_id": "9a797e9e-e37e-44a9-bc05-48c9b6d5d654", "event_time": "2021-06-09T22:43:27.490Z", "host_id": "4a0c6719-e600-49ef-9f1a-5bcc1a31d1c5", "message": "Host sno00017: updated status from \"preparing-for-installation\" to \"insufficient\" (Host cannot be installed due to following failing validation(s): Failed to fetch container images needed for installation from quay.io:443/acm-d/assisted-installer-rhel8@sha256:834c3259f30c5b139e02712aa9c56413535749bf8e8ab9adfd4908902d7e651a ; Host couldn't synchronize with any NTP server)", "severity": "warning" }, { "cluster_id": "9a797e9e-e37e-44a9-bc05-48c9b6d5d654", "event_time": "2021-06-09T22:43:35.478Z", "message": "Cluster validation 'all-hosts-are-ready-to-install' that used to succeed is now failing", "severity": "warning" }, { "cluster_id": "9a797e9e-e37e-44a9-bc05-48c9b6d5d654", "event_time": "2021-06-09T22:43:35.487Z", "message": "Updated status of cluster sno00017 to insufficient", "severity": "info" }, It seems to me that something "told" the assisted-installer the sno was having trouble pulling images and it gave up and prevents the cluster from further install in this case. (instead of a corrupted https packet (failure to decode))
Created attachment 1790017 [details] sno00017 events log
Created attachment 1790018 [details] sno00037 events log
I believe the smoking gun for this is: Jun 10 21:06:37 localhost next_step_runne[1940]: time="10-06-2021 21:06:37" level=info msg="Sending step <container-image-availability-6a5ee7e6> reply output <{\"images\":[{\"name\":\"quay.io:443/acm-d/assisted-installer-rhel8@sha256:834c3259f30c5b139e02712aa9c56413535749bf8e8ab9adfd4908902d7e651a\",\"result\":\"failure\"},{\"download_rate\":6.553431398919053,\"name\":\"quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:75dcac253d32eeb10fba6fc760a55c23b424dbe075946f7c16ae9491902c61dc\",\"result\":\"success\",\"size_bytes\":475449430,\"time\":72.549692071},{\"download_rate\":15.100748670432615,\"name\":\"quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:83276e1421645322d9abdf21752c170645fac26456b05b19b1870b7c0b808dbb\",\"result\":\"success\",\"size_bytes\":428752695,\"time\":28.392810473}]}> error <time=\"2021-06-10T21:06:37Z\" level=warning msg=\"lstat /sys/fs/cgroup/devices/machine.slice/libpod-08e75a959d2036bbf7df221acb80e17c1d05e105f69ff7466e7a2dbbfd7cade2.scope: no such file or directory\"\n> exit-code <0>" file="step_processor.go:50" request_id=1df87a8d-ee20-4a58-b934-8811150766c7 {\"images\":[{\"name\":\"quay.io:443/acm-d/assisted-installer-rhel8@sha256:834c3259f30c5b139e02712aa9c56413535749bf8e8ab9adfd4908902d7e651a\",\"result\":\"failure\"} Looks like the container-image-availability step does not have any mechanism to retry.
(In reply to Alex Krzos from comment #8) > I believe the smoking gun for this is: > > Jun 10 21:06:37 localhost next_step_runne[1940]: time="10-06-2021 21:06:37" > level=info msg="Sending step <container-image-availability-6a5ee7e6> reply > output > <{\"images\":[{\"name\":\"quay.io:443/acm-d/assisted-installer-rhel8@sha256: > 834c3259f30c5b139e02712aa9c56413535749bf8e8ab9adfd4908902d7e651a\", > \"result\":\"failure\"},{\"download_rate\":6.553431398919053,\"name\":\"quay. > io/openshift-release-dev/ocp-v4.0-art-dev@sha256: > 75dcac253d32eeb10fba6fc760a55c23b424dbe075946f7c16ae9491902c61dc\", > \"result\":\"success\",\"size_bytes\":475449430,\"time\":72.549692071}, > {\"download_rate\":15.100748670432615,\"name\":\"quay.io/openshift-release- > dev/ocp-v4.0-art-dev@sha256: > 83276e1421645322d9abdf21752c170645fac26456b05b19b1870b7c0b808dbb\", > \"result\":\"success\",\"size_bytes\":428752695,\"time\":28.392810473}]}> > error <time=\"2021-06-10T21:06:37Z\" level=warning msg=\"lstat > /sys/fs/cgroup/devices/machine.slice/libpod- > 08e75a959d2036bbf7df221acb80e17c1d05e105f69ff7466e7a2dbbfd7cade2.scope: no > such file or directory\"\n> exit-code <0>" file="step_processor.go:50" > request_id=1df87a8d-ee20-4a58-b934-8811150766c7 > > > {\"images\":[{\"name\":\"quay.io:443/acm-d/assisted-installer-rhel8@sha256: > 834c3259f30c5b139e02712aa9c56413535749bf8e8ab9adfd4908902d7e651a\", > \"result\":\"failure\"} > > Looks like the container-image-availability step does not have any mechanism > to retry. Great investigation, @akrzos I think we should investigate both options. From conversations over chat, it looks like cri-o may have a similar issue to https://bugzilla.redhat.com/show_bug.cgi?id=1966872 However, it is true that sending `free_addresses` may end up in a quite big json. It would be nice to verify whether there's a retry logic for it and that it is, indeed, working as expected. Maybe we can find this info in the assisted-service logs. Is the testing host still up? Any chance we can take a look at it and run some criocli commands?
@oamizur Lets disable the Free-List tool on operator based deployments
There are two separate issues here: 1 - The large free-addresses list is getting truncated and the assisted-service is not able to decode the json 2 - Some images are not being pulled correctly the second issue is similar to https://bugzilla.redhat.com/show_bug.cgi?id=1964591. The long-term solution should come from podman fix to https://bugzilla.redhat.com/show_bug.cgi?id=1966872
PR https://github.com/openshift/assisted-service/pull/2009
Removing needinfo since bug is verified.