Created attachment 1719947 [details] service log A failure in running the installer happened but the Agent log did not contain the StepReply. (install-a9bea63b) Service log: time="2020-10-08T10:34:35Z" level=info msg="Submitting step <install> id <install-a9bea63b> to cluster <d52a0eca-d47b-4305-9dc2-2b675e9daf65> host <89ce12aa-b43e-4226-9807-f6e55dbcf8b7> Command: <bash> Arguments: <[-c podman run -v /dev:/dev:rw -v /opt:/opt:rw -v /run/systemd/journal/socket:/run/systemd/journal/socket --privileged --pid=host --net=host -v /var/log:/var/log:rw --env PULL_SECRET_TOKEN --name assisted-installer registry.redhat.io/openshift4/assisted-installer-rhel8:v4.6.0-21 --role master --cluster-id d52a0eca-d47b-4305-9dc2-2b675e9daf65 --boot-device /dev/vda --host-id 89ce12aa-b43e-4226-9807-f6e55dbcf8b7 --openshift-version 4.6 --controller-image registry.redhat.io/openshift4/assisted-installer-reporter-rhel8:v4.6.0-17 --url https://api.openshift.com --insecure=false --agent-image registry.redhat.io/openshift4/assisted-installer-agent-rhel8:v4.6.0-17 --host-name liat10981-test-cluster-assisted-installer-master-2 || ( returnCode=$?; podman run --rm --privileged -v /run/systemd/journal/socket:/run/systemd/journal/socket -v /var/log:/var/log --env PULL_SECRET_TOKEN --name logs-sender registry.redhat.io/openshift4/assisted-installer-agent-rhel8:v4.6.0-17 logs_sender -url https://api.openshift.com -cluster-id d52a0eca-d47b-4305-9dc2-2b675e9daf65 -host-id 89ce12aa-b43e-4226-9807-f6e55dbcf8b7 --insecure=false -bootstrap=false; exit $returnCode; )]>" func=github.com/openshift/assisted-service/internal/host.logSteps file="/go/src/github.com/openshift/origin/internal/host/instructionmanager.go:125" cluster_id=d52a0eca-d47b-4305-9dc2-2b675e9daf65 go-id=1519033 host_id=89ce12aa-b43e-4226-9807-f6e55dbcf8b7 pkg=instructions request_id=b9968c7f-d9b6-4176-b9f2-f8408e80786f time="2020-10-08T10:35:09Z" level=error msg="Exit code is <125> " func="github.com/openshift/assisted-service/internal/bminventory.(*bareMetalInventory).PostStepReply" file="/go/src/github.com/openshift/origin/internal/bminventory/inventory.go:1537" cluster_id=d52a0eca-d47b-4305-9dc2-2b675e9daf65 error="Received step reply <install-a9bea63b> from cluster <d52a0eca-d47b-4305-9dc2-2b675e9daf65> host <89ce12aa-b43e-4226-9807-f6e55dbcf8b7> exit-code <125> stdout <Logs were sent\n> stderr <Trying to pull registry.redhat.io/openshift4/assisted-installer-rhel8:v4.6.0-21...\nGetting image source signatures\nCopying blob sha256:0fd3b5213a9b4639d32bf2ef6a3d7cc9891c4d8b23639ff7ae99d66ecb490a70\nCopying blob sha256:aebb8c5568533b57ee3da86262f7bff81383a2a624b9f54b9da3418705009901\n Get \"https://registry.redhat.io/v2/openshift4/assisted-installer-rhel8/blobs/sha256:76e3aaebfbf2b79c06c65c7ff745800d5523c9e41d31141843ad9c0904fdbac2\": dial tcp 184.27.156.130:443: i/o timeout\nError: unable to pull registry.redhat.io/openshift4/assisted-installer-rhel8:v4.6.0-21: unable to pull image: Error reading blob sha256:76e3aaebfbf2b79c06c65c7ff745800d5523c9e41d31141843ad9c0904fdbac2: Get \"https://registry.redhat.io/v2/openshift4/assisted-installer-rhel8/blobs/sha256:76e3aaebfbf2b79c06c65c7ff745800d5523c9e41d31141843ad9c0904fdbac2\": dial tcp 184.27.156.130:443: i/o timeout\n>" go-id=1519033 host_id=89ce12aa-b43e-4226-9807-f6e55dbcf8b7 pkg=Inventory request_id=b9968c7f-d9b6-4176-b9f2-f8408e80786f
Created attachment 1719948 [details] hosts logs
Please add the install command even if there is no error
We should at least print the step.StepID before executing it. Here, I think: https://github.com/openshift/assisted-installer-agent/blob/master/src/commands/step_processor.go#L67
Print the step.StepID before executing it: https://github.com/openshift/assisted-installer-agent/pull/51
The commit that Fred made is just for adding a log before starting the command. We still need to understand why we don't have the steps reply in the log
we should add a log after stdout, stderr, exitCode := handler(command, args...) when the exitCode != 0
(In reply to Fred Rolland from comment #4) > Print the step.StepID before executing it: > > https://github.com/openshift/assisted-installer-agent/pull/51 Not sure how to verify the bug. What is correct failure scenario? My cluster failed http://10.19.131.3:8080/clusters/58f01af4-bdf0-4dc5-ac05-fd5d8e6da9be But I did not find <install-****> step Dec 14 17:56:45 master-0-0 next_step_runne[2062]: time="14-12-2020 17:56:45" level=info msg="Executing step: <execute-c88a45f8>, command: </usr/bin/podman>, args: <[stop -i -t 5 assisted-installer]>" file="step_processor.go:78" request_id=a83bb15c-44ab-4e8e-a4a1-0119aec2aadd Dec 14 17:56:45 master-0-0 next_step_runne[2062]: time="14-12-2020 17:56:45" level=info msg="Executing step: <execute-49dd511d>, command: <podman>, args: <[run --rm --privileged --net=host -v /run/systemd/journal/socket:/run/systemd/journal/socket -v /var/log:/var/log --env PULL_SECRET_TOKEN --name logs-sender --pid=host quay.io/ocpmetal/assisted-installer-agent:latest logs_sender -url http://10.19.131.3:8090 -cluster-id 58f01af4-bdf0-4dc5-ac05-fd5d8e6da9be -host-id aafd9871-ff57-4fe4-be81-695ea2b753d2 --insecure=false -bootstrap=false -with-installer-gather-logging=true]>" file="step_processor.go:78" request_id=a83bb15c-44ab-4e8e-a4a1-0119aec2aadd Dec 14 17:56:45 master-0-0 next_step_runne[2062]: time="14-12-2020 17:56:45" level=info msg="Sending step <execute-c88a45f8> reply output <> error <> exit-code <0>" file="step_processor.go:46" request_id=a83bb15c-44ab-4e8e-a4a1-0119aec2aadd Dec 14 17:56:45 master-0-0 logs_sender[4333]: time="14-12-2020 17:56:45" level=info msg="Start gathering journalctl logs with tags [agent installer], services [] and installer-gather" file="send_logs.go:238" Thanks
There is more than one PR relevant for resolving this ticket: (1) https://github.com/openshift/assisted-installer-agent/pull/51 (2) https://github.com/openshift/assisted-installer-agent/pull/91/files The first one you mentioned yourself. It adds the "Executing step:" log and it seems like from your logs that it appears correctly as expected. The second PR makes it so when the *step itself* fails (exit code != 0), a "Step execution failed" log should occur, with the step's exitcode/stdout/stderr also included in the log. I'll try and look into why exactly your cluster failed, but it might be for a reason other than a failed step, so it's not good enough for the test. In order to test PR #91 (the second one), specifically a step failure needs to occur.
Ok, reproduced cluster failure https://qaprodauth.cloud.redhat.com/openshift/assisted-installer/clusters/96af0ccc-3e3e-41d5-bf65-f08e4e7c8e31 We can see Executing step: <install-7e44a08d> However it happened even error <> exit-code <0> Not sure we expect this behaviour Dec 15 20:12:18 worker-0-1 next_step_runne[1933]: time="15-12-2020 20:12:18" level=info msg="Executing step: <install-7e44a08d>, command: <bash>, args: <[-c podman run -v /dev:/dev:rw -v /opt:/opt:rw -v /run/systemd/journal/socket:/run/systemd/journal/socket --privileged --pid=host --net=host -v /var/log:/var/log:rw --env PULL_SECRET_TOKEN --name assisted-installer quay.io/ocpmetal/assisted-installer:latest --role worker --cluster-id 96af0ccc-3e3e-41d5-bf65-f08e4e7c8e31 --boot-device /dev/sda --host-id ec8a2162-2927-4a16-b31d-cd00af78f93f --openshift-version 4.6 --mco-image quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c8f0b9a2852b15b45487c08c158e10f3b803d7a77538d6dbc1d991994f58bcee --controller-image quay.io/ocpmetal/assisted-installer-controller:latest --url https://api.stage.openshift.com --insecure=false --agent-image quay.io/ocpmetal/assisted-installer-agent:latest]>" file="step_processor.go:78" request_id=fdbe18ce-0740-4ba9-adf9-5ae52fb0a9c5 Dec 15 20:12:18 worker-0-1 next_step_runne[1933]: time="15-12-2020 20:12:18" level=info msg="Sending step <dhcp-lease-allocate-0e2ba009> reply output <{\"api_vip_address\":\"192.168.123.131\",\"api_vip_lease\":\"lease {\\n interface \\\"api\\\";\\n fixed-address 192.168.123.131;\\n option subnet-mask 255.255.255.0;\\n option routers 192.168.123.1;\\n option dhcp-lease-time 3600;\\n option dhcp-message-type 5;\\n option domain-name-servers 192.168.123.1;\\n option dhcp-server-identifier 192.168.123.1;\\n option dhcp-renewal-time 1800;\\n option broadcast-address 192.168.123.255;\\n option dhcp-rebinding-time 3150;\\n option host-name \\\"00-1a-4a-33-3c-6f-api\\\";\\n option domain-name \\\"ocp-cluster-sealusa47-0.qe.lab.redhat.com\\\";\\n renew 2 2020/12/15 20:40:11;\\n rebind 2 2020/12/15 21:04:48;\\n expire 2 2020/12/15 21:12:18;\\n}\",\"ingress_vip_address\":\"192.168.123.109\",\"ingress_vip_lease\":\"lease {\\n interface \\\"ingress\\\";\\n fixed-address 192.168.123.109;\\n option subnet-mask 255.255.255.0;\\n option routers 192.168.123.1;\\n option dhcp-lease-time 3600;\\n option dhcp-message-type 5;\\n option domain-name-servers 192.168.123.1;\\n option dhcp-server-identifier 192.168.123.1;\\n option dhcp-renewal-time 1800;\\n option broadcast-address 192.168.123.255;\\n option dhcp-rebinding-time 3150;\\n option host-name \\\"00-1a-4a-54-a6-00-ingress\\\";\\n option domain-name \\\"ocp-cluster-sealusa47-0.qe.lab.redhat.com\\\";\\n renew 2 2020/12/15 20:41:27;\\n rebind 2 2020/12/15 21:04:48;\\n expire 2 2020/12/15 21:12:18;\\n}\"}> error <> exit-code <0>" file="step_processor.go:46" request_id=fdbe18ce-0740-4ba9-adf9-5ae52fb0a9c5
I don't seem to have access to see the latest cluster you linked. However, I can see from the logs you posted that the step (the one that happens to be in the logs, there should be many other ones) has exit-code 0. This issue's PRs deal with adding logs to step failures specifically, not cluster failures. The cluster failure might be interesting to investigate, but I'm not sure that it's relevant to discuss it as part of this issue's comments.
Created attachment 1739467 [details] Installer log
Added logs. Still do not know how to reproduce Step failure Give me clear scenario Thank you
yobshans I think that this bug is not really worth the testing effort that is required from you
(In reply to Ronnie Lazar from comment #13) > yobshans I think that this bug is not really worth the testing > effort that is required from you Agreed, let's close it