Bug 1886426 - Missing step reply log in Agent log
Summary: Missing step reply log in Agent log
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: assisted-installer
Version: 4.6
Hardware: Unspecified
OS: Unspecified
high
unspecified
Target Milestone: ---
: ---
Assignee: Omer Tuchfeld
QA Contact: Yuri Obshansky
URL:
Whiteboard: OCP-Metal-v1.0.12
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-10-08 12:32 UTC by Fred Rolland
Modified: 2022-08-28 08:45 UTC (History)
2 users (show)

Fixed In Version: OCP-Metal-v1.0.12.1
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-08-28 08:45:59 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
service log (307.19 KB, text/plain)
2020-10-08 12:32 UTC, Fred Rolland
no flags Details
hosts logs (27.00 KB, application/x-tar)
2020-10-08 12:33 UTC, Fred Rolland
no flags Details
Installer log (33.00 KB, application/x-tar)
2020-12-15 21:18 UTC, Yuri Obshansky
no flags Details

Description Fred Rolland 2020-10-08 12:32:02 UTC
Created attachment 1719947 [details]
service log

A failure in running the installer happened but the Agent log did not contain the StepReply. (install-a9bea63b)

Service log:

time="2020-10-08T10:34:35Z" level=info msg="Submitting step <install> id <install-a9bea63b> to cluster <d52a0eca-d47b-4305-9dc2-2b675e9daf65> host <89ce12aa-b43e-4226-9807-f6e55dbcf8b7> Command: <bash> Arguments: <[-c podman run -v /dev:/dev:rw -v /opt:/opt:rw -v /run/systemd/journal/socket:/run/systemd/journal/socket --privileged --pid=host --net=host -v /var/log:/var/log:rw --env PULL_SECRET_TOKEN --name assisted-installer registry.redhat.io/openshift4/assisted-installer-rhel8:v4.6.0-21 --role master --cluster-id d52a0eca-d47b-4305-9dc2-2b675e9daf65 --boot-device /dev/vda --host-id 89ce12aa-b43e-4226-9807-f6e55dbcf8b7 --openshift-version 4.6 --controller-image registry.redhat.io/openshift4/assisted-installer-reporter-rhel8:v4.6.0-17 --url https://api.openshift.com --insecure=false --agent-image registry.redhat.io/openshift4/assisted-installer-agent-rhel8:v4.6.0-17 --host-name liat10981-test-cluster-assisted-installer-master-2 || ( returnCode=$?; podman run --rm --privileged -v /run/systemd/journal/socket:/run/systemd/journal/socket -v /var/log:/var/log --env PULL_SECRET_TOKEN --name logs-sender registry.redhat.io/openshift4/assisted-installer-agent-rhel8:v4.6.0-17 logs_sender -url https://api.openshift.com -cluster-id d52a0eca-d47b-4305-9dc2-2b675e9daf65 -host-id 89ce12aa-b43e-4226-9807-f6e55dbcf8b7 --insecure=false -bootstrap=false; exit $returnCode; )]>" func=github.com/openshift/assisted-service/internal/host.logSteps file="/go/src/github.com/openshift/origin/internal/host/instructionmanager.go:125" cluster_id=d52a0eca-d47b-4305-9dc2-2b675e9daf65 go-id=1519033 host_id=89ce12aa-b43e-4226-9807-f6e55dbcf8b7 pkg=instructions request_id=b9968c7f-d9b6-4176-b9f2-f8408e80786f
time="2020-10-08T10:35:09Z" level=error msg="Exit code is <125> " func="github.com/openshift/assisted-service/internal/bminventory.(*bareMetalInventory).PostStepReply" file="/go/src/github.com/openshift/origin/internal/bminventory/inventory.go:1537" cluster_id=d52a0eca-d47b-4305-9dc2-2b675e9daf65 error="Received step reply <install-a9bea63b> from cluster <d52a0eca-d47b-4305-9dc2-2b675e9daf65> host <89ce12aa-b43e-4226-9807-f6e55dbcf8b7>  exit-code <125> stdout <Logs were sent\n> stderr <Trying to pull registry.redhat.io/openshift4/assisted-installer-rhel8:v4.6.0-21...\nGetting image source signatures\nCopying blob sha256:0fd3b5213a9b4639d32bf2ef6a3d7cc9891c4d8b23639ff7ae99d66ecb490a70\nCopying blob sha256:aebb8c5568533b57ee3da86262f7bff81383a2a624b9f54b9da3418705009901\n  Get \"https://registry.redhat.io/v2/openshift4/assisted-installer-rhel8/blobs/sha256:76e3aaebfbf2b79c06c65c7ff745800d5523c9e41d31141843ad9c0904fdbac2\": dial tcp 184.27.156.130:443: i/o timeout\nError: unable to pull registry.redhat.io/openshift4/assisted-installer-rhel8:v4.6.0-21: unable to pull image: Error reading blob sha256:76e3aaebfbf2b79c06c65c7ff745800d5523c9e41d31141843ad9c0904fdbac2: Get \"https://registry.redhat.io/v2/openshift4/assisted-installer-rhel8/blobs/sha256:76e3aaebfbf2b79c06c65c7ff745800d5523c9e41d31141843ad9c0904fdbac2\": dial tcp 184.27.156.130:443: i/o timeout\n>" go-id=1519033 host_id=89ce12aa-b43e-4226-9807-f6e55dbcf8b7 pkg=Inventory request_id=b9968c7f-d9b6-4176-b9f2-f8408e80786f

Comment 1 Fred Rolland 2020-10-08 12:33:54 UTC
Created attachment 1719948 [details]
hosts logs

Comment 2 liat gamliel 2020-10-08 12:36:52 UTC
Please add the install command even if there is no error

Comment 3 Fred Rolland 2020-10-11 09:45:38 UTC
We should at least print the step.StepID before executing it.
Here, I think:
https://github.com/openshift/assisted-installer-agent/blob/master/src/commands/step_processor.go#L67

Comment 4 Fred Rolland 2020-10-13 05:48:42 UTC
Print the step.StepID before executing it:

https://github.com/openshift/assisted-installer-agent/pull/51

Comment 5 Ronnie Lazar 2020-10-15 07:33:55 UTC
The commit that Fred made is just for adding a log before starting the command.
We still need to understand why we don't have the steps reply in the log

Comment 6 liat gamliel 2020-10-19 06:43:14 UTC
we should add a log after stdout, stderr, exitCode := handler(command, args...) 
when the exitCode != 0

Comment 7 Yuri Obshansky 2020-12-14 18:07:19 UTC
(In reply to Fred Rolland from comment #4)
> Print the step.StepID before executing it:
> 
> https://github.com/openshift/assisted-installer-agent/pull/51

Not sure how to verify the bug. What is correct failure scenario?
My cluster failed http://10.19.131.3:8080/clusters/58f01af4-bdf0-4dc5-ac05-fd5d8e6da9be
But I did not find <install-****> step

Dec 14 17:56:45 master-0-0 next_step_runne[2062]: time="14-12-2020 17:56:45" level=info msg="Executing step: <execute-c88a45f8>, command: </usr/bin/podman>, args: <[stop -i -t 5 assisted-installer]>" file="step_processor.go:78" request_id=a83bb15c-44ab-4e8e-a4a1-0119aec2aadd

Dec 14 17:56:45 master-0-0 next_step_runne[2062]: time="14-12-2020 17:56:45" level=info msg="Executing step: <execute-49dd511d>, command: <podman>, args: <[run --rm --privileged --net=host -v /run/systemd/journal/socket:/run/systemd/journal/socket -v /var/log:/var/log --env PULL_SECRET_TOKEN --name logs-sender --pid=host quay.io/ocpmetal/assisted-installer-agent:latest logs_sender -url http://10.19.131.3:8090 -cluster-id 58f01af4-bdf0-4dc5-ac05-fd5d8e6da9be -host-id aafd9871-ff57-4fe4-be81-695ea2b753d2 --insecure=false -bootstrap=false -with-installer-gather-logging=true]>" file="step_processor.go:78" request_id=a83bb15c-44ab-4e8e-a4a1-0119aec2aadd

Dec 14 17:56:45 master-0-0 next_step_runne[2062]: time="14-12-2020 17:56:45" level=info msg="Sending step <execute-c88a45f8> reply output <> error <> exit-code <0>" file="step_processor.go:46" request_id=a83bb15c-44ab-4e8e-a4a1-0119aec2aadd

Dec 14 17:56:45 master-0-0 logs_sender[4333]: time="14-12-2020 17:56:45" level=info msg="Start gathering journalctl logs with tags [agent installer], services [] and installer-gather" file="send_logs.go:238"

Thanks

Comment 8 Omer Tuchfeld 2020-12-15 09:32:15 UTC
There is more than one PR relevant for resolving this ticket:
(1) https://github.com/openshift/assisted-installer-agent/pull/51
(2) https://github.com/openshift/assisted-installer-agent/pull/91/files

The first one you mentioned yourself. It adds the "Executing step:" log and it seems like from your logs that it appears correctly as expected.

The second PR makes it so when the *step itself* fails (exit code != 0), a "Step execution failed" log should occur, with the step's exitcode/stdout/stderr also included in the log.

I'll try and look into why exactly your cluster failed, but it might be for a reason other than a failed step, so it's not good enough for the test. In order to test PR #91 (the second one), specifically a step failure needs to occur.

Comment 9 Yuri Obshansky 2020-12-15 20:59:47 UTC
Ok, reproduced cluster failure 
https://qaprodauth.cloud.redhat.com/openshift/assisted-installer/clusters/96af0ccc-3e3e-41d5-bf65-f08e4e7c8e31

We can see Executing step: <install-7e44a08d>
However it happened even error <> exit-code <0>
Not sure we expect this behaviour

Dec 15 20:12:18 worker-0-1 next_step_runne[1933]: time="15-12-2020 20:12:18" level=info msg="Executing step: <install-7e44a08d>, command: <bash>, args: <[-c podman run -v /dev:/dev:rw -v /opt:/opt:rw -v /run/systemd/journal/socket:/run/systemd/journal/socket --privileged --pid=host --net=host -v /var/log:/var/log:rw --env PULL_SECRET_TOKEN --name assisted-installer quay.io/ocpmetal/assisted-installer:latest --role worker --cluster-id 96af0ccc-3e3e-41d5-bf65-f08e4e7c8e31 --boot-device /dev/sda --host-id ec8a2162-2927-4a16-b31d-cd00af78f93f --openshift-version 4.6 --mco-image quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c8f0b9a2852b15b45487c08c158e10f3b803d7a77538d6dbc1d991994f58bcee --controller-image quay.io/ocpmetal/assisted-installer-controller:latest --url https://api.stage.openshift.com --insecure=false --agent-image quay.io/ocpmetal/assisted-installer-agent:latest]>" file="step_processor.go:78" request_id=fdbe18ce-0740-4ba9-adf9-5ae52fb0a9c5
Dec 15 20:12:18 worker-0-1 next_step_runne[1933]: time="15-12-2020 20:12:18" level=info msg="Sending step <dhcp-lease-allocate-0e2ba009> reply output <{\"api_vip_address\":\"192.168.123.131\",\"api_vip_lease\":\"lease {\\n  interface \\\"api\\\";\\n  fixed-address 192.168.123.131;\\n  option subnet-mask 255.255.255.0;\\n  option routers 192.168.123.1;\\n  option dhcp-lease-time 3600;\\n  option dhcp-message-type 5;\\n  option domain-name-servers 192.168.123.1;\\n  option dhcp-server-identifier 192.168.123.1;\\n  option dhcp-renewal-time 1800;\\n  option broadcast-address 192.168.123.255;\\n  option dhcp-rebinding-time 3150;\\n  option host-name \\\"00-1a-4a-33-3c-6f-api\\\";\\n  option domain-name \\\"ocp-cluster-sealusa47-0.qe.lab.redhat.com\\\";\\n  renew 2 2020/12/15 20:40:11;\\n  rebind 2 2020/12/15 21:04:48;\\n  expire 2 2020/12/15 21:12:18;\\n}\",\"ingress_vip_address\":\"192.168.123.109\",\"ingress_vip_lease\":\"lease {\\n  interface \\\"ingress\\\";\\n  fixed-address 192.168.123.109;\\n  option subnet-mask 255.255.255.0;\\n  option routers 192.168.123.1;\\n  option dhcp-lease-time 3600;\\n  option dhcp-message-type 5;\\n  option domain-name-servers 192.168.123.1;\\n  option dhcp-server-identifier 192.168.123.1;\\n  option dhcp-renewal-time 1800;\\n  option broadcast-address 192.168.123.255;\\n  option dhcp-rebinding-time 3150;\\n  option host-name \\\"00-1a-4a-54-a6-00-ingress\\\";\\n  option domain-name \\\"ocp-cluster-sealusa47-0.qe.lab.redhat.com\\\";\\n  renew 2 2020/12/15 20:41:27;\\n  rebind 2 2020/12/15 21:04:48;\\n  expire 2 2020/12/15 21:12:18;\\n}\"}> error <> exit-code <0>" file="step_processor.go:46" request_id=fdbe18ce-0740-4ba9-adf9-5ae52fb0a9c5

Comment 10 Omer Tuchfeld 2020-12-15 21:11:03 UTC
I don't seem to have access to see the latest cluster you linked. 
However, I can see from the logs you posted that the step (the one that happens to be in the logs, there should be many other ones) has exit-code 0. This issue's PRs deal with adding logs to step failures specifically, not cluster failures. 
The cluster failure might be interesting to investigate, but I'm not sure that it's relevant to discuss it as part of this issue's comments.

Comment 11 Yuri Obshansky 2020-12-15 21:18:14 UTC
Created attachment 1739467 [details]
Installer log

Comment 12 Yuri Obshansky 2020-12-15 21:19:33 UTC
Added logs. 
Still do not know how to reproduce Step failure
Give me clear scenario
Thank you

Comment 13 Ronnie Lazar 2020-12-16 16:52:25 UTC
yobshans I think that this bug is not really worth the testing effort that is required from you

Comment 14 Yuri Obshansky 2020-12-16 17:01:06 UTC
(In reply to Ronnie Lazar from comment #13)
> yobshans I think that this bug is not really worth the testing
> effort that is required from you

Agreed, let's close it


Note You need to log in before you can comment on or make changes to this bug.