Bug 1371274
| Summary: | Osp-Director-10 : Undercloud upgrade OSP9 -> OSP10 fails (Error: /etc/systemd/system.conf:#DefaultTimeoutStartSec=90s too slow? ) | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Omri Hochman <ohochman> | ||||
| Component: | rhosp-director | Assignee: | Angus Thomas <athomas> | ||||
| Status: | CLOSED NOTABUG | QA Contact: | Omri Hochman <ohochman> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | high | ||||||
| Version: | 10.0 (Newton) | CC: | augol, dbecker, jcoufal, mandreou, mburns, mcornea, morazi, ohochman, rhel-osp-director-maint, sathlang | ||||
| Target Milestone: | --- | Keywords: | Triaged | ||||
| Target Release: | 10.0 (Newton) | ||||||
| Hardware: | x86_64 | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2016-10-14 15:48:50 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | |||||||
| Bug Blocks: | 1337794 | ||||||
| Attachments: |
|
||||||
|
Description
Omri Hochman
2016-08-29 18:24:44 UTC
Created attachment 1195461 [details]
Full-Nova-Compute.log
The error was : Error: Duplicate declaration: Ironic_config[DEFAULT/my_ip] is already declared in file /etc/puppet/modules/ironic/manifests/init.pp:417; cannot redeclare at /tmp/tmp.M48wkq881P/puppet-stack- (01:48:34 PM) ohochman: undercloud72 Hi, When logged into the server the nova-compute service was running fine. Then, running /usr/share/instack-undercloud/puppet-stack-config/os-refresh-config/configure.d/50-puppet-stack-config a second time on the server worked well. Here the timeline of the events: - 2016-08-23 23:44:00.437: - restart of the nova-compute with the wrong value for scheduler_host_manager - puppet has not changed it yet but the package is updated | - 2016-08-24 00:12:03.283 5383 ERROR nova - last message from nova before restart - 2016-08-24 00:12:03 - systemctl gives up on restarting nova-compute, and puppet fails accordingly - 2016-08-24 00:12:08.525 - nova-compute starting successfuly (this is 5s too late !) So this really looks like a timeout issue. By default this is 90s, if I interpret this line correctly: /etc/systemd/system.conf:#DefaultTimeoutStartSec=90s For the error mentionned in the title. It happens at 23:15 and could be related to the packages being updated but not completely. This error disappeared later when nova-compute get updated. The timeline for this one is : 2016-08-23 23:15:24.260 6256 CRITICAL -> first error 2016-08-23 23:15:40.875 6391 CRITICAL -> last error 2016-08-23 23:15:45.369 -> restart of nova-compute no more error, but the ConfigFileValueError seen above. This prove that nova-compute is updated, match the ServiceList version, but as puppet has not run yet the value of scheduler_host_manager is wrong. All in all: 1. the error in the title is a transient error due to package being updated, it disappears when nova-compute is updated (this is the yum upgrade -y step) 2. the nova-compute service takes more than 90s to restart, but restart. We should attempt to increase the DefaultTimeoutStartSec=90s using crudini --set /etc/systemd/system.conf Manager DefaultTimeoutStartSec 180s Let me know, if using this workaround, you still have the error, then we will have another look at what takes nova-compute so long to restart. Maybe #DefaultTimeoutStopSec=90s should be increased also. I cannot infer which one was slowest and made the restart failed. Maybe a specialist of systemd could help to debug here. As a side note, to help debugging, you cloud modify this this file:
/usr/share/instack-undercloud/puppet-stack-config/os-refresh-config/configure.d/50-puppet-stack-config
19c19
< puppet_apply puppet apply --detailed-exitcodes /etc/puppet/manifests/puppet-stack-config.pp
---
> puppet_apply puppet apply --debug --detailed-exitcodes /etc/puppet/manifests/puppet-stack-config.pp
after "yum upgrade -y", and before "openstack undercloud upgrade"
We should see the systemctl restart command used by puppet and at what time and the time it takes.
The workaround seems to be valid: Execute before upgrade: ------------------------ sudo crudini --set /etc/systemd/system.conf Manager DefaultTimeoutStartSec 180s sudo crudini --set /etc/systemd/system.conf Manager DefaultTimeoutStopSec 180s Result: ############################################################################# Undercloud install complete. The file containing this installation's passwords is at /home/stack/undercloud-passwords.conf. There is also a stackrc file at /home/stack/stackrc. These files are needed to interact with the OpenStack services, and should be secured. ############################################################################# After re-run the all test from the beginning and applying the workaround , the undercloud upgrade still fails at the first attempt, then it succeeds after running 'openstack undercloud upgrade' for the second time. Hi,
digging more into the log we can see that:
2016-08-31 14:10:27.515 11426 ERROR oslo.messaging._drivers.impl_rabbit [-] [94a8d6a7-0238-4408-9ad3-c352015e5f21] AMQP server on 192.0.2.1:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 6 seconds
. Client port: None
2016-08-31 14:10:33.883 11426 INFO oslo.messaging._drivers.impl_rabbit [-] [94a8d6a7-0238-4408-9ad3-c352015e5f21] Reconnected to AMQP server on 192.0.2.1:5672 via [amqp] clientwith port 41228.
2016-08-31 14:20:07.279 11426 WARNING nova.conductor.api [req-438297e9-4d36-49c8-855e-fc329f13d1f3 - - - - -] Timed out waiting for nova-conductor. Is it running? Or did this service start before nova-conductor? Reattempting establishment of nova-conductor connection...
2016-08-31 14:20:07.289 11426 DEBUG oslo_messaging._drivers.amqpdriver [req-438297e9-4d36-49c8-855e-fc329f13d1f3 - - - - -] CALL msg_id: 0cfa0ef8c553448d9163fea06c927c75 exchange 'nova' topic 'conductor' _send /usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py:448
2016-08-31 14:20:07.292 11426 ERROR oslo.messaging._drivers.impl_rabbit [req-438297e9-4d36-49c8-855e-fc329f13d1f3 - - - - -] [a2241e20-b695-4f38-9855-ba31addec64f] AMQP server on 192.0.2.1:5672 is unreachable: [Errno 32] Broken pipe. Trying again in 1 seconds. Client port: 40514
2016-08-31 14:20:08.314 11426 INFO oslo.messaging._drivers.impl_rabbit [req-438297e9-4d36-49c8-855e-fc329f13d1f3 - - - - -] [a2241e20-b695-4f38-9855-ba31addec64f] Reconnected
Well in case formating don't work:
14:10:33 -> message of nova-compute
14:20:07 ->
WARNING nova.conductor.api ... Timed out waiting for nova-conductor.
Is it running? Or did this service start before nova-conductor?
Reattempting establishment of nova-conductor connection...
and then it works fine.
Meanwhile the nova-conductor has restarted just before and was fine.
Since we are now adding the service stop before upgrading the undercloud, as per https://bugzilla.redhat.com/show_bug.cgi?id=1372040#c6 - is this still a valid bug? I suspect you may not hit this if you add the service stop first? I am going to go ahead and close this as not a bug for now (assuming comment #10 is true), omri please reopen if you disagree/are still hitting this, thanks closed, no need for needinfo. |