Bug 1586171 - [Update] Update of OpenStack and OpenDaylight fails
Summary: [Update] Update of OpenStack and OpenDaylight fails
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 13.0 (Queens)
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: z1
: 13.0 (Queens)
Assignee: Tim Rozet
QA Contact: Noam Manos
URL:
Whiteboard: Update
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-06-05 16:02 UTC by Itzik Brown
Modified: 2022-07-09 10:01 UTC (History)
14 users (show)

Fixed In Version: openstack-tripleo-heat-templates-8.0.2-40.el7ost
Doc Type: Bug Fix
Doc Text:
Previously, overcloud updates failed due to stale cache in OpenDaylight. With this update, OpenDaylight is stopped and the stale cache is removed before upgrading to a new version. Level 1 updates work with OpenDaylight deployments. Level 2 updates are currently unsupported.
Clone Of:
Environment:
N/A
Last Closed: 2018-07-19 14:27:12 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Karaf log with FileNotFoundExceptionetc/opendaylight/karaf/*.xml (290.15 KB, application/x-gzip)
2018-06-07 11:35 UTC, Michael Vorburger
no flags Details
Level 2 update logs with cache diff and odl log (475.79 KB, application/x-gzip)
2018-06-07 21:51 UTC, Tim Rozet
no flags Details
ansible logs for updating overcloud (985.86 KB, application/x-gzip)
2018-06-07 22:54 UTC, Tim Rozet
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1775919 0 None None None 2018-06-08 19:08:53 UTC
OpenStack gerrit 573720 0 None master: MERGED tripleo-heat-templates: Fix update/upgrade or config change for ODL (Ia457b90b765617822e9adbf07485c9ea1fe179e5) 2018-06-29 17:29:22 UTC
OpenStack gerrit 574039 0 None stable/queens: MERGED tripleo-heat-templates: Fix update/upgrade or config change for ODL (Ia457b90b765617822e9adbf07485c9ea1fe179e5) 2018-06-29 17:29:16 UTC
OpenStack gerrit 575111 0 None master: MERGED tripleo-heat-templates: Set ODL API restart policy to unless-stopped (Id033cbeef46bded13f4fd6c06eb3a244634f0e7b) 2018-06-29 17:29:10 UTC
OpenStack gerrit 578555 0 None stable/queens: MERGED tripleo-heat-templates: Set ODL API restart policy to unless-stopped (Id033cbeef46bded13f4fd6c06eb3a244634f0e7b) 2018-06-29 17:29:05 UTC
Red Hat Issue Tracker ODL-254 0 None None None 2021-12-10 16:23:35 UTC
Red Hat Issue Tracker OSP-11412 0 None None None 2021-12-10 16:23:38 UTC
Red Hat Product Errata RHSA-2018:2214 0 None None None 2018-07-19 14:28:01 UTC

Description Itzik Brown 2018-06-05 16:02:18 UTC
Description of problem:
During Controller update the ODL controller is not loading any features.
karaf@root()> feature:list                                                                                                                                                                                         
Name | Version | Required | State | Repository | Description
-----+---------+----------+-------+------------+------------
No features available
karaf@root()>   

The log:
Jun 05 13:53:22 controller-1 dockerd-current[820801]: 2018-06-05T12:53:22,965 | INFO  | paxweb-extender-1-thread-1 | ServletContainerInitializerScanner | 438 - org.ops4j.pax.web.pax-web-api - 6.0.7 | will add org.eclipse.jetty.websocket.jsr356.server.deploy.WebSocketServerContainerInitializer to ServletContainerInitializers
Jun 05 13:53:23 controller-1 dockerd-current[820801]: [311B blob data]
Jun 05 13:53:23 controller-1 dockerd-current[820801]: java.io.IOException: /opt/opendaylight/etc/f3025925-1189-449c-bcfc-f6fa88bb00cc.xml (No such file or directory) : file:/opt/opendaylight/etc/f3025925-1189-449c-bcfc-f6fa88bb00cc.xml
Jun 05 13:53:23 controller-1 dockerd-current[820801]:         at org.apache.karaf.features.internal.service.RepositoryImpl.load(RepositoryImpl.java:91) ~[?:?]
Jun 05 13:53:23 controller-1 dockerd-current[820801]:         at org.apache.karaf.features.internal.service.RepositoryImpl.load(RepositoryImpl.java:80) ~[?:?]
Jun 05 13:53:23 controller-1 dockerd-current[820801]:         at org.apache.karaf.features.internal.service.FeaturesServiceImpl.getFeatures(FeaturesServiceImpl.java:780) ~[?:?]
Jun 05 13:53:23 controller-1 dockerd-current[820801]:         at org.apache.karaf.features.internal.service.FeaturesServiceImpl.listInstalledFeatures(FeaturesServiceImpl.java:822) ~[?:?]
Jun 05 13:53:23 controller-1 dockerd-current[820801]:         at org.opendaylight.controller.configpusherfeature.internal.FeatureServiceCustomizer.addingService(FeatureServiceCustomizer.java:47) ~[?:?]
Jun 05 13:53:23 controller-1 dockerd-current[820801]:         at org.opendaylight.controller.configpusherfeature.internal.FeatureServiceCustomizer.addingService(FeatureServiceCustomizer.java:27) ~[?:?]
Jun 05 13:53:23 controller-1 dockerd-current[820801]:         at org.osgi.util.tracker.ServiceTracker$Tracked.customizerAdding(ServiceTracker.java:941) ~[?:?]
Jun 05 13:53:23 controller-1 dockerd-current[820801]:         at org.osgi.util.tracker.ServiceTracker$Tracked.customizerAdding(ServiceTracker.java:870) ~[?:?]
Jun 05 13:53:23 controller-1 dockerd-current[820801]:         at org.osgi.util.tracker.AbstractTracked.trackAdding(AbstractTracked.java:256) ~[?:?]
Jun 05 13:53:23 controller-1 dockerd-current[820801]:         at org.osgi.util.tracker.AbstractTracked.trackInitial(AbstractTracked.java:183) ~[?:?]
Jun 05 13:53:23 controller-1 dockerd-current[820801]:         at org.osgi.util.tracker.ServiceTracker.open(ServiceTracker.java:318) ~[?:?]
Jun 05 13:53:23 controller-1 dockerd-current[820801]:         at org.osgi.util.tracker.ServiceTracker.open(ServiceTracker.java:261) ~[?:?]
Jun 05 13:53:23 controller-1 dockerd-current[820801]:         at org.opendaylight.controller.configpusherfeature.internal.ConfigPusherCustomizer.addingService(ConfigPusherCustomizer.java:32) ~[?:?]
Jun 05 13:53:23 controller-1 dockerd-current[820801]:         at org.opendaylight.controller.configpusherfeature.internal.ConfigPusherCustomizer.addingService(ConfigPusherCustomizer.java:19) ~[?:?]
Jun 05 13:53:23 controller-1 dockerd-current[820801]:         at org.osgi.util.tracker.ServiceTracker$Tracked.customizerAdding(ServiceTracker.java:941) ~[?:?]
Jun 05 13:53:23 controller-1 dockerd-current[820801]:         at org.osgi.util.tracker.ServiceTracker$Tracked.customizerAdding(ServiceTracker.java:870) ~[?:?]
Jun 05 13:53:23 controller-1 dockerd-current[820801]:         at org.osgi.util.tracker.AbstractTracked.trackAdding(AbstractTracked.java:256) ~[?:?]
Jun 05 13:53:23 controller-1 dockerd-current[820801]:         at org.osgi.util.tracker.AbstractTracked.track(AbstractTracked.java:229) ~[?:?]
Jun 05 13:53:23 controller-1 dockerd-current[820801]:         at org.osgi.util.tracker.ServiceTracker$Tracked.serviceChanged(ServiceTracker.java:901) ~[?:?]
Jun 05 13:53:23 controller-1 dockerd-current[820801]:         at org.eclipse.osgi.internal.serviceregistry.FilteredServiceListener.serviceChanged(FilteredServiceListener.java:109) ~[?:?]
Jun 05 13:53:23 controller-1 dockerd-current[820801]:         at org.eclipse.osgi.internal.framework.BundleContextImpl.dispatchEvent(BundleContextImpl.java:915) ~[?:?]
Jun 05 13:53:23 controller-1 dockerd-current[820801]:         at org.eclipse.osgi.framework.eventmgr.EventManager.dispatchEvent(EventManager.java:230) ~[?:?]
Jun 05 13:53:23 controller-1 dockerd-current[820801]:         at org.eclipse.osgi.framework.eventmgr.ListenerQueue.dispatchEventSynchronous(ListenerQueue.java:148) ~[?:?]
Jun 05 13:53:23 controller-1 dockerd-current[820801]:         at org.eclipse.osgi.internal.serviceregistry.ServiceRegistry.publishServiceEventPrivileged(ServiceRegistry.java:862) ~[?:?]
Jun 05 13:53:23 controller-1 dockerd-current[820801]:         at org.eclipse.osgi.internal.serviceregistry.ServiceRegistry.publishServiceEvent(ServiceRegistry.java:801) ~[?:?]
Jun 05 13:53:23 controller-1 dockerd-current[820801]:         at org.eclipse.osgi.internal.serviceregistry.ServiceRegistrationImpl.register(ServiceRegistrationImpl.java:127) ~[?:?]
Jun 05 13:53:23 controller-1 dockerd-current[820801]:         at org.eclipse.osgi.internal.serviceregistry.ServiceRegistry.registerService(ServiceRegistry.java:225) ~[?:?]
Jun 05 13:53:23 controller-1 dockerd-current[820801]:         at org.eclipse.osgi.internal.framework.BundleContextImpl.registerService(BundleContextImpl.java:464) ~[?:?]
Jun 05 13:53:23 controller-1 dockerd-current[820801]:         at org.eclipse.osgi.internal.framework.BundleContextImpl.registerService(BundleContextImpl.java:482) ~[?:?]
Jun 05 13:53:23 controller-1 dockerd-current[820801]:         at org.opendaylight.controller.config.persist.impl.osgi.ConfigPersisterActivator.lambda$startPusherThread$0(ConfigPersisterActivator.java:128) ~[?:?]
Jun 05 13:53:23 controller-1 dockerd-current[820801]:         at java.lang.Thread.run(Thread.java:748) [?:?]
Jun 05 13:53:23 controller-1 dockerd-current[820801]: Caused by: java.io.FileNotFoundException: /opt/opendaylight/etc/f3025925-1189-449c-bcfc-f6fa88bb00cc.xml (No such file or directory)
Jun 05 13:53:23 controller-1 dockerd-current[820801]:         at java.io.FileInputStream.open0(Native Method) ~[?:?]
Jun 05 13:53:23 controller-1 dockerd-current[820801]:         at java.io.FileInputStream.open(FileInputStream.java:195) ~[?:?]
Jun 05 13:53:23 controller-1 dockerd-current[820801]:         at java.io.FileInputStream.<init>(FileInputStream.java:138) ~[?:?]
Jun 05 13:53:23 controller-1 dockerd-current[820801]:         at java.io.FileInputStream.<init>(FileInputStream.java:93) ~[?:?]
Jun 05 13:53:23 controller-1 dockerd-current[820801]:         at sun.net.www.protocol.file.FileURLConnection.connect(FileURLConnection.java:90) ~[?:?]
Jun 05 13:53:23 controller-1 dockerd-current[820801]:         at sun.net.www.protocol.file.FileURLConnection.getInputStream(FileURLConnection.java:188) ~[?:?]
Jun 05 13:53:23 controller-1 dockerd-current[820801]:         at java.net.URL.openStream(URL.java:1045) ~[?:?]
Jun 05 13:53:23 controller-1 dockerd-current[820801]:         at org.apache.karaf.features.internal.service.RepositoryImpl.load(RepositoryImpl.java:86) ~[?:?]
Jun 05 13:53:23 controller-1 dockerd-current[820801]:         ... 30 more


Version-Release number of selected component (if applicable):
8.0.0-11

How reproducible:


Steps to Reproduce:
1. Follow the OpenStack update procedure.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Itzik Brown 2018-06-05 16:03:24 UTC
According to Tim it's because the UUID in file /opt/opendaylight/etc/f3025925-1189-449c-bcfc-f6fa88bb00cc.xml is hard coded.

Comment 2 Tim Rozet 2018-06-05 16:14:17 UTC
The problem is the random UUID changes between ODL RPMs and the path is hardcoded into the features boot.  This needs to be a consistent file name and not use a UUID.

Comment 3 Stephen Kitt 2018-06-05 16:55:49 UTC
(In reply to Tim Rozet from comment #2)
> The problem is the random UUID changes between ODL RPMs and the path is
> hardcoded into the features boot.  This needs to be a consistent file name
> and not use a UUID.

I’m not sure using a consistent file name is the right approach. The existence of this bug suggests that the following is happening:

* we install opendaylight <old version>
* we run it, so it populates its caches, datastore etc.
* we stop it (I hope)
* we upgrade it to opendaylight <new version>
* we restart it, which fails because it reloads its bundles from its caches

The RPM is always internally consistent, so I think the appropriate fix is to clear data/cache on upgrade.

Comment 5 Tim Rozet 2018-06-05 19:05:04 UTC
(In reply to Stephen Kitt from comment #3)
> (In reply to Tim Rozet from comment #2)
> > The problem is the random UUID changes between ODL RPMs and the path is
> > hardcoded into the features boot.  This needs to be a consistent file name
> > and not use a UUID.
> 
> I’m not sure using a consistent file name is the right approach. The

What is the value in having a randomly named file between each RPM?  Seems like this should be fixed anyway if there is no advantage here.

> existence of this bug suggests that the following is happening:
> 
> * we install opendaylight <old version>
> * we run it, so it populates its caches, datastore etc.
> * we stop it (I hope)
> * we upgrade it to opendaylight <new version>
> * we restart it, which fails because it reloads its bundles from its caches
> 
> The RPM is always internally consistent, so I think the appropriate fix is
> to clear data/cache on upgrade.

That may be a better approach in case there is other stale cache from the previous ODL.  If that is the case then we need to change this to be THT component.

Comment 9 Michael Vorburger 2018-06-06 13:21:57 UTC
>> The RPM is always internally consistent, so I think the appropriate fix is
>> to clear data/cache on upgrade.

> That may be a better approach in case there is other stale cache from the 
> previous ODL.  If that is the case then we need to change this to be THT 
> component.

that semems like a good idea to me.  In fact, I'm wondering if a level 1 upgrade should even wipe more than only data/cache; but also e.g. data/generated-bundles/ (which is empty for me) and data/tmp/ ?  Or just "rm -rf data/" should actually be safe; because journal/ and snapshots/ directories are at the Karaf root and not inside data/ anyway.

What surprises me is that we just "overlay" a new version; I thought originally the idea was that we would move (backup) the entire old version, remove it entirely, lay down the new one, and cp only journal/ and snapshots/ directories back from old to new.  Because otherwise, in theory, there could be "left over" files from old which we don't want in new.

Comment 10 Tim Rozet 2018-06-06 16:43:36 UTC
Deleting all of data is not a good idea imo.  If you do that you are removing potential data/logs/ directory when we log to file.  Also it is removing the user/password idm db.  Let's just remove only what is necessary (data/cache) and see if it fixes the problem.

Comment 11 Stephen Kitt 2018-06-07 08:10:50 UTC
(In reply to Tim Rozet from comment #5)
> (In reply to Stephen Kitt from comment #3)
> > (In reply to Tim Rozet from comment #2)
> > > The problem is the random UUID changes between ODL RPMs and the path is
> > > hardcoded into the features boot.  This needs to be a consistent file name
> > > and not use a UUID.
> > 
> > I’m not sure using a consistent file name is the right approach. The
> 
> What is the value in having a randomly named file between each RPM?  Seems
> like this should be fixed anyway if there is no advantage here.

Perhaps we should clone this as a wishlist bug against opendaylight then... There’s no advantage AFAICT, at least for us; but the way Karaf 4 is built means that that’s what we currently end up with. I’ll see if we can return to the Karaf 3 style or at least produce a consistently-named feature file.

Comment 12 Michael Vorburger 2018-06-07 11:35:13 UTC
Created attachment 1448691 [details]
Karaf log with FileNotFoundExceptionetc/opendaylight/karaf/*.xml

> Deleting all of data is not a good idea imo.

but deleting only data/cache on upgrade leads to this error seen in attached log.txt :

2018-06-07T08:51:04,057 | ERROR | ConfigFeatureListener - ConfigPusher | AbstractFeatureWrapper           | 50 - config-persister-feature-adapter - 0.8.0.redhat-10 | Could not parse XML file /opt/opendaylight/etc/opendaylight/karaf/00-netty.xml
java.io.FileNotFoundException: /opt/opendaylight/etc/opendaylight/karaf/00-netty.xml (No such file or directory)

I'm trying to figure out the exact sequence of events here, and how to best proceed.

Comment 13 Michael Vorburger 2018-06-07 12:00:19 UTC
I have attempted to reproduce this locally with the following steps which mimics what I think happens here during an upgrade, but surprisingly this actually does works perfectly…. So what is the upgrade doing differently from this:

cd netvirt/karaf
mvn -o -Pq clean package

http://pkgs.devel.redhat.com/cgit/rpms/opendaylight/tree/opendaylight.spec.tmpl?h=rhos-13.0-rhel-7#n73 
cp target/assembly/etc/org.apache.karaf.features.cfg{,.orig}
sed -i '/^featuresBoot *=/ s/$/,odl-netvirt-openstack/' target/assembly/etc/org.apache.karaf.features.cfg
diff target/assembly/etc/org.apache.karaf.features.cfg.orig target/assembly/etc/org.apache.karaf.features.cfg

cp -R target/assembly target/assembly.before-upgrade
cp -R target/assembly target/assembly.after-upgrade

./target/assembly.before-upgrade/bin/karaf
diagstatus:showSvcStatus
diagstatus:showSvcStatus
diagstatus:showSvcStatus
shutdown

https://github.com/openstack/tripleo-heat-templates/blob/master/docker/services/opendaylight-api.yaml#L139 
cp -R target/assembly.before-upgrade/journal target/assembly.after-upgrade/journal
cp -R target/assembly.before-upgrade/snapshots target/assembly.after-upgrade/snapshots
cp -R target/assembly.before-upgrade/data target/assembly.after-upgrade/data
rm -rf target/assembly.after-upgrade/data/cache

./target/assembly.after-upgrade/bin/karaf
diagstatus:showSvcStatus
diagstatus:showSvcStatus
diagstatus:showSvcStatus
shutdown
ls target/assembly.after-upgrade/etc/opendaylight/karaf/

Comment 14 Michael Vorburger 2018-06-07 14:41:20 UTC
> but surprisingly this actually does work

I made a small mistake, here’s how above, using branch stable/oxygen (which probably makes no difference) and fixed cp commands (without trailing directory repeated; duh) *DOES* reproduce those exact FileNotFoundException: .../etc/opendaylight/… problems:

cd netvirt/karaf
git checkout stable/oxygen
mvn -o -Pq clean package

cp target/assembly/etc/org.apache.karaf.features.cfg{,.orig}
sed -i '/^featuresBoot *=/ s/$/,odl-netvirt-openstack/' target/assembly/etc/org.apache.karaf.features.cfg
diff target/assembly/etc/org.apache.karaf.features.cfg.orig target/assembly/etc/org.apache.karaf.features.cfg

cp -R target/assembly target/assembly.before-upgrade
./target/assembly.before-upgrade/bin/karaf
diagstatus:showSvcStatus
diagstatus:showSvcStatus
diagstatus:showSvcStatus
shutdown


rm -rf target/assembly.after-upgrade
cp -R target/assembly target/assembly.after-upgrade
cp -R target/assembly.before-upgrade/journal target/assembly.after-upgrade/
cp -R target/assembly.before-upgrade/snapshots target/assembly.after-upgrade/
cp -R target/assembly.before-upgrade/data target/assembly.after-upgrade/
./target/assembly.after-upgrade/bin/karaf
tail -f target/assembly.after-upgrade/data/log/karaf.log

NB: This is if we DO NOT rm -rf data/cache! Now if you redo above but also:

rm -rf target/assembly.after-upgrade/data/cache
./target/assembly.after-upgrade/bin/karaf

Then it works just fine. Conclusion: I think this is just a big confusion - above clearly proves that the problem in attached log is from an environment where, for some reason, data/cache was simple NOT deleted.

Comment 15 Tim Rozet 2018-06-07 21:48:13 UTC
I ran a successful deploy and then a level 2 update. This update will remove data, snapshots, and journal directory.  The update failed because ODL failed when we check to ensure the ODL rest is responding in deployment.  There were multiple exceptions, before and after the update, including a run time exception with pax web service.  Additionally the original error of missing the random uuid file was not present (i.e. file:/opt/opendaylight/etc/f3025925-1189-449c-bcfc-f6fa88bb00cc.xml).  However the FileNotFoundException did show up for multiple karaf/ and datastore/ files:

Jun 07 22:15:53 controller-0 dockerd-current[19266]: java.io.FileNotFoundException: /opt/opendaylight/etc/opendaylight/karaf/06-clustered-entity-ownership.xml (No such file or directory)
Jun 07 22:15:53 controller-0 dockerd-current[19266]: java.io.FileNotFoundException: /opt/opendaylight/etc/opendaylight/datastore/initial/config/legacy-openflow-connection-config.xml (No such file or directory)
Jun 07 22:15:53 controller-0 dockerd-current[19266]: java.io.FileNotFoundException: /opt/opendaylight/etc/opendaylight/karaf/00-netty.xml (No such file or directory)
Jun 07 22:15:53 controller-0 dockerd-current[19266]: java.io.FileNotFoundException: /opt/opendaylight/etc/opendaylight/karaf/01-md-sal.xml (No such file or directory)
Jun 07 22:15:53 controller-0 dockerd-current[19266]: java.io.FileNotFoundException: /opt/opendaylight/etc/opendaylight/karaf/00-netty.xml (No such file or directory)
Jun 07 22:15:53 controller-0 dockerd-current[19266]: java.io.FileNotFoundException: /opt/opendaylight/etc/opendaylight/karaf/01-md-sal.xml (No such file or directory)
Jun 07 22:15:53 controller-0 dockerd-current[19266]: java.io.FileNotFoundException: /opt/opendaylight/etc/opendaylight/karaf/00-netty.xml (No such file or directory)
Jun 07 22:15:53 controller-0 dockerd-current[19266]: java.io.FileNotFoundException: /opt/opendaylight/etc/opendaylight/karaf/01-md-sal.xml (No such file or directory)
Jun 07 22:15:53 controller-0 dockerd-current[19266]: java.io.FileNotFoundException: /opt/opendaylight/etc/opendaylight/karaf/00-netty.xml (No such file or directory)
Jun 07 22:15:53 controller-0 dockerd-current[19266]: java.io.FileNotFoundException: /opt/opendaylight/etc/opendaylight/karaf/01-md-sal.xml (No such file or directory)
Jun 07 22:15:53 controller-0 dockerd-current[19266]: java.io.FileNotFoundException: /opt/opendaylight/etc/opendaylight/karaf/05-clustering.xml (No such file or directory)
Jun 07 22:15:53 controller-0 dockerd-current[19266]: java.io.FileNotFoundException: /opt/opendaylight/etc/opendaylight/karaf/06-clustered-entity-ownership.xml (No such file or directory)
Jun 07 22:15:53 controller-0 dockerd-current[19266]: java.io.FileNotFoundException: /opt/opendaylight/etc/opendaylight/datastore/initial/config/genius-itm-config.xml (No such file or directory)
Jun 07 22:15:53 controller-0 dockerd-current[19266]: java.io.FileNotFoundException: /opt/opendaylight/etc/opendaylight/datastore/initial/config/genius-ifm-config.xml (No such file or directory)



I could validate that the update did indeed remove all 3 directories in the ansible log.  I also saved the previous cache of pre-update so that I could diff them.  I'm fairly confident the directory was removed during update as the original OSGI log is not there post update.  You can also see in the cache diff that the feature repositories UUID is different:

>   "org.apache.felix.fileinstall.ca43cb45-1f64-4ddd-ad7e-dd6ada173ce1", \
diff -r /tmp/odl_cache/org.eclipse.osgi/5/data/config/org/apache/karaf/features.config /var/lib/opendaylight/data/cache/org.eclipse.osgi/5/data/config/org/apache/karaf/features.config
3c3
< featuresRepositories="file:/opt/opendaylight/etc/f3025925-1189-449c-bcfc-f6fa88bb00cc.xml"
---
> featuresRepositories="file:/opt/opendaylight/etc/d91733ba-0875-4c67-9040-3976a0562592.xml"

Going to attach the full cache diff and ODL log.

Pax runtime exception:
Jun 07 21:32:04 controller-0 dockerd-current[19266]: 2018-06-07T20:32:04,571 | WARN  | paxweb-extender-1-thread-1 | ServletContainerInitializerScanner | 438 - org.ops4j.pax.web.pax-web-api - 6.0
Jun 07 21:32:04 controller-0 dockerd-current[19266]: java.lang.RuntimeException: null
Jun 07 21:32:04 controller-0 dockerd-current[19266]:         at org.objectweb.asm.MethodVisitor.visitParameter(Unknown Source) ~[?:?]
Jun 07 21:32:04 controller-0 dockerd-current[19266]:         at org.objectweb.asm.ClassReader.b(Unknown Source) ~[?:?]
Jun 07 21:32:04 controller-0 dockerd-current[19266]:         at org.objectweb.asm.ClassReader.accept(Unknown Source) ~[?:?]
Jun 07 21:32:04 controller-0 dockerd-current[19266]:         at org.objectweb.asm.ClassReader.accept(Unknown Source) ~[?:?]
Jun 07 21:32:04 controller-0 dockerd-current[19266]:         at org.apache.xbean.finder.AbstractFinder.readClassDef(AbstractFinder.java:581) ~[?:?]
Jun 07 21:32:04 controller-0 dockerd-current[19266]:         at org.apache.xbean.finder.BundleAnnotationFinder$AnnotationFindingCallback.foundInDirectory(BundleAnnotationFinder.java:86) ~[?:?]
Jun 07 21:32:04 controller-0 dockerd-current[19266]:         at org.apache.xbean.osgi.bundle.util.BundleResourceFinder.scanDirectory(BundleResourceFinder.java:151) ~[?:?]
Jun 07 21:32:04 controller-0 dockerd-current[19266]:         at org.apache.xbean.osgi.bundle.util.BundleResourceFinder.scanBundleClassPath(BundleResourceFinder.java:120) ~[?:?]
Jun 07 21:32:04 controller-0 dockerd-current[19266]:         at org.apache.xbean.osgi.bundle.util.BundleResourceFinder.find(BundleResourceFinder.java:88) ~[?:?]
Jun 07 21:32:04 controller-0 dockerd-current[19266]:         at org.apache.xbean.finder.BundleAnnotationFinder.<init>(BundleAnnotationFinder.java:56) ~[?:?]
Jun 07 21:32:04 controller-0 dockerd-current[19266]:         at org.apache.xbean.finder.BundleAnnotationFinder.<init>(BundleAnnotationFinder.java:50) ~[?:?]
Jun 07 21:32:04 controller-0 dockerd-current[19266]:         at org.apache.xbean.finder.BundleAnnotationFinder.<init>(BundleAnnotationFinder.java:46) ~[?:?]
Jun 07 21:32:04 controller-0 dockerd-current[19266]:         at org.ops4j.pax.web.utils.ServletContainerInitializerScanner.scanBundles(ServletContainerInitializerScanner.java:107) ~[?:?]
Jun 07 21:32:04 controller-0 dockerd-current[19266]:         at org.ops4j.pax.web.service.jetty.internal.JettyServerWrapper.addContext(JettyServerWrapper.java:284) ~[?:?]
Jun 07 21:32:04 controller-0 dockerd-current[19266]:         at org.ops4j.pax.web.service.jetty.internal.JettyServerWrapper.getOrCreateContext(JettyServerWrapper.java:209) ~[?:?]
Jun 07 21:32:04 controller-0 dockerd-current[19266]:         at org.ops4j.pax.web.service.jetty.internal.JettyServerImpl.getContext(JettyServerImpl.java:289) ~[?:?]
Jun 07 21:32:04 controller-0 dockerd-current[19266]:         at org.ops4j.pax.web.service.jetty.internal.ServerControllerImpl$Started.getContext(ServerControllerImpl.java:346) ~[?:?]
Jun 07 21:32:04 controller-0 dockerd-current[19266]:         at org.ops4j.pax.web.service.jetty.internal.ServerControllerImpl.getContext(ServerControllerImpl.java:175) ~[?:?]
Jun 07 21:32:04 controller-0 dockerd-current[19266]:         at org.ops4j.pax.web.service.internal.HttpServiceStarted.begin(HttpServiceStarted.java:1187) ~[?:?]
Jun 07 21:32:04 controller-0 dockerd-current[19266]:         at org.ops4j.pax.web.service.internal.HttpServiceProxy.begin(HttpServiceProxy.java:440) ~[?:?]
Jun 07 21:32:04 controller-0 dockerd-current[19266]:         at org.ops4j.pax.web.extender.war.internal.RegisterWebAppVisitorWC.visit(RegisterWebAppVisitorWC.java:176) ~[?:?]
Jun 07 21:32:04 controller-0 dockerd-current[19266]:         at org.ops4j.pax.web.extender.war.internal.model.WebApp.accept(WebApp.java:619) ~[?:?]
Jun 07 21:32:04 controller-0 dockerd-current[19266]:         at org.ops4j.pax.web.extender.war.internal.WebAppPublisher$WebAppDependencyListener.register(WebAppPublisher.java:228) ~[?:?]
Jun 07 21:32:04 controller-0 dockerd-current[19266]:         at org.ops4j.pax.web.extender.war.internal.WebAppPublisher$WebAppDependencyListener.addingService(WebAppPublisher.java:173) ~[?:?]
Jun 07 21:32:04 controller-0 dockerd-current[19266]:         at org.ops4j.pax.web.extender.war.internal.WebAppPublisher$WebAppDependencyListener.addingService(WebAppPublisher.java:129) ~[?:?]
Jun 07 21:32:04 controller-0 dockerd-current[19266]:         at org.osgi.util.tracker.ServiceTracker$Tracked.customizerAdding(ServiceTracker.java:941) ~[?:?]
Jun 07 21:32:04 controller-0 dockerd-current[19266]:         at org.osgi.util.tracker.ServiceTracker$Tracked.customizerAdding(ServiceTracker.java:870) ~[?:?]
Jun 07 21:32:04 controller-0 dockerd-current[19266]:         at org.osgi.util.tracker.AbstractTracked.trackAdding(AbstractTracked.java:256) ~[?:?]
Jun 07 21:32:04 controller-0 dockerd-current[19266]:         at org.osgi.util.tracker.AbstractTracked.trackInitial(AbstractTracked.java:183) ~[?:?]
Jun 07 21:32:04 controller-0 dockerd-current[19266]:         at org.osgi.util.tracker.ServiceTracker.open(ServiceTracker.java:318) ~[?:?]
Jun 07 21:32:04 controller-0 dockerd-current[19266]:         at org.osgi.util.tracker.ServiceTracker.open(ServiceTracker.java:261) ~[?:?]
Jun 07 21:32:04 controller-0 dockerd-current[19266]:         at org.ops4j.pax.web.extender.war.internal.WebAppPublisher.publish(WebAppPublisher.java:98) ~[?:?]
Jun 07 21:32:04 controller-0 dockerd-current[19266]:         at org.ops4j.pax.web.extender.war.internal.WebObserver.deploy(WebObserver.java:217) ~[?:?]
Jun 07 21:32:04 controller-0 dockerd-current[19266]:         at org.ops4j.pax.web.extender.war.internal.WebObserver$1.doStart(WebObserver.java:172) ~[?:?]
Jun 07 21:32:04 controller-0 dockerd-current[19266]:         at org.ops4j.pax.web.extender.war.internal.extender.SimpleExtension.start(SimpleExtension.java:59) ~[?:?]
Jun 07 21:32:04 controller-0 dockerd-current[19266]:         at org.ops4j.pax.web.extender.war.internal.extender.AbstractExtender.lambda$createExtension$0(AbstractExtender.java:277) ~[?:?]
Jun 07 21:32:04 controller-0 dockerd-current[19266]:         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:?]
Jun 07 21:32:04 controller-0 dockerd-current[19266]:         at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:?]
Jun 07 21:32:04 controller-0 dockerd-current[19266]:         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [?:?]
Jun 07 21:32:04 controller-0 dockerd-current[19266]:         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [?:?]
Jun 07 21:32:04 controller-0 dockerd-current[19266]:         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:?]
Jun 07 21:32:04 controller-0 dockerd-current[19266]:         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:?]
Jun 07 21:32:04 controller-0 dockerd-current[19266]:         at java.lang.Thread.run(Thread.java:748) [?:?]

Missing File exception:
Jun 07 21:32:05 controller-0 dockerd-current[19266]: 2018-06-07T20:32:05,086 | ERROR | ConfigFeatureListener - ConfigPusher | AbstractFeatureWrapper           | 50 - config-persister-feature-ada
Jun 07 21:32:05 controller-0 dockerd-current[19266]: java.io.FileNotFoundException: /opt/opendaylight/etc/opendaylight/karaf/00-netty.xml (No such file or directory)
Jun 07 21:32:05 controller-0 dockerd-current[19266]:         at java.io.FileInputStream.open0(Native Method) ~[?:?]
Jun 07 21:32:05 controller-0 dockerd-current[19266]:         at java.io.FileInputStream.open(FileInputStream.java:195) [?:?]
Jun 07 21:32:05 controller-0 dockerd-current[19266]:         at java.io.FileInputStream.<init>(FileInputStream.java:138) [?:?]
Jun 07 21:32:05 controller-0 dockerd-current[19266]:         at org.opendaylight.controller.configpusherfeature.internal.AbstractFeatureWrapper.isConfigSnapshot(AbstractFeatureWrapper.java:116)
Jun 07 21:32:05 controller-0 dockerd-current[19266]:         at org.opendaylight.controller.configpusherfeature.internal.AbstractFeatureWrapper.getFeatureConfigSnapshotHolders(AbstractFeatureWra
Jun 07 21:32:05 controller-0 dockerd-current[19266]:         at org.opendaylight.controller.configpusherfeature.internal.ChildAwareFeatureWrapper.getFeatureConfigSnapshotHolders(ChildAwareFeatur
Jun 07 21:32:05 controller-0 dockerd-current[19266]:         at org.opendaylight.controller.configpusherfeature.internal.ChildAwareFeatureWrapper.getFeatureConfigSnapshotHolders(ChildAwareFeatur
Jun 07 21:32:05 controller-0 dockerd-current[19266]:         at org.opendaylight.controller.configpusherfeature.internal.ChildAwareFeatureWrapper.getFeatureConfigSnapshotHolders(ChildAwareFeatur
Jun 07 21:32:05 controller-0 dockerd-current[19266]:         at org.opendaylight.controller.configpusherfeature.internal.ChildAwareFeatureWrapper.getFeatureConfigSnapshotHolders(ChildAwareFeatur
Jun 07 21:32:05 controller-0 dockerd-current[19266]:         at org.opendaylight.controller.configpusherfeature.internal.ChildAwareFeatureWrapper.getFeatureConfigSnapshotHolders(ChildAwareFeatur
Jun 07 21:32:05 controller-0 dockerd-current[19266]:         at org.opendaylight.controller.configpusherfeature.internal.ChildAwareFeatureWrapper.getFeatureConfigSnapshotHolders(ChildAwareFeatur
Jun 07 21:32:05 controller-0 dockerd-current[19266]:         at org.opendaylight.controller.configpusherfeature.internal.FeatureConfigPusher.pushConfig(FeatureConfigPusher.java:101) [50:config-p
Jun 07 21:32:05 controller-0 dockerd-current[19266]:         at org.opendaylight.controller.configpusherfeature.internal.FeatureConfigPusher.pushConfigs(FeatureConfigPusher.java:69) [50:config-p
Jun 07 21:32:05 controller-0 dockerd-current[19266]:         at org.opendaylight.controller.configpusherfeature.internal.ConfigPushingRunnable.processFeatureEvent(ConfigPushingRunnable.java:70)
Jun 07 21:32:05 controller-0 dockerd-current[19266]:         at org.opendaylight.controller.configpusherfeature.internal.ConfigPushingRunnable.run(ConfigPushingRunnable.java:52) [50:config-persi
Jun 07 21:32:05 controller-0 dockerd-current[19266]:         at java.lang.Thread.run(Thread.java:748) [?:?]
Jun 07 21:32:05 controller-0 dockerd-current[19266]: [478B blob data]

Comment 16 Tim Rozet 2018-06-07 21:51:07 UTC
Created attachment 1448897 [details]
Level 2 update logs with cache diff and odl log

Comment 17 Tim Rozet 2018-06-07 22:37:06 UTC
As another test, I removed the cache on this opendaylight container and restarted it.  After that the container started fine and did not show the previous errors.  Need to investigate further...

Comment 18 Tim Rozet 2018-06-07 22:54:48 UTC
Created attachment 1448904 [details]
ansible logs for updating overcloud

Comment 19 Tim Rozet 2018-06-08 02:21:58 UTC
Looking at the log what is surprising is that ODL container started 3 times:
Jun 07 17:22:50 controller-0 dockerd-current[19266]: + sudo -E kolla_set_configs
Jun 07 21:09:25 controller-0 dockerd-current[19266]: + sudo -E kolla_set_configs
Jun 07 21:31:38 controller-0 dockerd-current[19266]: + sudo -E kolla_set_configs

I would expect that the container would start twice (once during first deployment, and then a second time during update).  Right before the 2nd container is started I can see that there are exceptions thrown about reading the cache:
Jun 07 20:49:34 controller-0 dockerd-current[19266]: !ENTRY io.netty.handler 4 0 2018-06-07 19:49:34.373
Jun 07 20:49:34 controller-0 dockerd-current[19266]: !MESSAGE FrameworkEvent ERROR
Jun 07 20:49:34 controller-0 dockerd-current[19266]: !STACK 0
Jun 07 20:49:34 controller-0 dockerd-current[19266]: java.io.IOException: Exception in opening zip file: /opt/opendaylight/data/cache/org.eclipse.osgi/61/0/bundleFile

This tells me that the directories were being removed while the original deployment container was still up.  Note, I also see before this that there are other exceptions around communicating with the other ODLs and NoShardLeaderException.  Do the updates happen in a rolling fashion (1 controller at a time)?

Still not sure what the root cause is.  One possible workaround until we know the root cause is in post update tasks rm the cache and restart the container.

Comment 20 Janki 2018-06-08 06:20:33 UTC
Yes, updates happen 1 controller at a time. Update and post update steps are ansible books executed 1 node at a time.

Comment 21 Janki 2018-06-08 06:54:41 UTC
I suspect that data/cache is deleted but then again repopulated becasue it is mounted and container is still up during this time. And during deploy_step when container is started, new data/cache info from container to appended to data/cache on host. But then data/cache files should have both pre and post update info, which I dont think is happening.

1 other option we can try is, explicitly deleting containers before deleting data/cache in update_tasks and then let pauch start a new container (like all pacemaker managed services are doing).

Comment 22 Tim Rozet 2018-06-08 14:38:07 UTC
Posted a potential fix.  I think there is some corruption happening when the cache folder is removed while the original container is up.  The option of restarting the container and removing cache post update will not work, because the opendaylight_ovs service relies on ODL responding in step 4 of the deployment steps, which will happen outside of update_tasks or post_update_tasks.  Therefore the potential fix is to stop the odl container before we remove the files.  Testing it out...

Comment 23 Tim Rozet 2018-06-08 18:29:49 UTC
So the fix didn't work, but it produced some more information. In comment 19 I asked why I saw extra container restart.  This is a bug, tracked here:
https://bugzilla.redhat.com/show_bug.cgi?id=1589346

If we then look at the log of the container with my previous fix, we can see that the corruption did not occur in the originally updated container (16:54).  It only happens with the extra container restart (17:42).

Jun 08 13:25:01 controller-1 dockerd-current[19336]: + sudo -E kolla_set_configs
Jun 08 16:54:41 controller-1 dockerd-current[19336]: + sudo -E kolla_set_configs
Jun 08 17:41:42 controller-1 dockerd-current[19336]: + sudo -E kolla_set_configs
Jun 08 17:42:06 controller-1 dockerd-current[19336]: java.io.FileNotFoundException: /opt/opendaylight/etc/opendaylight/karaf/00-netty.xml (No such file or directory)

From talking with Jiri, he thinks there is an extra deployment happening in this order:

<jistr> so what i *think* right now, with the double-execution issue, it probably looks like:
<jistr> 1) update_tasks
<jistr> 2) deploy tasks (puppet, containers)
<jistr> 3) host_prep_tasks
<jistr> 4) deploy tasks again

If this is the case, then we can workaround this extra deployment by removing the cache in the host prep tasks as well.  I'm going to try that out now.

Comment 24 Tim Rozet 2018-06-08 19:03:31 UTC
Now I realize based on Jiri's theory, this is exactly what is happening:
1) With the patch update works fine.
2) With the extra deployment, cache is mounted a new ODL container is brought up without karaf/ files.  This triggers the problem.

Note, now this is much bigger than update/upgrade.  It has the implication that even a configuration change to a deployment will result in this error.  The solution I think is still the same (also remove cache in host_prep_tasks) until we can figure out a better way to mount all the karaf/ and other files and then remove them in update/upgrade.

Comment 25 Jiri Stransky 2018-06-08 20:18:47 UTC
If we'd have the issue on config change too (just recreating the ODL API container without a real minor update), then fixing bug 1589346 wouldn't fully solve the problem. I'm mentioning it mainly because i'd like to avoid rushing a fix for that bug if possible, as the fix means quite a significant change with possible side effects which may be hard to predict, and would likely prompt a lot of updates retesting.

As for how to always clear the cache when starting the ODL API container, i've commented with some possibilities that come to my mind, at https://review.openstack.org/#/c/573720/2/docker/services/opendaylight-api.yaml

Comment 26 Tim Rozet 2018-06-09 01:52:23 UTC
I agree Jiri, no reason to rush bug 1589346.  The patch I posted works, and we never really needed to stop the ODL container (other than to avoid massive logs of exceptions spewing out when the current ODL fails to read the deleted cache).  I commented in the gerrit but really what we want to do is mount the missing files on the host that the cache looks for when we do a redeployment/config update.  That way the cache will speed up the ODL boot process.  *However*, on an update we want to remove those files *and* the cache and then let the new ODL generate its own new config files.

The fix I proposed achieves that for update/upgrade, but it also removes cache on redeploy/config update.  This isn't ideal, but it works and we can optimize it in a future patch if you are OK with that.

Comment 33 Itzik Brown 2018-06-12 09:33:50 UTC
It fails with the patch. We are looking for the cause.

Comment 34 Tim Rozet 2018-06-14 05:20:49 UTC
The problem was we were moving a cache before update, but there was a corner case where dockerd would be restarted between the cache delete and the update of ODL.  That would cause  a new stale cache to be introduced and update to fail.  The fix is to move the docker restart policy on the ODL container to be "unless-stopped" which means that docker will not restart this container if it was manually stopped.

Since this problem will be resolved in the undercloud and this is the first supported release of ODL.  It is safe to re-target this fix into Z stream.

Comment 49 Janki 2018-07-17 04:25:51 UTC
Minor update works fine with these patches.
1. cache is removed before update which solves the file UUID issue.
2. Setting ODL container policy solves restarting container issue.

Comment 51 errata-xmlrpc 2018-07-19 14:27:12 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2214


Note You need to log in before you can comment on or make changes to this bug.