Description of problem: During Controller update the ODL controller is not loading any features. karaf@root()> feature:list Name | Version | Required | State | Repository | Description -----+---------+----------+-------+------------+------------ No features available karaf@root()> The log: Jun 05 13:53:22 controller-1 dockerd-current[820801]: 2018-06-05T12:53:22,965 | INFO | paxweb-extender-1-thread-1 | ServletContainerInitializerScanner | 438 - org.ops4j.pax.web.pax-web-api - 6.0.7 | will add org.eclipse.jetty.websocket.jsr356.server.deploy.WebSocketServerContainerInitializer to ServletContainerInitializers Jun 05 13:53:23 controller-1 dockerd-current[820801]: [311B blob data] Jun 05 13:53:23 controller-1 dockerd-current[820801]: java.io.IOException: /opt/opendaylight/etc/f3025925-1189-449c-bcfc-f6fa88bb00cc.xml (No such file or directory) : file:/opt/opendaylight/etc/f3025925-1189-449c-bcfc-f6fa88bb00cc.xml Jun 05 13:53:23 controller-1 dockerd-current[820801]: at org.apache.karaf.features.internal.service.RepositoryImpl.load(RepositoryImpl.java:91) ~[?:?] Jun 05 13:53:23 controller-1 dockerd-current[820801]: at org.apache.karaf.features.internal.service.RepositoryImpl.load(RepositoryImpl.java:80) ~[?:?] Jun 05 13:53:23 controller-1 dockerd-current[820801]: at org.apache.karaf.features.internal.service.FeaturesServiceImpl.getFeatures(FeaturesServiceImpl.java:780) ~[?:?] Jun 05 13:53:23 controller-1 dockerd-current[820801]: at org.apache.karaf.features.internal.service.FeaturesServiceImpl.listInstalledFeatures(FeaturesServiceImpl.java:822) ~[?:?] Jun 05 13:53:23 controller-1 dockerd-current[820801]: at org.opendaylight.controller.configpusherfeature.internal.FeatureServiceCustomizer.addingService(FeatureServiceCustomizer.java:47) ~[?:?] Jun 05 13:53:23 controller-1 dockerd-current[820801]: at org.opendaylight.controller.configpusherfeature.internal.FeatureServiceCustomizer.addingService(FeatureServiceCustomizer.java:27) ~[?:?] Jun 05 13:53:23 controller-1 dockerd-current[820801]: at org.osgi.util.tracker.ServiceTracker$Tracked.customizerAdding(ServiceTracker.java:941) ~[?:?] Jun 05 13:53:23 controller-1 dockerd-current[820801]: at org.osgi.util.tracker.ServiceTracker$Tracked.customizerAdding(ServiceTracker.java:870) ~[?:?] Jun 05 13:53:23 controller-1 dockerd-current[820801]: at org.osgi.util.tracker.AbstractTracked.trackAdding(AbstractTracked.java:256) ~[?:?] Jun 05 13:53:23 controller-1 dockerd-current[820801]: at org.osgi.util.tracker.AbstractTracked.trackInitial(AbstractTracked.java:183) ~[?:?] Jun 05 13:53:23 controller-1 dockerd-current[820801]: at org.osgi.util.tracker.ServiceTracker.open(ServiceTracker.java:318) ~[?:?] Jun 05 13:53:23 controller-1 dockerd-current[820801]: at org.osgi.util.tracker.ServiceTracker.open(ServiceTracker.java:261) ~[?:?] Jun 05 13:53:23 controller-1 dockerd-current[820801]: at org.opendaylight.controller.configpusherfeature.internal.ConfigPusherCustomizer.addingService(ConfigPusherCustomizer.java:32) ~[?:?] Jun 05 13:53:23 controller-1 dockerd-current[820801]: at org.opendaylight.controller.configpusherfeature.internal.ConfigPusherCustomizer.addingService(ConfigPusherCustomizer.java:19) ~[?:?] Jun 05 13:53:23 controller-1 dockerd-current[820801]: at org.osgi.util.tracker.ServiceTracker$Tracked.customizerAdding(ServiceTracker.java:941) ~[?:?] Jun 05 13:53:23 controller-1 dockerd-current[820801]: at org.osgi.util.tracker.ServiceTracker$Tracked.customizerAdding(ServiceTracker.java:870) ~[?:?] Jun 05 13:53:23 controller-1 dockerd-current[820801]: at org.osgi.util.tracker.AbstractTracked.trackAdding(AbstractTracked.java:256) ~[?:?] Jun 05 13:53:23 controller-1 dockerd-current[820801]: at org.osgi.util.tracker.AbstractTracked.track(AbstractTracked.java:229) ~[?:?] Jun 05 13:53:23 controller-1 dockerd-current[820801]: at org.osgi.util.tracker.ServiceTracker$Tracked.serviceChanged(ServiceTracker.java:901) ~[?:?] Jun 05 13:53:23 controller-1 dockerd-current[820801]: at org.eclipse.osgi.internal.serviceregistry.FilteredServiceListener.serviceChanged(FilteredServiceListener.java:109) ~[?:?] Jun 05 13:53:23 controller-1 dockerd-current[820801]: at org.eclipse.osgi.internal.framework.BundleContextImpl.dispatchEvent(BundleContextImpl.java:915) ~[?:?] Jun 05 13:53:23 controller-1 dockerd-current[820801]: at org.eclipse.osgi.framework.eventmgr.EventManager.dispatchEvent(EventManager.java:230) ~[?:?] Jun 05 13:53:23 controller-1 dockerd-current[820801]: at org.eclipse.osgi.framework.eventmgr.ListenerQueue.dispatchEventSynchronous(ListenerQueue.java:148) ~[?:?] Jun 05 13:53:23 controller-1 dockerd-current[820801]: at org.eclipse.osgi.internal.serviceregistry.ServiceRegistry.publishServiceEventPrivileged(ServiceRegistry.java:862) ~[?:?] Jun 05 13:53:23 controller-1 dockerd-current[820801]: at org.eclipse.osgi.internal.serviceregistry.ServiceRegistry.publishServiceEvent(ServiceRegistry.java:801) ~[?:?] Jun 05 13:53:23 controller-1 dockerd-current[820801]: at org.eclipse.osgi.internal.serviceregistry.ServiceRegistrationImpl.register(ServiceRegistrationImpl.java:127) ~[?:?] Jun 05 13:53:23 controller-1 dockerd-current[820801]: at org.eclipse.osgi.internal.serviceregistry.ServiceRegistry.registerService(ServiceRegistry.java:225) ~[?:?] Jun 05 13:53:23 controller-1 dockerd-current[820801]: at org.eclipse.osgi.internal.framework.BundleContextImpl.registerService(BundleContextImpl.java:464) ~[?:?] Jun 05 13:53:23 controller-1 dockerd-current[820801]: at org.eclipse.osgi.internal.framework.BundleContextImpl.registerService(BundleContextImpl.java:482) ~[?:?] Jun 05 13:53:23 controller-1 dockerd-current[820801]: at org.opendaylight.controller.config.persist.impl.osgi.ConfigPersisterActivator.lambda$startPusherThread$0(ConfigPersisterActivator.java:128) ~[?:?] Jun 05 13:53:23 controller-1 dockerd-current[820801]: at java.lang.Thread.run(Thread.java:748) [?:?] Jun 05 13:53:23 controller-1 dockerd-current[820801]: Caused by: java.io.FileNotFoundException: /opt/opendaylight/etc/f3025925-1189-449c-bcfc-f6fa88bb00cc.xml (No such file or directory) Jun 05 13:53:23 controller-1 dockerd-current[820801]: at java.io.FileInputStream.open0(Native Method) ~[?:?] Jun 05 13:53:23 controller-1 dockerd-current[820801]: at java.io.FileInputStream.open(FileInputStream.java:195) ~[?:?] Jun 05 13:53:23 controller-1 dockerd-current[820801]: at java.io.FileInputStream.<init>(FileInputStream.java:138) ~[?:?] Jun 05 13:53:23 controller-1 dockerd-current[820801]: at java.io.FileInputStream.<init>(FileInputStream.java:93) ~[?:?] Jun 05 13:53:23 controller-1 dockerd-current[820801]: at sun.net.www.protocol.file.FileURLConnection.connect(FileURLConnection.java:90) ~[?:?] Jun 05 13:53:23 controller-1 dockerd-current[820801]: at sun.net.www.protocol.file.FileURLConnection.getInputStream(FileURLConnection.java:188) ~[?:?] Jun 05 13:53:23 controller-1 dockerd-current[820801]: at java.net.URL.openStream(URL.java:1045) ~[?:?] Jun 05 13:53:23 controller-1 dockerd-current[820801]: at org.apache.karaf.features.internal.service.RepositoryImpl.load(RepositoryImpl.java:86) ~[?:?] Jun 05 13:53:23 controller-1 dockerd-current[820801]: ... 30 more Version-Release number of selected component (if applicable): 8.0.0-11 How reproducible: Steps to Reproduce: 1. Follow the OpenStack update procedure. 2. 3. Actual results: Expected results: Additional info:
According to Tim it's because the UUID in file /opt/opendaylight/etc/f3025925-1189-449c-bcfc-f6fa88bb00cc.xml is hard coded.
The problem is the random UUID changes between ODL RPMs and the path is hardcoded into the features boot. This needs to be a consistent file name and not use a UUID.
(In reply to Tim Rozet from comment #2) > The problem is the random UUID changes between ODL RPMs and the path is > hardcoded into the features boot. This needs to be a consistent file name > and not use a UUID. I’m not sure using a consistent file name is the right approach. The existence of this bug suggests that the following is happening: * we install opendaylight <old version> * we run it, so it populates its caches, datastore etc. * we stop it (I hope) * we upgrade it to opendaylight <new version> * we restart it, which fails because it reloads its bundles from its caches The RPM is always internally consistent, so I think the appropriate fix is to clear data/cache on upgrade.
(In reply to Stephen Kitt from comment #3) > (In reply to Tim Rozet from comment #2) > > The problem is the random UUID changes between ODL RPMs and the path is > > hardcoded into the features boot. This needs to be a consistent file name > > and not use a UUID. > > I’m not sure using a consistent file name is the right approach. The What is the value in having a randomly named file between each RPM? Seems like this should be fixed anyway if there is no advantage here. > existence of this bug suggests that the following is happening: > > * we install opendaylight <old version> > * we run it, so it populates its caches, datastore etc. > * we stop it (I hope) > * we upgrade it to opendaylight <new version> > * we restart it, which fails because it reloads its bundles from its caches > > The RPM is always internally consistent, so I think the appropriate fix is > to clear data/cache on upgrade. That may be a better approach in case there is other stale cache from the previous ODL. If that is the case then we need to change this to be THT component.
>> The RPM is always internally consistent, so I think the appropriate fix is >> to clear data/cache on upgrade. > That may be a better approach in case there is other stale cache from the > previous ODL. If that is the case then we need to change this to be THT > component. that semems like a good idea to me. In fact, I'm wondering if a level 1 upgrade should even wipe more than only data/cache; but also e.g. data/generated-bundles/ (which is empty for me) and data/tmp/ ? Or just "rm -rf data/" should actually be safe; because journal/ and snapshots/ directories are at the Karaf root and not inside data/ anyway. What surprises me is that we just "overlay" a new version; I thought originally the idea was that we would move (backup) the entire old version, remove it entirely, lay down the new one, and cp only journal/ and snapshots/ directories back from old to new. Because otherwise, in theory, there could be "left over" files from old which we don't want in new.
Deleting all of data is not a good idea imo. If you do that you are removing potential data/logs/ directory when we log to file. Also it is removing the user/password idm db. Let's just remove only what is necessary (data/cache) and see if it fixes the problem.
(In reply to Tim Rozet from comment #5) > (In reply to Stephen Kitt from comment #3) > > (In reply to Tim Rozet from comment #2) > > > The problem is the random UUID changes between ODL RPMs and the path is > > > hardcoded into the features boot. This needs to be a consistent file name > > > and not use a UUID. > > > > I’m not sure using a consistent file name is the right approach. The > > What is the value in having a randomly named file between each RPM? Seems > like this should be fixed anyway if there is no advantage here. Perhaps we should clone this as a wishlist bug against opendaylight then... There’s no advantage AFAICT, at least for us; but the way Karaf 4 is built means that that’s what we currently end up with. I’ll see if we can return to the Karaf 3 style or at least produce a consistently-named feature file.
Created attachment 1448691 [details] Karaf log with FileNotFoundExceptionetc/opendaylight/karaf/*.xml > Deleting all of data is not a good idea imo. but deleting only data/cache on upgrade leads to this error seen in attached log.txt : 2018-06-07T08:51:04,057 | ERROR | ConfigFeatureListener - ConfigPusher | AbstractFeatureWrapper | 50 - config-persister-feature-adapter - 0.8.0.redhat-10 | Could not parse XML file /opt/opendaylight/etc/opendaylight/karaf/00-netty.xml java.io.FileNotFoundException: /opt/opendaylight/etc/opendaylight/karaf/00-netty.xml (No such file or directory) I'm trying to figure out the exact sequence of events here, and how to best proceed.
I have attempted to reproduce this locally with the following steps which mimics what I think happens here during an upgrade, but surprisingly this actually does works perfectly…. So what is the upgrade doing differently from this: cd netvirt/karaf mvn -o -Pq clean package http://pkgs.devel.redhat.com/cgit/rpms/opendaylight/tree/opendaylight.spec.tmpl?h=rhos-13.0-rhel-7#n73 cp target/assembly/etc/org.apache.karaf.features.cfg{,.orig} sed -i '/^featuresBoot *=/ s/$/,odl-netvirt-openstack/' target/assembly/etc/org.apache.karaf.features.cfg diff target/assembly/etc/org.apache.karaf.features.cfg.orig target/assembly/etc/org.apache.karaf.features.cfg cp -R target/assembly target/assembly.before-upgrade cp -R target/assembly target/assembly.after-upgrade ./target/assembly.before-upgrade/bin/karaf diagstatus:showSvcStatus diagstatus:showSvcStatus diagstatus:showSvcStatus shutdown https://github.com/openstack/tripleo-heat-templates/blob/master/docker/services/opendaylight-api.yaml#L139 cp -R target/assembly.before-upgrade/journal target/assembly.after-upgrade/journal cp -R target/assembly.before-upgrade/snapshots target/assembly.after-upgrade/snapshots cp -R target/assembly.before-upgrade/data target/assembly.after-upgrade/data rm -rf target/assembly.after-upgrade/data/cache ./target/assembly.after-upgrade/bin/karaf diagstatus:showSvcStatus diagstatus:showSvcStatus diagstatus:showSvcStatus shutdown ls target/assembly.after-upgrade/etc/opendaylight/karaf/
> but surprisingly this actually does work I made a small mistake, here’s how above, using branch stable/oxygen (which probably makes no difference) and fixed cp commands (without trailing directory repeated; duh) *DOES* reproduce those exact FileNotFoundException: .../etc/opendaylight/… problems: cd netvirt/karaf git checkout stable/oxygen mvn -o -Pq clean package cp target/assembly/etc/org.apache.karaf.features.cfg{,.orig} sed -i '/^featuresBoot *=/ s/$/,odl-netvirt-openstack/' target/assembly/etc/org.apache.karaf.features.cfg diff target/assembly/etc/org.apache.karaf.features.cfg.orig target/assembly/etc/org.apache.karaf.features.cfg cp -R target/assembly target/assembly.before-upgrade ./target/assembly.before-upgrade/bin/karaf diagstatus:showSvcStatus diagstatus:showSvcStatus diagstatus:showSvcStatus shutdown rm -rf target/assembly.after-upgrade cp -R target/assembly target/assembly.after-upgrade cp -R target/assembly.before-upgrade/journal target/assembly.after-upgrade/ cp -R target/assembly.before-upgrade/snapshots target/assembly.after-upgrade/ cp -R target/assembly.before-upgrade/data target/assembly.after-upgrade/ ./target/assembly.after-upgrade/bin/karaf tail -f target/assembly.after-upgrade/data/log/karaf.log NB: This is if we DO NOT rm -rf data/cache! Now if you redo above but also: rm -rf target/assembly.after-upgrade/data/cache ./target/assembly.after-upgrade/bin/karaf Then it works just fine. Conclusion: I think this is just a big confusion - above clearly proves that the problem in attached log is from an environment where, for some reason, data/cache was simple NOT deleted.
I ran a successful deploy and then a level 2 update. This update will remove data, snapshots, and journal directory. The update failed because ODL failed when we check to ensure the ODL rest is responding in deployment. There were multiple exceptions, before and after the update, including a run time exception with pax web service. Additionally the original error of missing the random uuid file was not present (i.e. file:/opt/opendaylight/etc/f3025925-1189-449c-bcfc-f6fa88bb00cc.xml). However the FileNotFoundException did show up for multiple karaf/ and datastore/ files: Jun 07 22:15:53 controller-0 dockerd-current[19266]: java.io.FileNotFoundException: /opt/opendaylight/etc/opendaylight/karaf/06-clustered-entity-ownership.xml (No such file or directory) Jun 07 22:15:53 controller-0 dockerd-current[19266]: java.io.FileNotFoundException: /opt/opendaylight/etc/opendaylight/datastore/initial/config/legacy-openflow-connection-config.xml (No such file or directory) Jun 07 22:15:53 controller-0 dockerd-current[19266]: java.io.FileNotFoundException: /opt/opendaylight/etc/opendaylight/karaf/00-netty.xml (No such file or directory) Jun 07 22:15:53 controller-0 dockerd-current[19266]: java.io.FileNotFoundException: /opt/opendaylight/etc/opendaylight/karaf/01-md-sal.xml (No such file or directory) Jun 07 22:15:53 controller-0 dockerd-current[19266]: java.io.FileNotFoundException: /opt/opendaylight/etc/opendaylight/karaf/00-netty.xml (No such file or directory) Jun 07 22:15:53 controller-0 dockerd-current[19266]: java.io.FileNotFoundException: /opt/opendaylight/etc/opendaylight/karaf/01-md-sal.xml (No such file or directory) Jun 07 22:15:53 controller-0 dockerd-current[19266]: java.io.FileNotFoundException: /opt/opendaylight/etc/opendaylight/karaf/00-netty.xml (No such file or directory) Jun 07 22:15:53 controller-0 dockerd-current[19266]: java.io.FileNotFoundException: /opt/opendaylight/etc/opendaylight/karaf/01-md-sal.xml (No such file or directory) Jun 07 22:15:53 controller-0 dockerd-current[19266]: java.io.FileNotFoundException: /opt/opendaylight/etc/opendaylight/karaf/00-netty.xml (No such file or directory) Jun 07 22:15:53 controller-0 dockerd-current[19266]: java.io.FileNotFoundException: /opt/opendaylight/etc/opendaylight/karaf/01-md-sal.xml (No such file or directory) Jun 07 22:15:53 controller-0 dockerd-current[19266]: java.io.FileNotFoundException: /opt/opendaylight/etc/opendaylight/karaf/05-clustering.xml (No such file or directory) Jun 07 22:15:53 controller-0 dockerd-current[19266]: java.io.FileNotFoundException: /opt/opendaylight/etc/opendaylight/karaf/06-clustered-entity-ownership.xml (No such file or directory) Jun 07 22:15:53 controller-0 dockerd-current[19266]: java.io.FileNotFoundException: /opt/opendaylight/etc/opendaylight/datastore/initial/config/genius-itm-config.xml (No such file or directory) Jun 07 22:15:53 controller-0 dockerd-current[19266]: java.io.FileNotFoundException: /opt/opendaylight/etc/opendaylight/datastore/initial/config/genius-ifm-config.xml (No such file or directory) I could validate that the update did indeed remove all 3 directories in the ansible log. I also saved the previous cache of pre-update so that I could diff them. I'm fairly confident the directory was removed during update as the original OSGI log is not there post update. You can also see in the cache diff that the feature repositories UUID is different: > "org.apache.felix.fileinstall.ca43cb45-1f64-4ddd-ad7e-dd6ada173ce1", \ diff -r /tmp/odl_cache/org.eclipse.osgi/5/data/config/org/apache/karaf/features.config /var/lib/opendaylight/data/cache/org.eclipse.osgi/5/data/config/org/apache/karaf/features.config 3c3 < featuresRepositories="file:/opt/opendaylight/etc/f3025925-1189-449c-bcfc-f6fa88bb00cc.xml" --- > featuresRepositories="file:/opt/opendaylight/etc/d91733ba-0875-4c67-9040-3976a0562592.xml" Going to attach the full cache diff and ODL log. Pax runtime exception: Jun 07 21:32:04 controller-0 dockerd-current[19266]: 2018-06-07T20:32:04,571 | WARN | paxweb-extender-1-thread-1 | ServletContainerInitializerScanner | 438 - org.ops4j.pax.web.pax-web-api - 6.0 Jun 07 21:32:04 controller-0 dockerd-current[19266]: java.lang.RuntimeException: null Jun 07 21:32:04 controller-0 dockerd-current[19266]: at org.objectweb.asm.MethodVisitor.visitParameter(Unknown Source) ~[?:?] Jun 07 21:32:04 controller-0 dockerd-current[19266]: at org.objectweb.asm.ClassReader.b(Unknown Source) ~[?:?] Jun 07 21:32:04 controller-0 dockerd-current[19266]: at org.objectweb.asm.ClassReader.accept(Unknown Source) ~[?:?] Jun 07 21:32:04 controller-0 dockerd-current[19266]: at org.objectweb.asm.ClassReader.accept(Unknown Source) ~[?:?] Jun 07 21:32:04 controller-0 dockerd-current[19266]: at org.apache.xbean.finder.AbstractFinder.readClassDef(AbstractFinder.java:581) ~[?:?] Jun 07 21:32:04 controller-0 dockerd-current[19266]: at org.apache.xbean.finder.BundleAnnotationFinder$AnnotationFindingCallback.foundInDirectory(BundleAnnotationFinder.java:86) ~[?:?] Jun 07 21:32:04 controller-0 dockerd-current[19266]: at org.apache.xbean.osgi.bundle.util.BundleResourceFinder.scanDirectory(BundleResourceFinder.java:151) ~[?:?] Jun 07 21:32:04 controller-0 dockerd-current[19266]: at org.apache.xbean.osgi.bundle.util.BundleResourceFinder.scanBundleClassPath(BundleResourceFinder.java:120) ~[?:?] Jun 07 21:32:04 controller-0 dockerd-current[19266]: at org.apache.xbean.osgi.bundle.util.BundleResourceFinder.find(BundleResourceFinder.java:88) ~[?:?] Jun 07 21:32:04 controller-0 dockerd-current[19266]: at org.apache.xbean.finder.BundleAnnotationFinder.<init>(BundleAnnotationFinder.java:56) ~[?:?] Jun 07 21:32:04 controller-0 dockerd-current[19266]: at org.apache.xbean.finder.BundleAnnotationFinder.<init>(BundleAnnotationFinder.java:50) ~[?:?] Jun 07 21:32:04 controller-0 dockerd-current[19266]: at org.apache.xbean.finder.BundleAnnotationFinder.<init>(BundleAnnotationFinder.java:46) ~[?:?] Jun 07 21:32:04 controller-0 dockerd-current[19266]: at org.ops4j.pax.web.utils.ServletContainerInitializerScanner.scanBundles(ServletContainerInitializerScanner.java:107) ~[?:?] Jun 07 21:32:04 controller-0 dockerd-current[19266]: at org.ops4j.pax.web.service.jetty.internal.JettyServerWrapper.addContext(JettyServerWrapper.java:284) ~[?:?] Jun 07 21:32:04 controller-0 dockerd-current[19266]: at org.ops4j.pax.web.service.jetty.internal.JettyServerWrapper.getOrCreateContext(JettyServerWrapper.java:209) ~[?:?] Jun 07 21:32:04 controller-0 dockerd-current[19266]: at org.ops4j.pax.web.service.jetty.internal.JettyServerImpl.getContext(JettyServerImpl.java:289) ~[?:?] Jun 07 21:32:04 controller-0 dockerd-current[19266]: at org.ops4j.pax.web.service.jetty.internal.ServerControllerImpl$Started.getContext(ServerControllerImpl.java:346) ~[?:?] Jun 07 21:32:04 controller-0 dockerd-current[19266]: at org.ops4j.pax.web.service.jetty.internal.ServerControllerImpl.getContext(ServerControllerImpl.java:175) ~[?:?] Jun 07 21:32:04 controller-0 dockerd-current[19266]: at org.ops4j.pax.web.service.internal.HttpServiceStarted.begin(HttpServiceStarted.java:1187) ~[?:?] Jun 07 21:32:04 controller-0 dockerd-current[19266]: at org.ops4j.pax.web.service.internal.HttpServiceProxy.begin(HttpServiceProxy.java:440) ~[?:?] Jun 07 21:32:04 controller-0 dockerd-current[19266]: at org.ops4j.pax.web.extender.war.internal.RegisterWebAppVisitorWC.visit(RegisterWebAppVisitorWC.java:176) ~[?:?] Jun 07 21:32:04 controller-0 dockerd-current[19266]: at org.ops4j.pax.web.extender.war.internal.model.WebApp.accept(WebApp.java:619) ~[?:?] Jun 07 21:32:04 controller-0 dockerd-current[19266]: at org.ops4j.pax.web.extender.war.internal.WebAppPublisher$WebAppDependencyListener.register(WebAppPublisher.java:228) ~[?:?] Jun 07 21:32:04 controller-0 dockerd-current[19266]: at org.ops4j.pax.web.extender.war.internal.WebAppPublisher$WebAppDependencyListener.addingService(WebAppPublisher.java:173) ~[?:?] Jun 07 21:32:04 controller-0 dockerd-current[19266]: at org.ops4j.pax.web.extender.war.internal.WebAppPublisher$WebAppDependencyListener.addingService(WebAppPublisher.java:129) ~[?:?] Jun 07 21:32:04 controller-0 dockerd-current[19266]: at org.osgi.util.tracker.ServiceTracker$Tracked.customizerAdding(ServiceTracker.java:941) ~[?:?] Jun 07 21:32:04 controller-0 dockerd-current[19266]: at org.osgi.util.tracker.ServiceTracker$Tracked.customizerAdding(ServiceTracker.java:870) ~[?:?] Jun 07 21:32:04 controller-0 dockerd-current[19266]: at org.osgi.util.tracker.AbstractTracked.trackAdding(AbstractTracked.java:256) ~[?:?] Jun 07 21:32:04 controller-0 dockerd-current[19266]: at org.osgi.util.tracker.AbstractTracked.trackInitial(AbstractTracked.java:183) ~[?:?] Jun 07 21:32:04 controller-0 dockerd-current[19266]: at org.osgi.util.tracker.ServiceTracker.open(ServiceTracker.java:318) ~[?:?] Jun 07 21:32:04 controller-0 dockerd-current[19266]: at org.osgi.util.tracker.ServiceTracker.open(ServiceTracker.java:261) ~[?:?] Jun 07 21:32:04 controller-0 dockerd-current[19266]: at org.ops4j.pax.web.extender.war.internal.WebAppPublisher.publish(WebAppPublisher.java:98) ~[?:?] Jun 07 21:32:04 controller-0 dockerd-current[19266]: at org.ops4j.pax.web.extender.war.internal.WebObserver.deploy(WebObserver.java:217) ~[?:?] Jun 07 21:32:04 controller-0 dockerd-current[19266]: at org.ops4j.pax.web.extender.war.internal.WebObserver$1.doStart(WebObserver.java:172) ~[?:?] Jun 07 21:32:04 controller-0 dockerd-current[19266]: at org.ops4j.pax.web.extender.war.internal.extender.SimpleExtension.start(SimpleExtension.java:59) ~[?:?] Jun 07 21:32:04 controller-0 dockerd-current[19266]: at org.ops4j.pax.web.extender.war.internal.extender.AbstractExtender.lambda$createExtension$0(AbstractExtender.java:277) ~[?:?] Jun 07 21:32:04 controller-0 dockerd-current[19266]: at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:?] Jun 07 21:32:04 controller-0 dockerd-current[19266]: at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:?] Jun 07 21:32:04 controller-0 dockerd-current[19266]: at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [?:?] Jun 07 21:32:04 controller-0 dockerd-current[19266]: at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [?:?] Jun 07 21:32:04 controller-0 dockerd-current[19266]: at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:?] Jun 07 21:32:04 controller-0 dockerd-current[19266]: at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:?] Jun 07 21:32:04 controller-0 dockerd-current[19266]: at java.lang.Thread.run(Thread.java:748) [?:?] Missing File exception: Jun 07 21:32:05 controller-0 dockerd-current[19266]: 2018-06-07T20:32:05,086 | ERROR | ConfigFeatureListener - ConfigPusher | AbstractFeatureWrapper | 50 - config-persister-feature-ada Jun 07 21:32:05 controller-0 dockerd-current[19266]: java.io.FileNotFoundException: /opt/opendaylight/etc/opendaylight/karaf/00-netty.xml (No such file or directory) Jun 07 21:32:05 controller-0 dockerd-current[19266]: at java.io.FileInputStream.open0(Native Method) ~[?:?] Jun 07 21:32:05 controller-0 dockerd-current[19266]: at java.io.FileInputStream.open(FileInputStream.java:195) [?:?] Jun 07 21:32:05 controller-0 dockerd-current[19266]: at java.io.FileInputStream.<init>(FileInputStream.java:138) [?:?] Jun 07 21:32:05 controller-0 dockerd-current[19266]: at org.opendaylight.controller.configpusherfeature.internal.AbstractFeatureWrapper.isConfigSnapshot(AbstractFeatureWrapper.java:116) Jun 07 21:32:05 controller-0 dockerd-current[19266]: at org.opendaylight.controller.configpusherfeature.internal.AbstractFeatureWrapper.getFeatureConfigSnapshotHolders(AbstractFeatureWra Jun 07 21:32:05 controller-0 dockerd-current[19266]: at org.opendaylight.controller.configpusherfeature.internal.ChildAwareFeatureWrapper.getFeatureConfigSnapshotHolders(ChildAwareFeatur Jun 07 21:32:05 controller-0 dockerd-current[19266]: at org.opendaylight.controller.configpusherfeature.internal.ChildAwareFeatureWrapper.getFeatureConfigSnapshotHolders(ChildAwareFeatur Jun 07 21:32:05 controller-0 dockerd-current[19266]: at org.opendaylight.controller.configpusherfeature.internal.ChildAwareFeatureWrapper.getFeatureConfigSnapshotHolders(ChildAwareFeatur Jun 07 21:32:05 controller-0 dockerd-current[19266]: at org.opendaylight.controller.configpusherfeature.internal.ChildAwareFeatureWrapper.getFeatureConfigSnapshotHolders(ChildAwareFeatur Jun 07 21:32:05 controller-0 dockerd-current[19266]: at org.opendaylight.controller.configpusherfeature.internal.ChildAwareFeatureWrapper.getFeatureConfigSnapshotHolders(ChildAwareFeatur Jun 07 21:32:05 controller-0 dockerd-current[19266]: at org.opendaylight.controller.configpusherfeature.internal.ChildAwareFeatureWrapper.getFeatureConfigSnapshotHolders(ChildAwareFeatur Jun 07 21:32:05 controller-0 dockerd-current[19266]: at org.opendaylight.controller.configpusherfeature.internal.FeatureConfigPusher.pushConfig(FeatureConfigPusher.java:101) [50:config-p Jun 07 21:32:05 controller-0 dockerd-current[19266]: at org.opendaylight.controller.configpusherfeature.internal.FeatureConfigPusher.pushConfigs(FeatureConfigPusher.java:69) [50:config-p Jun 07 21:32:05 controller-0 dockerd-current[19266]: at org.opendaylight.controller.configpusherfeature.internal.ConfigPushingRunnable.processFeatureEvent(ConfigPushingRunnable.java:70) Jun 07 21:32:05 controller-0 dockerd-current[19266]: at org.opendaylight.controller.configpusherfeature.internal.ConfigPushingRunnable.run(ConfigPushingRunnable.java:52) [50:config-persi Jun 07 21:32:05 controller-0 dockerd-current[19266]: at java.lang.Thread.run(Thread.java:748) [?:?] Jun 07 21:32:05 controller-0 dockerd-current[19266]: [478B blob data]
Created attachment 1448897 [details] Level 2 update logs with cache diff and odl log
As another test, I removed the cache on this opendaylight container and restarted it. After that the container started fine and did not show the previous errors. Need to investigate further...
Created attachment 1448904 [details] ansible logs for updating overcloud
Looking at the log what is surprising is that ODL container started 3 times: Jun 07 17:22:50 controller-0 dockerd-current[19266]: + sudo -E kolla_set_configs Jun 07 21:09:25 controller-0 dockerd-current[19266]: + sudo -E kolla_set_configs Jun 07 21:31:38 controller-0 dockerd-current[19266]: + sudo -E kolla_set_configs I would expect that the container would start twice (once during first deployment, and then a second time during update). Right before the 2nd container is started I can see that there are exceptions thrown about reading the cache: Jun 07 20:49:34 controller-0 dockerd-current[19266]: !ENTRY io.netty.handler 4 0 2018-06-07 19:49:34.373 Jun 07 20:49:34 controller-0 dockerd-current[19266]: !MESSAGE FrameworkEvent ERROR Jun 07 20:49:34 controller-0 dockerd-current[19266]: !STACK 0 Jun 07 20:49:34 controller-0 dockerd-current[19266]: java.io.IOException: Exception in opening zip file: /opt/opendaylight/data/cache/org.eclipse.osgi/61/0/bundleFile This tells me that the directories were being removed while the original deployment container was still up. Note, I also see before this that there are other exceptions around communicating with the other ODLs and NoShardLeaderException. Do the updates happen in a rolling fashion (1 controller at a time)? Still not sure what the root cause is. One possible workaround until we know the root cause is in post update tasks rm the cache and restart the container.
Yes, updates happen 1 controller at a time. Update and post update steps are ansible books executed 1 node at a time.
I suspect that data/cache is deleted but then again repopulated becasue it is mounted and container is still up during this time. And during deploy_step when container is started, new data/cache info from container to appended to data/cache on host. But then data/cache files should have both pre and post update info, which I dont think is happening. 1 other option we can try is, explicitly deleting containers before deleting data/cache in update_tasks and then let pauch start a new container (like all pacemaker managed services are doing).
Posted a potential fix. I think there is some corruption happening when the cache folder is removed while the original container is up. The option of restarting the container and removing cache post update will not work, because the opendaylight_ovs service relies on ODL responding in step 4 of the deployment steps, which will happen outside of update_tasks or post_update_tasks. Therefore the potential fix is to stop the odl container before we remove the files. Testing it out...
So the fix didn't work, but it produced some more information. In comment 19 I asked why I saw extra container restart. This is a bug, tracked here: https://bugzilla.redhat.com/show_bug.cgi?id=1589346 If we then look at the log of the container with my previous fix, we can see that the corruption did not occur in the originally updated container (16:54). It only happens with the extra container restart (17:42). Jun 08 13:25:01 controller-1 dockerd-current[19336]: + sudo -E kolla_set_configs Jun 08 16:54:41 controller-1 dockerd-current[19336]: + sudo -E kolla_set_configs Jun 08 17:41:42 controller-1 dockerd-current[19336]: + sudo -E kolla_set_configs Jun 08 17:42:06 controller-1 dockerd-current[19336]: java.io.FileNotFoundException: /opt/opendaylight/etc/opendaylight/karaf/00-netty.xml (No such file or directory) From talking with Jiri, he thinks there is an extra deployment happening in this order: <jistr> so what i *think* right now, with the double-execution issue, it probably looks like: <jistr> 1) update_tasks <jistr> 2) deploy tasks (puppet, containers) <jistr> 3) host_prep_tasks <jistr> 4) deploy tasks again If this is the case, then we can workaround this extra deployment by removing the cache in the host prep tasks as well. I'm going to try that out now.
Now I realize based on Jiri's theory, this is exactly what is happening: 1) With the patch update works fine. 2) With the extra deployment, cache is mounted a new ODL container is brought up without karaf/ files. This triggers the problem. Note, now this is much bigger than update/upgrade. It has the implication that even a configuration change to a deployment will result in this error. The solution I think is still the same (also remove cache in host_prep_tasks) until we can figure out a better way to mount all the karaf/ and other files and then remove them in update/upgrade.
If we'd have the issue on config change too (just recreating the ODL API container without a real minor update), then fixing bug 1589346 wouldn't fully solve the problem. I'm mentioning it mainly because i'd like to avoid rushing a fix for that bug if possible, as the fix means quite a significant change with possible side effects which may be hard to predict, and would likely prompt a lot of updates retesting. As for how to always clear the cache when starting the ODL API container, i've commented with some possibilities that come to my mind, at https://review.openstack.org/#/c/573720/2/docker/services/opendaylight-api.yaml
I agree Jiri, no reason to rush bug 1589346. The patch I posted works, and we never really needed to stop the ODL container (other than to avoid massive logs of exceptions spewing out when the current ODL fails to read the deleted cache). I commented in the gerrit but really what we want to do is mount the missing files on the host that the cache looks for when we do a redeployment/config update. That way the cache will speed up the ODL boot process. *However*, on an update we want to remove those files *and* the cache and then let the new ODL generate its own new config files. The fix I proposed achieves that for update/upgrade, but it also removes cache on redeploy/config update. This isn't ideal, but it works and we can optimize it in a future patch if you are OK with that.
It fails with the patch. We are looking for the cause.
The problem was we were moving a cache before update, but there was a corner case where dockerd would be restarted between the cache delete and the update of ODL. That would cause a new stale cache to be introduced and update to fail. The fix is to move the docker restart policy on the ODL container to be "unless-stopped" which means that docker will not restart this container if it was manually stopped. Since this problem will be resolved in the undercloud and this is the first supported release of ODL. It is safe to re-target this fix into Z stream.
Minor update works fine with these patches. 1. cache is removed before update which solves the file UUID issue. 2. Setting ODL container policy solves restarting container issue.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:2214