Created attachment 1779004 [details] nohup output file is also attached to see logs in more details. Description of problem: After executing upgrade command with no tags on ceph nodes network config files vanished from ceph node. Version-Release number of selected component (if applicable): How reproducible: During RHOSP13 to 16 FFU ceph node upgradation fails because of openvswitch in inactive state due to ceph node upgrdation getting failed. Steps to Reproduce: We have executed below command for ceph upgradation nohup openstack overcloud upgrade run --stack overcloud --limit overcloud-cephstorage-0 -y & Ceph node upgradation failed because of below error: os_net_config.ConfigurationError: Failure(s) occurred when applying configuration Below is the ip a output of ceph node: [root@overcloud-cephstorage-0 ~]# ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: em1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000 link/ether 14:18:77:43:ab:c0 brd ff:ff:ff:ff:ff:ff inet 192.168.100.221/24 brd 192.168.100.255 scope global dynamic em1 valid_lft 81907sec preferred_lft 81907sec inet6 fe80::1618:77ff:fe43:abc0/64 scope link valid_lft forever preferred_lft forever 3: em2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000 link/ether 14:18:77:43:ab:c1 brd ff:ff:ff:ff:ff:ff inet6 fe80::1618:77ff:fe43:abc1/64 scope link valid_lft forever preferred_lft forever 4: em3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000 link/ether 14:18:77:43:ab:c2 brd ff:ff:ff:ff:ff:ff inet6 fe80::1618:77ff:fe43:abc2/64 scope link valid_lft forever preferred_lft forever 5: em4: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000 link/ether 14:18:77:43:ab:c3 brd ff:ff:ff:ff:ff:ff 6: p1p1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000 link/ether a0:36:9f:ec:4c:44 brd ff:ff:ff:ff:ff:ff 7: p1p2: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000 link/ether a0:36:9f:ec:4c:46 brd ff:ff:ff:ff:ff:ff 8: p4p1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000 link/ether a0:36:9f:d3:ce:48 brd ff:ff:ff:ff:ff:ff 9: p4p2: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000 link/ether a0:36:9f:d3:ce:4a brd ff:ff:ff:ff:ff:ff Below is ip r output of effected node: [root@overcloud-cephstorage-0 ~]# ip r default via 192.168.100.34 dev em1 192.168.100.0/24 dev em1 proto kernel scope link src 192.168.100.221 Actual results: Upgradation was failing because ovs-vswitchd service was in inactive state during ceph node upgradation because of this ceph node upgradation was failing because of networking. [root@overcloud-cephstorage-0 ~]# systemctl list-unit-files|grep -i ovs ovs-delete-transient-ports.service static ovs-vswitchd.service static ovsdb-server.service static [root@overcloud-cephstorage-0 ~]# systemctl status ovs-vswitchd.service ● ovs-vswitchd.service - Open vSwitch Forwarding Unit Loaded: loaded (/usr/lib/systemd/system/ovs-vswitchd.service; static; vendor preset: disabled) Active: inactive (dead) After activating ovs service in ceph node upgradation successfully completed. [root@overcloud-cephstorage-0 ~]# systemctl status ovs-vswitchd.service ● ovs-vswitchd.service - Open vSwitch Forwarding Unit Loaded: loaded (/usr/lib/systemd/system/ovs-vswitchd.service; static; vendor preset: disabled) Active: active (running) since Thu 2021-04-29 15:09:47 UTC; 17s ago Process: 96792 ExecStart=/usr/share/openvswitch/scripts/ovs-ctl --no-ovsdb-server --no-monitor --system-id=random ${OVS_USER_OPT} s> Process: 96789 ExecStartPre=/usr/bin/chmod 0775 /dev/hugepages (code=exited, status=0/SUCCESS) Process: 96786 ExecStartPre=/bin/sh -c /usr/bin/chown :$${OVS_USER_ID##*:} /dev/hugepages (code=exited, status=0/SUCCESS) Main PID: 96844 (ovs-vswitchd) Tasks: 1 (limit: 822668) Memory: 20.4M CGroup: /system.slice/ovs-vswitchd.service └─96844 ovs-vswitchd unix:/var/run/openvswitch/db.sock -vconsole:emer -vsyslog:err -vfile:info --mlockall --user openvswit> Apr 29 15:09:47 overcloud-cephstorage-0 systemd[1]: Starting Open vSwitch Forwarding Unit... Apr 29 15:09:47 overcloud-cephstorage-0 ovs-ctl[96792]: Inserting openvswitch module [ OK ] Apr 29 15:09:47 overcloud-cephstorage-0 ovs-ctl[96792]: Starting ovs-vswitchd [ OK ] Apr 29 15:09:47 overcloud-cephstorage-0 ovs-vsctl[96851]: ovs|00001|vsctl|INFO|Called as ovs-vsctl --no-wait add Open_vSwitch . exter> Apr 29 15:09:47 overcloud-cephstorage-0 ovs-ctl[96792]: Enabling remote OVSDB managers [ OK ] Apr 29 15:09:47 overcloud-cephstorage-0 systemd[1]: Started Open vSwitch Forwarding Unit. Unset noout flag ------------------------------------------------------- 14.97s tripleo-podman : Purge /var/lib/docker ---------------------------------- 8.87s tripleo-podman : Uninstall Docker rpm ----------------------------------- 5.46s Gathering Facts --------------------------------------------------------- 3.32s Render all_nodes data as group_vars for overcloud ----------------------- 2.53s Gathering Facts --------------------------------------------------------- 2.04s tripleo-podman : Check docker service state ----------------------------- 1.38s tripleo-podman : Check if docker has some data -------------------------- 0.96s tripleo-podman : Refresh hardware facts --------------------------------- 0.90s tripleo-podman : Clean podman images ------------------------------------ 0.38s tripleo-podman : Clean podman images ------------------------------------ 0.38s include_tasks ----------------------------------------------------------- 0.35s tripleo-podman : Clean podman volumes ----------------------------------- 0.33s include_tasks ----------------------------------------------------------- 0.33s Stop docker ------------------------------------------------------------- 0.33s Purge everything about docker on the host ------------------------------- 0.28s Unset noout flag -------------------------------------------------------- 0.27s include_tasks ----------------------------------------------------------- 0.27s include_tasks ----------------------------------------------------------- 0.25s Stop docker ------------------------------------------------------------- 0.23s Updated nodes - overcloud-cephstorage-0 Success 2021-04-29 21:05:17.529 528361 INFO tripleoclient.v1.overcloud_upgrade.UpgradeRun [-] Completed Overcloud Upgrade Run for overcloud-cephstorage-0 with playbooks ['upgrade_steps_playbook.yaml', 'deploy_steps_playbook.yaml', 'post_upgrade_steps_playbook.yaml'] 2021-04-29 21:05:17.535 528361 INFO osc_lib.shell [-] END return value: None I am getting same error for compute node as well, I just wanted to know if there is any mechanism that openvswitch service will be in active state without manual intervention so that rhosp upgradation will not get hamper in this state. Expected results: openvswitch service should be in active state during upgradation. Additional info: RHEL Version: Red Hat Enterprise Linux release 8.2 (Ootpa) RHOSP Version: 16.2 Below is the upgrade prepare command used during upgradation: nohup openstack overcloud upgrade prepare --templates /home/stack/openstack-tripleo-heat-templates-rendered_16 -r /home/stack/templates/roles_data.yaml -n /home/stack/templates/network_data.yaml -e /home/stack/containers-prepare-parameter.yaml -e /home/stack/templates/upgrades-environment.yaml -e /home/stack/templates/rhsm.yml -e /home/stack/openstack-tripleo-heat-templates-rendered_16/environments/network-isolation.yaml -e /home/stack/templates/network-environment.yaml -e /home/stack/templates/node-info.yaml -e /home/stack/openstack-tripleo-heat-templates-rendered_16/environments/services/neutron-sriov.yaml -e /home/stack/openstack-tripleo-heat-templates-rendered_16/environments/services/neutron-ovs.yaml -e /home/stack/openstack-tripleo-heat-templates-rendered_16/environments/ceph-ansible/ceph-ansible.yaml -e /home/stack/openstack-tripleo-heat-templates-rendered_16/environments/cinder-backup.yaml -e /home/stack/templates/storage-config.yaml -e /home/stack/openstack-tripleo-heat-templates-rendered_16/environments/host-config-and-reboot.yaml --libvirt-type kvm --ntp-server pool.ntp.org -v -y & I am using below redhat document for rhosp13 to 16.1 upgradation. https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.1/html-single/framework_for_upgrades_13_to_16.1/index
Hi OVS team, Please suggest, I am waiting for your suggestions, do let me know if you need any other information.
can you attach the NIC config templates; this looks like a configuration issue or an environmental issue because the logs suggest there was an error with em2 and em3 not getting the IP assigned by DHCP "[2021/04/29 02:03:51 PM] [ERROR] Failure(s) occurred when applying configuration", "[2021/04/29 02:03:51 PM] [ERROR] stdout: ", "Determining IP information for em2... failed.", ", stderr: WARN : [ifup] You are using 'ifup' script provided by 'network-scripts', which are now deprecated.", "WARN : [ifup] 'network-scripts' will be removed in one of the next major releases of RHEL.", "WARN : [ifup] It is advised to switch to 'NetworkManager' instead - it provides 'ifup/ifdown' scripts as well.", "", "[2021/04/29 02:03:51 PM] [ERROR] stdout: ", "Determining IP information for em3... failed.", ", stderr: WARN : [ifup] You are using 'ifup' script provided by 'network-scripts', which are now deprecated.", "WARN : [ifup] 'network-scripts' will be removed in one of the next major releases of RHEL.", "WARN : [ifup] It is advised to switch to 'NetworkManager' instead - it provides 'ifup/ifdown' scripts as well.", "", "Traceback (most recent call last):", " File \"/bin/os-net-config\", line 10, in <module>", " sys.exit(main())", " File \"/usr/lib/python3.6/site-packages/os_net_config/cli.py\", line 349, in main", " activate=not opts.no_activate)", " File \"/usr/lib/python3.6/site-packages/os_net_config/impl_ifcfg.py\", line 1806, in apply", " raise os_net_config.ConfigurationError(message)", "os_net_config.ConfigurationError: Failure(s) occurred when applying configuration", "+ RETVAL=1", "+ set -e", "+ [[ 1 == 2 ]]", "+ [[ 1 != 0 ]]", "+ echo 'ERROR: configuration of safe defaults failed.'"
Created attachment 1784160 [details] ceph-storage.yaml file
Created attachment 1784161 [details] controller.yaml file
Created attachment 1784162 [details] computesriov.yaml file
(In reply to Giulio Fidente from comment #3) > can you attach the NIC config templates; this looks like a configuration > issue or an environmental issue because the logs suggest there was an error > with em2 and em3 not getting the IP assigned by DHCP > > "[2021/04/29 02:03:51 PM] [ERROR] Failure(s) occurred when applying > configuration", > "[2021/04/29 02:03:51 PM] [ERROR] stdout: ", > "Determining IP information for em2... failed.", > ", stderr: WARN : [ifup] You are using 'ifup' script provided > by 'network-scripts', which are now deprecated.", > "WARN : [ifup] 'network-scripts' will be removed in one of the > next major releases of RHEL.", > "WARN : [ifup] It is advised to switch to 'NetworkManager' > instead - it provides 'ifup/ifdown' scripts as well.", > "", > "[2021/04/29 02:03:51 PM] [ERROR] stdout: ", > "Determining IP information for em3... failed.", > ", stderr: WARN : [ifup] You are using 'ifup' script provided > by 'network-scripts', which are now deprecated.", > "WARN : [ifup] 'network-scripts' will be removed in one of the > next major releases of RHEL.", > "WARN : [ifup] It is advised to switch to 'NetworkManager' > instead - it provides 'ifup/ifdown' scripts as well.", > "", > "Traceback (most recent call last):", > " File \"/bin/os-net-config\", line 10, in <module>", > " sys.exit(main())", > " File \"/usr/lib/python3.6/site-packages/os_net_config/cli.py\", > line 349, in main", > " activate=not opts.no_activate)", > " File > \"/usr/lib/python3.6/site-packages/os_net_config/impl_ifcfg.py\", line 1806, > in apply", > " raise os_net_config.ConfigurationError(message)", > "os_net_config.ConfigurationError: Failure(s) occurred when applying > configuration", > "+ RETVAL=1", > "+ set -e", > "+ [[ 1 == 2 ]]", > "+ [[ 1 != 0 ]]", > "+ echo 'ERROR: configuration of safe defaults failed.'" Hi Giulio Fidente, I have attached nic-configs file for controller,computesriov and ceph-storage, do let me know if any other info required from my side.
hi there seems to be a validation issue for the nic templates config/indentation "[2021/04/29 02:00:28 PM] [WARNING] Config file failed schema validation at network_config/1:", " {'dns_servers': ['8.8.8.8', '8.8.4.4'], 'domain': [], 'members': [{'bonding_options': 'bond_mode=active-backup', 'members': [{'name': 'em2', 'primary': True, 'type': 'interface'}, {'name': 'em3', 'type': 'interface'}], 'name': 'bond1', 'ovs_options': None, 'type': 'ovs_bond'}, {'addresses': [{'ip_netmask': '192.168.23.161/24'}], 'type': 'vlan', 'vlan_id': 23}, {'addresses': [{'ip_netmask': '192.168.24.200/24'}], 'type': 'vlan', 'vlan_id': 24}], 'name': 'br-bond', 'type': 'ovs_bridge', 'nic_mapping': None, 'persist_mapping': False} is not valid under any of the given schemas", " Sub-schemas tested and not matching:", " - items/oneOf/ovs_bridge/members/items/oneOf: {'bonding_options': 'bond_mode=active-backup', 'members': [{'name': 'em2', 'primary': True, 'type': 'interface'}, {'name': 'em3', 'type': 'interface'}], 'name': 'bond1', 'ovs_options': None, 'type': 'ovs_bond'} is not valid under any of the given schemas", " -- items/oneOf/ovs_bridge/members/items/oneOf/ovs_bond/additionalProperties: Additional properties are not allowed ('bonding_options' was unexpected)", " -- items/oneOf/ovs_bridge/members/items/oneOf/ovs_bond/ovs_options/oneOf: None is not valid under any of the given schemas", " --- items/oneOf/ovs_bridge/members/items/oneOf/ovs_bond/ovs_options/oneOf/ovs_options_string/type: 'None' is not of type 'string'", " --- items/oneOf/ovs_bridge/members/items/oneOf/ovs_bond/ovs_options/oneOf/param/oneOf: None is not valid under any of the given schemas", " ---- items/oneOf/ovs_bridge/members/items/oneOf/ovs_bond/ovs_options/oneOf/param/oneOf/0/type: 'None' is not of type 'object'", " ---- items/oneOf/ovs_bridge/members/items/oneOf/ovs_bond/ovs_options/oneOf/param/oneOf/1/type: 'None' is not of type 'object'", we're looking into that to try find the exact problem
Hi, Looking at the ceph-storage.yaml file, it appears you're using bonding_options - which is intended to be used with Linux Bonds: - type: ovs_bridge name: br-bond dns_servers: get_param: DnsServers domain: get_param: DnsSearchDomains members: - type: ovs_bond name: bond1 ovs_options: null bonding_options: get_param: BondInterfaceOvsOptions members: - type: interface name: em2 primary: true - type: interface name: em3 The correct way to do this with a ovs_bond would be: - type: ovs_bridge name: br-bond dns_servers: get_param: DnsServers domain: get_param: DnsSearchDomains members: - type: ovs_bond name: bond1 ovs_options: get_param: BondInterfaceOvsOptions members: - type: interface name: em2 primary: true - type: interface name: em3 The schema is defined here and we can see that bonding_options is only used for the linux_bond and linux_team interface types: https://github.com/openstack/os-net-config/blob/stable/train/os_net_config/schema.yaml#L1165-L1181 Whereas for ovs_bond it uses bonding_options: https://github.com/openstack/os-net-config/blob/stable/train/os_net_config/schema.yaml#L606-L623 Here's some examples from the default network config files for reference: linux_bond example: https://github.com/openstack/tripleo-heat-templates/blob/stable/train/network/config/bond-with-vlans/role.role.j2.yaml#L195-L200 ovs_bond example: https://github.com/openstack/tripleo-heat-templates/blob/stable/train/network/config/bond-with-vlans/role.role.j2.yaml#L159-L164 I believe that is the reason the schemas are not matching and you're getting those errors.
I would recommend the following changes: - Convert from OVS bond to Linux bond - Add "device: bond1" to the VLANs that are attached to the bond - Remove OVS bridge (there is no reason for it if you are using Linux bonds) - Remove line that has: "ovs_options: null" Alternately, you can remove the "bonding_options:" line and put the "get_param: BondInterfaceOvsOptions" under the ovs_options: line (without the "null"). So it would look like this: network_config: - type: interface name: em1 use_dhcp: false dns_servers: get_param: DnsServers domain: get_param: DnsSearchDomains addresses: - ip_netmask: list_join: - / - - get_param: ControlPlaneIp - get_param: ControlPlaneSubnetCidr routes: - ip_netmask: 169.254.169.254/32 next_hop: get_param: EC2MetadataIp - default: true next_hop: get_param: ControlPlaneDefaultRoute - type: linux_bond name: bond1 bonding_options: get_param: BondInterfaceOvsOptions members: - type: interface name: em2 primary: true - type: interface name: em3 - type: vlan vlan_id: get_param: StorageNetworkVlanID addresses: - ip_netmask: get_param: StorageIpSubnet - type: vlan vlan_id: get_param: StorageMgmtNetworkVlanID addresses: - ip_netmask: get_param: StorageMgmtIpSubnet
Note that in order to apply the network configuration, you will have to have NetworkDeploymentActions set with "UPDATE" in the list in an environment file: parameter_defaults: NetworkDeploymentActions: ["CREATE","UPDATE"] Then run a stack update, and on subsequent stack updates you shouldn't need to set NetworkDeploymentActions.
(In reply to Dan Sneddon from comment #10) > I would recommend the following changes: > > - Convert from OVS bond to Linux bond > > - Add "device: bond1" to the VLANs that are attached to the bond > > - Remove OVS bridge (there is no reason for it if you are using Linux bonds) > > - Remove line that has: "ovs_options: null" > > > Alternately, you can remove the "bonding_options:" line and put the > "get_param: BondInterfaceOvsOptions" under the ovs_options: line (without > the "null"). > > So it would look like this: > > network_config: > - type: interface > name: em1 > use_dhcp: false > dns_servers: > get_param: DnsServers > domain: > get_param: DnsSearchDomains > addresses: > - ip_netmask: > list_join: > - / > - - get_param: ControlPlaneIp > - get_param: ControlPlaneSubnetCidr > routes: > - ip_netmask: 169.254.169.254/32 > next_hop: > get_param: EC2MetadataIp > - default: true > next_hop: > get_param: ControlPlaneDefaultRoute > - type: linux_bond > name: bond1 > bonding_options: > get_param: BondInterfaceOvsOptions > members: > - type: interface > name: em2 > primary: true > - type: interface > name: em3 > - type: vlan > vlan_id: > get_param: StorageNetworkVlanID > addresses: > - ip_netmask: > get_param: StorageIpSubnet > - type: vlan > vlan_id: > get_param: StorageMgmtNetworkVlanID > addresses: > - ip_netmask: > get_param: StorageMgmtIpSubnet Hi Dan, I will apply these changes suggested by you and verify once again during rhosp fast forward upgrade, but one thing that comes to mind is why I am not getting error for controller nodes because I am using the same bonding_options parameter in nic-configs file for controller nodes also like below: - type: ovs_bridge name: br-ex dns_servers: get_param: DnsServers domain: get_param: DnsSearchDomains members: - type: ovs_bond name: bond1 ovs_options: null bonding_options: get_param: BondInterfaceOvsOptions members: - type: interface name: em2 primary: true - type: interface name: em3