Bug 2207812
| Summary: | Adding an eni to an EC2 instance sometimes breaks network because ip rules/tables get ruined | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 9 | Reporter: | Curtis Taylor <cutaylor> |
| Component: | NetworkManager | Assignee: | Thomas Haller <thaller> |
| Status: | CLOSED ERRATA | QA Contact: | Filip Pokryvka <fpokryvk> |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 9.1 | CC: | andavis, bgalvani, ferferna, libhe, linl, lrintel, qzhang, rkhan, sfaye, sukulkar, thaller, till, vbenes, vkuznets, xiliang, ymao |
| Target Milestone: | rc | Keywords: | Triaged |
| Target Release: | --- | Flags: | pm-rhel:
mirror+
|
| Hardware: | aarch64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | NetworkManager-1.43.10-1.el9 | Doc Type: | No Doc Update |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2023-11-07 08:38:04 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 2221903 | ||
|
Comment 1
Thomas Haller
2023-05-19 14:48:10 UTC
would it also be possible to enable debug logging in nm-cloud-setup and get logs? Those show what configuration was received. you could that via mkdir -p /etc/systemd/system/nm-cloud-setup.service.d cat > /etc/systemd/system/nm-cloud-setup.service.d/90-log.conf << EOF [Service] Environment=NM_CLOUD_SETUP_LOG=TRACE EOF systemctl daemon-reload (or, if you prefer `systemctl edit nm-cloud-setup.service`) Thomas, thank you for replying. I working on collecting those logs. And setting this BZ Needinfo on myself. Thomas, I appears someone (or something) has deleted my test instance. So I will have to reproduce the results. Will it be sufficient to see one test where the addition of enic causes the issue or do you want to see logs when the same enic adds successfully too? (I assume I will get it setup again, collect the logs, and before I get back to it again, it will be destroyed. So I'd rather collect as much as possible.) It took a lot of tries to reproduce the issue after rebuilding the instance. After about 20 tries, including twice rebuilding the eni I was adding, I recreated the 3rd eni 3rd time but manually assigned an ip address on the same subnet as the first interface but with a lower host address (i.e. eth0 had 172.31.33.44, eth2 was assigned 172.31.33.43). After assigning the 3rd eni ip address the hang happened. I did this because every previous attempt had all 3 interfaces on different subnets while when I opened the BZ the 3rd interface was automatically assigned an address on the the same subnet as eth0 with a lower host address. And it doesn't seem as though the manual assignment was intrisically wrong because after a second reproduction I rebooted while forgetting to detach the 3rd interface and after boot everything worked. (Customer reported the same behavior.) I'm attaching the messages from all of the attempts today. I waited to turn on debug until I had the configuration that reproduced the issue (Detaching the 3rd interface that reproduces the issue the reboot, allows it to immediately reproduce after reboot.) Once I was able to reproduce the issue I did: # logger "Try 1: Begin" - Using web portal attach 3rd interface to instance [ instance becomes unreachable in previously working ssh session ] - detach 3rd interface from instance - wait some period hoping not to loose critical messages - select instance and reboot - after boot, ssh # logger "Try 1: (After reboot) END" # logger "Try 2: Begin" - Using web portal attach 3rd interface to instance [instance becomes unreachable in previously working ssh session ] - (distracted and forgot to detach 3rd interface) - wait 10+ minutes - in web portal select instance and reboot - after boot, ssh # logger "Try 1: (After reboot) END" These messages stood out. I rebooted pretty quickly after the first reproduction, and left it for a good 10min after the second. May 26 21:20:30 ip-172-31-33-44 root[1654]: Try 1: Begin May 26 21:20:59 ip-172-31-33-44 nm-cloud-setup[1700]: [118.00188] <info> provider ec2 detected May 26 21:20:59 ip-172-31-33-44 nm-cloud-setup[1700]: [118.01950] <info> some changes were applied for provider ec2 May 26 21:21:02 ip-172-31-33-44 nm-cloud-setup[1737]: [121.03473] <debug> http-request[5ed727cc776d97bb, "http://169.254.169.254/latest/meta-data/"]: failed with HTTP request timed out May 26 21:21:05 ip-172-31-33-44 nm-cloud-setup[1737]: [124.03517] <debug> http-request[b27b8d3c0ab3d37d, "http://169.254.169.254/latest/meta-data/"]: failed with HTTP request timed out May 26 21:21:06 ip-172-31-33-44 nm-cloud-setup[1737]: [125.03451] <info> provider ec2 not detected: failure to get EC2 metadata: timeout expired [reboot] May 26 21:24:05 ip-172-31-33-44 nm-cloud-setup[949]: [4.05820] <info> provider ec2 detected May 26 21:24:05 ip-172-31-33-44 nm-cloud-setup[949]: [4.07014] <info> some changes were applied for provider ec2 May 26 21:24:05 ip-172-31-33-44 nm-cloud-setup[1149]: [4.07530] <info> provider ec2 detected May 26 21:24:05 ip-172-31-33-44 nm-cloud-setup[1149]: [4.07956] <debug> no changes were applied for provider ec2 May 26 21:24:05 ip-172-31-33-44 nm-cloud-setup[1227]: [4.08432] <info> provider ec2 detected May 26 21:24:05 ip-172-31-33-44 nm-cloud-setup[1227]: [4.08514] <debug> no changes were applied for provider ec2 May 26 21:24:42 ip-172-31-33-44 root[1656]: Try 1: (After reboot) END May 26 21:48:25 ip-172-31-33-44 root[1775]: Try 2: Begin May 26 21:48:49 ip-172-31-33-44 nm-cloud-setup[1825]: [233.06204] <info> provider ec2 detected May 26 21:48:49 ip-172-31-33-44 nm-cloud-setup[1825]: [233.07430] <info> some changes were applied for provider ec2 May 26 21:48:52 ip-172-31-33-44 nm-cloud-setup[1859]: [236.09058] <debug> http-request[b74abdcbd2a6f15b, "http://169.254.169.254/latest/meta-data/"]: failed with HTTP request timed out May 26 21:48:55 ip-172-31-33-44 nm-cloud-setup[1859]: [239.09101] <debug> http-request[37572ddd118ec054, "http://169.254.169.254/latest/meta-data/"]: failed with HTTP request timed out May 26 21:48:56 ip-172-31-33-44 nm-cloud-setup[1859]: [240.09044] <info> provider ec2 not detected: failure to get EC2 metadata: timeout expired May 26 21:50:13 ip-172-31-33-44 nm-cloud-setup[1870]: [317.06569] <debug> http-request[7982e6aba6a0d335, "http://169.254.169.254/latest/meta-data/"]: failed with HTTP request timed out May 26 21:50:16 ip-172-31-33-44 nm-cloud-setup[1870]: [320.06612] <debug> http-request[eaa10bb409470ccf, "http://169.254.169.254/latest/meta-data/"]: failed with HTTP request timed out May 26 21:50:17 ip-172-31-33-44 nm-cloud-setup[1870]: [321.06545] <info> provider ec2 not detected: failure to get EC2 metadata: timeout expired May 26 21:56:13 ip-172-31-33-44 nm-cloud-setup[1940]: [677.06585] <debug> http-request[4a796857d4f31601, "http://169.254.169.254/latest/meta-data/"]: failed with HTTP request timed out May 26 21:56:16 ip-172-31-33-44 nm-cloud-setup[1940]: [680.06629] <debug> http-request[3f1317c186c62dca, "http://169.254.169.254/latest/meta-data/"]: failed with HTTP request timed out May 26 21:56:17 ip-172-31-33-44 nm-cloud-setup[1940]: [681.06562] <info> provider ec2 not detected: failure to get EC2 metadata: timeout expired [reboot] May 26 22:01:31 ip-172-31-33-44 nm-cloud-setup[971]: [4.07026] <info> provider ec2 detected May 26 22:01:31 ip-172-31-33-44 nm-cloud-setup[971]: [4.08754] <info> some changes were applied for provider ec2 May 26 22:01:31 ip-172-31-33-44 nm-cloud-setup[1229]: [4.09160] <info> provider ec2 detected May 26 22:01:31 ip-172-31-33-44 nm-cloud-setup[1229]: [4.09794] <info> some changes were applied for provider ec2 May 26 22:01:57 ip-172-31-33-44 root[1687]: Try 2: (After reboot) END The interfaces are as follows with eth2 being the interface that was added to cause the distruption:
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc mq state UP group default qlen 1000
link/ether 06:39:8b:2a:54:f3 brd ff:ff:ff:ff:ff:ff
inet 172.31.33.44/20 brd 172.31.47.255 scope global noprefixroute eth0
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc mq state UP group default qlen 1000
link/ether 06:5f:6c:8d:9e:95 brd ff:ff:ff:ff:ff:ff
inet 172.31.36.25/20 brd 172.31.47.255 scope global noprefixroute eth1
4: eth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc mq state UP group default qlen 1000
link/ether 06:22:71:ac:b6:37 brd ff:ff:ff:ff:ff:ff
inet 172.31.34.43/20 brd 172.31.47.255 scope global noprefixroute eth2
Each time eth2 was added in those two failure cases, eth2 took v_mac_data->iface_idx 0:
May 26 21:19:05 ip-172-31-33-44 nm-cloud-setup[1214]: [4.08694] <debug> get-config: start fetching meta data for #0, 06:39:8B:2A:54:F3 (06:39:8b:2a:54:f3/)
May 26 21:19:05 ip-172-31-33-44 nm-cloud-setup[1214]: [4.08694] <debug> get-config: start fetching meta data for #1, 06:5F:6C:8D:9E:95 (06:5f:6c:8d:9e:95/)
May 26 21:19:05 ip-172-31-33-44 nm-cloud-setup[1214]: [4.08703] <debug> get-config: success
May 26 21:20:30 ip-172-31-33-44 root[1654]: Try 1: Begin
May 26 21:20:59 ip-172-31-33-44 nm-cloud-setup[1700]: [118.00252] <debug> get-config: starting
May 26 21:20:59 ip-172-31-33-44 nm-cloud-setup[1700]: [118.00257] <debug> get-config: start fetching meta data for #0, 06:22:71:AC:B6:37 (06:22:71:ac:b6:37/)
May 26 21:20:59 ip-172-31-33-44 nm-cloud-setup[1700]: [118.00257] <debug> get-config: start fetching meta data for #2, 06:5F:6C:8D:9E:95 (06:5f:6c:8d:9e:95/)
May 26 21:20:59 ip-172-31-33-44 nm-cloud-setup[1700]: [118.00258] <debug> get-config: start fetching meta data for #1, 06:39:8B:2A:54:F3 (06:39:8b:2a:54:f3/)
May 26 21:47:48 ip-172-31-33-44 nm-cloud-setup[1758]: [172.01562] <debug> get-config: start fetching meta data for #1, 06:5F:6C:8D:9E:95 (06:5f:6c:8d:9e:95/)
May 26 21:47:48 ip-172-31-33-44 nm-cloud-setup[1758]: [172.01562] <debug> get-config: start fetching meta data for #0, 06:39:8B:2A:54:F3 (06:39:8b:2a:54:f3/)
May 26 21:47:48 ip-172-31-33-44 nm-cloud-setup[1758]: [172.01571] <debug> get-config: success
May 26 21:48:25 ip-172-31-33-44 root[1775]: Try 2: Begin
May 26 21:48:49 ip-172-31-33-44 nm-cloud-setup[1825]: [233.06267] <debug> get-config: starting
May 26 21:48:49 ip-172-31-33-44 nm-cloud-setup[1825]: [233.06272] <debug> get-config: start fetching meta data for #0, 06:22:71:AC:B6:37 (06:22:71:ac:b6:37/)
May 26 21:48:49 ip-172-31-33-44 nm-cloud-setup[1825]: [233.06272] <debug> get-config: start fetching meta data for #1, 06:39:8B:2A:54:F3 (06:39:8b:2a:54:f3/)
May 26 21:48:49 ip-172-31-33-44 nm-cloud-setup[1825]: [233.06272] <debug> get-config: start fetching meta data for #2, 06:5F:6C:8D:9E:95 (06:5f:6c:8d:9e:95/)
When adding that same eni as the second eni to the VM it still takes index 0 but then adding the next interface still works without flaw:
BOOT:
May 26 23:15:15 ip-172-31-33-44 nm-cloud-setup[923]: [4.04260] <debug> get-config: start fetching meta data for #0, 06:39:8B:2A:54:F3 (06:39:8b:2a:54:f3/)
add what was the 3rd ENI, which made the VM unreachable, eth2:
May 26 23:15:57 ip-172-31-33-44 nm-cloud-setup[1636]: [46.03554] <debug> get-config: start fetching meta data for #0, 06:22:71:AC:B6:37 (06:22:71:ac:b6:37/)
May 26 23:15:57 ip-172-31-33-44 nm-cloud-setup[1636]: [46.03554] <debug> get-config: start fetching meta data for #1, 06:39:8B:2A:54:F3 (06:39:8b:2a:54:f3/)
May 26 23:15:57 ip-172-31-33-44 nm-cloud-setup[1636]: [46.04909] <info> some changes were applied for provider ec2
Add what was originally the 2nd ENI, which didn't and still doesn't make the VM unreachable, eth1:
May 26 23:16:15 ip-172-31-33-44 nm-cloud-setup[1737]: [64.09736] <debug> get-config: start fetching meta data for #1, 06:39:8B:2A:54:F3 (06:39:8b:2a:54:f3/)
May 26 23:16:15 ip-172-31-33-44 nm-cloud-setup[1737]: [64.09736] <debug> get-config: start fetching meta data for #2, 06:5F:6C:8D:9E:95 (06:5f:6c:8d:9e:95/)
May 26 23:16:15 ip-172-31-33-44 nm-cloud-setup[1737]: [64.09737] <debug> get-config: start fetching meta data for #0, 06:22:71:AC:B6:37 (06:22:71:ac:b6:37/)
May 26 23:16:15 ip-172-31-33-44 nm-cloud-setup[1737]: [64.09780] <info> some changes were applied for provider ec2
So is it that we lost 169.254.169.254 or that this is an AWS Bug due to their lack of proving the config data in the HTTP get?
I am not sure I look at the right place, but in the log from comment 6 I see: May 26 21:20:59 ip-172-31-33-44 nm-cloud-setup[1700]: [118.00269] <debug> meta data received May 26 21:20:59 ip-172-31-33-44 nm-cloud-setup[1700]: [118.00269] <debug> config device 06:22:71:AC:B6:37: configuring "eth2" (/org/freedesktop/NetworkManager/Devices/4)... May 26 21:20:59 ip-172-31-33-44 nm-cloud-setup[1700]: [118.00275] <debug> config device 06:22:71:AC:B6:37: reapply connection "Wired connection 2" (21b7eb9c-9966-3273-ac76-f38dad746580) May 26 21:20:59 ip-172-31-33-44 nm-cloud-setup[1700]: [118.00289] <debug> config device 06:22:71:AC:B6:37: connection "Wired connection 2" (21b7eb9c-9966-3273-ac76-f38dad746580) reapplied May 26 21:20:59 ip-172-31-33-44 nm-cloud-setup[1700]: [118.00290] <debug> config device 06:39:8B:2A:54:F3: configuring "eth0" (/org/freedesktop/NetworkManager/Devices/2)... May 26 21:20:59 ip-172-31-33-44 NetworkManager[887]: <info> [1685136059.0412] audit: op="device-reapply" interface="eth2" ifindex=4 args="ipv4.routing-rules,ipv4.addresses,ipv4.routes" pid=1700 uid=0 result="success" May 26 21:20:59 ip-172-31-33-44 NetworkManager[887]: <info> [1685136059.1823] dhcp4 (eth2): canceled DHCP transaction May 26 21:20:59 ip-172-31-33-44 NetworkManager[887]: <info> [1685136059.1824] dhcp4 (eth2): activation: beginning transaction (timeout in 45 seconds) May 26 21:20:59 ip-172-31-33-44 NetworkManager[887]: <info> [1685136059.1824] dhcp4 (eth2): state changed no lease May 26 21:20:59 ip-172-31-33-44 NetworkManager[887]: <info> [1685136059.1829] dhcp4 (eth2): activation: beginning transaction (timeout in 45 seconds) May 26 21:20:59 ip-172-31-33-44 nm-cloud-setup[1700]: [118.01750] <debug> config device 06:39:8B:2A:54:F3: reapply connection "System eth0" (5fb06bd0-0bb0-7ffb-45f1-d6edd65f3e03) May 26 21:20:59 ip-172-31-33-44 NetworkManager[887]: <info> [1685136059.1871] dhcp4 (eth2): state changed new lease, address=172.31.34.43 May 26 21:20:59 ip-172-31-33-44 nm-dispatcher[1691]: req:4 'dhcp4-change' [eth2]: new request (5 scripts) May 26 21:20:59 ip-172-31-33-44 nm-cloud-setup[1700]: [118.01813] <debug> config device 06:39:8B:2A:54:F3: connection "System eth0" (5fb06bd0-0bb0-7ffb-45f1-d6edd65f3e03) reapplied May 26 21:20:59 ip-172-31-33-44 nm-cloud-setup[1700]: [118.01813] <debug> config device 06:5F:6C:8D:9E:95: configuring "eth1" (/org/freedesktop/NetworkManager/Devices/3)... May 26 21:20:59 ip-172-31-33-44 NetworkManager[887]: <info> [1685136059.1936] audit: op="device-reapply" interface="eth0" ifindex=2 args="ipv4.routing-rules,ipv4.routes" pid=1700 uid=0 result="success" May 26 21:20:59 ip-172-31-33-44 nm-dispatcher[1691]: req:4 'dhcp4-change' [eth2]: start running ordered scripts... May 26 21:20:59 ip-172-31-33-44 nm-cloud-setup[1700]: [118.01949] <debug> SIGTERM received May 26 21:20:59 ip-172-31-33-44 nm-cloud-setup[1700]: [118.01950] <info> some changes were applied for provider ec2 afterwards, up then next run of nm-cloud-setup, we see that we are unable to fetch meta-data (and fix the situation): May 26 21:21:06 ip-172-31-33-44 nm-cloud-setup[1737]: [125.03451] <info> no provider detected nm-cloud-setup gets stopped via SIGTERM. The "proper way" to run it is via `systemctl restart nm-cloud-setup`, which will kill any running instance and start again. nm-cloud-setup first fetches all information. If it gets killed at that point, there is no problem (it didn't do anything). However at `May 26 21:20:59` we see that nm-cloud-setup gets killed in the middle while re-configuring the system. It leaves the system broken, and the subsequent run cannot fix it (because the meta data service is now unreachable). I think the (simple) solution is to ignore SIGTERM during the "reconfiguration" phase. Thank you Curtis, for all the effort. Very helpful!! I think, when the bad thing happens we see that nm-cloud-setup logs "SIGTERM received" while only configuring some of the interfaces. And on the next run, we see nm-cloud-setup bailing out with "<info> no provider detected". Curtis, would it be possible to test https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=52995591 ? It's a rhel-9.1 rebuild, with a patch that (hopefully) fixes this issue. You would just need to install it, and try whether you can still reproduce the problem (the expectation would be, that you are unable to reproduce). How I tested:
- It took several attempts to get into the failure state, which seems to be requiring addition of an eni adpater that will take over index 0 of the first adapter in the ip rule list. (I've no idea what decides this but has required building different ENI in ec2, until finally this is the outcome.)
- Once in the failure state, I am able to detach the most recently added (3rd) adapter, force reboot, and then get back to the same failure state with a subsequent add of the same adapter.
- After confirming 3 times that re-adding the same adapter triggers the failure, I updated NM To the test rpms.
- Made 3 attempts to reproduce the issue and every time ip rules looked like this prior to adding the 3rd eni:
0: from all lookup local
30200: from 172.31.100.244 lookup 30200
30201: from 172.31.100.55 lookup 30201
30350: from all lookup main suppress_prefixlength 0
30400: from 172.31.100.244 lookup 304 00
30401: from 172.31.100.55 lookup 30401
32766: from all lookup main
32767: from all lookup default
Then after the successful addition of a 3rd eni ip rules became:
0: from all lookup local
30200: from 172.31.45.189 lookup 30200
30201: from 172.31.100.244 lookup 30201
30202: from 172.31.100.55 lookup 30202
30350: from all lookup main suppress_prefixlength 0
30400: from 172.31.45.189 lookup 30400
30401: from 172.31.100.244 lookup 30401
30402: from 172.31.100.55 lookup 30402
32766: from all lookup main
32767: from all lookup default
Using:
NetworkManager-libnm-1.40.0-2.1.rh2207812.el9_1.aarch64
NetworkManager-1.40.0-2.1.rh2207812.el9_1.aarch64
NetworkManager-cloud-setup-1.40.0-2.1.rh2207812.el9_1.aarch64
NetworkManager-team-1.40.0-2.1.rh2207812.el9_1.aarch64
NetworkManager-tui-1.40.0-2.1.rh2207812.el9_1.aarch64
Conclusion: The issue seems to be fixed.
(In reply to Curtis Taylor from comment #11) > Conclusion: The issue seems to be fixed. Thank you for your efforts. That's very helpful!! This bug is also observed in RHEL-8.9, open another bug https://bugzilla.redhat.com/show_bug.cgi?id=2221903 to track it. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (NetworkManager bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2023:6585 |