Skip to content

Conversation

@MoeSalah1999
Copy link
Contributor

@MoeSalah1999 MoeSalah1999 commented Dec 31, 2025

Proposed Commit Message


- Summary

On EC2 instances with multiple network interfaces, cloud-init may misidentify
the primary NIC when the OS interface enumeration order does not match the EC2
primary interface. This can lead to delayed network configuration and slow
boot times while waiting on DHCP for a non-primary interface.

This change introduces a small helper to reliably identify the EC2 primary
network interface directly from instance metadata.

- Details

EC2 metadata explicitly exposes the primary interface via:

- network-card == 0
- device-number == 0

However, this information was previously only used indirectly and not exposed
as a reusable helper. In scenarios where ENIs are attached in a non-standard
order, cloud-init may select the wrong interface as primary based on OS
enumeration.

This patch adds `get_primary_mac_from_metadata()` to the EC2 helpers module,
providing a deterministic and metadata-driven way to identify the primary NIC.


- Motivation

This helper enables future fixes to NIC ordering and network bring-up logic
without relying on OS-level interface ordering, which is unreliable on Nitro-
based EC2 instances.


Fixes GH-#6618

Additional Context

Behavior

  • No behavior change for single-NIC instances
  • No behavior change when metadata is missing or malformed
  • Deterministic selection when multiple candidates are present
  • Debug logging only; no warnings or exceptions introduced

Test Steps

Unit tests added to validate:

  • Empty and malformed metadata
  • Correct identification of primary NIC
  • Correct behavior when primary NIC is not first
  • Deterministic behavior with multiple candidates

Run pytest tests/unittests/sources/helpers/test_ec2.py

This change is covered by new unit tests exercising EC2 metadata shapes where the primary ENI is not first in the metadata ordering, including deterministic selection behavior.

The fix has not yet been validated on a live EC2 instance with multiple attached ENIs where the primary interface is not enumerated first by IMDS. While the logic follows documented EC2 metadata semantics (network-card == 0, device-number == 0) and matches observed metadata layouts, confirmation on a real EC2 datasource would further validate the behavior under actual IMDS timing and attachment conditions.

Merge type

  • Squash merge using "Proposed Commit Message"
  • Rebase and merge unique commits. Requires commit messages per-commit each referencing the pull request number (#<PR_NUM>)

@holmanb
Copy link
Member

holmanb commented Jan 5, 2026

Hi @MoeSalah1999, thanks for taking a look at this. I think that there is an issue with this approach. The problem in #6618 occurs before this code runs - it is the ephemeral interface code that fails, which occurs before the metadata is queried.

A change like this might still be desirable as a part of the solution for #6618, but as it is I do not think that this is complete.

Either way, we will need testing on the specific interface type to verify the solution to this problem.

@holmanb holmanb self-assigned this Jan 5, 2026
@holmanb holmanb added the incomplete Action required by submitter label Jan 5, 2026
@MoeSalah1999
Copy link
Contributor Author

@holmanb Thanks for the clarification — that makes sense.

I agree that this helper is, at best, a partial improvement that may still be useful later in the pipeline once metadata is available, but not a complete fix on its own.

Unfortunately I wasn’t able to reproduce the issue locally or on a real EC2 instance with the specific multi-NIC + ephemeral interface configuration described in the report. As you noted, proper validation will require testing against that interface type.

Please let me know if you’d prefer this PR to be parked as a partial improvement, extended to explore the earlier ephemeral interface code path, or closed until testing on the specific EC2 configuration can be performed.

@MoeSalah1999
Copy link
Contributor Author

MoeSalah1999 commented Jan 6, 2026

@holmanb I’m going to trace the ephemeral interface heuristic and look at deferring or softening classification until metadata is available. I’ll follow up with a revised approach.

@MoeSalah1999 MoeSalah1999 force-pushed the ec2-slow-boot branch 4 times, most recently from 3598b59 to 4191bce Compare January 10, 2026 13:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

incomplete Action required by submitter

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants