Skip to content

Conversation

@ppoage
Copy link
Contributor

@ppoage ppoage commented Jan 1, 2026

To be edited~

  • Code is formatted (go fmt ./...)
  • Linter passes (golangci-lint run)
  • All tests pass with race detector (go test -race ./...)
  • Benchmarks don't regress (FFI overhead < 200ns)
  • New code has tests (minimum 70% coverage, current: 89.1%)
  • Platform-specific code tested on target OS
  • Assembly changes validated on real hardware
  • Documentation updated (if applicable)
  • Commit messages follow conventions
  • No sensitive data (credentials, tokens, etc.)

Note: I added some checks and panics as we were experiencing very hard to pin down issues with segfaults, silent failures that caused later corruptions, and race conditions. At one point, running with -race actually stopped the segfault.

@ppoage ppoage requested a review from kolkov as a code owner January 1, 2026 02:40
@ppoage ppoage changed the title Arm64 debug ARM64 Darwin Fixes Jan 1, 2026
@kolkov
Copy link
Contributor

kolkov commented Jan 1, 2026

Hey @ppoage, thanks for the comprehensive work on ARM64!

Quick note: we merged v0.3.6 on Dec 29 with similar HFA/sret fixes, so there's a struct layout conflict now:

v0.3.6 layout:

  • r8 (X8 sret) at offset 184 (end of struct)
  • Separate fr1-fr4 for D0-D3 returns (offsets 152-176)

Your layout:

  • sret at offset 72 (shifts all float offsets)
  • Reuses f1-f4 for returns

What we'd like to adopt from your PR:

  • r2 return (X1) — we're missing this for 9-16 byte structs
  • darwin example update

Issues:

  • .DS_Store file — should not be committed
  • Inline HFA detection in call_unix.go — we use classification.go

Options:

  1. Rebase on current main and adapt to v0.3.6 struct layout
  2. Or we can cherry-pick the r2 handling into v0.3.7

Preferred: Option 1 — please rebase on main, adapt to our struct layout, and force-push. This keeps the fix properly attributed to you.

Note: This is preliminary feedback — I'm still out of town without my laptop. Will do a deeper review when I'm back.

Will review wgpu/naga/gogpu PRs next — the type-safe ObjC messaging looks critical and great!

…ests

Changes
- Return path: read X1 for 9–16 byte struct returns; interpret D0–D3 as raw bits for float/HFA.
- Arg path: AAPCS64 mixed struct packing into GPR/FPR; HFA structs routed to FP regs; by‑ref fallback for >16 bytes.
- Classification: nested HFA detection; shared struct‑reg counting for mixed int/float.
- Example: darwin uses libSystem puts with correct return type and arg storage.
- Ignore Go build cache (.gocache/).

Fixes
- 9–16 byte struct returns now capture X0+X1.
- Correct float/HFA return decoding and mixed struct arg packing on ARM64.

Tests
- Add nested HFA cases and struct‑arg packing tests.
- Add Darwin ObjC/CoreGraphics FFI coverage.
- Refresh callback/coverage/benchmark/ffi tests.

All tests pass on Darwin. Current benchmark stats: M3 Pro

BenchmarkGoffiOverhead: 58 ns/op
BenchmarkGoffiIntArgs: 67 ns/op
BenchmarkGoffiStringOutput: crashes
BenchmarkGoffiMultipleArgs: 73 ns/op
BenchmarkDirectGo: 9 ns/op
BenchmarkPrepareCallInterface: 26 ns/op
BenchmarkLoadLibrary: 27 us/op
BenchmarkGetSymbol: 498 ns/op

BenchmarkPrepCIF: 7 ns/op
BenchmarkNewCallback: 12 ns/op
BenchmarkCallbackInvoke: 146 ns/op
BenchmarkCallbackFloat: 148 ns/op
@ppoage
Copy link
Contributor Author

ppoage commented Jan 2, 2026

Ok, rebased.

I have some extra changes in this rebase, but most should be necessary. Once the initial fix was made with the objc cffi (HFA/etc), I had to fix nested struct handling because that was the root cause for why we couldn't send or receive window size. And then I was seeing some weird things between floats/ints so I added mixed support.

FYI BenchmarkGoffiStringOutput segfaults on mac. It looks like it calls with the string as a pointer. Didn't want to modify your benchmark, but it does run if that's fixed.

Edit: Also when trying to add textures to metal, discovered oversized structs (size >16) aren't being handled properly. I fixed that and added a test for coverage (which now passes). That will be coming in a future commit

…ent classification for ARM64. Added asm shim to test function call through ABI interface.
@ppoage
Copy link
Contributor Author

ppoage commented Jan 2, 2026

Ok, I fixed the oversized struct handling. Also added an assembly shim for use in testing to verify the whole chain through the ffi call doesn't mangle/drop something.

FYI, the current ffi doesn't accept the stack based extra arguments. I'm seeing if I can add the texture pipeline for metal, but the current ABI implementation isn't complete and I don't know what approach you prefer to take here.

@kolkov
Copy link
Contributor

kolkov commented Jan 2, 2026

Thanks for the excellent rebase! I've done a deep review and the PR looks great.

Key improvements I verified:

  • Struct layout compatible with v0.3.6 (r8@184, fr1-fr4@152-176)
  • r2 (X1) return for 9-16 byte structs - this was missing in v0.3.6
  • uint64 bit patterns instead of float64 - cleaner for mixed float32/float64
  • Nested struct handling via placeStructRegisters()
  • Mixed int/float struct support via countStructRegUsage()
  • Comprehensive test coverage (darwin_objc_test.go, abi_capture_test.s)

One small fix needed for the benchmark:

In ffi/benchmark_test.go line ~152 (darwin case for BenchmarkGoffiStringOutput), the fix should be:

// args contains pointers to argument storage, not the values themselves
_ = CallFunction(cif, sym, unsafe.Pointer(&result), []unsafe.Pointer{unsafe.Pointer(&strPtr)})

Once that's fixed, I'll approve and merge. Great work on the ARM64 darwin support!

Re: stack-based extra arguments - that's a known limitation, we can address it in a future PR.

benchmark results: M3 Pro
BenchmarkGoffiStringOutput: 64 ns/op
@ppoage
Copy link
Contributor Author

ppoage commented Jan 2, 2026

Fix for the benchmark pushed!

Note: I haven't run any of the tests for other OS's, I assume they pass as ARM64 was the one getting changes

Copy link
Contributor

@kolkov kolkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick fix! Benchmark fix looks correct.

Verified:

  • strPtr now passed as unsafe.Pointer(&strPtr) - correct API usage
  • 64 ns/op on M3 Pro - excellent performance

Approving. Will merge once CI passes.

@codecov-commenter
Copy link

Welcome to Codecov 🎉

Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests.

ℹ️ You can also turn on project coverage checks and project coverage reporting on Pull Request comment

Thanks for integrating Codecov - We've got you covered ☂️

@kolkov kolkov merged commit 90eca7a into go-webgpu:main Jan 3, 2026
8 checks passed
kolkov added a commit that referenced this pull request Jan 3, 2026
- Update platform support table (macOS ARM64 now tested on M3 Pro)
- Add v0.3.7 changelog entry with PR #9 changes
- Update roadmap: ARM64 support marked as released
- Credit @ppoage for comprehensive ARM64 darwin fixes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants