Skip to content

Remote/Colab server not publishing on port 5558 → Mac client crashes at first recv_json() with JSONDecodeError #796

@diegobaldin

Description

@diegobaldin

Describe the bug

When running Avatarify with the Colab “remote server” + local Mac client setup, the Mac client connects to the ngrok endpoints successfully but then crashes immediately on the first metadata read:

json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Diagnosis shows the Colab server never publishes any JSON on its local 5558 (the “out” port). If I temporarily publish a fake JSON to tcp://127.0.0.1:5558 from Colab, the Mac client stops crashing — so the network path is fine and the failure is due to the server not emitting.

To Reproduce

Open the official Colab notebook: avatarify.ipynb (from this repo).

Run setup cells (install deps, download models) — no visible errors.

Start ngrok TCP tunnels (as in the notebook). Colab prints something like:

tcp://4.tcp.eu.ngrok.io:17400 -> 5557 [input]
tcp://5.tcp.eu.ngrok.io:17638 -> 5558 [output]
Tunnel opened

On the Mac (local client):

Confirm connectivity:

nc -vz 4.tcp.eu.ngrok.io 17400 # succeeds
nc -vz 5.tcp.eu.ngrok.io 17638 # succeeds

Run the client:

./run_mac.sh --is-client
--in-addr tcp://4.tcp.eu.ngrok.io:17400
--out-addr tcp://5.tcp.eu.ngrok.io:17638

Observe the crash in the Mac client when it tries to read the first JSON message from the “out” socket.

Expected behavior

Once the Colab worker/render is started, the server should publish a first JSON metadata message on port 5558 so that the client’s initial recv_json() succeeds and the stream starts.

Actual behavior (Mac client)
[1759420307.292758] Loading Predictor
[1759420307.817827] Receiving from tcp://5.tcp.eu.ngrok.io:17638
[1759420323.067542] Trying camera with id 0
2025-10-02 12:52:04.342 python[...] WARNING: AVCaptureDeviceTypeExternal is deprecated...
[1759420325.931305] Trying camera with id 1
[1759420327.986817] Could not read from camera with id 1
[1759420328.020803] Trying camera with id 2
OpenCV: out device of bound (0-1): 2
OpenCV: camera failed to properly initialize!
[...]
Process recv_process:
Traceback (most recent call last):
File ".../predictor_remote.py", line 181, in recv_worker
msg = receiver.recv_data()
File ".../networking.py", line 122, in recv_data
md = self.recv_json(flags=flags) # metadata text
...
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

(AVCapture/Camera warnings on macOS look unrelated — they’re deprecation notices. The crash is at the ZeroMQ recv_json() due to empty/absent first message.)

Server-side diagnostics (Colab)

Local subscribe to 5558 (out):

import zmq, json
ctx = zmq.Context.instance()
s = ctx.socket(zmq.SUB); s.setsockopt_string(zmq.SUBSCRIBE, "")
s.connect("tcp://127.0.0.1:5558"); s.RCVTIMEO = 5000
print("Reading 5558 local...")
print(s.recv_json()) # <-- TIMES OUT with Again: Resource temporarily unavailable

Output:

Again: Resource temporarily unavailable

➜ Indicates no publisher active on 5558.

Control test (fake publisher) — success
If I temporarily publish JSON to 5558 from Colab:

import zmq, time
ctx = zmq.Context.instance()
pub = ctx.socket(zmq.PUB)
pub.bind("tcp://127.0.0.1:5558")
time.sleep(0.5)
t0 = time.time()
while time.time() - t0 < 10:
pub.send_json({"ok": True, "ts": time.time()})
time.sleep(0.5)

…then the Mac client does not crash during those 10 seconds.
➜ Confirms network/tunnels/client are OK; the real worker/render is simply not publishing.

Environment

Server (Colab)

Google Colab, default Python reported in paths as 3.12 (linux x86_64)

Notebook: avatarify.ipynb from alievk/avatarify

ngrok TCP, EU region (transient hosts like 4.tcp.eu.ngrok.io:17400 → 5557, 5.tcp.eu.ngrok.io:17638 → 5558)

Client (Mac)

macOS on Apple Silicon (Homebrew path suggests arm64): /opt/homebrew/...

Conda env: avatarify, Python 3.10

Key packages (after adjusting old pins in requirements.txt):

opencv-python==4.12.0.88

pyzmq==25.1.2 (updated from 20.0.0 to avoid build issues)

PyYAML==6.0.2 (updated from 5.4 to avoid build issues)

Command:

./run_mac.sh --is-client --in-addr tcp://<ngrok_in> --out-addr tcp://<ngrok_out>

What I tried

Verified tunnels with nc -vz (both succeed).

Confirmed no JSON on Colab’s tcp://127.0.0.1:5558 (SUB with 5s timeout).

Published a fake JSON on 5558 to validate path — client then does not crash.

Re-ran notebook cells multiple times (setup → worker → tunnels), keeping the worker cell running.

Updated opencv, pyzmq, PyYAML pins on the Mac to versions with prebuilt wheels for Apple Silicon.

Tried different ngrok endpoints (EU TCP hosts).

Ensured client uses the exact in_addr (→5557) and out_addr (→5558) printed by Colab.

Hypotheses / Questions

Is the Colab notebook missing a step to start the worker/render that actually publishes to 5558? (e.g., choosing a source image and a driver/test video and pressing a “Start/Render” step?)

Is there a known race where the worker only starts publishing after a handshake on 5557, and the first publish is delayed enough to cause the immediate recv_json() failure?

Are there version constraints (Python/pyzmq) expected on the server that could prevent the worker from binding/publishing to 5558 silently?

Could the current Colab script be launching only the tunnels and not the worker process (e.g., missing ./run.sh --is-worker --in-port 5557 --out-port 5558)?

Request

Guidance on the exact sequence required on Colab to ensure the worker publishes on 5558 (and keeps doing so).

Any known fixes (e.g., delay/retry before first recv_json(), explicit command to start worker) or supported versions for the Colab server side.

Minimal “health-check” snippet maintainers recommend to confirm the server is alive/publishing before connecting the client.

Additional context (minor)

macOS logs show AVCaptureDeviceTypeExternal is deprecated... — appears unrelated.

Some “Trying camera with id …” messages on the client precede the crash — also likely unrelated to the missing first JSON.

Thanks a lot! Happy to provide a minimal repro notebook or run any extra diagnostics you suggest.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions