This project utilizes Large Language Models (LLMs) to generate code for medical image segmentation tasks using U-Net architectures. We compare the performance of various LLMs in generating code for U-Net-based segmentation models and track the interactions with the LLMs and results on multiple datasets. The models are evaluated on their ability to generate functional code out-of-the-box, accuracy in segmentation, and error frequency during development. Additionally, reasoning and non-reasoning models as well as 2024 vs. 2025 generation models are compared.
This repository explores how Large Language Models (LLMs) perform when tasked with generating code for medical image segmentation. We used eight state-of-the-art LLMs from 2024, including:
- GPT-4, 4o and o1 Preview (closed source)
- Claude 3.5 Sonnet (closed source)
- Gemini 1.5 Pro (closed source)
- Github Copilot (closed source)
- Bing Microsoft Copilot (closed source)
- LLAMA 3.1 405B (open source)
with GPT o1-preview being the only reasoning-model; as well as eleven state-of-the-art LLMs from 2025, including the closed source models:
- Claude 4 Sonnet
- DeepSeek R1 and V3
- GPT o3 and O4-mini-high
- Gemini 2.5 Pro
- Grok 3 mini and Grok 3
and the open source models:
- Llama 4 Maverick
- Mistral Medium 3
- Qwen 3_235B
with DeepSeek V3 and Grok 3 (without thinking mode enabled) being the non-reasoning models, to compare performance across different models for U-net based semantic segmentation on several medical image datasets, focusing on the ease of generation, minimum required modifications and error frequency, and compare the performance across the different LLM generated model outputs. The models were prompted with the same engineered prompt, and asked to generate a dataloader, training, model and main script. Errors were fed back to the model and suggested fixes were used to debug the codes untill it ran error-free.
We use the Dice coefficient as the key evaluation metric for segmentation quality and track error rates and types, and interactions required with each LLM to make the code run through. The final models are evaluated on three datasets to measure their performance on real-world medical image segmentation tasks.
For benchmarking model performance, we used nnUnet v2 as a well-established baseline in biomedical image segmentation. nnU-Net provides an automated, self-configuring framework that enables fair and reliable comparisons across various segmentation tasks [(Isensee et al., 2021)]
-
Multiple LLM Comparisons: Assess how different LLMs perform in generating code for a U-Net-based segmentation model.
-
Engineered Prompt: Detailed prompt to generate scripts for dataset, train, model and main scripts is provided in Prompt_final.txt file, and can be modified and tailored based on needs or to test any additional LLM.
-
Medical Datasets: Utilized real 2D medical image datasets for training and evaluation.
-
Generated U-Net Architectures: Each U-net model is trained using the out-of-box architecture generated by the different LLMs.
-
Error Tracking: Record number of errors, error types and debugging for each LLM.
-
Convergence Evaluation: Compare convergence speed and behaviour of models on different datasets.
-
Dice Score Evaluation: Compare Dice scores for each model on different datasets.
-
Run-to-run Variability Testing: Compare output consistency of same model under multiple runs of the same prompt.
-
Cohorts Comparison and Statistical Analysis: Compare performances of 2024 vs. 2025 models, as well as Reasoning vs. non-reasoning models, on the same datasets.
Ensure you have the following installed:
- Python 3.6 or higher
- PyTorch
- CUDA (optional for GPU support)
- Clone the repository:
git clone https://github.com/ankilab/LLM_based_Segmentation.git cd LLM_based_Segmentation - Installing required packages: pip install -r requirements.txt
You can train and evaluate the models using the scripts and main.py provided for each LLM-generated architecture, in it's respective folder. The prompt text in Prompt_final.txt file can directly be used or tailored according to needs, to generate the dataset, train, model and main scripts using any other LLM. The dataloader in Dataset.py can also be modified based on dataset requirements. For comparison of performances across models, the scripts in Models Comparison can be used to visualize and compare the validation and test dice scores, as well as the train and validation losses across models, and run inference on a single example image from the dataset for each model.
- The full chat script with each LLM from prompt input to final error correction was added as a .json file in each LLM's respective folder.
- The prompt engineering process documentation was added as .docx in Models Comparison/Results folder
- The tables for initial comparison of all features for dataloader, models, train and main scripts, as well as architectures and hyper-parameters across models were added as excel sheets in Models Comparison/Results folder
This project uses the following LLMs to generate U-Net architectures:
GPT-4 (Closed-source) GPT-4o (Open-source) Claude 3.5 Sonnet (Open-source) LLAMA 3.1 405B (Open-source) Gemini 1.5 Pro (Open-source) Bing Microsoft Copilot (Closed-source) Copilot (Closed-source) GPT-o1 Preview (Open-source)
Key differences between these models include the number of encoder/decoder stages, convolutional block design, use of batch normalization, skip connections, bottleneck size, and final activation functions, as well as choice of hyper parameters, such as number of epochs, batch size and image resizing. These variations contribute to differences in segmentation performance and ease of model generation.
| Company | 2024 Models | 2025 Models | #Parameters | Open Source | Token Limit | Approx. Character Limit |
|---|---|---|---|---|---|---|
| OpenAI | GPT-4, GPT-4o GPT-1-Preview (Reasoning) |
GPT o3 GPT o4-mini-high (Reasoning) |
GPT-4 ~1.8T GPT-4o ~200B |
No | 128K | ~512K |
| Anthropic | Claude 3.5 Sonnet | Claude 4 Sonnet | ~175B | No | 200K | ~800K |
| Google DeepMind | Gemini 1.5 Pro | Gemini 2.5 Pro | ~200B | No | 1M | ~4M |
| GitHub | Copilot | – | – | No | ||
| Microsoft | Bing Microsoft Copilot | – | – | No | ||
| Meta (LLaMA) | Llama 3.1 | Llama 4 Maverick | 405B | Yes | 1M | ~4M |
| DeepSeek | – | DeepSeek V3 (non-reasoning) DeepSeek R1 (reasoning) |
671B | Yes | ||
| xAI | – | Grok 3 mini (reasoning) Grok 3 (non-reasoning) |
2.7T | No | 128K | ~512K |
| Mistral AI | – | Mistral Medium 3 | – | Yes | ||
| Qwen (Alibaba) | – | Qwen 3 235B | 235B | Yes | 128K | ~512K |
Table: Comparison of the selected LLMs by input character limit, token limit, and open-source status.
We evaluated the LLM-generated models on six standard 2D medical image segmentation datasets: the Benchmark for Automatic Glottis Segmentation (BAGLS) with endoscopic video frames from laryngoscopy exams for glottis segmentation, an internal Bolus Swallowing dataset (without a pathological condition) with videofluoroscopic swallowing studies annotated for bolus segmentation, a Brain Meningioma Tumor MRI dataset including single-slice grayscale MR images with segmentation masks for different types of brain tumors, including Glioma, Meningioma, and Pituitary tumors, which for this study, we only used the Meningioma tumor images and their corresponding masks, the HAM10000 ISIC Skin Cancer dataset consisting of dermoscopic RGB images annotated for skin lesion segmentation, the UMD Uterine Myoma MRI dataset, comprising pelvic MRI scans labeled "3" for uterine myoma; and a combined Retinal Vessel segmentation dataset.
Collectively, the datasets encompass a wide variety of imaging modalities (endoscopy, fluoroscopy, dermoscopy, MRI, and fundus imaging) and segmentation difficulties, spanning fine anatomical structures, large irregular regions, and motion-induced artifacts. This breadth enables a robust evaluation of the ability of LLM-generated pipelines to generalize across modality-dependent preprocessing steps, network architectures, and segmentation tasks.
All datasets were preprocessed to fit the input requirements of the LLM-generated models. 5002 total images were selected for the BAGLS, Swallowing and skin cancer dataset, 999 total images for the Brain tumor dataset, 5457 images for the Uterine Myoma dataset, and 73 retinal fundus images respectively. Images were resized and normalized to [0,1] in the LLM-generated code. Each dataset was split into training (80%), validation (10%), and test (10%) sets with different methods by each LLM. The batch sizes, learning rate and number of epochs for training were also determined by the LLM. For a fair evaluation of the model performances, one separate randomly selected test subset (10%) was held out from each dataset prior to any training, on which all of the models were ultimately re-evaluated for comparison.
- Model Architecture and Hyper parameter differences:
For the 2024 models:
| Feature | GPT-4 | GPT-4o | GPT-o1 | Claude 3.5 | Copilot | Bing Copilot | LLAMA 3.1 | Gemini 1.5 Pro |
|---|---|---|---|---|---|---|---|---|
| Batch Size | 16 | 8 | 16 | 16 | 16 | 16 | 32 | 8 |
| Epochs | 25 | 25 | 10 | 50 | 25 | 25 | 10 | 20 |
| Optimizer and Learning Rate | Adam (lr=0.001) | Adam (lr=1e-4) | Adam (lr=1e-4) | Adam (lr=0.001) | Adam (lr=1e-4) | Adam (lr=1e-4) | Adam (lr=0.001) | Adam (lr=1e-4) |
| Loss Function | BCEWithLogitsLoss | BCELoss | BCELoss | BCELoss | BCELoss | BCELoss | BCELoss | BCELoss |
| Image Size | 256x256 | 256x256 | 256x256 | 256x256 | 128x128 | 256x256 | 256x256 | 256x256 |
| Number of Encoder Stages | 4 | 5 | 4 | 4 | 4 | 4 | 3 | 4 |
| Number of Decoder Stages | 3 | 4 | 4 | 4 | 4 | 4 | 3 | 4 |
| Convolutional Block | Double Conv (Conv2d + ReLU ×2) | Conv2d + BatchNorm2d + ReLU ×2 | Double Conv (Conv2d + BatchNorm + ReLU ×2) | Double Conv (Conv2d + BatchNorm2d + ReLU ×2) | Conv2d + BatchNorm2d + ReLU ×2 | Conv2d + ReLU ×2 | Conv2d + ReLU | Conv2d + BatchNorm2d + ReLU ×2 |
| Bottleneck Layer | 512 channels | 1024 channels | 1024 channels | 1024 channels | 512 channels | 512 channels | 256 channels | 1024 channels |
| Final Layer | Conv2d(64, 1, 1) | Conv2d(64, 1, 1) | Conv2d(64, 1, 1) | Conv2d(64, 1, 1) | Conv2d(64, 1, 1) | Conv2d(64, 1, 1) | Conv2d(64, 1, 1) | Conv2d(64, 1, 1) |
| Encoder Channels | 64, 128, 256, 512 | 64, 128, 256, 512, 1024 | 64, 128, 256, 512, 1024 | 64, 128, 256, 512, 1024 | 64, 128, 256, 512 | 64, 128, 256, 512 | 64, 128, 256 | 64, 128, 256, 512 |
| Decoder Channels | 512, 256, 128, 64 | 512, 256, 128, 64 | 512, 256, 128, 64 | 512, 256, 128, 64 | 512, 256, 128, 64 | 512, 256, 128, 64 | 256, 128, 64 | 512, 256, 128, 64 |
| Total Trainable Model Parameters | 7,696,193 | 31,042,369 | 31,042,369 | 31,042,369 | 6,153,297 | 6,147,659 | 533,953 | 31,042,369 |
| Total Training Time (sec) | 1474.03 | 949.33 | 780.13 | 5285.17 | 184.16 | 497.04 | 234.27 | 1066.50 |
Table: Architectural and training configurations of 2024 LLM–generated segmentation pipelines.
And for the 2025 models:
| Feature | GPT o3 | GPT o4-mini-high | DeepSeek V3 | DeepSeek R1 | Claude 4 Sonnet | Gemini 2.5 Pro | Grok 3 mini R | Grok 3 | LLaMA 4 Maverick | Mistral Medium 3 | Qwen 3 235B |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Batch Size | 8 | 8 | 8 | 8 | 8 | 8 | 16 | 16 | 32 | 16 | 16 |
| Epochs | 25 | 50 | 25 | 30 | 50 | 25 | 20 | 20 | 10 | 50 | 20 |
| Optimizer & LR | Adam (1e-3) | Adam (1e-4) | Adam (1e-3) | Adam (1e-3, wd=1e-5) | Adam (1e-4, wd=1e-5) | Adam (1e-4) | Adam (1e-4) | Adam (1e-3) | Adam (1e-3) | Adam (1e-3) | Adam (1e-3) |
| Loss Function | BCEWithLogits | BCEWithLogits | BCE + Dice | Dice + BCE (70/30) | Combined Loss (α = 0.5) | BCE + Dice | BCEWithLogits | BCE | BCE | BCE | BCEWithLogits |
| # Encoder Stages | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 |
| # Decoder Stages | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 3 | 4 | 4 |
| Conv Block | (Conv → BN → ReLU) ×2 | DoubleConv | DoubleConv | DoubleConv | DoubleConv | DoubleConv | DoubleConv | DoubleConv | DoubleConv | DoubleConv | CBR |
| Bottleneck | 256 → 512 ch | 1024 ch | 512 ch | 1024 ch | 1024 ch | 1024 ch | 512 → 1024 → 1024 ch | 512 → 1024 ch | 256 → 512 ch | 512 → 512 ch | 512 → 1024 ch |
| Final Layer | Conv2d (32 → 1) | Conv2d (64 → 1) | Conv2d (64 → 1) | Conv2d (64 → 1) | Conv2d (64 → 1) | Conv2d (64 → 1) | Conv2d (64 → 1) + Sigmoid | Conv2d (64 → 1) + Sigmoid | Conv2d (64 → 1) + Sigmoid | 1×1 Conv (64 → 1) + Sigmoid | 1×1 Conv (64 → 1) |
| Encoder Channels | 32, 64, 128, 256 | 64, 128, 256, 512 | 64, 128, 256, 512, 512 | 64, 128, 256, 512, 1024 | 64, 128, 256, 512, 1024 | 64, 128, 256, 512 | 64, 128, 256, 512 | 64, 128, 256, 512 | 64, 128, 256, 512 | 64, 128, 256, 512 | 1, 64, 128, 256, 512 |
| Decoder Channels | 256, 128, 64, 32 | 512, 256, 128, 64 | 512, 256, 128, 64 | 512, 256, 128, 64 | 512, 256, 128, 64 | 512, 256, 128, 64 | 512, 256, 128, 64 | 512, 256, 128, 64 | 256, 128, 64 | 512, 256, 128, 64 | 1024, 512, 256, 128, 64 |
| Total Parameters | 7.8 M | 31.0 M | 31.4 M | 31.4 M | 31.0 M | 17.3 M | 31.0 M | 31.0 M | 7.7 M | 13.4 M | 31.0 M |
| Training Time (mean ± std) | 0:40:44 ± 14.0 min | 1:06:08 ± 23.4 min | 0:26:34 ± 16.4 min | 0:43:49 ± 25.6 min | 1:30:01 ± 32.7 min | 0:29:21 ± 18.1 min | 0:24:21 ± 14.9 min | 0:35:13 ± 13.8 min | 0:12:24 ± 7.8 min | 0:54:15 ± 41.8 min | 0:22:48 ± 11.3 min |
Table: Architectural and training configurations of 2025 LLM–generated segmentation pipelines.
When asked to use TensorFlow instead of PyTorch as a test on the 2024 models, only to compare whether the same architectures, losses and optimizers were used, the models generally used larger batch sizes and more epochs, prioritizing stability over speed. TensorFlow implementations frequently incorporated Dice Loss alongside Binary Cross-Entropy to handle class imbalance, whereas PyTorch versions relied primarily on Binary Cross-Entropy or BCEWithLogitsLoss. Additionally, TensorFlow models often included Batch Normalization in the convolutional blocks to improve convergence. Despite these variations, the core architecture remained consistent across frameworks.
- Model Architecture and Hyper parameter differences when asked to use Tensorflow, for the 2024 models:
| Feature | GPT-4 | GPT-4o | GPT-o1 | Claude 3.5 Sonnet | Copilot | Bing Copilot | LLAMA 3.1 405B | Gemini 1.5 Pro |
|---|---|---|---|---|---|---|---|---|
| Batch Size | 32 | 16 | 16 | 16 | 16 | 8 | 32 | 16 |
| Epochs | 50 | 50 | 10 | 50 | 25 | 20 | 50 | 25 |
| Optimizer and Learning Rate | Adam, default learning rate (0.001) | Adam, Learning Rate: 0.001 | Adam, Default Learning Rate 0.001 (TensorFlow) | Adam, Learning Rate: 0.0001 | Adam (lr=1e-4) | Adam, Learning Rate: 0.0001 | Adam, Learning Rate: 0.0001 | Adam, Learning Rate: 0.0001 |
| Loss Function | Binary Crossentropy | Binary Cross-Entropy + Dice Loss | Combined Loss (Binary Cross-Entropy + Dice Loss) | Binary Cross-Entropy | BCELoss | Binary Cross-Entropy | Binary Cross-Entropy | Dice Loss |
| Image Size | 256x256 | (256, 256) | (256, 256) | (256, 256) | 128x128 | (256, 256) | (256, 256) | (256, 256) |
| Number of Encoder Stages | 4 (typical in U-Net) | 4 | 4 | 4 | 4 | 4 | 4 | 3 |
| Number of Decoder Stages | 4 (symmetrical to encoder) | 4 | 4 | 4 | 4 | 4 | 4 | 3 |
| Convolutional Block | 2 Conv2D + ReLU per block | 2 Conv layers + BatchNorm + ReLU | 2 Conv layers + ReLU Activation | 2 Conv layers + BatchNorm + ReLU | Conv2d + BatchNorm2d + ReLU ×2 | 2 Conv layers + BatchNorm + ReLU | 2 Conv layers + ReLU Activation | 2 Conv layers + ReLU Activation |
| Bottleneck Layer | Conv2D, depth varies | 1024 filters with Conv block | 1024 filters with Conv Block | 1024 filters with Conv Block | 512 channels | 1024 filters with Conv Block | 512 filters with Conv Block | 512 filters with Conv Block |
| Final Layer | 1x1 Conv2D, Sigmoid activation | 1 Conv2D with Sigmoid Activation | Conv2D (1 filter, Sigmoid Activation) | Conv2D (1 filter, Sigmoid Activation) | Conv2d(64, 1, 1) | Conv2D (1 filter, Sigmoid Activation) | Conv2D (1 filter, Sigmoid Activation) | Conv2D (1 filter, Sigmoid Activation) |
| Encoder Channels | [64, 128, 256, 512] | 64, 128, 256, 512, 1024 (in bottleneck) | 64, 128, 256, 512, 1024 (in bottleneck) | 64, 128, 256, 512, 1024 (in bottleneck) | - 64, 128, 256, 512 | 64, 128, 256, 512, 1024 (in bottleneck) | 32, 64, 128, 256, 512 (in bottleneck) | 64, 128, 256, 512 (in bottleneck) |
| Decoder Channels | [256, 128, 64, 32] | 512, 256, 128, 64 | 512, 256, 128, 64 | 512, 256, 128, 64 | - 512, 256, 128, 64 | 512, 256, 128, 64 | 256, 128, 64, 32 | 256, 128, 64 |
Table: Comparison of Features Across Different 2024 LLM-based Models, when asked to use Tensorflow instead of Pytorch.
2024 LLM-Generated Models
- Architecture: Loosely followed a U-Net template, mostly with 4 encoder–decoder stages, DoubleConv blocks (BatchNorm used inconsistently), and ConvTranspose2d upsampling.
- Capacity: Bottleneck widths ranged from 256 to 1024 channels, with parameter counts spanning ~0.5M to over 31M, reflecting high architectural variability.
- Training Setup: All models relied on a simple binary cross-entropy (BCE) loss with relatively lightweight training configurations.
- Outcome: Faster training and simpler architectures, but lower overall segmentation performance and less architectural consistency.
2025 LLM-Generated Models
- Architecture: Strong convergence toward a standardized U-Net design with 256×256 inputs, four encoder–decoder stages (except LLaMA 4 Maverick with three stages), and DoubleConv blocks with BatchNorm and ReLU across all models.
- Capacity: Reasoning models consolidated around 1024-channel bottlenecks (~31M parameters), while lighter non-reasoning variants used 512-channel bottlenecks (~7–8M parameters). Qwen 3 235B uniquely employed CBR blocks.
- Training Setup: All models used the Adam optimizer, with heterogeneous learning rates (1e-3 vs. 1e-4), batch sizes ranging from 8 to 32, and epoch counts between 10 and 50. Loss functions diversified into BCE, BCE+Dice, and Dice-weighted combinations.
- Outcome: Higher architectural consistency, longer training times for high-capacity models, and improved final Dice scores, indicating a shift toward more expressive architectures and task-aware loss design.
Key Takeaway
- From 2024 to 2025, LLM-generated segmentation pipelines evolved from heterogeneous, lightweight designs to more consistent, high-capacity architectures with mixed loss functions, trading training efficiency for improved segmentation accuracy.
This comparative analysis highlights how architectural and hyperparameter choices across LLM-generated U-Net models impact computational complexity, model depth, and training efficiency. These differences provide insights into balancing model complexity with training duration, essential for selecting models suited to varying computational resources.
Errors and interactions with the LLM to fix the errors were tracked and logged. The errors were fed back to the LLM and the suggested fix was applied, until the code was could run through. For LLAMA 3.1 and Gemini 1.5 some additional explanation and input was needed for resolving some errors. Claude, GPT-o1 Preview, o3, Gemini 2.5 Pro, and Grok 3 had 0 errors and ran successfully out-of-the box without modifications, while others such as Gemini and LLAMA required more bug fixes.
Figure: The graphs illustrate the validation loss per epoch for each model on the BAGLS dataset as an example, for the 2024 models (left) and 2025 models (right), with losses plotted on a logarithmic scale for improved clarity.
- Most models, such as Claude 3.5 Sonnet, GPT-o1 Preview, GPT-4o, and Gemini 1.5 Pro, exhibit rapid convergence in training loss within the first 10 epochs, indicating efficient learning.
- Copilot and Bing Microsoft Copilot show a slower and minimal reduction in training loss, which may suggest underfitting or optimization issues.
- The nnUnet baseline demonstrates a stable and low training loss across epochs, showcasing its reliability as a baseline model for comparison.
- Claude 3.5 Sonnet, Gemini 1.5 Pro, GPT-o1 Preview, GPT-o1 Preview and GPT o3 achieve the lowest validation loss consistently, suggesting strong generalization performance on the validation set.
- GPT-4 and GPT-4o display higher fluctuations in validation loss, possibly due to instability in the training process.
- Copilot and Bing Microsoft Copilot showed almost no convergence at very high loss values. LLAMA 3.1 405B displayed gradual convergence but remained less converged compared to other models.
- 2025 models and reasoning-enabled models showed faster convergence and lower final validation losses compared to the non-reasoning and 2024 models.
In summary, while all models demonstrate some level of convergence, the best convergence and lowest validation losses were achieved by GPT o4-mini-high, GPT o3 and o1, Claude 4 Sonnet, Gemini 2.5 Pro, Mistral Medium 3, and Qwen 3, all closely matching or surpassing nnU-Net behavior. In contrast, DeepSeek models showed moderate convergence, LLaMA and Grok 3 Mini were less stable, and several 2024 models—especially Bing and GitHub Copilot—exhibited poor or negligible convergence. (note that the absolute y-values are not directly comparable, as curves show loss values from different formulations and should be interpreted only in terms of relative convergence speed and cohort‐level trends)
Despite number of epochs set by each LLM also being a hyper parameter and part of the evaluation, for fairness we evaluated all LLM-generated models under a fixed training budget (120 epochs on BAGLS). Reasoning-enabled 2025 models still converged faster and reached lower validation losses than non-reasoning and 2024 models. The relative convergence behavior remained consistent with results from self-selected epoch schedules, indicating that differing epoch counts do not bias the overall performance comparison.
Figure: Validation loss per epoch for 2024 and 2025 LLM-generated segmentation pipelines, exemplarily shown on the BAGLS dataset for a constant number of epochs, Initial number of epochs chosen by each model is marked with a dot on the respective curves.Dice score evaluation across six datasets shows clear gains from 2024 to 2025, with reasoning-enabled models consistently achieving higher and more stable segmentation accuracy. Greater variability in Dice scores was seen on the more complex Retina and Uterine Myoma datasets, reflecting the increased difficulty of fine-structure or thin or ambiguous boundaries.
Among the 2024 models, GPT-o1-Preview, Claude 3.5, and Gemini 1.5 Pro performed closest to nnU-Net, while Copilot, Bing Copilot, and LLaMA 3.1 lagged significantly. From the 2025 models, models such as GPT o3, GPT o4-mini-high, and Claude 4 Sonnet matched or occasionally surpassed nnU-Net, confirming that reasoning-augmented LLMs now deliver near–expert-level segmentation performance across modalities.
Prompting GPT-4o and GPT-o4-mini-high ten times repeating the full pipeline generation, showed that both models produced largely consistent architectures and training setups, with minor variations in epoch count, learning rate, and (for GPT-o4-mini-high) network depth. While GPT-4o exhibited greater variability in training dynamics, occasional errors, and a wider spread in Dice scores, GPT-o4-mini-high was more stable across runs, self-corrected errors more reliably, and achieved higher and more consistent segmentation performance. Overall, reasoning-enabled generation yielded more robust and reproducible pipelines despite small architectural variations.
Figure: Run-to-run Model variability testing across 10 independent runs for one reasoning model (GPT o4-mini-high) (red) and one non-reasoning model (GPT-4o) (blue), for one examplary dataset, showing validation losses per epoch. Figure: Run-to-run Model variability testing across 10 independent runs for one reasoning model (GPT o4-mini-high) (red) and one non-reasoning model (GPT-4o) (blue), for one examplary dataset, showing test Dice scores across runs (sorted by medians), with nnU-Net as the baseline (gray)The performance differences observed quantitatively with the losses and dice scores comparison, can also be seen in the inference and visualization of the predictions of each model.
Figure: Inference prediction masks comparison across 2024 LLM generated models along with nnU-net baseline. Each horizontal set shows performance of each model on a sample image from each of the three datasets.
Figure: Inference prediction masks comparison across 2025 LLM generated models along with nnU-net baseline. Each horizontal set shows performance of each model on a sample image from each of the three datasets.
This project is licensed under the Apache License, Version 2.0. You may obtain a copy of the License at:
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.







