Take a look this model zoo, and if you found the CoreML model you want, download the model from google drive link and bundle it in your project. Or if the model have sample project link, try it and see how to use the model in the project. You are free to do or not.
If you like this repository, please give me a star so I can do my best.
-
Stable Diffusion :text2image
You can get the model converted to CoreML format from the link of Google drive. See the section below for how to use it in Xcode. The license for each model conforms to the license for the original project.
| Google Drive Link | Size | Dataset | Original Project | License |
|---|---|---|---|---|
| Efficientnetb0 | 22.7 MB | ImageNet | TensorFlowHub | Apache2.0 |
| Google Drive Link | Size | Dataset | Original Project | License | Year |
|---|---|---|---|---|---|
| Efficientnetv2 | 85.8 MB | ImageNet | Google/autoML | Apache2.0 | 2021 |
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale.
| Google Drive Link | Size | Dataset | Original Project | License | Year |
|---|---|---|---|---|---|
| VisionTransformer-B16 | 347.5 MB | ImageNet | google-research/vision_transformer | Apache2.0 | 2021 |
Local Features Coupling Global Representations for Visual Recognition.
| Google Drive Link | Size | Dataset | Original Project | License | Year |
|---|---|---|---|---|---|
| Conformer-tiny-p16 | 94.1 MB | ImageNet | pengzhiliang/Conformer | Apache2.0 | 2021 |
Data-efficient Image Transformers
| Google Drive Link | Size | Dataset | Original Project | License | Year |
|---|---|---|---|---|---|
| DeiT-base384 | 350.5 MB | ImageNet | facebookresearch/deit | Apache2.0 | 2021 |
Making VGG-style ConvNets Great Again
| Google Drive Link | Size | Dataset | Original Project | License | Year |
|---|---|---|---|---|---|
| RepVGG-A0 | 33.3 MB | ImageNet | DingXiaoH/RepVGG | MIT | 2021 |
Designing Network Design Spaces
| Google Drive Link | Size | Dataset | Original Project | License | Year |
|---|---|---|---|---|---|
| regnet_y_400mf | 16.5 MB | ImageNet | TORCHVISION.MODELS | MIT | 2020 |
CVNets: A library for training computer vision networks
| Google Drive Link | Size | Dataset | Original Project | License | Year | Conversion Script |
|---|---|---|---|---|---|---|
| MobileViTv2 | 18.8 MB | ImageNet | apple/ml-cvnets | apple | 2022 |
| Download Link | Size | Output | Original Project | License | Note | Sample Project |
|---|---|---|---|---|---|---|
| dfine-n-coco | 13MB | Confidence(MultiArray (Float32 300 × 80)), Coordinates (MultiArray (Float32 300 × 4)) | Peterande/D-FINE | Apache 2.0 | Input 640×640. Coordinates are normalized cxcywh. No NMS — filter by confidence threshold. | peaceofcake DFINEDemo |
| Download Link | Size | Output | Original Project | License | Note | Sample Project |
|---|---|---|---|---|---|---|
| rfdetr-n-coco | 95MB | Confidence(MultiArray (Float32 300 × 91)), Coordinates (MultiArray (Float32 300 × 4)) | roboflow/rf-detr | Apache 2.0 | Input 384×384. 91 classes (index 0 = background, 1-90 = COCO category IDs). Coordinates are normalized cxcywh. No NMS. | peaceofcake DFINEDemo |
| Google Drive Link | Size | Output | Original Project | License | Note | Sample Project |
|---|---|---|---|---|---|---|
| YOLOv5s | 29.3MB | Confidence(MultiArray (Double 0 × 80)), Coordinates (MultiArray (Double 0 × 4)) | ultralytics/yolov5 | GNU | Non Maximum Suppression has been added. | CoreML-YOLOv5 |
| Google Drive Link | Size | Output | Original Project | License | Note | Sample Project | Conversion Script |
|---|---|---|---|---|---|---|---|
| YOLOv7 | 147.9MB | Confidence(MultiArray (Double 0 × 80)), Coordinates (MultiArray (Double 0 × 4)) | WongKinYiu/yolov7 | GNU | Non Maximum Suppression has been added. | CoreML-YOLOv5 |
| Google Drive Link | Size | Output | Original Project | License | Note | Sample Project |
|---|---|---|---|---|---|---|
| YOLOv8s | 45.1MB | Confidence(MultiArray (Double 0 × 80)), Coordinates (MultiArray (Double 0 × 4)) | ultralytics/ultralytics | GNU | Non Maximum Suppression has been added. | CoreML-YOLOv5 |
YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information. Uses PGI and GELAN architecture for efficient object detection.
| Download Link | Size | Output | Original Project | License | Year | Note | Sample Project |
|---|---|---|---|---|---|---|---|
| yolov9s.mlpackage.zip | 14 MB | Confidence (MultiArray (Double 0 × 80)), Coordinates (MultiArray (Double 0 × 4)) | WongKinYiu/yolov9 | GPL-3.0 | 2024 | Non Maximum Suppression has been added. | YOLOv9Demo |
YOLOv10: Real-Time End-to-End Object Detection. NMS-free architecture using consistent dual assignments — no post-processing needed.
| Download Link | Size | Output | Original Project | License | Year | Note | Sample Project |
|---|---|---|---|---|---|---|---|
| yolov10s.mlpackage.zip | 14 MB | MultiArray (1 × 300 × 6) | THU-MIG/yolov10 | AGPL-3.0 | 2024 | NMS-free end-to-end detection. | YOLO26Demo |
YOLO11: Ultralytics latest YOLO with improved backbone and neck architecture. 22% fewer parameters than YOLOv8 with higher mAP.
| Download Link | Size | Output | Original Project | License | Year | Note | Sample Project |
|---|---|---|---|---|---|---|---|
| yolo11s.mlpackage.zip | 18 MB | Confidence (MultiArray (Double 0 × 80)), Coordinates (MultiArray (Double 0 × 4)) | ultralytics/ultralytics | AGPL-3.0 | 2024 | Non Maximum Suppression has been added. | YOLOv9Demo |
YOLO26: Edge-first vision AI with NMS-free end-to-end detection. Up to 43% faster CPU inference vs YOLO11 with DFL removal and ProgLoss.
| Download Link | Size | Output | Original Project | License | Year | Note | Sample Project |
|---|---|---|---|---|---|---|---|
| yolo26s.mlpackage.zip | 18 MB | MultiArray (1 × 300 × 6) | ultralytics/ultralytics | AGPL-3.0 | 2026 | NMS-free end-to-end detection. | YOLO26Demo |
YOLO-World: Real-Time Open-Vocabulary Object Detection. Type any text query and detect it — no fixed class list. Uses CLIP text encoder for open-vocabulary matching.
| Download Link | Size | Description | Original Project | License | Year | Sample Project |
|---|---|---|---|---|---|---|
| yoloworld_detector.mlpackage.zip | 25 MB | YOLO-World V2-S visual detector | AILab-CVC/YOLO-World | GPL-3.0 | 2024 | YOLOWorldDemo |
| clip_text_encoder.mlpackage.zip | 121 MB | CLIP ViT-B/32 text encoder | openai/CLIP | MIT | 2021 | — |
| clip_vocab.json.zip | 1.6 MB | BPE vocabulary for tokenizer | — | — | — | — |
YOLOE: Real-Time Open-Vocabulary Detection + Instance Segmentation. Detect and segment anything from a text query or a visual prompt (box an example object) — no fixed class list. Available in S (fast) and L (accurate). See YOLOEDemo for the region-embedding + MobileCLIP pipeline.
| Download Link | Size | Description | Original Project | License | Year | Sample Project |
|---|---|---|---|---|---|---|
| yoloe_detector_s.mlpackage.zip | 20 MB | YOLOE-11s-seg region-embedding detector + segmentation | THU-MIG/yoloe | AGPL-3.0 | 2025 | YOLOEDemo |
| yoloe_detector_l.mlpackage.zip | 54 MB | YOLOE-11l-seg region-embedding detector + segmentation | THU-MIG/yoloe | AGPL-3.0 | 2025 | YOLOEDemo |
| reprta_s.mlpackage.zip | 6 MB | YOLOE RepRTA text-refinement MLP (S) | THU-MIG/yoloe | AGPL-3.0 | 2025 | — |
| reprta_l.mlpackage.zip | 6 MB | YOLOE RepRTA text-refinement MLP (L) | THU-MIG/yoloe | AGPL-3.0 | 2025 | — |
| visual_prompt_encoder_s.mlpackage.zip | 20 MB | YOLOE SAVPE visual-prompt encoder (S): image + box → query | THU-MIG/yoloe | AGPL-3.0 | 2025 | YOLOEDemo |
| visual_prompt_encoder_l.mlpackage.zip | 54 MB | YOLOE SAVPE visual-prompt encoder (L): image + box → query | THU-MIG/yoloe | AGPL-3.0 | 2025 | YOLOEDemo |
| mobileclip_blt_text.mlpackage.zip | 121 MB | Apple MobileCLIP B-LT text encoder (shared) | apple/ml-mobileclip | Apple | 2024 | — |
| clip_vocab.json.zip | 1.6 MB | BPE vocabulary for tokenizer (shared) | — | — | — | — |
ByteTrack: Multi-Object Tracking by Associating Every Detection Box. Pure-Swift on-device tracker that adds persistent IDs on top of any detector above — an 8D Kalman filter plus two-stage IoU association, no appearance / ReID network.
| Implementation | Source | Paper | License | Year | Note | Sample Project |
|---|---|---|---|---|---|---|
| Pure Swift (no download) | Tracker.swift | ByteTrack (arXiv 2110.06864) | MIT (this port) / Original | 2022 | 8D Kalman + two-stage IoU association, class-aware, greedy matching, lost-track buffer of 30 frames. Drop-in on top of any [Detection] stream. |
YOLO26Demo |
| Google Drive Link | Size | Output | Original Project | License |
|---|---|---|---|---|
| U2Net | 175.9 MB | Image(GRAYSCALE 320 × 320) | xuebinqin/U-2-Net | Apache |
| U2Netp | 4.6 MB | Image(GRAYSCALE 320 × 320) | xuebinqin/U-2-Net | Apache |
| Google Drive Link | Size | Output | Original Project | License | Year | Conversion Script |
|---|---|---|---|---|---|---|
| IS-Net | 176.1 MB | Image(GRAYSCALE 1024 × 1024) | xuebinqin/DIS | Apache | 2022 | |
| IS-Net-General-Use | 176.1 MB | Image(GRAYSCALE 1024 × 1024) | xuebinqin/DIS | Apache | 2022 |
RMBG1.4 - The IS-Net enhanced with our unique training scheme and proprietary dataset.
| Download Link | Size | Output | Original Project | License | year | Sample Project | Conversion Script |
|---|---|---|---|---|---|---|---|
| RMBG_1_4.mlpackage.zip | 42 MB (INT8) | Alpha mask 1024x1024 | briaai/RMBG-1.4 | Creative Commons | 2024 | RMBGDemo | convert_rmbg.py |
| Google Drive Link | Size | Output | Original Project | License | Sample Project |
|---|---|---|---|---|---|
| face-Parsing | 53.2 MB | MultiArray(1 x 512 × 512) | zllrunning/face-parsing.PyTorch | MIT | CoreML-face-parsing |
Simple and Efficient Design for Semantic Segmentation with Transformers
| Google Drive Link | Size | Output | Original Project | License | year |
|---|---|---|---|---|---|
| SegFormer_mit-b0_1024x1024_cityscapes | 14.9 MB | MultiArray(512 × 1024) | NVlabs/SegFormer | NVIDIA | 2021 |
Bilateral Network with Guided Aggregation for Real-time Semantic Segmentation
| Google Drive Link | Size | Output | Original Project | License | year |
|---|---|---|---|---|---|
| BiSeNetV2_1024x1024_cityscapes | 12.8 MB | MultiArray | ycszen/BiSeNet | Apache2.0 | 2021 |
Disentangled Non-Local Neural Networks
| Google Drive Link | Size | Output | Dataset | Original Project | License | year |
|---|---|---|---|---|---|---|
| dnl_r50-d8_512x512_80k_ade20k | 190.8 MB | MultiArray[512x512] | ADE20K | yinmh17/DNL-Semantic-Segmentation | Apache2.0 | 2020 |
Interlaced Sparse Self-Attention for Semantic Segmentation
| Google Drive Link | Size | Output | Dataset | Original Project | License | year |
|---|---|---|---|---|---|---|
| isanet_r50-d8_512x512_80k_ade20k | 141.5 MB | MultiArray[512x512] | ADE20K | openseg-group/openseg.pytorch | MIT | ArXiv'2019/IJCV'2021 |
Rethinking Dilated Convolution in the Backbone for Semantic Segmentation
| Google Drive Link | Size | Output | Dataset | Original Project | License | year |
|---|---|---|---|---|---|---|
| fastfcn_r50-d32_jpu_aspp_512x512_80k_ade20k | 326.2 MB | MultiArray[512x512] | ADE20K | wuhuikai/FastFCN | MIT | ArXiv'2019 |
Non-local Networks Meet Squeeze-Excitation Networks and Beyond
| Google Drive Link | Size | Output | Dataset | Original Project | License | year |
|---|---|---|---|---|---|---|
| gcnet_r50-d8_512x512_20k_voc12aug | 189 MB | MultiArray[512x512] | PascalVOC | xvjiarui/GCNet | Apache License 2.0 | ICCVW'2019/TPAMI'2020 |
Dual Attention Network for Scene Segmentation(CVPR2019)
| Google Drive Link | Size | Output | Dataset | Original Project | License | year |
|---|---|---|---|---|---|---|
| danet_r50-d8_512x1024_40k_cityscapes | 189.7 MB | MultiArray[512x1024] | CityScapes | junfu1115/DANet | MIT | CVPR2019 |
Panoptic Feature Pyramid Networks
| Google Drive Link | Size | Output | Dataset | Original Project | License | year |
|---|---|---|---|---|---|---|
| fpn_r50_512x1024_80k_cityscapes | 108.6 MB | MultiArray[512x1024] | CityScapes | facebookresearch/detectron2 | Apache License 2.0 | 2019 |
Code for binary segmentation of various cloths.
| Google Drive Link | Size | Output | Dataset | Original Project | License | year |
|---|---|---|---|---|---|---|
| clothSegmentation | 50.1 MB | Image(GrayScale 640x960) | fashion-2019-FGVC6 | facebookresearch/detectron2 | MIT | 2020 |
EasyPortrait - Face Parsing and Portrait Segmentation Dataset.
| Google Drive Link | Size | Output | Original Project | License | year | Swift sample | Conversion Script |
|---|---|---|---|---|---|---|---|
| easyportrait-segformer512-fp | 7.6 MB | Image(GrayScale 512x512) * 9 | hukenovs/easyportrait | Creative Commons | 2023 | easyportrait-coreml |
Faster Segment Anything: Towards Lightweight SAM for Mobile Applications. MobileSAM replaces the heavy ViT-H image encoder with a lightweight ViT-Tiny encoder via decoupled knowledge distillation, making it ~60x smaller and ~40x faster than the original SAM.

| Download Link | Size | Output | Original Project | License | Year | Sample Project |
|---|---|---|---|---|---|---|
| MobileSAM.zip | 23 MB (Encoder 13 MB + Decoder 9.8 MB) | Segmentation Mask | ChaoningZhang/MobileSAM | Apache 2.0 | 2023 | SamKit |
SAM 2: Segment Anything in Images and Videos. SAM 2 extends promptable segmentation from images to videos using a streaming architecture with memory. The Tiny variant uses a Hiera-T backbone for efficient on-device inference.
| Download Link | Size | Output | Original Project | License | Year | Sample Project |
|---|---|---|---|---|---|---|
| SAM2Tiny.zip | 76 MB (ImageEncoder 64 MB + PromptEncoder 2 MB + MaskDecoder 9.8 MB) | Segmentation Mask | facebookresearch/sam2 | Apache 2.0 | 2024 | SamKit |
Fast Segment Anything — a YOLOv8-seg instance segmenter (not a SAM encoder/decoder): one forward pass segments everything and point/box prompts just select among them, the fastest SAM-family option for real-time use. FastSAM-s (light) / FastSAM-x (quality).
| Download Link | Size | Output | Original Project | License | Year | Sample Project | Conversion Script |
|---|---|---|---|---|---|---|---|
| FastSAM_s.mlpackage.zip | ~23 MB FP16 | Instance masks | CASIA-IVA-Lab/FastSAM | AGPL-3.0 | 2023 | FastSAMDemo · SamKit | convert_fastsam.py |
| FastSAM_x.mlpackage.zip | ~138 MB FP16 | Instance masks | CASIA-IVA-Lab/FastSAM | AGPL-3.0 | 2023 | FastSAMDemo · SamKit | convert_fastsam.py |
Note: AGPL-3.0 (Ultralytics YOLOv8), unlike the Apache-2.0 SAM family.
pq-yang/MatAnyone (CVPR 2025) — temporally consistent video matting with object-level memory propagation. From a first-frame mask it tracks and refines an alpha matte across the whole clip, holding sharp edges (hair, semitransparent regions) far better than per-frame baselines.
| Download Link | Size | Input | Output | Original Project | License | Year | Sample Project | Conversion Script |
|---|---|---|---|---|---|---|---|---|
| MatAnyone (5 mlpackages, ~111 MB FP16 total) | 111 MB | image [1,3,432,768] (per-frame state in Swift) | alpha matte [1,1,432,768] | pq-yang/MatAnyone | NTU S-Lab 1.0 | 2025 | MatAnyoneDemo | convert_matanyone.py |
See sample_apps/MatAnyoneDemo/README.md for the per-frame state machine, the 5-module split, and conversion details.
| Google Drive Link | Size | Output | Original Project | License | year |
|---|---|---|---|---|---|
| Real ESRGAN4x | 66.9 MB | Image(RGB 2048x2048) | xinntao/Real-ESRGAN | BSD 3-Clause License | 2021 |
| Real ESRGAN Anime4x | 66.9 MB | Image(RGB 2048x2048) | xinntao/Real-ESRGAN | BSD 3-Clause License | 2021 |
Towards Real-World Blind Face Restoration with Generative Facial Prior
| Google Drive Link | Size | Output | Original Project | License | year |
|---|---|---|---|---|---|
| GFPGAN | 337.4 MB | Image(RGB 512x512) | TencentARC/GFPGAN | Apache2.0 | 2021 |
| Google Drive Link | Size | Output | Original Project | License | year |
|---|---|---|---|---|---|
| BSRGAN | 66.9 MB | Image(RGB 2048x2048) | cszn/BSRGAN | 2021 |
| Google Drive Link | Size | Output | Original Project | License | year | Conversion Script |
|---|---|---|---|---|---|---|
| A-ESRGAN | 63.8 MB | Image(RGB 1024x1024) | aesrgan/A-ESRGANN | BSD 3-Clause License | 2021 |
Best-Buddy GANs for Highly Detailed Image Super-Resolution
| Google Drive Link | Size | Output | Original Project | License | year |
|---|---|---|---|---|---|
| Beby-GAN | 66.9 MB | Image(RGB 2048x2048) | dvlab-research/Simple-SR | MIT | 2021 |
The Residual in Residual Dense Network for image super-scaling.
| Google Drive Link | Size | Output | Original Project | License | year |
|---|---|---|---|---|---|
| RRDN | 16.8 MB | Image(RGB 2048x2048) | idealo/image-super-resolution | Apache2.0 | 2018 |
Fast-SRGAN.
| Google Drive Link | Size | Output | Original Project | License | year |
|---|---|---|---|---|---|
| Fast-SRGAN | 628 KB | Image(RGB 1024x1024) | HasnainRaz/Fast-SRGAN | MIT | 2019 |
Enhanced-SRGAN.
| Google Drive Link | Size | Output | Original Project | License | year |
|---|---|---|---|---|---|
| ESRGAN | 66.9 MB | Image(RGB 2048x2048) | xinntao/ESRGAN | Apache 2.0 | 2018 |
Pretrained: 4xESRGAN
| Google Drive Link | Size | Output | Original Project | License | year |
|---|---|---|---|---|---|
| UltraSharp | 34 MB | Image(RGB 1024x1024) | Kim2019/ | CC-BY-NC-SA-4.0 | 2021 |
Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network.
| Google Drive Link | Size | Output | Original Project | License | year |
|---|---|---|---|---|---|
| SRGAN | 6.1 MB | Image(RGB 2048x2048) | dongheehand/SRGAN-PyTorch | 2017 |
Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network.
| Google Drive Link | Size | Output | Original Project | License | year |
|---|---|---|---|---|---|
| SRResNet | 6.1 MB | Image(RGB 2048x2048) | dongheehand/SRGAN-PyTorch | 2017 |
Lightweight Image Super-Resolution with Enhanced CNN.
| Google Drive Link | Size | Output | Original Project | License | year | Conversion Script |
|---|---|---|---|---|---|---|
| LESRCNN | 4.3 MB | Image(RGB 512x512) | hellloxiaotian/LESRCNN | 2020 |
Metric Learning based Interactive Modulation for Real-World Super-Resolution
| Google Drive Link | Size | Output | Original Project | License | year | Conversion Script |
|---|---|---|---|---|---|---|
| MMRealSRGAN | 104.6 MB | Image(RGB 1024x1024) | TencentARC/MM-RealSR | BSD 3-Clause | 2022 | |
| MMRealSRNet | 104.6 MB | Image(RGB 1024x1024) | TencentARC/MM-RealSR | BSD 3-Clause | 2022 |
Pytorch implementation of "Unsupervised Degradation Representation Learning for Blind Super-Resolution", CVPR 2021
| Google Drive Link | Size | Output | Original Project | License | year |
|---|---|---|---|---|---|
| DASR | 12.1 MB | Image(RGB 1024x1024) | The-Learning-And-Vision-Atelier-LAVA/DASR | MIT | 2022 |
wyf0912/SinSR — single-step diffusion-based super-resolution (CVPR 2024, ~113M params). Distilled from ResShift for one-step 4x upscaling. Uses a Swin Transformer UNet with VQ-VAE latent space.
Left: bicubic 4x upscale, Right: SinSR single-step diffusion SR (128x128 → 512x512)
3 CoreML models: VQ-VAE encoder, Swin-UNet denoiser (single step), and VQ-VAE decoder with vector quantization.
| Download Link | Size | Input | Output | Original Project | License | Year | Sample Project | Conversion Script |
|---|---|---|---|---|---|---|---|---|
| SinSR_Encoder.mlpackage.zip | 39 MB | image [1,3,1024,1024] | latent [1,3,256,256] | wyf0912/SinSR | S-Lab | 2024 | SinSRDemo | convert_sinsr.py |
| SinSR_Denoiser.mlpackage.zip | 420 MB | input [1,6,256,256] | predicted_latent [1,3,256,256] | |||||
| SinSR_Decoder.mlpackage.zip | 58 MB | latent [1,3,256,256] | image [1,3,1024,1024] |
See sample_apps/SinSRDemo/README.md for the inference pipeline and conversion details.
Learning Temporal Consistency for Low Light Video Enhancement from Single Images.
| Google Drive Link | Size | Output | Original Project | License | Year |
|---|---|---|---|---|---|
| StableLLVE | 17.3 MB | Image(RGB 512x512) | zkawfanx/StableLLVE | MIT | 2021 |
Zero-Reference Deep Curve Estimation for Low-Light Image Enhancement
| Google Drive Link | Size | Output | Original Project | License | Year | Conversion Script |
|---|---|---|---|---|---|---|
| Zero-DCE | 320KB | Image(RGB 512x512) | Li-Chongyi/Zero-DCE | See Repo | 2021 |
Retinexformer: One-stage Retinex-based Transformer for Low-light Image Enhancement
| Google Drive Link | Size | Output | Original Project | License | Year | Conversion Script |
|---|---|---|---|---|---|---|
| ZRetinexformer FiveK | 3.4MB | Image(RGB 512x512) | caiyuanhao1998/Retinexformer | MIT | 2023 | |
| ZRetinexformer NTIRE | 3.4MB | Image(RGB 512x512) | caiyuanhao1998/Retinexformer | MIT | 2023 |
Multi-Stage Progressive Image Restoration.
Debluring
Denoising
Deraining
| Google Drive Link | Size | Output | Original Project | License | Year |
|---|---|---|---|---|---|
| MPRNetDebluring | 137.1 MB | Image(RGB 512x512) | swz30/MPRNet | MIT | 2021 |
| MPRNetDeNoising | 108 MB | Image(RGB 512x512) | swz30/MPRNet | MIT | 2021 |
| MPRNetDeraining | 24.5 MB | Image(RGB 512x512) | swz30/MPRNet | MIT | 2021 |
Learning Enriched Features for Fast Image Restoration and Enhancement.
Denoising
Super Resolution
Contrast Enhancement
Low Light Enhancement
| Google Drive Link | Size | Output | Original Project | License | Year | Conversion Script |
|---|---|---|---|---|---|---|
| MIRNetv2Denoising | 42.5 MB | Image(RGB 512x512) | swz30/MIRNetv2 | ACADEMIC PUBLIC LICENSE | 2022 | |
| MIRNetv2SuperResolution | 42.5 MB | Image(RGB 512x512) | swz30/MIRNetv2 | ACADEMIC PUBLIC LICENSE | 2022 | |
| MIRNetv2ContrastEnhancement | 42.5 MB | Image(RGB 512x512) | swz30/MIRNetv2 | ACADEMIC PUBLIC LICENSE | 2022 | |
| MIRNetv2LowLightEnhancement | 42.5 MB | Image(RGB 512x512) | swz30/MIRNetv2 | ACADEMIC PUBLIC LICENSE | 2022 |
| Google Drive Link | Size | Output | Original Project | License | Sample Project |
|---|---|---|---|---|---|
| MobileStyleGAN | 38.6MB | Image(Color 1024 × 1024) | bes-dev/MobileStyleGAN.pytorch | Nvidia Source Code License-NC | CoreML-StyleGAN |
| Google Drive Link | Size | Output | Original Project |
|---|---|---|---|
| DCGAN | 9.2MB | MultiArray | TensorFlowCore |
| Google Drive Link | Size | Output | Original Project | License | Usage |
|---|---|---|---|---|---|
| Anime2Sketch | 217.7MB | Image(Color 512 × 512) | Mukosame/Anime2Sketch | MIT | Drop an image to preview |
| Google Drive Link | Size | Output | Original Project | Conversion Script |
|---|---|---|---|---|
| AnimeGAN2Face_Paint_512_v2 | 8.6MB | Image(Color 512 × 512) | bryandlee/animegan2-pytorch |
| Google Drive Link | Size | Output | Original Project | License | Note |
|---|---|---|---|---|---|
| Photo2Cartoon | 15.2 MB | Image(Color 256 × 256) | minivision-ai/photo2cartoon | MIT | The output is little bit different from the original model. It cause some operations were converted replaced manually. |
| Google Drive Link | Size | Output | Original Project | Sample |
|---|---|---|---|---|
| AnimeGANv2_Hayao | 8.7MB | Image(256 x 256) | TachibanaYoshino/AnimeGANv2 | AnimeGANv2-iOS |
| Google Drive Link | Size | Output | Original Project |
|---|---|---|---|
| AnimeGANv2_Paprika | 8.7MB | Image(256 x 256) | TachibanaYoshino/AnimeGANv2 |
| Google Drive Link | Size | Output | Original Project |
|---|---|---|---|
| WarpGAN Caricature | 35.5MB | Image(256 x 256) | seasonSH/WarpGAN |
| Google Drive Link | Size | Output | Original Project |
|---|---|---|---|
| UGATIT_selfie2anime | 266.2MB(quantized) | Image(256x256) | taki0112/UGATIT |
| Google Drive Link | Size | Output | Original Project |
|---|---|---|---|
| CartoonGAN_Shinkai | 44.6MB | MultiArray | mnicnc404/CartoonGan-tensorflow |
| CartoonGAN_Hayao | 44.6MB | MultiArray | mnicnc404/CartoonGan-tensorflow |
| CartoonGAN_Hosoda | 44.6MB | MultiArray | mnicnc404/CartoonGan-tensorflow |
| CartoonGAN_Paprika | 44.6MB | MultiArray | mnicnc404/CartoonGan-tensorflow |
| Google Drive Link | Size | Output | Original Project | License | Year |
|---|---|---|---|---|---|
| fast-neural-style-transfer-cuphead | 6.4MB | Image(RGB 960x640) | eriklindernoren/Fast-Neural-Style-Transfer | MIT | 2019 |
| fast-neural-style-transfer-starry-night | 6.4MB | Image(RGB 960x640) | eriklindernoren/Fast-Neural-Style-Transfer | MIT | 2019 |
| fast-neural-style-transfer-mosaic | 6.4MB | Image(RGB 960x640) | eriklindernoren/Fast-Neural-Style-Transfer | MIT | 2019 |
Learning to Cartoonize Using White-box Cartoon Representations
| Google Drive Link | Size | Output | Original Project | License | Year |
|---|---|---|---|---|---|
| White_box_Cartoonization | 5.9MB | Image(1536x1536) | SystemErrorWang/White-box-Cartoonization | creativecommons | CVPR2020 |
White-box facial image cartoonizaiton
| Google Drive Link | Size | Output | Original Project | License | Year |
|---|---|---|---|---|---|
| FacialCartoonization | 8.4MB | Image(256x256) | SystemErrorWang/FacialCartoonization | creativecommons | 2020 |
| Google Drive Link | Size | Output | Original Project | License | Note | Sample Project |
|---|---|---|---|---|---|---|
| AOT-GAN-for-Inpainting | 60.8MB | MLMultiArray(3,512,512) | researchmm/AOT-GAN-for-Inpainting | Apache2.0 | To use see sample. | john-rocky/Inpainting-CoreML |
| Google Drive Link | Size | Input | Output | Original Project | License | Note | Sample Project | Conversion Script |
|---|---|---|---|---|---|---|---|---|
| Lama | 216.6MB | Image (Color 800 × 800), Image (GrayScale 800 × 800) | Image (Color 800 × 800) | advimman/lama | Apache2.0 | To use see sample. | john-rocky/lama-cleaner-iOS | mallman/CoreMLaMa |
ByteDance-Seed/Depth-Anything-3 (ICLR 2026 Oral) — relative monocular depth from a single image. This Core ML port exposes only the monocular depth + confidence subgraph (camera / multi-view / sky / 3DGS branches stripped). First public Core ML conversion of DA3.
| Module | Size | Input | Output | Original Project | License | Year | Sample Project | Conversion Script |
|---|---|---|---|---|---|---|---|---|
| DA3 Small 504×504 | ~44 MB FP16 | Image (RGB 504 × 504) | depth + confidence | ByteDance-Seed/Depth-Anything-3 | Apache 2.0 | 2025 | Hub App | convert_depth_anything_v3.py |
| DA3 Base 504×504 | ~173 MB FP16 | Image (RGB 504 × 504) | depth + confidence | ByteDance-Seed/Depth-Anything-3 | Apache 2.0 | 2025 | Hub App | convert_depth_anything_v3.py |
microsoft/MoGe (CVPR 2025 Oral) — open-domain monocular 3D geometry from a single image. Predicts a metric depth map, surface normals, and a confidence mask in one forward pass — unlike MiDaS-style relative depth, the depth comes out in real meters.
Left: original photo, center: metric depth (turbo colormap), right: surface normals.
| Module | Size | Input | Output | Original Project | License | Year | Sample Project | Conversion Script |
|---|---|---|---|---|---|---|---|---|
| MoGe-2 ViT-B + normal | ~200 MB FP16 | Image (RGB 504 × 504) | depth + normal + mask + metric_scale | microsoft/MoGe | MIT | 2025 | MoGe2Demo | convert_moge2.py |
Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-shot Cross-dataset Transfer
| Google Drive Link | Size | Output | Original Project | License | Year | Conversion Script |
|---|---|---|---|---|---|---|
| MiDaS_Small | 66.3MB | MultiArray(1x256x256) | isl-org/MiDaS | MIT | 2022 |
amd/Nitro-E — AMD's 304M-parameter E-MMDiT text-to-image model. The 4-step distilled variant generates 512×512 images in ~2–3 seconds on iPhone 15+, and the full pipeline fits in ~1.04 GB after INT4 / INT8 palettization.
4-step generation on iPhone, 512×512. Prompt: "a hot air balloon in the shape of a heart, grand canyon".
3 CoreML models total:
| Download Link | Size | Input | Output | Original Project | License | Year | Sample Project | Conversion Script |
|---|---|---|---|---|---|---|---|---|
| NitroE_TextEncoder.mlpackage | 590 MB (INT4) / 2.3 GB (FP16) | input_ids [1,128], attention_mask [1,128] | last_hidden_state [1,128,2048] | meta-llama/Llama-3.2-1B | Llama 3.2 (gated) | 2024 | NitroEDemo | convert_nitro_e_text_encoder.py |
| NitroE_EMMDiT.mlpackage | 295 MB (INT8) / 578 MB (FP16) | latent [1,32,16,16], encoder_hs [1,128,2048], timestep [1] | noise_pred [1,32,16,16] | amd/Nitro-E | MIT | 2025 | convert_nitro_e_emmdit.py | |
| NitroE_VAEDecoder.mlpackage | 159 MB (INT8) / 608 MB (FP32) | latent [1,32,16,16] | image [1,3,512,512] | mit-han-lab/dc-ae-f32c32-sana-1.0-diffusers | MIT | 2024 | convert_nitro_e_vae_decoder.py |
See sample_apps/NitroEDemo/README.md for the Swift FlowMatchEulerScheduler port, tokenizer details, and iOS 18 palettization notes.
ByteDance/Hyper-SD — single-step text-to-image distilled from SD1.5 via Trajectory Segmented Consistency Distillation. ByteDance reports 2× user preference over SD-Turbo at 1 step; runs at acceptable speed and quality on iPhone 15+ via Apple's ml-stable-diffusion.
1-step generations on iPhone, 512×512. Prompts: cat with sunglasses, cyberpunk city, japanese garden, astronaut on horse.
4 CoreML models (~947 MB total): CLIP text encoder + Swin-style chunked UNet (6-bit palettized) + VAE decoder. Uses TCD scheduler for single-step inference.
| Download Link | Size | Input | Output | Original Project | License | Year | Sample Project | Conversion Script |
|---|---|---|---|---|---|---|---|---|
| HyperSDTextEncoder.mlpackage.zip | 235 MB | input_ids [1,77] | encoder_hidden_states [1,77,768] | ByteDance/Hyper-SD | OpenRAIL++ | 2024 | HyperSDDemo | convert_hypersd.py |
| HyperSDUnetChunk1.mlpackage.zip | 318 MB | latent + encoder_hs + timestep | first half intermediates | |||||
| HyperSDUnetChunk2.mlpackage.zip | 299 MB | first half outputs + skip connections | noise_pred [2,4,64,64] | |||||
| HyperSDVAEDecoder.mlpackage.zip | 95 MB | latent [1,4,64,64] | image [1,3,512,512] |
See sample_apps/HyperSDDemo/README.md for the LoRA fusion, chunked-UNet palettization, and TCD scheduler details.
| Google Drive Link | Original Model | Original Project | License | Run on mac | Conversion Script | Year |
|---|---|---|---|---|---|---|
| stable-diffusion-v1-5 | runwayml/stable-diffusion-v1-5 | runwayml/stable-diffusion | Open RAIL M license | godly-devotion/MochiDiffusion | godly-devotion/MochiDiffusion | 2022 |
Pastel Mix - a stylized latent diffusion model.This model is intended to produce high-quality, highly detailed anime style with just a few prompts.
| Google Drive Link | Original Model | License | Run on mac | Conversion Script | Year |
|---|---|---|---|---|---|
| pastelMixStylizedAnime_pastelMixPrunedFP16 | andite/pastel-mix | Fantasy.ai | godly-devotion/MochiDiffusion | godly-devotion/MochiDiffusion | 2023 |
| Google Drive Link | Original Model | License | Run on mac | Conversion Script | Year |
|---|---|---|---|---|---|
| AOM3_orangemixs | WarriorMama777/OrangeMixs | CreativeML OpenRAIL-M | godly-devotion/MochiDiffusion | godly-devotion/MochiDiffusion | 2023 |
| Google Drive Link | Original Model | License | Run on mac | Conversion Script | Year |
|---|---|---|---|---|---|
| Counterfeit-V2.5 | gsdf/Counterfeit-V2.5 | - | godly-devotion/MochiDiffusion | godly-devotion/MochiDiffusion | 2023 |
| Google Drive Link | Original Model | License | Run on mac | Conversion Script | Year |
|---|---|---|---|---|---|
| anything-v4.5 | andite/anything-v4.0 | Fantasy.ai | godly-devotion/MochiDiffusion | godly-devotion/MochiDiffusion | 2023 |
| Google Drive Link | Original Model | License | Run on mac | Conversion Script | Year |
|---|---|---|---|---|---|
| Openjourney | prompthero/openjourney | - | godly-devotion/MochiDiffusion | godly-devotion/MochiDiffusion | 2023 |
| Google Drive Link | Original Model | License | Run on mac | Conversion Script | Year |
|---|---|---|---|---|---|
| dreamlike-photoreal-2.0 | dreamlike-art/dreamlike-photoreal-2.0 | CreativeML OpenRAIL-M | godly-devotion/MochiDiffusion | godly-devotion/MochiDiffusion | 2023 |
DDColor — AI image colorization for grayscale/B&W photos using dual decoders (ICCV 2023).
| Input | Output |
|---|---|
![]() |
| Download Link | Size | Input | Output | Original Project | License | Year | Sample Project | Conversion Script |
|---|---|---|---|---|---|---|---|---|
| DDColor_Tiny.mlpackage.zip | 242 MB | 512×512 RGB | AB channels (LAB) | piddnad/DDColor | Apache-2.0 | 2023 | DDColorDemo | convert_ddcolor.py |
AdaFace — Quality-adaptive face recognition. Outputs 512-dim embedding for face verification and identification.
| Download Link | Size | Input | Output | Original Project | License | Year | Sample Project | Conversion Script |
|---|---|---|---|---|---|---|---|---|
| AdaFace_IR18.mlpackage.zip | 48 MB | Image (112×112 face) | 512-dim L2-normalized embedding | mk-minchul/AdaFace | MIT | 2022 | AdaFaceDemo | convert_adaface.py |
3DDFA_V2 — 3D face reconstruction and head pose estimation (yaw, pitch, roll) from a single face image.
| Download Link | Size | Input | Output | Original Project | License | Year | Sample Project |
|---|---|---|---|---|---|---|---|
| 3DDFA_V2.mlpackage.zip | 6.3 MB | Image (120×120 RGB) | 62 params (12 pose + 40 shape + 10 expression) | cleardusk/3DDFA_V2 | MIT | 2020 | Face3DDemo |
pyannote segmentation — Speaker diarization with up to 3 simultaneous speakers. Identifies who speaks when, with overlap detection and per-speaker transcription.
| Download Link | Size | Input | Output | Original Project | License | Year | Sample Project | Conversion Script |
|---|---|---|---|---|---|---|---|---|
| SpeakerSegmentation.mlpackage.zip | 5.8 MB | 10s mono 16kHz [1,1,160000] | [1, 589, 7] speaker logits | pyannote/segmentation-3.0 | MIT | 2023 | DiarizationDemo | convert_diarization.py |
OpenVoice — Zero-shot voice conversion. Record source and target voice, convert on-device.
openvoice.mp4
| Download Link | Size | Input | Output | Original Project | License | Year | Sample Project | Conversion Script |
|---|---|---|---|---|---|---|---|---|
| OpenVoice_SpeakerEncoder.mlpackage.zip | 1.7 MB | Spectrogram [1, T, 513] | 256-dim speaker embedding | myshell-ai/OpenVoice | MIT | 2024 | OpenVoiceDemo | convert_openvoice.py |
| OpenVoice_VoiceConverter.mlpackage.zip | 64 MB | Spectrogram + speaker embeddings | Waveform audio (22050 Hz) |
Hybrid Transformer Demucs — separates music into 4 stems: drums, bass, vocals, and other instruments.
demucs.mp4
| Download Link | Size | Input | Output | Original Project | License | Year | Sample Project | Conversion Script |
|---|---|---|---|---|---|---|---|---|
| HTDemucs_SourceSeparation_F32.mlpackage.zip | 80 MB | Audio Waveform [1, 2, 343980] at 44.1kHz | 4 stems (drums, bass, other, vocals) stereo | facebookresearch/demucs | MIT | 2022 | DemucsDemo | convert_htdemucs.py |
Microsoft Florence-2 — a unified vision-language model supporting image captioning, OCR, and object detection from a single model. Converted as 3 CoreML models (INT8): Vision Encoder (DaViT), Text Encoder (BART), and Decoder with autoregressive generation.
| Download Link | Size | Input | Output | Original Project | License | Year | Sample Project | Conversion Script |
|---|---|---|---|---|---|---|---|---|
| Florence2VisionEncoder / TextEncoder / Decoder | 260 MB (INT8, 3 models total) | 768x768 RGB image + task prompt | Generated text (caption, OCR, etc.) | microsoft/Florence-2-base | MIT | 2024 | Florence2Demo | convert_florence2.py |
john-rocky/CoreML-LLM — Companion repository for running LLMs on the Apple Neural Engine. Unlike MLX Swift (GPU-only), CoreML-LLM targets ANE for ~10x lower power draw, making always-on on-device LLMs practical on iPhone. All models below load via the same CoreMLLLM.load(...) Swift API and are available in-app through the Models Zoo hub.
| Model | Size | Modalities | iPhone 17 Pro decode | HuggingFace |
|---|---|---|---|---|
| Gemma 4 E2B | 3.1 GB | Text + image + audio + video | 31–34 tok/s | mlboydaisuke/gemma-4-E2B-coreml |
| Gemma 4 E4B | 5.5 GB | Text | ~14 tok/s | mlboydaisuke/gemma-4-E4B-coreml |
| Qwen3.5 2B | 2.4 GB | Text | ~17 tok/s (~200 MB RSS) | mlboydaisuke/qwen3.5-2B-CoreML |
| Qwen3.5 0.8B | 754 MB | Text | ~20 tok/s | mlboydaisuke/qwen3.5-0.8B-CoreML |
| Qwen3-VL 2B | 4.7 GB | Text + image | ~7.5 tok/s | mlboydaisuke/qwen3-vl-2b-coreml |
Google Gemma 4 E2B (2.3B effective parameters) running fully on ANE. Multimodal — text, image, audio, and video input, 2048-token context. Decodes at 31–34 tok/s on iPhone 17 Pro.
| Download Link | Size | Input | Output | Original Project | License | Year | Sample Project | Swift Package |
|---|---|---|---|---|---|---|---|---|
| mlboydaisuke/gemma-4-E2B-coreml | 3.1 GB (INT4, 4 chunks + vision + audio + video encoders) | Text + image + audio + video (≤2048 tokens) | Generated text (streaming) | google/gemma-3n-E2B-it | Gemma ToU | 2025 | CoreMLLLMChat | CoreML-LLM |
Larger text-only Gemma 4 variant — 42-layer decoder, ~4B effective parameters, 100% ANE-resident. Use when you want maximum text quality and have the storage budget. No vision / audio / video encoders.
| Download Link | Size | Input | Output | Original Project | License | Year | Sample Project | Swift Package |
|---|---|---|---|---|---|---|---|---|
| mlboydaisuke/gemma-4-E4B-coreml | 5.5 GB (INT4, 4 chunks) | Text prompt (≤2048 tokens) | Generated text (streaming) | google/gemma-3n-E4B-it | Gemma ToU | 2025 | CoreMLLLMChat | CoreML-LLM |
Alibaba Qwen3.5 2B — hybrid Gated-DeltaNet SSM + attention, INT8. Shipped as 4 chunks + an mmap fp16 embed sidecar so a 2B-param model fits in ~200 MB physical footprint and stays ANE-resident.
| Download Link | Size | Input | Output | Original Project | License | Year | Sample Project | Swift Package |
|---|---|---|---|---|---|---|---|---|
| mlboydaisuke/qwen3.5-2B-CoreML | 2.4 GB (INT8, 4 chunks + embed) | Text prompt | Generated text (streaming) | Qwen/Qwen3.5-2B | Apache-2.0 | 2025 | CoreMLLLMChat | CoreML-LLM |
Compact hybrid SSM+attention model, INT8 palettized — same semantic precision as fp16 (top-3 = 100% parity vs fp32 oracle), half the bundle size. Smallest and fastest option in the lineup at 754 MB / ~20 tok/s decode.
| Download Link | Size | Input | Output | Original Project | License | Year | Sample Project | Swift Package |
|---|---|---|---|---|---|---|---|---|
| mlboydaisuke/qwen3.5-0.8B-CoreML | 754 MB (INT8 palettized) | Text prompt | Generated text (streaming) | Qwen/Qwen3.5-0.8B | Apache-2.0 | 2025 | CoreMLLLMChat | CoreML-LLM |
Qwen3-VL multimodal — text + image input, re-using Qwen3-VL's native ViT vision tower. 28-layer GQA text backbone shipped as 6 INT8 chunks + an mmap fp16 embed sidecar.
| Download Link | Size | Input | Output | Original Project | License | Year | Sample Project | Swift Package |
|---|---|---|---|---|---|---|---|---|
| mlboydaisuke/qwen3-vl-2b-coreml | 4.7 GB (INT8, 6 body chunks + head + embed) | Text + image | Generated text (streaming) | Qwen/Qwen3-VL-2B-Instruct | Apache-2.0 | 2025 | CoreMLLLMChat | CoreML-LLM |
See CoreML-LLM for the full conversion pipeline, ANE optimization techniques (cat-trick RMSNorm, Conv2d Linear, pre-computed RoPE, stateless KV with explicit I/O), and the Swift sample app.
Google SigLIP — sigmoid-based contrastive image-text model for zero-shot classification. Type any labels (e.g. "cat, dog, car") and get per-label probabilities. Converted as 2 CoreML models (INT8): Image Encoder and Text Encoder.
| Download Link | Size | Input | Output | Original Project | License | Year | Sample Project | Conversion Script |
|---|---|---|---|---|---|---|---|---|
| SigLIP_ImageEncoder / TextEncoder | 386 MB (FP16, 2 models total) | 224x224 RGB image + text labels | Per-label similarity scores (softmax) | google/siglip-base-patch16-224 | Apache-2.0 | 2024 | SigLIPDemo | convert_siglip.py |
hexgrad/Kokoro-82M — open-weight 82M-parameter StyleTTS2 TTS producing 24kHz speech in 9 languages. The first CoreML port with on-device bilingual (English + Japanese) free-text input — no MLX, no MeCab, no Python G2P at runtime.
ScreenRecording_04-07-2026.12-30-44_1.mov
2 CoreML models: a flexible-length Predictor (BERT + LSTM duration head + text encoder) and 3 fixed-shape Decoder buckets (128 / 256 / 512 frames). The Swift pipeline picks the smallest bucket that fits the predicted total duration, pads input features with zeros, and trims the output audio.
| Download Link | Size | Input | Output | Original Project | License | Year | Sample Project | Conversion Script |
|---|---|---|---|---|---|---|---|---|
| Kokoro_Predictor.mlpackage.zip | 75 MB | input_ids [1, T≤256] (int32) + ref_s_style [1, 128] | duration [1, T] + d_for_align [1, 640, T] + t_en [1, 512, T] | hexgrad/Kokoro-82M | Apache-2.0 | 2025 | KokoroDemo | convert_kokoro.py |
| Kokoro_Decoder_128.mlpackage.zip | 238 MB | en_aligned [1, 640, 128] + asr_aligned [1, 512, 128] + ref_s [1, 256] | audio [1, 76800] @ 24kHz | |||||
| Kokoro_Decoder_256.mlpackage.zip | 241 MB | en_aligned [1, 640, 256] + asr_aligned [1, 512, 256] + ref_s [1, 256] | audio [1, 153600] @ 24kHz | |||||
| Kokoro_Decoder_512.mlpackage.zip | 246 MB | en_aligned [1, 640, 512] + asr_aligned [1, 512, 512] + ref_s [1, 256] | audio [1, 307200] @ 24kHz |
See sample_apps/KokoroDemo/README.md for the on-device G2P (English + Japanese), bucketed decoder strategy, and conversion details.
EfficientAD (PDN-Small) — lightweight unsupervised anomaly detection for industrial inspection. Wraps teacher, student, and autoencoder networks into a single model that outputs a per-pixel anomaly heatmap and image-level anomaly score. Pretrained on MVTec AD bottle category.
| Download Link | Size | Input | Output | Original Project | License | Year | Sample Project | Conversion Script |
|---|---|---|---|---|---|---|---|---|
| EfficientAD_Bottle.mlpackage.zip | 15 MB (FP16) | 256x256 RGB image | anomaly_map [1,1,256,256] + anomaly_score [0-1] | nelson1425/EfficientAD | MIT | 2023 | EfficientADDemo | convert_efficientad.py |
spotify/basic-pitch — polyphonic Automatic Music Transcription. Converts any audio (any instrument, any voice) into MIDI notes with pitch bend detection. Just 17K parameters / 272 KB — runs in real time on iPhone with full ANE acceleration.
ScreenRecording_04-08-2026.02-14-51_1.mov
The first open-source iOS implementation: detected notes are shown as a piano roll, exported as a Standard MIDI File, and played back through a built-in synth for A/B comparison with the original audio.
| Download Link | Size | Input | Output | Original Project | License | Year | Sample Project |
|---|---|---|---|---|---|---|---|
| BasicPitch_nmp.mlpackage.zip | 272 KB | audio waveform [1, 43844, 1] @ 22050 Hz mono | note [1,172,88] + onset [1,172,88] + contour [1,172,264] | spotify/basic-pitch | Apache-2.0 | 2022 | BasicPitchDemo |
See sample_apps/BasicPitchDemo/README.md for the sliding-window inference, post-processing port, and iOS-specific gotchas.
stabilityai/stable-audio-open-small — text-to-music generation (497M params). Generates up to 11.9 seconds of stereo 44.1kHz audio from text prompts using rectified flow diffusion.
ScreenRecording_04-04-2026.13-54-08_1.mov
4 CoreML models: T5 text encoder, NumberEmbedder (seconds conditioning), DiT (diffusion transformer), and VAE decoder (Oobleck).
| Download Link | Size | Input | Output | Original Project | License | Year | Sample Project | Conversion Script |
|---|---|---|---|---|---|---|---|---|
| StableAudioT5Encoder.mlpackage.zip | 105 MB | input_ids [1, 64] | text_embeddings [1, 64, 768] | stabilityai/stable-audio-open-small | Stability AI Community | 2024 | StableAudioDemo | convert_stable_audio.py |
| StableAudioNumberEmbedder.mlpackage.zip | 396 KB | normalized_seconds [1] | seconds_embedding [1, 768] | |||||
| StableAudioDiT.mlpackage.zip | 326 MB | latent [1,64,256] + timestep + conditioning | velocity [1,64,256] | |||||
| StableAudioDiT_FP32.mlpackage.zip | 1.3 GB | latent [1,64,256] + timestep + conditioning | velocity [1,64,256] | |||||
| StableAudioVAEDecoder.mlpackage.zip | 149 MB | latent [1, 64, 256] | stereo audio [1, 2, 524288] at 44.1kHz |
See sample_apps/StableAudioDemo/README.md for INT8 vs FP32 DiT selection and conversion details.
import Vision
lazy var coreMLRequest:VNCoreMLRequest = {
let model = try! VNCoreMLModel(for: modelname().model)
let request = VNCoreMLRequest(model: model, completionHandler: self.coreMLCompletionHandler)
return request
}()
let handler = VNImageRequestHandler(ciImage: ciimage,options: [:])
DispatchQueue.global(qos: .userInitiated).async {
try? handler.perform([coreMLRequest])
}
If the model has Image type output:
let result = request?.results?.first as! VNPixelBufferObservation
let uiimage = UIImage(ciImage: CIImage(cvPixelBuffer: result.pixelBuffer))Else the model has Multiarray type output:
For visualizing multiArray as image, Mr. Hollance’s “CoreML Helpers” are very convenient. CoreML Helpers
Converting from MultiArray to Image with CoreML Helpers.
func coreMLCompletionHandler(request:VNRequest?、error:Error?){
let = coreMLRequest.results?.first as!VNCoreMLFeatureValueObservation
let multiArray = result.featureValue.multiArrayValue
let cgimage = multiArray?.cgImage(min:-1、max:1、channel:nil)
Option 2,Use CoreGANContainer. You can use models with dragging&dropping into the container project.
You can make the model size lighter with Quantization if you want. https://coremltools.readme.io/docs/quantization
The lower the number of bits, more the chances of degrading the model accuracy. The loss in accuracy varies with the model.
import coremltools as ct
from coremltools.models.neural_network import quantization_utils
# load full precision model
model_fp32 = ct.models.MLModel('model.mlmodel')
model_fp16 = quantization_utils.quantize_weights(model_fp32, nbits=16)
# nbits can be 16(half size model), 8(1/4), 4(1/8), 2, 1Cover image was taken from Ghibli free images.
On YOLOv5 convertion, dbsystel/yolov5-coreml-tools give me the super inteligent convert script.
And all of original projects
Daisuke Majima Freelance engineer. iOS/MachineLearning/AR I can work on mobile ML projects and AR project. Feel free to contact: rockyshikoku@gmail.com























































































































