My Journey: Fine-Tuning & Deploying Smaller AI Models

🌐🇮🇹 Italiano 🇧🇷 Português 🇩🇪 Deutsch 🇫🇷 Français 🇫🇷 Français 🇪🇸 Español 🇺🇸 English

📖 10 min read•1,978 words•Updated Mar 26, 2026

Hey everyone, Nina here, back at agntbox.com! Today, I want to talk about something that’s been buzzing in my Slack channels and haunting my late-night coding sessions: AI frameworks. Specifically, I want to explore a particular corner that’s often overlooked when everyone’s busy chasing the latest LLM: the tooling around fine-tuning and deployment for smaller, more specialized models.

We all know the big names – TensorFlow, PyTorch. They’re the giants, the established players. And for good reason! They’re incredibly powerful and versatile. But let’s be real, sometimes you don’t need a battleship to cross a pond. Sometimes, you need a nimble speedboat, especially when you’re working on a tight deadline, with a specific dataset, and a clear goal in mind. That’s where I want to talk about something I’ve been getting quite cozy with lately: Hugging Face Optimum for ONNX Runtime.

Now, before your eyes glaze over with acronyms, let me break it down. Hugging Face, of course, is the darling of the NLP world, making pre-trained models accessible to everyone. ONNX (Open Neural Network Exchange) is an open standard for representing machine learning models, which basically means you can convert models trained in one framework (like PyTorch) and run them in another (like TensorFlow, or in our case, ONNX Runtime). And ONNX Runtime? That’s Microsoft’s high-performance inference engine.

So, what does Hugging Face Optimum do for ONNX Runtime? It’s basically a bridge, an optimizer, and a deployment assistant all rolled into one. It helps you take your Hugging Face models, optimize them for ONNX Runtime inference, and often, get a significant speed boost without sacrificing much (or any) accuracy. And that, my friends, is gold.

Why I’m Obsessed with Optimum ONNX Runtime Right Now

My journey into Optimum ONNX Runtime started, as most of my tech obsessions do, with a problem. I was working on a project for a client that involved deploying a relatively small BERT-based model for text classification on customer support tickets. The model was trained on a custom dataset, and while it performed great, inference time was just a touch too slow for real-time customer interaction. We were talking about maybe 150-200ms per inference on a beefy GPU, which isn’t terrible, but for high-volume, real-time applications, every millisecond counts.

I tried all the usual suspects: batching, optimizing input pipelines, even some basic quantization. We got some improvements, but nothing dramatic. Then, a colleague mentioned Optimum, and specifically, its ONNX integration. I was skeptical at first. Another layer of abstraction? More dependencies? But I was desperate, so I dove in.

What I found was a surprisingly straightforward workflow that delivered results. We managed to cut our inference time down to around 50-70ms per inference on the same GPU, and even saw decent performance on a CPU for less critical tasks. That’s a 2-3x speedup, which in the world of real-time AI, is a massive win. It meant we could scale our service much more efficiently and provide quicker responses to customers, directly impacting their experience.

The Problem it Solves: Performance and Portability

Let’s be honest, deploying AI models can be a headache. You train a beautiful model in PyTorch, and then you have to figure out how to get it to run efficiently in production. Sometimes, you’re stuck with a specific hardware setup, or you need to deploy to an edge device with limited resources. This is where ONNX comes in. It provides a common format, decoupling your model from the training framework.

Optimum takes this a step further. It’s not just about converting to ONNX; it’s about optimizing that ONNX model. It can apply techniques like graph optimizations, operator fusion, and even quantize your model to reduce its size and speed up inference, often with minimal impact on accuracy. This is especially useful for smaller models or when you’re constrained by memory or compute on your deployment target.

My client’s use case was a perfect example. We had a PyTorch model, but we wanted to deploy it on a cloud instance with NVIDIA GPUs, and we needed maximum throughput. Optimum ONNX Runtime allowed us to export the model, apply specific optimizations for our target hardware, and get it running like a dream.

Getting Started with Optimum ONNX Runtime: A Practical Example

Let’s walk through a basic example of how you might use Optimum to export and optimize a Hugging Face model for ONNX Runtime. For this, we’ll use a simple sentiment analysis model.

First, you’ll need to install the necessary libraries:


pip install transformers optimum[onnxruntime] onnx

Now, let’s write some Python code to export a pre-trained sentiment analysis model.

Example 1: Exporting a Sentiment Analysis Model

Here, we’re taking a standard `distilbert-base-uncased-finetuned-sst-2-english` model and exporting it to ONNX format using Optimum.


from transformers import pipeline
from optimum.onnxruntime import ORTModelForSequenceClassification
from transformers import AutoTokenizer

model_id = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Export the model to ONNX
# The `save_directory` is where your ONNX model and tokenizer will be saved.
# `opset` specifies the ONNX operator set version.
# `input_names` are important for defining the inputs to your ONNX graph.
onnx_path = "./onnx_sentiment_model/"
ort_model = ORTModelForSequenceClassification.from_pretrained(model_id, from_transformers=True)
ort_model.save_pretrained(onnx_path)
tokenizer.save_pretrained(onnx_path)

print(f"Model and tokenizer exported to {onnx_path}")

After running this, you’ll have a directory named `onnx_sentiment_model` containing your `model.onnx` file and the tokenizer files. This `model.onnx` file is the optimized version ready for ONNX Runtime.

Example 2: Running Inference with the ONNX Model

Now, let’s load that exported model and run some inference with it. Notice how we load `ORTModelForSequenceClassification` directly from the `onnx_path` we saved to.


from optimum.onnxruntime import ORTModelForSequenceClassification
from transformers import AutoTokenizer, pipeline
import time

onnx_path = "./onnx_sentiment_model/"

# Load the ONNX model and tokenizer
ort_model = ORTModelForSequenceClassification.from_pretrained(onnx_path)
tokenizer = AutoTokenizer.from_pretrained(onnx_path)

# Create a pipeline for easy inference
onnx_pipeline = pipeline(
 "sentiment-analysis",
 model=ort_model,
 tokenizer=tokenizer,
 accelerator="ort" # This tells the pipeline to use ONNX Runtime
)

text_samples = [
 "I love this product, it's amazing!",
 "This movie was just okay, a bit boring.",
 "I absolutely hate waiting in long lines.",
 "The service was incredibly fast and efficient."
]

print("\n--- Running Inference with ONNX Model ---")
start_time = time.time()
results = onnx_pipeline(text_samples)
end_time = time.time()

for i, res in enumerate(results):
 print(f"Text: '{text_samples[i]}' -> Label: {res['label']}, Score: {res['score']:.4f}")

print(f"Inference time for {len(text_samples)} samples: {(end_time - start_time):.4f} seconds")

When you run this, you’ll see the sentiment predictions. More importantly, if you were to compare the inference time with the original PyTorch model on the same hardware, you would likely observe a noticeable speedup, especially for larger batches or more complex models. The `accelerator=”ort”` parameter in the pipeline is a small but mighty flag that tells Hugging Face to use the ONNX Runtime for inference, which is where the magic happens.

Example 3: Controlling Optimization Features (Optional but Powerful)

Optimum allows fine-grained control over the optimization process. For instance, you can specify the optimization level or even choose specific graph optimizations. This can be crucial when you’re trying to squeeze every last drop of performance out of your model or when you need to make trade-offs between speed and accuracy (e.g., with quantization).


from optimum.onnxruntime import ORTModelForSequenceClassification, ORTConfig
from transformers import AutoTokenizer
from optimum.exporters.tasks import TasksManager
from optimum.onnxruntime.configuration import AutoOptimizationConfig

model_id = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Define optimization configuration
# Here, we're using default optimization but you could customize further
# For example, to apply quantization: `optimization_config = AutoOptimizationConfig.O2()`
optimization_config = AutoOptimizationConfig.O1() # O1 for basic optimization, O2 for more aggressive, O3 for full

# Export the model with explicit optimization configuration
onnx_path_optimized = "./onnx_sentiment_model_optimized/"
task = TasksManager.get_task_from_model_or_model_name(model_id)
# You might need to adjust the `feature` parameter based on your model's task
# For sequence classification, it's often 'sequence-classification'
ORTModelForSequenceClassification.from_pretrained(
 model_id,
 from_transformers=True,
 export_feature=task.feature,
 optimization_config=optimization_config,
).save_pretrained(onnx_path_optimized)
tokenizer.save_pretrained(onnx_path_optimized)

print(f"Optimized model and tokenizer exported to {onnx_path_optimized}")

The `AutoOptimizationConfig` is your friend here. `O1` provides basic graph optimizations, `O2` adds more aggressive fusion and node eliminations, and `O3` includes all available optimizations, including quantization if applicable. Choosing the right level depends on your specific needs and the hardware you’re targeting. For my client, we experimented between `O1` and `O2` to find the sweet spot, leaning towards `O2` for the best balance of speed and accuracy.

My Takeaways and What’s Next

My experience with Hugging Face Optimum for ONNX Runtime has been overwhelmingly positive. It’s not a magic bullet for every single AI deployment, but it addresses a very common and critical need: getting your models to run faster and more efficiently in production, especially when you’re working with Hugging Face models.

Performance Boost: The primary benefit is the significant reduction in inference time. For real-time applications, this can be a significant shift, improving user experience and reducing infrastructure costs.
Portability: By converting to ONNX, your models become more portable, runnable across different hardware and operating systems without being tied to a specific deep learning framework.
Ease of Use: The integration with Hugging Face `transformers` library is remarkably smooth. If you’re already familiar with Hugging Face, the learning curve for Optimum is quite gentle.
Resource Efficiency: Optimized models often require less memory and CPU/GPU cycles, which is crucial for edge deployments or cost-sensitive cloud environments.

One thing I’ve learned is that it’s worth experimenting with different optimization levels and configurations. Don’t just settle for the default. Try `O1`, `O2`, and even `O3` (with quantization if your use case allows for it) and benchmark the results on your actual target hardware. The gains can be surprising!

Looking ahead, I believe tools like Hugging Face Optimum are going to become even more essential. As AI models proliferate and move into more diverse deployment environments, the ability to optimize and streamline their inference will be paramount. I’m particularly excited to see how Optimum evolves with new hardware accelerators and more advanced quantization techniques.

Actionable Takeaways for Your Next AI Project:

Evaluate Your Inference Needs: Before exploring optimization, clearly define your performance requirements. What’s an acceptable latency? What’s your throughput target?
Consider ONNX Early: If you’re using Hugging Face models and performance is a concern, start thinking about ONNX export and optimization during your development cycle, not just at deployment.
Benchmark, Benchmark, Benchmark: Always measure the actual performance improvement (or degradation) after applying optimizations. Don’t rely on theoretical gains. Use real data and real hardware.
Experiment with Optimization Levels: Don’t just use the default settings. Play around with `AutoOptimizationConfig.O1()`, `O2()`, and `O3()` to find the best balance for your model and use case.
Stay Updated: The Hugging Face Optimum library is actively developed. Keep an eye on their releases and documentation for new features and performance improvements.

That’s it for me this week! If you’ve been struggling with model deployment performance, give Hugging Face Optimum for ONNX Runtime a try. It might just be the speedboat you need. Let me know in the comments if you’ve used it or if you have other go-to tools for model optimization. Happy inferencing!

🕒 Last updated: March 26, 2026 · Originally published: March 12, 2026

🧰

Written by Jake Chen

Software reviewer and AI tool expert. Independently tests and benchmarks AI products. No sponsored reviews — ever.

Learn more →