
Here's the thing. Generative AI made image creation trivial. Sora, Midjourney, DALL-E — they all produce hundreds of variations in minutes.
But evaluation? Still a human bottleneck.
A creative director reviews maybe 5-6 images per hour if they're being thorough — checking brand alignment, emotional tone, composition, segment fit, whether you can even add a headline without covering the product. At that rate, evaluating 100 images takes an entire workday. And if you're running segment-targeted campaigns (Gen Z vs. Parents vs. Professionals), multiply that by three.
Most teams solve this by either accepting mediocre AI outputs because manual review is too slow, or burning out their creative teams with repetitive pixel-peeping.
This article is about a third option: building an automated system that evaluates marketing effectiveness, not just technical quality — by combining research-backed metrics with LLM-powered judgment.
The result? A three-layer pipeline that processes images in seconds, catches issues metrics alone miss, and generates actionable feedback creators can actually use.
The architecture is deliberately simple: cascade from cheap filters to expensive judgment.
Layer 1: Technical Metrics (instant, free)
Basic computer vision checks — sharpness, composition, color energy. Filters ~20% of obvious failures immediately.
Layer 2: HEIM-Inspired AI Metrics (2-3 seconds, free)
Safety and alignment checks using Stanford's HEIM benchmark framework — CLIP alignment, aesthetics scoring, watermark detection, NSFW filtering.
Layer 3: LLM Marketing Judgment (5 seconds, $0.01/image)
GPT-4o evaluates through five marketing lenses to simulate creative director review.
Why this cascade matters: You only spend money on images that pass cheap filters. At 1000 images/day, this saves ~$450/month versus running LLM evaluation on everything.
But the real innovation isn't the layers — it's what happens in Layer 3.
Technical metrics are straightforward computer vision: sharpness via Laplacian variance, composition via edge density, color energy via saturation. Nothing fancy.
The threshold is simple: technical quality below 3.0/5 gets rejected immediately. No API calls. No analysis. Just filtered out.
This catches maybe 20% of images — the obviously blurry, flat, washed-out ones that no amount of marketing genius can save.
The code is pure OpenCV and NumPy. Runs at ~1000 images/second on a MacBook M2. The entire layer costs $0 and takes <100ms per image.
What I learned: Don't overthink this part. The goal is speed, not sophistication. Save the intelligence budget for later.
Stanford's Holistic Evaluation of Text-To-Image Models is a benchmark that evaluates AI image generation across 12 aspects: alignment, quality, aesthetics, originality, reasoning, knowledge, bias, toxicity, fairness, robustness, multilinguality, and efficiency.
It's comprehensive. Maybe too comprehensive for production.
So I cherry-picked four critical metrics:
CLIP measures semantic similarity between images and text. Standard stuff. But here's where it gets interesting.
The Problem: Generic CLIP measures "does this image match the prompt." For marketing, you need "does this image match the segment aesthetic?"
The Solution: Build segment-specific descriptors.
Instead of scoring against "a marketing image," I score against:
This single change transformed CLIP from a generic alignment check into a segment fitness predictor.
And the thresholds matter: Gen Z needs 3.3+ segment fit. Parents need 3.4+. Professionals prioritize composition (3.4+) over color energy. Different audiences = different standards.
Pre-trained models handle this: an aesthetics predictor scores visual appeal (0-5), and an NSFW detector flags safety risks (0-1, lower = safer).
The safety gate is non-negotiable: NSFW risk ≥ 0.20 → automatic rejection. No debate. No LLM review. Just rejected.
Here's a lesson nobody tells you: production systems need degraded modes.
If CLIP downloads fail, or models timeout, the system doesn't crash. It falls back to heuristic scoring:
Is it perfect? No. But it's better than returning nothing because an API timed out.
What broke: The first time I ran this at scale, CLIP downloads timed out on 30% of runs. The system appeared dead even though upstream scrapers were fine.
The fix: Graceful fallback + better error logging. Now when CLIP fails, I get a warning, not a silent crash.
This is where metrics stop and marketing starts.
Sharpness, CLIP scores, aesthetics — they tell you if an image is technically acceptable. They don't tell you if it will convert.
That requires understanding:
Enter: The 5-Lens Marketing Framework.
I force GPT-4o to evaluate every image through five mandatory lenses:
Each lens gets a status: PASS, WEAK, or FAIL.
The decision rule: If 2+ lenses FAIL → automatic rejection. No exceptions.
This isn't arbitrary. It's the minimum bar for a marketing image to work in a real campaign.
Iteration 1: "Is this a good marketing image for Gen Z?"
Result: Too generic. LLM approved everything. No actionable feedback.
Iteration 2: Added a scoring rubric and requested JSON output.
Result: Better structure, but still too lenient. Everything scored 3.5-4.5. No variance.
Iteration 3 (Final): The 5-lens framework with hard fails and segment-specific minimums.
Here's the structure (simplified):
You are a senior performance marketer reviewing AI-generated images.
INPUT DATA:
- Segment: {segment}
- Technical quality: {score}/5
- Segment fit: {score}/5
- CLIP score: {score}
- NSFW risk: {score}
HARD FAILS (immediate reject):
- NSFW >= 0.20
- CLIP < 0.18
- Technical < 3.0
SEGMENT REQUIREMENTS:
[Gen Z / Parents / Professionals specific thresholds]
EVALUATE USING 5 LENSES (mandatory):
1. Visual Saliency & Focal Point - PASS/WEAK/FAIL
2. Emotional & Psychological Tone - PASS/WEAK/FAIL
3. Compositional Balance & Negative Space - PASS/WEAK/FAIL
4. Brand Consistency & Recognizability - PASS/WEAK/FAIL
5. Contextual & Demographic Relevance - PASS/WEAK/FAIL
If 2+ lenses FAIL → cannot approve.
OUTPUT (JSON only):
{
"decision": "APPROVE | NEEDS_REVIEW | REJECT",
"lens_evaluation": [...],
"issues": [...],
"regen_prompt": {...}
}
What changed: Adding the "think like a marketer" persona and explicit hard fails made all the difference. The LLM stopped being polite and started being useful.
What broke: Early versions returned inconsistent JSON. Sometimes valid, sometimes broken with extra commentary.
The fix: Strict JSON schema enforcement via response_format={"type": "json_object"} in the API call. Problem solved.
I tested 15 images (5 per segment). Here's what happened:
Evaluation Method | Approved | Rejected |
| Metrics Only (Layers 1+2) | 14/15 (93%) | 1/15 (7%) |
| With LLM (Layer 3) | 1/15 (7%) | 9/15 (60%) |
Metrics approved nearly everything because they only measure technical quality. The LLM caught real marketing flaws:
Example: Gen Z #1 Image
Why?
A technically perfect image. Marketing-ineffective.
Across the 9 rejected images:
These are issues only an LLM catches. Metrics don't know what "on-brand" means.
Here's the honest truth: building this isn't hard technically. It's hard architecturally.
Don't build all three layers at once. Start with technical metrics only. Run it on every image in your library. Get a feel for what the thresholds actually mean.
Is 3.0 too strict? Too loose? Does composition_clarity correlate with what your team calls "good"?
This exploration phase costs $0 and teaches you more than any tutorial.
CLIP first. Get segment-specific descriptors working. Validate that Gen Z images actually score differently than Professional images.
Then aesthetics. Then NSFW. Then watermark detection.
Each metric should make the system noticeably better. If adding a metric doesn't change decisions, remove it.
Only add LLM evaluation once you've filtered down to candidates that might be good.
Why? Because the LLM is expensive and slow. If you're running it on blurry images that should've been rejected in Layer 1, you're burning money.
Get the cascade working first. LLM judgment is the cherry on top, not the foundation.
You need human labels to calibrate. But how you collect them determines everything.
I built a dead-simple web UI: show image, show metrics, two buttons (APPROVE / REJECT). That's it.
No fancy features. No "maybe" option. No multi-step wizard.
Binary decisions force clarity. And clarity makes calibration possible.
After 50 labeled images per segment, patterns emerge. You'll see which metrics actually predict human approval. Those insights feed back into thresholds.
The naive approach: Run GPT-4o on every image.
Cost: 1000 images/day × $0.015 = $15/day = $450/month
The smart approach: Cascade through cheaper checks first.
Cost: ~$1.90/day = $57/month
Savings: 91%
This isn't theoretical. I ran both approaches. The tiered version saved $393/month while maintaining the same accuracy.
Initial thresholds were arbitrary (total ≥ 3.0, aesthetics ≥ 2.5). They worked okay, but not great.
The solution: Collect human labels. Optimize thresholds to maximize F1 score.
I built a simple labeling UI where reviewers mark images APPROVE/REJECT. Then used scipy.optimize to find thresholds that match human judgment.
Result: Accuracy improved from ~65% to ~85% within a month.
But here's what surprised me: the optimal thresholds weren't what I expected. Gen Z required lower aesthetics (2.8) than Professionals (3.4), because authenticity beats polish for that segment.
You can't guess your way to those insights. You have to measure.
One of the scariest failures: Everything upstream worked perfectly. Scrapers ran. Data flowed. But zero images appeared in the output.
The culprit? The deduplication node was checking the wrong field. Every image looked "already processed."
The debugging trick: Temporarily disable dedupe. If images suddenly appear, you know the logic is the problem, not the data.
This failure mode is particularly insidious because there's no error. The system just... quietly rejects everything.
Renamed a column from "Viral_Playbook" to "Playbook"? Cool.
Except the API still writes to "Viral_Playbook." Silently fails. No error. Just... nothing.
The lesson: Field renames in Airtable break workflows. Update both sides or suffer.
This happened twice before I learned. Now I treat field names like database schema changes — versioned, documented, and never touched casually.
Initially, I tried generating HTML reports. Standard stuff. But when stakeholders wanted PDFs, things got messy.
Attempt 1: WeasyPrint
Failed with dependency hell on macOS (cairo, pango, libffi conflicts).
Attempt 2: Cloud PDF converters
Worked sometimes. Failed when images were external paths. Unreliable.
Attempt 3 (Final): Playwright + base64-embedded images
Generate HTML with images as data URIs. Use headless Chromium to render. Export as PDF.
Result: Self-contained 8.5MB PDFs that work everywhere.
Sometimes the "hacky" solution (base64 + headless browser) is more reliable than the "proper" solution (dedicated PDF libraries).
First-run downloads of CLIP (>500MB) would timeout on 30% of runs. System appeared dead even though everything upstream worked.
The fix: Graceful fallback + retry logic + better error logging.
Now when CLIP fails, I get:
⚠️ CLIP unavailable: Connection timeout
→ Using fallback heuristic scoring for segment_fit
→ Continuing evaluation...
Not ideal, but the system keeps running. Production systems need degraded modes.
Don't run expensive models on garbage. Filter cheap first.
This seems obvious in retrospect, but my first version ran GPT-4o on everything. Including images that were obviously blurry, obviously unsafe, obviously misaligned.
The bill was painful. The learnings were valuable.
Metrics catch technical issues fast and cheap. LLMs catch marketing issues that metrics can't see.
The 93% vs. 7% gap proved this. Metrics approved nearly everything. LLM caught the issues that actually matter: unclear focal points, no room for text, generic brand presence.
You need both. Not one or the other.
The 5-lens framework works because it forces comprehensive evaluation. Each lens addresses a specific marketing concern.
Vague "is this good?" prompts produce vague "it looks nice but..." responses.
Structured frameworks produce actionable decisions with specific fixes.
If your system crashes when a model download times out, it's not production-ready.
Real systems degrade gracefully. Missing CLIP? Use fallback scoring. API timeout? Retry with exponential backoff. Field renamed? Log warning and continue.
Perfect isn't the goal. Reliable is.
Don't guess. Measure. Optimize. Repeat.
Every week, I review 20 new images. Label them. Feed labels back to calibration. Thresholds drift as platforms change, audience preferences evolve, and campaign objectives shift.
The system that worked in January might be wrong in June. Continuous calibration keeps it accurate.
Before this system: 4 hours to review 100 images.
After: 20 minutes to review the 7 that passed all filters.
That's not a 12x speedup. It's a complete workflow change.
Creative teams stop being QA bottlenecks and start being strategic decision-makers.
At scale (1000 images/day with tiered evaluation):
The real question isn't cost — it's value.
One strong campaign created from these insights can pay for the entire system for a year. The return comes from repeatability, not precision accounting.
The architecture was designed to expand. Planned additions:
None of these require rethinking the core system. They slot in as new branches.
The biggest shift here isn't automation. It's treating creative evaluation like a data problem.
Instead of "does this look good?" it's "does this match our segment? can we add text? will it grab attention?"
Instead of manually reviewing 100 images, you automatically filter to the top 10 that actually work.
Instead of starting from a blank page, creators start from proven structures with specific fixes.
That's the real win.
This project originated from a question posed by Prof. Rohit Aggarwal: "How to create an AI evaluation pipeline for generated images to determine which images are better for a certain audience?"
The foundational HEIM (Holistic Evaluation of Text-To-Image Models) benchmark metrics were developed by Stanford researchers, which I adapted and extended for marketing-specific use cases. During a milestone review, Prof. Aggarwal provided critical architectural feedback emphasizing that the evaluation pipeline should not depend on human intervention but should instead align its thresholds and decision logic with human preferences. This guidance led me to develop the LLM-based judgment system combined with a ground truth dataset for fully automated decision-making.
MentorStudents.org provided access to OpenAI API credits for experimentation with GPT-4o vision and offered bi-weekly milestone check-ins to track project progress. He also provided a Claude prompt template for this write-up.
All system architecture decisions (three-layer cascade, decision pipeline, calibration system), technical implementation (code, debugging, prompt engineering, 5-lens framework), dataset creation (segment definitions, test images, human labeling), and interface design were executed independently. My contributions beyond the inherited HEIM framework include: (1) segment-specific CLIP alignment scoring, (2) the 5-lens marketing evaluation framework for LLM judgment, (3) the automated decision pipeline with human-in-the-loop calibration, (4) integration of quantitative metrics with qualitative LLM assessment, and (5) production-ready tooling for report generation and threshold optimization.
I’m Yash Kothari, a graduate student at Purdue studying Business Analytics and Information Management. Before Purdue, I spent a few years at Amazon leading ML-driven catalog programs that freed up $20M in working capital, and more recently built GenAI automation pipelines at Prediction Guard using LangChain and RAG. I enjoy taking complex systems whether it’s an AI model or a finance workflow and turning them into simple, repeatable automations that actually work in the real world.
Dr. Rohit Aggarwal is a professor, AI researcher and practitioner. His research focuses on two complementary themes: how AI can augment human decision-making by improving learning, skill development, and productivity, and how humans can augment AI by embedding tacit knowledge and contextual insight to make systems more transparent, explainable, and aligned with human preferences. He has done AI consulting for many startups, SMEs and public listed companies. He has helped many companies integrate AI-based workflow automations across functional units, and developed conversational AI interfaces that enable users to interact with systems through natural dialogue.