Table of Content

close

TL;DR

The Creative Evaluation Problem Nobody Talks About

What I Built (And Why It's Not Just "Run CLIP and Call It a Day")

Layer 1: The Fast Filter (Or: Don't Pay for Blurry Images)

    What is HEIM?
    1. CLIP Score (Image-Text Alignment)
    2. Aesthetics + Safety (The Gates)
    3. The Fallback Strategy (Boring But Critical)

    The Framework
    The Prompt (And Why It Took Three Iterations)

    The 86-point gap.
    Common Rejection Patterns

    Start with Layer 1, Run It Everywhere
    Add HEIM Metrics One at a Time
    LLM Judgment Comes Last
    The Labeling Interface Matters More Than You Think

    1. Tiered Evaluation Saves 91% on Costs
    2. Human Calibration Is Non-Negotiable
    3. Dedupe Logic Is a Silent Killer
    4. Airtable Field Names Are API Contracts
    5. The PDF Generation Saga
    6. CLIP Model Downloads Timeout

    1. Cascade Your Intelligence
    2. Metrics + LLMs Are Better Than Either Alone
    3. Structured Prompts Beat Generic Ones By 10x
    4. Production Means Fallbacks, Not Perfection
    5. Human Calibration > Arbitrary Thresholds
    6. The Real Bottleneck Is Always Human Review (Until It Isn't)

The Cost Reality

What's Next

Final Thoughts

Acknowledgments

About the Author

Building an AI-Powered Creative QA System: Combining HEIM Metrics with LLM-Based Marketing Judgment

How I built a system that evaluates AI-generated images like a marketing director — using Stanford's HEIM benchmark and structured LLM prompts

Artificial Intelligence
Yash Kothari
Rohit Aggarwal
Yash Kothari
  +1 More
down

TL;DR

  • Built a 3-layer evaluation cascade: technical metrics → HEIM safety checks → LLM marketing judgment
  • Metrics approved 93% of images. LLM approved 7%. The gap = everything metrics miss.
  • Segment-specific CLIP alignment + 5-lens LLM framework = marketing effectiveness scoring
  • Tiered evaluation saves 91% on costs vs. LLM-only
  • Production requires fallbacks, human calibration, and graceful degradation
  • Total cost: ~$80/month for 1000 images/day

The Creative Evaluation Problem Nobody Talks About

Here's the thing. Generative AI made image creation trivial. Sora, Midjourney, DALL-E — they all produce hundreds of variations in minutes.

But evaluation? Still a human bottleneck.

A creative director reviews maybe 5-6 images per hour if they're being thorough — checking brand alignment, emotional tone, composition, segment fit, whether you can even add a headline without covering the product. At that rate, evaluating 100 images takes an entire workday. And if you're running segment-targeted campaigns (Gen Z vs. Parents vs. Professionals), multiply that by three.

Most teams solve this by either accepting mediocre AI outputs because manual review is too slow, or burning out their creative teams with repetitive pixel-peeping.

This article is about a third option: building an automated system that evaluates marketing effectiveness, not just technical quality — by combining research-backed metrics with LLM-powered judgment.

The result? A three-layer pipeline that processes images in seconds, catches issues metrics alone miss, and generates actionable feedback creators can actually use.

What I Built (And Why It's Not Just "Run CLIP and Call It a Day")

The architecture is deliberately simple: cascade from cheap filters to expensive judgment.

Layer 1: Technical Metrics (instant, free)
Basic computer vision checks — sharpness, composition, color energy. Filters ~20% of obvious failures immediately.

Layer 2: HEIM-Inspired AI Metrics (2-3 seconds, free)
Safety and alignment checks using Stanford's HEIM benchmark framework — CLIP alignment, aesthetics scoring, watermark detection, NSFW filtering.

Layer 3: LLM Marketing Judgment (5 seconds, $0.01/image)
GPT-4o evaluates through five marketing lenses to simulate creative director review.

Why this cascade matters: You only spend money on images that pass cheap filters. At 1000 images/day, this saves ~$450/month versus running LLM evaluation on everything.

But the real innovation isn't the layers — it's what happens in Layer 3.
 

Layer 1: The Fast Filter (Or: Don't Pay for Blurry Images)

Technical metrics are straightforward computer vision: sharpness via Laplacian variance, composition via edge density, color energy via saturation. Nothing fancy.

The threshold is simple: technical quality below 3.0/5 gets rejected immediately. No API calls. No analysis. Just filtered out.

This catches maybe 20% of images — the obviously blurry, flat, washed-out ones that no amount of marketing genius can save.

The code is pure OpenCV and NumPy. Runs at ~1000 images/second on a MacBook M2. The entire layer costs $0 and takes <100ms per image.

What I learned: Don't overthink this part. The goal is speed, not sophistication. Save the intelligence budget for later.

Layer 2: HEIM Metrics (Or: Why Stanford's Research Matters for Production)

What is HEIM?

Stanford's Holistic Evaluation of Text-To-Image Models is a benchmark that evaluates AI image generation across 12 aspects: alignment, quality, aesthetics, originality, reasoning, knowledge, bias, toxicity, fairness, robustness, multilinguality, and efficiency.

It's comprehensive. Maybe too comprehensive for production.

So I cherry-picked four critical metrics:

1. CLIP Score (Image-Text Alignment)

CLIP measures semantic similarity between images and text. Standard stuff. But here's where it gets interesting.

The Problem: Generic CLIP measures "does this image match the prompt." For marketing, you need "does this image match the segment aesthetic?"

The Solution: Build segment-specific descriptors.

Instead of scoring against "a marketing image," I score against:

  • "An ad image that appeals to Gen Z. Keywords: vibrant, trendy, social, authentic. Energetic, colorful images with clear focal points."
  • "An ad image that appeals to Parents. Keywords: warm, trustworthy, family-oriented, safe. Warm, authentic images showing trust signals."
  • "An ad image that appeals to Professionals. Keywords: clean, modern, minimal, premium. Polished, minimalist images with clear messaging."

This single change transformed CLIP from a generic alignment check into a segment fitness predictor.

And the thresholds matter: Gen Z needs 3.3+ segment fit. Parents need 3.4+. Professionals prioritize composition (3.4+) over color energy. Different audiences = different standards.

2. Aesthetics + Safety (The Gates)

Pre-trained models handle this: an aesthetics predictor scores visual appeal (0-5), and an NSFW detector flags safety risks (0-1, lower = safer).

The safety gate is non-negotiable: NSFW risk ≥ 0.20 → automatic rejection. No debate. No LLM review. Just rejected.

3. The Fallback Strategy (Boring But Critical)

Here's a lesson nobody tells you: production systems need degraded modes.

If CLIP downloads fail, or models timeout, the system doesn't crash. It falls back to heuristic scoring:

  • Gen Z: 60% color energy + 40% composition
  • Parents: 50% composition + 30% technical + 20% color
  • Professionals: 60% composition + 40% aesthetics

Is it perfect? No. But it's better than returning nothing because an API timed out.

What broke: The first time I ran this at scale, CLIP downloads timed out on 30% of runs. The system appeared dead even though upstream scrapers were fine.

The fix: Graceful fallback + better error logging. Now when CLIP fails, I get a warning, not a silent crash.

Layer 3: LLM Marketing Judgment (Or: The Part That Actually Works)

This is where metrics stop and marketing starts.

Sharpness, CLIP scores, aesthetics — they tell you if an image is technically acceptable. They don't tell you if it will convert.

That requires understanding:

  • Does this grab attention at scroll speed?
  • Does the emotional tone match campaign goals?
  • Can you add a headline without covering the product?
  • Is this distinctly on-brand or generic stock photo vibes?
  • Will the target segment actually relate to this?

Enter: The 5-Lens Marketing Framework.

The Framework

I force GPT-4o to evaluate every image through five mandatory lenses:

  1. Visual Saliency & Focal Point — What grabs attention first? Is the product the hero?
  2. Emotional & Psychological Tone — Does the vibe match the objective and audience?
  3. Compositional Balance & Negative Space — Can you add text without a mess?
  4. Brand Consistency & Recognizability — Is this distinctly on-brand?
  5. Contextual & Demographic Relevance — Will the segment relate to this?

Each lens gets a status: PASS, WEAK, or FAIL.

The decision rule: If 2+ lenses FAIL → automatic rejection. No exceptions.

This isn't arbitrary. It's the minimum bar for a marketing image to work in a real campaign.

The Prompt (And Why It Took Three Iterations)

Iteration 1: "Is this a good marketing image for Gen Z?"

Result: Too generic. LLM approved everything. No actionable feedback.

Iteration 2: Added a scoring rubric and requested JSON output.

Result: Better structure, but still too lenient. Everything scored 3.5-4.5. No variance.

Iteration 3 (Final): The 5-lens framework with hard fails and segment-specific minimums.

Here's the structure (simplified):

You are a senior performance marketer reviewing AI-generated images.

INPUT DATA:

- Segment: {segment}

- Technical quality: {score}/5

- Segment fit: {score}/5

- CLIP score: {score}

- NSFW risk: {score}

HARD FAILS (immediate reject):

- NSFW >= 0.20

- CLIP < 0.18

- Technical < 3.0

SEGMENT REQUIREMENTS:

[Gen Z / Parents / Professionals specific thresholds]

EVALUATE USING 5 LENSES (mandatory):

1. Visual Saliency & Focal Point - PASS/WEAK/FAIL

2. Emotional & Psychological Tone - PASS/WEAK/FAIL

3. Compositional Balance & Negative Space - PASS/WEAK/FAIL

4. Brand Consistency & Recognizability - PASS/WEAK/FAIL

5. Contextual & Demographic Relevance - PASS/WEAK/FAIL

If 2+ lenses FAIL → cannot approve.

OUTPUT (JSON only):

{

  "decision": "APPROVE | NEEDS_REVIEW | REJECT",

  "lens_evaluation": [...],

  "issues": [...],

  "regen_prompt": {...}

}

What changed: Adding the "think like a marketer" persona and explicit hard fails made all the difference. The LLM stopped being polite and started being useful.

What broke: Early versions returned inconsistent JSON. Sometimes valid, sometimes broken with extra commentary.

The fix: Strict JSON schema enforcement via response_format={"type": "json_object"} in the API call. Problem solved.

The Results: The 93% vs 7% Gap

I tested 15 images (5 per segment). Here's what happened:

Evaluation Method

Approved

Rejected

Metrics Only (Layers 1+2)14/15 (93%)1/15 (7%)
With LLM (Layer 3)1/15 (7%)9/15 (60%)

The 86-point gap.

Metrics approved nearly everything because they only measure technical quality. The LLM caught real marketing flaws:

Example: Gen Z #1 Image

  • Metrics: Total 3.35/5, Aesthetics 3.37, CLIP 0.24 → ✅ APPROVE
  • LLM: ❌ REJECT

Why?

  • Compositional Balance: FAIL — "Too busy with people and graffiti. No room to add headlines or CTA."
  • Brand Consistency: WEAK — "Product not prominent enough. Could belong to multiple competitors."
  • Fix (Priority 1): "Increase color vibrancy, simplify background, make product the hero."

A technically perfect image. Marketing-ineffective.

Common Rejection Patterns

Across the 9 rejected images:

  1. No negative space (67%) — Can't add text without covering key elements
  2. Generic brand presence (56%) — Looks like stock photography, not distinctive
  3. Unclear focal point (44%) — Eye doesn't know where to land at scroll speed
  4. Segment mismatch (33%) — Wrong emotional tone or demographic vibe

These are issues only an LLM catches. Metrics don't know what "on-brand" means.

 

How to Actually Build This (Strategic Guidance, Not Code Dumps)

Here's the honest truth: building this isn't hard technically. It's hard architecturally.

Start with Layer 1, Run It Everywhere

Don't build all three layers at once. Start with technical metrics only. Run it on every image in your library. Get a feel for what the thresholds actually mean.

Is 3.0 too strict? Too loose? Does composition_clarity correlate with what your team calls "good"?

This exploration phase costs $0 and teaches you more than any tutorial.

Add HEIM Metrics One at a Time

CLIP first. Get segment-specific descriptors working. Validate that Gen Z images actually score differently than Professional images.

Then aesthetics. Then NSFW. Then watermark detection.

Each metric should make the system noticeably better. If adding a metric doesn't change decisions, remove it.

LLM Judgment Comes Last

Only add LLM evaluation once you've filtered down to candidates that might be good.

Why? Because the LLM is expensive and slow. If you're running it on blurry images that should've been rejected in Layer 1, you're burning money.

Get the cascade working first. LLM judgment is the cherry on top, not the foundation.

The Labeling Interface Matters More Than You Think

You need human labels to calibrate. But how you collect them determines everything.

I built a dead-simple web UI: show image, show metrics, two buttons (APPROVE / REJECT). That's it.

No fancy features. No "maybe" option. No multi-step wizard.

Binary decisions force clarity. And clarity makes calibration possible.

After 50 labeled images per segment, patterns emerge. You'll see which metrics actually predict human approval. Those insights feed back into thresholds.

 

Production Lessons (The Stuff That Actually Broke)

1. Tiered Evaluation Saves 91% on Costs

The naive approach: Run GPT-4o on every image.

Cost: 1000 images/day × $0.015 = $15/day = $450/month

The smart approach: Cascade through cheaper checks first.

  • Tier 1: Technical metrics → Filter 20%
  • Tier 2: HEIM metrics → Filter 50%
  • Tier 3: GPT-4o-mini → Filter 20% more
  • Tier 4: GPT-4o for final 10%

Cost: ~$1.90/day = $57/month

Savings: 91%

This isn't theoretical. I ran both approaches. The tiered version saved $393/month while maintaining the same accuracy.

2. Human Calibration Is Non-Negotiable

Initial thresholds were arbitrary (total ≥ 3.0, aesthetics ≥ 2.5). They worked okay, but not great.

The solution: Collect human labels. Optimize thresholds to maximize F1 score.

I built a simple labeling UI where reviewers mark images APPROVE/REJECT. Then used scipy.optimize to find thresholds that match human judgment.

Result: Accuracy improved from ~65% to ~85% within a month.

But here's what surprised me: the optimal thresholds weren't what I expected. Gen Z required lower aesthetics (2.8) than Professionals (3.4), because authenticity beats polish for that segment.

You can't guess your way to those insights. You have to measure.

3. Dedupe Logic Is a Silent Killer

One of the scariest failures: Everything upstream worked perfectly. Scrapers ran. Data flowed. But zero images appeared in the output.

The culprit? The deduplication node was checking the wrong field. Every image looked "already processed."

The debugging trick: Temporarily disable dedupe. If images suddenly appear, you know the logic is the problem, not the data.

This failure mode is particularly insidious because there's no error. The system just... quietly rejects everything.

4. Airtable Field Names Are API Contracts

Renamed a column from "Viral_Playbook" to "Playbook"? Cool.

Except the API still writes to "Viral_Playbook." Silently fails. No error. Just... nothing.

The lesson: Field renames in Airtable break workflows. Update both sides or suffer.

This happened twice before I learned. Now I treat field names like database schema changes — versioned, documented, and never touched casually.

5. The PDF Generation Saga

Initially, I tried generating HTML reports. Standard stuff. But when stakeholders wanted PDFs, things got messy.

Attempt 1: WeasyPrint
Failed with dependency hell on macOS (cairo, pango, libffi conflicts).

Attempt 2: Cloud PDF converters
Worked sometimes. Failed when images were external paths. Unreliable.

Attempt 3 (Final): Playwright + base64-embedded images
Generate HTML with images as data URIs. Use headless Chromium to render. Export as PDF.

Result: Self-contained 8.5MB PDFs that work everywhere.

Sometimes the "hacky" solution (base64 + headless browser) is more reliable than the "proper" solution (dedicated PDF libraries).

6. CLIP Model Downloads Timeout

First-run downloads of CLIP (>500MB) would timeout on 30% of runs. System appeared dead even though everything upstream worked.

The fix: Graceful fallback + retry logic + better error logging.

Now when CLIP fails, I get:

⚠️  CLIP unavailable: Connection timeout

→ Using fallback heuristic scoring for segment_fit

→ Continuing evaluation...

Not ideal, but the system keeps running. Production systems need degraded modes.

 

What This Taught Me About AI Systems

1. Cascade Your Intelligence

Don't run expensive models on garbage. Filter cheap first.

This seems obvious in retrospect, but my first version ran GPT-4o on everything. Including images that were obviously blurry, obviously unsafe, obviously misaligned.

The bill was painful. The learnings were valuable.

2. Metrics + LLMs Are Better Than Either Alone

Metrics catch technical issues fast and cheap. LLMs catch marketing issues that metrics can't see.

The 93% vs. 7% gap proved this. Metrics approved nearly everything. LLM caught the issues that actually matter: unclear focal points, no room for text, generic brand presence.

You need both. Not one or the other.

3. Structured Prompts Beat Generic Ones By 10x

The 5-lens framework works because it forces comprehensive evaluation. Each lens addresses a specific marketing concern.

Vague "is this good?" prompts produce vague "it looks nice but..." responses.

Structured frameworks produce actionable decisions with specific fixes.

4. Production Means Fallbacks, Not Perfection

If your system crashes when a model download times out, it's not production-ready.

Real systems degrade gracefully. Missing CLIP? Use fallback scoring. API timeout? Retry with exponential backoff. Field renamed? Log warning and continue.

Perfect isn't the goal. Reliable is.

5. Human Calibration > Arbitrary Thresholds

Don't guess. Measure. Optimize. Repeat.

Every week, I review 20 new images. Label them. Feed labels back to calibration. Thresholds drift as platforms change, audience preferences evolve, and campaign objectives shift.

The system that worked in January might be wrong in June. Continuous calibration keeps it accurate.

6. The Real Bottleneck Is Always Human Review (Until It Isn't)

Before this system: 4 hours to review 100 images.

After: 20 minutes to review the 7 that passed all filters.

That's not a 12x speedup. It's a complete workflow change.

Creative teams stop being QA bottlenecks and start being strategic decision-makers.

 

The Cost Reality

At scale (1000 images/day with tiered evaluation):

  • Infrastructure: ~$60/month (API calls + cloud storage)
  • OpenAI credits: ~$20/month (experimentation)
  • Total: ~$80/month

The real question isn't cost — it's value.

One strong campaign created from these insights can pay for the entire system for a year. The return comes from repeatability, not precision accounting.

 

What's Next

The architecture was designed to expand. Planned additions:

  1. Auto-regeneration loop — When LLM rejects, automatically regenerate with its suggested improvements
  2. A/B test integration — Feed campaign performance back to calibrate thresholds
  3. Multi-modal context — Consider landing pages, competitor creatives, historical winners
  4. Fusion model — Train a classifier combining metrics + LLM assessments for 10x speed

None of these require rethinking the core system. They slot in as new branches.
 

Final Thoughts

The biggest shift here isn't automation. It's treating creative evaluation like a data problem.

Instead of "does this look good?" it's "does this match our segment? can we add text? will it grab attention?"

Instead of manually reviewing 100 images, you automatically filter to the top 10 that actually work.

Instead of starting from a blank page, creators start from proven structures with specific fixes.

That's the real win.

 

Acknowledgments

This project originated from a question posed by Prof. Rohit Aggarwal: "How to create an AI evaluation pipeline for generated images to determine which images are better for a certain audience?"

The foundational HEIM (Holistic Evaluation of Text-To-Image Models) benchmark metrics were developed by Stanford researchers, which I adapted and extended for marketing-specific use cases. During a milestone review, Prof. Aggarwal provided critical architectural feedback emphasizing that the evaluation pipeline should not depend on human intervention but should instead align its thresholds and decision logic with human preferences. This guidance led me to develop the LLM-based judgment system combined with a ground truth dataset for fully automated decision-making.

MentorStudents.org provided access to OpenAI API credits for experimentation with GPT-4o vision and offered bi-weekly milestone check-ins to track project progress. He also provided a Claude prompt template for this write-up.

All system architecture decisions (three-layer cascade, decision pipeline, calibration system), technical implementation (code, debugging, prompt engineering, 5-lens framework), dataset creation (segment definitions, test images, human labeling), and interface design were executed independently. My contributions beyond the inherited HEIM framework include: (1) segment-specific CLIP alignment scoring, (2) the 5-lens marketing evaluation framework for LLM judgment, (3) the automated decision pipeline with human-in-the-loop calibration, (4) integration of quantitative metrics with qualitative LLM assessment, and (5) production-ready tooling for report generation and threshold optimization.

I’m Yash Kothari, a graduate student at Purdue studying Business Analytics and Information Management. Before Purdue, I spent a few years at Amazon leading ML-driven catalog programs that freed up $20M in working capital, and more recently built GenAI automation pipelines at Prediction Guard using LangChain and RAG. I enjoy taking complex systems whether it’s an AI model or a finance workflow and turning them into simple, repeatable automations that actually work in the real world.

Dr. Rohit Aggarwal is a professor, AI researcher and practitioner. His research focuses on two complementary themes: how AI can augment human decision-making by improving learning, skill development, and productivity, and how humans can augment AI by embedding tacit knowledge and contextual insight to make systems more transparent, explainable, and aligned with human preferences. He has done AI consulting for many startups, SMEs and public listed companies. He has helped many companies integrate AI-based workflow automations across functional units, and developed conversational AI interfaces that enable users to interact with systems through natural dialogue.