Designing the AI Skincare Assistant: Building the Proof of Concept


TL;DR:

Built a logic-based AI skincare assistant using structured ingredient data and strict prompts, tested it with 50 real-world queries to see if it could stay accurate, auditable, and avoid guessing.


 
 

This post builds on the original project overview. You can read the full context here. → Original case study

 
 

How I Simulated RAG Training Using ChatGPT

This post breaks down how I set up the AI system to act less like a chatbot, and more like a rules-based ingredient analyst.

The goal was to simulate a RAG-style setup using ChatGPT, where the model could only respond if structured data was present. No hallucinating, no paraphrasing, no “clean beauty” filler language. Just logic.

Each ingredient was manually tagged across multiple fields:

  • Function

  • Mechanism

  • Skin compatibility

  • Comedogenicity

  • Pairings

  • Risks

I wrote 50 prompts simulating the kinds of questions someone might actually ask, with edge cases, dual-condition filters, and phrasing designed to break the logic. Then I audited every output against the original constraints:

  • Only use data that’s tagged — no guessing

  • Stick to exact labels (e.g., “Oil-Compatible”)

  • Use fallback icons (🟡, 🚫, ⚠) for gaps or flags

  • Never paraphrase or reword

  • If it doesn’t know, it has to say that

Every output had to follow the same format:

  • 🧬 What the ingredient does

  • ✅ What it pairs well with

  • 📌 Any best practices

  • ⚠ Risks or controversial use

  • 🟡 Or 🚫 if it was missing data or broke a rule

I also added hard lines around tone. No soft guesses. No reworded labels. No pretending data exists if it doesn’t. If it didn’t know something, it had to say that.

Testing the AI

I simulated 50 real-world queries that users may ask. These queries were designed to test factual recall, probe multi-field logic, compatibility evaluation, and the system’s ability to gracefully handle ambiguity or unsupported data. The queries were crafted pressure-test the AI's logical limits:

  • Dual-condition filters (“Soothing but non-comedogenic for acne-prone skin”)

  • Compatibility queries with asymmetrical tags

  • Debated claims like parabens, alcohols, silicones

 

Results:

47/50 AI answers that adhered to the prompt policy.

Below is the audit of the AI’s responses:

Overall Data

Evaluation results from 50 AI queries, graded for accuracy, tone compliance, and proper fallback behavior.

Behaviors Noticed in responses.

Breakdown of evaluation criteria used to audit AI responses, grouped by logic behavior, formatting, and compliance with structured rules.

 

Responses that had Errors of Need Potential Fixes

Fields Used and Frequency Used

Breakdown of evaluation criteria used to audit AI responses, grouped by logic behavior, formatting, and compliance with structured rules.

Frequency of field usage across 50 AI responses, highlighting which data tags the model relied on most during evaluation.

 

What’s Next:

I’m expanding the dataset to 300+ ingredients, layering in safety and regulatory tags, and refining how the AI handles tone when the science gets more complex. Eventually, I want to build a lightweight version for people to actually use.

 

Phases:

Phase 1: Testing + Iteration (Current Phase)

  • Integrate with LLM or retrieval-augmented model

  • Conduct usability testing on language clarity and risk phrasing

  • Align outputs with accessibility and enterprise design systems

Phase 2: Testing + Iteration

  • Begin user testing with real ingredient lists

  • Test upload and paste workflows

  • Build a trust-feedback loop to guide future improvements

Future Plans

  • Layer in user profiles (e.g., acne-prone, sensitive skin, rosacea)

  • Enable “smart” questions (e.g., “Is this pregnancy safe?”)

  • Develop a Chrome extension for ingredient popovers while brows

Next
Next

Can AI Explain Your Moisturizer? Ours Can, Almost…