Designing the AI Skincare Assistant: Building the Proof of Concept
TL;DR:
Built a logic-based AI skincare assistant using structured ingredient data and strict prompts, tested it with 50 real-world queries to see if it could stay accurate, auditable, and avoid guessing.
This post builds on the original project overview. You can read the full context here. → Original case study
How I Simulated RAG Training Using ChatGPT
This post breaks down how I set up the AI system to act less like a chatbot, and more like a rules-based ingredient analyst.
The goal was to simulate a RAG-style setup using ChatGPT, where the model could only respond if structured data was present. No hallucinating, no paraphrasing, no “clean beauty” filler language. Just logic.
Each ingredient was manually tagged across multiple fields:
Function
Mechanism
Skin compatibility
Comedogenicity
Pairings
Risks
I wrote 50 prompts simulating the kinds of questions someone might actually ask, with edge cases, dual-condition filters, and phrasing designed to break the logic. Then I audited every output against the original constraints:
Only use data that’s tagged — no guessing
Stick to exact labels (e.g., “Oil-Compatible”)
Use fallback icons (🟡, 🚫, ⚠) for gaps or flags
Never paraphrase or reword
If it doesn’t know, it has to say that
Every output had to follow the same format:
🧬 What the ingredient does
✅ What it pairs well with
📌 Any best practices
⚠ Risks or controversial use
🟡 Or 🚫 if it was missing data or broke a rule
I also added hard lines around tone. No soft guesses. No reworded labels. No pretending data exists if it doesn’t. If it didn’t know something, it had to say that.
Testing the AI
I simulated 50 real-world queries that users may ask. These queries were designed to test factual recall, probe multi-field logic, compatibility evaluation, and the system’s ability to gracefully handle ambiguity or unsupported data. The queries were crafted pressure-test the AI's logical limits:
Dual-condition filters (“Soothing but non-comedogenic for acne-prone skin”)
Compatibility queries with asymmetrical tags
Debated claims like parabens, alcohols, silicones
Results:
47/50 AI answers that adhered to the prompt policy.
Below is the audit of the AI’s responses:
Overall Data
Evaluation results from 50 AI queries, graded for accuracy, tone compliance, and proper fallback behavior.
Behaviors Noticed in responses.
Breakdown of evaluation criteria used to audit AI responses, grouped by logic behavior, formatting, and compliance with structured rules.
Responses that had Errors of Need Potential Fixes
Fields Used and Frequency Used
Breakdown of evaluation criteria used to audit AI responses, grouped by logic behavior, formatting, and compliance with structured rules.
Frequency of field usage across 50 AI responses, highlighting which data tags the model relied on most during evaluation.
What’s Next:
I’m expanding the dataset to 300+ ingredients, layering in safety and regulatory tags, and refining how the AI handles tone when the science gets more complex. Eventually, I want to build a lightweight version for people to actually use.
Phases:
Phase 1: Testing + Iteration (Current Phase)
Integrate with LLM or retrieval-augmented model
Conduct usability testing on language clarity and risk phrasing
Align outputs with accessibility and enterprise design systems
Phase 2: Testing + Iteration
Begin user testing with real ingredient lists
Test upload and paste workflows
Build a trust-feedback loop to guide future improvements
Future Plans
Layer in user profiles (e.g., acne-prone, sensitive skin, rosacea)
Enable “smart” questions (e.g., “Is this pregnancy safe?”)
Develop a Chrome extension for ingredient popovers while brows