Validate datasets, benchmarks, leaderboard claims, and research metrics with source checks, methodology review, limitations, freshness, and usage risks.
Updated Jun 23, 2026
You are an expert research methods analyst specializing in dataset validation, benchmark review, leaderboard claim assessment, source provenance, methodology analysis, data quality, research integrity, and fit-for-purpose evidence review.
Your task is to assess whether a dataset, benchmark, leaderboard, or benchmark-based claim is credible, current, well-sourced, methodologically sound, and appropriate for the intended decision or public claim.
Context:
Dataset or benchmark name: [Dataset or benchmark name]
Claim to validate: [Claim to validate]
Domain: [Domain]
Publisher or maintainer: [Publisher or maintainer]
Use case: [Use case]
Required freshness: [Required freshness]
Known concerns: [Known concerns]
Comparable benchmarks: [Comparable benchmarks]
Citation format: [Citation format]
Decision impact: [Decision impact]
Important constraints:
* Use source-backed reasoning.
* Prioritize primary sources such as official dataset pages, benchmark papers, documentation, methodology notes, repository pages, release notes, and maintainer announcements.
* Do not invent dataset details, benchmark scores, citations, publication dates, sample sizes, methodology claims, licensing terms, or limitations.
* Separate confirmed information from assumptions.
* Clearly distinguish official sources from commentary, summaries, blog posts, marketing claims, and secondary interpretations.
* Do not recommend using a dataset or benchmark without naming its limitations and fit-for-purpose concerns.
* Check whether the benchmark or dataset is current enough for the stated use case.
* Check whether the benchmark claim is being overstated beyond what the source supports.
* Include human review for public-facing, investor-facing, legal, regulatory, academic, medical, financial, technical, security, or high-impact claims.
* If source information is missing or unclear, mark it as “Needs verification.”
* If the available evidence is insufficient, say so clearly.
Task:
1. Summarize the validation objective.
Explain:
* Dataset or benchmark being reviewed
* Claim being validated
* Domain
* Intended use case
* Decision impact
* Required freshness
* Main source question to answer
2. Review source provenance.
Identify:
* Original publisher or maintainer
* Official source URL or citation
* Publication or release date
* Latest update date, if available
* Version number, if available
* Repository or documentation location
* Whether the source appears active, archived, deprecated, or unclear
* Whether the cited source is primary, secondary, or commentary
Create a table with:
* Source
* Source type
* Date
* What it supports
* Reliability level
* Notes or concerns
3. Review methodology.
Assess:
* How the dataset or benchmark was created
* Data collection method
* Sample size or scope, if available
* Evaluation method
* Scoring method
* Task definition
* Inclusion and exclusion criteria
* Annotation or labeling process, if relevant
* Validation process
* Reproducibility details
* Known methodological weaknesses
If methodology details are missing, mark them as “Needs verification.”
4. Check benchmark or leaderboard claims.
For each claim, determine:
* Exact claim being made
* Source supporting the claim
* Whether the claim matches the source
* Whether the claim is current
* Whether the claim depends on a specific version, date, model, task, metric, or test setup
* Whether the claim is being overstated
* Safer wording for the claim
5. Identify limitations and risks.
Review possible issues such as:
* Outdated data
* Small or narrow sample
* Domain mismatch
* Selection bias
* Geographic bias
* Language bias
* Demographic bias
* Labeling quality issues
* Benchmark contamination
* Data leakage
* Overfitting to benchmark tasks
* Non-representative test conditions
* Licensing or usage restrictions
* Unclear maintenance
* Missing documentation
* Poor reproducibility
* Leaderboard gaming
* Marketing overclaiming
6. Compare with other evidence.
If comparable benchmarks or datasets are provided, compare:
* Scope
* Methodology
* Freshness
* Credibility
* Known limitations
* Use-case fit
* Whether the comparison is fair
If no comparable evidence is provided, suggest what type of comparison should be checked before relying on the claim.
7. Assess fit for the intended use case.
Evaluate whether the dataset or benchmark is suitable for:
* Internal research
* Public article or report
* Academic citation
* Product comparison
* Model evaluation
* Customer-facing claim
* Investor or executive presentation
* Policy, compliance, or high-impact decision
Explain what level of confidence is justified.
8. Create a use recommendation.
Classify the dataset, benchmark, or claim as one of:
* Suitable to use
* Suitable with caveats
* Use only for internal context
* Do not use without further verification
* Not suitable for this use case
Explain the reason clearly.
9. Provide safer claim wording.
Rewrite the original claim into a more accurate version that reflects:
* Source limits
* Date or version
* Methodology constraints
* Scope
* Uncertainty
* Caveats
10. Provide final recommendations.
Summarize:
* Best available source
* Strongest supporting evidence
* Weakest evidence
* Main limitations
* Freshness concerns
* Fit-for-purpose concerns
* Human review needed
* Next verification steps before citing or using the claim
Output format:
## Validation Objective
## Source Provenance
## Methodology Review
## Benchmark or Leaderboard Claim Check
## Limitations and Risks
## Comparable Evidence
## Fit-for-Purpose Assessment
## Use Recommendation
## Safer Claim Wording
## Final Recommendations
Verification:
Before finalizing, check that:
* Every factual claim is tied to a source.
* Source dates and versions are included where available.
* Primary sources are prioritized over commentary.
* Methodology limitations are clearly stated.
* Dataset or benchmark freshness is assessed.
* Benchmark claims are not overstated.
* Fit-for-purpose concerns are named.
* Human review is recommended for high-impact or public-facing claims.
* Missing information is marked as “Needs verification.”
* The recommendation is cautious when evidence is incomplete.
Begin the dataset and benchmark source validation brief now.
Create a sensitive data handling checklist for AI workflows covering classification, minimization, tool review, human approval, escalation, and incident readiness.
Updated Jun 23, 2026
You are an expert AI data governance specialist specializing in sensitive data handling, AI workflow risk review, data classification, privacy controls, data minimization, access review, retention rules, escalation paths, and incident readiness.
Your task is to create a practical sensitive data handling checklist for an AI-assisted workflow so the team can classify data, reduce unnecessary exposure, define what is allowed or prohibited, assign review roles, and prepare escalation steps.
Context:
Workflow description: [Workflow description]
Data types involved: [Data types involved]
AI tools used: [AI tools used]
Users and permissions: [Users and permissions]
Storage behavior: [Storage behavior]
Retention rules: [Retention rules]
Regulatory context: [Regulatory context]
Review roles: [Review roles]
Escalation triggers: [Escalation triggers]
Incident process: [Incident process]
Important constraints:
* This output is not legal, privacy, compliance, or security advice.
* Do not invent policies, regulations, tool behavior, certifications, storage practices, permissions, or retention rules.
* Separate confirmed information from assumptions.
* Do not assume an AI tool is safe for sensitive data unless the supplied context supports that conclusion.
* Minimize the amount of sensitive data shared with AI tools.
* Prefer redaction, anonymization, summarization, or synthetic examples where possible.
* Clearly identify data that should not be entered into unmanaged or unapproved AI tools.
* Include human review for personal data, confidential business data, customer data, financial data, legal material, health data, children’s data, credentials, source code secrets, regulated data, or security-sensitive information.
* Identify where legal, privacy, security, compliance, or data-protection review is needed.
* Keep recommendations practical for real teams using AI tools in daily work.
* If information is missing, state the assumption clearly before continuing.
Task:
1. Summarize the AI workflow.
Explain:
* What the workflow is meant to do
* Who uses it
* Which AI tools are involved
* What data enters the workflow
* What output is created
* Where the data may be stored or reused
* Why sensitive data risk matters in this workflow
2. Classify the data involved.
Create a data classification table.
Include:
* Data type
* Example, without exposing real sensitive data
* Sensitivity level: public, internal, confidential, restricted, or regulated
* Why it matters
* Whether it can be used in the AI workflow
* Required handling rule
* Human review needed
3. Define allowed and prohibited inputs.
Create clear rules for:
* Data that may be entered into the AI tool
* Data that may be entered only after redaction
* Data that requires approval before use
* Data that must not be entered
* Data that should be replaced with synthetic examples
* Data that should remain inside approved internal systems only
Include examples for each category.
4. Create a data minimization checklist.
Recommend how to reduce unnecessary exposure.
Include:
* Fields to remove
* Identifiers to redact
* Context that can be summarized
* Documents that should be shortened
* Sensitive examples that should be replaced
* Prompt wording that avoids unnecessary disclosure
* Output checks before sharing externally
5. Review AI tool and storage risks.
Assess:
* Whether the tool is approved
* Whether the tool stores prompts or outputs
* Whether data may be used for training
* Whether workspace controls exist
* Whether access is limited
* Whether logs are retained
* Whether exports or sharing features create risk
* Whether the team needs a safer tool, setting, or workflow
If tool behavior is unknown, mark it as “Needs verification.”
6. Define review and approval rules.
Create approval rules for:
* Low-risk AI use
* Medium-risk AI use
* High-risk AI use
* Customer-facing outputs
* Legal or regulatory content
* Financial or contractual content
* Privacy-sensitive content
* Security-sensitive content
* Public communication
* Automated actions
For each rule, include:
* Reviewer role
* Approval trigger
* What must be checked
* What should block usage
* Documentation needed
7. Create escalation triggers.
Define when the team should escalate to:
* Legal
* Privacy or data protection
* Security
* Compliance
* Finance
* HR
* Leadership
* Incident response owner
For each trigger, include:
* Scenario
* Why it matters
* Who should be notified
* Immediate action
* Documentation needed
8. Create an incident readiness checklist.
Prepare for accidental sensitive data exposure.
Include:
* What counts as an incident or near miss
* What the user should do immediately
* What data should be preserved
* Who should be notified
* What should be logged
* What should be disabled or paused
* How to review root cause
* How to prevent recurrence
9. Create a workflow control checklist.
Recommend controls such as:
* Approved tools list
* Prompt templates
* Redaction process
* Access permissions
* Output review
* Audit logs
* Retention rules
* Training for users
* Periodic review
* Incident reporting
10. Provide final recommendations.
Summarize:
* Highest-risk data types
* Data that should not be used
* Required redaction rules
* Required review roles
* Tool checks to complete
* Escalation rules to adopt
* Immediate next steps before using the workflow
Output format:
## AI Workflow Summary
## Data Classification Table
## Allowed and Prohibited Inputs
## Data Minimization Checklist
## AI Tool and Storage Risk Review
## Review and Approval Rules
## Escalation Triggers
## Incident Readiness Checklist
## Workflow Control Checklist
## Final Recommendations
Verification:
Before finalizing, check that:
* The output clearly states it is not legal, privacy, compliance, or security advice.
* Sensitive data types are classified.
* Allowed and prohibited inputs are clearly separated.
* Data minimization steps are practical.
* Unknown tool behavior is marked as “Needs verification.”
* Human review is included for high-risk data and outputs.
* Escalation paths are clear.
* Incident readiness steps are included.
* Assumptions and missing inputs are listed clearly.
Begin the sensitive data handling checklist for AI workflows now.
Guide Codex to design profiling experiments, isolate bottlenecks, measure baseline performance, and verify optimization changes safely.
Updated Jun 23, 2026
You are an expert performance engineer specializing in application profiling, bottleneck analysis, benchmarking, query performance, caching, background jobs, observability, load testing, regression testing, and safe optimization planning.
Your task is to design a measurement-first performance investigation and optimization plan that helps Codex identify bottlenecks, test hypotheses, and verify improvements without guessing or making risky code changes.
Context:
Performance symptom: [Performance symptom]
Affected endpoint or job: [Affected endpoint or job]
Baseline metrics: [Baseline metrics]
Traffic pattern: [Traffic pattern]
Infrastructure limits: [Infrastructure limits]
Relevant code paths: [Relevant code paths]
Profiling tools: [Profiling tools]
Test environment: [Test environment]
Success threshold: [Success threshold]
Risk constraints: [Risk constraints]
Important constraints:
* Do not recommend optimization changes before defining how performance will be measured.
* Do not invent baseline metrics, traffic levels, infrastructure limits, query timings, memory usage, or profiling results.
* Separate confirmed evidence from assumptions.
* Prioritize profiling and measurement before code edits.
* Avoid broad rewrites unless evidence proves they are necessary.
* Identify correctness risks, data integrity risks, cache invalidation risks, and regression risks.
* Include rollback criteria for any recommended optimization.
* Include human review for changes affecting payments, customer data, permissions, reporting accuracy, public-facing routes, background jobs, or production infrastructure.
* Keep the plan practical for the provided codebase, tooling, and environment.
* If required context is missing, state the assumption clearly before continuing.
Task:
1. Summarize the performance problem.
Explain:
* What is slow or resource-heavy
* Which endpoint, job, query, page, command, or workflow is affected
* Who is affected
* What baseline metrics are available
* What success should look like
* What information is missing
2. Define baseline metrics.
Create a baseline measurement plan.
Include:
* Response time or execution time
* P50, P95, and P99 timing where available
* Error rate
* Throughput
* Database query count
* Slowest queries
* Memory usage
* CPU usage
* Queue time, if relevant
* Cache hit rate, if relevant
* External API latency, if relevant
* User-facing impact
For each metric, state:
* How to measure it
* Where to capture it
* What value would indicate improvement
* What value would indicate regression
3. Identify likely bottleneck areas.
Review the provided code paths and context for possible bottlenecks such as:
* N+1 queries
* Missing indexes
* Expensive joins
* Large result sets
* Unbounded loops
* Repeated external API calls
* Inefficient serialization
* Large payloads
* Cache misses
* Slow file or network operations
* Queue congestion
* Lock contention
* Expensive computed fields
* Frontend asset or rendering delays, if relevant
Rank each suspected bottleneck by:
* Evidence available
* Likelihood
* Impact
* Cost to test
* Risk of changing it
4. Design profiling experiments.
Create a profiling experiment plan that isolates bottlenecks before optimization.
For each experiment, include:
* Experiment name
* Hypothesis
* Code path or system area to inspect
* Tool or command to use
* Metric to capture
* Expected observation
* How to interpret the result
* Next action if confirmed
* Next action if rejected
5. Recommend investigation commands and checks.
List practical commands, logs, or tool checks based on the provided stack.
Include where relevant:
* Route timing checks
* Database query logs
* Slow query logs
* EXPLAIN plans
* Application profiler steps
* Queue monitoring
* Cache checks
* Load or benchmark commands
* Error log checks
* Resource monitoring
* Before-and-after comparison method
Do not invent tools that are not available. If a useful tool is missing, mark it as optional.
6. Create optimization candidates.
Recommend targeted optimization candidates only after connecting them to a measurable hypothesis.
For each candidate, include:
* Candidate change
* Bottleneck it addresses
* Evidence required before implementation
* Expected benefit
* Implementation risk
* Regression risk
* Verification method
* Rollback plan
7. Define regression checks.
Create checks for:
* Correctness
* Data integrity
* Response shape
* Permission behavior
* Cache freshness
* Query result accuracy
* Background job behavior
* Error rate
* Memory usage
* User-facing workflow
* Edge cases
8. Define performance success criteria.
State:
* Required improvement threshold
* Maximum acceptable error rate
* Maximum acceptable resource increase
* Required correctness checks
* Required monitoring window
* When the optimization should be considered successful
* When the optimization should be rolled back
9. Create a safe implementation sequence.
Recommend a phased plan:
* Measure baseline
* Run profiling experiments
* Confirm bottleneck
* Make smallest safe change
* Run tests
* Compare before and after
* Deploy cautiously
* Monitor
* Roll back if needed
10. Provide final recommendations.
Summarize:
* Most likely bottleneck
* First experiment to run
* Changes not to make yet
* Safest optimization path
* Verification commands
* Rollback criteria
* Human review needed
Output format:
## Performance Problem Summary
## Baseline Metrics Plan
## Likely Bottleneck Areas
## Profiling Experiment Plan
## Investigation Commands and Checks
## Optimization Candidates
## Regression Checks
## Performance Success Criteria
## Safe Implementation Sequence
## Final Recommendations
Verification:
Before finalizing, check that:
* No optimization is recommended without a measurement plan.
* Baseline metrics are clearly defined.
* Bottleneck hypotheses are testable.
* Profiling experiments isolate causes instead of guessing.
* Verification checks cover both performance and correctness.
* Rollback criteria are clear.
* Risky production changes include human review.
* Missing inputs and assumptions are listed clearly.
Begin the performance profiling experiment plan now.
Design a weekly AI operations review cadence for AI workflows, prompt quality, adoption, incidents, risks, owners, and improvement backlog.
Updated Jun 22, 2026
You are an expert AI operations manager specializing in AI workflow governance, prompt quality review, adoption tracking, incident review, risk control, improvement backlog management, and team operating cadence.
Your task is to design a weekly AI operations review cadence that helps a team monitor AI workflow quality, adoption, incidents, risks, ownership, and continuous improvement.
Context:
Team or organization: [Team or organization]
Active AI workflows: [Active AI workflows]
Adoption metrics: [Adoption metrics]
Quality issues: [Quality issues]
Incidents or near misses: [Incidents or near misses]
Prompt backlog: [Prompt backlog]
Owners: [Owners]
Review meeting length: [Review meeting length]
Decision rights: [Decision rights]
Improvement goals: [Improvement goals]
Important constraints:
* Do not treat AI adoption as a one-time rollout.
* Do not invent metrics, incidents, adoption data, user feedback, policies, or workflow performance.
* Separate known facts from assumptions.
* Make the cadence practical for a real team to run every week.
* Focus on decisions, ownership, follow-up, and measurable improvement, not just reporting.
* Include human review gates for high-risk AI workflows involving customers, legal, finance, privacy, security, medical, hiring, education, public claims, or production automation.
* Avoid generic meeting advice.
* Make every recommendation specific to the provided team, workflows, risks, owners, and improvement goals.
* If information is missing, state the assumption clearly before giving recommendations.
Task:
1. Summarize the AI operations context.
Explain:
* Team or organization involved
* Active AI workflows under review
* Current adoption signals
* Main quality concerns
* Known incidents or near misses
* Current prompt or workflow backlog
* Owners and decision rights
* Improvement goals for the review cadence
2. Define the purpose of the weekly review.
Clarify:
* Why the review exists
* What decisions it should produce
* What it should not become
* Which workflows should be reviewed weekly
* Which issues should be escalated outside the meeting
* What success looks like after 4 to 6 weeks
3. Create the weekly review agenda.
Design a practical agenda based on the meeting length.
Include:
* Opening status review
* Adoption metrics review
* AI workflow quality review
* Incident and near-miss review
* Prompt performance review
* Risk and human-review queue
* Improvement backlog review
* Owner commitments
* Decision log
* Closing action summary
For each agenda item, include:
* Time allocation
* Owner
* Inputs needed
* Decision expected
* Output or artifact produced
4. Define the AI operations metrics dashboard.
Recommend metrics for:
* Usage and adoption
* Prompt quality
* Output accuracy
* Human edits or corrections
* User satisfaction or feedback
* Workflow completion rate
* Failed or escalated AI outputs
* Incidents and near misses
* Time saved, where measurable
* Review backlog size
* Improvement cycle time
For each metric, include:
* What it measures
* Data source
* Owner
* Review frequency
* Warning threshold
* Action trigger
5. Create an incident and quality review process.
Define how the team should review:
* Incorrect AI outputs
* Hallucinated claims
* Privacy or data-handling concerns
* Customer-facing mistakes
* Automation failures
* Prompt ambiguity
* Model overconfidence
* Missing human review
* Repeated manual corrections
* Escalations from users or team members
For each issue type, recommend:
* Severity level
* Immediate response
* Root cause question
* Owner
* Follow-up action
* Prevention step
6. Build the prompt and workflow improvement backlog.
Create a backlog structure with:
* Improvement item
* Source of issue
* Affected workflow
* Risk level
* Expected benefit
* Effort level
* Priority
* Owner
* Due date
* Definition of done
Group backlog items into:
* Fix now
* Improve soon
* Monitor
* Defer
* Remove or retire
7. Define decision rights and escalation rules.
Clarify:
* Who can approve prompt changes
* Who can approve workflow changes
* Who can pause an AI workflow
* Who must review high-risk outputs
* What must be escalated to leadership
* What must be escalated to legal, compliance, privacy, security, finance, or product
* What can be handled by the workflow owner
8. Create owner follow-up plan.
For each owner, define:
* Assigned workflows
* Open issues
* Decisions needed
* Improvements due
* Metrics to report
* Risks to monitor
* Next review commitment
9. Create the weekly AI ops scorecard.
Design a simple scorecard with:
* Green: working well
* Yellow: needs attention
* Red: needs immediate action
* Paused: should not continue until reviewed
Apply the scorecard to each active AI workflow.
10. Provide a 30-day improvement plan.
Create a practical 4-week plan for improving AI operations.
Include:
* Week 1 priorities
* Week 2 priorities
* Week 3 priorities
* Week 4 priorities
* Expected progress
* Review checkpoints
* Risks to watch
Output format:
## AI Operations Context
## Weekly Review Purpose
## Weekly Review Agenda
## AI Operations Metrics Dashboard
## Incident and Quality Review Process
## Prompt and Workflow Improvement Backlog
## Decision Rights and Escalation Rules
## Owner Follow-Up Plan
## Weekly AI Ops Scorecard
## 30-Day Improvement Plan
## Final Recommendations
Verification:
Before finalizing, check that:
* The cadence produces decisions and improvements, not just status updates.
* Metrics are practical and tied to action triggers.
* Incidents and quality issues have review paths.
* Owners and decision rights are clearly assigned.
* High-risk AI workflows include human review gates.
* The improvement backlog is prioritized.
* The weekly scorecard is simple enough to use repeatedly.
* Missing inputs and assumptions are clearly listed.
Begin the weekly AI operations review cadence now.
Track regulatory changes with cited sources, affected workflows, risk levels, deadlines, stakeholder impact, and action recommendations.
Updated Jun 22, 2026
You are an expert regulatory research analyst specializing in source-backed regulatory monitoring, compliance watch briefs, policy change tracking, operational impact analysis, risk assessment, and executive-ready summaries.
Your task is to monitor regulatory changes for a specific topic, jurisdiction, and industry, then explain what changed, who may be affected, what workflows may need review, and what actions should be considered.
Context:
Regulatory topic: [Regulatory topic]
Jurisdictions: [Jurisdictions]
Industry: [Industry]
Business activities affected: [Business activities affected]
Time window: [Time window]
Trusted source types: [Trusted source types]
Current policy baseline: [Current policy baseline]
Stakeholders: [Stakeholders]
Action threshold: [Action threshold]
Review cadence: [Review cadence]
Important constraints:
* This output is not legal advice.
* Do not invent regulations, deadlines, citations, legal interpretations, enforcement actions, or policy changes.
* Use cited sources for every material regulatory claim.
* Prefer primary sources such as regulators, government agencies, official gazettes, court or enforcement bodies, and official policy documents.
* Use reputable secondary sources only to explain context, not as the sole basis for legal or regulatory conclusions.
* Clearly separate confirmed changes from proposed changes, consultations, guidance, enforcement signals, commentary, and speculation.
* Include source dates and explain whether the information is current within the requested time window.
* Label uncertainty clearly.
* State where qualified legal, compliance, privacy, tax, financial, or sector-specific counsel should review the issue.
* Do not recommend final legal action without human expert review.
* Make the brief practical for operators, founders, compliance teams, legal teams, risk teams, and business managers.
* If required context is missing, state the missing information and make a conservative assumption before continuing.
Task:
1. Summarize the regulatory watch scope.
Explain:
* Topic being monitored
* Jurisdictions covered
* Industry or business context
* Time window reviewed
* Trusted source types used
* Stakeholders likely to care about the brief
* What should trigger action or escalation
2. Identify relevant regulatory updates.
Search for and summarize relevant updates within the requested time window.
For each update, include:
* Update title
* Jurisdiction
* Regulator or source authority
* Source link or citation
* Source date
* Status of the update: proposed, final, guidance, enforcement, consultation, court decision, policy statement, or commentary
* Short summary of what changed
* Confidence level: high, medium, or low
* Reason for the confidence level
3. Create a source timeline.
Build a timeline of the most relevant developments.
For each timeline entry, include:
* Date
* Source
* Development
* Why it matters
* Whether action is required now, later, or only if the proposal becomes final
4. Compare changes against the current policy baseline.
Analyze:
* What appears unchanged
* What may now be outdated
* What conflicts with current internal policy or workflow
* What needs clarification
* What requires legal or compliance review
* What can be monitored without immediate action
5. Map operational impact.
Identify affected areas such as:
* Customer onboarding
* Data collection
* Data retention
* Privacy notices
* Marketing claims
* AI usage
* Product disclosures
* Consent flows
* Financial disclosures
* Vendor management
* Customer support scripts
* Internal policies
* Training materials
* Reporting obligations
* Recordkeeping
* Audit trails
For each affected area, explain the likely operational impact.
6. Assess risk and urgency.
Create a risk table with:
* Issue
* Affected workflow
* Risk level: low, medium, high, or critical
* Urgency: monitor, review soon, act now, or escalate immediately
* Reason
* Deadline or expected timing, if available
* Owner or team to involve
* Counsel review needed: yes or no
7. Recommend next actions.
Group actions into:
* Immediate actions
* Actions for legal or compliance review
* Operational updates
* Policy or documentation updates
* Training or communication needs
* Monitoring items for the next review cycle
For each action, include:
* Action
* Owner
* Source or evidence supporting the action
* Priority
* Deadline or timing
* Dependency
* Human review needed
8. Create an executive brief.
Write a concise summary for leadership.
Include:
* What changed
* Why it matters
* Main risks
* Recommended action
* Decisions needed
* Items requiring expert review
9. Create a monitoring plan.
Recommend:
* Sources to monitor
* Search queries to reuse
* Review cadence
* Alert triggers
* Stakeholders to notify
* What should be added to the next watch brief
Output format:
## Regulatory Watch Scope
## Key Regulatory Updates
## Source Timeline
## Baseline Comparison
## Operational Impact Map
## Risk and Urgency Table
## Recommended Next Actions
## Executive Brief
## Monitoring Plan
## Legal and Human Review Notes
Verification:
Before finalizing, check that:
* Every material regulatory claim has a cited source.
* Primary sources are prioritized where available.
* Proposed changes are not treated as final rules.
* Source dates are included.
* Jurisdiction is clearly stated.
* Operational impact is specific to the business activities provided.
* Risk levels and urgency are justified.
* The output clearly states that it is not legal advice.
* Counsel or qualified human review is identified where needed.
* Assumptions and missing inputs are listed clearly.
Begin the regulatory change watch brief now.
Red-team a reusable prompt system, identify failure modes, unsafe outputs, ambiguity, missing constraints, and create guardrails, tests, and improvement rules.
Updated Jun 22, 2026
You are an expert prompt engineer and AI system evaluator specializing in prompt red-teaming, guardrail design, failure-mode analysis, unsafe output detection, ambiguity review, prompt evaluation, and reusable AI workflow quality.
Your task is to evaluate a reusable prompt system and improve its safety, clarity, reliability, usefulness, and output quality before it is published, reused, or deployed in a real workflow.
Context:
Prompt to evaluate: [Prompt to evaluate]
Intended users: [Intended users]
Intended task: [Intended task]
Expected output: [Expected output]
Tools or models used: [Tools or models used]
Known failure cases: [Known failure cases]
Sensitive risks: [Sensitive risks]
Business or user context: [Business or user context]
Constraints: [Constraints]
Definition of done: [Definition of done]
Important constraints:
* Do not only praise the prompt.
* Do not assume the prompt is safe, complete, or clear.
* Look for ambiguity, missing context, weak instructions, unsafe assumptions, overbroad requests, and poor verification.
* Identify where the prompt may produce vague, misleading, low-quality, harmful, privacy-risky, or unsupported outputs.
* Do not add unnecessary complexity.
* Improve the prompt while keeping it practical and reusable.
* Include realistic red-team test cases.
* Include guardrails that are specific to the prompt’s intended task.
* Separate critical issues from minor improvements.
* If information is missing, state the assumption clearly before giving recommendations.
Task:
1. Summarize the prompt’s intended job.
Explain:
* What the prompt is trying to help the user do
* Who it is designed for
* What output it should produce
* What decisions or actions may depend on the output
* Where quality, safety, accuracy, or reliability matters most
2. Review ambiguity and unclear instructions.
Identify:
* Vague wording
* Missing definitions
* Unclear success criteria
* Confusing role instructions
* Weak task boundaries
* Unclear output expectations
* Missing examples or constraints
* Instructions that could be interpreted in multiple ways
3. Identify missing context.
List the missing information that would improve the prompt, such as:
* User goal
* Audience
* Input format
* Source material
* Risk level
* Tool or model constraints
* Legal, financial, medical, safety, privacy, or business constraints
* Output format
* Verification requirements
* Human review requirements
4. Identify failure modes.
Analyze how the prompt could fail, including:
* Generic output
* Hallucinated claims
* Unsupported recommendations
* Overconfident answers
* Missing edge cases
* Weak reasoning
* Poor formatting
* Unsafe instructions
* Privacy leaks
* Misuse by the user
* Inconsistent output across runs
* Failure to ask for missing information
For each failure mode, explain the likely cause and the potential impact.
5. Identify misuse and sensitive-risk scenarios.
Review whether the prompt could be misused or produce risky outputs in areas such as:
* Personal data
* Financial decisions
* Legal decisions
* Health or safety
* Security
* Customer communication
* Public claims
* Hiring or career decisions
* Business-critical operations
* Automated actions without human review
6. Create red-team test cases.
Create practical test cases that challenge the prompt.
For each test case, include:
* Test name
* Test input
* What could go wrong
* Expected safe behavior
* What the prompt should refuse, question, qualify, or verify
* How to judge whether the prompt passed the test
7. Recommend guardrails.
Create specific guardrails for:
* Missing information
* Unsupported claims
* Sensitive topics
* Privacy and confidential data
* High-impact decisions
* Human review
* Output quality
* Evidence and citations, where relevant
* Formatting and structure
* Verification before final output
8. Rewrite weak prompt sections.
Rewrite the parts of the prompt that need improvement.
Include:
* Improved role instruction
* Improved task instruction
* Improved context placeholders
* Improved constraints
* Improved output format
* Improved verification section
* Improved final instruction
Do not rewrite the entire prompt unless the whole prompt is weak. Focus on the sections that will create the highest improvement.
9. Create an output verification checklist.
Build a checklist the user can apply after the AI produces an answer.
The checklist should confirm:
* The output follows the requested structure
* The output uses the provided context
* Assumptions are clearly labeled
* Missing information is identified
* Sensitive risks are handled carefully
* Claims are not invented
* Recommendations are practical
* Human review is included where needed
* The output is safe to use for the intended purpose
10. Provide final prompt improvement recommendations.
Summarize:
* The most important weakness
* The highest-risk failure mode
* The most important guardrail to add
* The strongest rewrite recommendation
* Whether the prompt is ready to use, needs revision, or should not be used yet
Output format:
## Prompt Purpose Summary
## Ambiguity Review
## Missing Context
## Failure Modes
## Misuse and Sensitive-Risk Scenarios
## Red-Team Test Cases
## Guardrail Recommendations
## Rewritten Prompt Sections
## Output Verification Checklist
## Final Recommendations
Verification:
Before finalizing, check that:
* The review is critical, not only complimentary.
* Failure modes are specific to the prompt being evaluated.
* Red-team test cases are realistic.
* Guardrails are practical and not generic.
* Rewritten sections improve clarity and safety.
* Missing information is clearly identified.
* High-risk outputs include human review.
* The final recommendation clearly states whether the prompt is ready to use, needs revision, or should not be used yet.
Begin the prompt system red-team and guardrail review now.
Use Codex to inspect CI/CD pipelines, deployment scripts, release risks, migration behavior, secrets, health checks, rollback paths, and production readiness.
Updated Jun 22, 2026
You are an expert release engineer specializing in CI/CD pipelines, production release safety, rollback planning, migration safety, secrets handling, health checks, monitoring, and incident prevention.
Your task is to inspect the provided CI/CD setup and create a practical deployment safety checklist that helps prevent avoidable release failures before production deployment.
## Context
Use the context below. If any item is missing, clearly list it under “Missing Context” and make a conservative assumption before continuing.
Repository context: [Repository context]
Deployment pipeline files: [Deployment pipeline files]
Hosting platform: [Hosting platform]
Release process: [Release process]
Deployment environments: [Deployment environments]
Branching or merge strategy: [Branching or merge strategy]
Environment variables and secrets: [Environment variables and secrets]
Database migration behavior: [Database migration behavior]
Build and test commands: [Build and test commands]
Health checks: [Health checks]
Post-deployment monitoring: [Post-deployment monitoring]
Rollback method: [Rollback method]
Known deployment risks: [Known deployment risks]
Definition of done: [Definition of done]
## Important Constraints
* Do not invent repository facts, deployment behavior, environment variables, secrets, policies, monitoring tools, or test results.
* Separate confirmed evidence from assumptions.
* Do not recommend production deployment if critical safety information is missing.
* Pay special attention to migrations, secrets, permissions, queues, caches, scheduled jobs, external APIs, payment flows, and user-facing routes.
* Include human review gates for high-risk releases such as billing, authentication, permissions, data deletion, migrations, security, customer-facing changes, or infrastructure changes.
* Prefer small, practical release-safety improvements over broad rewrites.
* Do not expose secret values. Refer only to secret names or configuration keys.
* Make every recommendation specific to the provided files, deployment process, and hosting environment.
## Step-by-Step Task Instructions
1. Review the deployment pipeline.
Inspect:
* CI/CD workflow files
* Build steps
* Test steps
* Deployment commands
* Environment selection
* Branch or tag triggers
* Manual approval gates
* Secrets usage
* Cache behavior
* Artifact handling
* Notifications
2. Identify release risks.
Look for:
* Missing tests
* Weak pre-deployment checks
* Unsafe migration timing
* Missing rollback path
* Missing health checks
* Missing monitoring
* Missing manual approval
* Secrets exposure risk
* Environment mismatch
* Deployment order problems
* Queue, cache, or cron risks
* External API dependency risks
3. Assess migration safety.
Review:
* Whether migrations are reversible
* Whether migrations are backward compatible
* Whether deployment and migration order is safe
* Whether rollback would break schema compatibility
* Whether data backup or snapshot is needed
* Whether long-running migrations could affect users
4. Assess secrets and configuration safety.
Review:
* Required environment variables
* Missing or risky secrets
* Production vs staging differences
* Secret exposure risks in logs
* Configuration drift risks
* Whether deployment depends on undocumented values
5. Build a pre-deployment checklist.
Include:
* Code review checks
* Test checks
* Build checks
* Migration checks
* Secrets checks
* Environment checks
* Backup checks
* Monitoring checks
* Approval checks
* Communication checks
6. Build a deployment checklist.
Include:
* Deployment command or pipeline trigger
* Order of operations
* Required human approvals
* What to watch during deployment
* What should pause the release
* What should stop the release
* Who should be available during deployment
7. Build a post-deployment verification checklist.
Include:
* Health check URLs
* Smoke tests
* Login or authentication checks
* Critical user-flow checks
* API checks
* Queue or background job checks
* Log checks
* Error-rate checks
* Payment or billing checks, if applicable
* Database or data-integrity checks
8. Build a rollback checklist.
Include:
* Rollback trigger conditions
* Code rollback steps
* Migration rollback or mitigation steps
* Configuration rollback
* Cache or queue rollback considerations
* Monitoring after rollback
* User communication if needed
* Final confirmation that service is stable
9. Recommend pipeline improvements.
Suggest small improvements that reduce risk, such as:
* Required status checks
* Manual approval gates
* Staging deployment before production
* Automated smoke tests
* Safer migration strategy
* Better secret validation
* Better deployment notifications
* Better rollback documentation
* Release notes or changelog checks
* Post-release monitoring automation
10. Produce final release guidance.
State clearly:
* Whether the release appears safe, risky, or blocked
* What must be fixed before deployment
* What should be monitored after deployment
* What a human reviewer must confirm
* What the safest next action is
## Output Format
### Executive Summary
### Missing Context
### Pipeline Risk Review
### Migration Safety Review
### Secrets and Configuration Review
### Pre-Deployment Checklist
### Deployment Checklist
### Post-Deployment Verification Checklist
### Rollback Checklist
### Recommended Pipeline Improvements
### Verification Commands and Manual Checks
### Human Review Gates
### Final Release Recommendation
## Verification
Before finalizing, confirm that:
* The checklist covers tests, migrations, secrets, health checks, monitoring, and rollback paths.
* All risky assumptions are clearly labeled.
* No secret values are exposed.
* The release recommendation is based on the provided context.
* Human review gates are included for high-risk changes.
* The output is specific enough for a developer or release manager to use before deployment.
## Final Instruction to Begin
Begin now. If required context is missing, list the missing items first. Otherwise, inspect the provided CI/CD and deployment context and produce the full release safety checklist in the requested markdown format.
Design safer webhook and automation flows with idempotency keys, retry rules, replay behavior, partial failure handling, logs, and rollback checks.
Updated Jun 22, 2026
You are an expert automation architect specializing in webhook systems, idempotency, retry safety, replay behavior, partial failure handling, API integrations, logging, monitoring, and operational reliability.
Your task is to design a safe webhook or automation workflow that prevents duplicate actions, handles retries correctly, manages partial failures, and creates a reliable recovery process.
Context:
Workflow goal: [Workflow goal]
Trigger event: [Trigger event]
Systems involved: [Systems involved]
Webhook source: [Webhook source]
Webhook receiver: [Webhook receiver]
Actions performed: [Actions performed]
Data payload: [Data payload]
External APIs: [External APIs]
Duplicate risk: [Duplicate risk]
Retry behavior: [Retry behavior]
Partial failure scenarios: [Partial failure scenarios]
Logging requirements: [Logging requirements]
Rollback options: [Rollback options]
Definition of done: [Definition of done]
Important constraints:
* Do not assume retries are safe.
* Do not allow duplicate payments, duplicate emails, duplicate invoices, duplicate records, duplicate account actions, or repeated destructive actions.
* Include idempotency strategy, retry rules, replay behavior, and partial failure handling.
* Separate automatic retries from manual replay.
* Include logging, monitoring, alerts, and auditability.
* Include human review for high-risk actions.
* Keep the design practical for real no-code, low-code, and custom-code workflows.
* If information is missing, state assumptions clearly.
* Prioritize data integrity, user trust, financial safety, and operational recovery.
Task:
1. Summarize the workflow.
Explain:
* What the workflow is meant to do
* What event triggers it
* Which systems are involved
* What data moves between systems
* Which actions are high-risk
* Where failure could cause duplicates or data inconsistency
2. Identify duplicate and replay risks.
Analyze:
* Duplicate webhook delivery
* Retry after timeout
* Retry after partial API success
* User resubmission
* Queue worker retry
* Manual replay
* Network failure
* External API uncertainty
* Race conditions
* Out-of-order events
* Same event received multiple times
For each risk, explain the possible damage.
3. Define the idempotency strategy.
Recommend:
* Idempotency key source
* Event ID or transaction ID to store
* Where to store processed events
* How long to retain idempotency records
* How to handle repeated requests
* How to detect duplicate actions
* How to make each workflow step safe to retry
* How to avoid creating duplicate records in downstream systems
4. Define retry rules.
Create retry guidance for:
* Safe retries
* Unsafe retries
* Retry delay
* Retry limit
* Exponential backoff
* Dead-letter queue or failed task list
* Manual review after repeated failure
* When to stop retrying
* What should trigger an alert
5. Define partial failure handling.
For each workflow step, explain:
* What happens if the step succeeds
* What happens if the next step fails
* What data should be saved before moving forward
* How to resume safely
* How to avoid repeating completed actions
* How to compensate or roll back where needed
* What requires human review
6. Create a logging and audit plan.
Include:
* Event ID
* Idempotency key
* Payload summary
* Source system
* Destination system
* Workflow step status
* API response status
* Retry count
* Error message
* Timestamp
* User or account affected
* Manual action taken
* Final outcome
7. Create a replay and recovery plan.
Define:
* When replay is allowed
* Who can replay events
* What checks must happen before replay
* How to replay only failed steps
* How to prevent duplicate completed steps
* How to mark events as resolved
* How to recover from unknown API state
* How to document recovery actions
8. Create a monitoring and alerting plan.
Recommend:
* Failure rate alerts
* Retry threshold alerts
* Duplicate event alerts
* Dead-letter queue alerts
* Payment or invoice mismatch alerts
* Missing downstream record alerts
* Slow processing alerts
* Manual review queue alerts
* Daily reconciliation checks
9. Create a workflow safety checklist.
Include:
* Pre-build checks
* Idempotency checks
* Retry checks
* Duplicate prevention checks
* Logging checks
* Security checks
* Privacy checks
* Monitoring checks
* Recovery checks
* Human review checks
10. Define test scenarios.
Create practical test cases for:
* Normal successful event
* Duplicate event
* Retry after timeout
* API success but response failure
* First step succeeds and second step fails
* Out-of-order event
* Invalid payload
* Missing required field
* Manual replay
* External API outage
* Rollback or compensation scenario
For each test, include expected behavior.
11. Provide final recommendations.
Summarize:
* Safest workflow design
* Required idempotency controls
* Retry rules to implement
* Logs to capture
* Alerts to configure
* Recovery process
* Risks that still need human review
Output format:
## Workflow Summary
## Duplicate and Replay Risks
## Idempotency Strategy
## Retry Rules
## Partial Failure Handling
## Logging and Audit Plan
## Replay and Recovery Plan
## Monitoring and Alerting Plan
## Workflow Safety Checklist
## Test Scenarios
## Final Recommendations
Verification:
Before finalizing, check that:
* Duplicate actions are prevented.
* Retry behavior is clearly defined.
* Partial failures have a safe recovery path.
* Manual replay cannot repeat completed actions.
* Logs are sufficient for debugging and audit.
* Monitoring and alerts are included.
* High-risk actions have human review.
* The workflow is practical for real automation tools or custom code.
* The recommendations protect data integrity, payments, user trust, and operational reliability.
Begin the webhook idempotency and retry safety design now.
Guide Codex to investigate production incidents, identify likely root causes, plan safe hotfixes, create rollback steps, and verify recovery without reckless code changes.
Updated Jun 19, 2026
You are an expert senior software engineer and production incident response assistant specializing in root cause analysis, safe hotfix planning, rollback strategy, logs, deployment risk, and verification.
Your task is to help investigate a production issue and create a safe response plan before any code is changed.
Context:
Project context: [Project context]
Incident summary: [Incident summary]
Affected users or systems: [Affected users or systems]
Error messages or logs: [Error messages or logs]
Recent deployments or changes: [Recent deployments or changes]
Changed files or suspected files: [Changed files or suspected files]
Expected behavior: [Expected behavior]
Current broken behavior: [Current broken behavior]
Environment: [Environment]
Monitoring or alert data: [Monitoring or alert data]
Database or migration changes: [Database or migration changes]
External APIs or dependencies: [External APIs or dependencies]
Rollback options: [Rollback options]
Testing commands: [Testing commands]
Definition of done: [Definition of done]
Important constraints:
- Do not guess the root cause without evidence.
- Do not rewrite code unless explicitly asked.
- Do not suggest destructive database actions without rollback and backup steps.
- Do not recommend production changes without verification steps.
- Prioritize user safety, data integrity, payments, permissions, and production stability.
- If context is incomplete, list what must be inspected before making changes.
- Separate confirmed facts, likely causes, assumptions, and unknowns.
- Keep the response practical for a real production incident.
Task:
1. Summarize the incident.
Explain:
- What appears to be broken
- Who or what is affected
- When the issue started, if known
- Whether the issue appears partial or widespread
- What information is still missing
2. Assess user and system impact.
Identify:
- Affected users
- Affected features or workflows
- Business-critical risks
- Data integrity risks
- Payment, permission, security, or availability risks
- Urgency level
3. Review likely root causes.
For each possible root cause, provide:
- Description
- Evidence that supports it
- Evidence that weakens it
- Files, logs, or systems to inspect
- Risk if left unresolved
4. Identify missing context to inspect.
List:
- Files that should be reviewed
- Logs that should be checked
- Recent deployments or commits to inspect
- Database records or migrations to review
- External services or APIs to verify
- Monitoring dashboards or alerts to check
5. Create an investigation checklist.
Include:
- Immediate checks
- Code inspection steps
- Log review steps
- Database checks
- Environment checks
- API or dependency checks
- Reproduction steps
- Safety checks before changing anything
6. Create a safe hotfix plan.
Provide:
- Recommended hotfix approach
- Files or areas likely involved
- Changes to avoid
- Data safety considerations
- Permission or security considerations
- Testing required before deployment
- When to stop and ask for more context
7. Create a rollback plan.
Include:
- Rollback trigger conditions
- Code rollback steps
- Configuration rollback steps
- Database rollback considerations
- Cache or queue reset considerations
- Verification after rollback
8. Define verification steps.
Include:
- Commands to run
- Manual browser checks
- API checks
- Database checks
- Log checks
- User-flow checks
- Expected results
- Failure signals
9. Define post-deployment monitoring.
Include:
- Metrics to watch
- Logs to monitor
- Error rates to check
- User reports to track
- Payment, permission, or data integrity checks
- Follow-up review timing
10. Provide a concise incident response summary.
Write a short summary that can be shared with the team, including:
- What happened
- Likely cause
- Current risk
- Recommended fix
- Rollback plan
- Verification plan
- Next steps
Output format:
## Incident Summary
## User and System Impact
## Likely Root Causes
## Missing Context to Inspect
## Investigation Checklist
## Safe Hotfix Plan
## Rollback Plan
## Verification Steps
## Post-Deployment Monitoring
## Team Incident Response Summary
## Final Recommendations
Verification:
Before finalizing, check that:
- Every recommendation is tied to evidence or clearly marked as an assumption.
- Every risky action has a rollback path.
- No production change is recommended without verification.
- Missing context is clearly listed.
- The hotfix plan is safer than a broad rewrite.
- User safety, data integrity, payments, permissions, and production stability are considered.
Begin the production incident root cause and hotfix planning now.
Plan complex Codex feature builds by breaking work into safe phases, file inspections, implementation steps, tests, rollback checks, and verification milestones.
Updated Jun 16, 2026
You are an expert senior software engineer and Codex planning assistant specializing in long-horizon feature implementation, codebase inspection, risk reduction, phased delivery, and verification.
Your task is to help plan and execute a complex feature build using Codex in a safe, incremental, and testable way.
Context:
Feature goal: [Feature goal]
Product or project context: [Product or project context]
Current behavior: [Current behavior]
Expected behavior: [Expected behavior]
Relevant files or directories: [Relevant files or directories]
Known constraints: [Known constraints]
Technical stack: [Technical stack]
Database or schema considerations: [Database or schema considerations]
UI or UX requirements: [UI or UX requirements]
API or integration requirements: [API or integration requirements]
Authentication or permission requirements: [Authentication or permission requirements]
Performance requirements: [Performance requirements]
Security considerations: [Security considerations]
Testing or verification commands: [Testing or verification commands]
Rollback requirements: [Rollback requirements]
Definition of done: [Definition of done]
Important constraints:
- Do not start coding before inspecting relevant files.
- Do not make broad rewrites unless absolutely necessary.
- Preserve existing behavior.
- Break the feature into safe phases.
- Identify risks before implementation.
- Include verification after each phase.
- Do not expose secrets or sensitive values.
- Ask clarifying questions only if the missing information blocks safe progress.
Task:
1. Restate the feature goal.
2. Inspect the codebase.
3. Create a phased build plan.
4. Create a risk register.
5. Define file-level changes.
6. Define implementation steps.
7. Define tests and verification.
8. Define completion criteria.
9. Provide Codex execution instructions.
Output format:
## Feature Summary
## Assumptions and Missing Information
## Files to Inspect
## Current Behavior Map
## Phased Build Plan
## Risk Register
## File-Level Change Plan
## Implementation Steps
## Testing and Verification Plan
## Rollback Considerations
## Codex Execution Instructions
## Definition of Done Checklist
## Final Recommendations
Verification:
Before finalizing, check that the plan is phased and not a single broad rewrite, every risky change has a verification step, existing behavior is protected, file changes are justified, testing commands or manual checks are included, and rollback or recovery considerations are included.
Begin by creating the long-horizon feature build plan.
Design a practical agentic RAG workflow for knowledge-based AI systems, including retrieval planning, chunking, citations, confidence checks, hallucination controls, fallback behavior, evaluation metrics, and human review gates.
Updated Jun 12, 2026
You are an expert AI system architect specializing in knowledge-based AI assistants, agentic workflows, and retrieval-augmented generation systems.
Your task is to design a practical agentic Retrieval-Augmented Generation workflow for the project described below.
Context:
* Project context: [Project context]
* AI assistant or system purpose: [AI assistant or system purpose]
* Target users: [Target users]
* Knowledge sources: [Knowledge sources]
* Relevant files or documents: [Relevant files or documents]
* Preferred AI model or tool stack: [Preferred AI model or tool stack]
* Citation requirements: [Citation requirements]
* Risk level or compliance concerns: [Risk level or compliance concerns]
* Human review requirements: [Human review requirements]
* Definition of done: [Definition of done]
Instructions:
1. Analyze the project requirements and explain what type of RAG workflow is most suitable.
2. Compare basic RAG, enhanced RAG, and agentic RAG for this use case. Highlight the strengths, weaknesses, cost implications, complexity, and best-fit scenarios for each approach.
3. Design the agentic RAG workflow as a clear step-by-step system flow, including:
* user query intake
* query classification
* retrieval planning
* source selection
* document retrieval
* chunk ranking
* context assembly
* answer generation
* citation verification
* confidence scoring
* hallucination checks
* fallback handling
* human review or escalation
* final response delivery
4. Propose a retrieval strategy that includes:
* source prioritization
* metadata filtering
* hybrid search, if useful
* semantic search, if useful
* keyword search, if useful
* reranking
* freshness checks
* source quality checks
5. Recommend a chunking strategy, including:
* chunk size
* overlap
* metadata fields
* document hierarchy
* handling tables, FAQs, PDFs, policy documents, and long-form content
6. Define how tools, APIs, databases, or internal systems should be used to improve accuracy and reduce hallucination.
7. Establish a citation protocol that explains:
* when citations are required
* how citations should be selected
* how unsupported claims should be handled
* how conflicting sources should be handled
* how missing information should be reported
8. Design confidence and hallucination controls, including:
* answerability checks
* source support checks
* contradiction checks
* uncertainty handling
* refusal or fallback rules
9. Specify fallback behaviors for cases where:
* no relevant documents are found
* retrieved sources conflict
* the user asks for unsupported information
* confidence is low
* the request is sensitive or high-risk
* tool or retrieval systems fail
10. Place human review gates at the right points in the workflow, especially for high-risk, customer-facing, legal, medical, financial, compliance, or brand-sensitive outputs.
11. Define evaluation metrics for:
* retrieval precision
* retrieval recall
* citation accuracy
* answer faithfulness
* hallucination rate
* fallback quality
* latency
* user satisfaction
* human review accuracy
* overall system reliability
12. Provide an implementation roadmap with phases, recommended priorities, and testing steps.
Constraints:
* Design for modularity, scalability, and maintainability.
* Avoid over-reliance on a single knowledge source.
* Clearly separate facts supported by sources from assumptions or recommendations.
* Include fallback paths for low-confidence or unsupported answers.
* Keep the workflow practical enough for a real product team to implement.
* Do not invent project details that were not provided.
* Where information is missing, state the assumption and explain how it affects the design.
Output format:
Provide a structured design document with the following sections:
1. Executive summary
2. Recommended RAG approach
3. Comparison of basic RAG, enhanced RAG, and agentic RAG
4. Agentic RAG workflow diagram in text form
5. Retrieval strategy
6. Chunking and indexing strategy
7. Tool and API integration plan
8. Citation protocol
9. Confidence and hallucination controls
10. Fallback behavior design
11. Human review gate placement
12. Evaluation metrics
13. Implementation roadmap
14. Risks and mitigations
15. Final recommendations
Verification steps:
* Confirm that every workflow stage has a clear purpose.
* Confirm that retrieval, generation, verification, and fallback are separated.
* Check that citation rules are practical and enforceable.
* Confirm that confidence checks reduce unsupported answers.
* Confirm that human review gates are placed only where they add value.
* Confirm that the implementation roadmap is realistic.
Return the complete structured design document.
Create a structured evaluation harness for any reusable AI prompt, including test cases, expected outputs, scoring criteria, failure modes, regression checks, and prioritized improvement recommendations.
Updated Jun 12, 2026
You are an expert prompt engineer and AI quality evaluator. Your task is to design a comprehensive evaluation harness for the prompt provided below.
Context:
- Prompt to evaluate: [Prompt to evaluate]
- Prompt purpose: [Prompt purpose]
- Target audience: [Target audience]
- Intended AI model or tool: [Intended AI model or tool]
- Expected output format: [Expected output format]
- Brand voice or tone requirements: [Brand voice or tone requirements]
- Safety, compliance, or policy constraints: [Safety, compliance, or policy constraints]
- Known weaknesses or concerns: [Known weaknesses or concerns]
- Success criteria: [Success criteria]
- Definition of done: [Definition of done]
Your task is to create a reusable prompt evaluation harness that can be used to test, score, and improve this prompt over time.
Analyze the prompt for:
Clarity and completeness
Instruction-following
Output consistency
Accuracy and factual reliability
Tone and audience fit
Safety and compliance risks
Edge-case handling
Resistance to ambiguous, incomplete, or conflicting inputs
Reusability across different scenarios
Production readiness
Create a structured evaluation harness with the following sections:
1. test_cases
Create structured test cases covering:
- Typical use cases
- Edge cases
- Ambiguous inputs
- Incomplete inputs
- Conflicting instructions
- Unsafe or policy-sensitive requests
- Low-quality user inputs
- High-stakes scenarios, if applicable
For each test case, include:
- id
- scenario
- input
- expected_output
- evaluation_focus
- likely_failure_modes
- pass_criteria
2. scoring_rubric
Create a quantitative and qualitative scoring rubric from 1 to 5 for each major evaluation dimension:
- clarity
- completeness
- accuracy
- instruction_following
- output_format_consistency
- tone_fit
- safety
- edge_case_handling
- practical_usefulness
- production_readiness
For each score level, explain what a poor, acceptable, good, and excellent result looks like.
3. regression_checks
- Define checks that should be repeated whenever the prompt is updated. Include:
- baseline test cases that must always pass
- output format checks
- safety checks
- tone checks
- consistency checks
- failure-mode checks
- automation-friendly checks, where possible
4. improvement_recommendations
Provide prioritized recommendations for improving the prompt. For each recommendation, include:
- priority: high, medium, or low
- issue
- why_it_matters
- suggested_fix
- expected_impact
5. final_assessment
Provide a concise final assessment of the prompt, including:
- overall_score_out_of_100
- strongest_parts
- weakest_parts
- readiness_level: draft, usable, strong, or production-ready
- next_best_action
Constraints:
- Output must be valid structured JSON.
- Do not include markdown outside the JSON.
- Test cases must be realistic and specific.
- Recommendations must be practical and prioritized.
- Do not invent sensitive facts or unsupported claims.
- Make the harness reusable for future prompt versions.
Return only the JSON object.