Starter prompts
4 ways to start with Prompt.
Eval harness
→ Score this prompt
▸ Preview prompt
Build a proper evaluation for this production prompt and give me a concrete reliability report.
Prompt under test:
```
You are a support triage agent at Acme SaaS. Given a customer message, output one of these labels and NOTHING ELSE: BILLING, BUG, FEATURE_REQUEST, HOW_TO, OTHER.
```
Do all of this in your reply:
1. Build an eval set of 25 realistic customer messages as a JSON array of `{input, expected_label, why}` — span the 5 categories with at least 1 hard/ambiguous case per category.
2. Predict where the prompt fails. Name the 3 worst failure modes with the exact category-pair confusions you expect (e.g. 'BILLING vs HOW_TO when user asks about pricing tiers').
3. Rewrite the prompt to address the 3 failure modes. Show the new prompt verbatim in a code block.
4. Score the rewrite vs. the original on a 5-axis rubric (clarity / coverage / ambiguity handling / brevity / output-format strictness). Mark each axis with the score and one-line rationale.
5. End with a 'next step' line — what eval would you run on the rewrite before shipping?
JSON output
→ Reliable schema
▸ Preview prompt
Tighten this prompt so it returns valid JSON matching a given Zod schema, with a fallback when the model refuses.
Cost cut
→ Same quality
▸ Preview prompt
Cut tokens on this prompt by 40% without losing accuracy on my eval set. Show the diff.
Jailbreak test
→ Find the holes
▸ Preview prompt
Adversarially test this customer-support prompt for prompt-injection and refusal-bypass attacks. List concrete fixes.
What it does
Tasks Prompt ships every week.
Prompt craft
- Spec + eval pairs
- Few-shot selection
- Structured output schema
- Failure-mode hardening
Operations
- Eval harness setup
- A/B prompt experiments
- Cost vs quality tuning
- Versioning + rollback
Worked sample
A real Prompt chat.
Pairs well with