Prompt Engineering: What Actually Works
Independent study: how 10 prompt engineering techniques perform across 4 models available in Russia
A structured prompt improves any model by 20-30%. But it won't replace a strong one.
Key Findings
Works
Structured output, role framing, self-critique, few-shot. Win rate 74–82% against naive prompts.
win rate: 74-82%
Doesn't Work
ALL CAPS, aggressive tone, multi-step decomposition. Win rate below 55%.
win rate: <55%
Can't Fix with Prompting
Factual accuracy, depth of knowledge, premium-model quality. Maximum gain just +0.08–0.52 points.
win rate: +0.08-0.52
Technique Ranking
1Structured Output
Strong77%
Describe the desired output structure: headings, tables, sections. The model fills in the template and doesn't skip important parts. Best ROI of all techniques.
* q < 0.05 (FDR)
Example prompt Scenario: electronics e-commerce store, revenue dropped 18%
We run an electronics e-commerce store with 45 employees. Last quarter revenue dropped 18%, while website traffic grew 12%. Average order value fell from $120 to $85. Returns increased from 4% to 11%. We raised the ad budget by 30%. What's going on and what should we do? Respond strictly in this format: ## Diagnosis (2–3 sentences: what exactly is wrong) ## Root Causes For each cause: - What is happening - Why it is happening (cause-effect relationship) - What data supports this ## Recommendations (from most urgent to least) For each recommendation: 1. What to do (specific action) 2. Expected result (in numbers, if possible) 3. Timeline 4. Who is responsible ## What I Don't Know What information is missing for a more accurate analysis?
2Few-shot (Reference)
StrongRequires a reference answer from a premium model78%
Provide an example of a high-quality answer. Highest win rate (89% for GigaChat-2-Max), but requires a reference answer from a premium model.
* q < 0.05 (FDR)
Example prompt Scenario: electronics e-commerce store, revenue dropped 18%
Here's an example of a quality business analysis on a similar task: --- Question: "A bakery-cafe, 8 employees. Quarterly revenue -25%, while number of orders grew 15%." Answer: Diagnosis: classic "more customers, cheaper" pattern. Growth in orders with falling revenue means a ~35% drop in average order value. Three possible causes: (1) demand shift to cheaper items, (2) aggressive discounts, (3) audience change. Recommendations: 1. Immediately: analyze order structure – what % is "coffee only" vs combos. 2. Short-term: replace "$1 coffee" promo with "$3 coffee + croissant" – keeps traffic, raises AOV. 3. Medium-term: introduce a loyalty program with a $5 threshold. --- Now answer with the same level of specificity and depth: We run an electronics e-commerce store with 45 employees. Last quarter revenue dropped 18%, while website traffic grew 12%. Average order value fell from $120 to $85. Returns increased from 4% to 11%. We raised the ad budget by 30%. What's going on and what should we do?
3Role + Context
Strong71%
Assign the model an expert role and task context. One line at the start of your prompt. Works across all models by activating relevant knowledge.
* q < 0.05 (FDR)
Example prompt Scenario: electronics e-commerce store, revenue dropped 18%
You are an experienced business analyst with 15 years in e-commerce consulting. You've worked with Amazon, Shopify merchants, and dozens of mid-sized online stores. Your task is to prepare a McKinsey-level analytical memo for the CEO. The memo must include specific numbers, cause-effect relationships, and prioritized recommendations. Avoid generic phrases like "needs improvement" – write what specifically to do. Client situation: electronics e-commerce store, 45 employees. Last quarter revenue dropped 18%, while website traffic grew 12%. Average order value fell from $120 to $85. Returns increased from 4% to 11%. Ad budget was raised by 30%. What's going on and what should we do?
4Self-Critique
Strong71%
Ask the model to find weaknesses and improve its answer. The only multi-turn technique that reliably helps all models.
* q < 0.05 (FDR)
Example prompt Scenario: electronics e-commerce store, revenue dropped 18%
System prompt: You are a business analyst. [Message 1]: We run an electronics e-commerce store, 45 employees. Revenue dropped 18%, traffic grew 12%. AOV fell from $120 to $85. Returns grew from 4% to 11%. Ad budget raised 30%. What's going on and what should we do? [Message 2]: Re-read your answer critically. Find: 1) Where were you vague? 2) Where could there be logic errors? 3) What did you miss? Then give an improved version fixing all issues found.
5Chain-of-Thought
Moderate71%
Ask the model to reason step by step. Helps strong models (Qwen3: 78%), but may add noise for weaker ones (Alice: 60%).
* q < 0.05 (FDR)
Example prompt Scenario: electronics e-commerce store, revenue dropped 18%
We run an electronics e-commerce store with 45 employees. Last quarter revenue dropped 18%, while website traffic grew 12%. Average order value fell from $120 to $85. Returns increased from 4% to 11%. We raised the ad budget by 30%. Before giving recommendations, analyze step by step: Step 1. List all facts from the brief and what each means individually. Step 2. Find connections between facts. Which are causes, which are effects? Step 3. Formulate 2–3 hypotheses about what's happening. Step 4. For each hypothesis, evaluate: what data supports it, what contradicts it? Step 5. Choose the most likely explanation. Step 6. Give recommendations based on that explanation. Step 7. Note where you're unsure and what information is missing. Show your entire reasoning process.
6XML + Markdown
Moderate73%
XML tags for structure, Markdown for output. Excellent for Alice (85%) and Qwen3 (74%), weak for GigaChat-Ultra (53%).
* q < 0.05 (FDR)
Example prompt Scenario: electronics e-commerce store, revenue dropped 18%
<task> Analyze the business situation and provide recommendations. </task> <context> Company: electronics e-commerce store, 45 employees. Period: last quarter. </context> <data> - Revenue: -18% - Website traffic: +12% - Average order value: dropped from $120 to $85 - Returns: grew from 4% to 11% - Ad budget: +30% </data> <question>What's going on and what should we do?</question> <output_format> # Diagnosis (2-3 sentences) # Root Causes (with evidence from the data) # Recommendations (table: action, expected result, timeline, priority) # Missing Data </output_format> <constraints> - No generic phrases without specifics - Tie every conclusion to numbers from the <data> section - If unsure – state it explicitly </constraints>
7Iterative Refinement
Moderate65%
Iterative answer refinement over multiple steps. Results vary heavily by model, and judges disagree on whether improvements are substantive.
Example prompt Scenario: electronics e-commerce store, revenue dropped 18%
System prompt: You are a business analyst. [Message 1]: We run an electronics e-commerce store, 45 employees. Revenue -18%, traffic +12%, AOV dropped from $120 to $85, returns from 4% to 11%, ad budget +30%. What's going on and what should we do? [Message 2]: Too generic. I need: 1) Specific numbers – calculate dollar losses and ad ROI. 2) Tie each recommendation to a specific metric. 3) State what you don't know for sure. [Message 3]: Format this as a 1-page executive memo for the CEO. Format: problem -> causes -> action plan for 30/60/90 days with KPIs.
8Decomposition
Ineffective58%
Break the task into subtasks across separate messages. Loses context between steps. Only works for GigaChat-Ultra; the other three models actually do worse than with a naive prompt.
Example prompt Scenario: electronics e-commerce store, revenue dropped 18%
System prompt: You are a business analyst. Be specific, use numbers. [Message 1]: Here's last quarter data: revenue -18%, traffic +12%, AOV dropped from $120 to $85, returns grew from 4% to 11%, ad budget +30%. Analyze each metric individually – what does it mean on its own? [Message 2]: Now find connections between these metrics. Which are causes, which are effects? Formulate 2–3 hypotheses. [Message 3]: Based on the analysis, give 5 specific recommendations. For each: what to do, expected result, timeline, who's responsible.
9ALL CAPS
Ineffective50%
Highlighting key words with ALL CAPS. Zero effect on quality. Doesn't even hurt.
Example prompt Scenario: electronics e-commerce store, revenue dropped 18%
WE RUN AN ELECTRONICS E-COMMERCE STORE WITH 45 EMPLOYEES. LAST QUARTER REVENUE DROPPED 18%, WHILE WEBSITE TRAFFIC GREW 12%. AVERAGE ORDER VALUE FELL FROM $120 TO $85. RETURNS INCREASED FROM 4% TO 11%. WE RAISED THE AD BUDGET BY 30%. WHAT'S GOING ON AND WHAT SHOULD WE DO?
10Aggressive Tone
Ineffective44%
Aggressive, demanding tone. The model becomes more confident but not more accurate. Suppresses hedging (GigaChat-Ultra: 0% hedging rate).
Example prompt Scenario: electronics e-commerce store, revenue dropped 18%
Look, I need a real answer. We run an electronics e-commerce store with 45 employees. Last quarter revenue dropped 18%, while website traffic grew 12%. Average order value fell from $120 to $85. Returns increased from 4% to 11%. We raised the ad budget by 30%. What's going on and what should we do? No fluff.

Win rate by technique and model – numbers are approximate, see ranking above for exact data
Results by Model
Techniques improve the formal metrics: answers become more complete, better structured, and the model hedges less. But in a head-to-head comparison with GPT-5.4 or Claude 4.6, the gap in analytical depth and factual accuracy remains clear.
GigaChat-Ultra
Best technique: Structured OutputBaseline score: 3.44
Best score: 4.27Best (Kimi K2.5): 4.57
Usable but not competitive with premium models
Ceiling vs premium
GigaChat-2-Max
Best technique: Few-shot (Reference)Baseline score: 3.02
Best score: 3.63Best (Kimi K2.5): 4.57
Not competitive. Prompt engineering can't compensate for a weak model
Ceiling vs premium
Alice AI LLM (Yandex)
Best technique: Structured OutputBaseline score: 3.85
Best score: 4.46Best (Kimi K2.5): 4.57
Viable alternative. Wins 1 in 3 comparisons against premium
Ceiling vs premium
Qwen3 Max
Best technique: Structured OutputBaseline score: 3.93
Best score: 4.49Best (Kimi K2.5): 4.57
Near-competitive. 33-41% win rate against premium models
Ceiling vs premium
What To Do: Practical Playbook
Using GigaChat
Apply Structured output
win rate
Effort
Using Alice AI
Apply XML + Markdown
win rate
Effort
Using Qwen3
Apply CoT + Structure
win rate
Effort
Starting from scratch
Apply Role
win rate
Effort
Willing to iterate
Apply Self-critique
win rate
Effort
Start applying now
Role, structure, and self-critique from this study are covered in the first chapter
Try for freeTry It Yourself
Same task, two prompting approaches. Hit "Run" and compare.
Scenario 1: Business Data Analysis
An electronics e-commerce store with declining revenue and rising returns. Same situation, two approaches.
Naive prompt – how a typical manager writes:
Structured prompt – same task, but with an output template:
Scenario 2: Difficult Team Communication
Writing a team email about layoffs. An emotionally complex task where structure is critical.
Naive prompt:
Structured prompt:
Show methodology
scenarios
repetitions per combination
evaluation passes
Each of the 10 techniques was tested on 6 real management tasks: data analysis, communication, planning. Every combination was repeated 3 times for result stability.
Two LLM judges (Claude Opus 4.6 + Gemini 3.1 Pro) weighted 70/30. Inter-judge agreement: 66–73%.
Methodology:
- Pass 1: pairwise comparison – does the technique beat a naive prompt? (2,880 evaluations)
- Pass 2: absolute scoring across 5 dimensions: accuracy, actionability, completeness, honesty, clarity (564 evaluations)
- Pass 3: ceiling – can the best technique compete with GPT-5.4, Claude 4.6, Kimi K2.5? (864 evaluations)
Statistics: Fisher exact test with FDR correction (q < 0.05). Significant results marked in the table.
All prompts, data, and evaluation scripts are available on request for experiment reproduction.
Data: 2,880 pairwise + 564 absolute + 864 ceiling evaluations | Analysis date: 2026-05-04
Frequently Asked Questions
Do ALL CAPS help in prompts?
No. The CAPS technique showed a win rate of about 50% across all models, statistically indistinguishable from random. Capitalizing keywords has no effect on output quality.
Do responses improve if you write aggressively?
No, they get worse. Aggressive tone showed a 43–46% win rate. The only measurable effect: the model stops admitting uncertainty (GigaChat-Ultra: 0% hedging rate). Responses become more confident, but not more accurate.
Is 'think step by step' worth it?
Depends on the model. For Qwen3 Max (78% win rate) and GigaChat-2-Max (79%), it's one of the best techniques. For Alice (60%), the effect is weak. CoT primarily helps models avoid skipping data, not improve reasoning.
Is it better to split the task across multiple messages?
Usually no. Multi-turn techniques (decomposition, self-critique, iteration) on average perform no better than single-turn. Exception: self-critique reliably helps. Decomposition loses context between steps and worsens output for 3 of 4 models.
What matters more: prompting technique or model choice?
Model choice. Even the best technique (structured output) on GigaChat-Ultra (4.27 score) loses 78% of comparisons against GPT-5.4, which scored 4.38 with a naive prompt. Prompt engineering raises the floor, but the model determines the ceiling.
Ready to apply these techniques to real work?
The first chapter covers role, structure, and self-critique on real management tasks – with AI feedback and measurable results
Try for free