Prompt Engineering: What Actually Works

Independent study: how 10 prompt engineering techniques perform across 4 models available in Russia

A structured prompt improves any model by 20-30%. But it won't replace a strong one.

Key Findings

Works

Structured output, role framing, self-critique, few-shot. Win rate 74–82% against naive prompts.

StructureRoleFew-shotSelf-critique

win rate: 74-82%

Doesn't Work

ALL CAPS, aggressive tone, multi-step decomposition. Win rate below 55%.

CAPSAggressionDecomposition

win rate: <55%

Can't Fix with Prompting

Factual accuracy, depth of knowledge, premium-model quality. Maximum gain just +0.08–0.52 points.

AccuracyKnowledgeParity

win rate: +0.08-0.52

Technique Ranking

Low
High
1
Structured Output Strong
77%

Describe the desired output structure: headings, tables, sections. The model fills in the template and doesn't skip important parts. Best ROI of all techniques.

76% GigaChat-Ultra
82% GigaChat-2-Max
76% Alice AI LLM (Yandex)
74% Qwen3 Max

* q < 0.05 (FDR)

Example prompt Scenario: electronics e-commerce store, revenue dropped 18%
We run an electronics e-commerce store with 45 employees. Last quarter revenue dropped 18%, while website traffic grew 12%. Average order value fell from $120 to $85. Returns increased from 4% to 11%. We raised the ad budget by 30%.

What's going on and what should we do?

Respond strictly in this format:

## Diagnosis (2–3 sentences: what exactly is wrong)

## Root Causes
For each cause:
- What is happening
- Why it is happening (cause-effect relationship)
- What data supports this

## Recommendations (from most urgent to least)
For each recommendation:
1. What to do (specific action)
2. Expected result (in numbers, if possible)
3. Timeline
4. Who is responsible

## What I Don't Know
What information is missing for a more accurate analysis?
Read more in blog
2
Few-shot (Reference) StrongRequires a reference answer from a premium model
78%

Provide an example of a high-quality answer. Highest win rate (89% for GigaChat-2-Max), but requires a reference answer from a premium model.

82% GigaChat-Ultra
89% GigaChat-2-Max
75% Alice AI LLM (Yandex)
65% Qwen3 Max

* q < 0.05 (FDR)

Example prompt Scenario: electronics e-commerce store, revenue dropped 18%
Here's an example of a quality business analysis on a similar task:

---
Question: "A bakery-cafe, 8 employees. Quarterly revenue -25%, while number of orders grew 15%."

Answer:
Diagnosis: classic "more customers, cheaper" pattern. Growth in orders with falling revenue means a ~35% drop in average order value. Three possible causes: (1) demand shift to cheaper items, (2) aggressive discounts, (3) audience change.

Recommendations:
1. Immediately: analyze order structure – what % is "coffee only" vs combos.
2. Short-term: replace "$1 coffee" promo with "$3 coffee + croissant" – keeps traffic, raises AOV.
3. Medium-term: introduce a loyalty program with a $5 threshold.
---

Now answer with the same level of specificity and depth:

We run an electronics e-commerce store with 45 employees. Last quarter revenue dropped 18%, while website traffic grew 12%. Average order value fell from $120 to $85. Returns increased from 4% to 11%. We raised the ad budget by 30%.

What's going on and what should we do?
Read more in blog
3
Role + Context Strong
71%

Assign the model an expert role and task context. One line at the start of your prompt. Works across all models by activating relevant knowledge.

75% GigaChat-Ultra
75% GigaChat-2-Max
75% Alice AI LLM (Yandex)
60% Qwen3 Max

* q < 0.05 (FDR)

Example prompt Scenario: electronics e-commerce store, revenue dropped 18%
You are an experienced business analyst with 15 years in e-commerce consulting. You've worked with Amazon, Shopify merchants, and dozens of mid-sized online stores. Your task is to prepare a McKinsey-level analytical memo for the CEO. The memo must include specific numbers, cause-effect relationships, and prioritized recommendations. Avoid generic phrases like "needs improvement" – write what specifically to do.

Client situation: electronics e-commerce store, 45 employees. Last quarter revenue dropped 18%, while website traffic grew 12%. Average order value fell from $120 to $85. Returns increased from 4% to 11%. Ad budget was raised by 30%.

What's going on and what should we do?
Read more in blog
4
Self-Critique Strong
71%

Ask the model to find weaknesses and improve its answer. The only multi-turn technique that reliably helps all models.

71% GigaChat-Ultra
74% GigaChat-2-Max
65% Alice AI LLM (Yandex)
75% Qwen3 Max

* q < 0.05 (FDR)

Example prompt Scenario: electronics e-commerce store, revenue dropped 18%
System prompt: You are a business analyst.

[Message 1]: We run an electronics e-commerce store, 45 employees. Revenue dropped 18%, traffic grew 12%. AOV fell from $120 to $85. Returns grew from 4% to 11%. Ad budget raised 30%. What's going on and what should we do?

[Message 2]: Re-read your answer critically. Find: 1) Where were you vague? 2) Where could there be logic errors? 3) What did you miss? Then give an improved version fixing all issues found.
Read more in blog
5
Chain-of-Thought Moderate
71%

Ask the model to reason step by step. Helps strong models (Qwen3: 78%), but may add noise for weaker ones (Alice: 60%).

68% GigaChat-Ultra
79% GigaChat-2-Max
60% Alice AI LLM (Yandex)
78% Qwen3 Max

* q < 0.05 (FDR)

Example prompt Scenario: electronics e-commerce store, revenue dropped 18%
We run an electronics e-commerce store with 45 employees. Last quarter revenue dropped 18%, while website traffic grew 12%. Average order value fell from $120 to $85. Returns increased from 4% to 11%. We raised the ad budget by 30%.

Before giving recommendations, analyze step by step:

Step 1. List all facts from the brief and what each means individually.
Step 2. Find connections between facts. Which are causes, which are effects?
Step 3. Formulate 2–3 hypotheses about what's happening.
Step 4. For each hypothesis, evaluate: what data supports it, what contradicts it?
Step 5. Choose the most likely explanation.
Step 6. Give recommendations based on that explanation.
Step 7. Note where you're unsure and what information is missing.

Show your entire reasoning process.
Read more in blog
6
XML + Markdown Moderate
73%

XML tags for structure, Markdown for output. Excellent for Alice (85%) and Qwen3 (74%), weak for GigaChat-Ultra (53%).

53% GigaChat-Ultra
78% GigaChat-2-Max
85% Alice AI LLM (Yandex)
74% Qwen3 Max

* q < 0.05 (FDR)

Example prompt Scenario: electronics e-commerce store, revenue dropped 18%
<task>
Analyze the business situation and provide recommendations.
</task>

<context>
Company: electronics e-commerce store, 45 employees.
Period: last quarter.
</context>

<data>
- Revenue: -18%
- Website traffic: +12%
- Average order value: dropped from $120 to $85
- Returns: grew from 4% to 11%
- Ad budget: +30%
</data>

<question>What's going on and what should we do?</question>

<output_format>
# Diagnosis (2-3 sentences)
# Root Causes (with evidence from the data)
# Recommendations (table: action, expected result, timeline, priority)
# Missing Data
</output_format>

<constraints>
- No generic phrases without specifics
- Tie every conclusion to numbers from the <data> section
- If unsure – state it explicitly
</constraints>
Read more in blog
7
Iterative Refinement Moderate
65%

Iterative answer refinement over multiple steps. Results vary heavily by model, and judges disagree on whether improvements are substantive.

61% GigaChat-Ultra
72% GigaChat-2-Max
61% Alice AI LLM (Yandex)
68% Qwen3 Max
Example prompt Scenario: electronics e-commerce store, revenue dropped 18%
System prompt: You are a business analyst.

[Message 1]: We run an electronics e-commerce store, 45 employees. Revenue -18%, traffic +12%, AOV dropped from $120 to $85, returns from 4% to 11%, ad budget +30%. What's going on and what should we do?

[Message 2]: Too generic. I need: 1) Specific numbers – calculate dollar losses and ad ROI. 2) Tie each recommendation to a specific metric. 3) State what you don't know for sure.

[Message 3]: Format this as a 1-page executive memo for the CEO. Format: problem -> causes -> action plan for 30/60/90 days with KPIs.
Read more in blog
8
Decomposition Ineffective
58%

Break the task into subtasks across separate messages. Loses context between steps. Only works for GigaChat-Ultra; the other three models actually do worse than with a naive prompt.

79% GigaChat-Ultra
58% GigaChat-2-Max
44% Alice AI LLM (Yandex)
49% Qwen3 Max
Example prompt Scenario: electronics e-commerce store, revenue dropped 18%
System prompt: You are a business analyst. Be specific, use numbers.

[Message 1]: Here's last quarter data: revenue -18%, traffic +12%, AOV dropped from $120 to $85, returns grew from 4% to 11%, ad budget +30%. Analyze each metric individually – what does it mean on its own?

[Message 2]: Now find connections between these metrics. Which are causes, which are effects? Formulate 2–3 hypotheses.

[Message 3]: Based on the analysis, give 5 specific recommendations. For each: what to do, expected result, timeline, who's responsible.
Read more in blog
9
ALL CAPS Ineffective
50%

Highlighting key words with ALL CAPS. Zero effect on quality. Doesn't even hurt.

47% GigaChat-Ultra
54% GigaChat-2-Max
56% Alice AI LLM (Yandex)
44% Qwen3 Max
Example prompt Scenario: electronics e-commerce store, revenue dropped 18%
WE RUN AN ELECTRONICS E-COMMERCE STORE WITH 45 EMPLOYEES. LAST QUARTER REVENUE DROPPED 18%, WHILE WEBSITE TRAFFIC GREW 12%. AVERAGE ORDER VALUE FELL FROM $120 TO $85. RETURNS INCREASED FROM 4% TO 11%. WE RAISED THE AD BUDGET BY 30%.

WHAT'S GOING ON AND WHAT SHOULD WE DO?
Read more in blog
10
Aggressive Tone Ineffective
44%

Aggressive, demanding tone. The model becomes more confident but not more accurate. Suppresses hedging (GigaChat-Ultra: 0% hedging rate).

43% GigaChat-Ultra
46% GigaChat-2-Max
43% Alice AI LLM (Yandex)
44% Qwen3 Max
Example prompt Scenario: electronics e-commerce store, revenue dropped 18%
Look, I need a real answer.

We run an electronics e-commerce store with 45 employees. Last quarter revenue dropped 18%, while website traffic grew 12%. Average order value fell from $120 to $85. Returns increased from 4% to 11%. We raised the ad budget by 30%.

What's going on and what should we do?

No fluff.
Read more in blog
Heatmap: prompt engineering technique effectiveness across models

Win rate by technique and model – numbers are approximate, see ranking above for exact data

Results by Model

Techniques improve the formal metrics: answers become more complete, better structured, and the model hedges less. But in a head-to-head comparison with GPT-5.4 or Claude 4.6, the gap in analytical depth and factual accuracy remains clear.

GigaChat-Ultra Best technique: Structured Output
Baseline score: 3.44 Best score: 4.27
Best (Kimi K2.5): 4.57

Usable but not competitive with premium models

Answer quality improvement (technique vs naive prompt) 79%
Win rate against premium models~18%
AccuracyActionabilityCompletenessHonestyClarity
Baseline (naive prompt)
Best technique
Accuracy 3.6 4.0+0.4
Actionability 3.2 4.1+0.9
Completeness 3.3 4.4+1.1
Honesty 3.0 4.8+1.7
Clarity 4.4 4.7+0.2

Ceiling vs premium

vs Claude 4.6: 6%
vs GPT-5.4: 22%
vs Kimi K2.5: 26%
GigaChat-2-Max Best technique: Few-shot (Reference)
Baseline score: 3.02 Best score: 3.63
Best (Kimi K2.5): 4.57

Not competitive. Prompt engineering can't compensate for a weak model

Answer quality improvement (technique vs naive prompt) 41%
Win rate against premium models~6%
AccuracyActionabilityCompletenessHonestyClarity
Baseline (naive prompt)
Best technique
Accuracy 3.1 3.5+0.4
Actionability 2.8 3.3+0.5
Completeness 2.9 3.5+0.6
Honesty 2.9 4.0+1.1
Clarity 4.1 4.3+0.2

Ceiling vs premium

vs Claude 4.6: 0%
vs GPT-5.4: 4%
vs Kimi K2.5: 14%
Alice AI LLM (Yandex) Best technique: Structured Output
Baseline score: 3.85 Best score: 4.46
Best (Kimi K2.5): 4.57

Viable alternative. Wins 1 in 3 comparisons against premium

Answer quality improvement (technique vs naive prompt) 95%
Win rate against premium models~31%
AccuracyActionabilityCompletenessHonestyClarity
Baseline (naive prompt)
Best technique
Accuracy 3.8 4.2+0.4
Actionability 3.8 4.3+0.6
Completeness 3.7 4.6+0.9
Honesty 3.7 4.8+1.1
Clarity 4.9 5.0+0.1

Ceiling vs premium

vs Claude 4.6: 28%
vs GPT-5.4: 36%
vs Kimi K2.5: 31%
Qwen3 Max Best technique: Structured Output
Baseline score: 3.93 Best score: 4.49
Best (Kimi K2.5): 4.57

Near-competitive. 33-41% win rate against premium models

Answer quality improvement (technique vs naive prompt) 100%
Win rate against premium models~37%
AccuracyActionabilityCompletenessHonestyClarity
Baseline (naive prompt)
Best technique
Accuracy 4.0 4.0+0.1
Actionability 3.6 4.5+0.9
Completeness 3.8 4.6+0.8
Honesty 4.0 4.9+0.9
Clarity 4.9 5.0+0.1

Ceiling vs premium

vs Claude 4.6: 39%
vs GPT-5.4: 33%
vs Kimi K2.5: 41%

What To Do: Practical Playbook

Using GigaChat

Apply Structured output

76-82%

win rate

Low

Effort

Using Alice AI

Apply XML + Markdown

85%

win rate

High

Effort

Using Qwen3

Apply CoT + Structure

74-78%

win rate

Low

Effort

Starting from scratch

Apply Role

60-75%

win rate

Minimal

Effort

Willing to iterate

Apply Self-critique

65-75%

win rate

Medium

Effort

Start applying now

Role, structure, and self-critique from this study are covered in the first chapter

Try for free

Try It Yourself

Same task, two prompting approaches. Hit "Run" and compare.

Scenario 1: Business Data Analysis

An electronics e-commerce store with declining revenue and rising returns. Same situation, two approaches.

Naive prompt – how a typical manager writes:

Try it yourself
Naive prompt – no structure
You
We run an electronics e-commerce store with 45 employees. Last quarter revenue dropped 18%, while website traffic grew 12%. Average order value fell from $120 to $85. Returns increased from 4% to 11%. We raised the ad budget by 30%. What's going on and what should we do?
Comparing:
aliceai-llm · gemini-3.1-flash-lite-preview · qwen3.6-plus · gpt-5.4-nano

Structured prompt – same task, but with an output template:

Try it yourself
Structured prompt – with output template
You
We run an electronics e-commerce store with 45 employees. Last quarter revenue dropped 18%, while website traffic grew 12%. Average order value fell from $120 to $85. Returns increased from 4% to 11%. We raised the ad budget by 30%. What's going on and what should we do? Respond strictly in this format: ## Diagnosis (2-3 sentences: what exactly is wrong) ## Root Causes For each cause: - What is happening - Why it is happening (cause-effect relationship) - What data supports this ## Recommendations (from most urgent to least) For each recommendation: 1. What to do (specific action) 2. Expected result (in numbers, if possible) 3. Timeline 4. Who is responsible ## What I Don't Know What information is missing for a more accurate analysis?
Comparing:
aliceai-llm · gemini-3.1-flash-lite-preview · qwen3.6-plus · gpt-5.4-nano

Scenario 2: Difficult Team Communication

Writing a team email about layoffs. An emotionally complex task where structure is critical.

Naive prompt:

Try it yourself
Naive prompt – communication
You
I need to write an email to my team. Situation: management decided to eliminate 3 of 15 positions in our department. The decision is final. Specific people haven't been identified yet – that will happen in 2 weeks. I need to inform the team about the upcoming layoffs without causing panic, but honestly. Write the email text.
Comparing:
aliceai-llm · gemini-3.1-flash-lite-preview · qwen3.6-plus · gpt-5.4-nano

Structured prompt:

Try it yourself
Structured prompt – communication
You
I need to write an email to my team. Situation: management decided to eliminate 3 of 15 positions in our department. The decision is final. Specific people haven't been identified yet – that will happen in 2 weeks. I need to inform the team about the upcoming layoffs without causing panic, but honestly. Write the email text in this structure: ## Subject Line (brief, neutral) ## Email Body Must include in this order: 1. Direct statement of the decision (no long lead-ins) 2. Reasons (brief, honest) 3. What has been decided and what hasn't 4. Specific timelines and next steps 5. What the company will do for support 6. Invitation for a personal conversation ## What Should NOT Be in the Email - Which phrases to avoid and why
Comparing:
aliceai-llm · gemini-3.1-flash-lite-preview · qwen3.6-plus · gpt-5.4-nano
Show methodology
6

scenarios

3

repetitions per combination

3

evaluation passes

Each of the 10 techniques was tested on 6 real management tasks: data analysis, communication, planning. Every combination was repeated 3 times for result stability.

Two LLM judges (Claude Opus 4.6 + Gemini 3.1 Pro) weighted 70/30. Inter-judge agreement: 66–73%.

Methodology:

  • Pass 1: pairwise comparison – does the technique beat a naive prompt? (2,880 evaluations)
  • Pass 2: absolute scoring across 5 dimensions: accuracy, actionability, completeness, honesty, clarity (564 evaluations)
  • Pass 3: ceiling – can the best technique compete with GPT-5.4, Claude 4.6, Kimi K2.5? (864 evaluations)

Statistics: Fisher exact test with FDR correction (q < 0.05). Significant results marked in the table.

All prompts, data, and evaluation scripts are available on request for experiment reproduction.

Data: 2,880 pairwise + 564 absolute + 864 ceiling evaluations | Analysis date: 2026-05-04

Frequently Asked Questions

Do ALL CAPS help in prompts?

No. The CAPS technique showed a win rate of about 50% across all models, statistically indistinguishable from random. Capitalizing keywords has no effect on output quality.

Do responses improve if you write aggressively?

No, they get worse. Aggressive tone showed a 43–46% win rate. The only measurable effect: the model stops admitting uncertainty (GigaChat-Ultra: 0% hedging rate). Responses become more confident, but not more accurate.

Is 'think step by step' worth it?

Depends on the model. For Qwen3 Max (78% win rate) and GigaChat-2-Max (79%), it's one of the best techniques. For Alice (60%), the effect is weak. CoT primarily helps models avoid skipping data, not improve reasoning.

Is it better to split the task across multiple messages?

Usually no. Multi-turn techniques (decomposition, self-critique, iteration) on average perform no better than single-turn. Exception: self-critique reliably helps. Decomposition loses context between steps and worsens output for 3 of 4 models.

What matters more: prompting technique or model choice?

Model choice. Even the best technique (structured output) on GigaChat-Ultra (4.27 score) loses 78% of comparisons against GPT-5.4, which scored 4.38 with a naive prompt. Prompt engineering raises the floor, but the model determines the ceiling.

Ready to apply these techniques to real work?

The first chapter covers role, structure, and self-critique on real management tasks – with AI feedback and measurable results

Try for free