33 AI Models for Managers: Why We Need Your Ratings

Over the past year, 33 new AI models have appeared on the market, each claiming the title of “best manager’s assistant.” ChatGPT updated to GPT-5.2, Claude released Opus 4.5, Gemini added a new Pro version, Yandex and Sber announced further improvements, and Chinese models went OpenSource. How do you choose a tool when every one of them promises a productivity revolution? We decided to run a large-scale comparative study – but ran into a problem that may seem paradoxical.
The Problem of Objectivity in AI Evaluation
Imagine: you ask three AI models to prepare a plan for a 1-on-1 meeting with an employee whose performance is declining. ChatGPT gives a detailed list of 12 questions with psychological technique explanations. Claude offers a concise 5-point structure focused on empathy. YandexGPT drafts a plan that accounts for Russian HR norms and corporate ethics.
Which answer is better?
This is not a case where you can verify correctness through calculation – like in math. There is no single correct plan for a 1-on-1 meeting. Quality depends on context: the manager’s experience, the employee’s personality, the company culture, and the urgency of the issue. One manager will appreciate ChatGPT’s detail, another will prefer Claude’s brevity, and a third will choose YandexGPT for its local specificity.
Surprisingly, even when testing 33 models across 32 real scenarios (over 1,000 responses), a fundamental question remains: who determines what counts as a “good” answer?
Why We Are Testing These Specific 33 Models
The list is not random. We selected models based on three criteria: availability in Russia, relevance to management, and representation of different price tiers.
Global leaders (8 models):
- OpenAI: GPT-5.2-Pro, GPT-4o, GPT-4o-mini (paid and API)
- Anthropic: Claude Opus 4.5, Sonnet 4.5, Haiku 4.5 (three performance tiers)
- Google: Gemini 2.5 Pro, Gemini 3.0 Flash (latest versions)
Available in Russia without VPN (6 models):
- Yandex: AliceLLM, YandexGPT 5 Pro, YandexGPT 5 Lite
- Sber: GigaChat Pro, GigaChat
- DeepSeek, Qwen, Xiaomi (Chinese models gaining popularity)
Specialized and niche (19 models):
- Meta Llama 3.3 70B, Mistral Large, Qwen 2.5, and other open-source solutions
- Models optimized for reasoning (DeepSeek R1, OpenAI o1-mini)
- Lightweight models for basic tasks (Phi-4 Mini, Gemma 3)
Why so many? Because in our course we teach how to choose the right tool for a specific task. One scenario requires deep analysis (an expensive model like GPT-5.2 Pro works well), another needs quick text generation (the free Gemini 3.0 Flash can handle it). A third requires working without a VPN (Russian models only). A fourth involves processing large volumes of data (tokens and cost are critical).
In the open module, you can already explore this material – it contains 12 lessons with practical scenarios. The study will give us concrete data: which model is best for team analysis, which for preparing presentations, and which for writing feedback. Students will receive not abstract advice like “use AI,” but a table with test results.
The “Naive Manager” Methodology
Here begins an important methodological decision. We deliberately do not optimize prompts. We do not use chain-of-thought, few-shot examples, or break the task into subtasks. Prompts are written the way an ordinary manager with no prompt engineering experience would write them.
Why? Because that is reality. Most AI users write queries in natural language:
“Help me prepare for a meeting with the director about the project budget”
Not like this:
“You are an experienced corporate finance consultant. Use step-by-step reasoning. Analyze the following context: [project details]. Propose three argumentation options for defending the budget, each with quantitative ROI justification…”
The first prompt is what 90% of users type. The second is the result of prompt engineering training. We test models with the first option because we want to understand: which tool works best with a “naive” user?
This reflects the real problem of AI adoption in companies. You can teach employees to write perfect prompts, but it takes time and discipline. In practice, people want to ask a question as they would a colleague – and get a useful answer. Which model handles this best?
Learn to write effective prompts – open module, no registration required
No payment required • Get notified on launch
Dual LLM-as-Judge: When AI Evaluates AI
With the thousand responses we have, a scale problem emerges. A human cannot objectively evaluate a thousand texts in a reasonable time. Even spending 5 minutes per answer, that is 88 hours of work – more than two working weeks. Over that time, evaluation standards will inevitably drift: what seemed like a good answer at the start of the week may look mediocre against new examples.
The solution is to use LLM-as-Judge: one AI model evaluates the answers of other models. This is a popular approach in AI research, but it has a bias problem. A model may rate answers in a “similar” style more highly or unconsciously inflate scores for certain approaches.
We use Dual Judge – two different judge models:
Judge A: Claude Opus 4.5 – evaluates nuances, tone, and regional context awareness. Claude understands empathy, cultural differences, and ethical aspects well. It will notice if a model gave advice that is inapplicable in Russian corporate culture.
Judge B: Gemini 3 Pro – evaluates reasoning structure, data accuracy, and answer format. Gemini is stronger in analytics, verifying logical chains, and identifying factual errors.
Each answer receives two independent scores on a 0–5 scale. The final score is the arithmetic mean. If judges differ by more than 0.75 points (for example, one gives 2.0, the other 3.0), the answer is flagged for human review.
Why these specific judge models? Claude Opus 4.5 and Gemini 3 Pro are the best in their classes, but they have different “philosophies.” Claude tends toward detailed, empathetic answers. Gemini tends toward structured, fact-based ones. By using both models, we balance the evaluation between the “humanness” and “analyticity” of the answer.
Calibration with Human Opinion: Why We Need Your Help
Here a critical question arises: how do we know the judges are evaluating correctly?
A judge model can be consistent – always giving similar scores to similar answers. But consistency does not guarantee alignment with human preferences. If Claude Opus 4.5 systematically underrates brief answers (because it is inclined toward detail), it will be unfair to models with a concise style.
The solution is Human Audit: a human evaluates random answers on the same 0–5 scale. This is called a “Gold Set” – reference ratings against which we compare the judges’ work.
Statistically, the correlation between LLM-Judge scores and human scores must be > 0.60 for the automated evaluation to be valid. If the correlation is lower, the judges are unreliable, and their scores cannot be used for ranking models.
Why are additional human ratings needed?
First, for independent verification of judge reliability. A 5% sample is enough for statistical validation, but the more human ratings there are, the more accurate the calibration. If 10 different people rate the same answer, we will see the spread of opinions and understand how subjective “quality” assessment is for a specific scenario.
Second, for detecting systematic errors. If a judge consistently gives high scores to answers with many bullet points, but people prefer concise answers – that is a signal to recalibrate the judge’s prompt.
Third, for understanding what matters to managers. Perhaps professionals will rate an answer higher if it contains specific metrics. Or the opposite – prefer an empathetic tone over numbers. These are qualitative insights that cannot be obtained from automated evaluations.
What We Will Publish Next Month
February 2026 – publication of the full study results.
What you will see in the report:
- Global ranking – top 33 models by average score across all scenarios
- Ranking of models available in Russia – which tools are best for those working without a VPN
- Category winners – the best model for data analysis, communication, decision-making, and text work
- Russia Availability Gap – a quantitative assessment of the gap between the best global model and the best model available in Russia
- Price/quality ratio – which models offer the best ROI
- Model reliability – refusal rate for legitimate business tasks
- Human preference analysis – how human ratings correlate with AI judge scores
Why does this matter for the course?
The “Choosing AI Tools” module of our open course will receive concrete data instead of general recommendations. Students will see not “ChatGPT is good for analysis,” but “ChatGPT GPT-4o scored an average of 4.2/5 in the ‘Analytical Depth’ category, YandexGPT 5 Pro – 3.8/5, but is available without a VPN.”
This will change the approach to learning. Instead of abstract advice – a comparison table with specific scenarios. Instead of “try different models” – data: which model is statistically better for which task.
Want to be the first to see the study results?
The open course module contains 12 practical lessons on choosing AI tools. After the results are published in February, you will receive updated materials with real testing data.
How to Participate in the Calibration
The process is simple and takes 15–20 minutes:
- Go to the /evaluate page
- Read a description of a real management scenario (for example, “Preparing feedback for an employee”)
- See a response from one of the AI models (anonymized – you do not know which model)
- Rate the answer on a 0–5 scale with brief explanations (optional)
- Repeat for 5–10 different scenarios
What participation gives you:
- Influence on methodology – your ratings will help calibrate the AI judges
- Early access to results – participants will receive the report 2 weeks before publication
- Understanding your own preferences – you will see which answer styles you value (detailed vs. concise, empathetic vs. analytical)
Important: all ratings are anonymous. We record only the rating and an optional comment. Your data is only needed to send you the report and, if desired, mention you in the report (at your request).
Why This Matters for the Industry
Most AI model comparisons focus on benchmark tasks: solving math problems, writing code, answering academic questions. This is measurable and objective, but far from the reality of management.
A manager does not solve mathematical equations. They write feedback, prepare for difficult conversations, analyze team performance, and make decisions under uncertainty. For these tasks, there is no “correct answer” – there are answers that work better in a specific context.
Studies that test AI on real management tasks taking into account Russian specifics are virtually nonexistent. Most research is conducted in English, in the context of American corporate culture, with a focus on technical tasks. We are filling this gap.
Methodological contribution: using Dual LLM-as-Judge with human calibration on “naive” prompts is an approach that can be scaled. If it proves valid (correlation with humans > 0.60), other researchers can apply it to test new models or other domains.
Practical contribution: concrete recommendations for managers who want to adopt AI but do not know where to start. Not “use ChatGPT,” but “for team analysis, try Claude Opus 4.5 (if you have a VPN) or Yandex Alice (if you work without a VPN) – they showed the best results in this category.”
Conclusions
Choosing an AI tool for a manager is not a technical question, but a question of matching tasks and context. 33 models on the market is not excess, but necessary diversity: for different budgets, confidentiality requirements, regional availability, and work styles.
The problem is that objectively comparing models on “soft” tasks is difficult. An answer to “how to prepare for a meeting with the director” may be good for one manager and useless for another. Automated evaluation through LLM-Judge speeds up the process, but requires calibration with human opinion.
That is why your participation matters. The more people evaluate AI responses, the more accurate the judge calibration, and the more reliable the study results. This is not abstract science – it is data that will change the course content for hundreds of students.
In February, you will see the results. For now, go to /evaluate, rate a few answers, and help make the study more objective.
Have you faced the problem of choosing an AI tool? What criteria matter most to you – price, availability without a VPN, answer quality? You can discuss this in our Telegram channel.
Sources
- Stanford AI Index Report 2025 – statistics on AI use in business
- McKinsey: The state of AI in 2024 – data on AI adoption in companies
- LLM-as-Judge: A Survey – overview of automated AI model evaluation methodology



