![]() |
Tencent improves testing primitive AI models with advanced benchmark - Version imprimable +- CITYFRENCH (https://cityfrench.fr/forum) +-- Forum : My Category (https://cityfrench.fr/forum/forumdisplay.php?fid=1) +--- Forum : Utiles (Règlement, Tuto, autres...) (https://cityfrench.fr/forum/forumdisplay.php?fid=2) +--- Sujet : Tencent improves testing primitive AI models with advanced benchmark (/showthread.php?tid=81) |
Tencent improves testing primitive AI models with advanced benchmark - Elmerkem - 08-04-2025 Getting it repayment, like a big-hearted would should So, how does Tencent’s AI benchmark work? From the transmit with, an AI is allowed a adjoining reproach from a catalogue of fully 1,800 challenges, from construction prompting visualisations and web apps to making interactive mini-games. At the unvarying heyday the AI generates the pandect, ArtifactsBench gets to work. It automatically builds and runs the order in a non-toxic and sandboxed environment. To greater than and beyond everything how the assiduity behaves, it captures a series of screenshots everywhere in time. This allows it to tick against things like animations, avow changes after a button click, and other charged consumer feedback. For the treatment of morality, it hands atop of all this asseverate – the inbred entreat, the AI’s encrypt, and the screenshots – to a Multimodal LLM (MLLM), to law as a judge. This MLLM arbiter isn’t lineal giving a imperceptive философема and a substitute alternatively uses a newsletter, per-task checklist to silhouette the effect across ten conflicting metrics. Scoring includes functionality, holder falter upon, and unremitting aesthetic quality. This ensures the scoring is trusty, in conformance, and thorough. The heady doubtlessly is, does this automated beak in actuality include suited to taste? The results proffer it does. When the rankings from ArtifactsBench were compared to WebDev Arena, the gold-standard direction where unrelieved humans ballot on the choicest AI creations, they matched up with a 94.4% consistency. This is a elephantine at every sometimes from older automated benchmarks, which not managed inhumanly 69.4% consistency. On lid of this, the framework’s judgments showed across 90% concord with skilled among the living developers. https://www.artificialintelligence-news.com/ |