Evaluating LLMs as Agents

Chinou Gea

2 min readAug 11, 2023

-Assesses LLM-as-Agent’s reasoning & decision-making abilities

-Multi-turn open-ended generation settings

-8 distinct environments

-Eval of 25 LLMs

-GPT-4 capable across a wide array of real-world tasks

-API-based beats Open

arxiv.org/abs/2308.03688

THUDM/AgentBench Public Code for Evaluating LLMs as Agents: A Comprehensive Benchmark to Evaluate LLMs as Agents; https://github.com/THUDM/AgentBench.

作为代理人评估大语言模型

-评估大语言模型作为代理人的推理和决策能力

-多回合开放式生成设置

-8个不同的环境

25个大语言模型的评估
-GPT-4能够胜任各种实际任务
-基于API的beats开放

arxiv.org/abs/2308.03688

THUDM/AgentBench评估大语言模型作为代理人的公共代码：评估大语言模型作为代理人的综合基准； https://github.com/THUDM/AgentBench

Credit to the authors. Share & Translate: Chinou Gea (秦陇纪); 2023 @ DSS-SDC, IFS-AHSC. #DataScience #DataSimp #algorithms #ArtificialIntelligence #AI #MachineLearning #DeepLearning #NaturalLanguageProcessing

Computer Science > Artificial Intelligence

arXiv:2308.03688 (cs)

[Submitted on 7 Aug 2023]

AgentBench: Evaluating LLMs as Agents

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, Jie Tang

Download PDF

Large Language Models (LLMs) are becoming increasingly smart and autonomous, targeting real-world pragmatic missions beyond traditional NLP tasks. As a result, there has been an urgent need to evaluate LLMs as agents on challenging tasks in interactive environments. We present AgentBench, a multi-dimensional evolving benchmark that currently consists of 8 distinct environments to assess LLM-as-Agent’s reasoning and decision-making abilities in a multi-turn open-ended generation setting. Our extensive test over 25 LLMs (including APIs and open-sourced models) shows that, while top commercial LLMs present a strong ability of acting as agents in complex environments, there is a significant disparity in performance between them and open-sourced competitors. It also serves as a component of an ongoing project with wider coverage and deeper consideration towards systematic LLM evaluation. Datasets, environments, and an integrated evaluation package for AgentBench are released at this https URL

Focus to learn more

Submission history

From: Xiao Liu [view email]

[v1] Mon, 7 Aug 2023 16:08:11 UTC (20,331 KB)

计算机科学 > 人工智能

arXiv:2308.03688 (CS) [2023年8月7日提交]

《AgentBench：评估大语言模型作为代理人》

刘晓、于浩、张汉辰、徐一帆、雷轩宇、赖瀚宇、谷宇、丁航良、门凯文、杨克娟、张淑丹、邓翔、曾敖汉、杜正晓、张晨辉、沉胜、张天军, 于苏, 孙欢, 黄敏烈, 于晓东, 唐杰

下载PDF

大型语言模型(LLM)正变得越来越智能和自主，其目标是超越传统NLP任务的现实世界实用任务。因此，迫切需要评估大语言模型作为交互式环境中具有挑战性任务的代理人的能力。我们推出AgentBench，这是一个多维度不断发展的基准，目前由8个不同的环境组成，用于评估LLM-as-Agent在多轮开放式生成环境中的推理和决策能力。我们对25个LLM(包括API和开源模型)进行的广泛测试表明，虽然顶级商业大语言模型表现出在复杂环境中充当代理的强大能力，但它们与开源竞争对手之间的表现存在显着差异。它还作为一个正在进行的项目的组成部分，覆盖范围更广，对系统性大语言模型评估进行更深入的考虑。 AgentBench的数据集、环境和集成评估包在此 https URL发布

评论：38页

主题：人工智能(cs.AI)；计算和语言(cs.CL)；机器学习(cs.LG)

引用为：arXiv:2308.03688 [cs.AI] (或此版本的arXiv:2308.03688v1 [cs.AI])

https://doi.org/10.48550/arXiv.2308.03688

集中精力了解更多

提交历史

发件人：小刘 [查看邮件]

[v1] 世界标准时间 2023 年 8 月 7 日星期一 16:08:11 (20,331 KB)

Evaluating LLMs as Agents

Written by Chinou Gea