ToolLLM=LLM+tooluse--大模型的高级玩法作者： AINLP 来源： AINLP 提纲 1 简介 2 背景‍ 3 构建ToolBench 3.1 收集API跟tool‍ 3.2 指令Instruction生成‍‍‍‍‍‍ 3.3 解决路径solution生成 4 实验‍‍‍‍‍ 4.1 ToolEval 4.2 实验设置‍ 4.3 实验结论 5 总结参考文献图1: ToolLLM论文信息

ToolLLM=LLM+tooluse--大模型的高级玩法

By AiBard123
August 14, 2023 - 2 min read

作者： AINLP 来源： AINLP

提纲

1 简介

2 背景‍

3 构建ToolBench

3.1 收集API跟tool‍

3.2 指令Instruction生成‍‍‍‍‍‍

3.3 解决路径solution生成

4 实验‍‍‍‍‍

4.1 ToolEval

4.2 实验设置‍

4.3 实验结论

5 总结

参考文献

图1: ToolLLM论文信息‍‍‍‍‍‍

####1 简介

大家可以明显感受到，目前开源大模型更新迭代的速度可谓日新月异，在基础语言任务的表现进步非常明显，但是执行高级任务的能力依旧有明显不足，例如遵循人类任务指令去调用外部工具（tool）。造成这种现象的原因是目前开源大模型的instruction tuning很大程度集中于基础语言任务而不是关于关于工具的使用。相对于开源大模型，目前被认为处于领先地位的大模型（例如chatgpt，GPT4等），虽然闭源，但是却有非常卓越的工具使用能力。于是为了促进开源大模型工具使用能力的建设，研究人员提出了一个通用的tool-use框架ToolLLM，包括构建数据集ToolBench，设计自动评估方案ToolEval，并基于此训练了一个语言模型ToolLLaMA，在工具使用的表现足以媲美ChatGPT。

图2: ToolBench构建过程，两个模型训练方式以及具体推理过程‍‍‍‍‍‍‍‍‍

####2 背景

Tool learning旨在释放大规模语言模型的能力，通过跟诸多API进行有效交互进而完成复杂任务。 目前这方面已经有些工作了，但是依旧不能完全激发LLM的工具使用能力，这是由于以下几个缺陷所导致的。

a) APIS受限，无法覆盖真实世界的API，所涉及的API数量有限，或者缺乏多样性。

b) 场景受限，目前的工具都被限制到每个instruction只对应一个工具tool，而真实场景下的一个复杂的指令是需要多个tool一同才能解决的。而且不少场景都提前给指令制定了响应的API，但是真实场景下该调用哪个API也是需要考虑的。

3)规划跟推理能力受限，目前工作大多采用简单的prompt engineering来进行模型推理，难以激发LLM本身的能力，在处理复杂指令时容易失败。

####3 构建ToolBench‍‍‍‍‍‍‍‍

ToolLLM收集了一个高质量的instruction-tuning数据集ToolBench，具体构建过程如下。

####3.1 收集API跟tool

RapidAPI Hub RapidAPI是一个API市场，里面包含了成千上万个真实世界的API，涉及多种多样的服务跟数据源。里面所有的API会有两个分类，按照粗粒度可以分为49个类别（categories），例如体育，金融，天气等等，也有更加细粒度的类别信息（collections），例如中文API，数据集API等等。除了API外，Rapid上还有更高级的工具tool，每个工具都可能包括多个API。对于每个工具，研究人员爬取工具tool相应的名字跟描述，URL以及所涉及的所有API信息。对于每个API，会记录对应的名字跟描述，访问方式，必须参数，可选参数，访问格式，执行状态代码以及一个成功调用返回样例。

研究人员从RapidAPI收集了10853个tool跟53190个API的相关信息 。但是鉴于这些API的质量跟可靠性参差不齐，需要对其进行过滤。首先对每个API进行基本的测试，移除不可用的那部分，然后再去调用API去获得相应的返回结果，由此判断响应耗时跟结果质量，最终只保留了其中3451个高质量的tool跟16464个API 。

除此之外，API有时的返回结果可能太过冗余，容易超出LLM的输入长度限制，需要进行压缩。鉴于每个API都有固定的输出格式，当API响应结果长度超过2048个token，研究人员将tool的相关信息以及3个压缩样例融入prompt中，通过ChatGPT对响应结果进行压缩，如果压缩后的结果长度依旧超过限制，那就保留前面的2048个token。

图3: ToolBench收集的API信息跟其他数据集的对比‍‍‍

####3.2 指令Instruction生成

高质量的指令有两方面要求，其一是多样性，保证LLM可以支持各种各种的API使用场景，从而增强模型的泛化性跟鲁棒性，其二是多工具用法，去模拟真实世界的场景，从而提升模型的实用性跟灵活性。

通过前面的操作可以获得一个API集合SAPI，每次通过抽样从中抽取一部分API，然后利用ChatGPT，去生成新的指令以及对应的多个API （生成的指令跟这多个API是相互对应的关系，要完成该指令需要依次调用这多个API才能实现，并且这多个API都应该属于被抽样的API集合），从而构造相应的指令-API对。这里给ChatGPT的输入包括3个部分，

图4: API介绍以及指令生成过程

a)指令生成prompt，让语言模型按照指令要求生成多个新的指令以及对应的API。 根据场景的不同分为三种类型，其一是single-tool instructions(I1)，每个指令只对应一个tool，其二是intra-category multi-tool instructions(I2)，每个指令对应同个category里多个tool的API，其三是intra-collection multi-tool instructions(I3)，每个指令对应同个collection里多个tool的API。（具体样例见下文）

#I1指令
You will be provided with a tool, its description, all of the tool’s available API functions, the descriptions of these API functions, and the parameters required for each API function. Your task involves creating 10 varied, innovative, and detailed user queries that employ multiple API functions of a tool. For instance, if the tool ‘climate news’ has three API calls - ‘get all climate change news’,
‘look up climate today’, and ‘historical climate’, your query should articulate something akin to: first, determine today’s weather, then verify how often it rains in Ohio in September, and finally, find news about climate change to help me understand whether the climate will change anytime soon. This query exemplifies how to utilize all API calls of ‘climate news’. A query that only uses one API call will not be accepted. Additionally, you must incorporate the input parameters required for each API call. To achieve this, generate random information for required parameters such as IP address, location, coordinates, etc. For instance, don’t merely say ‘an address’, provide the exact road and district names. Don’t just mention ‘a product’, specify wearables, milk, a blue blanket, a pan, etc. Don’t refer to ‘my company’, invent a company name instead. The first seven of the ten queries should be very specific. Each single query should combine all API call usages in different ways and include the necessary parameters. Note that you shouldn’t ask ‘which API to use’, rather, simply state your needs that can be addressed by these APIs. You should also avoid asking for the input parameters required by the API call, but instead directly provide the parameter in your query. The final three queries should be complex and lengthy, describing a complicated scenario where all the API calls can be utilized to provide assistance within a single query. You should first think about possible related API combinations, then give your query. Related apis are apis that can be used for a give query; those related apis have to strictly come from the provided api names. For each query, there should be multiple related apis; for different queries, overlap of related apis should be as little as possible. Deliver your response in this format: [Query1: ......, ‘related apis’:[api1, api2, api3...],Query2: ......, ‘related apis’:[api4, api5, api6...],Query3: ......, ‘related apis’:[api1, api7, api9...], ...]
#I2跟I3指令
You will be provided with several tools, tool descriptions, all of each tool’s available API functions, the descriptions of these API functions, and the parameters required for each API function. Your task involves creating 10 varied, innovative, and detailed user queries that employ API functions of multiple tools. For instance, given three tools ‘nba news’, ‘cat-facts’, and ‘hotels’: ‘nba news’ has API functions ‘Get individual NBA source news’ and ‘Get all NBA news’, ‘cat-facts’ has API functions ‘Get all facts about cats’ and ‘Get a random fact about cats’, ‘hotels’ has API functions
‘properties/get-details (Deprecated)’, ‘properties/list (Deprecated)’ and ‘locations/v3/search’. Your query should articulate something akin to: ‘I want to name my newborn cat after Kobe and host a party to celebrate its birth. Get me some cat facts and NBA news to gather inspirations for the cat name. Also, find a proper hotel around my house in Houston Downtown for the party.’ This query exemplifies how to utilize API calls of all the given tools. A query that uses API calls of only one tool will not be accepted. Additionally, you must incorporate the input parameters required for each API call. To achieve this, generate random information for required parameters such as IP address, location, coordinates, etc. For instance, don’t merely say ‘an address’, provide the exact road and district names. Don’t just mention ‘a product’, specify wearables, milk, a blue blanket, a pan, etc. Don’t refer to ‘my company’, invent a company name instead. The first seven of the ten queries should be very specific. Each single query should combine API calls of different tools in various ways and include the necessary parameters. Note that you shouldn’t ask ‘which API to use’, rather, simply state your needs that can be addressed by these APIs. You should also avoid asking for the input parameters required by the API call, but instead directly provide the parameters in your query. The final three queries should be complex and lengthy, describing a complicated scenario where all the provided API calls can be utilized to provide assistance within a single query. You should first think about possible related API combinations, then give your query. Related APIs are APIs that can be used for a given query; those related APIs have to strictly come from the provided API names. For each query, there should be multiple related APIs; for different queries, overlap of related APIs should be as little as possible. Deliver your response in this format: [Query1: ......, ‘related apis’:[[tool name, api name], [tool name, api name], [tool name, api name]...],Query2: ......, ‘related apis’:[[tool name, api name], [tool name, api name], [tool name, api name]...],Query3: ......, ‘related apis’:[[tool name, api name], [tool name, api name], [tool name, api name]...], ...]

b) 抽取的tool跟对应的API的信息，方便语言模型理解这些tool跟API的功能以及对应关系。ChatGPT生成的API信息就是从这里获取，生成的结果也不应超过这些范围。

c) 3个in-context样例，这些样例都是由人类专家所撰写。总共有12/36个不同样例，分别对应单工具跟多工具场景，每次从中随机抽取3个。

For example, with tool ASCII Art, the given api names are ‘figlet’, ‘list figlet styles’, ‘cowsay’, ‘list cowsay styles’, ‘matheq’.
Some sample queries and related apis would be:
“Query”: “Need to create an ASCII art representation of a mathematical equation. The equation
is ‘y = mx + c’, where m and c are constants. Help me generate the ASCII art for this equation. Also please generate an ASCII art representation of the text ‘Newton’s Second Law of Motion’.”, “related apis”: [’figlet’, ‘list figlet styles’, ‘matheq’]
“Query”: “Working on a research paper on cows and need to include ASCII art representations of various cows. Can you first retrieve available ASCII art styles for cows? Then, can you generate ASCII art for cows like the Jersey, Holstein, and Guernsey? Finally, I want the cow to say ‘Moo!’ in
the ASCII art.”, “related apis”: [’figlet’, ‘list figlet styles’, ‘cowsay’, ‘list cowsay styles’]
“Query”: “I’m writing a blog post on ASCII art and need to include some examples. Can you generate ASCII art for the following strings: ‘ASCII’, ‘art’, and ‘gallery’? You can first retrieve available
figlet styles and then generate ASCII art for the strings using the styles.”, “related apis”: [’figlet’, ‘list figlet styles’]
“Query”: “Greetings! I’m putting together a quirky slideshow about our furry friends and need your
help to sprinkle some ASCII art goodness. Could you kindly fetch me the catalog of ASCII art styles available for animals? Also, I’m particularly keen on featuring ASCII art for creatures like pandas, cows, elephants, and penguins. And if they could say something cute like ‘Hello!’ or ‘Hugs!’ in the ASCII art, that would be purr-fect!”, “related apis”: [’figlet’, ‘list figlet styles’, ‘cowsay’, ‘list cowsay styles’]

利用ChatGPT生成指令后，通过进一步过滤后，最终只保留超过20万个高质量（Instruction, APIs）对，其中I1，I2，I3的数据分别有87413，84815跟25251个。这部分数据可以用来训练API检索模块，用于在利用指令作为输入，去检索返回相关的API。

####3.3 解决路径solution生成

给定任务指令后，利用ChatGPT去搜寻一个可用有效的动作序列{a1,…aN}，这样一个多步决策过程可以转化为一个基于ChatGPT的多轮对话。在每个时刻t，模型基于已有的动作a跟观测r生成新的动作，即ChatGPT(at|{a1,r1,…,at-1,rt-1},Instruction)。 这里的动作有严格的格式，包括三个部分，其一是Thought：模型的思考过程，其二是需要调用的API名，其三是API对应的参数，从而能利用生成的动作去调用对应的API，进而得到响应结果rt。

对于每个指令Instruction，通过抽样获取一个API集合，将该集合里的API信息喂给ChatGPT，让ChatGPT根据Instruction要求去选择性调用这些API，从而搜索得到一条解决问题的路径。除了前面搜集的API外，还定义了两个新的API函数，分别是Finish with Final Answer跟Finish by Giving up，前者表示任务指令已经完成，并且会关联一个参数对应任务的具体答案，后者则表示在当前的搜索路径下，根据提供的API已经无法完成对应的任务指令了。

通过之前的调研，研究人员发现利用传统的CoT或者ReACT在给定Instruction条件下很难探索出一条有效解决路径(由于error propagation跟limited exploration)，从而让这个数据集构建变得困难。于是研究人员提出了一种决策算法DFSDT，该算法支持语言模型去评估多个不同推理路径，从而选择更具潜力的路径，或者移除那些不可能成功的节点（想法跟之前提到的tree of thought接近）。基于这种方式构造了一个包含12657个(Instruction，solution)对的数据集。

图5: DFSDT跟其他方法对比‍‍

####4 实验‍‍‍‍‍‍‍‍‍‍‍‍‍

####4.1 ToolEval

有了数据集之后，就可以去评测不同语言模型遵循指令调用外部工具的能力，考虑到人工评测耗时耗力，于是研究人员提出了一个高效的自动评测工具ToolEval ，主要包括以下2个指标。

a) Pass rate 在有限动作次数的条件下成功完成指令的比例，用于评估语言模型对于指令的执行能力，也被视作一个使用工具能力的基本要求。这一个指标可以通过解决路径最终节点是否调用Finish with Final Answer这个API来自动判断。

b) Win rate 在给定统一指令下，比较两个不同解决路径的优劣性。这个指标通过调用ChatGPT来实现，文中也将ChatGPT的标注结果跟人工标注结果进行比对，发现两者间有75.8%的高相关性。

####4.2 实验设置

在完成数据集的构建跟评估方式的定义后，研究人员利用指令生成数据集，基于Bert-Base为基底，以双塔模型的结构训练了一个API检索器，用于根据Instruction去检索得到合适的API，同时利用解决路径数据集以LLaMA 7B位基底微调得到了一个语言模型ToolLLaMA，并在这个过程中将文本长度限制由2048扩展到8192。

对于语言模型ToolLLaMA，应该有足够的泛化能力，遇到新的指令Instruction跟API也要有良好的表现。于是研究人员会在三种不同场景下去评估模型的泛化能力。

a) Inst. 来自训练集中已有工具tool的新指令Instruction，也就是训练集已经见过这些tool，但是没见过这个指令Instruction。

b) Tool 来自同一个训练集同一个已有类别category下的新工具tool，也就是训练集已经见过这个类别下的其他tool，但没见过这些tool。

c) Cat 来自于不同新类别category（训练集没出现）的新工具tool，也就是训练集没见过这些不同类别的其他tool，也没见过这些属于多个类别的tool。

同时考虑到前面构建的指令生成数据集也有三种类型，所以总有以下6种实验设置，后续评估模型的工具使用能力也就在这6种设置下分别进行。(看似组合应该有3*3=9种，但是有的组合并不合理，所以最终只有6种。)

####4.3 实验结论

a)训练好的API Retriever效果显著超过基于稀疏检索跟稠密检索的方法，并且可以看出单工具指令(I1)相对更加简单，比起多工具指令（I2, I3）。

图6: API Retriever表现‍‍‍‍‍

b) ToolLLaMA的工具使用能力显著优于其他传统方法，那些在指令集上微调过的模型（Vicuna, Alpaca）的指令执行能力并不能推广到工具使用的场景。同时这也能说明ToolBench能有效激发语言模型的工具能力，使之具备掌控各种API（即便新API）去完成不同指令的能力。

图7: ToolLLaMA表现‍‍

** c) 在具体使用中，利用API Retriever来获得合适的API候选供模型选择也能取得不错效果。另外，DFSDT这种决策策略明显优于ReACT。**

图8: ToolLLaMA的更多分析‍‍‍‍

####5 总结

这个文章的内容还是比较多的，不多看几遍很容看晕，这里在稍微总结下，主要内容包括，

** a) 构建一个高质量instruction-tuning数据集ToolBench(针对工具使用)，包括收集了一批API跟tool，构建了一个(Instruction, APIs)的数据集，一个(Instruction, solution)数据集。**

** b) 提出了一个自动评估工具ToolEval。**

** c) 训练了一个API Reteriver跟ToolLLaMA，同时得出若干相关分析结论。**

** Tool use是语言模型更加高级的能力，目前也只在openai发布的模型上有比较不错的表现，而开源的大模型基本都是针对基础的语言模型去做优化，在工具使用这方面的表现很难评，所以这里发布的数据集对于后续的开源大模型在这方面的优化还是很有价值的。随着开源大模型基础能力提升到一定程度后，一定也会往更加高级的任务去突破。**

作为个人而言，这个工作所涉及的机构挺多的，发布的代码可读性有点难评，我自己在读这个源码的时候还是蛮吃力的，由于时间关系，也只是粗略看下，以便能更好的理解文中提及的做法。所以如果存在解读有误的地方，欢迎交流。

参考文献

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

https://arxiv.org/pdf/2307.16789.pdf

https://github.com/OpenBMB/ToolBench

进技术交流群请添加AINLP小助手微信（id: ainlp2)

请备注具体方向+所用到的相关技术点

![](https://api.allorigins.win/raw?url=https://mmbiz.qpic.cn/mmbiz_jpg/nW2ZPfuYqSJADkmZ2IX6Z23znAibuEevotDMq9iaMxiapK7jfMibiauGFkycicAJEs6x5U9SGyDJZ0S1tRed9TPNUUDQ/640?wx_fmt=jpeg&wxfrom=5&wx_lazy=1&wx_co=1)

关于AINLP

AINLP 是一个有趣有AI的自然语言处理社区，专注于 AI、NLP、机器学习、深度学习、推荐算法等相关技术的分享，主题包括LLM、预训练模型、自动生成、文本摘要、智能问答、聊天机器人、机器翻译、知识图谱、推荐系统、计算广告、招聘信息、求职经验分享等，欢迎关注！加技术交流群请添加AINLP小助手微信(id：ainlp2)，备注工作/研究方向+加群目的。

  


  


![](https://api.allorigins.win/raw?url=https://mmbiz.qpic.cn/mmbiz_jpg/nW2ZPfuYqSKABHCqVVQkVYPrM4XY1vsd0iaeuXzyJnoFc8cibd5mYb4wdA3WMQtiaPVmr0XLZHMuVibqWncibpnTSnQ/640?wx_fmt=jpeg&wxfrom=5&wx_lazy=1&wx_co=1)

阅读至此了，分享、点赞、在看三选一吧🙏

更多AI工具，参考Github-AiBard123，国内AiBard123

可关注我们的公众号：每天AI新工具