GraphRAG如何配置处理csv文件作者：深入LLM Agent应用开发来源：深入LLM Agent应用开发经常有粉丝朋友在群里问，GraphRAG怎么处理CSV文件啊？你会发现如果只是按照生成的settings.yaml模板配置，你是不可能成功的。比如这样 input: type: file # or blob file_type: csv # or csv

GraphRAG如何配置处理csv文件

By AiBard123
July 30, 2024 - 2 min read

作者：深入LLM Agent应用开发来源：深入LLM Agent应用开发

经常有粉丝朋友在群里问，GraphRAG怎么处理CSV文件啊？你会发现如果只是按照生成的settings.yaml模板配置，你是不可能成功的。比如这样

input:  
  type: file # or blob  
  file_type: csv # or csv  
  base_dir: "input"  
  file_encoding: utf-8  
  file_pattern: ".*\\.csv"

为什么呢？让我们一探究竟。

我已经建了一个LLM Agent应用和GraphRAG讨论群，如果希望进群交流的朋友，后台回复加群即可。

1. 配置csv文件输入

GraphRAG的索引输入代码位于graphrag/index/config/input.py ，它目前支持加载csv文件和txt文本文件。因此如果你想实现类似PDF加载，我们需要在这里实现相应代码。回到正题，让我们看一下csv.py代码。

 async def load_file(path: str, group: dict | None) -> pd.DataFrame:  
        ....  
        if "id" not in data.columns:  
            data["id"] = data.apply(lambda x: gen_md5_hash(x, x.keys()), axis=1)  
        # 获取指定的source列，并保存为source列  
        if csv_config.source_column is not None and "source" not in data.columns:  
            ...  
            else:  
                data["source"] = data.apply(  
                    lambda x: x[csv_config.source_column], axis=1  
                )  
        # 获取指定的text列，并保存为text列  
        if csv_config.text_column is not None and "text" not in data.columns:  
            ...  
            else:  
                data["text"] = data.apply(lambda x: x[csv_config.text_column], axis=1)  
        # 获取指定的title_column并将其保存为tilte列  
        if csv_config.title_column is not None and "title" not in data.columns:  
            ...  
                data["title"] = data.apply(lambda x: x[csv_config.title_column], axis=1)  
    # 获取指定的时间列，处理时间列timestamp_column  
        if csv_config.timestamp_column is not None:  
          ...  
         else:  
            data["timestamp"] = pd.to_datetime(  
                      data[csv_config.timestamp_column], format=fmt  
                  )  
        return data

所以如果我们要处理CSV，需要通过指定配置说明你的文本，标题，来源和时间，当然你也可以直接修改你的csv文件来包含这几个列名。那么通过配置的话，我们有哪些选项可以配置呢？

type: The type of input to use. Options are file or blob.  
file_type: The file type field discriminates between the different input types. Options are csv and text.  
base_dir: The base directory to read the input files from. This is relative to the config file.  
file_pattern: A regex to match the input files. The regex must have named groups for each of the fields in the file_filter.  
post_process: A DataShaper workflow definition to apply to the input before executing the primary workflow.  
source_column (type: csv only): The column containing the source/author of the data  
text_column (type: csv only): The column containing the text of the data  
timestamp_column (type: csv only): The column containing the timestamp of the data  
timestamp_format (type: csv only): The format of the timestamp

如果你需要timestamp列，你一定要配置timestamp_format列 ，告诉它如何解析，解析代码在上面。所以对于一个形如以下的csv文件

我们只需要如下配置，设定文本列为Text，设定来源为Source列，标题列也为Source即可。

input:  
  type: file # or blob  
  file_type: csv # or csv  
  base_dir: "input"  
  file_encoding: utf-8  
  file_pattern: ".*\\.csv"  
  source_column: Source  
  text_column: Text  
  title_column: Source

2. 开始索引

poetry run poe index --root .

然后索引完成。

3. 测试

准备测试。我最近为GraphRAG开发了一个流式服务器，并修改了部分GraphRAG代码，使之能够秒速输出内容 ，相比较之前使用命令行查询，动辄等待十几秒的，这体验提升的太明显了，丝滑～

启动Web服务，然后下载cherry-studio配置API端点和模型即可。

python -m uvicorn webserver.main:app --reload --port 20213

4. 总结

本篇介绍了如何为GraphRAG配置csv文件输入，并最终通过自己编写的web服务进行查询测试，体验丝滑。下一篇，我将介绍如何实现秒速查询响应流式输出和UI配置。

关注点赞评论，一键三连呐，不要忘啦。另外如果你有致力于了解GraphRAG，推荐你研读了解一下知识图谱的基本知识。

参考链接：

cherry-studio: https://cherry-ai.com/

更多AI工具，参考Github-AiBard123，国内AiBard123

可关注我们的公众号：每天AI新工具