也看KCA利用知识一致性策略清洗幻觉微调数据:兼看大模型问答辅助工具RAGxplorer实现细节
作者: 老刘说NLP 来源: 老刘说NLP
今天是2024年1月23日,星期二,北京,天气晴。
我们来继续谈谈RAG以及幻觉的话题,一个是RAG召回过程中的可视化探索工具:RAGxplorer,我们来看看其中的实现细节。
我们在之前说到,微调阶段的数据质量很重要,如果微调数据中的知识并不是大模型自身的知识,那么就会导致较模型进行撒谎。
image
那么,经典的问题来了,怎么找到微调数据中哪些知识是大模型自身没学过的,也出现了一些十分有趣的工作, 例如对超出能力边界的配准训练数据进行重整,很有趣,因此,我们再来看看另一个有趣的工作-KCA:基于知识一致性检验来缓解幻觉问题,会有一些思路。
一、RAG召回过程中的可视化探索工具
RAGxplorer是一种可视化嵌入空间中文档块的交互式工具,通过可视化文档块和嵌入空间中的查询来支持构建检索增强生成(RAG)应用,昨天受到大家关注。
其实现的流程很简单,主要核心点就是问句扩展【生成查询子串】,通过嵌入可视化的方式来观察彼此之间的相关性。
1、文档上传:用户可以上传PDF文档
def _load_pdf(file: Any) -> List[str]:
"""
Loads and extracts text from a PDF file.
Args:
file: The PDF file to load.
Returns:
A list of strings, each representing the text of a page.
"""
pdf = PdfReader(file)
pdf_texts = [p.extract_text().strip() for p in pdf.pages if p.extract_text()]
return pdf_texts
2、chunk配置:配置chunk大小和重叠的选项
def _split_text_into_chunks(pdf_texts: List[str], chunk_size: int, chunk_overlap: int) -> List[str]:
"""
Splits the text from a PDF into chunks based on character count.
Args:
pdf_texts: List of text extracted from PDF pages.
chunk_size: The number of tokens in one chunk.
chunk_overlap: The number of tokens shared between consecutive chunks.
Returns:
A list of text chunks.
"""
character_splitter = RecursiveCharacterTextSplitter(
separators=["\n\n", "\n", ". ", " ", ""],
chunk_size=chunk_size,
chunk_overlap=chunk_overlap
)
return character_splitter.split_text('\n\n'.join(pdf_texts))
3、选择嵌入模型:all-MiniLM-L6-v2或text-embedding-ada-002
def _create_and_populate_chroma_collection(token_split_texts: List[str], embedding_model) -> chromadb.Collection:
"""
Creates a Chroma collection and populates it with the given text chunks.
Args:
token_split_texts: List of text chunks split by token count.
Returns:
A Chroma collection object populated with the text chunks.
"""
chroma_client = chromadb.Client()
document_name = _generate_random_string(10)
if embedding_model == "all-MiniLM-L6-v2":
chroma_collection = chroma_client.create_collection(document_name, embedding_function=SentenceTransformerEmbeddingFunction())
elif embedding_model == "text-embedding-ada-002":
openai_ef = embedding_functions.OpenAIEmbeddingFunction(
api_key=st.secrets['OPENAI_API_KEY'],
model_name="text-embedding-ada-002"
)
chroma_collection = chroma_client.create_collection(document_name, embedding_function=openai_ef)
elif embedding_model == "gte-large":
chroma_collection = chroma_client.create_collection(document_name, embedding_function=AnyScaleEmbeddings)
ids = [str(i) for i in range(len(token_split_texts))]
chroma_collection.add(ids=ids, documents=token_split_texts)
return chroma_collection
4、创建向量存储数据库:使用Chroma构建向量数据库
def build_vector_database(file: Any, chunk_size: int, chunk_overlap: int, embedding_model: str) -> chromadb.Collection:
"""
Builds a vector database from a PDF file by splitting the text into chunks and embedding them.
Args:
file: The PDF file to process.
chunk_size: The number of tokens in one chunk.
chunk_overlap: The number of tokens shared between consecutive chunks.
Returns:
A Chroma collection object containing the embedded chunks.
"""
pdf_texts = _load_pdf(file)
character_split_texts = _split_text_into_chunks(pdf_texts, chunk_size, chunk_overlap)
token_split_texts = _split_chunks_into_tokens(character_split_texts)
chroma_collection = _create_and_populate_chroma_collection(token_split_texts, embedding_model)
return chroma_collection
5、查询扩展:生成子问题和假设答案,以增强检索过程,这块主要看对应的prompt
MULTIPLE_QNS_SYS_MSG = "Given a question, your task is to generate 3 to 5 simple sub-questions related to the original questions. "\
"These sub-questions are to be short. Format your reply in json with numbered keys."\
"EXAMPLE:"\
"INPUTS What are the top 3 reasons for the decline in Microsoft's revenue from 2022 to 2023?"\
"OUTPUT: \{'1': 'What was microsoft's revenue in 2022?', '2': 'What was microsoft's revenue in 2023?', '3': What are the drivers for revenue?'\}"
HYDE_SYS_MSG = "Given a question, your task is to craft a template for the hypothetical answer. Do not include anyfacts, and instead label them as <PLACEHOLDERS>. "\
"FOR EXAMPLE: INPUT: 'What is the revenue of microsoft in 2021 and 2022?' "\
"OUTPUT: 'Microsoft's 2021 and 2022 revenue is <MONETARY SUM> and <MONETARY SUM> respectively.'"
6、交互式可视化:利用Plotly来可视化块
def plot_embeddings(df):
# Create a figure
fig = go.Figure()
for category in df['category'].unique():
category_df = df[df['category'] == category]
settings = VISUALISATION_SETTINGS.get(category, {'color': 'grey', 'opacity': 1, 'symbol': 'circle', 'size': 10})
fig.add_trace(go.Scatter(
x=category_df['x'],
y=category_df['y'],
mode='markers',
name=category,
marker=dict(
color=settings['color'],
opacity=settings['opacity'],
symbol=settings['symbol'],
size=settings['size'],
line_width=0
),
hoverinfo='text',
text=category_df['document_cleaned']
))
# Set the layout, including moving the legend to the top
fig.update_layout(
height=500,
legend=dict(
y=100,
x=0.5,
xanchor='center',
yanchor='top',
orientation='h'
)
)
return st.plotly_chart(fig, use_container_width=True)
最终效果如下:
image
可以看到origin query原始问题以及sub-questions子问题之间的相似度,经过问句扩展后,语义是有发生偏移的【能够找到其他的不一样的表达】。
地址:https://kkgithub.com/gabrielchua/RAGxplorer
二、KCA:基于知识一致性检验来缓解幻觉问题
预训练语料库与对齐训练数据之间的不一致性说明,这种不一致性导致LLMs在对齐后容易产生有说服力但不清晰的回答, 如图2所示,知识不一致性百分比的增加与LLMs幻觉发生率的增加之间存在直接的相关性。
那么,经典的问题来了,怎么找到微调数据中哪些知识是大模型自身没学过的,也出现了一些十分有趣的工作, 例如对超出能力边界的配准训练数据进行重整,但这些方法有两个主要局限。
首先,基础LLM和对齐LLM的能力存在差异,因此不能准确反映预训练语料库和对齐训练数据之间的不一致性。其次,鉴于真实世界任务的多样性,如何精确评估LLMs在各种场景下的能力仍然很难。
《Mitigating Hallucinations of Large Language Models via Knowledge Consistent Alignment》(https://arxiv.org/abs/2401.10768.pdf)引入知识一致性对齐(KCA),以减少对齐数据中的外部知识与LLMs从预训练语料库中记忆的内在知识之间的不一致性,从而减少LLMs对齐中的幻觉。
项目地址放在:https://github.com/fanqiwan/KCA
先看整个系统,如下图所示,包括2个阶段,一个是知识不一致性检测(Knowledge Inconsistency Detection),另一个是知识不一致性处理(Knowledge Inconsistency Processing):
首先,给定一个用于指令微调的训练数据集D=(Ii,Ri)1≤i≤N,它由N个指令和响应组成,分为两个不同的子集:Dn,需要外部知识;Dd,不需要任何知识,对于Dn中的指令,生成用于补充的参考知识Ki。
其次,通过制定多选题考试Ei=(Qi,Ai)来评估基础LLMM对Ki的理解,其中Qi和Ai分别指Ei中的问题和答案,可以根据考试分数确定不一致子集Di和一致子集Dc。
最后,实施各种策略来处理Di,并利用处理过的数据集对M进行微调,从而通过知识一致性对齐来减少不一致。图3展示了建议的KCA方法概览。
1、知识不一致性检测
为了检测指令微调数据中的外部知识与预训练语料库中记忆的内在知识M之间的不一致性,提出了一个三阶段框架:
知识需求分类(Knowledge Requirements Classification)、生成参考知识(Reference Knowledge Generation)、考试计分(Examination Formulation)。
在知识需求分类方面,不同的任务需要不同的要求,分析和法律咨询等任务需要外部知识的支持,相反,创意写作或表格解析等其他任务则不需要这些知识, 因此,需要根据任务对外部知识的要求,将D中的实例(Ii,Ri)分成两个不同的子集Dn和Dd,为了实现这个任务,提示一个对齐大模型G,利用上下文学习来进行fewshot分类。
对应的prompt如下:
在参考知识生成方面,关于需要外部知识的子集Dn,必须为每个实例(Ii、Ri)获取相应的参考知识Ki。 虽然人工标注是一种直观的解决方案,但成本相当高,而且在质量、可靠性、多样性、自一致性和不良偏差等方面存在潜在问题,因此采用LLM进行生成,提示大模型G为每个(Ii,Ri)生成相应的Ki。对应的prompt如下:
在考试计分方面,为了检测Dn中的每个外部知识Ki与M中的内在知识之间的不一致性,首先提示G提出Ei=(Qi,Ai),其中包括M道选择题。 随后,为每个Ei计算M的测试得分Si。如果Si低于给定的阈值T,就将相应的不一致实例(Ii,Ri)归类为Di,而另一个一致实例则归类为Dc。对应的prompt如下:
2、知识不一致性处理
在检测到知识不一致之后,标准的指令微调方法不会对不一致子集Di采取任何措施,从而使对齐的LLMs产生具有说服力但却令人产生幻觉的反应。为了减轻知识不一致带来的幻觉,采用多种简单而有效的Di处理技术,包括:(i)开卷微调(Open-Book Tuning)、(ii)丢弃微调(Discarding Tuning)和(iii)拒绝微调(Refusal Tuning)。
其中:
开卷微调指的是为了防止LLM在指令调谐过程中学习到不一致的外部知识,在Di中引入已生成的参考知识Ki和再现指令Ii。 随后,使用添加了参考知识的Dd、Dc和Di的混合物对M进行微调,通过为结构完成提供补充上下文信息,引入的Ki可以防止M的参数向不一致的信息更新。
丢弃微调,舍弃含有知识不一致性的数据,可以减少知识不一致性所带来的幻觉。 具体地,丢弃Di,仅使用Dd和Dc的混合物对M进行微调。
拒绝微调,指对齐的LLM应该诚实地回应它所理解的查询,并适度地承认它不理解的查询。 因此,将Di中的回应Ri调整为拒绝格式,并将处理后的Di、Dc和Dd一起用于微调M。与补充指令知识的开卷微调类似,不过,拒绝微调也会阻碍LLM获取超出其能力范围的知识。
我们来看看对应的数据集情况,放在https://huggingface.co/datasets/Wanfq/KCA_data
1)need_and_have_fact:需要外部知识的微调数据
{
"id": "need_and_llama-2-13b_have_fact_1",
"conversations": [
{
"from": "human",
"value": "How do I find out which process is using my webcam on macOS?"
},
{
"from": "gpt",
"value": ""
}
],
"class": "need_and_have_fact",
"analysis": "This command is asking for instructions on how to find out which process is using the webcam on macOS. It requires knowledge of the macOS system and the steps to identify and monitor processes using the webcam.",
"knowledge": "To find out which process is using your webcam on macOS, you can follow these steps:\n\n1. Launch the Activity Monitor: Activity Monitor is a built-in utility on macOS that displays detailed information about processes and system resources. You can open it by searching for \"Activity Monitor\" in Spotlight or by navigating to \"Applications\" > \"Utilities\" > \"Activity Monitor.\"\n\n2. Identify the process: Once the Activity Monitor is open, click on the \"Process Name\" column header to sort the processes alphabetically. Look for processes that might be related to webcam usage. Some common processes to check for include:\n\n a. 'VDCAssistant': This process is responsible for managing video devices, including the webcam. If an application is actively using the webcam, you should see 'VDCAssistant' listed in the process list.\n\n b. 'AppleCameraAssistant': This process is associated with the FaceTime HD camera (built-in webcam on Macs). If a process is currently using the FaceTime camera, you may find 'AppleCameraAssistant' in the process list.\n\n Note: Other applications/processes might access the webcam as well, and they may have different names depending on the software you have installed.\n\n3. Check process activity: To see if a process is actively using the webcam, look at the CPU and Energy tabs of the Activity Monitor. If a process is using a significant amount of CPU or energy, it's likely utilizing the webcam. You can click on the process to highlight it and see more details at the bottom of the Activity Monitor window.\n\n4. Quit or force quit unwanted processes: If you identify a process that you want to stop from accessing the webcam, you can quit it by selecting it in the Activity Monitor and clicking the \"X\" button in the toolbar or by right-clicking and choosing \"Quit Process.\" If a process is unresponsive or causing issues, you can force quit it using the \"Force Quit\" option.\n\nBy following these steps, you should be able to identify which process is currently using your webcam on macOS using the Activity Monitor utility."
}
2)no_need_fact:不需要外部数据的微调数据
{
"id": "no_need_fact_168",
"conversations": [
{
"from": "human",
"value": "Could an organism on Earth perceive time in both directions?"
},
{
"from": "gpt",
"value": ""
}
],
"class": "no_need_fact",
"analysis": "This command is asking about the ability of an organism on Earth to perceive time in both directions. While it involves scientific knowledge, it also touches upon speculative or hypothetical scenarios. As such, it does not necessarily require factual information retrieval and may involve more speculative or theoretical discussions.",
"knowledge": ""
},
总结
本文主要介绍了RAG召回过程中的可视化探索工具:RAGxplorer、KCA:基于知识一致性检验来缓解幻觉问题两个工作的实现细节,这个都是当前RAG行业问答落地中需要关注的技术点,感兴趣的可以多看看。
参考文献
1、https://arxiv.org/abs/2401.10768.pdf
2、https://kkgithub.com/gabrielchua/RAGxplorer
关于我们
老刘,刘焕勇,NLP开源爱好者与践行者,主页:https://liuhuanyong.github.io。
老刘说NLP,将定期发布语言资源、工程实践、技术总结等内容,欢迎关注。
对于想加入更优质的知识图谱、事件图谱、大模型AIGC实践、相关分享的,可关注公众号,在后台菜单栏中点击会员社区->会员入群加入。
更多AI工具,参考Github-AiBard123,国内AiBard123