One Only

[논문 리뷰] Astute RAG: Overcoming Imperfect Retrieval Augmentation and Knowledge Conflicts for Large Language Models

oneonlee — Tue, 19 Nov 2024 15:16:46 +0900

arXiv: https://arxiv.org/abs/2410.07176

OpenReview: https://openreview.net/forum?id=xy6B5Fh2v7

Code: x

Keywords: Retrieval Augmented Generation, Knowledge Conflicts

1. Motivation

검색 결과에 의존하는 RAG는 관련성이 없거나 오해의 소지가 있는 불완전한 검색 결과로 인해, 부정확한 LLM 응답을 초래할 수 있음
검색된 결과가 LLM이 알고 있던 지식과 다를 때는 Knowledge Conflict가 발생할 수 있지만, 대부분의 기존 연구들은 이를 고려하지 않음

2. Preliminary Experiment: Imperfect retrieval is common

$$Retrieval\ Precision=\frac{\#\ retrieved\ passages\ containing\ correct\ answer}{\#\ total\ retrieved\ passages}$$

3. Problems & Previous Works

Problems

불완전한 검색 결과 및 Knowledge Conflicts는 광범위하게 발생하고, 이것들은 RAG의 오류를 초래함
- 기존 연구들에 따르면, LLM은 Knowledge Conflict 상황에서 내·외부 지식을 종합적으로 이해하기보다는 잘못된 정보에 기반하여 답변하는 경향이 있음 [1-3]

Previous Works

- 검색 결과에 초점을 둔 이전 연구들과[1,4] 달리, 본 논문은 검색된 passage가 제공된 검색 후 단계에서 LLM 내부 지식을 활용하여 RAG의 견고성을 강화하는 데 중점을 두고 있음
- 또한, Black-Box 환경에서 training 없이 knowledge conflicts를 직접 해결하여 양쪽의 유용한 정보를 결합하고, 보다 신뢰할 수 있는 답변을 얻을 수 있는 방법을 제안함
그렇다면, 신뢰할 수 있는 RAG를 위해, LLM의 내·외부 지식 충돌을 해결하는 방법이 있는가?

4. Methods

Overview

Step 1/3: Passages Generation of Internal Knowledge

LLM의 내부 지식을 명시적으로 도출
- question $q$를 기반으로 여러 개의 passages를 생성하도록 LLM prompting
- LLM 내부 지식과 외부 지식 간의 상호 확인을 위한 목적

Step 2/3: Iterative Source-aware Knowledge Consolidation

내·외부 지식 정보들을 한번에 비교하여 context를 명시적으로 통합하도록 LLM prompting
- 일관된 정보 → cluster & summarize
- 정보 간 충돌 → separate
- 불필요한 정보 → exclude
LLM이 지식을 통합할 때, 각 지식의 출처를 함께 제공
- Memory or Internal
위 과정을 $t$번 반복하여 더 유용한 contexts로 개선함

Step 3/3: Step Answer Finalization

각각의 그룹으로부터 답변을 하나씩 생성한 후, 신뢰성을 고려하여 하나의 최종 답변을 선택하도록 LLM prompting
- 신뢰성 평가에는 지식 출처, 출처간 정보 일치 여부, 정보 세밀성 등을 고려

5. Experimental Settings

Dataset

NQ, TriviaQA, BioASQ, PopQA → 짧은 형식의 QA 데이터셋

Passage Collection

각 질문에 대해 Google Search API를 통해 상위 30개의 결과를 검색
접근 가능한 상위 10개의 웹사이트를 선택
검색 결과의 snippet에 해당하는 단락을 각 웹사이트에서 추출하여 passage로 사용

Metric

Accuracy; 모델의 응답이 실제 정답을 포함하고 있으면 정확한 것으로 간주

LLM Parameters

LLM: gemini-1.5-pro-002, claude-3-5-sonnet@20240620
temperature: 0
max_token: 1024
# shot: zero-shot

Baselines

USC (universal self-consistency) [5]
- 모든 LLM 응답을 여러 번 샘플링하여 평균을 냄 (기본적인 API 호출을 통한 단순 개선 방법)
GenRead [6]
- LLM 내부 지식으로 생성한 문서로 답변함 (외부 지식을 사용하지 않음)
RobustRAG [7]
- 각각의 독립적인 문서에서 답변을 생성하고, 키워드로 최종 답변을 집계함
InstructRAG [8]
- 답변 생성 시, Rationale을 생성하는 RAG 방법
Self-Route [9]
- 답변 시, RAG/LLM을 adaptive하게 선택하여 전환함 (내부 및 외부 지식 간의 전환)

6. Main Results

No RAG vs. RAG : NQ나 TriviaQA 같은 데이터셋에서는 RAG를 쓰지 않는게 성능이 더 좋을 때도 있음
- 이는 검색 결과와 LLM 간의 지식 충돌 때문으로 보임
- 반면, domain-specific QA 및 long-tail QA인 BioASQ와 PopQA에선 RAG가 LLM의 성능을 향상시킴
베이스라인들 중에선 일관되게 성능이 높은 모델이 없음
- 이는 baseline 모델이 특정 setting에 fitting되어 있고, 보편적으로 적용되기에는 어렵다는 것을 시사함
반면, Astute RAG는 모든 데이터셋에서 일관되게 baseline을 능가함
knowledge consolidation의 반복 횟수인 t를 늘리면 성능 개선 폭이 줄어드는데, 이것은 반복할수록 통합할 정보들이 줄어들기 때문

Gemini에서 t를 늘리면 BioASQ와 PopQA의 성능이 증가함
- 두 데이터셋은 외부 지식에 더 많이 의존하는데, knowledge consolidation 과정을 반복하면 외부 정보 내의 노이즈를 완화하는 데 도움이 되기 때문
t가 3에 도달하면 NQ와 TriviaQA의 성능은 더 이상 향상되지 않음
- 이 두 데이터셋에서는 외부 지식의 역할이 덜 중요하기 때문

7. Analyses

Performance by Retrieval Precision

검색 품질이 매우 낮은 경우(Retrieval Precision이 거의 0에 가까울 때) 다른 베이스라인 모델들은 No RAG에 비해 성능이 저하된 반면, Astute RAG만이 이 기준을 넘는 성능을 보임

Performance by Knowledge Conflicts

8. Conclusions

별도의 훈련 없이 Knowledge Confliction을 완화하기 위한 Astute RAG 제안
- LLM의 내부 지식을 활용하여 생성된 응답을 반복적으로 개선
- 내·외부 지식을 출처 기반으로 통합하여 답변을 최종화
Limitations
- LLM의 Instruction-following 능력이나 Reasoning 능력에 의존함
- Knowledge Consolidation 시, LLM의 내재적 편견과 환각이 발생할 수 있음
- Main Results에서 API Call을 비교하는 것은 의미가 없어 보임
- API Call에 사용한 token 수를 비교해야 함

References

[1] (LREC-COLING 2024) Tug-of-War between Knowledge: Exploring and Resolving Knowledge Conflicts in Retrieval-Augmented Language Models

[2] (ACL 2024) Blinded by Generated Contexts: How Language Models Merge Generated and Retrieved Contexts for Open-Domain QA?

[3] (ICLR 2024) Adaptive Chameleon or Stubborn Sloth: Revealing the Behavior of Large Language Models in Knowledge Conflicts

[4] (ACL 2024) When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories

[5] (ICML Workshop 2024) Universal Self-Consistency for Large Language Models
[6] (ICLR 2023) Generate rather than Retrieve: Large Language Models are Strong Context Generators
[7] (arXiv 2024) Certifiably Robust RAG against Retrieval Corruption
[8] (arXiv 2024) InstructRAG: Instructing Retrieval-Augmented Generation via Self-Synthesized Rationales
[9] (EMNLP Industry 2024) Retrieval Augmented Generation or Long-Context LLMs? A Comprehensive Study and Hybrid Approach

Continual Learning의 목표와 Forward Transfer 및 Backward Transfer

oneonlee — Sun, 27 Oct 2024 21:59:08 +0900

Continual Learning의 목표

Avoid Catastrophic Forgetting
- 이전 task의 기억을 보전해야 함
Positive Forward Transfer
- 이전 task에서 학습했던 지식이 다음 task에 도움이 되어야 함
Positive Backward Transfer
- 다음 task에서 학습을 한 지식이 이전 task의 성능 향상에도 도움이 되어야 함
Task-Order Free Learning
- Task의 학습 순서와 무관하게 모든 task를 잘 수행해야 함

Forward Transfer

Forward transfer는 모델이 이전에 학습한 task의 지식을 활용하여 새로운 task에 대한 학습 효율과 성능을 향상시키는 능력을 말한다.

Continual Learning에서 forward transfer는 모델이 과거 task에서 학습한 representation을 사용하여 새로운 task를 얼마나 쉽게 학습할 수 있는지에 따라 측정된다.

이상적으로 forward transfer가 positive인 모델은 이전 지식을 효과적으로 활용할 수 있으므로 더 적은 리소스로 더 빠르게 또는 더 적은 자원으로 새로운 task를 학습할 수 있다.

Backward Transfer

반면에 Backward transfer는 새로운 task를 학습하는 것이 이전에 학습한 task 수행에 미치는 영향을 포함한다. 즉, 새로운 task를 학습하면서 이전에 학습했던 task들의 성능도 함께 개선되는 현상을 말한다.

Positive backward transfer는 새로운 task를 학습함으로써 이전 task의 수행 능력이 향상될 때 발생한다. 이전 task의 데이터를 다시 보지 않고도 성능이 개선될 수 있다는 점이 특징이다

하지만 대부분의 신경망에서는 catastrophic forgetting 때문에 backward transfer 달성이 어렵다.

Continual Learning은 catastrophic forgetting을 방지하는 데 초점을 맞추기 때문에, 의도치 않게 backward transfer를 제한할 수 있고, positive backward transfer의 발생은 흔치 않다.

그러나 일부 방법은 새 작업과 이전 작업 모두에 도움이 되는 방식으로 모델을 선택적으로 업데이트하여 positive backward transfer를 달성하는 것을 목표로 한다.

References

ContinualAI Wiki - Introduction to Continual Learning, https://wiki.continualai.org/the-continualai-wiki/introduction-to-continual-learning
KAIST 산업및시스템공학과 박찬영 교수님 - 연속학습을 통한 사용자의 일반적인 표현 학습 (Universal User Representation Learning based on Continual Learning), https://dsail.kaist.ac.kr/files/NAVER_Techtalk2023.pdf
Lin et al., Beyond Not-Forgetting: Continual Learning with Backward Knowledge Transfer, NeurIPS 2022, https://arxiv.org/abs/2211.00789

[논문 간단 정리] Advancing continual lifelong learning in neural information retrieval: definition, dataset, framework, and empirical evaluation

oneonlee — Fri, 25 Oct 2024 20:19:12 +0900

Advancing continual lifelong learning in neural information retrieval: definition, dataset, framework, and empirical evaluation

Publication Info: Information Sciences 2025
URL: https://www.sciencedirect.com/science/article/pii/S0020025524012829

Contribution

IR task의 맥락에서 Continual Learning 패러다임에 대해 명확히 정의함
Continual IR을 평가하기 위해, Topic-MS-MARCO 데이터셋을 제안함
- 주제별 IR task와 predefined task similarity가 포함
CLNIR (Continual Learning Framework for Neural Information Retrieval) 프레임워크 제안
- regularization & replay mechanism에 기반한 7가지 전략

Limitations

optimization-based & architecture-based strategies는 탐구되지 않음
단일 데이터셋만을 사용했음
- cross-domain setting이라고 보기 힘듦

[논문 간단 정리] Dense Retrieval Adaptation using Target Domain Description

oneonlee — Fri, 25 Oct 2024 20:17:08 +0900

Dense Retrieval Adaptation using Target Domain Description

Cited by 3 ('2024-10-22)
Publication Info: ACM ICTIR 2023
URL: https://arxiv.org/abs/2307.02740

Dense Retrieval Adaptation using Target Domain Description

In information retrieval (IR), domain adaptation is the process of adapting a retrieval model to a new domain whose data distribution is different from the source domain. Existing methods in this area focus on unsupervised domain adaptation where they have

arxiv.org

Summary

(NAACL 2022) GPL: Generative Pseudo Labeling for Unsupervised Domain Adaptation of Dense Retrieval의 후속 연구
- 관련 글: [논문 간단 리뷰] GPL: Generative Pseudo Labeling for Unsupervised Domain Adaptation of Dense Retrieval
Target Domain의 Description만을 사용하여 Unsupervised Domain Adpatation을 수행하는 방법론 제시

Problem

Related Work
- IR 모델을 domain adaptation하는 기존의 방법론들은 target domain 데이터에 접근 가능한 방법들이 대부분임
그러나 현실에서는 실제 target domain 데이터에 접근 불가능 할 수도 있음
- e.g., 의료 기록이나 법적 제한이 있는 데이터는 공유 불가
본 논문은 zero-shot setting과 유사하게, target data를 사용하지 않고, target domain의 description만으로 Dense Retrieval 모델의 성능을 개선함
- description: 데이터의 작업과 특성을 개괄적으로 설명하는 높은 수준의 text desc.

Methods

Exp - Dataset

Target Retrieval Task 1: Bio-Medical IR
- TREC Covid Track in 2020 (TREC-COVID)
Target Retrieval Task 2: Financial Question Answering
- FiQA-2018 Task 2 (FiQA)
Target Retrieval Task 3: Argument Retrieval
- ArguAna
Target Retrieval Task 4: Duplicate Question Retrieval
- Quora
  - The aim of duplicate question retrieval is to detect repeated questions asked on community question-answering (CQA) forums
Target Retrieval Task 5: Fact Checking
- SciFact

[논문 간단 정리] GPL: Generative Pseudo Labeling for Unsupervised Domain Adaptation of Dense Retrieval

oneonlee — Fri, 25 Oct 2024 20:10:58 +0900

GPL: Generative Pseudo Labeling for Unsupervised Domain Adaptation of Dense Retrieval

Cited by 142 (’2024-10-22)
Publication Info: NAACL 2022
URL: https://aclanthology.org/2022.naacl-main.168

GPL: Generative Pseudo Labeling for Unsupervised Domain Adaptation of Dense Retrieval

Kexin Wang, Nandan Thakur, Nils Reimers, Iryna Gurevych. Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2022.

aclanthology.org

Summary

GPL이라는 Unsupervised Domain Adaptation 방법을 제안하고, 기존 domain adaptation method와 광범위하게 성능 비교 (nDCG@10)

Problem

Dense retrieval는 대부분의 도메인에서 사용할 수 없는 대량의 학습 데이터가 필요로 함
Dense retireval은 domain shifts에 매우 민감함
- MS MARCO에서 훈련된 모델은 코로나19 과학 문헌에 대한 질문에 대해 다소 저조한 성능을 보임
- MS MARCO는 코로나19 이전에 생성되었기 때문에 코로나19 관련 주제가 포함되어 있지 않으며, 모델은 이 주제를 vector space에서 잘 표현하는 방법을 학습하지 못했기 때문

Related Work

Dense retrieval의 성능은 domain shift를 거치면 하락함이 이전 연구를 통해 밝혀짐

Contribution

unsupervised domain adaptation 방법론인 query generator와 cross-encoder의 pseudo labeling을 결합하는 방식의 Generative Pseudo Labeling (GPL)을 제안
1. 원하는 도메인의 passages에 대해 pre-trained T5로 합성 쿼리를 생성
2. 합성 쿼리 생성에 사용된 passages는 생성된 쿼리에 대해 positive passages로 간주
3. negative mining; 기존의 dense retrieval model로 생성된 쿼리와 가장 유사한 passages를 찾아 그것을 negative passages (hard negative)로 간주
4. cross-encoder를 사용하여 각 (query, passage) 쌍에 점수를 매기고, MarginMSE-Loss를 사용하여 생성된 pseudo-labeled queries에 대해 dense retrieval model을 훈련
GPL을 Previous Domain Adaptaion Models와 비교
- Previous Domain Adaptation Methods
  - UDALM, MoDIR
- Pre-Training based Domain Adaptation
  - CD, SimCSE, CT, MLM, ICT, TSDAE
- Generation-based Domain Adaptation
  - QGen, QGen (w/ Hard Negatives)

Exp - Dataset

six representative domain-specific datasets from the BeIR benchmark
- FiQA (financial domain)
- SciFact (scientific papers)
- BioASQ (biomedical Q&A)
- TREC-COVID (scientific papers on COVID-19)
- CQADupStack (12 StackExchange subforums)
- Robust04 (news articles)

Limitation

catastrophic forgetting 분석 x

[논문 간단 정리] Continual Learning of Long Topic Sequences in Neural Information Retrieval

oneonlee — Fri, 25 Oct 2024 20:05:56 +0900

Continual Learning of Long Topic Sequences in Neural Information Retrieval

Cited by 6 ('2024-10-22)
Publication Info: ECIR 2022
URL: https://arxiv.org/abs/2201.03356

Continual Learning of Long Topic Sequences in Neural Information Retrieval

In information retrieval (IR) systems, trends and users' interests may change over time, altering either the distribution of requests or contents to be recommended. Since neural ranking approaches heavily depend on the training data, it is crucial to under

arxiv.org

Summary

Deep Neural Ranking Models에서의 Catastrophic Forgetting 정도를 확인하고 해결방안을 제시한 (ECIR 2021) Studying Catastrophic Forgetting in Neural Ranking Models의 후속논문
- 관련 글: [논문 간단 정리] Studying Catastrophic Forgetting in Neural Ranking Models

[논문 간단 정리] Studying Catastrophic Forgetting in Neural Ranking Models

Cited by 23 ('2024-10-22)Publication Info: ECIR 2021URL: https://arxiv.org/abs/2101.06984 Studying Catastrophic Forgetting in Neural Ranking ModelsSeveral deep neural ranking models have been proposed in the recent IR literature. While their transferabili

oneonlee.tistory.com

Problem

콘텐츠와 사용자의 요구 사항이 시간이 지남에 따라 변화할 수 있음
IR 모델이 새로운 토픽/트렌드에 대한 랭킹 능력을 변경할 수 있는지, 또한 이러한 모델이 최신 상태로 유지되는 경우 이전 토픽/트렌드에 대해서도 여전히 성능을 발휘할 수 있는지 파악하는 것

Contribution

Continual Learning을 위한 긴 토픽 시퀀스 및 IR 기반 제어 토픽 시퀀스를 처리하기 위해 MSMarco에서 파생된 corpus를 설계
Long-term Continual Learning IR setting과 controlled setting에서 서로 다른 neural ranking model의 성능 비교
- RQ1: Modeling the long topic sequence; IR에서 지속적인 학습을 위한 일련의 작업을 설계하는 방법은 무엇인가?
  - 평생 학습 전략을 설계할 때 과제 유사성, 학습 과정에서의 과제 위치, 전달해야 하는 배포 유형(짧은 텍스트 대 긴 텍스트)을 고려하는 것이 중요하다는 점을 확인
- RQ2: Performances on the MSMarco long topic sequence; 긴 주제 시퀀스를 학습하는 동안 신경 순위 모델의 성능은 어떤가? 치명적인 망각의 신호를 감지할 수 있나?
  - IR에서 치명적인 망각은 존재하지만 다른 영역에 비해 낮다는 것을 확인
Continual Learning에서 task similarity level이 neural ranking model의 학습 행동(learning behavior)에 미치는 영향을 조사
- RQ3: Behavior on IR-driven controlled settings 시퀀스 내 작업의 유사성 수준이 모델 효과와 치명적 망각에 대한 견고성에 영향을 미치는가?
- RQ4: 신경 순위 모델은 쿼리 또는 문서 분포 변화에 어떻게 적응하는가?Exp - Dataset
자체 변형한 MS MARCO 데이터셋

Limitation

본 연구에서 서로 다른 도메인은 서로 다른 데이터 분포를 특징으로 하는 서로 다른 데이터 세트를 의미

[논문 간단 정리] Studying Catastrophic Forgetting in Neural Ranking Models

oneonlee — Fri, 25 Oct 2024 20:01:33 +0900

Cited by 23 ('2024-10-22)
Publication Info: ECIR 2021
URL: https://arxiv.org/abs/2101.06984

Studying Catastrophic Forgetting in Neural Ranking Models

Several deep neural ranking models have been proposed in the recent IR literature. While their transferability to one target domain held by a dataset has been widely addressed using traditional domain adaptation strategies, the question of their cross-doma

arxiv.org

Summary

Deep Neural Ranking Models에서의 Catastrophic Forgetting 정도를 확인하고 해결방안을 제시
- 시간이 지남에 따라 지식을 조금 잊어버리는 IR 모델의 작은 약점을 강조
IR에서 Continual Learning을 다룬 최초의 논문(으로 보임)

Problem

이전까지의 Ranking Model 연구에서는 하나의 target domain에 대해서만 Domain Adaptation 전략을 사용해서 다루었음
그러나 cross-domain transferability에서 catastrophic forgetting이 발생하는지 확인되지 않음

Related Work

이전 연구에서는 Catastrophic Forgetting의 수준이 데이터셋과 아키텍쳐에 크게 영향을 받는다는 사실을 밝힘
그러나 domain 간 전이성 관점에서 Catastrophic Forgetting을 확인하거나, 있다면 어떻게 극복할 수 있는지를 보여주는 기존 연구는 아직까지 없음

Contribution

Cross-domain setting에서 Ranking Model이 새로운 지식을 습득한 후, 오래된 지식에 대해 성능이 저하되는 Catastrophic Forgetting 정도를 확인
치명적인 망각을 예측하는 데이터 세트의 특성을 확인
cross-domain regularizer를 통해 Catastrophic Forgetting을 완화할 수 있음을 실험을 통해 검증

Exp - Dataset

MS MARCO
TREC CORD19
TREC Microblog

Limitation

본 연구에서 서로 다른 도메인은 서로 다른 데이터 분포를 특징으로 하는 서로 다른 데이터셋을 의미
2개 또는 3개의 연속적인 데이터셋으로 stream을 구성
- 현실적인 long-term topic sequences에 대한 시나리오는 고려되지 않음
language shift나 information update는 고려되지 않음

[논문 간단 정리] SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models

oneonlee — Mon, 30 Sep 2024 17:34:15 +0900

(EMNLP 2023) SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models

arXiv: https://arxiv.org/abs/2303.08896

code: https://github.com/potsawee/selfcheckgpt

1. Problem

Hallucination Detection
- 기존의 fact verification 방법은 ChatGPT와 같은 블랙박스 모델에서는 작동하지 않을 수 있으므로 외부 리소스 없이도 Hallucination을 Detection 할 수 있는 새로운 접근 방식이 필요함

2. Related Works

intrinsic uncertainty metrics (e.g., token probability or entropy)
- information may not be available to users when systems are accessed through limited external APIs
fact-verification approaches
- facts can only be assessed relative to the knowledge present in the database
- hallucinations are observed over a wide range of tasks beyond pure fact verification

3. Proposed Key Ideas: SelfCheckGPT (sampling-based approach)

외부 리소스에 의존하지 않고 블랙박스 LLM에서 환각을 감지하기 위한 샘플링 기반 접근 방식인 'SelfCheckGPT'를 소개
- BERTScore, question-answering, n-gram 분석, NLI, LLM 프롬프트 등 다양한 변형을 사용하여 샘플링된 여러 응답에 걸쳐 일관성을 측정
The motivating idea
- When an LLM has been trained on a given concept, the sampled responses are likely to be similar and contain consistent facts.
- However, for hallucinated facts, stochastically sampled responses are likely to diverge and
  may contradict one another.
zero-resource hallucination detection solution that can be applied to black-box systems

[논문 간단 정리] Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation

oneonlee — Mon, 30 Sep 2024 15:47:40 +0900

(ICLR 2023 notable-top-25%) Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation

arXiv: https://arxiv.org/abs/2302.09664

code: https://github.com/lorenzkuhn/semantic_uncertainty

1. Motivation

LLM이 생성한 답변의 uncertainty를 추정하는 것은 Trustworthy LLM과 관련하여 중요한 문제임
그러나 답변의 uncertainty를 추정하는 기존의 token-likelihood 기반 방법들은 semantic equivalence 문제를 고려하지 않음
- semantic equivalence: 어휘적(lexical)으로는 다른 문장이 의미적(semantic)으로는 같은 의미를 가지는 것
본 논문은 LLM에서 불확실성을 측정하는 문제, 특히 semantic equivalence 문제로 인해 기존 방법이 어려움을 겪는 QA task의 문제를 해결함

2. Related Work on Uncertainty Estimation

predictive entropy of the output distribution
- $PE(x) = H(Y \vert x) = -\int p(y \vert x) \log{p(y \vert x)} dy$

3. Proposed Key Ideas

semantic likelihood
- 의미적으로 유사한 샘플들에서 발생하는 불확실성을 줄이기 위해, 해당 샘플들에 대해 marginalization을 수행하여, 그들의 정보를 하나로 통합함으로써 비지도 방식으로 uncertainty를 측정하는 지표
- 이 방은 bidirectional entailment clustering을 사용하여 의미적으로 동등한 결과물을 그룹화하고, 이러한 클러스터의 분포를 기반으로 불확실성을 계산함

4. Summary of Experimental Results

더 큰 모델과 더 까다로운 데이터 세트의 경우, semantic entropy는 기존의 불확실성 측정값보다 AUROC 성능이 뛰어남
비슷한 baseline보다 QA task에서 모델의 정확도를 더 잘 예측하며, 모델 크기가 커질수록 성능이 향상됨

논문 간단 정리 포스팅 템플릿

oneonlee — Mon, 30 Sep 2024 14:06:03 +0900

(venue year) Title

arXiv:

code:

1. Problem

.
- .

2. Importance of the Problem

.
- .

3. Related Works

.
- .

4. Proposed Key Ideas

.
- .

5. Summary of Experimental Results

.
- .