Paper Review

[논문리뷰] Tree of Clarifications: Answering Ambiguous Questions with Retrieval-Augmented Large Language Models

oneonlee 2024. 6. 12. 17:13
반응형

arXiv : https://arxiv.org/abs/2310.14696

code : https://github.com/gankim/tree-of-clarifications


1. Introduction & Related Work

Open-domain question answering (ODQA) task에서 사용자들은 종종 ambiguous questions (AQs)를 질문할 때가 있는데, 이러한 AQs는 여러 뜻으로 해석 될 수 있는 문제점이 있음

 

  • AQs를 다루기 위한 3가지 관련 연구
    1. Min et al., AmbigQA: Answering Ambiguous Open-domain Questions, EMNLP 2020
      • providing individual answers to disambiguated questions (DQs) for all plausible interpretations of the given AQ
    2. Guo et al., Abg-CoQA: Clarifying Ambiguity in Conversational Question Answering, AKBC 2021
      • asking a clarification question
    3. Stelmakh et al., ASQA: Factoid Questions Meet Long-Form Answers, EMNLP 2022
      • provides a comprehensive response without bothering the user for clarification
      • The task is to identify all DQs of the given AQ and generate a long-form answer addressing all the DQs
      • e.g., "What is the name of the vietnamese currency?"라는 질문은 아래와 같이 여러개의 ambiguity를 내포함
        • "question": "What is the name of the Vietnamese currency since May 3, 1978?",
          "answer": "đồng"
        • "question": "What was the name of North Vietnamese currency from 1946-78?",
          "answer": "North Vietnamese đồng"
        • "question": "What was the name of South Vietnamese currency from 1954 to September 22, 1975?",
          "answer": "South Vietnamese đồng"
        • "question": "What was the name of South Vietnamese currency from September 22, 1975 to May 3, 1978?",
          "answer": "liberation đồng'"
      • ASQA task의 목표는 위와 같은 질문에 아래와 같은 답변을 하는 것임
        • "Between 1946 and 1978, Vietnamese currency has had a number of name changes due to the divide between North and South Vietnam. From 1946 to 1954, North Vietnam used the North Vietnamese đồng while South Vietnam used both the piastre and South Vietnamese đồng. On 22 September 1975, after the fall of Saigon, the currency in South Vietnam was renamed to the \"liberation đồng''. After North and South Vietnam was reunified, the đồng was also unified, on May 3, 1978."
      • 본 논문(ToC)은 ASQA 방법을 적용하였음

 

  • ASQA task의 두가지 main challenges
    1. AQ may need to be clarified by considering multiple dimensions of ambiguity
      • e.g., “what country has the most medals in Olympic history”라는 질문에선 아래의 유형들을 명확히 해야 함 
        • 메달의 종류 (gold, silver, or bronze)
        • Olympics의 종류 (summer or winter)
    2. substantial knowledge is required to identify DQs and respective answers
      • e.g., 메달의 종류가 다르다는 것을 알아야 하고, 각 나라 별 정확한 개수를 알고 있어야 함

 

  • 본 논문에서 제안하는 Tree of Clarifications (ToC) 방법론의 Contributions
    1. ToC를 통해 AQ를 설명하는 diverse path를 탐색하도록 하고, 도움이 되지 않는 DQ는 잘라낼 수 있는 기능 제공
      (기존 방법론인 ASQA의 첫번째 challenge를 해결)
    2. Ambiguous Question에 long-form 답변을 생성하기 위해, LLM과 Retrieval systems를 결합한 첫번째 시도
      (AQ를 위한 relevant passage를 검색하여 기존 방법론인 ASQA의 두번째 challenge를 해결)
    3. Few-shot 환경에서 ASQA 벤치마크의 베이스라인을 모두 능가 (across all metrics)

2. 제안 방법론

Figure 1: Overview of TREE OF CLARIFICATIONS. (1) relevant passages for the ambiguous question (AQ) are retrieved. (2) leveraging the passages, disambiguated questions (DQs) for the AQ are recursively generated via few-shot prompting and pruned as necessary. (3) a long-form answer addressing all DQs is generated.

Main Idea

  1. Retrieval-Augmented Clarification (RAC)
    1. MS-Marco로 사전학습된 ColBERT (dense retriever)와 Bing search engine을 통해 검색된 Wikipedia passage를 합쳐 총 200개의 passage set를 수집
    2. 수집한 passage set를 MS-Marco로 사전학습된 SentenceBERTrerank
    3. rerank 후 nearest neighbor search를 통해 top-k passages를 선정하여 prompt에 augment
  2. Tree Structure (TS)
    • Root node인 AQ에서부터 시작해서, recursive하게 RAC를 수행하며 childe nodes를 확장해나감
      (childe nodes: disambiguated question-answer pair)
    • 각 확장 단계마다 현재의 query에 따라 passages들이 다시 reranking 됨
    • 최대 Nodes 수에 다다르거나, 최대 depth에 다다를 때까지 반복함 (10 valid nodes 10 in experiments)
    • 확장에는 breadth-first search (BFS)를 적용
  3. Pruning with Self-Verification
    • “Who will host the next world cup 2022?”라는 AQ에 아래와 같은 generated disambiguation question-answer pair는 AQ의 범위를 벗어나는 생성 결과물임
      • “DQ: Who hosted the world cup 2018? A: Russia”
    • 따라서 Self-verification 연구를 활용하여 불필요한 nodes를 제거함
      • AQ와 target node의 answers 간의 factual conherency를 검사하는 방식
      • LLM Prompting으로 구현
  4. Answer Generation
    • 유효한 nodes들을 종합하여 long-form 답변 생성
  5. Etc.
    1. Ambiguity Detection
      • In TOC, when it fails to disambiguate the given question or all generated disambiguations are pruned, the question could be regarded as unambiguous.
    2. Computational Complexity
      • TOC requires multiple LLM calls, its maximum number is less than 20 times per question.
      • 그 밖에 Implementaion Detail로 ToC expand 종료 시점을 구체화함으로써 number of LLM calls을 효율적으로 효율적으로 조정함

3. 실험 및 결과

Dataset

  • ASQA 벤치마크
    • Natural Questions를 포함하는 AmbigNQ 데이터셋의 subset임
    • 6,316개의 ambiguous questions와 long-from answers with disambiguations pair로 구성되어 있음

 

Evaluation Metrics

ASQA에서 제안한 3가지 Evaluation Metrics를 사용

  1. Disambig-F1 ($F1_{D}$)
    • SQuADv2로 Pre-trained 된 RoBERTa를 활용하여 Reading Comprehension task로 치환하여 성능을 측정하는 아이디어
    • 방법론을 통해 생성된 Long-form Answer를 Reading Comprehension task의 입력 Context로, Short Disambiguated Question을 Reading Comprehension task의 입력 Question으로 하여, 생성된 Predicted short answer와 ground truth short answer 간의 token-level F1-score를 측정함


  2. ROUGE-L ($R_{L}$)
    • Measure using LCS (Longest common Sequence) between long-form answers from references and system-generated predictions
  3. DR (Disambiguation-Rouge) Score = $\sqrt{ F1_{D} \times R_{L} }$
    • Disambig-F1과 ROUGE-L의 Geometric mean
  4. Answer-F1
    • $DQ_i^{(k)}$와 $\widehat{DA}_i^{(k)}$ 간의 accuracy

 

Baseline

  1. Fine-tuned T5-Large (770M) (Stelmakh et al., 2022)
    • Models are evaluated in the closed-book setup or combined with JPR
      (*JPR: task-specific dense retriever for ambiguous QA by enhancing DPR)
  2. Prompt engineering method on PaLM & GPT-3 text-davinci-002 (175B)
    • closed-book setup

 

Results

  • TOC outperforms fully-supervised and few-shot prompting baselines.

Table 1: Evaluation results for long-form QA on ambigu- ous questions from the development set of ASQA (Stel- makh et al., 2022). Baselines are either fully-supervised or 5-shot prompted. Note, ToC framework consists of retrieval-augmented clarification (RAC) and tree struc- ture (TS).

 

  • Integrating retrieval systems largely contributes to accurate and diverse disambiguations. 

Table 2: Ablation study on all components of retrieval-augmented clarification (RAC).

 

  • Our pruning method precisely identifies helpful disambiguations from the tree.

Table 3: Ablated results with and without pruning methods. The number of retained DQs after pruning and Answer-F1 are reported.

 

Limitations

  • The experiments are only conducted on a ASQA benchmark.
  • The cost of multiple LLM prompting is not negligible.
  • Pilot experiments에서 CoT Prompting을 시도했지만, 성능 향상에 실패함
  • Re-ranker와 Pruning 기법을 개선하여, 성능을 개선할 수 있는 여지가 남아있음

Gangwoo Kim, Sungdong Kim, Byeongguk Jeon, Joonsuk Park, and Jaewoo Kang. 2023. Tree of Clarifications: Answering Ambiguous Questions with Retrieval-Augmented Large Language Models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 996-1009, Singapore. Association for Computational Linguistics.

반응형