샘플 코드

d9f12113 · 창호 허 · f6cdf24e · d9f12113 · d9f12113 · d9f12113
Commit d9f12113 authored Apr 22, 2024 by 창호 허
--- a/HCX_RAG.ipynb
+++ b/HCX_RAG.ipynb
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "4684a49d-08be-4192-98d8-e6e378f8d674",
+   "metadata": {},
+   "source": [
+    "# LINER PDF Chat Tutorial (Simple)\n",
+    "\n",
+    "![](https://raw.githubusercontent.com/liner-engineering/liner-pdf-chat-tutorial/main/images/liner-pdf-chat.gif)\n",
+    "\n",
+    "본 튜토리얼은 **ChatGPT**를 활용해 **PDF** 파일에 기반하여 답변할 수 있는 질의응답 챗봇 코드를 다루고 있습니다. <br>\n",
+    "튜토리얼을 마치고 나면 위 그림과 같은 제품을 만드는 방법을 익히실 수 있게 됩니다. <br><br>\n",
+    "\n",
+    "튜토리얼은 크게 **세 단계**로 나누어 진행됩니다.\n",
+    "- **PDF-to-Image**\n",
+    "- **Text Preprocessing**\n",
+    "- **Vector Search**\n",
+    "\n",
+    "\n",
+    "# 1. PDF-to-Text\n",
+    "\n",
+    "PDF 파일에서 언어 모델이 이해할 수 있는 플레인 텍스트를 추출하는 과정입니다. <br>\n",
+    "해당 과정에는 PDF를 문서 이미지로 변환하는 `PDF-to-Image`, 문서 이미지에서 텍스트를 추출하는 `Image-to-Text` 로직이 포함됩니다."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "4c3fd898-d8d9-4044-a920-6a48ae264f5b",
+   "metadata": {},
+   "source": [
+    "## 1.1. PDF-to-Image\n",
+    "\n",
+    "`PDF-to-Image`는 PDF 파일을 이미지 파일의 모음으로 변환하는 단계입니다. <br>\n",
+    "이 작업을 수행하기 위해 많은 기술들이 존재하지만, 본 튜토리얼에서는 [`pdf2image`](https://github.com/Belval/pdf2image)를 활용합니다. <br><br>\n",
+    "\\* `pdf2image` 활용을 위해서는 `poppler` [설치](https://pdf2image.readthedocs.io/en/latest/installation.html)가 필요합니다."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "1e728210-712b-451d-bdc7-7e283ab4f223",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#!apt-get install poppler-utils\n",
+    "#!brew install popler\n",
+    "#!pip install pdf2image"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "3db689ae-7b23-40d6-b1f1-fd98a9e8a549",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# pdf2image 라이브러리 임포트\n",
+    "from pdf2image import convert_from_path\n",
+    "\n",
+    "# 로컬 내 PDF 파일 경로 변수로 지정\n",
+    "FILE_NAME = \"sample_data/transformer.pdf\"\n",
+    "\n",
+    "# `convert_from_path` 함수 통해 PDF 파일 읽어와 이미지 리스트로 변환\n",
+    "images = convert_from_path(FILE_NAME)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "9ce06793-4099-4427-9329-dc106e80dc99",
+   "metadata": {},
+   "source": [
+    "본 튜토리얼에서는 2017년 공개된 [**Attention Is All You Need**](https://arxiv.org/abs/1706.03762) 논문을 예제 문서로 활용합니다."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "668c7af7-9feb-4499-ab8d-ff6a4d5e6504",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "len(images)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "8f46f4da-92f1-443c-8d75-cb52540469ea",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# 다음 단계를 위해 이미지 파일 로컬에 저장\n",
+    "for i, image in enumerate(images):\n",
+    "    image.save(f\"page_{str(i)}.jpg\", \"JPEG\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "9599fe85-85fe-449a-9b45-b082088d5d21",
+   "metadata": {},
+   "source": [
+    "## 1.2. Image-to-Text\n",
+    "\n",
+    "`Image-to-Text`는 앞서 저장한 이미지 파일에서 텍스트를 추출하는 단계입니다. <br>\n",
+    "본 튜토리얼에서는 [`Google OCR`](https://cloud.google.com/vision/docs/ocr)을 활용하며, 기호에 따라 다른 OCR 기술 (e.g. [HuggingFace](https://huggingface.co/), [Tesseract](https://github.com/tesseract-ocr/tesseract), ...) 을 활용하실 수도 있습니다.\n",
+    "<br><br>\n",
+    "https://yunwoong.tistory.com/148 를 참고하여 API키 발급"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "fe7603e3-3775-4cc6-8194-741d9a598b44",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#!pip install opencv-contrib-python\n",
+    "#!pip install --upgrade google-cloud-vision\n",
+    "#!pip install --upgrade google-cloud-speech\n",
+    "#!pip install --upgrade google-cloud-language\n",
+    "#!pip install --upgrade google-cloud-texttospeech"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "974b5c30-0fe9-4c7e-bc2e-4f35415dd998",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "os.environ[\"GOOGLE_APPLICATION_CREDENTIALS\"]=\"/Users/changho/workspace/study_rag/sample/sample_data/triple-baton-420805-d94cd9668436.json\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "341ab084-d8b9-43b3-ac38-5847a1affaed",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#!pip install tqdm\n",
+    "#!echo $GOOGLE_APPLICATION_CREDENTIALS"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "0d5d6d77-f8eb-41a0-97d5-a0915ce9a9b0",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Google OCR 라이브러리 임포트\n",
+    "import io\n",
+    "from tqdm import tqdm\n",
+    "from google.cloud import vision\n",
+    "\n",
+    "client = vision.ImageAnnotatorClient()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "db5d89fb-d284-4d55-82c5-901e365d4df5",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Google OCR을 활용하여 이미지 파일에서 텍스트를 추출하는 메서드\n",
+    "def detect_text(path: str):\n",
+    "    with io.open(path, \"rb\") as image_file:\n",
+    "        content = image_file.read()\n",
+    "\n",
+    "    image = vision.Image(content=content)\n",
+    "\n",
+    "    response = client.text_detection(image=image)\n",
+    "    return response.full_text_annotation"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "fa3b372d-cf88-4774-808c-d63542979cb7",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "detect_text(\"page_1.jpg\").text"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d5852cd5-a954-428b-b749-7ea7df7e226a",
+   "metadata": {},
+   "source": [
+    "Google OCR에서 내려준 결과를 곧바로 활용할 경우 위 예시와 같이 각 행의 마지막에 위치한 **띄어쓰기**, **개행** 등의 *Break* 정보가 유실된 상태의 텍스트 (e.g. `Numerous` 뒤에 불필요한 개행문자가 포함) 를 얻게 됩니다.<br>\n",
+    "Google도 이러한 점을 고려해 [**Break Detection**](https://cloud.google.com/dotnet/docs/reference/Google.Cloud.Vision.V1/latest/Google.Cloud.Vision.V1.TextAnnotation.Types.DetectedBreak.Types.BreakType) 기술을 제공하고 있습니다. <br>\n",
+    "따라서 *Break Detection* 에 의해 추론된 결과에 따라 **띄어쓰기**, **개행** 등을 올바르게 정렬하는 후처리 작업을 진행합니다."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "5652928d-40b5-4e72-879d-a248008c5d1b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "breaks = vision.TextAnnotation.DetectedBreak.BreakType"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "48ca77f6-6499-4c52-880f-0835f8d230a8",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "dir(breaks)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c96ae0db-2b96-40a0-9b66-093cd10d1c88",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Break Detection 결과 적용을 위한 후처리 메서드\n",
+    "def postprocess_ocr(annotation) -> str:\n",
+    "    text = \"\"\n",
+    "\n",
+    "    for page in annotation.pages:\n",
+    "        for block in page.blocks:\n",
+    "            for paragraph in block.paragraphs:\n",
+    "                for word in paragraph.words:\n",
+    "                    for symbol in word.symbols:\n",
+    "                        detected_break = symbol.property.detected_break\n",
+    "                        detected_break_type = detected_break.type_\n",
+    "\n",
+    "                        if detected_break_type == breaks.UNKNOWN:\n",
+    "                            text += symbol.text\n",
+    "                        elif detected_break_type == breaks.SPACE:\n",
+    "                            text += f\"{symbol.text} \"\n",
+    "                        elif detected_break_type == breaks.SURE_SPACE:\n",
+    "                            text += f\"{symbol.text} \"\n",
+    "                        elif detected_break_type == breaks.EOL_SURE_SPACE:\n",
+    "                            text += f\"{symbol.text} \"\n",
+    "                        elif detected_break_type == breaks.HYPHEN:\n",
+    "                            text += f\"{symbol.text}-\"\n",
+    "                        elif detected_break_type == breaks.LINE_BREAK:\n",
+    "                            text += f\"{symbol.text}\\n\"\n",
+    "\n",
+    "    return text.strip()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d5b78799-96aa-4a09-9998-72e9357038e7",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "postprocess_ocr(detect_text(\"page_1.jpg\"))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "0a16400d-fc0b-438a-b7c9-68dccfe40ba4",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "documents = []"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "714b634d-8457-4a20-8463-302293f79d32",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "for i in tqdm(range(len(images))):\n",
+    "    documents.append(\n",
+    "        {\n",
+    "            \"page\": int(i+1),\n",
+    "            \"text\": postprocess_ocr(detect_text(f\"page_{i}.jpg\")),\n",
+    "        }\n",
+    "    )"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "88886558-aa5b-42f2-bd75-1cf48df3d313",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "documents[0]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "5fabd06a-c494-4a7f-b697-e09d1253bed4",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "len(documents)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1d960f74-11a1-480f-9e1e-5fa5cd06cbee",
+   "metadata": {},
+   "source": [
+    "# 2. Text Pre-processing\n",
+    "\n",
+    "언어 모델이 보다 잘 이해할 수 있는 단위로 텍스트 데이터를 **정제**하는 과정입니다. <br>\n",
+    "해당 과정에는 불필요한 텍스트를 제거하는 `Text Cleansing`, 텍스트를 보다 작은 의미 단위로 분할하는 `Text Chunking` 로직이 포함됩니다. <br>\n",
+    "일반적으로 문서 전처리에 따라 **서비스 품질이 크게 달라질 수 있기에** 이 과정에 튜토리얼 코드 이상으로 많은 공을 들이는게 좋습니다."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e2778077-b4d3-46a6-b9f7-552de2cda1bf",
+   "metadata": {},
+   "source": [
+    "## 2.1. Text Cleansing\n",
+    "\n",
+    "`Text Cleansing`은 문서 활용과 벡터화에 있어 불필요한 문자열을 제거하는 단계입니다. <br>\n",
+    "본 로직은 도메인 특성에 따라 다르게 작성될 수 있습니다. 본 튜토리얼에서는 **최소 단위 정제 작업**만 진행합니다. <br>\n",
+    "활용하시는 목적에 따라 아래 메서드에 추가 로직을 작성해주시면 됩니다."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "0f4bf006-f5f6-4481-8477-c458c1cc4ade",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import re\n",
+    "from typing import List, Optional\n",
+    "\n",
+    "\n",
+    "citation_pattern = r\"\\[\\d+\\]\"\n",
+    "\n",
+    "def cleanse_text(text: str) -> Optional[str]:\n",
+    "    # 길이 단위 필터링\n",
+    "    if len(text) <= 5:\n",
+    "        return None\n",
+    "\n",
+    "    # 각주 제거\n",
+    "    text = re.sub(citation_pattern, \"\", text)\n",
+    "\n",
+    "    # 불필요하게 나열된 여러 개 공백 제거\n",
+    "    text = re.sub(\" +\", \" \", text)\n",
+    "    return text"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "534634fd-d272-4b38-bbfa-d67fc59a3c97",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "dirty_text = \"\"\"We show that the  Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large    and limited   training data.\\n1 Introduction\\nRecurrent neural networks, long short-term memory [13] and gated recurrent [7] neural networks in particular, have been firmly established as state of the art approaches in sequence modeling and\\n*Equal contribution. Listing order is random.   Jakob proposed replacing RNNs with self-attention and started   the effort to evaluate this idea.\"\"\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "0ac6e172-1a44-464c-9dfe-5c28f95f2561",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "dirty_text"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "8d446ed6-6d7e-4acc-84ee-a69527b544af",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "cleanse_text(dirty_text)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d1cd4efa-9f2a-475a-b520-dfa0fa6e43a0",
+   "metadata": {},
+   "source": [
+    "## 2.2. Text Chunking\n",
+    "\n",
+    "`Text Chunking`은 하나의 벡터에 명료하고 확실한 정보를 담기 위해 텍스트를 **의미적으로 자르는 단계**입니다. <br>\n",
+    "대개 문단 단위로 자르는 로직, 토큰 갯수로 자르는 로직 등이 있으며 본 튜토리얼에서는 편의상 **토큰 갯수**로 자르는 로직을 구현합니다. <br>\n",
+    "`Text Cleansing`과 마찬가지로 목적에 따라 다른 분할 로직을 활용하시는게 바람직합니다."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a76baec9-b4d7-427a-bf98-16bd1edf2a5c",
+   "metadata": {},
+   "source": [
+    "**OpenAI**는 토큰 단위 비즈니스 로직을 지원하기 위해 문장의 토큰 갯수를 반환해주는 [`tiktoken`](https://github.com/openai/tiktoken) 라이브러리를 제공합니다. <br>\n",
+    "본 튜토리얼에서는 `tiktoken`을 활용해 토큰 갯수 기반 청킹을 적용합니다."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e7f7e87f-b114-420b-8795-eb29c9f2902e",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#!pip install tiktoken"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "26043673-90a0-4eda-9430-b49c7db0466b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# tiktoken 라이브러리 임포트\n",
+    "import tiktoken"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "1a241f65-e5f0-4afc-be24-760443fa64f6",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# ChatGPT 인코딩 로직인 `cl100k_base`를 기본 인코딩으로 설정\n",
+    "tokenizer = tiktoken.get_encoding(\"cl100k_base\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "7d6dbc1d-5a4a-4415-9f3f-34abbe02e290",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# 입력 문장의 토큰 갯수를 카운트 하는 메서드\n",
+    "def num_tokens_from_text(text: str) -> int:\n",
+    "    num_tokens = len(tokenizer.encode(text))\n",
+    "    return num_tokens"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "01dd7987-f6bd-40e6-ae53-11394970f0b6",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# 최대 토큰 갯수 지정\n",
+    "CHUNK_SIZE = 180"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ca9d141d-dbd6-41cd-8703-31bc26bd3e05",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# 토큰 갯수 단위로 문서 분할하는 메서드\n",
+    "def chunkify(text: str) -> List[str]:\n",
+    "    lines = text.split(\"\\n\")\n",
+    "\n",
+    "    chunks = []\n",
+    "\n",
+    "    chunk = \"\"\n",
+    "    for line in lines:\n",
+    "        line = cleanse_text(line)\n",
+    "        if line is None:\n",
+    "            continue\n",
+    "\n",
+    "        chunk += f\" {line}\"\n",
+    "\n",
+    "        if num_tokens_from_text(chunk) >= CHUNK_SIZE:\n",
+    "            chunks.append(chunk.strip())\n",
+    "            chunk = \"\"\n",
+    "\n",
+    "    # 마지막 청크가 남아 있다면 추가하며 마무리\n",
+    "    if chunk:\n",
+    "        chunks.append(chunk)\n",
+    "\n",
+    "    return chunks"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "2bbe2c89-8169-49f9-829b-487c4075d092",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "chunks = chunkify(documents[0][\"text\"])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "502730a8-4567-40bc-a15b-c0d9d68f31fd",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "chunks[0]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ffda1491-f050-483b-8ef3-e57d3dd60a7b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "chunked_documents = []"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "bf741189-10b9-450f-870f-8e19ff5fba5c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "for document in documents:\n",
+    "    chunks = chunkify(document[\"text\"])\n",
+    "    for chunk in chunks:\n",
+    "        chunked_documents.append(\n",
+    "            {\n",
+    "                \"page\": document[\"page\"],\n",
+    "                \"text\": chunk,\n",
+    "            }\n",
+    "        )"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "717bbf98-d11e-4d01-acb2-a03dea5043ee",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "chunked_documents[0]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "7ae7a8b1-f585-4aab-bbeb-6cfc2e04ffe7",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# 첫이미지에 이상한 문자열 때문에 오류나서 빼버림\n",
+    "chunked_documents.pop(0)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "16f0e3c0-96ab-49ab-8142-e4dc57c3157c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "len(chunked_documents)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5fc13ce3-9435-4d50-85b7-54d3eee394af",
+   "metadata": {},
+   "source": [
+    "# 3. Vector Search\n",
+    "\n",
+    "사용자 질의에 부합하는 문서를 반환 받기 위해 문서를 벡터 검색 엔진에 추가하고, 활용하는 과정입니다. <br>\n",
+    "문서를 벡터화하는 `Embedding`, 임베딩 된 문서를 검색해오는 `Hybrid Search` 로직이 포함됩니다."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ff1e8c57-6660-4d93-ad21-1847889abd8a",
+   "metadata": {},
+   "source": [
+    "## 3.1. Embedding\n",
+    "\n",
+    "![](https://raw.githubusercontent.com/liner-engineering/liner-pdf-chat-tutorial/main/images/openai-vectors.svg)<br><br>\n",
+    "`Embedding`은 검색 엔진에 등록할 문서의 텍스트를 벡터로 변환하는 단계입니다. <br>\n",
+    "텍스트 임베딩을 위해 다양한 기술을 활용할 수 있지만, 본 튜토리얼에서는 clir-emb-dolphin Embedding을 활용합니다. <br><br>\n",
+    "\\* 임베딩에 대한 이해를 보다 가꾸고자 하는 분은 [링크](http://jalammar.github.io/illustrated-word2vec/)를 참조해주세요."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "6d19cfee-db54-4e3b-a56d-885537899aeb",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# -*- coding: utf-8 -*-\n",
+    "\n",
+    "import base64\n",
+    "import json\n",
+    "# import http.client\n",
+    "import requests\n",
+    "#import ssl\n",
+    "#ssl._create_default_https_context = ssl._create_unverified_context\n",
+    "\n",
+    "\n",
+    "class CompletionExecutor:\n",
+    "    def __init__(self, host, api_key, api_key_primary_val, request_id):\n",
+    "        self._host = host\n",
+    "        self._api_key = api_key\n",
+    "        self._api_key_primary_val = api_key_primary_val\n",
+    "        self._request_id = request_id\n",
+    "\n",
+    "    def _send_request(self, completion_request):\n",
+    "        headers = {\n",
+    "            'X-NCP-CLOVASTUDIO-API-KEY': self._api_key,\n",
+    "            'X-NCP-APIGW-API-KEY': self._api_key_primary_val,\n",
+    "            'X-NCP-CLOVASTUDIO-REQUEST-ID': self._request_id,\n",
+    "            'Content-Type': 'application/json; charset=utf-8'\n",
+    "        }\n",
+    "\n",
+    "        # Initialize result variable\n",
+    "        result = None\n",
+    "\n",
+    "        try:\n",
+    "            # Use requests.post for making an HTTP POST request\n",
+    "            # Base model /testapp/v1/chat-completions/HCX-002\n",
+    "            # Tunning model\n",
+    "            response = requests.post(\n",
+    "                f\"{self._host}/testapp/v1/api-tools/embedding/clir-emb-dolphin/623982be24c547eea38aebff2fb6b7d8\",\n",
+    "                headers=headers, json=completion_request, stream=False\n",
+    "            )\n",
+    "\n",
+    "            # Check if the request was successful (status code 200)\n",
+    "            if response.status_code == 200:\n",
+    "                result = response.json()\n",
+    "            else:\n",
+    "                print(f\"Request failed with status code: {response.status_code}\")\n",
+    "        except requests.RequestException as e:\n",
+    "            # Handle exceptions, log, or raise accordingly\n",
+    "            print(f\"Request failed: {e}\")\n",
+    "\n",
+    "        #print(result)\n",
+    "        \n",
+    "        return result\n",
+    "\n",
+    "    def execute(self, completion_request):\n",
+    "        res = self._send_request(completion_request)\n",
+    "        if res['status']['code'] == '20000':\n",
+    "            return res['result']['embedding']\n",
+    "        else:\n",
+    "            return 'Error'\n",
+    "\n",
+    "\n",
+    "if __name__ == '__main__':\n",
+    "    completion_executor = CompletionExecutor(\n",
+    "        host='https://clovastudio.apigw.ntruss.com',\n",
+    "        api_key='NTA0MjU2MWZlZTcxNDJiY/45xXsnSsV6vxNC+a7OYWOp13Vfn1SdWFJfJNhVNlbP',\n",
+    "        api_key_primary_val = 'BUHxye45QXMc8t5xloJUNLGQqSobFssyE8XKks68',\n",
+    "        request_id='ff649bcc-4632-4fd2-9837-8695ff8f0475'\n",
+    "    )\n",
+    "\n",
+    "    # 앞서 준비한 문서 데이터를 순회하며 벡터 추출 후, 문서 객체에 벡터 추가 할당\n",
+    "    for chunked_document in tqdm(chunked_documents):\n",
+    "        json_data = str(chunked_document[\"text\"]).replace('\"', '')\n",
+    "        request_data = json.loads(\n",
+    "            '''\n",
+    "            {\n",
+    "                \"text\": \"''' + json_data + '''\"\n",
+    "            }\n",
+    "            ''',\n",
+    "            strict=False\n",
+    "        )\n",
+    "\n",
+    "        #print(request_data)\n",
+    "\n",
+    "        response_text = completion_executor.execute(request_data)\n",
+    "\n",
+    "        chunked_document[\"embedding\"] = response_text\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "4b87ecb8-9e65-45bf-b084-df03d28489e3",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "chunked_documents[0]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "05915905-341c-4ab9-8271-b7de49fc5aa8",
+   "metadata": {},
+   "source": [
+    "## 3.2. Hybrid Search\n",
+    "\n",
+    "`Hybrid Search`는 사용자 질의에 따라 레퍼런스가 될 수 있는 문서를 검색하는 단계입니다. <br>\n",
+    "23년 기준 많은 벡터 서치 엔진이 개발되고 있지만, 현재 기준 개발 편의성이 가장 높은 [**Pinecone**](https://www.pinecone.io/)을 활용합니다."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b9f373f1-5850-4b24-ba78-fa246aa12571",
+   "metadata": {},
+   "source": [
+    "먼저 **Pinecone**에서 활용할 인덱스를 생성해줍니다. <br>\n",
+    "clir-emb-dolphin embedding 이 **1024차원**의 벡터를 반환하므로 해당 값을 `Dimensions`에, 유사도 검색에 활용하고자 하는 메트릭을 `Metric`에 선택해주면 됩니다."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "4b2d7770-410e-4ad6-afce-0be94fb303f3",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# pinecone 라이브러리 임포트\n",
+    "import pinecone\n",
+    "from pinecone import Pinecone, ServerlessSpec"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "b2207b92-11ad-4cb3-a6c6-2b260316aaa1",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# pinecone API Key 설정\n",
+    "pc = Pinecone(\n",
+    "        api_key=\"a178f7a6-060c-4609-91b7-6839632c7163\"\n",
+    "    )"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "916ba29e-0a39-45ab-a8f6-bc1c7aeba216",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# pinecone 등록 인덱스 확인\n",
+    "active_indexes = pc.list_indexes()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "755852f4-ed53-4589-9722-d4535bc98bd2",
+   "metadata": {},
+   "source": [
+    "아래 변수에 앞서 생성한 **Pinecone Index**가 담겨 있어야 합니다."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "0845c672-a1b5-468f-af47-1fdbb5a13083",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "active_indexes"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c1cbf8be-7cc8-40ec-a04d-562e00cdef4e",
+   "metadata": {},
+   "source": [
+    "이제 문서 데이터를 **Pinecone**에 등록하기 위해 벡터 데이터를 튜플 형태로 생성합니다."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f1ac3252-2f57-4762-8da3-64eb73e182f8",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "vectors = [\n",
+    "    (\n",
+    "        f\"vec{str(i)}\",                  # 문서 아이디\n",
+    "        chunked_document[\"embedding\"],   # 벡터\n",
+    "        {                                # 문서 메타 정보 딕셔너리\n",
+    "            \"text\": chunked_document[\"text\"],\n",
+    "            \"page\": chunked_document[\"page\"],\n",
+    "            \"file\": FILE_NAME,\n",
+    "        },\n",
+    "    )\n",
+    "    for i, chunked_document in enumerate(chunked_documents)\n",
+    "]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "70d0c694-a237-4dd4-b51b-629085a23211",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "vectors[0]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "8b003c6e-6d68-4a59-a926-007e3ca20665",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "len(vectors)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ad664fa5-9ccf-4bf2-a634-b90765afc1a6",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# 인덱스 설정\n",
+    "index = pc.Index(\"newclass\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e7ced2c5-4d58-4088-b559-bd627cdcd66d",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# 설정된 인덱스에 앞서 생성한 벡터 데이터 Upsert\n",
+    "index.upsert(\n",
+    "    vectors=vectors,\n",
+    "    namespace=\"pdf_vectors\",\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ec16a1c5-810a-4f38-b029-9dd58642367b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# 사용자 쿼리 벡터화 위한 메서드\n",
+    "def query_embed(text: str) -> List[float]:\n",
+    "    json_data = str(text).replace('\"', '')\n",
+    "    request_data = json.loads(\n",
+    "        '''\n",
+    "        {\n",
+    "            \"text\": \"''' + json_data + '''\"\n",
+    "        }\n",
+    "        ''',\n",
+    "        strict=False\n",
+    "    )\n",
+    "    response_text = completion_executor.execute(request_data)\n",
+    "\n",
+    "    return response_text"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "083c4121-1189-4ad0-8ad9-e7a9b27e848d",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# 사용자 쿼리 벡터화\n",
+    "query_vector = query_embed(\"What advantages do transformers have over RNNs?\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ce985435-74df-450d-8f90-ce6eb0b2bb38",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# 사용자 쿼리 벡터와 `filter` 로직을 활용해 Hybrid Search\n",
+    "query_response = index.query(\n",
+    "    namespace=\"pdf_vectors\",\n",
+    "    top_k=10,\n",
+    "    include_values=True,\n",
+    "    include_metadata=True,\n",
+    "    vector=query_vector,\n",
+    "    filter={\n",
+    "        \"file\": {\"$in\": [FILE_NAME]},\n",
+    "    }\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7483cb93-ee61-4933-b0d0-19c399928d67",
+   "metadata": {},
+   "source": [
+    "위 코드를 보면 `query_vector` 외에도 `filter` 인자를 추가로 넘겨주는 것을 확인할 수 있습니다. <br>\n",
+    "`filter` 인자에 따라 특정 **file** 명을 지닌 문서 데이터 중에서 가장 유사한 문서를 반환하게끔 하는 **Hybrid Search** 적용이 가능해집니다."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a4489ff1-2b8f-4c4f-8f00-3955dee08403",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Top-1으로 매칭된 데이터 확인\n",
+    "query_response[\"matches\"][0][\"metadata\"]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "49f52a7e-1a9c-43c4-b6bf-c829b52d29ed",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# PDF 중 몇 번째 페이지에서 근거를 찾을 수 있었는지 확인\n",
+    "images[int(query_response[\"matches\"][0][\"metadata\"][\"page\"])-1]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3b550690-0633-4d5b-a7b5-f9cb348201c6",
+   "metadata": {},
+   "source": [
+    "# Simple PDF Chat\n",
+    "\n",
+    "이제 기본적인 기능 구현을 위한 로직들의 개발이 완료되었습니다. <br>\n",
+    "최종적으로 아래와 같은 함수를 구현해 위 모든 과정을 하나의 로직으로 통합할 수 있습니다."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "b1932ef7-fe5b-455c-a5e2-c6f55284a844",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import requests\n",
+    "\n",
+    "class Child_CompletionExecutor(CompletionExecutor):\n",
+    "    def _send_request(self, completion_request):\n",
+    "        headers = {\n",
+    "            'X-NCP-CLOVASTUDIO-API-KEY': self._api_key,\n",
+    "            'X-NCP-APIGW-API-KEY': self._api_key_primary_val,\n",
+    "            'X-NCP-CLOVASTUDIO-REQUEST-ID': self._request_id,\n",
+    "            'Content-Type': 'application/json; charset=utf-8'\n",
+    "        }\n",
+    "\n",
+    "        # Initialize result variable\n",
+    "        result = None\n",
+    "\n",
+    "        try:\n",
+    "            # Use requests.post for making an HTTP POST request\n",
+    "            # Base model /testapp/v1/chat-completions/HCX-002\n",
+    "            # Tunning model\n",
+    "            response = requests.post(\n",
+    "                f\"{self._host}/testapp/v1/chat-completions/HCX-002\",\n",
+    "                headers=headers, json=completion_request, stream=False\n",
+    "            )\n",
+    "\n",
+    "            # Check if the request was successful (status code 200)\n",
+    "            if response.status_code == 200:\n",
+    "                result = response.json()\n",
+    "            else:\n",
+    "                print(f\"Request failed with status code: {response.status_code}\")\n",
+    "        except requests.RequestException as e:\n",
+    "            # Handle exceptions, log, or raise accordingly\n",
+    "            print(f\"Request failed: {e}\")\n",
+    "\n",
+    "        return result\n",
+    "\n",
+    "\n",
+    "    def execute(self, completion_request):\n",
+    "        res = self._send_request(completion_request)\n",
+    "\n",
+    "        if res['status']['code'] == '40103':\n",
+    "            # Check whether the token has expired and reissue the token.\n",
+    "            self._access_token = None\n",
+    "            return self.execute(completion_request)\n",
+    "        elif res['status']['code'] == '20000':\n",
+    "            return res['result']['message']['content']\n",
+    "        else:\n",
+    "            return 'Error'\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "de614c65-bcc0-435b-8ab0-e0ac8200b144",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def pdf_chat(query: str) -> str:\n",
+    "    # 1. 사용자 쿼리 벡터화\n",
+    "    query_vector = query_embed(query)\n",
+    "\n",
+    "    # 2. Hybrid Search 통해 레퍼런스 문서 반환\n",
+    "    query_response = index.query(\n",
+    "        namespace=\"pdf_vectors\",\n",
+    "        top_k=10,\n",
+    "        include_values=True,\n",
+    "        include_metadata=True,\n",
+    "        vector=query_vector,\n",
+    "        filter={\n",
+    "            \"file\": {\"$in\": [FILE_NAME]},\n",
+    "        },\n",
+    "    )\n",
+    "\n",
+    "    reference = query_response[\"matches\"][0][\"metadata\"]\n",
+    "\n",
+    "    # 3. 프롬프트에 레퍼런스 문서 정보와 사용자 쿼리 정보 입력\n",
+    "    child_completion_executor = Child_CompletionExecutor(\n",
+    "        host='https://clovastudio.stream.ntruss.com',\n",
+    "        api_key='NTA0MjU2MWZlZTcxNDJiY/45xXsnSsV6vxNC+a7OYWOp13Vfn1SdWFJfJNhVNlbP',\n",
+    "        api_key_primary_val = 'BUHxye45QXMc8t5xloJUNLGQqSobFssyE8XKks68',\n",
+    "        request_id='ff649bcc-4632-4fd2-9837-8695ff8f0475'\n",
+    "    )\n",
+    "\n",
+    "    preset_text = [{\"role\": \"system\", \"content\": \"\\n\".join([\n",
+    "                \"Your role is to answer the user's query based on the references provided.\",\n",
+    "                \"You must base your answer solely on the references, regardless of your own knowledge, and you must include the page information in your answer.\",\n",
+    "            ])},\n",
+    "            {\"role\": \"system\", \"content\": f\"reference: {reference['text']}, page: ({int(reference['page'])})\"},\n",
+    "            {\"role\": \"user\", \"content\": query}]\n",
+    "\n",
+    "    request_data = {\n",
+    "        'messages': preset_text,\n",
+    "        'topP': 0.8,\n",
+    "        'topK': 0,\n",
+    "        'maxTokens': 2048,\n",
+    "        'temperature': 0.5,\n",
+    "        'repeatPenalty': 5.0,\n",
+    "        'stopBefore': [],\n",
+    "        'includeAiFilters': True\n",
+    "    }\n",
+    "\n",
+    "    # 4. LLM 생성 답변 반환\n",
+    "    response_text = child_completion_executor.execute(request_data)\n",
+    "\n",
+    "    return response_text"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f5593d7d-59ed-467e-81e3-b8d0fbcae341",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "response = pdf_chat(\"What advantages do transformers have over RNNs? 한글로 번역해서 알려줘\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "16f30cc6-17eb-4841-b18f-1431a17ab480",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "response"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "41eef846-3b4e-4784-9286-44677f360e5a",
+   "metadata": {},
+   "source": [
+    "# Future Work\n",
+    "\n",
+    "본 튜토리얼에 소개된 기능 외에도 추가적으로 아래와 같은 기능을 구현해볼 수 있습니다.\n",
+    "\n",
+    "- **Multi-turn Chat**: 싱글 턴 대화가 아닌 멀티 턴 대화를 이어나가기 위해 대화 이력을 관리 및 활용해볼 수 있습니다.\n",
+    "- **Query Refinement**: 검색 로직을 개선하기 위해 쿼리를 정제 및 강화해주는 로직을 더해볼 수 있습니다.\n",
+    "- **Term-based Search**: 필터, 벡터 기반 검색 외 키워드 기반 검색을 더해볼 수 있습니다."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b35e47ed-d850-4780-aaa6-9b88f53b2fd6",
+   "metadata": {},
+   "source": [
+    "# Future Work 2\n",
+    "\n",
+    "- **google ocr -> NCP OCR**: OCR 를 구글 사용하지 않고, 네이버 클라우드 플랫폼의 것으로 바꿔보자. (EST 테스트 계정이 있으므로..)\n",
+    "- **apache tika**: 문서에서 텍스트를 뽑아낼때, 꼭 OCR을 사용해야 하는지"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a36130dc-b672-4061-8ed2-f7116521fadb",
+   "metadata": {},
+   "source": [
+    "# Caution KEYS\n",
+    "\n",
+    "- **google ocr, pinecone**: 제 개인 키들 입니다. 노출 하지 말아주세요.ㅠ\n",
+    "- **NCP HCX**: EST 사의 테스트 계정입니다. 맘대로 쓰라고 하긴 했는데. 너무 맘대로 쓰는건 안될듯. (NCP 둘러보실분 저에게 계정정보 요청하시면 됩니다)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.10"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
+%% Cell type:markdown id:4684a49d-08be-4192-98d8-e6e378f8d674 tags:
+
+# LINER PDF Chat Tutorial (Simple)
+
+![](https://raw.githubusercontent.com/liner-engineering/liner-pdf-chat-tutorial/main/images/liner-pdf-chat.gif)
+
+본 튜토리얼은 **ChatGPT**를 활용해 **PDF** 파일에 기반하여 답변할 수 있는 질의응답 챗봇 코드를 다루고 있습니다. <br>
+튜토리얼을 마치고 나면 위 그림과 같은 제품을 만드는 방법을 익히실 수 있게 됩니다. <br><br>
+
+튜토리얼은 크게 **세 단계**로 나누어 진행됩니다.
+- **PDF-to-Image**
+- **Text Preprocessing**
+- **Vector Search**
+
+
+# 1. PDF-to-Text
+
+PDF 파일에서 언어 모델이 이해할 수 있는 플레인 텍스트를 추출하는 과정입니다. <br>
+해당 과정에는 PDF를 문서 이미지로 변환하는 `PDF-to-Image`, 문서 이미지에서 텍스트를 추출하는 `Image-to-Text` 로직이 포함됩니다.
+
+%% Cell type:markdown id:4c3fd898-d8d9-4044-a920-6a48ae264f5b tags:
+
+## 1.1. PDF-to-Image
+
+`PDF-to-Image`는 PDF 파일을 이미지 파일의 모음으로 변환하는 단계입니다. <br>
+이 작업을 수행하기 위해 많은 기술들이 존재하지만, 본 튜토리얼에서는 [`pdf2image`](https://github.com/Belval/pdf2image)를 활용합니다. <br><br>
+\* `pdf2image` 활용을 위해서는 `poppler` [설치](https://pdf2image.readthedocs.io/en/latest/installation.html)가 필요합니다.
+
+%% Cell type:code id:1e728210-712b-451d-bdc7-7e283ab4f223 tags:
+
+``` python
+#!apt-get install poppler-utils
+#!brew install popler
+#!pip install pdf2image
+```
+
+%% Cell type:code id:3db689ae-7b23-40d6-b1f1-fd98a9e8a549 tags:
+
+``` python
+# pdf2image 라이브러리 임포트
+from pdf2image import convert_from_path
+
+# 로컬 내 PDF 파일 경로 변수로 지정
+FILE_NAME = "sample_data/transformer.pdf"
+
+# `convert_from_path` 함수 통해 PDF 파일 읽어와 이미지 리스트로 변환
+images = convert_from_path(FILE_NAME)
+```
+
+%% Cell type:markdown id:9ce06793-4099-4427-9329-dc106e80dc99 tags:
+
+본 튜토리얼에서는 2017년 공개된 [**Attention Is All You Need**](https://arxiv.org/abs/1706.03762) 논문을 예제 문서로 활용합니다.
+
+%% Cell type:code id:668c7af7-9feb-4499-ab8d-ff6a4d5e6504 tags:
+
+``` python
+len(images)
+```
+
+%% Cell type:code id:8f46f4da-92f1-443c-8d75-cb52540469ea tags:
+
+``` python
+# 다음 단계를 위해 이미지 파일 로컬에 저장
+for i, image in enumerate(images):
+    image.save(f"page_{str(i)}.jpg", "JPEG")
+```
+
+%% Cell type:markdown id:9599fe85-85fe-449a-9b45-b082088d5d21 tags:
+
+## 1.2. Image-to-Text
+
+`Image-to-Text`는 앞서 저장한 이미지 파일에서 텍스트를 추출하는 단계입니다. <br>
+본 튜토리얼에서는 [`Google OCR`](https://cloud.google.com/vision/docs/ocr)을 활용하며, 기호에 따라 다른 OCR 기술 (e.g. [HuggingFace](https://huggingface.co/), [Tesseract](https://github.com/tesseract-ocr/tesseract), ...) 을 활용하실 수도 있습니다.
+<br><br>
+https://yunwoong.tistory.com/148 를 참고하여 API키 발급
+
+%% Cell type:code id:fe7603e3-3775-4cc6-8194-741d9a598b44 tags:
+
+``` python
+#!pip install opencv-contrib-python
+#!pip install --upgrade google-cloud-vision
+#!pip install --upgrade google-cloud-speech
+#!pip install --upgrade google-cloud-language
+#!pip install --upgrade google-cloud-texttospeech
+```
+
+%% Cell type:code id:974b5c30-0fe9-4c7e-bc2e-4f35415dd998 tags:
+
+``` python
+import os
+os.environ["GOOGLE_APPLICATION_CREDENTIALS"]="/Users/changho/workspace/study_rag/sample/sample_data/triple-baton-420805-d94cd9668436.json"
+```
+
+%% Cell type:code id:341ab084-d8b9-43b3-ac38-5847a1affaed tags:
+
+``` python
+#!pip install tqdm
+#!echo $GOOGLE_APPLICATION_CREDENTIALS
+```
+
+%% Cell type:code id:0d5d6d77-f8eb-41a0-97d5-a0915ce9a9b0 tags:
+
+``` python
+# Google OCR 라이브러리 임포트
+import io
+from tqdm import tqdm
+from google.cloud import vision
+
+client = vision.ImageAnnotatorClient()
+```
+
+%% Cell type:code id:db5d89fb-d284-4d55-82c5-901e365d4df5 tags:
+
+``` python
+# Google OCR을 활용하여 이미지 파일에서 텍스트를 추출하는 메서드
+def detect_text(path: str):
+    with io.open(path, "rb") as image_file:
+        content = image_file.read()
+
+    image = vision.Image(content=content)
+
+    response = client.text_detection(image=image)
+    return response.full_text_annotation
+```
+
+%% Cell type:code id:fa3b372d-cf88-4774-808c-d63542979cb7 tags:
+
+``` python
+detect_text("page_1.jpg").text
+```
+
+%% Cell type:markdown id:d5852cd5-a954-428b-b749-7ea7df7e226a tags:
+
+Google OCR에서 내려준 결과를 곧바로 활용할 경우 위 예시와 같이 각 행의 마지막에 위치한 **띄어쓰기**, **개행** 등의 *Break* 정보가 유실된 상태의 텍스트 (e.g. `Numerous` 뒤에 불필요한 개행문자가 포함) 를 얻게 됩니다.<br>
+Google도 이러한 점을 고려해 [**Break Detection**](https://cloud.google.com/dotnet/docs/reference/Google.Cloud.Vision.V1/latest/Google.Cloud.Vision.V1.TextAnnotation.Types.DetectedBreak.Types.BreakType) 기술을 제공하고 있습니다. <br>
+따라서 *Break Detection* 에 의해 추론된 결과에 따라 **띄어쓰기**, **개행** 등을 올바르게 정렬하는 후처리 작업을 진행합니다.
+
+%% Cell type:code id:5652928d-40b5-4e72-879d-a248008c5d1b tags:
+
+``` python
+breaks = vision.TextAnnotation.DetectedBreak.BreakType
+```
+
+%% Cell type:code id:48ca77f6-6499-4c52-880f-0835f8d230a8 tags:
+
+``` python
+dir(breaks)
+```
+
+%% Cell type:code id:c96ae0db-2b96-40a0-9b66-093cd10d1c88 tags:
+
+``` python
+# Break Detection 결과 적용을 위한 후처리 메서드
+def postprocess_ocr(annotation) -> str:
+    text = ""
+
+    for page in annotation.pages:
+        for block in page.blocks:
+            for paragraph in block.paragraphs:
+                for word in paragraph.words:
+                    for symbol in word.symbols:
+                        detected_break = symbol.property.detected_break
+                        detected_break_type = detected_break.type_
+
+                        if detected_break_type == breaks.UNKNOWN:
+                            text += symbol.text
+                        elif detected_break_type == breaks.SPACE:
+                            text += f"{symbol.text} "
+                        elif detected_break_type == breaks.SURE_SPACE:
+                            text += f"{symbol.text} "
+                        elif detected_break_type == breaks.EOL_SURE_SPACE:
+                            text += f"{symbol.text} "
+                        elif detected_break_type == breaks.HYPHEN:
+                            text += f"{symbol.text}-"
+                        elif detected_break_type == breaks.LINE_BREAK:
+                            text += f"{symbol.text}\n"
+
+    return text.strip()
+```
+
+%% Cell type:code id:d5b78799-96aa-4a09-9998-72e9357038e7 tags:
+
+``` python
+postprocess_ocr(detect_text("page_1.jpg"))
+```
+
+%% Cell type:code id:0a16400d-fc0b-438a-b7c9-68dccfe40ba4 tags:
+
+``` python
+documents = []
+```
+
+%% Cell type:code id:714b634d-8457-4a20-8463-302293f79d32 tags:
+
+``` python
+for i in tqdm(range(len(images))):
+    documents.append(
+        {
+            "page": int(i+1),
+            "text": postprocess_ocr(detect_text(f"page_{i}.jpg")),
+        }
+    )
+```
+
+%% Cell type:code id:88886558-aa5b-42f2-bd75-1cf48df3d313 tags:
+
+``` python
+documents[0]
+```
+
+%% Cell type:code id:5fabd06a-c494-4a7f-b697-e09d1253bed4 tags:
+
+``` python
+len(documents)
+```
+
+%% Cell type:markdown id:1d960f74-11a1-480f-9e1e-5fa5cd06cbee tags:
+
+# 2. Text Pre-processing
+
+언어 모델이 보다 잘 이해할 수 있는 단위로 텍스트 데이터를 **정제**하는 과정입니다. <br>
+해당 과정에는 불필요한 텍스트를 제거하는 `Text Cleansing`, 텍스트를 보다 작은 의미 단위로 분할하는 `Text Chunking` 로직이 포함됩니다. <br>
+일반적으로 문서 전처리에 따라 **서비스 품질이 크게 달라질 수 있기에** 이 과정에 튜토리얼 코드 이상으로 많은 공을 들이는게 좋습니다.
+
+%% Cell type:markdown id:e2778077-b4d3-46a6-b9f7-552de2cda1bf tags:
+
+## 2.1. Text Cleansing
+
+`Text Cleansing`은 문서 활용과 벡터화에 있어 불필요한 문자열을 제거하는 단계입니다. <br>
+본 로직은 도메인 특성에 따라 다르게 작성될 수 있습니다. 본 튜토리얼에서는 **최소 단위 정제 작업**만 진행합니다. <br>
+활용하시는 목적에 따라 아래 메서드에 추가 로직을 작성해주시면 됩니다.
+
+%% Cell type:code id:0f4bf006-f5f6-4481-8477-c458c1cc4ade tags:
+
+``` python
+import re
+from typing import List, Optional
+
+
+citation_pattern = r"\[\d+\]"
+
+def cleanse_text(text: str) -> Optional[str]:
+    # 길이 단위 필터링
+    if len(text) <= 5:
+        return None
+
+    # 각주 제거
+    text = re.sub(citation_pattern, "", text)
+
+    # 불필요하게 나열된 여러 개 공백 제거
+    text = re.sub(" +", " ", text)
+    return text
+```
+
+%% Cell type:code id:534634fd-d272-4b38-bbfa-d67fc59a3c97 tags:
+
+``` python
+dirty_text = """We show that the  Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large    and limited   training data.\n1 Introduction\nRecurrent neural networks, long short-term memory [13] and gated recurrent [7] neural networks in particular, have been firmly established as state of the art approaches in sequence modeling and\n*Equal contribution. Listing order is random.   Jakob proposed replacing RNNs with self-attention and started   the effort to evaluate this idea."""
+```
+
+%% Cell type:code id:0ac6e172-1a44-464c-9dfe-5c28f95f2561 tags:
+
+``` python
+dirty_text
+```
+
+%% Cell type:code id:8d446ed6-6d7e-4acc-84ee-a69527b544af tags:
+
+``` python
+cleanse_text(dirty_text)
+```
+
+%% Cell type:markdown id:d1cd4efa-9f2a-475a-b520-dfa0fa6e43a0 tags:
+
+## 2.2. Text Chunking
+
+`Text Chunking`은 하나의 벡터에 명료하고 확실한 정보를 담기 위해 텍스트를 **의미적으로 자르는 단계**입니다. <br>
+대개 문단 단위로 자르는 로직, 토큰 갯수로 자르는 로직 등이 있으며 본 튜토리얼에서는 편의상 **토큰 갯수**로 자르는 로직을 구현합니다. <br>
+`Text Cleansing`과 마찬가지로 목적에 따라 다른 분할 로직을 활용하시는게 바람직합니다.
+
+%% Cell type:markdown id:a76baec9-b4d7-427a-bf98-16bd1edf2a5c tags:
+
+**OpenAI**는 토큰 단위 비즈니스 로직을 지원하기 위해 문장의 토큰 갯수를 반환해주는 [`tiktoken`](https://github.com/openai/tiktoken) 라이브러리를 제공합니다. <br>
+본 튜토리얼에서는 `tiktoken`을 활용해 토큰 갯수 기반 청킹을 적용합니다.
+
+%% Cell type:code id:e7f7e87f-b114-420b-8795-eb29c9f2902e tags:
+
+``` python
+#!pip install tiktoken
+```
+
+%% Cell type:code id:26043673-90a0-4eda-9430-b49c7db0466b tags:
+
+``` python
+# tiktoken 라이브러리 임포트
+import tiktoken
+```
+
+%% Cell type:code id:1a241f65-e5f0-4afc-be24-760443fa64f6 tags:
+
+``` python
+# ChatGPT 인코딩 로직인 `cl100k_base`를 기본 인코딩으로 설정
+tokenizer = tiktoken.get_encoding("cl100k_base")
+```
+
+%% Cell type:code id:7d6dbc1d-5a4a-4415-9f3f-34abbe02e290 tags:
+
+``` python
+# 입력 문장의 토큰 갯수를 카운트 하는 메서드
+def num_tokens_from_text(text: str) -> int:
+    num_tokens = len(tokenizer.encode(text))
+    return num_tokens
+```
+
+%% Cell type:code id:01dd7987-f6bd-40e6-ae53-11394970f0b6 tags:
+
+``` python
+# 최대 토큰 갯수 지정
+CHUNK_SIZE = 180
+```
+
+%% Cell type:code id:ca9d141d-dbd6-41cd-8703-31bc26bd3e05 tags:
+
+``` python
+# 토큰 갯수 단위로 문서 분할하는 메서드
+def chunkify(text: str) -> List[str]:
+    lines = text.split("\n")
+
+    chunks = []
+
+    chunk = ""
+    for line in lines:
+        line = cleanse_text(line)
+        if line is None:
+            continue
+
+        chunk += f" {line}"
+
+        if num_tokens_from_text(chunk) >= CHUNK_SIZE:
+            chunks.append(chunk.strip())
+            chunk = ""
+
+    # 마지막 청크가 남아 있다면 추가하며 마무리
+    if chunk:
+        chunks.append(chunk)
+
+    return chunks
+```
+
+%% Cell type:code id:2bbe2c89-8169-49f9-829b-487c4075d092 tags:
+
+``` python
+chunks = chunkify(documents[0]["text"])
+```
+
+%% Cell type:code id:502730a8-4567-40bc-a15b-c0d9d68f31fd tags:
+
+``` python
+chunks[0]
+```
+
+%% Cell type:code id:ffda1491-f050-483b-8ef3-e57d3dd60a7b tags:
+
+``` python
+chunked_documents = []
+```
+
+%% Cell type:code id:bf741189-10b9-450f-870f-8e19ff5fba5c tags:
+
+``` python
+for document in documents:
+    chunks = chunkify(document["text"])
+    for chunk in chunks:
+        chunked_documents.append(
+            {
+                "page": document["page"],
+                "text": chunk,
+            }
+        )
+```
+
+%% Cell type:code id:717bbf98-d11e-4d01-acb2-a03dea5043ee tags:
+
+``` python
+chunked_documents[0]
+```
+
+%% Cell type:code id:7ae7a8b1-f585-4aab-bbeb-6cfc2e04ffe7 tags:
+
+``` python
+# 첫이미지에 이상한 문자열 때문에 오류나서 빼버림
+chunked_documents.pop(0)
+```
+
+%% Cell type:code id:16f0e3c0-96ab-49ab-8142-e4dc57c3157c tags:
+
+``` python
+len(chunked_documents)
+```
+
+%% Cell type:markdown id:5fc13ce3-9435-4d50-85b7-54d3eee394af tags:
+
+# 3. Vector Search
+
+사용자 질의에 부합하는 문서를 반환 받기 위해 문서를 벡터 검색 엔진에 추가하고, 활용하는 과정입니다. <br>
+문서를 벡터화하는 `Embedding`, 임베딩 된 문서를 검색해오는 `Hybrid Search` 로직이 포함됩니다.
+
+%% Cell type:markdown id:ff1e8c57-6660-4d93-ad21-1847889abd8a tags:
+
+## 3.1. Embedding
+
+![](https://raw.githubusercontent.com/liner-engineering/liner-pdf-chat-tutorial/main/images/openai-vectors.svg)<br><br>
+`Embedding`은 검색 엔진에 등록할 문서의 텍스트를 벡터로 변환하는 단계입니다. <br>
+텍스트 임베딩을 위해 다양한 기술을 활용할 수 있지만, 본 튜토리얼에서는 clir-emb-dolphin Embedding을 활용합니다. <br><br>
+\* 임베딩에 대한 이해를 보다 가꾸고자 하는 분은 [링크](http://jalammar.github.io/illustrated-word2vec/)를 참조해주세요.
+
+%% Cell type:code id:6d19cfee-db54-4e3b-a56d-885537899aeb tags:
+
+``` python
+# -*- coding: utf-8 -*-
+
+import base64
+import json
+# import http.client
+import requests
+#import ssl
+#ssl._create_default_https_context = ssl._create_unverified_context
+
+
+class CompletionExecutor:
+    def __init__(self, host, api_key, api_key_primary_val, request_id):
+        self._host = host
+        self._api_key = api_key
+        self._api_key_primary_val = api_key_primary_val
+        self._request_id = request_id
+
+    def _send_request(self, completion_request):
+        headers = {
+            'X-NCP-CLOVASTUDIO-API-KEY': self._api_key,
+            'X-NCP-APIGW-API-KEY': self._api_key_primary_val,
+            'X-NCP-CLOVASTUDIO-REQUEST-ID': self._request_id,
+            'Content-Type': 'application/json; charset=utf-8'
+        }
+
+        # Initialize result variable
+        result = None
+
+        try:
+            # Use requests.post for making an HTTP POST request
+            # Base model /testapp/v1/chat-completions/HCX-002
+            # Tunning model
+            response = requests.post(
+                f"{self._host}/testapp/v1/api-tools/embedding/clir-emb-dolphin/623982be24c547eea38aebff2fb6b7d8",
+                headers=headers, json=completion_request, stream=False
+            )
+
+            # Check if the request was successful (status code 200)
+            if response.status_code == 200:
+                result = response.json()
+            else:
+                print(f"Request failed with status code: {response.status_code}")
+        except requests.RequestException as e:
+            # Handle exceptions, log, or raise accordingly
+            print(f"Request failed: {e}")
+
+        #print(result)
+
+        return result
+
+    def execute(self, completion_request):
+        res = self._send_request(completion_request)
+        if res['status']['code'] == '20000':
+            return res['result']['embedding']
+        else:
+            return 'Error'
+
+
+if __name__ == '__main__':
+    completion_executor = CompletionExecutor(
+        host='https://clovastudio.apigw.ntruss.com',
+        api_key='NTA0MjU2MWZlZTcxNDJiY/45xXsnSsV6vxNC+a7OYWOp13Vfn1SdWFJfJNhVNlbP',
+        api_key_primary_val = 'BUHxye45QXMc8t5xloJUNLGQqSobFssyE8XKks68',
+        request_id='ff649bcc-4632-4fd2-9837-8695ff8f0475'
+    )
+
+    # 앞서 준비한 문서 데이터를 순회하며 벡터 추출 후, 문서 객체에 벡터 추가 할당
+    for chunked_document in tqdm(chunked_documents):
+        json_data = str(chunked_document["text"]).replace('"', '')
+        request_data = json.loads(
+            '''
+            {
+                "text": "''' + json_data + '''"
+            }
+            ''',
+            strict=False
+        )
+
+        #print(request_data)
+
+        response_text = completion_executor.execute(request_data)
+
+        chunked_document["embedding"] = response_text
+```
+
+%% Cell type:code id:4b87ecb8-9e65-45bf-b084-df03d28489e3 tags:
+
+``` python
+chunked_documents[0]
+```
+
+%% Cell type:markdown id:05915905-341c-4ab9-8271-b7de49fc5aa8 tags:
+
+## 3.2. Hybrid Search
+
+`Hybrid Search`는 사용자 질의에 따라 레퍼런스가 될 수 있는 문서를 검색하는 단계입니다. <br>
+23년 기준 많은 벡터 서치 엔진이 개발되고 있지만, 현재 기준 개발 편의성이 가장 높은 [**Pinecone**](https://www.pinecone.io/)을 활용합니다.
+
+%% Cell type:markdown id:b9f373f1-5850-4b24-ba78-fa246aa12571 tags:
+
+먼저 **Pinecone**에서 활용할 인덱스를 생성해줍니다. <br>
+clir-emb-dolphin embedding 이 **1024차원**의 벡터를 반환하므로 해당 값을 `Dimensions`에, 유사도 검색에 활용하고자 하는 메트릭을 `Metric`에 선택해주면 됩니다.
+
+%% Cell type:code id:4b2d7770-410e-4ad6-afce-0be94fb303f3 tags:
+
+``` python
+# pinecone 라이브러리 임포트
+import pinecone
+from pinecone import Pinecone, ServerlessSpec
+```
+
+%% Cell type:code id:b2207b92-11ad-4cb3-a6c6-2b260316aaa1 tags:
+
+``` python
+# pinecone API Key 설정
+pc = Pinecone(
+        api_key="a178f7a6-060c-4609-91b7-6839632c7163"
+    )
+```
+
+%% Cell type:code id:916ba29e-0a39-45ab-a8f6-bc1c7aeba216 tags:
+
+``` python
+# pinecone 등록 인덱스 확인
+active_indexes = pc.list_indexes()
+```
+
+%% Cell type:markdown id:755852f4-ed53-4589-9722-d4535bc98bd2 tags:
+
+아래 변수에 앞서 생성한 **Pinecone Index**가 담겨 있어야 합니다.
+
+%% Cell type:code id:0845c672-a1b5-468f-af47-1fdbb5a13083 tags:
+
+``` python
+active_indexes
+```
+
+%% Cell type:markdown id:c1cbf8be-7cc8-40ec-a04d-562e00cdef4e tags:
+
+이제 문서 데이터를 **Pinecone**에 등록하기 위해 벡터 데이터를 튜플 형태로 생성합니다.
+
+%% Cell type:code id:f1ac3252-2f57-4762-8da3-64eb73e182f8 tags:
+
+``` python
+vectors = [
+    (
+        f"vec{str(i)}",                  # 문서 아이디
+        chunked_document["embedding"],   # 벡터
+        {                                # 문서 메타 정보 딕셔너리
+            "text": chunked_document["text"],
+            "page": chunked_document["page"],
+            "file": FILE_NAME,
+        },
+    )
+    for i, chunked_document in enumerate(chunked_documents)
+]
+```
+
+%% Cell type:code id:70d0c694-a237-4dd4-b51b-629085a23211 tags:
+
+``` python
+vectors[0]
+```
+
+%% Cell type:code id:8b003c6e-6d68-4a59-a926-007e3ca20665 tags:
+
+``` python
+len(vectors)
+```
+
+%% Cell type:code id:ad664fa5-9ccf-4bf2-a634-b90765afc1a6 tags:
+
+``` python
+# 인덱스 설정
+index = pc.Index("newclass")
+```
+
+%% Cell type:code id:e7ced2c5-4d58-4088-b559-bd627cdcd66d tags:
+
+``` python
+# 설정된 인덱스에 앞서 생성한 벡터 데이터 Upsert
+index.upsert(
+    vectors=vectors,
+    namespace="pdf_vectors",
+)
+```
+
+%% Cell type:code id:ec16a1c5-810a-4f38-b029-9dd58642367b tags:
+
+``` python
+# 사용자 쿼리 벡터화 위한 메서드
+def query_embed(text: str) -> List[float]:
+    json_data = str(text).replace('"', '')
+    request_data = json.loads(
+        '''
+        {
+            "text": "''' + json_data + '''"
+        }
+        ''',
+        strict=False
+    )
+    response_text = completion_executor.execute(request_data)
+
+    return response_text
+```
+
+%% Cell type:code id:083c4121-1189-4ad0-8ad9-e7a9b27e848d tags:
+
+``` python
+# 사용자 쿼리 벡터화
+query_vector = query_embed("What advantages do transformers have over RNNs?")
+```
+
+%% Cell type:code id:ce985435-74df-450d-8f90-ce6eb0b2bb38 tags:
+
+``` python
+# 사용자 쿼리 벡터와 `filter` 로직을 활용해 Hybrid Search
+query_response = index.query(
+    namespace="pdf_vectors",
+    top_k=10,
+    include_values=True,
+    include_metadata=True,
+    vector=query_vector,
+    filter={
+        "file": {"$in": [FILE_NAME]},
+    }
+)
+```
+
+%% Cell type:markdown id:7483cb93-ee61-4933-b0d0-19c399928d67 tags:
+
+위 코드를 보면 `query_vector` 외에도 `filter` 인자를 추가로 넘겨주는 것을 확인할 수 있습니다. <br>
+`filter` 인자에 따라 특정 **file** 명을 지닌 문서 데이터 중에서 가장 유사한 문서를 반환하게끔 하는 **Hybrid Search** 적용이 가능해집니다.
+
+%% Cell type:code id:a4489ff1-2b8f-4c4f-8f00-3955dee08403 tags:
+
+``` python
+# Top-1으로 매칭된 데이터 확인
+query_response["matches"][0]["metadata"]
+```
+
+%% Cell type:code id:49f52a7e-1a9c-43c4-b6bf-c829b52d29ed tags:
+
+``` python
+# PDF 중 몇 번째 페이지에서 근거를 찾을 수 있었는지 확인
+images[int(query_response["matches"][0]["metadata"]["page"])-1]
+```
+
+%% Cell type:markdown id:3b550690-0633-4d5b-a7b5-f9cb348201c6 tags:
+
+# Simple PDF Chat
+
+이제 기본적인 기능 구현을 위한 로직들의 개발이 완료되었습니다. <br>
+최종적으로 아래와 같은 함수를 구현해 위 모든 과정을 하나의 로직으로 통합할 수 있습니다.
+
+%% Cell type:code id:b1932ef7-fe5b-455c-a5e2-c6f55284a844 tags:
+
+``` python
+import requests
+
+class Child_CompletionExecutor(CompletionExecutor):
+    def _send_request(self, completion_request):
+        headers = {
+            'X-NCP-CLOVASTUDIO-API-KEY': self._api_key,
+            'X-NCP-APIGW-API-KEY': self._api_key_primary_val,
+            'X-NCP-CLOVASTUDIO-REQUEST-ID': self._request_id,
+            'Content-Type': 'application/json; charset=utf-8'
+        }
+
+        # Initialize result variable
+        result = None
+
+        try:
+            # Use requests.post for making an HTTP POST request
+            # Base model /testapp/v1/chat-completions/HCX-002
+            # Tunning model
+            response = requests.post(
+                f"{self._host}/testapp/v1/chat-completions/HCX-002",
+                headers=headers, json=completion_request, stream=False
+            )
+
+            # Check if the request was successful (status code 200)
+            if response.status_code == 200:
+                result = response.json()
+            else:
+                print(f"Request failed with status code: {response.status_code}")
+        except requests.RequestException as e:
+            # Handle exceptions, log, or raise accordingly
+            print(f"Request failed: {e}")
+
+        return result
+
+
+    def execute(self, completion_request):
+        res = self._send_request(completion_request)
+
+        if res['status']['code'] == '40103':
+            # Check whether the token has expired and reissue the token.
+            self._access_token = None
+            return self.execute(completion_request)
+        elif res['status']['code'] == '20000':
+            return res['result']['message']['content']
+        else:
+            return 'Error'
+```
+
+%% Cell type:code id:de614c65-bcc0-435b-8ab0-e0ac8200b144 tags:
+
+``` python
+def pdf_chat(query: str) -> str:
+    # 1. 사용자 쿼리 벡터화
+    query_vector = query_embed(query)
+
+    # 2. Hybrid Search 통해 레퍼런스 문서 반환
+    query_response = index.query(
+        namespace="pdf_vectors",
+        top_k=10,
+        include_values=True,
+        include_metadata=True,
+        vector=query_vector,
+        filter={
+            "file": {"$in": [FILE_NAME]},
+        },
+    )
+
+    reference = query_response["matches"][0]["metadata"]
+
+    # 3. 프롬프트에 레퍼런스 문서 정보와 사용자 쿼리 정보 입력
+    child_completion_executor = Child_CompletionExecutor(
+        host='https://clovastudio.stream.ntruss.com',
+        api_key='NTA0MjU2MWZlZTcxNDJiY/45xXsnSsV6vxNC+a7OYWOp13Vfn1SdWFJfJNhVNlbP',
+        api_key_primary_val = 'BUHxye45QXMc8t5xloJUNLGQqSobFssyE8XKks68',
+        request_id='ff649bcc-4632-4fd2-9837-8695ff8f0475'
+    )
+
+    preset_text = [{"role": "system", "content": "\n".join([
+                "Your role is to answer the user's query based on the references provided.",
+                "You must base your answer solely on the references, regardless of your own knowledge, and you must include the page information in your answer.",
+            ])},
+            {"role": "system", "content": f"reference: {reference['text']}, page: ({int(reference['page'])})"},
+            {"role": "user", "content": query}]
+
+    request_data = {
+        'messages': preset_text,
+        'topP': 0.8,
+        'topK': 0,
+        'maxTokens': 2048,
+        'temperature': 0.5,
+        'repeatPenalty': 5.0,
+        'stopBefore': [],
+        'includeAiFilters': True
+    }
+
+    # 4. LLM 생성 답변 반환
+    response_text = child_completion_executor.execute(request_data)
+
+    return response_text
+```
+
+%% Cell type:code id:f5593d7d-59ed-467e-81e3-b8d0fbcae341 tags:
+
+``` python
+response = pdf_chat("What advantages do transformers have over RNNs? 한글로 번역해서 알려줘")
+```
+
+%% Cell type:code id:16f30cc6-17eb-4841-b18f-1431a17ab480 tags:
+
+``` python
+response
+```
+
+%% Cell type:markdown id:41eef846-3b4e-4784-9286-44677f360e5a tags:
+
+# Future Work
+
+본 튜토리얼에 소개된 기능 외에도 추가적으로 아래와 같은 기능을 구현해볼 수 있습니다.
+
+- **Multi-turn Chat**: 싱글 턴 대화가 아닌 멀티 턴 대화를 이어나가기 위해 대화 이력을 관리 및 활용해볼 수 있습니다.
+- **Query Refinement**: 검색 로직을 개선하기 위해 쿼리를 정제 및 강화해주는 로직을 더해볼 수 있습니다.
+- **Term-based Search**: 필터, 벡터 기반 검색 외 키워드 기반 검색을 더해볼 수 있습니다.
+
+%% Cell type:markdown id:b35e47ed-d850-4780-aaa6-9b88f53b2fd6 tags:
+
+# Future Work 2
+
+- **google ocr -> NCP OCR**: OCR 를 구글 사용하지 않고, 네이버 클라우드 플랫폼의 것으로 바꿔보자. (EST 테스트 계정이 있으므로..)
+- **apache tika**: 문서에서 텍스트를 뽑아낼때, 꼭 OCR을 사용해야 하는지
+
+%% Cell type:markdown id:a36130dc-b672-4061-8ed2-f7116521fadb tags:
+
+# Caution KEYS
+
+- **google ocr, pinecone**: 제 개인 키들 입니다. 노출 하지 말아주세요.ㅠ
+- **NCP HCX**: EST 사의 테스트 계정입니다. 맘대로 쓰라고 하긴 했는데. 너무 맘대로 쓰는건 안될듯. (NCP 둘러보실분 저에게 계정정보 요청하시면 됩니다)
--- a/sample_data/transformer.pdf
+++ b/sample_data/transformer.pdf
--- a/sample_data/triple-baton-420805-d94cd9668436.json
+++ b/sample_data/triple-baton-420805-d94cd9668436.json
+{
+  "type": "service_account",
+  "project_id": "triple-baton-420805",
+  "private_key_id": "d94cd96684367b7db48fabd0e51c709aa1f6d99d",
+  "private_key": "-----BEGIN PRIVATE KEY-----\nMIIEvAIBADANBgkqhkiG9w0BAQEFAASCBKYwggSiAgEAAoIBAQCMPAe7wty8M2Cg\nOolyfCv3wy9dn0diQ0D9gNzMwHh7h46zhfJ1z8G7MZRUtc57gr34lhYlSYnTrXE2\nUM/JKmf7z3j6QRiDzHDi2DihZgblQVDp8hNyAqkBOq7ZSfscWJmjHaz8fFjA3YAm\nEIvb7tO0Or9s8SX3BypanPXVkvquTv1WoQAuwBKQLU6C3SNY07RbedDu4uyVvGZO\noEo+hQt+sfvQneMCk3vXb2vE3QaYa5F5rVu2VjWpd9t4Xh7jzndI2BhCqUqxFtFH\nPmxmmN54WC47jSX0FCWf8C5+8dgj6gCEAiyFp7yVk/kPHe0UUB5DvMvreEiAynq/\nDWOEjL0HAgMBAAECggEADL3FjBs72AOpA1XeOB8tFYFP8+ctYrGunXnQVfAk4kBi\nSFBiw66BMSNjkUDFhnZOEWB9mZyxX6CyGRfFkUb/lKL4oHA6rHruRMYVeyCcfsbs\n9ZyPhLvWJCzzRv3QSXaJWwcuuPAJVliptIurUWvFI2p1Cw5r/yJRCFObiHCmwyB6\ns6KFctHZrVh6NB7x9GYih1P9fz42FDlQNE4N/10uHxddz3Ghk4vha5wibWFbekH8\nqQT2csRV+61Nohqla8xrb3GDWb9vUw0lUF8/k8lGGgLC0jOCukFwUCidxlaENIVa\necumHJD45pviDiw/EUeZx8xE2vMsCMmY2ej3jhioVQKBgQDEJXR3cBB8h2ghcrAL\nBXoWEb3cxddltmiIv1l1YqShXDPfckGPi6WvgcsLz4e1a7ZbmY/00S92GtNpCs5j\nwG2lxXsbg0AO5mjatVVZ5jFSNUARioWuz3/bBNk5Mi1PRa0FSvo8qUWy3TCtIE0I\n8q0vLDbf9acW5IRwfV1T063hCwKBgQC3BtwMWjL4zTJ7m0tLTNU0NIQ4oz4ojJfB\nB2icWTbUnbBPG652c/bZfZ942bg/1/iE388Bkq3dEjWONIqHHpHib8hDTxoUtkJh\nNmtXoeF1YLNYLGzo4948ZH3zN3lLqpn5hKZY+ACNrs9PBqR4J3RpzCIhC/qNrlir\n4yeVHbKJdQKBgCjYH9OLO3OjArUMW8o/vrd/xEiHzh25CTWImwlNnDiZqZebBDnu\n+3Z7kZuJAJpvro6OgKKbOMXgOivCe03cUTjW0ZbeEuXHZwg8AGTzAUw8GHZOoR3Q\nybAC53T4lOTP/oJ+pXMiUIg5dRxoAIKffh63l0m1rrCer1F5WYjOKIQXAoGAMla/\npOIWDNobHWYL4m0CYrZi+1TinrJ0dpG8EuxyqS2ptUhOxqOEbDMh7lIrW9vhrWIF\nBFC8YwZEFpWa2CjvRNEryl9yM+og/a3C/jo20VrEWOb3GWK61+9nuMI0KTyF1tvG\nCMhFFrLSr9CK4cUwPnz3khFCWz9tgfEbDOc7GJUCgYAsehB2yqUkC1BadUfY3/Q2\n1X36njaZ0GQRGVLxYOsJe2Uk42kC/5hEoGhAXjBGZaVZ2n+uChY33KZzHjBGzQZw\ni8rHuJFUVebISaKS7SfDJtVHr0mqn7bpVGk5t912QfipqH9RkPtMXjz+HlbgoIky\nnoGGDq72q8BT2a7eBU7P3Q==\n-----END PRIVATE KEY-----\n",
+  "client_email": "id-939@triple-baton-420805.iam.gserviceaccount.com",
+  "client_id": "110366945840597444534",
+  "auth_uri": "https://accounts.google.com/o/oauth2/auth",
+  "token_uri": "https://oauth2.googleapis.com/token",
+  "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
+  "client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/id-939%40triple-baton-420805.iam.gserviceaccount.com",
+  "universe_domain": "googleapis.com"
+}