Corporate PDF Document Navigator
Corporate Document Navigator: AI‑Powered Policy Q&A
Introduction & Problem Statement
I’ve spent countless hours digging through lengthy PDF manuals just to answer simple policy questions—“How many sick days do I have?” or “What’s our remote work allowance?” In a fast‑moving corporate environment, that’s time I can't afford to lose. My goal was to build an AI assistant that:
- **Understands** our policy documents
- **Retrieves** only the most relevant passages
- **Generates** concise, structured answers with clear citations
In this post, I’ll walk you through how I built, evaluated, and fine‑tuned my **Corporate Document Navigator**—a Kaggle notebook that anyone in our organisation can run to get instant, accurate answers.
# Corporate PDF Document Navigator: A Step‑by‑Step Guide
Follow these steps to build your own AI‑powered policy Q&A assistant in a Kaggle Notebook, using Google’s Generative AI API, FAISS for semantic search, and an LLM‑based evaluator.
---
## Prerequisites
1. **Kaggle account** with access to create and run Notebooks.
2. **Google Cloud project** with the Generative AI API enabled and an API key.
- Store your API key in Kaggle as a Secret named `GOOGLE_API_KEY`.
3. **Dataset**
- A PDF of your policy manual uploaded as a Kaggle Dataset (e.g. `MY_KAGGLE_ACCOUNT/MY_SAMPLE_DOC.PDF`).
---
## 1. Create Your Kaggle Notebook
1. Go to **Kaggle → Notebooks → New Notebook**.
2. Under **Data**, click **Add data**, search for `MY_KAGGLE_ACCOUNT/MY_SAMPLE_DOC.PDF`, and attach it.
3. Confirm that `/kaggle/input/MY_SAMPLE_DOC.PDF/` now exists by running:
```python
import os
print(os.listdir("/kaggle/input/MY_SAMPLE_DOC.PDF"))
2. Environment Setup
Add a cell with the following to install libraries, fetch your API key, and configure GenAI:
# Install libraries
!pip install PyPDF2 faiss-cpu sentence-transformers google-generativeai ipywidgets --quiet
# Imports & API key
import os, json, pickle, faiss, numpy as np
import PyPDF2, re
from sentence_transformers import SentenceTransformer
from kaggle_secrets import UserSecretsClient
import google.generativeai as genai
# Silence tokenizer warnings
os.environ["TOKENIZERS_PARALLELISM"] = "false"
# Load and configure API key
api_key = UserSecretsClient().get_secret("GOOGLE_API_KEY")
if not api_key:
raise ValueError("Set your GOOGLE_API_KEY in Kaggle Secrets.")
genai.configure(api_key=api_key)
# Instantiate the GenAI model (for both answering and evaluation)
gen_model = genai.GenerativeModel(model_name="models/gemini-2.0-flash")
judge_model = genai.GenerativeModel(model_name="models/gemini-2.0-flash")
print("✅ Environment ready.")
3. Extract and Chunk Your PDF
3.1 Text Extraction
5. Retrieve Relevant Context
.png)
Comments
Post a Comment