Corporate PDF Document Navigator

 Corporate Document Navigator: AI‑Powered Policy Q&A


Introduction & Problem Statement  

I’ve spent countless hours digging through lengthy PDF manuals just to answer simple policy questions—“How many sick days do I have?” or “What’s our remote work allowance?” In a fast‑moving corporate environment, that’s time I can't afford to lose. My goal was to build an AI assistant that:


- **Understands** our policy documents  

- **Retrieves** only the most relevant passages  

- **Generates** concise, structured answers with clear citations  


In this post, I’ll walk you through how I built, evaluated, and fine‑tuned my **Corporate Document Navigator**—a Kaggle notebook that anyone in our organisation can run to get instant, accurate answers.


# Corporate PDF Document Navigator: A Step‑by‑Step Guide


Follow these steps to build your own AI‑powered policy Q&A assistant in a Kaggle Notebook, using Google’s Generative AI API, FAISS for semantic search, and an LLM‑based evaluator.


---


## Prerequisites


1. **Kaggle account** with access to create and run Notebooks.  

2. **Google Cloud project** with the Generative AI API enabled and an API key.  


   - Store your API key in Kaggle as a Secret named `GOOGLE_API_KEY`.  

3. **Dataset**  

   - A PDF of your policy manual uploaded as a Kaggle Dataset (e.g. `MY_KAGGLE_ACCOUNT/MY_SAMPLE_DOC.PDF`).


---


## 1. Create Your Kaggle Notebook


1. Go to **Kaggle → Notebooks → New Notebook**.  

2. Under **Data**, click **Add data**, search for `MY_KAGGLE_ACCOUNT/MY_SAMPLE_DOC.PDF`, and attach it.  

3. Confirm that `/kaggle/input/MY_SAMPLE_DOC.PDF/` now exists by running:

   ```python

   import os

   print(os.listdir("/kaggle/input/MY_SAMPLE_DOC.PDF"))


Solution Architecture  








2. Environment Setup

Add a cell with the following to install libraries, fetch your API key, and configure GenAI:

python

# Install libraries

!pip install PyPDF2 faiss-cpu sentence-transformers google-generativeai ipywidgets --quiet


# Imports & API key

import os, json, pickle, faiss, numpy as np

import PyPDF2, re

from sentence_transformers import SentenceTransformer

from kaggle_secrets import UserSecretsClient

import google.generativeai as genai


# Silence tokenizer warnings

os.environ["TOKENIZERS_PARALLELISM"] = "false"


# Load and configure API key

api_key = UserSecretsClient().get_secret("GOOGLE_API_KEY")

if not api_key:

    raise ValueError("Set your GOOGLE_API_KEY in Kaggle Secrets.")

genai.configure(api_key=api_key)


# Instantiate the GenAI model (for both answering and evaluation)

gen_model   = genai.GenerativeModel(model_name="models/gemini-2.0-flash")

judge_model = genai.GenerativeModel(model_name="models/gemini-2.0-flash")


print("✅ Environment ready.")


3. Extract and Chunk Your PDF

3.1 Text Extraction

python
def extract_text(path): reader = PyPDF2.PdfReader(path) return "".join(p.extract_text() or "" for p in reader.pages) dataset_folder = "/kaggle/input/MY_SAMPLE_DOC.PDF" pdf_file = os.path.join(dataset_folder, "MY_SAMPLE_DOC.PDF") full_text = extract_text(pdf_file) print(f"Extracted {len(full_text)} characters from PDF.")


3.2 Smart Chunking

python
from nltk.tokenize import sent_tokenize def smart_chunk(text, max_tokens=300, overlap=50): sentences = sent_tokenize(text) chunks, cur, length = [], [], 0 for sent in sentences: tok = len(sent.split()) if length + tok > max_tokens: chunks.append(" ".join(cur)) cur = cur[-overlap:] length = sum(len(s.split()) for s in cur) cur.append(sent) length += tok if cur: chunks.append(" ".join(cur)) return chunks chunks = smart_chunk(full_text) print(f"Created {len(chunks)} chunks.")
4. Build & Cache Embeddings + FAISS Index
python
CACHE_PATH = "/kaggle/working/embed_cache.pkl" def load_or_build_index(chunks, folder): mtime = max(os.path.getmtime(os.path.join(folder, f)) for f in os.listdir(folder)) if os.path.exists(CACHE_PATH): with open(CACHE_PATH, "rb") as f: cache = pickle.load(f) if cache["mtime"] == mtime: return cache["embeddings"], cache["index"] # Compute embeddings model = SentenceTransformer("all-MiniLM-L6-v2") embs = np.array(model.encode(chunks, show_progress_bar=True), dtype="float32") idx = faiss.IndexFlatL2(embs.shape[1]) idx.add(embs) with open(CACHE_PATH, "wb") as f: pickle.dump({"mtime": mtime, "embeddings": embs, "index": idx}, f) return embs, idx chunk_embeddings, index = load_or_build_index(chunks, dataset_folder) print(f"FAISS index built with {index.ntotal} vectors.")

5. Retrieve Relevant Context

python
def retrieve_chunks(question, k=3): q_emb = chunk_embeddings_model.encode([question]).astype("float32") _, ids = index.search(q_emb, k) return [chunks[i] for i in ids[0]] # Test retrieval ctx = retrieve_chunks("How many sick days do I have?") print("\n".join(ctx[:2]))


6. Generate a JSON‑Formatted Answer

python
FEW_SHOT = """ You are a helpful AI assistant for corporate policy navigation. Examples: Q: What is our remote work policy? A: {"answer":"Remote up to 2 days/week","source":"Manual, Section 4"} Q: How many sick leave days? A: {"answer":"10 days/year","source":"Manual, Section 3.2"} Context: {context} If a value is missing, respond with {"answer":"Not specified","source":"Manual"}. Q: {question} """ def answer_question(question): context = "\n\n".join(retrieve_chunks(question)) prompt = FEW_SHOT.format(context=context, question=question) res = gen_model.generate_content(prompt).text.strip() m = re.search(r'\{[\s\S]*\}', res) data = m.group(0) if m else res return json.loads(data) # Demo print(answer_question("What is our annual leave allowance?"))


7. Evaluate Answer Quality

python
EVAL_PROMPT = """ You are an expert evaluator using this rubric (1–5): 1: Very poor … 5: Very good. Steps: 1. Instruction Following 2. Groundedness 3. Completeness 4. Fluency Output JSON: {"score":<1–5>,"explanation":<string>} Question: {question} Context: {context} Answer: {answer} """ def evaluate(question): context = "\n\n".join(retrieve_chunks(question)) ans = answer_question(question)["answer"] prompt = EVAL_PROMPT.format(question=question, context=context, answer=ans) raw = judge_model.generate_content(prompt).text clean = raw.replace("```","") js = clean[clean.find("{"):clean.rfind("}")+1] return json.loads(js) # Demo print(evaluate("What is our remote work policy?"))


8. Interactive Q&A

python
import ipywidgets as widgets from IPython.display import display question_box = widgets.Text(placeholder="Ask a question…", continuous_update=False) output_box = widgets.Output() def on_change(change): if change['name']=="value": with output_box: output_box.clear_output() print(answer_question(change['new'])) question_box.observe(on_change, names="value") display(question_box, output_box)

9. Limitations & Next Steps

  • Chunk Boundaries: Tables and columns can split awkwardly.

  • Cache Invalidation: Edits with same timestamp won’t rebuild until you clear cache.


10. Try It Yourself

  1. Copy and paste the code and run in your kaggle enviroment.

  2. Attach your own policy PDF as a Kaggle Dataset.

  3. Set your GOOGLE_API_KEY in Kaggle Secrets.

  4. Run all cells and start asking policy questions in seconds!





Comments