Corporate PDF Document Navigator

Corporate Document Navigator: AI‑Powered Policy Q&A

Introduction & Problem Statement

I’ve spent countless hours digging through lengthy PDF manuals just to answer simple policy questions—“How many sick days do I have?” or “What’s our remote work allowance?” In a fast‑moving corporate environment, that’s time I can't afford to lose. My goal was to build an AI assistant that:

- **Understands** our policy documents

- **Retrieves** only the most relevant passages

- **Generates** concise, structured answers with clear citations

In this post, I’ll walk you through how I built, evaluated, and fine‑tuned my **Corporate Document Navigator**—a Kaggle notebook that anyone in our organisation can run to get instant, accurate answers.

# Corporate PDF Document Navigator: A Step‑by‑Step Guide

Follow these steps to build your own AI‑powered policy Q&A assistant in a Kaggle Notebook, using Google’s Generative AI API, FAISS for semantic search, and an LLM‑based evaluator.

---

## Prerequisites

1. **Kaggle account** with access to create and run Notebooks.

2. **Google Cloud project** with the Generative AI API enabled and an API key.

- Store your API key in Kaggle as a Secret named `GOOGLE_API_KEY`.

3. **Dataset**

- A PDF of your policy manual uploaded as a Kaggle Dataset (e.g. `MY_KAGGLE_ACCOUNT/MY_SAMPLE_DOC.PDF`).

---

## 1. Create Your Kaggle Notebook

1. Go to **Kaggle → Notebooks → New Notebook**.

2. Under **Data**, click **Add data**, search for `MY_KAGGLE_ACCOUNT/MY_SAMPLE_DOC.PDF`, and attach it.

3. Confirm that `/kaggle/input/MY_SAMPLE_DOC.PDF/` now exists by running:

```python

import os

print(os.listdir("/kaggle/input/MY_SAMPLE_DOC.PDF"))

Solution Architecture

2. Environment Setup

Add a cell with the following to install libraries, fetch your API key, and configure GenAI:

python

# Install libraries

!pip install PyPDF2 faiss-cpu sentence-transformers google-generativeai ipywidgets --quiet

# Imports & API key

import os, json, pickle, faiss, numpy as np

import PyPDF2, re

from sentence_transformers import SentenceTransformer

from kaggle_secrets import UserSecretsClient

import google.generativeai as genai

# Silence tokenizer warnings

os.environ["TOKENIZERS_PARALLELISM"] = "false"

# Load and configure API key

api_key = UserSecretsClient().get_secret("GOOGLE_API_KEY")

if not api_key:

raise ValueError("Set your GOOGLE_API_KEY in Kaggle Secrets.")

genai.configure(api_key=api_key)

# Instantiate the GenAI model (for both answering and evaluation)

gen_model = genai.GenerativeModel(model_name="models/gemini-2.0-flash")

judge_model = genai.GenerativeModel(model_name="models/gemini-2.0-flash")

print("✅ Environment ready.")

3. Extract and Chunk Your PDF

3.1 Text Extraction

python
def extract_text(path):
    reader = PyPDF2.PdfReader(path)
    return "".join(p.extract_text() or "" for p in reader.pages)

dataset_folder = "/kaggle/input/MY_SAMPLE_DOC.PDF"
pdf_file       = os.path.join(dataset_folder, "MY_SAMPLE_DOC.PDF")
full_text      = extract_text(pdf_file)
print(f"Extracted {len(full_text)} characters from PDF.")


3.2 Smart Chunking
python
from nltk.tokenize import sent_tokenize

def smart_chunk(text, max_tokens=300, overlap=50):
    sentences = sent_tokenize(text)
    chunks, cur, length = [], [], 0
    for sent in sentences:
        tok = len(sent.split())
        if length + tok > max_tokens:
            chunks.append(" ".join(cur))
            cur = cur[-overlap:]
            length = sum(len(s.split()) for s in cur)
        cur.append(sent)
        length += tok
    if cur:
        chunks.append(" ".join(cur))
    return chunks

chunks = smart_chunk(full_text)
print(f"Created {len(chunks)} chunks.")

4. Build & Cache Embeddings + FAISS Index

python
CACHE_PATH = "/kaggle/working/embed_cache.pkl"

def load_or_build_index(chunks, folder):
    mtime = max(os.path.getmtime(os.path.join(folder, f)) for f in os.listdir(folder))
    if os.path.exists(CACHE_PATH):
        with open(CACHE_PATH, "rb") as f:
            cache = pickle.load(f)
        if cache["mtime"] == mtime:
            return cache["embeddings"], cache["index"]

    # Compute embeddings
    model = SentenceTransformer("all-MiniLM-L6-v2")
    embs  = np.array(model.encode(chunks, show_progress_bar=True), dtype="float32")
    idx   = faiss.IndexFlatL2(embs.shape[1])
    idx.add(embs)

    with open(CACHE_PATH, "wb") as f:
        pickle.dump({"mtime": mtime, "embeddings": embs, "index": idx}, f)
    return embs, idx

chunk_embeddings, index = load_or_build_index(chunks, dataset_folder)
print(f"FAISS index built with {index.ntotal} vectors.")

5. Retrieve Relevant Context

python
def retrieve_chunks(question, k=3):
    q_emb = chunk_embeddings_model.encode([question]).astype("float32")
    _, ids = index.search(q_emb, k)
    return [chunks[i] for i in ids[0]]

# Test retrieval
ctx = retrieve_chunks("How many sick days do I have?")
print("\n".join(ctx[:2]))


6. Generate a JSON‑Formatted Answer
python
FEW_SHOT = """
You are a helpful AI assistant for corporate policy navigation.
Examples:
Q: What is our remote work policy?
A: {"answer":"Remote up to 2 days/week","source":"Manual, Section 4"}

Q: How many sick leave days?
A: {"answer":"10 days/year","source":"Manual, Section 3.2"}

Context:
{context}

If a value is missing, respond with {"answer":"Not specified","source":"Manual"}.

Q: {question}
"""

def answer_question(question):
    context = "\n\n".join(retrieve_chunks(question))
    prompt  = FEW_SHOT.format(context=context, question=question)
    res     = gen_model.generate_content(prompt).text.strip()
    m       = re.search(r'\{[\s\S]*\}', res)
    data    = m.group(0) if m else res
    return json.loads(data)

# Demo
print(answer_question("What is our annual leave allowance?"))


7. Evaluate Answer Quality
python
EVAL_PROMPT = """
You are an expert evaluator using this rubric (1–5):
1: Very poor … 5: Very good.

Steps:
1. Instruction Following
2. Groundedness
3. Completeness
4. Fluency

Output JSON: {"score":<1–5>,"explanation":<string>}

Question: {question}
Context: {context}
Answer: {answer}
"""

def evaluate(question):
    context = "\n\n".join(retrieve_chunks(question))
    ans     = answer_question(question)["answer"]
    prompt  = EVAL_PROMPT.format(question=question, context=context, answer=ans)
    raw     = judge_model.generate_content(prompt).text
    clean   = raw.replace("```","")
    js      = clean[clean.find("{"):clean.rfind("}")+1]
    return json.loads(js)

# Demo
print(evaluate("What is our remote work policy?"))


8. Interactive Q&A
python
import ipywidgets as widgets
from IPython.display import display

question_box = widgets.Text(placeholder="Ask a question…", continuous_update=False)
output_box   = widgets.Output()

def on_change(change):
    if change['name']=="value":
        with output_box:
            output_box.clear_output()
            print(answer_question(change['new']))

question_box.observe(on_change, names="value")
display(question_box, output_box)


9. Limitations & Next Steps
Chunk Boundaries: Tables and columns can split awkwardly.
Cache Invalidation: Edits with same timestamp won’t rebuild until you clear cache.

10. Try It Yourself
Copy and paste the code and run in your kaggle enviroment.
Attach your own policy PDF as a Kaggle Dataset.
Set your GOOGLE_API_KEY in Kaggle Secrets.
Run all cells and start asking policy questions in seconds!

Search This Blog

Corporate PDF Document Navigator