Why Hosting AI Models In-House Matters
In an era dominated by cloud computing, there are still compelling reasons to host AI models on-premises. While cloud-based solutions offer scalability and convenience, certain environments demand more control, reliability, and privacy. Hosting models locally ensures greater data governance, allows compliance with industry or regulatory standards, and enhances security by keeping sensitive information within a closed network. It also becomes essential in situations where internet connectivity is unreliable or unavailable, such as in remote facilities, secure government operations, or offline field deployments. Additionally, on-prem hosting can offer reduced latency, cost predictability, and full control over model execution and updates—making it a critical choice for organizations with strict operational or compliance requirements.
This will show you how to run a basic document Q&A offline using:
- Ollama + local LLM (Gemma3, Mistral, Llama3.3, etc.)
- LangChain
- FAISS (vector DB)
- SentenceTransformers (embeddings)
- PyPDF (PDF loading)
Install Project Dependencies
For Google Colaboratory, install the pciutils
package, which provides utilities for managing PCI devices. It will help determine if a GPU is available to use (Hint: change your runtime type to use a T4 GPU to acclerate your models processing).
!apt install pciutils
Now run the installation script for Ollama from the internet to install it on your instance.
Downloads and manual install instructions for different platforms can be found here.
!curl -fsSL https://ollama.com/install.sh | sh
You will need to install the necessary Python packages required for this project. Use the pip package manager to install the following libraries:
-
langchain-huggingface: integrates LangChain with Hugging Face's libraries, enabling the use of Hugging Face models, particularly for generating text embeddings.
-
langchain-ollama: facilitates interaction with Ollama, a tool for running large language models locally, such as the Mistral, Gemma3 or other models.
-
langchain-community: provides additional tools and extensions for LangChain, enhancing its functionality for various natural language processing tasks.
-
faiss-cpu: installs FAISS, a library specifically designed for efficient similarity search. It's used here as the vector database for storing and retrieving document chunks based on their embeddings.
-
sentence-transformers: installs the SentenceTransformers library, which is crucial for generating text embeddings. These embeddings represent the semantic meaning of text and are used to calculate similarity between different pieces of text.
-
pypdf: provides tools for reading and processing PDF files, allowing the notebook to load and analyze the content of PDF documents.
-
hf_xet: is a HuggingFace dataset used for text embedding with HuggingFace models. It includes a variety of pretrained models and tools for generating text embeddings.
!pip install langchain-huggingface==0.1.2 \
langchain-ollama==0.3.1 \
langchain-community==0.3.21 \
faiss-cpu==1.10.0 \
sentence-transformers==3.4.1 \
pypdf==5.4.0 \
hf_xet==1.0.3
Run the Ollama server and select a model to use
Ollama offers a diverse library of open-source models that you can run locally. These models cater to various tasks, including general-purpose language understanding, coding assistance, multilingual support, and multimodal processing. You can explore the full list of available models at the Ollama Model Library.
Here are some notable models you can use:
🔤 General-Purpose Language Models
- Llama 3.3: Meta's latest model, known for its versatility in natural language processing tasks.
- Mistral: A compact and efficient model suitable for various applications.
- Gemma 3: Google DeepMind's lightweight, state-of-the-art open model.
- Phi-4: Microsoft's model demonstrating strong reasoning and language understanding capabilities.
- Mixtral: A mixture-of-experts model designed for enhanced performance.
💻 Coding and Developer Models
- CodeQwen1.5: A large language model pretrained on extensive code data.
- DeepSeek Coder: An open-source model optimized for code-specific tasks.
- StarCoder2: A next-generation model for code generation across multiple programming languages.
- CodeGemma: A model capable of code generation, natural language understanding, and mathematical reasoning.
🌐 Multilingual and Multimodal Models
- Aya 23: A multilingual model supporting 23 languages, suitable for diverse language tasks.
- Llama 3.2 Vision: An instruction-tuned model optimized for visual recognition, image reasoning, and captioning.
🧠 Lightweight and Specialized Models
- TinyLlama: A compact model designed for efficiency on resource-constrained devices.
- SmolLM2: A family of compact language models available in sizes of 135M, 360M, and 1.7B parameters.
🧰 Tools and Embedding Models
- mxbai-embed-large: A state-of-the-art large embedding model from mixedbread.ai.
- bge-m3: A versatile model distinguished for its multi-functionality, multi-linguality, and multi-granularity.
Now that all requirements are installed, start the Ollama server in the background so that you can use it to run your language model. The command runs your server in the background and allows it to keep running even if you close your current session.
!nohup ollama serve &
Select the desired model then download it using the "Ollama" tool, which runs silently and hides any download progress or messages from the notebook's output.
LanguageModel = "mistral" #gemma3
!ollama pull {LanguageModel}
Confirm your server is running.
!ollama --version
!ollama list
Load PDF Document
In Google Colaboratory, this is how you upload a file from your local system: use the google.colab.files
module to handle file uploads.
Click to select a file, and once the upload is completed, extract the filename and create a Path
object representing the file's location in the Colab environment.
from google.colab import files\
from pathlib import Path\
print("Select a PDF to analyze:")\
uploaded = files.upload\
# Get the first uploaded file (assuming only one file is uploaded)\
pdf_name = next(iter(uploaded))\
pdf_path = Path(pdf_name)\
print(f"Loaded: {pdf_path}")
Preprocessing
Next you will prepare the document for vectorization by:
- Importing necessary libraries
- Loading the PDF
- Splitting the document into manageable chunks with overlap to ensure smoother transitions and context preservation
from langchain.document_loaders import PyPDFLoader\
from langchain.text_splitter import RecursiveCharacterTextSplitter\
loader = PyPDFLoader(str(pdf_path))\
documents = loader.load()\
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)\
split_docs = splitter.split_documents(documents)\
print(f"Document split into {len(split_docs)} chunks.")
Then you will be creating embeddings (numerical representations of text) for each chunk of the PDF document and storing them in a vector database for efficient searching. Import the HuggingFaceEmbeddings class which will be used to generate the embeddings using a model from Hugging Face, and the FAISS class. FAISS is a library for efficient similarity search and will act as your vector database.
Create an instance of the HuggingFaceEmbeddings class, using the "all-MiniLM-L6-v2" model to generate embeddings. This model converts text into numerical vectors, capturing the semantic meaning of the text. Then create a FAISS vector store using the split_docs
(which are the chunks of the PDF document) and the embedding_model
.
The next line essentially takes each document chunk, generates its embedding using the embedding_model, and stores it in the FAISS vector store. This organized storage allows for efficient retrieval of relevant chunks later during the question-answering process.
Finally, save the created FAISS vector store to a local file named "vector_index". This allows you to reuse it without recomputing the embeddings.
from langchain_huggingface import HuggingFaceEmbeddings\
from langchain.vectorstores import FAISS\
embedding_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")\
vectorstore = FAISS.from_documents(split_docs, embedding=embedding_model)\
# Save for reuse\
vectorstore.save_local("vector_index")
Initialize the Language Model
Time for initializing the Large Language Model (LLM) that will be used for answering questions based on your document. Here's how:
from langchain_ollama import OllamaLLM\
llm = OllamaLLM(model=LanguageModel)
Setup the Question-Answering System
This is where the magic of question answering happens. It connects the language model (llm) with the vector database (vectorstore) to create a system that can answer your questions based on the PDF document.
In essence, this code sets up a workflow:
- You ask a question.
- The retriever finds the most relevant parts of the PDF using the vectorstore.
- The llm uses these relevant parts and your question to craft an answer.
- The answer is presented to you.
from langchain.chains import RetrievalQA\
retriever = vectorstore.as_retriever()\
qa_chain = RetrievalQA.from_chain_type(llm=llm, retriever=retriever)
Query the Document
You are now ready to interactively query the document you've uploaded. Ask away.
Generally, your prompts should be based on context, such as:
- What kind of documents are you querying? (e.g., legal texts, manuals, research papers, business reports)
- What kind of information are you trying to extract? (e.g., summaries, steps, key insights, named entities, risks)
- How should the output be structured? (e.g., list, bullet points, paragraph, etc.)
Typically, craft your prompts to extract relevant, accurate, and structured information. Here are some common types of prompts, along with examples:
🔍 1. Direct Question Prompts
Ask a factual or specific question to get a precise answer:
- "What are the key takeaways from this document?"
- "What does the document say about [topic]?"
- "Who is the intended audience of the text?"
- "What is the author's main argument?"
🧠 2. Summarization Prompts
Get a high-level or section-specific summary:
- "Summarize the main points of this document."
- "Give a brief overview of the section about [topic]."
- "What is the conclusion in this document?"
✅ 3. Verification/Fact-Checking Prompts
Verify if a fact or claim appears in the document:
- "Does this document mention any risks associated with [X]?"
- "Is it true that [claim]? Justify with text from the document."
- "What evidence does the document provide for [claim]?"
🔄 4. Comparison Prompts
Compare two sections or concepts within the document:
- "How does the author compare [A] and [B]?"
- "What are the differences between the introduction and conclusion?"
🛠️ 5. Instructional/How-To Prompts
Extract process steps or actionable guidance:
- "List the steps described in the document for [task]."
- "How does the document recommend handling [situation]?"
🧾 6. Extraction Prompts
Pull out entities, dates, figures, policies, etc.:
- "List all the statistics mentioned in the document."
- "What products/services are referenced in the text?"
- "Extract all named people or organizations."
while True:\
query = input("Ask a question about the document (type 'exit' to quit): ")\
if query.lower() == "exit":\
break\
response = qa_chain.invoke(query)\
print("\nAnswer:", response, "\n") {LanguageModel}
Wrapping It Up: Congratulations, You Just Built Your Very Own AI!
By combining a local LLM with vector-based retrieval, you've built a powerful, private document Q&A system—all running offline. Here’s what you accomplished:
- 🧠 Chose an on-prem AI model using tools like Ollama to host language models such as Gemma3, Mistral or LLaMA 3 locally.
- 📄 Embedded your documents using FAISS and HuggingFace to convert text into searchable vector space.
- 🔍 Implemented RAG (Retrieval-Augmented Generation) to pull relevant context from your documents before generating answers.
- ✅ Created a full offline pipeline for querying documents in a secure, controlled, and latency-free environment.
With this setup, you now have an AI assistant tailored to your own data—ready to answer questions, summarize insights, and support decisions without ever needing the cloud. 🎉