At AnaVation, learning and professional development aren’t just a value; they’re a part of our culture. Our Education in the Evening (EiTE) series is one of the many ways we invest in our people, offering AnaVators the opportunity to expand their knowledge, explore emerging technologies, and sharpen their skills. Whether it’s a technical deep dive or a soft skills workshop, these sessions offer meaningful insights that can be applied both at work and beyond.
In the latest EitE session, Collin D., Software Engineer, led an in-depth exploration of On-Premises Large Language Models (LLMs), a topic that continues to grow more relevant as organizations strive to balance capability, performance, and security.
Collin began by outlining the motivating use case and clearly defining the problem: organizations often work with massive amounts of sensitive data that cannot be sent to external cloud providers. Processing this volume of information through cloud-based APIs can quickly become costly and inefficient, especially if every request requires a full network round-trip. He explained how these limitations create a need for alternative approaches that maintain performance and security without sacrificing scalability.
To address these challenges, Collin outlined a solution that centers on deploying AI models within an organization’s environment using its provided hardware. He demonstrated how on-premises systems allow teams to scale horizontally and run batch processing techniques. A simple example is sending a prompt to a large language model, receiving a response through an API, and repeating that process for all the data that needs to be analyzed. While cloud providers offer APIs, SDKs, and batch interfaces, these options still involve high costs, potential trust concerns, and network delays. By comparison, on-premises solutions avoid sending data to a third party and reduce communication overhead. Collin guided through how on-premises deployment differs from personal local AI systems. The discussion focused on high-performance hardware and software in a datacenter used for inference on existing large-scale data as opposed to other use cases such as interactive chat, generating new content, or training models.
As part of the foundational overview, Collin walked through what a Large Language Model is. He described an LLM as a machine learning model, specifically a neural network. The model updates its parameters based on evaluations, and it works with inputs that can be turned into embeddings, usually text. Some models accept audio, images, or video. Everything is converted to numerical embeddings that pass through multiple layers that contain weights learned during training. The model outputs a probability distribution that predicts the next token in a sequence. Next, he reviewed the role of GPUs in running these models. A GPU is a device that performs many simple operations quickly and in parallel. It differs from a CPU because it has many more cores that operate on different parts of the same data. Modern GPUs contain specialized cores for matrix multiplication, which is essential for neural networks. Some recent GPU series are designed entirely for AI-related tasks.
From there, the session continued with a summary of hardware selection. Collin walked through key metrics such as total GPU memory, memory bandwidth, inter-node and intra-node connection speeds, supported data types, and inference engine support. He explained that organizations often have to work with the hardware available to them, which means hardware frequently determines which models can be deployed. In addition, Collin explored model selection. Important considerations include modality, tool use, model size, required context length, architecture, performance benchmarks, and inference engine compatibility. He also reviewed prominent open-weight model series, ranging from 1 billion to more than 200 billion parameters. Collin discussed dense models compared to sparse Mixture of Experts models, multimodal capabilities, tool-use models, and models that provide reasoning functions.
To wrap it up, Collin elaborated on the quantization process, including problems, solutions, and implementation examples. A breakdown of inference engines and approaches to horizontal scaling was provided and concluded with examples related to object detection and other batch data processing workflows, tying the technical concepts back to real-world applications.
Thank you to our AnaVator, Collin, for delivering such an enriching EiTE session! This presentation highlighted the complexity and importance of running large language models on-premises and offered valuable insight into workflows and decisions that influence successful implementation.