How to run an LLM locally with a ChatGPT like interface

A few days ago Meta released a new version of their open source Large Language Model (LLM) called the Llama 3.1 in three variants, the 8B, 70B, and 405B which represents the size of tokens, with the smallest being 8Billion tokens.

Introduction

Llama 3.1 405B is the first openly available model that rivals the top AI models when it comes to state-of-the-art capabilities in general knowledge, steerability, math, tool use, and multilingual translation.

The paper for the Llama 3 Herd of Models is also available here

I've always been fascinated by the open-source approach to developing large language models. Historically, open source has driven innovation and provided strong competition to the proprietary methods of major tech companies.

Another reason I wanted to try running a model locally is due to the nature of my workflow. I frequently ask ChatGPT to analyze numerous files, but the free plan has limitations for this use case. I figured having an open-source LLM running on my computer would reduce my reliance on ChatGPT's file upload feature. So, I decided to install Llama 3.1 using Ollama and OpenWebUI.

Initial Attempt with Llama 3.1

Well, that failed woefully. Despite downloading the smallest variant, which was the Llama3.1 8B of about ~5GB, using Ollama, and upon running it on the command line, asking it a question "What i a GPU?", well, that took about 10 minutes to fully write out the answer 😔. In hindsight that was not such a great idea, my Macbook air was not powerful enough to run the model.

I knew that was not going to work, nearly giving up, I came across a twitter(X) post from @Prince_Canuma, on a light weight LLM from Google Deep Mind called Gemma2. I then decided to read it, and used it as an alternative to Llama3.1 model.

About Gemma 2 2B

Gemma 2 2B is a newly released artificial intelligence model developed by Google DeepMind, representing the smallest variant in the Gemma 2 family, which also includes larger models of 9 billion and 27 billion parameters. This 2 billion parameter model is optimized for a variety of hardware, making it suitable for deployment on edge devices, laptops.

This lightweight model produces outsized results by learning from larger models through distillation. In fact, Gemma 2 2B surpasses all GPT-3.5 models on the Chatbot Arena, demonstrating its exceptional conversational AI abilities.

Despite its compact size, according to Deep Mind the model outperformed larger models like GPT-3.5 and Mixtral on the LMSYS Chatbot Arena benchmark. It achieved an Elo score of 1126, surpassing the scores of its larger counterparts.

Ollama

Ollama is a lightweight, open-source tool designed to make it easy to run and manage large language models (LLMs) locally on personal computers. It's gained popularity for its simplicity and focus on local AI deployment.

OpenWebUI

OpenWebUI is an open-source project aimed at creating a web-based user interface for large language models (LLMs). It's designed to provide a customizable and extendable front-end for interacting with various AI models, similar to how ChatGPT's interface works for OpenAI's models.

How to install Gemma2 2B model locally using Ollama

The first step is to download Ollama on your computer, it supports windows, mac and Linux operating systems. The application was just about 177MB

When that is done, extract the application from the zip file and install it. Then the next step is to navigate to the terminal and run the command:

bash

ollama run gemma2:2b

This downloads the gemma2 model on your computer, it might take a while since the whole model is about 1.6GB large

Quantized Models

Ollama also supports installing custom models and quantized models. Quantized models are machine learning models that have had their precision reduced to make them faster and more memory efficient.

It involves converting a model's high-precision floating-point values to lower-precision fixed-point or integer representations. For example, a model's 32-bit floating-point numbers (float32) can be converted to 8-bit integers (int8)

Hugging Face offers a variety of quantized models that you can download. For example, there is a quantized version of the Gemma2 model posted by Bartowski. You can find it at this model link

Just similar to the installation of ollama, you will have to download the model a GGUF file, and then once that is done. navigate to your terminal and type in the following command

bash

vim Modelfile

This opens a new file called Modelfile, which you can think of as a Dockerfile for llms. Once the editor is opened up on the terminal, type FROM followed by the link to you downloaded model file with a gguf extension. You can also set other commands as detailed in the github docs

bash

FROM ./link_to_model/filename.gguf

FROM ./downloads/gemma-2-9b-it-Q6_K_L.gguf

You can then create the model from the Modelfile using the command below

bash

ollama create choose-a-model-name -f Modelfile

To see all of the models installed on Ollama, you can use the ollama list command to achieve that

bash

ollama list

There it is, the list should contain the model you just created, you can then use ollama run

bash

ollama run choose-a-model-name "Any Prompt you want to write"

Running OpenWebUI

As explained above OpenWebUI provides a ChatGPT like interface where file upload is also supported. To run OpenWebUI with your ollama models, it can be achieved with just one command on the terminal. You can either install it using docker or manually, all of these are explained in their installation guide

bash

docker run -d -p 3000:8080 --add-host=host.docker.internal:host-gateway -v open-webui:/app/backend/data --name open-webui --restart always ghcr.io/open-webui/open-webui:main

Demo of testing Gemma2 locally

Thank you for reading. Cheers! 🥂