Skip to content


How to run Llama 3.2 locally on Mac and serve it to a local Linux laptop to use with Zed

UPDATE: I wrote this post for Llama3.1, but just after I published it, Llama3.2 was released. It comes with similar performance but faster inference as it’s a distilled model(~2.6 times faster than Llama3.1 in a quick test I did). I updated this post to use 3.2 but it should work with any other version as well.

Zed is a great editor that supports AI assistants. In this post I will explain how you can share one Llama model you have running in a Mac between other computers in your local network for privacy and cost efficiency. Also, fans might get loud if you run Llama directly on the laptop you are using Zed as well.

Since I’ve found that Apple silicon (M1, M2, etc) is quite good at running these models, I will assume the model will be run in that computer. The default LLama3.2:3b works fine on a Mac Mini M1 with 16GB. If you have 8GB you might need to use simpler models.

The first step is to install ollama in your Mac. Just follow the instruction on the website. You will end up with a little llama icon on menu bar(top right). We now need to install a model, Llama3.2 in particular. In the terminal run this:

ollama run llama3.2

This will try to run llama3.2 and since it is not yet installed, it will fetch the latest model for you. After it downloads and runs the model, simply type something to test it all works. It should look like this:

>>> Tell me a joke
Here's one:

What do you call a fake noodle?

An impasta.

>>> Send a message (/? for help)

Now that it works locally we want to make it available for other computers in the local network. Open the terminal and run this command:

launchctl setenv OLLAMA_HOST 0.0.0.0:11434

Now click on the icon and exit ollama. Then start ollama again. It should now be ready to accept connections from other computers in your network. To check connectivity, go to a Linux computer in your network and open a terminal. Run the following:

curl http://your_mac.local:11434/api/generate -d '{
 "model": "llama3.2",
 "prompt": "Tell me a joke",
 "options": {
   "num_ctx": 4096
 }
}'

Make sure you see a correct response before continuing. Now let’s configure Zed to use it. Open settings (CTRL-SHIFT-P and write open settings). Add these settings there (plus any others you already have, it’s a json file):

{
  "language_models": {
    "ollama": {
      "api_url": "http://your_mac.local:11434",
      "low_speed_timeout_in_seconds": 120,
      "keep_alive": "120s",
      "available_models": [
        {
          "provider": "ollama",
          "name": "llama3.2:latest",
          "max_tokens": 16384
        }
      ]
    }
  },
  "assistant": {
    "default_model": {
      "provider": "ollama",
      "model": "llama3.2:latest"
    },
    "version": "2"
  }
}

It should now be configured. You can go to the Assistant Panel (CTRL+?) and ask whatever you want there. You can add context as well with /tab and others.

That’s it, now you have a shared Llama 3.2 model running in one computer, while using it on another computer in your network, privately and free, integrated into a great text editor.

Posted in AI, Open Source, Programming, Zed.

Tagged with , , , , , , , .


OpenVINO performance for state of the art real-time monocular depth estimation

Recently, an interesting paper was accepted to CVPR 2024, Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data. The authors made publicly available their pre-trained models in three sizes: Depth-Anything-Small, Depth-Anything-Base, and Depth-Anything-Large.

I wanted to have an idea of how fast these models can run on a CPU these days. Since I am interested in real-time operation, I started with the smallest model to get an idea of how well it performed and how fast it can run the inference in my laptop:

You can see that the smallest model still performs relatively well. In terms of inference time, it took on average 1.02 seconds per image on my CPU, and the size of this PyTorch model, depth_anything_vits14.pth, is 95MB. It’s fast and small, but not really ideal for real-time applications. Let’s see if we can do better.

OpenVINO uses their own Intermediate Representation format, IR, which is designed to be optimised for inference. Furthermore, you can then generate a quantised model from the already OpenVINO optimised model, getting even better inference times and a smaller model.

Here is a table with the different results on my machine:

Model NameInference speed (FPS)Model Size (MB)
depth_anything_vits14.pth (Original PyTorch model)~195
depth_anything_vits14.(bin+xml) (OpenVINO IR)7.6547.11
depth_anything_vits14_int8.(bin+xml) (IR Quantised)9.7424.27

You can clearly see that there is a massive increase in performance when you use the quantised OpenVINO IR model compared to the original PyTorch model. This allows real time operation on a laptop. And for reference, here is the output of the quantised model:

The output is still reasonably good, with the nice bonus that it can be used in real time, plus the file size is about one quarter of the original. All thanks to OpenVINO’s great ability to optimise the inference pipeline!

Posted in Computer Vision, Open Source, OpenVINO.

Tagged with , , .