{"id":723,"date":"2025-02-05T00:01:07","date_gmt":"2025-02-04T13:01:07","guid":{"rendered":"https:\/\/www.samontab.com\/web\/?p=723"},"modified":"2025-02-05T03:04:41","modified_gmt":"2025-02-04T16:04:41","slug":"from-cpu-to-npu-the-secret-to-15x-faster-ai-on-intels-latest-chips","status":"publish","type":"post","link":"https:\/\/www.samontab.com\/web\/2025\/02\/from-cpu-to-npu-the-secret-to-15x-faster-ai-on-intels-latest-chips\/","title":{"rendered":"From CPU to NPU: The Secret to ~15x Faster AI on Intel&#8217;s Latest Chips"},"content":{"rendered":"\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"da8089\" data-has-transparency=\"false\" style=\"--dominant-color: #da8089;\" loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"422\" src=\"https:\/\/www.samontab.com\/web\/wp-content\/uploads\/2025\/02\/cpu_npu-1024x422.avif\" alt=\"\" class=\"wp-image-725 not-transparent\" srcset=\"https:\/\/www.samontab.com\/web\/wp-content\/uploads\/2025\/02\/cpu_npu-1024x422.avif 1024w, https:\/\/www.samontab.com\/web\/wp-content\/uploads\/2025\/02\/cpu_npu-300x124.avif 300w, https:\/\/www.samontab.com\/web\/wp-content\/uploads\/2025\/02\/cpu_npu-768x316.avif 768w, https:\/\/www.samontab.com\/web\/wp-content\/uploads\/2025\/02\/cpu_npu-1536x633.avif 1536w, https:\/\/www.samontab.com\/web\/wp-content\/uploads\/2025\/02\/cpu_npu-2048x844.avif 2048w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>Intel&#8217;s newest chips now come with a <strong>Neural Processing Unit (NPU)<\/strong>, built to handle AI and machine learning tasks more efficiently than a regular CPU. Instead of struggling with AI workloads on the CPU, the NPU is designed to run them faster and with less power. This is great because you can free up the CPU to do other general tasks, but I wanted to know how much better the NPU can run a model, compared to the CPU. Based on my test, it&#8217;s roughly a <strong>15x performance boost<\/strong>, which is great.<\/p>\n\n\n\n<p>If you&#8217;re looking to buy an edge device with an NPU, I can recommend the <a href=\"https:\/\/amzn.to\/4hjUAeT\">Khadas Mind 2 Mini PC<\/a>, as it&#8217;s really small and packs a lot of power, plus it has a small battery that serves as a UPS, so you can just move it around from one USB power supply to another without it losing power. It&#8217;s quite nice. OK, now let&#8217;s see how I got to that number from the title.<\/p>\n\n\n\n<p>In real-time computer vision, <strong>throughput<\/strong> and <strong>latency<\/strong> are two fundamental performance metrics that impact the efficiency and responsiveness of a system. <strong>Throughput<\/strong> refers to the number of frames processed per second (FPS), determining how much data the system can handle over time. This is basically what you&#8217;re referring to when you ask &#8220;how long it takes to process this video&#8221;. <strong>Latency<\/strong>, on the other hand, is the time it takes to process a single frame from input to output, affecting how quickly the system responds to new data. Low latency is crucial for real-time applications like augmented reality and autonomous driving. When you play around with a system and it feels &#8220;laggy&#8221;, it&#8217;s because it has high latency. You want to keep your latency low and your throughput high.<\/p>\n\n\n\n<p>I&#8217;m going to assume you already installed OpenVINO on your system, and that you have an Intel chip with an NPU in it. You can quickly check if those two things are true by running this command:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\">import openvino as ov\n\ncore = ov.Core()\ncore.available_devices<\/code><\/pre>\n\n\n\n<p>You should see something like <strong>[&#8216;CPU&#8217;, &#8216;GPU&#8217;, &#8216;NPU&#8217;]<\/strong> as the reply to that. Those are the available devices in OpenVINO. If you don&#8217;t see your device, make sure you installed the drivers correctly and troubleshoot it before continuing.<\/p>\n\n\n\n<p>Next, we need a model. I\u2019ll be using <strong>ResNet-50<\/strong>, one of the most well-known Convolutional Neural Network architectures, introduced in Microsoft&#8217;s 2015 paper <em><a href=\"https:\/\/arxiv.org\/abs\/1512.03385\">&#8220;Deep Residual Learning for Image Recognition&#8221;<\/a><\/em> It was trained on <strong>ImageNet-1K<\/strong> at a resolution of <strong>224&#215;224<\/strong>, meaning you can feed it an image of that size, and it will predict the probabilities for <strong>1,000 different object categories<\/strong>. You can get the object names <a href=\"https:\/\/raw.githubusercontent.com\/pytorch\/hub\/master\/imagenet_classes.txt\">here<\/a>.<\/p>\n\n\n\n<p>Luckily, ResNet-50, optimised for OpenVINO, is available for download <a href=\"https:\/\/huggingface.co\/katuni4ka\/resnet50_fp16\/tree\/main\">here<\/a>. Just grab those two files, <strong>resnet50_fp16.xml<\/strong>, and <strong>resnet50_fp16.bin<\/strong> and place them in your working folder. If you want to try with another model, you can also do that. Make sure to run the OpenVINO optimiser on your model to get the best performance. I&#8217;m also going to use OpenCV for image loading and resizing, so let&#8217;s install it first, and make sure numpy is there as well:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\">pip install opencv-python numpy<\/code><\/pre>\n\n\n\n<p>Now let&#8217;s classify an image with this model. Write this into a file and save it as <strong>classify.py<\/strong>:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\">import openvino as ov\nimport numpy as np\nimport cv2\n\ndef classify_image():\n    # Step 1: Load OpenVINO model\n    core = ov.Core()\n    model = core.read_model(\"resnet50_fp16.xml\")\n    compiled_model = core.compile_model(model, \"CPU\")  # Use \"NPU\" if available\n\n    # Step 2: Get input tensor details\n    input_layer = compiled_model.input(0)\n    input_shape = input_layer.shape  # Should be (1, 3, 224, 224)\n\n    # Step 3: Load and preprocess image\n    image = cv2.imread(\"input.jpg\")\n    image = cv2.resize(image, (224, 224))  # Resize to match model input\n    image = image[:, :, ::-1]  # Convert BGR to RGB (OpenCV loads as BGR)\n    image = image.astype(np.float32) \/ 255.0  # Normalise to [0,1]\n    image = np.transpose(image, (2, 0, 1))  # HWC to CHW\n    image = np.expand_dims(image, axis=0)  # Add batch dimension\n\n    # Step 4: Run the inference\n    output = compiled_model(image)[compiled_model.output(0)]\n\n    # Step 5: Process the results\n    top_class = np.argmax(output)  # Get class index\n\n    # Load ImageNet labels (remember to download the file)\n    imagenet_labels = np.array([line.strip() for line in open(\"imagenet_classes.txt\").readlines()])\n\n    # Display result\n    print(f\"Predicted Class: {imagenet_labels[top_class]}\")\n\nif __name__ == \"__main__\":\n    classify_image()\n<\/code><\/pre>\n\n\n\n<p>Make sure that in the same folder you have these files: <strong>classify.py<\/strong>, <strong>imagenet_classes.txt<\/strong>, <strong>resnet50_fp16.xml<\/strong>, and <strong>resnet50_fp16.bin<\/strong>. Now add any image in there and rename it to <strong>input.jpg<\/strong>. Now simply call the script:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\">python classify.py<\/code><\/pre>\n\n\n\n<p>You should get the correct predicted class, like this:<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img data-dominant-color=\"498081\" data-has-transparency=\"false\" style=\"--dominant-color: #498081;\" loading=\"lazy\" decoding=\"async\" width=\"939\" height=\"664\" src=\"https:\/\/www.samontab.com\/web\/wp-content\/uploads\/2025\/02\/llama.avif\" alt=\"\" class=\"wp-image-724 not-transparent\" srcset=\"https:\/\/www.samontab.com\/web\/wp-content\/uploads\/2025\/02\/llama.avif 939w, https:\/\/www.samontab.com\/web\/wp-content\/uploads\/2025\/02\/llama-300x212.avif 300w, https:\/\/www.samontab.com\/web\/wp-content\/uploads\/2025\/02\/llama-768x543.avif 768w\" sizes=\"auto, (max-width: 939px) 100vw, 939px\" \/><\/figure>\n\n\n\n<p>Now that we know that the model actually works correctly with OpenVINO, we can benchmark it with a convenient tool that comes with it. It&#8217;s called <strong>benchmark_app<\/strong>, and it allows you to quickly check the performance of your devices with different models. You can call it like this:<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p>benchmark_app -m <strong>MODEL<\/strong> -d <strong>DEVICE<\/strong> -hint <strong>HINT<\/strong><\/p>\n<\/blockquote>\n\n\n\n<p>For this benchmark, I ran these four commands:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\">benchmark_app -m \"resnet50_fp16.xml\" -d CPU -hint latency\nbenchmark_app -m \"resnet50_fp16.xml\" -d CPU -hint throughput\nbenchmark_app -m \"resnet50_fp16.xml\" -d NPU -hint latency\nbenchmark_app -m \"resnet50_fp16.xml\" -d NPU -hint throughput<\/code><\/pre>\n\n\n\n<p>These are the results:<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><tbody><tr><td>Device<\/td><td>Hint<\/td><td>Median Latency (ms)<\/td><td>Average Latency (ms)<\/td><td>Min Latency (ms)<\/td><td>Max Latency (ms)<\/td><td>Throughput (FPS)<\/td><\/tr><tr><td>CPU<\/td><td>Latency<\/td><td>25.31<\/td><td>24.73<\/td><td>18.38<\/td><td>47.16<\/td><td>40.32<\/td><\/tr><tr><td>CPU<\/td><td>Throughput<\/td><td>47.38<\/td><td>63.65<\/td><td>38.52<\/td><td>135.35<\/td><td>62.69<\/td><\/tr><tr><td>NPU<\/td><td>Latency<\/td><td>1.68<\/td><td>1.7<\/td><td>1.52<\/td><td>8.3<\/td><td>569.71<\/td><\/tr><tr><td>NPU<\/td><td>Throughput<\/td><td>9.15<\/td><td>8.4<\/td><td>3.21<\/td><td>86.6<\/td><td>936.05<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">Key Takeaways:<\/h2>\n\n\n\n<p>NPU (Latency mode) achieves <strong>1.70ms<\/strong> average latency compared to <strong>24.73ms<\/strong> on CPU (<strong>~15x improvement<\/strong>)<br>NPU (Throughput mode) reaches <strong>936.05 FPS<\/strong>, which is <strong>~15x higher<\/strong> than CPU throughput mode (<strong>62.69 FPS<\/strong>)<\/p>\n\n\n\n<p>This confirms that <strong>Intel&#8217;s NPU significantly outperforms the CPU<\/strong> in both latency and throughput, with a roughly 15x performance boost for this particular model.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Intel&#8217;s newest chips now come with a Neural Processing Unit (NPU), built to handle AI and machine learning tasks more efficiently than a regular CPU. Instead of struggling with AI workloads on the CPU, the NPU is designed to run them faster and with less power. This is great because you can free up the [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[97,29,21,24],"tags":[110,109,25,92,33],"class_list":["post-723","post","type-post","status-publish","format-standard","hentry","category-ai","category-computer-vision","category-open-source","category-openvino","tag-intel","tag-npu","tag-openvino","tag-performance","tag-python"],"_links":{"self":[{"href":"https:\/\/www.samontab.com\/web\/wp-json\/wp\/v2\/posts\/723","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.samontab.com\/web\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.samontab.com\/web\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.samontab.com\/web\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.samontab.com\/web\/wp-json\/wp\/v2\/comments?post=723"}],"version-history":[{"count":0,"href":"https:\/\/www.samontab.com\/web\/wp-json\/wp\/v2\/posts\/723\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.samontab.com\/web\/wp-json\/wp\/v2\/media?parent=723"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.samontab.com\/web\/wp-json\/wp\/v2\/categories?post=723"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.samontab.com\/web\/wp-json\/wp\/v2\/tags?post=723"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}