Shrinking LLMs for AI on the PC

July 2025
M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31
« Jun		Aug »

Published on the 22/10/2024 | Written by Heather Wright

Lightweight and running on your CPU…

Microsoft has open sourced its BitNet.cpp inference framework, a lightweight AI model that runs directly on CPUs and is designed to reduce memory footprint and energy consumption, and make running large language models more accessible.

It’s being hailed not only for its potential to lower the cost and energy consumption of large language models (LLMs), but as a way of improving accessibility and making LLMs more available for local use cases, and democratising AI.

“You lose some detail, but the main features are still there, and it becomes much easier to process.”

Much has been made of AI’s power- and resource-intensive requirements. While the large language models on which generative AI tools are built are getting better, they’re also getting bigger, demanding ever more energy and compute power and posing challenges for deployment and the environment. Putting them through their paces requires some serious hardware, with all the costs associated.

The International Energy Agency says a ChatGPT request consumes 10 times the electricity of a Google Search, with estimates data centres could consume up to nine percent of US electricity generation by 2030. That’s seen providers including Amazon and Google now looking to nuclear to power their AI data centres.

A Microsoft Research Asia team created BitNet, a 1-bit quantisation-aware training (QAT) method for large language models (LLMs), last year.

Microsoft says results on language modelling show that BitNet achieves competitive performance while substantially reducing memory footprint and energy consumption.

The model enables large LLMs to be run on single computers, rather than having to be hosted in large data centres.

The Institute of Electrical and Electronic Engineers says 1-bit LLMs could solve AI’s energy demands, with ‘imprecise’ language models being smaller, speedier and nearly as accurate.

Microsoft claims the first release of bitnet.cpp, which supports inference on CPUs – NPU and GPU support will follow – can achieve speedups of 1.37x to 5.07x on ARM CPUs, with larger models experiencing greater performance gains.

Energy consumption is reduced by between 55 percent and 70 percent.

On x86 CPUs speedups range from 2.37x to 6.17x, while energy consumption is reduced by 72 percent to 82 percent.

A 100 billion parameter b1.58 model can be run on a single CPU, achieving speeds comparable to human reading (five to seven tokens per second), ‘significantly enhancing the potential for running LLMs on local devices’, BitNet.cpp’s ReadMe, on GitHub, says.

That means rather than having to rely on data centres for processing, we could see powerful AI deployed locally on personal devices, embedded systems, or in areas with limited internet, opening up who could access the tools.

With the ability to do sensitive data analysis locally, rather than having to send anything to the cloud, data privacy concerns could be eased.

While traditional LLMs use high precision numbers, which take up a lot of memory, BitNet.cpp’s framework uses an extremely efficient way of representing information, streamlining the model’s mathematical information, without sacrificing accuracy.

AI engineer Rohan Paul described it as ‘simplifying a detailed colour image into just black, white and grey.

“You lose some detail, but the main features are still there, and it becomes much easier to process.”

Microsoft aren’t the only ones pursuing 1-bit LLMs. Early this year a combined team from Swiss public research university ETH Zurich, China’s Beihang University and the University of Hong Kong introduced a method called BiLLM, taking a different approach from Microsoft’s while another team, led by a Harbin Institute of Technology researcher released a preprint on another method, called OneBit.

MORE NEWS:

NZ secret squirrel data gets new home in spy data centre
Industry reaction...
July 3, 2025

Employees raise red flag on AI agents
While ‘silicon ceiling’ sees frontline AI adoption stagnate…
July 2, 2025

Taking finance AI and automation end-to-end
It's a dance of three parts...
July 1, 2025

AppWrap: New AI panel, Canterbury's space plans, Xero and Halter deal$ and CPI for Spark
AppWrap aims to help you keep up to date with an easy to read collection of news and snippets published [...]
June 30, 2025