Summary of WHAT IS AI INFERENCE AT THE EDGE?
The article explains the distinction between training and inference in machine learning and argues for running AI inference at the edge to reduce latency and network dependence. It highlights Google Edge TPU as a custom ASIC optimized for low-power, high-speed edge inference, capable of running mobile vision models (e.g., MobileNet V2) at high frame rates and offering up to 4 TOPS at about 2 watts. Edge inference suits time-sensitive IoT and embedded applications better than cloud-only inference.
Parts used in the Google Edge TPU project:
- Google Edge TPU (ASIC)
- Embedded device or electronic hardware for data capture
- Mobile vision models such as MobileNet V2
- Low-power power supply (approx. supporting 2 watts operation)
- Coral prototyping or production hardware integrating Edge TPU
The conventional style of using network connectivity in bringing artificial intelligence models to improve performance and efficiency needs some modification to meet the demands from the embedded systems to the automobile industry. Before directly jumping to the role of AI inference at the edge, let us understand the difference between training and inference. Machine learning training refers to the process of building an algorithm with frameworks and datasets, while in the case of inference, it takes the trained machine learning algorithms to make a prediction.

By getting AI inference at the edge, there is a significant improvement in the performance along with the reduced time (inference time) and reducing the dependency on the network connectivity.
Machine learning or artificial intelligence inference can run in on the cloud as well as on a device (hardware). However, when there is a requirement for fast data processing and predictions of the outcome, AI inference at the cloud can increase the inference time creating delays in the system. For non-time critical applications, AI inference at the cloud can always do the job, but in a world full of IoT devices and applications that require fast processing, AI inference at the edge solves the problem. In AI inference at the edge, specialized models are made to run at the point of data capture, which is an electronic embedded device in this case.
Google Edge TPU is Google’s custom-built ASIC that is designed to run AI at the edge with a target for a specific kind of application. When we talk about TPUs, CPUs and GPUs, it is important to note that only TPU is an ASIC while the other two are not. Also, in TPUs, the ALUs are directly connected to each other without using memory. This means that there is a low latency in transferring information.
With the need and increasing requirements to deploy high-quality AI inference at the edge, there have been several prototyping and production products from Coral that come with integrated Google Edge TPU. This small ASIC is built for low-power devices that can execute state-of-the-art mobile vision models such as MobileNet V2 at almost 400 FPS, in a power-efficient manner. According to the manufacturer, an individual Edge TPU can perform 4 trillion operations per second (4 TOPS), while utilizing only 2 watts of power. More information on ASIC and the production products can be found on the manufacturer’s page.
Read more: WHAT IS AI INFERENCE AT THE EDGE?
- What is the difference between training and inference?
Training builds an algorithm using frameworks and datasets, while inference uses the trained algorithm to make predictions. - Why run AI inference at the edge?
Running AI inference at the edge reduces inference time, improves performance, and lowers dependence on network connectivity. - Can AI inference run on the cloud and on device?
Yes, AI inference can run on the cloud as well as on a device (hardware). - When is cloud inference acceptable?
Cloud inference is acceptable for non-time-critical applications. - What is Google Edge TPU?
Google Edge TPU is a custom-built ASIC designed to run AI at the edge for specific applications. - How does a TPU differ from CPUs and GPUs?
TPU is an ASIC while CPUs and GPUs are not; in TPUs ALUs are directly connected without using memory, lowering latency. - What performance does an Edge TPU offer?
An individual Edge TPU can perform 4 trillion operations per second (4 TOPS) while using about 2 watts of power, per the manufacturer. - What models can Edge TPU run effectively?
Edge TPU can execute state-of-the-art mobile vision models such as MobileNet V2 at high frame rates. - Where can I find more information on ASIC and production products?
More information can be found on the manufacturer’s page and Coral prototyping and production product documentation.