Building a Real-Time Object Detection App for Android — From Architecture to Working Code
Invalid Date
Building a Real-Time Object Detection App for Android — From Architecture to Working Code
In the previous blog, we explored the foundational concepts behind Edge AI on Android — the architecture patterns, model selection philosophy, runtime formats, and the high-level pipeline that makes on-device intelligence possible.
We intentionally stopped before writing a single line of code. Understanding why things work the way they do was the priority.
Now, it’s time to build.
In this blog, we’ll take every concept from Part 1 and transform it into a working Android application that detects objects in real-time using your phone’s camera. No cloud APIs. No network requests. Everything runs locally, on-device, in real-time.
By the end of this post, you’ll have an application that:
- Captures live camera frames using CameraX
- Runs YOLOv8 inference using TensorFlow Lite
- Draws bounding boxes with class labels and confidence scores
- Handles all the subtle engineering challenges — image rotation, coordinate mapping, letterboxing, non-maximum suppression — that separate a demo from a real application
Let’s get into it.
The Architecture We’re Building
Before touching code, let’s visualize the complete data flow. This is the pipeline we discussed conceptually in Part 1, now made concrete:
Camera Sensor
|
v
CameraX ImageAnalysis (YUV frame)
|
v
ImageProxy → Bitmap (with rotation correction)
|
v
Letterbox Resize (640x640, gray padding)
|
v
RGB Float Buffer (normalized 0-1)
|
v
TFLite Interpreter (YOLOv8s model)
|
v
Raw Output Tensor [1, 84, 8400]
|
v
Decode: confidence filtering + NMS
|
v
List<Detection> (box coords + class + score)
|
v
OverlayView (canvas drawing on camera preview)
Every stage in this pipeline exists for a reason. Skip one, and the app breaks in subtle ways — boxes appear in wrong positions, objects outside the frame get detected, the UI freezes, or nothing gets detected at all.
Let’s build each stage.
Project Setup
Create the Android Project
Create a new Android project in Android Studio with these settings:
- Language: Kotlin
- Minimum SDK: API 24 (Android 7.0)
- Template: Empty Activity
Dependencies
This is where most tutorials go wrong. They list dependencies without explaining why each one exists. Let’s fix that.
Open app/build.gradle.kts and add:
plugins {
alias(libs.plugins.android.application)
alias(libs.plugins.kotlin.android)
}
android {
namespace = "com.aman.real_timeobjectdetectionapp"
compileSdk = 36
defaultConfig {
applicationId = "com.aman.real_timeobjectdetectionapp"
minSdk = 24
targetSdk = 36
versionCode = 1
versionName = "1.0"
testInstrumentationRunner = "androidx.test.runner.AndroidJUnitRunner"
}
buildTypes {
release {
isMinifyEnabled = false
proguardFiles(
getDefaultProguardFile("proguard-android-optimize.txt"),
"proguard-rules.pro"
)
}
}
compileOptions {
sourceCompatibility = JavaVersion.VERSION_11
targetCompatibility = JavaVersion.VERSION_11
}
kotlinOptions {
jvmTarget = "11"
}
aaptOptions {
noCompress("tflite")
}
}
val cameraxVersion = "1.3.4"
dependencies {
implementation(libs.androidx.core.ktx)
implementation(libs.androidx.appcompat)
implementation(libs.material)
implementation(libs.androidx.activity)
implementation(libs.androidx.constraintlayout)
implementation(libs.androidx.camera.view)
// CameraX — the modern camera API
implementation("androidx.camera:camera-core:${cameraxVersion}")
implementation("androidx.camera:camera-camera2:${cameraxVersion}")
implementation("androidx.camera:camera-lifecycle:${cameraxVersion}")
implementation("androidx.camera:camera-view:${cameraxVersion}")
// TensorFlow Lite — the inference engine
implementation("org.tensorflow:tensorflow-lite:2.14.0")
implementation("org.tensorflow:tensorflow-lite-support:0.4.4")
// GPU acceleration (optional but recommended)
implementation("org.tensorflow:tensorflow-lite-gpu-delegate-plugin:0.4.4")
implementation("org.tensorflow:tensorflow-lite-gpu:2.14.0")
}
A few things worth calling out:
CameraX gives us a lifecycle-aware camera API. It handles the camera hardware, frame delivery, and rotation metadata — things that would take hundreds of lines with the raw Camera2 API.
TensorFlow Lite is the inference runtime. It loads the .tflite model file and runs forward passes on input tensors.
The GPU delegate is optional but impactful. It offloads convolution operations to the phone’s GPU, which can make inference 2-3x faster. This matters when you upgrade from a nano model to a larger one.
One of the most important lines in this entire file is
noCompress("tflite"). Without it, Android’s build system compresses the.tflitefile inside the APK. When your code tries to memory-map the model at runtime, it reads compressed garbage instead of actual weights. The model will load, the interpreter will initialize, but every inference will produce meaningless output. This is the kind of bug that can cost you hours — the app doesn’t crash, it just silently fails.
The Model
We’re using YOLOv8s (small) — the sweet spot between accuracy and speed for mobile devices.
You might think the nano variant would be the obvious choice for mobile. It’s the smallest, after all. But in practice, YOLOv8s gives you +3.1 mAP over YOLOv8n with only ~30% slower inference. On a modern phone with GPU delegation, that speed difference is barely noticeable. The accuracy difference, however, is very noticeable — especially for smaller objects or crowded scenes.
Exporting the Model
You’ll need Python with Ultralytics installed:
pip install ultralytics onnx onnxslim onnx2tf tensorflow
The export pipeline goes: PyTorch → ONNX → Simplified ONNX → TFLite
from ultralytics import YOLO
model = YOLO('yolov8s.pt')
model.export(format='tflite', imgsz=640)
This produces a yolov8s_float32.tflite file. Place it in:
app/src/main/assets/yolov8s.tflite
Understanding the Model’s Contract
This might seem subtle, but it has massive implications for every line of inference code we write.
Input: [1, 640, 640, 3] — A single 640x640 RGB image, pixel values normalized to 0-1.
Output: [1, 84, 8400] — 8400 predictions, each with 84 values.
Let’s break down those 84 values:
- Indices 0-3: Bounding box coordinates (
cx,cy,w,h) — normalized, representing the center and dimensions of the detected box - Indices 4-83: Class confidence scores for 80 COCO classes — already post-sigmoid, so they’re probabilities between 0 and 1
The model does NOT produce human-readable detections. It produces 8400 raw predictions, most of which are noise. Our job is to filter, decode, and render only the meaningful ones.
Project Structure
Before writing code, let’s understand how the files are organized:
app/src/main/
├── assets/
│ └── yolov8s.tflite # The ML model
├── java/
│ ├── com/aman/.../
│ │ └── MainActivity.kt # Entry point, permissions, initialization
│ ├── camera/
│ │ ├── CameraController.kt # CameraX setup and lifecycle
│ │ └── FrameAnalyzer.kt # Bridges camera frames to detector
│ ├── core/
│ │ └── ImageUtils.kt # Image conversion utilities
│ ├── detection/
│ │ ├── Detection.kt # Data class for a single detection
│ │ └── YoloDetector.kt # The inference engine
│ └── ui/
│ └── OverlayView.kt # Custom view for drawing boxes
└── res/
└── layout/
└── activity_main.xml # Camera preview + overlay layout
Each file has a single responsibility. The camera doesn’t know about detection. The detector doesn’t know about the UI. The overlay doesn’t know about the camera. This separation isn’t academic — it’s what lets you swap a YOLOv8 model for a MobileNet SSD without touching your camera code.
Implementation — Stage by Stage
Stage 1: The Layout
<?xml version="1.0" encoding="utf-8"?>
<FrameLayout xmlns:android="http://schemas.android.com/apk/res/android"
xmlns:tools="http://schemas.android.com/tools"
android:id="@+id/rootContainer"
android:layout_width="match_parent"
android:layout_height="match_parent"
android:background="#000000">
<FrameLayout
android:layout_width="match_parent"
android:layout_height="match_parent"
android:layout_margin="16dp">
<androidx.camera.view.PreviewView
android:id="@+id/previewView"
android:layout_width="match_parent"
android:layout_height="match_parent" />
<com.aman.real_timeobjectdetectionapp.ui.OverlayView
android:id="@+id/overlayView"
android:layout_width="match_parent"
android:layout_height="match_parent"
tools:ignore="MissingClass" />
</FrameLayout>
</FrameLayout>
Two views stacked in a FrameLayout. The PreviewView shows the camera feed. The OverlayView sits on top, transparent, drawing bounding boxes on a canvas. They share the same dimensions — this is critical for coordinate alignment.
The outer FrameLayout with a black background and the inner one with 16dp margin gives us a clean border around the camera view.
Stage 2: The Detection Data Class
package detection
data class Detection(
val left: Float,
val top: Float,
val right: Float,
val bottom: Float,
val score: Float,
val classId: Int,
val className: String
)
Simple, but worth calling out: the coordinates here are in original image pixel space — not model space, not view space. The detector maps from model coordinates back to image coordinates. The overlay maps from image coordinates to screen coordinates. This two-step mapping is what keeps boxes accurate regardless of screen size or aspect ratio.
Stage 3: The Image Utility
package core
import android.graphics.Bitmap
import android.graphics.Matrix
import androidx.camera.core.ImageProxy
object ImageUtils {
fun imageProxyToBitmap(image: ImageProxy): Bitmap {
val rawBitmap = image.toBitmap()
val rotation = image.imageInfo.rotationDegrees
return if (rotation != 0) {
val matrix = Matrix()
matrix.postRotate(rotation.toFloat())
Bitmap.createBitmap(
rawBitmap, 0, 0,
rawBitmap.width, rawBitmap.height,
matrix, true
)
} else {
rawBitmap
}
}
}
This looks deceptively simple, but there’s a lot happening under the hood.
Why image.toBitmap() instead of manual YUV conversion? CameraX delivers frames in YUV_420_888 format. The classic approach is to manually extract Y, U, V planes and construct an NV21 byte array. But this breaks on many devices because the UV plane interleaving varies — some phones use pixel stride 1, others use stride 2 with overlapping buffers. ImageProxy.toBitmap() (available since CameraX 1.3.0) handles all of this internally.
Why rotation? On most Android phones, the back camera sensor is physically mounted at 90 degrees. Without rotation correction, you’d feed the model a sideways image. The model would still run, but it would be trying to detect objects in an orientation it was never trained on.
Stage 4: The YoloDetector — The Heart of the App
This is the largest and most critical file. Let’s build it piece by piece.
package detection
import android.content.Context
import android.graphics.Bitmap
import android.graphics.Canvas
import android.graphics.Color
import android.util.Log
import org.tensorflow.lite.Interpreter
import org.tensorflow.lite.gpu.GpuDelegate
import java.io.FileInputStream
import java.nio.ByteBuffer
import java.nio.ByteOrder
import java.nio.MappedByteBuffer
import java.nio.channels.FileChannel
class YoloDetector(private val context: Context) {
companion object {
private const val TAG = "YoloDetector"
private const val INPUT_SIZE = 640
private const val CONF_THRESHOLD = 0.45f
private const val IOU_THRESHOLD = 0.45f
private val LABELS = arrayOf(
"person", "bicycle", "car", "motorcycle", "airplane",
"bus", "train", "truck", "boat", "traffic light",
"fire hydrant", "stop sign", "parking meter", "bench",
"bird", "cat", "dog", "horse", "sheep", "cow",
"elephant", "bear", "zebra", "giraffe", "backpack",
"umbrella", "handbag", "tie", "suitcase", "frisbee",
"skis", "snowboard", "sports ball", "kite",
"baseball bat", "baseball glove", "skateboard",
"surfboard", "tennis racket", "bottle", "wine glass",
"cup", "fork", "knife", "spoon", "bowl", "banana",
"apple", "sandwich", "orange", "broccoli", "carrot",
"hot dog", "pizza", "donut", "cake", "chair", "couch",
"potted plant", "bed", "dining table", "toilet", "tv",
"laptop", "mouse", "remote", "keyboard", "cell phone",
"microwave", "oven", "toaster", "sink", "refrigerator",
"book", "clock", "vase", "scissors", "teddy bear",
"hair drier", "toothbrush"
)
}
private var interpreter: Interpreter? = null
private var gpuDelegate: GpuDelegate? = null
The LABELS array maps class indices (0-79) to human-readable names. These are the 80 COCO dataset classes that YOLOv8 was trained on.
The confidence threshold (0.45) and IoU threshold (0.45) were chosen through experimentation. Lower confidence thresholds catch more objects but also more false positives. Lower IoU thresholds are more aggressive at removing overlapping boxes.
Loading the Model
fun loadModel(modelName: String) {
val model = loadModelFile(modelName)
val options = Interpreter.Options()
try {
gpuDelegate = GpuDelegate()
options.addDelegate(gpuDelegate!!)
Log.d(TAG, "GPU delegate enabled")
} catch (e: Throwable) {
Log.w(TAG, "GPU delegate not available, using CPU: ${e.message}")
gpuDelegate = null
}
interpreter = Interpreter(model, options)
Log.d(TAG, "TFLite Interpreter initialized")
}
private fun loadModelFile(modelName: String): MappedByteBuffer {
val fileDescriptor = context.assets.openFd(modelName)
val inputStream = FileInputStream(fileDescriptor.fileDescriptor)
val fileChannel = inputStream.channel
return fileChannel.map(
FileChannel.MapMode.READ_ONLY,
fileDescriptor.startOffset,
fileDescriptor.declaredLength
)
}
Notice we catch Throwable, not Exception. The GPU delegate can throw NoClassDefFoundError (which is an Error, not an Exception) on devices where the GPU libraries aren’t available. Catching only Exception would crash the app. This way, we gracefully fall back to CPU.
The model is loaded via memory-mapping (MappedByteBuffer), which is significantly more efficient than reading the entire file into a byte array. The OS can page in only the parts of the model that are actually needed.
The Inference Pipeline
This is where the magic happens:
fun runInference(bitmap: Bitmap): List<Detection> {
val results = mutableListOf<Detection>()
if (interpreter == null) return results
// Letterbox: preserve aspect ratio + gray padding
val imgW = bitmap.width
val imgH = bitmap.height
val scale = minOf(INPUT_SIZE.toFloat() / imgW,
INPUT_SIZE.toFloat() / imgH)
val newW = (imgW * scale).toInt()
val newH = (imgH * scale).toInt()
val padLeft = (INPUT_SIZE - newW) / 2
val padTop = (INPUT_SIZE - newH) / 2
val letterboxed = Bitmap.createBitmap(
INPUT_SIZE, INPUT_SIZE, Bitmap.Config.ARGB_8888
)
val canvas = Canvas(letterboxed)
canvas.drawColor(Color.rgb(114, 114, 114))
val scaled = Bitmap.createScaledBitmap(bitmap, newW, newH, true)
canvas.drawBitmap(scaled, padLeft.toFloat(), padTop.toFloat(), null)
Why letterboxing instead of stretching?
You might think: “The model expects 640x640, so just resize the image to 640x640.” And that would work — the model would run, produce output, give you detections. But those detections would be less accurate than they should be.
The reason is subtle: YOLOv8 was trained with letterboxing. During training, images were resized to fit within 640x640 while preserving their aspect ratio, with gray padding filling the remaining space. If you stretch instead, you’re distorting the image in a way the model never saw during training. Faces become wider. Cars become taller. The model’s learned features don’t match what it sees.
The gray color (114, 114, 114) isn’t arbitrary — it’s the standard YOLO padding color used during training.
// Convert to float RGB buffer
val inputBuffer = ByteBuffer.allocateDirect(
1 * INPUT_SIZE * INPUT_SIZE * 3 * 4
)
inputBuffer.order(ByteOrder.nativeOrder())
for (y in 0 until INPUT_SIZE) {
for (x in 0 until INPUT_SIZE) {
val px = letterboxed.getPixel(x, y)
inputBuffer.putFloat(((px shr 16) and 0xFF) / 255f)
inputBuffer.putFloat(((px shr 8) and 0xFF) / 255f)
inputBuffer.putFloat((px and 0xFF) / 255f)
}
}
// Run inference
val output = Array(1) { Array(84) { FloatArray(8400) } }
interpreter?.run(inputBuffer, output)
Each pixel is extracted in RGB order (not BGR — this matters) and normalized to 0-1 by dividing by 255. The model expects [1, 640, 640, 3] in NHWC format, and that’s exactly what we write into the ByteBuffer: for each row, for each column, three channel values.
Decoding the Output
Now comes the part that trips up most developers:
for (i in 0 until 8400) {
var bestClass = -1
var bestScore = 0f
for (c in 4 until 84) {
val score = output[0][c][i]
if (score > bestScore) {
bestScore = score
bestClass = c - 4
}
}
if (bestScore > CONF_THRESHOLD) {
val cx = output[0][0][i]
val cy = output[0][1][i]
val w = output[0][2][i]
val h = output[0][3][i]
// Convert from model space back to original image space
val left = ((cx - w / 2f) * INPUT_SIZE - padLeft) / scale
val top = ((cy - h / 2f) * INPUT_SIZE - padTop) / scale
val right = ((cx + w / 2f) * INPUT_SIZE - padLeft) / scale
val bottom = ((cy + h / 2f) * INPUT_SIZE - padTop) / scale
val className = if (bestClass in LABELS.indices)
LABELS[bestClass] else "class_$bestClass"
results.add(
Detection(left, top, right, bottom,
bestScore, bestClass, className)
)
}
}
return applyNms(results)
}
For each of the 8400 predictions, we:
- Find the best class — scan indices 4-83 for the highest score. No sigmoid needed — the model already applies it internally.
- Filter by confidence — only keep predictions above 0.45.
- Decode the bounding box — convert from normalized center-width-height to left-top-right-bottom in original image coordinates.
That coordinate transformation on lines with padLeft, padTop, and scale is reversing the letterbox operation. The model predicted coordinates in 640x640 letterboxed space. We need them in original image pixel space. The math:
original_coord = (model_coord * 640 - pad_offset) / letterbox_scale
Non-Maximum Suppression
Without NMS, you’d get dozens of overlapping boxes for every object:
private fun applyNms(
detections: List<Detection>
): List<Detection> {
val sorted = detections.sortedByDescending { it.score }
val selected = mutableListOf<Detection>()
val active = BooleanArray(sorted.size) { true }
for (i in sorted.indices) {
if (!active[i]) continue
selected.add(sorted[i])
for (j in i + 1 until sorted.size) {
if (!active[j]) continue
if (iou(sorted[i], sorted[j]) > IOU_THRESHOLD) {
active[j] = false
}
}
}
return selected
}
private fun iou(a: Detection, b: Detection): Float {
val x1 = maxOf(a.left, b.left)
val y1 = maxOf(a.top, b.top)
val x2 = minOf(a.right, b.right)
val y2 = minOf(a.bottom, b.bottom)
val intersection = maxOf(0f, x2 - x1) * maxOf(0f, y2 - y1)
val areaA = (a.right - a.left) * (a.bottom - a.top)
val areaB = (b.right - b.left) * (b.bottom - b.top)
val union = areaA + areaB - intersection
return if (union > 0f) intersection / union else 0f
}
fun close() {
interpreter?.close()
gpuDelegate?.close()
}
}
NMS works by sorting detections by confidence (highest first), then for each detection, suppressing all lower-confidence detections that overlap significantly (IoU > 0.45). This is O(n^2) in the worst case, but since we’ve already filtered by confidence threshold, typically fewer than 100 detections reach NMS — negligible compared to the 50-200ms inference time.
Stage 5: The Camera Controller
package com.aman.real_timeobjectdetectionapp.camera
import android.content.Context
import androidx.camera.core.*
import androidx.camera.lifecycle.ProcessCameraProvider
import androidx.camera.view.PreviewView
import androidx.core.content.ContextCompat
import androidx.lifecycle.LifecycleOwner
import com.aman.real_timeobjectdetectionapp.ui.OverlayView
import detection.YoloDetector
import java.util.concurrent.Executors
class CameraController(
private val context: Context,
private val previewView: PreviewView,
private val lifecycleOwner: LifecycleOwner,
private val yoloDetector: YoloDetector,
private val overlayView: OverlayView
) {
private val analysisExecutor =
Executors.newSingleThreadExecutor()
fun startCamera() {
previewView.scaleType =
PreviewView.ScaleType.FIT_CENTER
val cameraProviderFuture =
ProcessCameraProvider.getInstance(context)
cameraProviderFuture.addListener({
val cameraProvider = cameraProviderFuture.get()
val preview = Preview.Builder()
.setTargetAspectRatio(AspectRatio.RATIO_4_3)
.build()
preview.setSurfaceProvider(
previewView.surfaceProvider
)
val imageAnalysis = ImageAnalysis.Builder()
.setTargetAspectRatio(AspectRatio.RATIO_4_3)
.setBackpressureStrategy(
ImageAnalysis.STRATEGY_KEEP_ONLY_LATEST
)
.build()
imageAnalysis.setAnalyzer(
analysisExecutor,
FrameAnalyzer(yoloDetector, overlayView)
)
val cameraSelector =
CameraSelector.DEFAULT_BACK_CAMERA
cameraProvider.unbindAll()
cameraProvider.bindToLifecycle(
lifecycleOwner,
cameraSelector,
preview,
imageAnalysis
)
}, ContextCompat.getMainExecutor(context))
}
}
There are three critical engineering decisions here:
1. FIT_CENTER scale type. The default is FILL_CENTER, which crops the camera image to fill the screen. This means the preview shows a different field of view than what ImageAnalysis processes. You’d detect objects that aren’t visible on screen, and bounding boxes would be misaligned. FIT_CENTER ensures the preview shows the full image — what you see is exactly what the model processes.
2. Matching aspect ratios. Both Preview and ImageAnalysis are locked to RATIO_4_3. Without this, CameraX might choose different resolutions for each, causing the same field-of-view mismatch problem.
3. Background executor for analysis. YOLOv8 inference takes 50-200ms per frame. If you run this on the main thread (using getMainExecutor), the UI freezes. Executors.newSingleThreadExecutor() runs inference on a background thread. The STRATEGY_KEEP_ONLY_LATEST backpressure strategy means if inference is still running when a new frame arrives, the old frame is dropped. This prevents frame queuing and keeps the app responsive.
Stage 6: The Frame Analyzer
package com.aman.real_timeobjectdetectionapp.camera
import android.annotation.SuppressLint
import androidx.camera.core.ImageAnalysis
import androidx.camera.core.ImageProxy
import com.aman.real_timeobjectdetectionapp.ui.OverlayView
import core.ImageUtils
import detection.YoloDetector
class FrameAnalyzer(
private val yoloDetector: YoloDetector,
private val overlayView: OverlayView
) : ImageAnalysis.Analyzer {
@SuppressLint("UnsafeOptInUsageError")
override fun analyze(image: ImageProxy) {
val bitmap = ImageUtils.imageProxyToBitmap(image)
val detections = yoloDetector.runInference(bitmap)
overlayView.post {
overlayView.setDetections(
detections, bitmap.width, bitmap.height
)
}
image.close()
}
}
This is the bridge between camera and detection. The post {} call is important — analyze() runs on our background executor, but setDetections triggers invalidate() which must happen on the main thread. View.post() queues the work on the main thread’s handler.
Also note: image.close() must be called, otherwise CameraX stops delivering new frames. It’s a common source of “camera freezes after first frame” bugs.
Stage 7: The Overlay View
package com.aman.real_timeobjectdetectionapp.ui
import android.content.Context
import android.graphics.Canvas
import android.graphics.Color
import android.graphics.Paint
import android.util.AttributeSet
import android.view.View
import detection.Detection
class OverlayView @JvmOverloads constructor(
context: Context,
attrs: AttributeSet? = null
) : View(context, attrs) {
private val boxPaint = Paint().apply {
color = Color.RED
strokeWidth = 6f
style = Paint.Style.STROKE
}
private val textPaint = Paint().apply {
color = Color.WHITE
textSize = 40f
style = Paint.Style.FILL
isAntiAlias = true
}
private val textBgPaint = Paint().apply {
color = Color.RED
style = Paint.Style.FILL
}
private var detections: List<Detection> = emptyList()
private var imageWidth: Int = 1
private var imageHeight: Int = 1
fun setDetections(
list: List<Detection>,
imgWidth: Int,
imgHeight: Int
) {
detections = list
imageWidth = imgWidth
imageHeight = imgHeight
invalidate()
}
override fun onDraw(canvas: Canvas) {
super.onDraw(canvas)
val viewW = width.toFloat()
val viewH = height.toFloat()
val scale = minOf(
viewW / imageWidth,
viewH / imageHeight
)
val offsetX = (viewW - imageWidth * scale) / 2f
val offsetY = (viewH - imageHeight * scale) / 2f
for (det in detections) {
val left = det.left * scale + offsetX
val top = det.top * scale + offsetY
val right = det.right * scale + offsetX
val bottom = det.bottom * scale + offsetY
canvas.drawRect(left, top, right, bottom, boxPaint)
val label = "${det.className} " +
"${(det.score * 100).toInt()}%"
val textWidth = textPaint.measureText(label)
val textHeight = textPaint.textSize
canvas.drawRect(
left, top - textHeight - 4f,
left + textWidth + 8f, top, textBgPaint
)
canvas.drawText(
label, left + 4f, top - 4f, textPaint
)
}
}
}
The coordinate mapping here mirrors the FIT_CENTER behavior of the PreviewView. We compute the same scale and offset that CameraX uses to fit the image inside the view, then apply it to our detection coordinates. This is why boxes align perfectly with the objects on screen.
Each detection gets a red bounding box and a label tag showing the class name and confidence percentage.
Stage 8: The MainActivity
package com.aman.real_timeobjectdetectionapp
import android.Manifest
import android.content.pm.PackageManager
import android.os.Bundle
import androidx.activity.result.contract.ActivityResultContracts
import androidx.appcompat.app.AppCompatActivity
import androidx.camera.view.PreviewView
import androidx.core.content.ContextCompat
import com.aman.real_timeobjectdetectionapp.camera.CameraController
import com.aman.real_timeobjectdetectionapp.ui.OverlayView
import detection.YoloDetector
class MainActivity : AppCompatActivity() {
private lateinit var previewView: PreviewView
private lateinit var overlayView: OverlayView
private var cameraController: CameraController? = null
private lateinit var yoloDetector: YoloDetector
private val requestPermissionLauncher =
registerForActivityResult(
ActivityResultContracts.RequestPermission()
) { isGranted ->
if (isGranted) {
onCameraPermissionGranted()
}
}
override fun onCreate(savedInstanceState: Bundle?) {
super.onCreate(savedInstanceState)
setContentView(R.layout.activity_main)
previewView = findViewById(R.id.previewView)
overlayView = findViewById(R.id.overlayView)
checkCameraPermission()
}
private fun checkCameraPermission() {
if (ContextCompat.checkSelfPermission(
this, Manifest.permission.CAMERA
) == PackageManager.PERMISSION_GRANTED
) {
onCameraPermissionGranted()
} else {
requestPermissionLauncher.launch(
Manifest.permission.CAMERA
)
}
}
private fun onCameraPermissionGranted() {
yoloDetector = YoloDetector(this)
yoloDetector.loadModel("yolov8s.tflite")
cameraController = CameraController(
context = this,
previewView = previewView,
lifecycleOwner = this,
yoloDetector = yoloDetector,
overlayView = overlayView
)
cameraController?.startCamera()
}
}
Don’t forget the camera permission in AndroidManifest.xml:
<uses-permission android:name="android.permission.CAMERA" />
The flow is straightforward: check permission → load model → start camera. The registerForActivityResult API is the modern way to handle permission requests — no more onRequestPermissionsResult overrides.
The Pitfalls — What Took Us Hours So You Don’t Have To
Building this application wasn’t a straight line. Here are the real engineering challenges we encountered and solved:
1. The Silent Model Failure
Without noCompress("tflite") in build.gradle, the model file gets compressed in the APK. Memory-mapping reads garbage data. The interpreter initializes successfully, runs inference without crashing, but produces meaningless output — max confidence scores of 0.01 instead of 0.8+. No error. No exception. Just silence.
2. The Double-Sigmoid Trap
Some YOLOv8 TFLite exports include sigmoid in the model graph. If you apply sigmoid again in your post-processing code, you compress already-valid probabilities: a score of 0.001 becomes sigmoid(0.001) = 0.5, making everything look like a detection. The symptom: 8400 out of 8400 predictions pass the confidence threshold.
3. The YUV Conversion Minefield
The classic approach of manually constructing NV21 from YUV planes breaks on devices with pixel stride 2 (most modern phones). The U and V buffers overlap in memory, and naive concatenation produces corrupted color data. Use ImageProxy.toBitmap() instead.
4. The Invisible Camera Mismatch
PreviewView’s default FILL_CENTER mode crops the camera image to fill the screen. But ImageAnalysis gets the full uncropped sensor image. The model detects objects that aren’t visible in the preview. Bounding boxes shift left or right. The fix: FIT_CENTER + matched aspect ratios.
5. The Main Thread Freeze
Running inference on the main executor makes the app unresponsive. YOLOv8 takes 50-200ms per frame. On the main thread, that’s 50-200ms of UI freeze per frame. Move analysis to a background executor.
What You’ve Built
Let’s take a step back and appreciate what just happened. You built an application that:
- Captures 4:3 camera frames at sensor resolution
- Corrects for physical sensor rotation
- Letterbox-resizes to 640x640 preserving aspect ratio
- Normalizes pixel data to float32 RGB tensors
- Runs a 43MB neural network with 11.2 million parameters
- Decodes 8400 predictions per frame
- Filters by confidence and applies non-maximum suppression
- Maps coordinates from model space → image space → screen space
- Renders labeled bounding boxes in real-time
And all of this happens on-device, with zero network latency, processing every frame the camera captures.
This is Edge AI. Not as a buzzword. As working code.
What’s Next
This implementation covers the core pipeline, but there’s much more to explore:
- Model quantization — INT8 or Float16 models that are 2-4x smaller and faster, with minimal accuracy loss
- NNAPI delegation — leveraging dedicated neural processing units (NPUs) on supported devices
- Custom model training — fine-tuning YOLOv8 on your own dataset for domain-specific detection
- Multi-model pipelines — combining detection with classification, segmentation, or pose estimation
- Performance profiling — measuring and optimizing latency across the entire pipeline
The foundation is solid. Every improvement from here builds on the architecture we’ve established.
The complete source code for this project is available on GitHub. If you found this useful, consider sharing it with someone who’s trying to build their first Edge AI application.
Until next time — keep building.
— Aman