Skip to main content

Command Palette

Search for a command to run...

From Image Upload to Multilingual Insights: Creating an Object Detection and Translation System

Updated
13 min read
From Image Upload to Multilingual Insights: Creating an Object Detection and Translation System
A

I’m a Software Developer passionate about writing on software development.

Introduction

Language barriers don't just exist between people, they exist between people and the world around them. I realized this when thinking about how someone encountering an unfamiliar object in a photo might have no way to even name it, let alone look it up in their native language.

That problem stuck with me, and I built something to solve it: an end-to-end system that detects objects in any uploaded image and instantly translates them into five languages using YOLOv8 and Google Translate.

Experiences That Led to Working on the Detection System

I have always believed that anyone, regardless of their field, should have the ability to discover new concepts or understand how familiar concepts are referred to in other languages. This conviction inspired me to develop an object detection and translation system.

I noticed how language barriers could create misunderstandings. This inspired me to find ways to bridge these gaps. With a passion for technology, I discovered the potential of machine learning and artificial intelligence to break down these barriers.

Prerequisites

This system will be built using Golang and Python. While knowing these languages can be advantageous, it’s not required. I’ll guide you through every step of the process.

Tools and technologies

I built the project using the following tools and technologies:

  • Golang

  • Python

  • Yolov8 : for Object detection

  • Google translate API : For language translation

  • Cloudinary : For uploading image to cloud

Workflow

The system follows a simple three-step flow:

Upload: Select any image from your device. The frontend sends it to the Golang backend, which encodes it as a base64 string and forwards it to the Python Flask server.

Detect: YOLOv8 processes the image, identifies every object it recognizes, and returns an annotated image with bounding boxes alongside a list of detected objects.

Translate: Choose from five languages Arabic, Chinese, French, Russian, or Spanish and the detected object names are instantly translated and displayed.

The Frontend

The user interface is built using HTML for structure and Tailwind CSS for styling. Users can upload images and select languages for translation through this interface.

JavaScript is used to Handle user interactions and sends image data to the backend server.

The Backend

At the backend, the Golang server receives the image, encodes it to a base64 string, stores it in the buffer, and sends the base64 string as a POST request to the Flask server.

The Flask server receives the base64 string, converts it to binary image data, opens it as an image object, and then sends this image object to the model for detection.

When the detection is complete the annotated image is stored in a buffer and then sent to Cloudinary. The final step is to send the list of the annotations and the the URL of the annotated image in Cloudinary as a JSON response.

Setting up the environment

You need to have Golang and Python installed in your computer, You can download Golang easily from Here and Python following this link. If installed properly you should be able to see your python and golang version in your terminal if you type python -v and go version respectively. My Python version is 3.12.3 and Golang version is 1.22.1

Setting up my Python server

In my empty new directory, I created a python virtual environment and activated it, I then created my flask application in the directory.

Create virtual environment and flask application

Check here to see how to create a virtual environment and flask application

Setting up my Golang server

I created a new folder for the Golang server, Initiated the Go project and then created a server.

How to initiate a Golang project

Check here to learn how to Initiate a Golang project and create a server.

Configuring YOLOv8

YOLOv8 (You Only Look Once, version 8) is a state-of-the-art object detection model known for its speed and accuracy. Unlike traditional detection approaches that scan an image multiple times, YOLO processes the entire image in a single pass, making it ideal for a responsive, real-time system like this one.

Getting it configured in Flask is surprisingly simple. Add this to your Flask application :

from ultralytics import YOLO
model = YOLO("yolov8n.pt")

This single line downloads the pretrained yolov8n model, the "nano" variant, which strikes a great balance between speed and accuracy for general object detection and loads it into memory, ready to process images.

Setting Up Google Translate API

You need to set up a google cloud account, configure the cloud translation API and create a service account.

How to setup cloud translation API

Watch this Video to learn how to setup cloud translation API

Building the Object detection system

The Flask App handles receiving the image, sending it to the model for detection, Gets the annotated image and Json that contains annotations and confidence.

Dependencies we need:

from ultralytics import YOLO
from flask import Flask, request, jsonify
import base64
from PIL import Image
from io import BytesIO
from dotenv import load_dotenv
import cloudinary
from cloudinary import CloudinaryImage
import cloudinary.uploader
import cloudinary.api

Load Dotenv, Yolov8 pretrained model and configure the Cloudinary client to use secure URLs (HTTPS)

load_dotenv()
model = YOLO('./yolov8n.pt')
config = cloudinary.config(secure=True)

Post Method that receives base64 string, Opens the image and sends to the model for detection. It then initializes buffer, Gets annotated image and saves to buffer and sends Image to Cloudinary from Buffer.

We then iterate over detection results, extract bounding boxes, confidence scores, and class IDs for each detected object. then convert these to NumPy arrays and pairs each box, score, and class ID using zip().

Finally, we returned a JSON response containing the detections and the URL of the annotated image stored on Cloudinary

Working with the Image and Translation System In Golang

If you remember, our Golang server sends a base64 string to our Python server and expects the annotated image and its annotations in return.

The packages we need:

import (
	"bytes"
	"context"
	"encoding/base64"
	"encoding/json"
	"fmt"
	"io"
	"net/http"
	"os"

	"cloud.google.com/go/translate"
	"github.com/gin-gonic/gin"
	"github.com/go-resty/resty/v2"
	"github.com/joho/godotenv"
	"golang.org/x/text/language"
	"google.golang.org/api/option"
)

Defined types:

type Detection struct {
	Class      string  `json:"class"`
	Confidence float64 `json:"confidence"`
	Box        [4]int  `json:"box"`
}

type PythonResponse struct {
	Detections     []Detection `json:"detections"`
	AnnotatedImage string      `json:"annotated_image"`
}

type DetectionRequest struct {
	Image string `json:"image"`
}

type TranslateRequest struct {
	Detections []string `json:"detections"`
	Lang       string   `json:"lang"`
}

var translateClient *translate.Client

Sending base64 String to Flask server

Firstly when the image file is gotten it is converted to base64 with this code:

r.POST("/upload", func(c *gin.Context) {
		file, err := c.FormFile("image")
		if err != nil {
			c.JSON(http.StatusBadRequest, gin.H{"error": "Image upload failed"})
			return
		}

		fileData, err := file.Open()
		if err != nil {
			c.JSON(http.StatusInternalServerError, gin.H{"error": "Failed to open image file"})
			return
		}
		defer fileData.Close()

		buffer := bytes.NewBuffer(nil)
		if _, err := io.Copy(buffer, fileData); err != nil {
			c.JSON(http.StatusInternalServerError, gin.H{"error": "Failed to read image file"})
			return
		}

		encodedImage := base64.StdEncoding.EncodeToString(buffer.Bytes())
		detections, err := detectObjects(encodedImage)
		if err != nil {
			fmt.Println(err)
			c.JSON(http.StatusInternalServerError, gin.H{"error": "Object detection failed"})
			return
		}
		c.JSON(http.StatusOK, detections)
	})

The base64 string is then sent to the flask server and expects the annotated images and the annotations using this function:

func detectObjects(imageData string) ([]PythonResponse, error) {
	client := resty.New()
	resp, err := client.R().
		SetHeader("Content-Type", "application/json").
		SetBody(DetectionRequest{Image: imageData}).
		Post("http://localhost:5000/detect")

	if err != nil {
		return nil, err
	}

	var pyResp PythonResponse
	if err := json.Unmarshal(resp.Body(), &pyResp); err != nil {
		return nil, err
	}

	return []PythonResponse{pyResp}, nil
}

Setting up Google translate

Setting up Google Translate in Golang requires a bit more ceremony than the Python side, but it's worth understanding each step because it reflects how production-grade Go applications handle external service authentication.

There are three things happening here:

Loading environment variables: Using godotenv, the service account credentials stored in the .env file are loaded into the application. Keeping credentials out of your codebase is a non-negotiable security practice.

Creating a Context: In Go, a context is how you manage the lifecycle of operations. It allows you to pass deadlines, cancellation signals, and request-scoped values across API boundaries and between goroutines. Think of it as a way of telling your application "here are the rules for how long this operation should run and when to stop."

Context allows you to pass request-scoped values, deadlines, and cancellation signals across API boundaries and between goroutines.

Building the client: The Google Cloud credentials JSON is retrieved from the environment, unmarshalled into a Go map, converted back to a byte slice, and passed into the translate.NewClient() function. This gives you an authenticated client ready to make translation requests.

func setupGoogleTranslate() {
	if err := godotenv.Load(); err != nil {
		fmt.Printf("Error loading .env file: %v\n", err)
		os.Exit(1)
	}

	ctx := context.Background()
	var err error

	credsJSON := os.Getenv("GOOGLE_APPLICATION_CREDENTIALS_JSON")
	var credsMap map[string]interface{}

	if err := json.Unmarshal([]byte(credsJSON), &credsMap); err != nil {
		fmt.Printf("Error parsing service account JSON: %v\n", err)
		os.Exit(1)
	}
	credsBytes, err := json.Marshal(credsMap)

	if err != nil {
		fmt.Printf("Error converting service account to bytes: %v\n", err)
		os.Exit(1)
	}

	translateClient, err = translate.NewClient(ctx, option.WithCredentialsJSON(credsBytes))
	if err != nil {
		fmt.Printf("Failed to create client: %v\n", err)
		os.Exit(1)
	}
}

Translating Text using Google Translate

Since we are translating into just five languages: Arabic, Chinese, French, Russian, and Spanish, we pass the detections and corresponding language tags ("ar", "zh", "fr", "ru", "es") to our translation function based on the user's choice in the frontend.

The translation works in a loop for each detected objects in the picture :

func translateText(detections []string, lang string) ([]string, error) {
	ctx := context.Background()
	translations := make([]string, len(detections))
	langTag := language.Make(lang)
	for i, detection := range detections {
		resp, err := translateClient.Translate(ctx, []string{detection}, langTag, nil)
		if err != nil {
			return nil, err
		}
		if len(resp) > 0 {
			translations[i] = resp[0].Text
		} else {
			translations[i] = detection
		}
	}
	return translations, nil
}

The next thing I did was to create a post request that receives the translations and language tags and then sends it to the translateText function

r.POST("/translate", func(c *gin.Context) {
		var translateReq TranslateRequest
		if err := c.ShouldBindJSON(&translateReq); err != nil {
			c.JSON(http.StatusBadRequest, gin.H{"error": "Invalid request"})
			return
		}

		translations, err := translateText(translateReq.Detections, translateReq.Lang)
		if err != nil {
			c.JSON(http.StatusInternalServerError, gin.H{"error": "Translation failed"})
			return
		}

		c.JSON(http.StatusOK, gin.H{"translations": translations})
	})

Displaying Detection Results and Translations

I did this using HTML and Javascript and styled with Tailwind CSS

My HTML :

<div class="bg-white shadow-md rounded-lg p-6 max-w-lg w-full">
    <h1 class="text-2xl font-bold text-center mb-6">Object Detection</h1>
    <form id="uploadForm" class="mb-4">
      <div class="mb-4">
        <label class="block text-gray-700 text-sm font-bold mb-2" for="image">
          Upload Image
        </label>
        <input class="block w-full text-sm text-gray-500 file:mr-4 file:py-2 file:px-4 file:rounded-full file:border-0 file:text-sm file:font-semibold file:bg-violet-50 file:text-violet-700 hover:file:bg-violet-100" type="file" id="image" name="image" accept="image/*" required>
      </div>
      <button class="w-full bg-blue-500 hover:bg-blue-700 text-white font-bold py-2 px-4 rounded" type="submit">Detect Objects</button>
    </form>
    <div id="results">
      <h2 class="text-xl font-bold mb-4">Detected Objects</h2>
      <div class="mb-4">
        <img id="annotatedImage" class="rounded shadow-md" src="" alt="Annotated Image">
      </div>
      <ul id="detectedObjects" class="list-disc pl-5"></ul>
    </div>
    <div class="translate-buttons hidden flex space-x-2 mt-2">
      <p>Translate</p>
      <button class="translate-button bg-green-500 text-white px-4 py-2 rounded" data-lang="ar">Arabic</button>
      <button class="translate-button bg-yellow-500 text-white px-4 py-2 rounded" data-lang="zh">Chinese</button>
      <button class="translate-button bg-purple-500 text-white px-4 py-2 rounded" data-lang="fr">French</button>
      <button class="translate-button bg-red-500 text-white px-4 py-2 rounded" data-lang="ru">Russian</button>
      <button class="translate-button bg-orange-500 text-white px-4 py-2 rounded" data-lang="es">Spanish</button>
  </div>
  <div id="translated-results" class="hidden mt-4">
    <h3 class="text-lg font-semibold mb-2">Translated Object Names</h3>
    <ul id="translated-list"></ul>
  </div>
  </div>

Send Image to /upload endpoint when button is clicked having it in mind the action is carried out after listening to submit event

const formData = new FormData();
const fileInput = document.getElementById('image');
formData.append('image', fileInput.files[0]);

const response = await fetch('/upload', {
          method: 'POST',
          body: formData
        });

const data = await response.json();

Display annotated image and list of detections

//  display annotated result
document.getElementById('annotatedImage').src = `${data[0].annotated_image}`;
            // Display detected objects
          const detectedObjectsList = document.getElementById('detectedObjects');
          detectedObjectsList.innerHTML = '';
          data[0].detections.forEach(detection => {
            const listItem = document.createElement('li');
            listItem.textContent = `${detection.class}`;
            detectedObjectsList.appendChild(listItem);
          });

          if(detectedObjects){
            const translate_buttons = document.querySelector('.translate-buttons')
            translate_buttons.classList.remove('hidden')
          }

Send annotations and language tags to the /translate endpoint

document.querySelectorAll('.translate-button').forEach(button => {
      button.addEventListener('click', async function () {
        const lang = this.getAttribute('data-lang');
        const detections = [...document.getElementById('detectedObjects').children].map(li => li.textContent.split(' ')[0]);

        try {
          const response = await fetch('/translate', {
            method: 'POST',
            headers: { 'Content-Type': 'application/json' },
            body: JSON.stringify({ detections, lang })
          });

          if (!response.ok) {
            throw new Error('Failed to translate text.');
          }

          const data = await response.json();
          displayTranslatedResults(data.translations);
        } catch (error) {
          console.error(error);
          alert('An error occurred. Please try again.');
        }
      });
    });

Display Translated Text:

function displayTranslatedResults (translations){
      const translatedResultsDiv = document.getElementById('translated-results');
      const translatedList = document.getElementById('translated-list');

      translatedList.innerHTML = '';
      translations.forEach(translation => {
        const li = document.createElement('li');
        li.textContent = translation;
        translatedList.appendChild(li);
      });

      translatedResultsDiv.classList.remove('hidden');
    }

Image Upload to Translation Testing

Now that everything is wired up, let's walk through the system end to end with a real example.

Click the Choose file button and choose any image from your device. The system works best with clear, well-lit photos where objects are distinct and not heavily overlapping. Once selected, you should see a preview of your image on the page:

Click the "Detect Objects" button. Within a few seconds, YOLOv8 will process the image and return an annotated version with bounding boxes drawn around every detected object, along with a list of detection labels beneath it. The confidence score behind each detection determines what gets shown:

The next thing you may want to do optionally is translate to your language, I'll be using french as an example, I Immediately got translations when i clicked on french:

The entire flow from upload to translated results takes just a few seconds, a small demonstration of how accessible these technologies have become.

Future Enhancements

While it does achieve its primary aim at this moment, Improvements that will be considered to make it a better solution are to:

  • Implement real-time object detection: imagine pointing your phone camera at a market stall abroad and getting instant translations of every item in view.

  • Broader language support: The current five languages are just a starting point. Expanding to other widely spoken languages would make the system accessible to a significantly larger global audience.

Conclusion

What started as curiosity about combining two powerful technologies turned into a fully functional system that bridges vision and language in a way that feels genuinely useful.

We walked through the entire pipeline, from encoding an uploaded image in Golang, to running it through YOLOv8 in Flask, to translating the results with Google Translate and displaying everything in a clean UI. Each piece is modular, meaning you can swap out components, try a different detection model, add more languages, or rebuild the frontend entirely.

You can find the complete code for this project on Github

References