From Image Upload to Multilingual Insights: Creating an Object Detection and Translation System

From Image Upload to Multilingual Insights: Creating an Object Detection and Translation System

Introduction

The Term Machine learning and Artificial Intelligence have always been interesting to me. These technologies have revolutionized how we interact with the digital world.

Combining these two technologies can lead to powerful systems and inspired by their potentials, I thought to work on a project that detects objects in an Image and then translates them to languages in just few clicks.

In this article, I will guide you through the development of an intelligent object detection and translation system. using Yolov8 detection model and Google Translate, a widely used language translation service.

Evolution of Machine Learning and Artificial Intelligence

AI and ML began in the 1950s with Turing's concepts and the development of the perceptron. Despite setbacks in the 1970s and 1980s, the focus shifted to data-driven approaches in the 2000s. Modern advancements in deep learning have revolutionized AI, impacting various industries.

Experiences That Led to Working on the Detection System

I have always believed that anyone, regardless of their field, should have the ability to discover new concepts or understand how familiar concepts are referred to in other languages. This conviction inspired me to develop an object detection and translation system.

I noticed how language barriers could create misunderstandings. This inspired me to find ways to bridge these gaps. With a passion for technology, I discovered the potential of machine learning and artificial intelligence to break down these barriers.

Prerequisites

This system will be built using Golang and Python. While knowing these languages can be advantageous, it’s not required. I’ll guide you through every step of the process.

Tools and technologies

I built the project using the following tools and technologies:

  • Golang

  • Python

  • Yolov8 : for Object detection

  • Google translate API : For language translation

  • Cloudinary : For uploading image to cloud

Workflow

Thinking of how it works? It's pretty straightforward. First, you upload a picture you took or have saved on your device. Click on the 'Detect Objects' button, and you will get an annotated image. You can then select one of the five available languages to translate the detections.

The Frontend

The user interface is built using HTML for structure and Tailwind CSS for styling. Users can upload images and select languages for translation through this interface.

JavaScript is used to Handle user interactions and sends image data to the backend server.

The Backend

At the backend, the Golang server receives the image, encodes it to a base64 string, stores it in the buffer, and sends the base64 string as a POST request to the Flask server.

The Flask server receives the base64 string, converts it to binary image data, opens it as an image object, and then sends this image object to the model for detection.

When the detection is complete the annotated image is stored in a buffer and then sent to Cloudinary. The final step is to send the list of the annotations and the the URL of the annotated image in Cloudinary as a JSON response.

Setting up the environment

You need to have Golang and Python installed in your computer, You can download Golang easily from Here and Python following this link. If installed properly you should be able to see your python and golang version in your terminal if you type python -v and go version respectively. My Python version is 3.12.3 and Golang version is 1.22.1

Setting up my Python server

In my empty new directory, I created a python virtual environment and activated it, I then created my flask application in the directory.

Create virtual environment and flask application
Check here to see how to create a virtual environment and flask application

Setting up my Golang server

I created a new folder for the Golang server, Initiated the Go project and then created a server.

How to initiate a Golang project
Check here to learn how to Initiate a Golang project and create a server.

Configuring YOLOv8

As i mentioned above, YOLOv8 is the model we'll be using to detect objects in our Image, Configuring it is pretty straightforward.

Paste this line of code In your flask application to easily download and load a pretrained model.

from ultralytics import YOLO
model = YOLO("yolov8n.pt")

Setting Up Google Translate API

You need to set up a google cloud account, configure the cloud translation API and create a service account.

How to setup cloud translation API
Watch this Video to learn how to setup cloud translation API

Building the Object detection system

The Flask App handles receiving the image, sending it to the model for detection, Gets the annotated image and Json that contains annotations and confidence.

Dependencies we need:

from ultralytics import YOLO
from flask import Flask, request, jsonify
import base64
from PIL import Image
from io import BytesIO
from dotenv import load_dotenv
import cloudinary
from cloudinary import CloudinaryImage
import cloudinary.uploader
import cloudinary.api

Load Dotenv, Yolov8 pretrained model and configure the Cloudinary client to use secure URLs (HTTPS)

load_dotenv()
model = YOLO('./yolov8n.pt')
config = cloudinary.config(secure=True)

Post Method that receives base64 string, Opens the image and sends to the model for detection. It then initializes buffer, Gets annotated image and saves to buffer and sends Image to Cloudinary from Buffer.

We then iterate over detection results, extract bounding boxes, confidence scores, and class IDs for each detected object. then convert these to NumPy arrays and pairs each box, score, and class ID using zip().

Finally, we returned a JSON response containing the detections and the URL of the annotated image stored on Cloudinary

Working with the Image and Translation System In Golang

If you remember, our Golang server sends a base64 string to our Python server and expects the annotated image and its annotations in return.

The packages we need:

import (
    "bytes"
    "context"
    "encoding/base64"
    "encoding/json"
    "fmt"
    "io"
    "net/http"
    "os"

    "cloud.google.com/go/translate"
    "github.com/gin-gonic/gin"
    "github.com/go-resty/resty/v2"
    "github.com/joho/godotenv"
    "golang.org/x/text/language"
    "google.golang.org/api/option"
)

Defined types:

type Detection struct {
    Class      string  `json:"class"`
    Confidence float64 `json:"confidence"`
    Box        [4]int  `json:"box"`
}

type PythonResponse struct {
    Detections     []Detection `json:"detections"`
    AnnotatedImage string      `json:"annotated_image"`
}

type DetectionRequest struct {
    Image string `json:"image"`
}

type TranslateRequest struct {
    Detections []string `json:"detections"`
    Lang       string   `json:"lang"`
}

var translateClient *translate.Client

Sending base64 String to Flask server

Firstly when the image file is gotten it is converted to base64 with this code:

r.POST("/upload", func(c *gin.Context) {
        file, err := c.FormFile("image")
        if err != nil {
            c.JSON(http.StatusBadRequest, gin.H{"error": "Image upload failed"})
            return
        }

        fileData, err := file.Open()
        if err != nil {
            c.JSON(http.StatusInternalServerError, gin.H{"error": "Failed to open image file"})
            return
        }
        defer fileData.Close()

        buffer := bytes.NewBuffer(nil)
        if _, err := io.Copy(buffer, fileData); err != nil {
            c.JSON(http.StatusInternalServerError, gin.H{"error": "Failed to read image file"})
            return
        }

        encodedImage := base64.StdEncoding.EncodeToString(buffer.Bytes())
        detections, err := detectObjects(encodedImage)
        if err != nil {
            fmt.Println(err)
            c.JSON(http.StatusInternalServerError, gin.H{"error": "Object detection failed"})
            return
        }
        c.JSON(http.StatusOK, detections)
    })

The base64 string is then sent to the flask server and expects the annotated images and the annotations using this function:

func detectObjects(imageData string) ([]PythonResponse, error) {
    client := resty.New()
    resp, err := client.R().
        SetHeader("Content-Type", "application/json").
        SetBody(DetectionRequest{Image: imageData}).
        Post("http://localhost:5000/detect")

    if err != nil {
        return nil, err
    }

    var pyResp PythonResponse
    if err := json.Unmarshal(resp.Body(), &pyResp); err != nil {
        return nil, err
    }

    return []PythonResponse{pyResp}, nil
}

Setting up Google translate

Firstly, I loaded the Dotenv file to access environment variables. Then, I created a context to manage the lifecycle of the client.

In Golang, Context allows you to pass request-scoped values, deadlines, and cancellation signals across API boundaries and between goroutines.

I then got the credentials google cloud provided and converted it to byte slice, Then created a client with my credentials :

func setupGoogleTranslate() {
    if err := godotenv.Load(); err != nil {
        fmt.Printf("Error loading .env file: %v\n", err)
        os.Exit(1)
    }

    ctx := context.Background()
    var err error

    credsJSON := os.Getenv("GOOGLE_APPLICATION_CREDENTIALS_JSON")
    var credsMap map[string]interface{}

    if err := json.Unmarshal([]byte(credsJSON), &credsMap); err != nil {
        fmt.Printf("Error parsing service account JSON: %v\n", err)
        os.Exit(1)
    }
    credsBytes, err := json.Marshal(credsMap)

    if err != nil {
        fmt.Printf("Error converting service account to bytes: %v\n", err)
        os.Exit(1)
    }

    translateClient, err = translate.NewClient(ctx, option.WithCredentialsJSON(credsBytes))
    if err != nil {
        fmt.Printf("Failed to create client: %v\n", err)
        os.Exit(1)
    }
}

Translating Text using Google Translate

Since we are translating into just five languages—Arabic, Chinese, French, Russian, and Spanish—we pass the detections and corresponding language tags ("ar", "zh", "fr", "ru", "es") to our translation function based on the user's choice in the frontend.

The translation works in a loop for each detected objects in the picture :

func translateText(detections []string, lang string) ([]string, error) {
    ctx := context.Background()
    translations := make([]string, len(detections))
    langTag := language.Make(lang)
    for i, detection := range detections {
        resp, err := translateClient.Translate(ctx, []string{detection}, langTag, nil)
        if err != nil {
            return nil, err
        }
        if len(resp) > 0 {
            translations[i] = resp[0].Text
        } else {
            translations[i] = detection
        }
    }
    return translations, nil
}

The next thing I did was to create a post request that receives the translations and language tags and then sends it to the translateText function

r.POST("/translate", func(c *gin.Context) {
        var translateReq TranslateRequest
        if err := c.ShouldBindJSON(&translateReq); err != nil {
            c.JSON(http.StatusBadRequest, gin.H{"error": "Invalid request"})
            return
        }

        translations, err := translateText(translateReq.Detections, translateReq.Lang)
        if err != nil {
            c.JSON(http.StatusInternalServerError, gin.H{"error": "Translation failed"})
            return
        }

        c.JSON(http.StatusOK, gin.H{"translations": translations})
    })

Displaying Detection Results and Translations

I did this using HTML and Javascript and styled with Tailwind CSS

My HTML :

<div class="bg-white shadow-md rounded-lg p-6 max-w-lg w-full">
    <h1 class="text-2xl font-bold text-center mb-6">Object Detection</h1>
    <form id="uploadForm" class="mb-4">
      <div class="mb-4">
        <label class="block text-gray-700 text-sm font-bold mb-2" for="image">
          Upload Image
        </label>
        <input class="block w-full text-sm text-gray-500 file:mr-4 file:py-2 file:px-4 file:rounded-full file:border-0 file:text-sm file:font-semibold file:bg-violet-50 file:text-violet-700 hover:file:bg-violet-100" type="file" id="image" name="image" accept="image/*" required>
      </div>
      <button class="w-full bg-blue-500 hover:bg-blue-700 text-white font-bold py-2 px-4 rounded" type="submit">Detect Objects</button>
    </form>
    <div id="results">
      <h2 class="text-xl font-bold mb-4">Detected Objects</h2>
      <div class="mb-4">
        <img id="annotatedImage" class="rounded shadow-md" src="" alt="Annotated Image">
      </div>
      <ul id="detectedObjects" class="list-disc pl-5"></ul>
    </div>
    <div class="translate-buttons hidden flex space-x-2 mt-2">
      <p>Translate</p>
      <button class="translate-button bg-green-500 text-white px-4 py-2 rounded" data-lang="ar">Arabic</button>
      <button class="translate-button bg-yellow-500 text-white px-4 py-2 rounded" data-lang="zh">Chinese</button>
      <button class="translate-button bg-purple-500 text-white px-4 py-2 rounded" data-lang="fr">French</button>
      <button class="translate-button bg-red-500 text-white px-4 py-2 rounded" data-lang="ru">Russian</button>
      <button class="translate-button bg-orange-500 text-white px-4 py-2 rounded" data-lang="es">Spanish</button>
  </div>
  <div id="translated-results" class="hidden mt-4">
    <h3 class="text-lg font-semibold mb-2">Translated Object Names</h3>
    <ul id="translated-list"></ul>
  </div>
  </div>

Send Image to /upload endpoint when button is clicked having it in mind the action is carried out after listening to submit event

const formData = new FormData();
const fileInput = document.getElementById('image');
formData.append('image', fileInput.files[0]);

const response = await fetch('/upload', {
          method: 'POST',
          body: formData
        });

const data = await response.json();

Display annotated image and list of detections

//  display annotated result
document.getElementById('annotatedImage').src = `${data[0].annotated_image}`;
            // Display detected objects
          const detectedObjectsList = document.getElementById('detectedObjects');
          detectedObjectsList.innerHTML = '';
          data[0].detections.forEach(detection => {
            const listItem = document.createElement('li');
            listItem.textContent = `${detection.class}`;
            detectedObjectsList.appendChild(listItem);
          });

          if(detectedObjects){
            const translate_buttons = document.querySelector('.translate-buttons')
            translate_buttons.classList.remove('hidden')
          }

Send annotations and language tags to the /translate endpoint

document.querySelectorAll('.translate-button').forEach(button => {
      button.addEventListener('click', async function () {
        const lang = this.getAttribute('data-lang');
        const detections = [...document.getElementById('detectedObjects').children].map(li => li.textContent.split(' ')[0]);

        try {
          const response = await fetch('/translate', {
            method: 'POST',
            headers: { 'Content-Type': 'application/json' },
            body: JSON.stringify({ detections, lang })
          });

          if (!response.ok) {
            throw new Error('Failed to translate text.');
          }

          const data = await response.json();
          displayTranslatedResults(data.translations);
        } catch (error) {
          console.error(error);
          alert('An error occurred. Please try again.');
        }
      });
    });

Display Translated Text:

function displayTranslatedResults (translations){
      const translatedResultsDiv = document.getElementById('translated-results');
      const translatedList = document.getElementById('translated-list');

      translatedList.innerHTML = '';
      translations.forEach(translation => {
        const li = document.createElement('li');
        li.textContent = translation;
        translatedList.appendChild(li);
      });

      translatedResultsDiv.classList.remove('hidden');
    }

Image Upload to Translation Testing

The first step is to select your image:

Then Proceed by Clicking the Detects Objects Button, It will then detect and you should see this in your page:

The next thing you may want to do optionally is translate to your language, I'll be using french as an example, I Immediately got translations when i clicked on french:

And that's it.

You can find the complete code for this project on Github

Future Enhancements

While it does achieve it's primary aim at this moment, Some improvements that will be considered to make it a better solution are to:

  • Expand the system to support a broader range of languages to make it accessible to a more diverse global audience.

  • Implement real-time object detection and translation capabilities so as to significantly enhance the user experience.

  • Upgrade to more advanced versions of YOLO or incorporating other state-of-the-art object detection models to improve accuracy.

Conclusion

We learnt how to create an intelligent object detection and translation system using YOLOv8 a machine learning model and Google translate API.

You now better understand how to use these technologies to build powerful applications.

References