Forget waiting for Stable Diffusion or DALL-E to magically churn out readable dialogue within a comic panel. For years, users have grappled with the frustrating reality of AI art generating nonsensical text. We’re talking about dialogue like "WHAT ARE YOU DONIG" or "HEILP" popping up in what’s otherwise a perfectly rendered scene. This wasn’t just an aesthetic annoyance; it meant a staggering 70% of generations for some projects required costly re-rolls, each burning precious GPU time and a chunk of change.
But here’s the thing: the solution wasn’t more complex AI. It was a humble 200 lines of Python code, a bit of classic computer vision, and the ever-reliable Pillow library.
The Core Problem: AI’s Textual Blind Spot
The fundamental issue lies in how current generative models approach text. They treat letters and words as just another pattern to be rendered, often with disastrous results. The AI doesn’t understand language in the way humans do; it mimics shapes and styles. This leads to visual artifacts that, while sometimes amusing, break the suspension of disbelief and render a comic unreadable. For Comicory, a project aimed at generating AI-powered comics, this garbled text was a showstopper. The developer simply decided to sidestep the problem entirely. Instead of forcing the AI to render text, they instructed it to draw empty speech bubbles. Typography was relegated to a deterministic, post-processing step. The result? A 0% retry rate for text-related issues.
Finding the Empty Canvas
The first hurdle in this new pipeline was reliably locating those empty speech bubbles. This wasn’t a job for more sophisticated machine learning. Instead, a classic computer vision approach using OpenCV (cv2) was employed. The process involves converting the image panel to grayscale, applying a simple binary threshold to isolate the white bubbles against the darker background, and then using contour detection to find the shapes.
from PIL import Image
import numpy as np
import cv2
def find_bubble(panel: Image.Image) -> tuple[int, int, int, int] | None:
arr = np.array(panel.convert("L"))
_, mask = cv2.threshold(arr, 245, 255, cv2.THRESH_BINARY)
contours, _ = cv2.findContours(mask, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
blobs = sorted(contours, key=cv2.contourArea, reverse=True)
for blob in blobs[1:5]:
x, y, w, h = cv2.boundingRect(blob)
aspect = w / h
if 0.6 < aspect < 3.0 and w * h > 5000:
return (x, y, w, h)
return None
This method, applied to thousands of panels, boasts a 96% success rate in pinpointing the correct bubble. The aspect-ratio constraint is a smart move, filtering out elongated cloud shapes or background elements that might otherwise be misidentified.
Matching Mood to Font
Once a bubble is found, the next step is to imbue it with the correct text. Each character in Comicory has a mood field, which is then mapped to specific font choices and weights. This isn’t just about picking a generic font; it’s about crafting the visual voice of the character.
```python FONT_MAP = { “calm”: (“AnimeAce2.ttf”