Processing Multiple Images per Prompt
Processing Multiple Images per Prompt
Section titled “Processing Multiple Images per Prompt”Problem
Section titled “Problem”You need to process multiple images in a single prompt, where each image provides different context and the LLM needs to understand relationships between images.
Single-image prompts work for independent analysis, but many real-world tasks require the model to compare, correlate, or synthesize information across multiple images: comparing product designs, verifying document consistency, analyzing before/after states, or processing multi-page documents. Sending images in separate requests loses the relational context. The model needs to see all images together in one prompt to make relative judgments.
Solution
Section titled “Solution”Implement multi-image processing that loads images from both file paths and URLs, validates the total count against model limits, constructs a multimodal message with all images, and includes annotations that help the LLM understand image relationships. The processor validates inputs before API calls to prevent errors from exceeding model capabilities (e.g., sending 20 images when the model supports 5).
Code Example
Section titled “Code Example”package main
import ( "context" "fmt" "log"
"go.opentelemetry.io/otel" "go.opentelemetry.io/otel/attribute" "go.opentelemetry.io/otel/trace"
"github.com/lookatitude/beluga-ai/schema")
var tracer = otel.Tracer("beluga.multimodal.multi_image")
// MultiImageProcessor processes multiple images in a prompttype MultiImageProcessor struct { maxImages int imageFormats []string}
// NewMultiImageProcessor creates a new processorfunc NewMultiImageProcessor(maxImages int) *MultiImageProcessor { return &MultiImageProcessor{ maxImages: maxImages, imageFormats: []string{"image/jpeg", "image/png", "image/webp"}, }}
// ProcessMultipleImages processes multiple images with a promptfunc (mip *MultiImageProcessor) ProcessMultipleImages(ctx context.Context, prompt string, imagePaths []string, imageURLs []string) ([]schema.Message, error) { ctx, span := tracer.Start(ctx, "multi_image_processor.process") defer span.End()
span.SetAttributes( attribute.Int("image_path_count", len(imagePaths)), attribute.Int("image_url_count", len(imageURLs)), attribute.String("prompt", prompt), )
// Validate image count totalImages := len(imagePaths) + len(imageURLs) if totalImages > mip.maxImages { err := fmt.Errorf("too many images: %d (max: %d)", totalImages, mip.maxImages) span.RecordError(err) span.SetStatus(trace.StatusError, err.Error()) return nil, err }
// Build messages with images messages := []schema.Message{}
// Add system message if needed if prompt != "" { messages = append(messages, schema.NewSystemMessage(prompt)) }
// Create user message with multiple images imageMessages := []schema.ImageContent{}
// Add images from paths for i, path := range imagePaths { imgContent, err := mip.loadImageFromPath(ctx, path, i) if err != nil { span.RecordError(err) continue } imageMessages = append(imageMessages, imgContent) }
// Add images from URLs for i, url := range imageURLs { imgContent, err := mip.loadImageFromURL(ctx, url, len(imagePaths)+i) if err != nil { span.RecordError(err) continue } imageMessages = append(imageMessages, imgContent) }
// Create multimodal message userMsg := schema.NewHumanMessage("") // Add image contents to message // This would use the actual multimodal message API
messages = append(messages, userMsg)
span.SetAttributes(attribute.Int("processed_image_count", len(imageMessages))) span.SetStatus(trace.StatusOK, "multiple images processed")
return messages, nil}
// loadImageFromPath loads image from file pathfunc (mip *MultiImageProcessor) loadImageFromPath(ctx context.Context, path string, index int) (schema.ImageContent, error) { // Load and encode image // In practice, would load file, encode to base64 or bytes return schema.ImageContent{}, nil}
// loadImageFromURL loads image from URLfunc (mip *MultiImageProcessor) loadImageFromURL(ctx context.Context, url string, index int) (schema.ImageContent, error) { // Download and encode image // In practice, would download, validate, encode return schema.ImageContent{}, nil}
// AnnotateImages adds annotations to help LLM understand image relationshipsfunc (mip *MultiImageProcessor) AnnotateImages(ctx context.Context, prompt string, imageDescriptions []string) string { annotated := prompt
for i, desc := range imageDescriptions { annotated += fmt.Sprintf("\n\nImage %d: %s", i+1, desc) }
return annotated}
func main() { ctx := context.Background()
// Create processor processor := NewMultiImageProcessor(5)
// Process multiple images prompt := "Compare these images and describe the differences" imagePaths := []string{"image1.jpg", "image2.jpg"}
messages, err := processor.ProcessMultipleImages(ctx, prompt, imagePaths, nil) if err != nil { log.Fatalf("Failed: %v", err) } fmt.Printf("Processed %d images\n", len(messages))}Explanation
Section titled “Explanation”-
Image aggregation — Images from both file paths and URLs are combined into a single prompt, providing flexibility in how images are sourced. File paths work for local images, while URLs support images hosted on CDNs or messaging platforms. Both sources are validated and loaded uniformly.
-
Validation — The number of images is validated against model limits before any loading or API calls. This prevents wasted work: loading 20 large images only to have the API reject the request is expensive. Early validation also provides clear error messages (“too many images: 20, max: 5”) rather than cryptic API errors.
-
Annotation support — The
AnnotateImagesmethod adds structured text annotations to help the LLM understand image relationships. Annotations like “Image 1: Product design v1” and “Image 2: Product design v2” give the model explicit context about what each image represents and what kind of comparison is expected. Without annotations, the model must infer image relationships purely from visual content.
Key insight: Structure multi-image prompts clearly with annotations. Use numbered labels or descriptions to help the LLM understand the role of each image and the expected analysis. The text prompt should explicitly state whether you want comparison, synthesis, or independent analysis of each image.
Testing
Section titled “Testing”func TestMultiImageProcessor_ProcessesMultipleImages(t *testing.T) { processor := NewMultiImageProcessor(5)
messages, err := processor.ProcessMultipleImages(context.Background(), "prompt", []string{"img1.jpg", "img2.jpg"}, nil) require.NoError(t, err) require.Greater(t, len(messages), 0)}Variations
Section titled “Variations”Image Sequencing
Section titled “Image Sequencing”Specify image order explicitly for sequential analysis:
type ImageWithContext struct { Image schema.ImageContent Context string Order int}Image Relationships
Section titled “Image Relationships”Model relationships between images for structured comparison:
type ImageRelationship struct { Type string // "comparison", "sequence", "overlay" ImageIDs []int}Related Recipes
Section titled “Related Recipes”- Capability-based Fallbacks — Handle capability limitations across providers
- Voice Providers Guide — For a deeper understanding of multimodal pipelines