Skip to content
Go to Dashboard

Multimodal Content

Text

Text is GUMem's primary input. Each Message should include:

FieldDescription
roleMessage role, such as user, assistant, or system.
contentRaw text content. GUMem extracts Facts from this field.
metadataOptional business metadata, such as source, page, attachment reference, or business object ID.
timestampOptional event time. Use accurate timestamps when relative time appears in the content.

Callers usually do not need to write Summary directly. Write Message input with provenance, time, and business context, and let GUMem process it.

Images

Images can be represented in GUMem, but the Memory pipeline should not be treated as an image decoding service. Convert image content upstream, then write the useful result as Message input.

Common patterns:

  • Write OCR text, captions, model recognition output, or human notes into content.
  • Store attachment ID, external URL, MIME type, business object ID, or upload source in metadata.
  • If an image belongs to a user message, write the image description and user text in the same message batch.
json
{
  "role": "user",
  "content": "User uploaded a receipt image. OCR text: dinner at Bistro A, total 86.40 SGD, paid on 2026-04-24.",
  "metadata": {
    "source": "image_upload",
    "mime_type": "image/png",
    "attachment_id": "att_xxx"
  },
  "timestamp": "2026-04-24T12:30:00Z"
}

If your deployment exposes image upload, confirm that it also converts image content into text or metadata that Memory can use. Binary image content alone is not enough for the current Memory pipeline to generate retrievable Facts.

Video

Video follows the same pattern. The current Memory pipeline does not promise video parsing, transcoding, key-frame extraction, or speech recognition.

Recommended patterns:

  • Write transcripts as Message content.
  • Write key-frame descriptions, scene summaries, or human notes as Message content.
  • Store video URL, file ID, duration, segment timecodes, and source system in metadata.
  • Split long video into segments so a single Message does not contain unrelated content.
json
{
  "role": "user",
  "content": "Video transcript from 00:02:10 to 00:03:05: user says the preferred onboarding flow should avoid long forms and support SSO first.",
  "metadata": {
    "source": "video_transcript",
    "video_id": "video_xxx",
    "segment_start": "00:02:10",
    "segment_end": "00:03:05"
  }
}

Write Guidance

  • Convert multimodal input into text facts useful for future tasks.
  • Keep attachment, resource, and business object references for audit and deletion.
  • Do not treat full media files as the primary Memory content.
  • Do not write uncertain model recognition as certain Facts.

Next Step

Read Add Memory to write text-shaped multimodal content into GUMem.