Multimodal Content

Text

Text is GUMem's primary input. Each Message should include:

Field	Description
`role`	Message role, such as `user`, `assistant`, or `system`.
`content`	Raw text content. GUMem extracts Facts from this field.
`metadata`	Optional business metadata, such as source, page, attachment reference, or business object ID.
`timestamp`	Optional event time. Use accurate timestamps when relative time appears in the content.

Callers usually do not need to write Summary directly. Write Message input with provenance, time, and business context, and let GUMem process it.

Images

Images can be represented in GUMem, but the Memory pipeline should not be treated as an image decoding service. Convert image content upstream, then write the useful result as Message input.

Common patterns:

Write OCR text, captions, model recognition output, or human notes into content.
Store attachment ID, external URL, MIME type, business object ID, or upload source in metadata.
If an image belongs to a user message, write the image description and user text in the same message batch.

json

{
  "role": "user",
  "content": "User uploaded a receipt image. OCR text: dinner at Bistro A, total 86.40 SGD, paid on 2026-04-24.",
  "metadata": {
    "source": "image_upload",
    "mime_type": "image/png",
    "attachment_id": "att_xxx"
  },
  "timestamp": "2026-04-24T12:30:00Z"
}

If your deployment exposes image upload, confirm that it also converts image content into text or metadata that Memory can use. Binary image content alone is not enough for the current Memory pipeline to generate retrievable Facts.

Video

Video follows the same pattern. The current Memory pipeline does not promise video parsing, transcoding, key-frame extraction, or speech recognition.

Recommended patterns:

Write transcripts as Message content.
Write key-frame descriptions, scene summaries, or human notes as Message content.
Store video URL, file ID, duration, segment timecodes, and source system in metadata.
Split long video into segments so a single Message does not contain unrelated content.

json

{
  "role": "user",
  "content": "Video transcript from 00:02:10 to 00:03:05: user says the preferred onboarding flow should avoid long forms and support SSO first.",
  "metadata": {
    "source": "video_transcript",
    "video_id": "video_xxx",
    "segment_start": "00:02:10",
    "segment_end": "00:03:05"
  }
}

Write Guidance

Convert multimodal input into text facts useful for future tasks.
Keep attachment, resource, and business object references for audit and deletion.
Do not treat full media files as the primary Memory content.
Do not write uncertain model recognition as certain Facts.

Next Step

Read Add Memory to write text-shaped multimodal content into GUMem.

Multimodal Content ​

Text ​

Images ​

Video ​