Using AI to Generate Text, Images, and Videos in a Single Workflow.
FROM Module 6: Prompt Engineering: Techniques and Approaches
Introduction
AI is evolving beyond just text-based interactions. Multimodal AI allows users to generate text, images, audio, and videos within a single workflow. This lesson will cover:
✅ What multimodal AI is
✅ Techniques for combining different media types
✅ Real-world applications
✅ Hands-on exercises
What is Multimodal AI?
Definition: Multimodal AI can process and generate content in multiple formats (text, images, video, speech, etc.).
Example:
- You provide a text prompt, and AI generates an image.
- AI then uses the image to generate a descriptive caption or video.
🤖 Popular Multimodal AI Models:
Several advanced AI models can process and generate multiple formats (text, images, video, and speech). Here are some of the top multimodal AI models:
- GPT-4V (Vision) – OpenAI’s multimodal version of GPT-4 that understands images and text together.
- DALL·E 3 – Generates high-quality AI images from text prompts and can now refine images using natural language.
- Gemini 1.5 (Google DeepMind) – Can process text, images, audio, and code in a single model.
- Grok-1.5V (xAI by Elon Musk) – A multimodal version of Grok that can interpret images and text-based inputs.
- Claude 3 (Anthropic) – Capable of handling text and some multimodal tasks (but not as visual-focused as GPT-4V or Gemini).
- Runway Gen-2 – A powerful AI video generator that transforms text prompts into short video clips.
- Pika Labs – Another AI tool for generating animated videos from text descriptions.
- Whisper (OpenAI) – An AI speech-to-text model that accurately transcribes and translates audio.
These models enable seamless multimodal workflows, making it possible to generate, edit, and enhance content across text, images, and video.
Multimodal Prompting Techniques
1. Text-to-Image Generation (Prompting for Images)
AI converts a detailed text prompt into an image.
Example Prompt:
“A futuristic city skyline at sunset, with flying cars and neon holograms reflecting off the glass buildings, in cyberpunk style.”
Best Practices:
- Be descriptive (e.g., “A cozy library with warm lighting and wooden bookshelves.”)
- Specify styles (e.g., “A Van Gogh-style painting of a sunflower field.”)
- Define composition (e.g., “A close-up portrait of a smiling astronaut on Mars.”)
2. Text-to-Video Generation (Prompting for Videos)
AI creates short videos from text descriptions or enhances images into animations.
- Example Prompt for Video AI (Runway ML):
“A golden retriever running on a beach at sunrise, slow motion, cinematic lighting.”
Best Practices:
- Use clear scene descriptions (e.g., “A waterfall in a dense jungle, viewed from a drone.”)
- Define camera movements (e.g., “A slow zoom into a spaceship cockpit.”)
- Add mood settings (e.g., “Dramatic lighting, 4K quality, cinematic tone.”)
Image-to-Text (Descriptive AI Captions & Summaries)
AI analyzes an image and generates text descriptions.
Example Use Case:
- Input: Upload a photo of the Eiffel Tower.
- AI Output: “A stunning view of the Eiffel Tower at night, illuminated against a deep blue sky.”
Best Practices:
- Request detailed descriptions (e.g., “Describe this image in 50 words.”)
- Use contextual instructions (e.g., “Generate a social media caption for this image.”)
4. Text-to-Speech (AI Voice Generation)
AI converts text into realistic voice narration.
Example Prompt for AI Voice:
“Read this article in a warm, friendly voice with natural pauses.”
Best Practices:
- Choose a tone (e.g., “Excited, formal, or calm.”)
- Set a pacing style (e.g., “Slow narration for storytelling.”)
- Specify emotion (e.g., “Sound enthusiastic while describing the product.”)
5. Combining Modalities in a Single Workflow
🔹 Example: AI-Powered Marketing Workflow
1. Generate a product description (Text)
- “A sleek, lightweight smartwatch with 7-day battery life and AI fitness tracking.”
2. Convert it into an ad image (Text-to-Image) - AI generates a high-quality product image.
3. Create a short promo video (Image-to-Video) - AI animates the product with smooth transitions.
4. Add AI voice narration (Text-to-Speech) - A professional AI voice reads the product features.
Best Practices:
- Define the end goal before prompting.
- Use consistent prompts across all media types.
- Fine-tune details to make outputs more realistic.
Real-World Applications of Multimodal AI
1. Content Creation & Marketing
- AI writes blog posts, generates matching images, and creates promotional videos.
- Example: An AI-generated travel blog that includes AI-created images and narrated videos.
2. Virtual Assistants & AI Chatbots
- AI chatbots can answer questions with text and images.
- Example: A virtual home designer suggests furniture and generates room mockups.
3. Art & Design
- AI helps concept artists generate quick sketches before turning them into 3D models.
- Example: Game designers use AI-generated landscapes for virtual worlds.
4. AI-Powered Video Editing
- AI can animate still images into short films.
- Example: Runway AI helps filmmakers create visual effects without green screens.
5. Journalism & Fact-Checking
- AI generates news summaries, verifies images, and detects deepfakes.
- Example: AI scans images to confirm their authenticity in breaking news.
Hands-On Exercise: Create a Multimodal AI Workflow
🔹 Goal: Use different AI tools to generate text, images, and video from a single concept.
Step 1: Generate a Concept
Pick a theme for your multimodal AI project.
- Example: “A futuristic eco-friendly city with AI-powered transportation.”
Step 2: Generate Text Content
🔹 Prompt:
“Write a 100-word description of a futuristic green city powered by AI and renewable energy.”
Step 3: Generate an Image Based on the Text
🔹 Prompt for an AI Image Generator:
“Create a detailed digital artwork of a futuristic eco-city with solar panels, flying cars, and green skyscrapers.”
Step 4: Generate a Short Video from the Image
🔹 Prompt for a Video Generator:
“Animate this futuristic city scene with moving traffic, flying drones, and changing weather effects.”
Step 5: Add AI Voice Narration
🔹 Prompt for AI Voice Generator:
“Narrate this description in an inspiring documentary-style voice.”
✅ End Result: A cohesive AI-generated project combining text, images, video, and speech!
Reflection Questions
- What was the most challenging part of using multimodal AI?
- How did changing the prompts affect AI’s output?
- How could you use multimodal AI in your field (marketing, education, design, etc.)?

The biggest challenge I’ll say is getting the right prompt to get the desired output. Customizing the prompt is a bit difficult than it looks but overall it was an interesting exercise.
1. The most challenging part of using multimodal AI was learning how to balance text prompts with visual inputs. Small mismatches between the written prompt and the reference image or video often led to outputs that were inconsistent with my intention.
2. Changing the prompt had a direct impact on the accuracy, style, and emotional tone of the output. More detailed prompts—such as specifying lighting, mood, cultural context, or realism level—produced outputs that were more refined and aligned with expectations.
3. In education, multimodal AI can help create engaging instructional content such as animated lessons, visual explanations, and storytelling videos that make complex topics easier to understand.
The only challenge is Hallucinations for me actually
1. What was the most challenging part of using multimodal AI?
The most challenging part was making sure that the text, images, and other media matched each other. Sometimes the outputs were not fully aligned, so extra effort was needed to make them consistent.
2. How did changing the prompts affect AI’s output?
Changing the prompts made a big difference in the results. When the prompts were clearer and more detailed, the AI produced better and more accurate outputs.
3. How could you use multimodal AI in your field?
I could use multimodal AI to create content, explain ideas better, and improve communication by combining text, images, and other media.
What was the most challenging part of using multimodal AI?
Answer: The prompting aspect is the most challenging; sometimes, despite giving a detailed prompt, there is every possibility you might not be satisfied with the outcome. Not that the image is not in line with your request, but you just have a picture in your mind that the image generated by the AI is not meeting your expectation.
How did changing the prompts affect AI’s output?
Answer: Most of the time, when you continue to iterate, you finally get exactly what you requested. Iterating prompts really helped, and the final output is always satisfactory.
How could you use multimodal AI in your field (marketing, education, design, etc.)?
I mostly use it on my blog; I describe an emotion I want in an image, and I always get the desired result.
Using the multimodal prompt is interesting, but I kinda find it difficult to navigate the AI apps.
Using the multimodal prompt is interesting but challenging especially in getting access to a right kind of AI app. Some of the apps out there are built with lots of distractions with adverts.
It’s challenging with a phone that is not upgraded to the features in the upgraded ai tools.
1. The most challenging part of using multimodal AI for me is describing the scene you want to create in a way that the AI will understand and create exactly what you have in mind.
2. The more detailed my prompt are the better the output. What AI will give you is proportional to your prompt skills
3. I am a physiotherapist, I can use it to create short social media posts educating people on basic physiotherapy tips in their day to day activities and ways to improve their physical and functional performance, grow my social media, sell myself and my profession as well.
1. The Biggest Challenge
The hardest part is “modality conflict,” where the AI might prioritize the text instructions over the actual visual data in an image (or vice versa). It’s often difficult to get the AI to “look” at the specific detail you care about rather than the most obvious thing in the frame.
2. How Prompts Change the Output
Changing a prompt from a general “What is this?” to a spatially-aware instruction (like “Focus on the graph in the bottom right”) completely shifts the AI’s logic. Specificity prevents the AI from “hallucinating” or guessing based on text context alone, resulting in much higher technical accuracy.
3. Use in My Field (Education)
It’s a lifesaver for me in quickly comparing data across multiple PDF charts without having to manually extract the numbers.