Stay organized with collectionsSave and categorize content based on your preferences.
Leveraging the Gemini Pro Vision model for image understanding, multimodal prompts and accessibility
Explore how you can use the new Gemini Pro Vision model with the Gemini API to handle multimodal input data including text and image prompts to receive a text result. In this solution, you will learn how to access the Gemini API with image and text data, explore a variety of examples of prompts that can be achieved using images using Gemini Pro Vision and finally complete a codelab exploring how to use the API for a real-world problem scenario involving accessibility and basic web development.Go back
check_circle
Leveraging the Gemini Pro Vision model for image understanding, multimodal prompts and accessibility
ondemand_videoVideo
Learn how to use the multimodal features of the Gemini model to analyze HTML documents and image files for the purpose of adding accessible descriptions to a webpage in a NodeJS script.
check_circle
Quickstart: Get started with the Gemini API in Node.js applications
subjectArticle
Learn how to generate text from multimodal text-and-image input data using the Gemini Pro Vision model in NodeJS.
Explore various examples of interesting ways that Gemini's multimodal image and text inputs can be combined to extract text output about images across a variety of different use cases.
Prompting with images and text using the Gemini API for accessibility
emoji_objectsCodelab
In this codelab, you will write a NodeJS script that leverages the Gemini Pro Vision model to analyze a local HTML document and generate accessible descriptions of the images in the page if necessary. By leveraging Gemini we can verify whether existing descriptions are accurate for a given image and, if not, generate entirely new descriptions.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Missing the information I need","missingTheInformationINeed","thumb-down"],["Too complicated / too many steps","tooComplicatedTooManySteps","thumb-down"],["Out of date","outOfDate","thumb-down"],["Samples / code issue","samplesCodeIssue","thumb-down"],["Other","otherDown","thumb-down"]],[],[[["\u003cp\u003eGemini supports multimodal prompts, accepting text and image inputs while providing text-only responses.\u003c/p\u003e\n"],["\u003cp\u003eMultimodal prompts allow for a variety of use cases like image classification, object recognition, and creative text generation based on images.\u003c/p\u003e\n"],["\u003cp\u003eGemini can interpret and reason about images, enabling tasks like counting objects, understanding handwriting, and inferring temporal information from scenes.\u003c/p\u003e\n"],["\u003cp\u003eAdvanced multimodal prompts can combine multiple skills like handwriting recognition, logical reasoning, and world knowledge for creative and practical applications.\u003c/p\u003e\n"],["\u003cp\u003eExperimentation with different multimodal prompts is encouraged to explore the full potential of Gemini's capabilities.\u003c/p\u003e\n"]]],["Multimodal prompts, combining text and images, enable LLMs like Gemini to perform diverse tasks. Key actions include entity recognition, classification, and counting objects in images. More advanced applications demonstrate text recognition from handwriting, reasoning, calculation, interpreting scene lighting for time inference, and creative tasks like haiku generation. Additionally, LLMs can identify logical progressions, understand object attributes, and infer real-world practicality. These capabilities highlight the power of multimodal prompts for understanding and extracting information from combined input.\n"],null,[]]