Recognize Text in Images with ML Kit on iOS

You can use ML Kit to recognize text in images. ML Kit has both a general-purpose API suitable for recognizing text in images, such as the text of a street sign, and an API optimized for recognizing the text of documents. The general-purpose API has both on-device and cloud-based models. Document text recognition is available only as a cloud-based model. See the overview for a comparison of the cloud and on-device models.

Before you begin

If you have not already added Firebase to your app, do so by following the steps in the getting started guide .
Include the ML Kit libraries in your Podfile:
```
pod 'Firebase/MLVision', '6.25.0'
&num; If using an on-device API:
pod 'Firebase/MLVisionTextModel', '6.25.0'
```
After you install or update your project's Pods, be sure to open your Xcode project using its .xcworkspace .

In your app, import Firebase:

Swift

 import 
  
 Firebase

Objective-C

 @import 
  
 Firebase 
 ;

If you want to use the Cloud-based model, and you have not already enabled the Cloud-based APIs for your project, do so now:
1. Open the ML Kit APIs page of the Firebase console.
2. If you have not already upgraded your project to a Blaze pricing plan, click Upgrade to do so. (You will be prompted to upgrade only if your project isn't on the Blaze plan.)
  
  Only Blaze-level projects can use Cloud-based APIs.
3. If Cloud-based APIs aren't already enabled, click Enable Cloud-based APIs .
Before you deploy to production an app that uses a Cloud API, you should take some additional steps to prevent and mitigate the effect of unauthorized API access .

If you want to use only the on-device model, you can skip this step.

Now you are ready to start recognizing text in images.

Input image guidelines

For ML Kit to accurately recognize text, input images must contain text that is represented by sufficient pixel data. Ideally, for Latin text, each character should be at least 16x16 pixels. For Chinese, Japanese, and Korean text (only supported by the cloud-based APIs), each character should be 24x24 pixels. For all languages, there is generally no accuracy benefit for characters to be larger than 24x24 pixels.

So, for example, a 640x480 image might work well to scan a business card that occupies the full width of the image. To scan a document printed on letter-sized paper, a 720x1280 pixel image might be required.
Poor image focus can hurt text recognition accuracy. If you aren't getting acceptable results, try asking the user to recapture the image.
If you are recognizing text in a real-time application, you might also want to consider the overall dimensions of the input images. Smaller images can be processed faster, so to reduce latency, capture images at lower resolutions (keeping in mind the above accuracy requirements) and ensure that the text occupies as much of the image as possible. Also see Tips to improve real-time performance .

Recognize text in images

To recognize text in an image using either an on-device or cloud-based model, run the text recognizer as described below.

1. Run the text recognizer

Pass the image as a `UIImage` or a `CMSampleBufferRef` to the `VisionTextRecognizer`'s `process(_:completion:)` method:

Get an instance of VisionTextRecognizer by calling either onDeviceTextRecognizer or cloudTextRecognizer :

Swift

To use the on-device model:

 let 
  
 vision 
  
 = 
  
 Vision 
 . 
 vision 
 () 
 let 
  
 textRecognizer 
  
 = 
  
 vision 
 . 
 onDeviceTextRecognizer 
 ()

To use the cloud model:

 let 
  
 vision 
  
 = 
  
 Vision 
 . 
 vision 
 () 
 let 
  
 textRecognizer 
  
 = 
  
 vision 
 . 
 cloudTextRecognizer 
 () 
 // Or, to provide language hints to assist with language detection: 
 // See https://cloud.google.com/vision/docs/languages for supported languages 
 let 
  
 options 
  
 = 
  
 VisionCloudTextRecognizerOptions 
 () 
 options 
 . 
 languageHints 
  
 = 
  
 [ 
 "en" 
 , 
  
 "hi" 
 ] 
 let 
  
 textRecognizer 
  
 = 
  
 vision 
 . 
 cloudTextRecognizer 
 ( 
 options 
 : 
  
 options 
 )

Objective-C

To use the on-device model:

 FIRVision 
  
 * 
 vision 
  
 = 
  
 [ 
 FIRVision 
  
 vision 
 ]; 
 FIRVisionTextRecognizer 
  
 * 
 textRecognizer 
  
 = 
  
 [ 
 vision 
  
 onDeviceTextRecognizer 
 ];

To use the cloud model:

 FIRVision 
  
 * 
 vision 
  
 = 
  
 [ 
 FIRVision 
  
 vision 
 ]; 
 FIRVisionTextRecognizer 
  
 * 
 textRecognizer 
  
 = 
  
 [ 
 vision 
  
 cloudTextRecognizer 
 ]; 
 // Or, to provide language hints to assist with language detection: 
 // See https://cloud.google.com/vision/docs/languages for supported languages 
 FIRVisionCloudTextRecognizerOptions 
  
 * 
 options 
  
 = 
  
 [[ 
 FIRVisionCloudTextRecognizerOptions 
  
 alloc 
 ] 
  
 init 
 ]; 
 options 
 . 
 languageHints 
  
 = 
  
 @[ 
 @"en" 
 , 
  
 @"hi" 
 ] 
 ; 
 FIRVisionTextRecognizer 
  
 * 
 textRecognizer 
  
 = 
  
 [ 
 vision 
  
 cloudTextRecognizerWithOptions 
 : 
 options 
 ];

Create a VisionImage object using a UIImage or a CMSampleBufferRef .

To use a UIImage :

If necessary, rotate the image so that its imageOrientation property is .up .

Create a VisionImage object using the correctly-rotated UIImage . Do not specify any rotation metadata—the default value, .topLeft , must be used.

Swift

 let 
  
 image 
  
 = 
  
 VisionImage 
 ( 
 image 
 : 
  
 uiImage 
 )

Objective-C

 FIRVisionImage 
  
 * 
 image 
  
 = 
  
 [[ 
 FIRVisionImage 
  
 alloc 
 ] 
  
 initWithImage 
 : 
 uiImage 
 ];

To use a CMSampleBufferRef :

Create a VisionImageMetadata object that specifies the orientation of the image data contained in the CMSampleBufferRef buffer.

To get the image orientation:

Swift

 func 
  
 imageOrientation 
 ( 
  
 deviceOrientation 
 : 
  
 UIDeviceOrientation 
 , 
  
 cameraPosition 
 : 
  
 AVCaptureDevice 
 . 
 Position 
  
 ) 
  
 - 
>  
 VisionDetectorImageOrientation 
  
 { 
  
 switch 
  
 deviceOrientation 
  
 { 
  
 case 
  
 . 
 portrait 
 : 
  
 return 
  
 cameraPosition 
  
 == 
  
 . 
 front 
  
 ? 
  
 . 
 leftTop 
  
 : 
  
 . 
 rightTop 
  
 case 
  
 . 
 landscapeLeft 
 : 
  
 return 
  
 cameraPosition 
  
 == 
  
 . 
 front 
  
 ? 
  
 . 
 bottomLeft 
  
 : 
  
 . 
 topLeft 
  
 case 
  
 . 
 portraitUpsideDown 
 : 
  
 return 
  
 cameraPosition 
  
 == 
  
 . 
 front 
  
 ? 
  
 . 
 rightBottom 
  
 : 
  
 . 
 leftBottom 
  
 case 
  
 . 
 landscapeRight 
 : 
  
 return 
  
 cameraPosition 
  
 == 
  
 . 
 front 
  
 ? 
  
 . 
 topRight 
  
 : 
  
 . 
 bottomRight 
  
 case 
  
 . 
 faceDown 
 , 
  
 . 
 faceUp 
 , 
  
 . 
 unknown 
 : 
  
 return 
  
 . 
 leftTop 
  
 } 
 }

Objective-C

 - 
 ( 
 FIRVisionDetectorImageOrientation 
 ) 
 imageOrientationFromDeviceOrientation: 
 ( 
 UIDeviceOrientation 
 ) 
 deviceOrientation 
  
 cameraPosition 
 :( 
 AVCaptureDevicePosition 
 ) 
 cameraPosition 
  
 { 
  
 switch 
  
 ( 
 deviceOrientation 
 ) 
  
 { 
  
 case 
  
 UIDeviceOrientationPortrait 
 : 
  
 if 
  
 ( 
 cameraPosition 
  
 == 
  
 AVCaptureDevicePositionFront 
 ) 
  
 { 
  
 return 
  
 FIRVisionDetectorImageOrientationLeftTop 
 ; 
  
 } 
  
 else 
  
 { 
  
 return 
  
 FIRVisionDetectorImageOrientationRightTop 
 ; 
  
 } 
  
 case 
  
 UIDeviceOrientationLandscapeLeft 
 : 
  
 if 
  
 ( 
 cameraPosition 
  
 == 
  
 AVCaptureDevicePositionFront 
 ) 
  
 { 
  
 return 
  
 FIRVisionDetectorImageOrientationBottomLeft 
 ; 
  
 } 
  
 else 
  
 { 
  
 return 
  
 FIRVisionDetectorImageOrientationTopLeft 
 ; 
  
 } 
  
 case 
  
 UIDeviceOrientationPortraitUpsideDown 
 : 
  
 if 
  
 ( 
 cameraPosition 
  
 == 
  
 AVCaptureDevicePositionFront 
 ) 
  
 { 
  
 return 
  
 FIRVisionDetectorImageOrientationRightBottom 
 ; 
  
 } 
  
 else 
  
 { 
  
 return 
  
 FIRVisionDetectorImageOrientationLeftBottom 
 ; 
  
 } 
  
 case 
  
 UIDeviceOrientationLandscapeRight 
 : 
  
 if 
  
 ( 
 cameraPosition 
  
 == 
  
 AVCaptureDevicePositionFront 
 ) 
  
 { 
  
 return 
  
 FIRVisionDetectorImageOrientationTopRight 
 ; 
  
 } 
  
 else 
  
 { 
  
 return 
  
 FIRVisionDetectorImageOrientationBottomRight 
 ; 
  
 } 
  
 default 
 : 
  
 return 
  
 FIRVisionDetectorImageOrientationTopLeft 
 ; 
  
 } 
 }

Then, create the metadata object:

Swift

 let 
  
 cameraPosition 
  
 = 
  
 AVCaptureDevice 
 . 
 Position 
 . 
 back 
  
 // Set to the capture device you used. 
 let 
  
 metadata 
  
 = 
  
 VisionImageMetadata 
 () 
 metadata 
 . 
 orientation 
  
 = 
  
 imageOrientation 
 ( 
  
 deviceOrientation 
 : 
  
 UIDevice 
 . 
 current 
 . 
 orientation 
 , 
  
 cameraPosition 
 : 
  
 cameraPosition 
 )

Objective-C

 FIRVisionImageMetadata 
  
 * 
 metadata 
  
 = 
  
 [[ 
 FIRVisionImageMetadata 
  
 alloc 
 ] 
  
 init 
 ]; 
 AVCaptureDevicePosition 
  
 cameraPosition 
  
 = 
  
 AVCaptureDevicePositionBack 
 ; 
  
 // Set to the capture device you used. 
 metadata 
 . 
 orientation 
  
 = 
  
 [ 
 self 
  
 imageOrientationFromDeviceOrientation 
 : 
 UIDevice 
 . 
 currentDevice 
 . 
 orientation 
  
 cameraPosition 
 : 
 cameraPosition 
 ];

Create a VisionImage object using the CMSampleBufferRef object and the rotation metadata:

Swift

 let 
  
 image 
  
 = 
  
 VisionImage 
 ( 
 buffer 
 : 
  
 sampleBuffer 
 ) 
 image 
 . 
 metadata 
  
 = 
  
 metadata

Objective-C

 FIRVisionImage 
  
 * 
 image 
  
 = 
  
 [[ 
 FIRVisionImage 
  
 alloc 
 ] 
  
 initWithBuffer 
 : 
 sampleBuffer 
 ]; 
 image 
 . 
 metadata 
  
 = 
  
 metadata 
 ;

Then, pass the image to the process(_:completion:) method:

Swift

 textRecognizer 
 . 
 process 
 ( 
 visionImage 
 ) 
  
 { 
  
 result 
 , 
  
 error 
  
 in 
  
 guard 
  
 error 
  
 == 
  
 nil 
 , 
  
 let 
  
 result 
  
 = 
  
 result 
  
 else 
  
 { 
  
 // ... 
  
 return 
  
 } 
  
 // Recognized text 
 }

Objective-C

 [ 
 textRecognizer 
  
 processImage 
 : 
 image 
  
 completion 
 : 
 ^ 
 ( 
 FIRVisionText 
  
 * 
 _Nullable 
  
 result 
 , 
  
 NSError 
  
 * 
 _Nullable 
  
 error 
 ) 
  
 { 
  
 if 
  
 ( 
 error 
  
 != 
  
 nil 
  
 || 
  
 result 
  
 == 
  
 nil 
 ) 
  
 { 
  
 // ... 
  
 return 
 ; 
  
 } 
  
 // Recognized text 
 }];

2. Extract text from blocks of recognized text

If the text recognition operation succeeds, it will return a [`VisionText`][VisionText] object. A `VisionText` object contains the full text recognized in the image and zero or more [`VisionTextBlock`][VisionTextBlock] objects. Each `VisionTextBlock` represents a rectangular block of text, which contain zero or more [`VisionTextLine`][VisionTextLine] objects. Each `VisionTextLine` object contains zero or more [`VisionTextElement`][VisionTextElement] objects, which represent words and word-like entities (dates, numbers, and so on). For each `VisionTextBlock`, `VisionTextLine`, and `VisionTextElement` object, you can get the text recognized in the region and the bounding coordinates of the region. For example:

Swift

 let 
  
 resultText 
  
 = 
  
 result 
 . 
 text 
 for 
  
 block 
  
 in 
  
 result 
 . 
 blocks 
  
 { 
  
 let 
  
 blockText 
  
 = 
  
 block 
 . 
 text 
  
 let 
  
 blockConfidence 
  
 = 
  
 block 
 . 
 confidence 
  
 let 
  
 blockLanguages 
  
 = 
  
 block 
 . 
 recognizedLanguages 
  
 let 
  
 blockCornerPoints 
  
 = 
  
 block 
 . 
 cornerPoints 
  
 let 
  
 blockFrame 
  
 = 
  
 block 
 . 
 frame 
  
 for 
  
 line 
  
 in 
  
 block 
 . 
 lines 
  
 { 
  
 let 
  
 lineText 
  
 = 
  
 line 
 . 
 text 
  
 let 
  
 lineConfidence 
  
 = 
  
 line 
 . 
 confidence 
  
 let 
  
 lineLanguages 
  
 = 
  
 line 
 . 
 recognizedLanguages 
  
 let 
  
 lineCornerPoints 
  
 = 
  
 line 
 . 
 cornerPoints 
  
 let 
  
 lineFrame 
  
 = 
  
 line 
 . 
 frame 
  
 for 
  
 element 
  
 in 
  
 line 
 . 
 elements 
  
 { 
  
 let 
  
 elementText 
  
 = 
  
 element 
 . 
 text 
  
 let 
  
 elementConfidence 
  
 = 
  
 element 
 . 
 confidence 
  
 let 
  
 elementLanguages 
  
 = 
  
 element 
 . 
 recognizedLanguages 
  
 let 
  
 elementCornerPoints 
  
 = 
  
 element 
 . 
 cornerPoints 
  
 let 
  
 elementFrame 
  
 = 
  
 element 
 . 
 frame 
  
 } 
  
 } 
 }

Objective-C

 NSString 
  
 * 
 resultText 
  
 = 
  
 result 
 . 
 text 
 ; 
 for 
  
 ( 
 FIRVisionTextBlock 
  
 * 
 block 
  
 in 
  
 result 
 . 
 blocks 
 ) 
  
 { 
  
 NSString 
  
 * 
 blockText 
  
 = 
  
 block 
 . 
 text 
 ; 
  
 NSNumber 
  
 * 
 blockConfidence 
  
 = 
  
 block 
 . 
 confidence 
 ; 
  
 NSArray<FIRVisionTextRecognizedLanguage 
  
 *> 
  
 * 
 blockLanguages 
  
 = 
  
 block 
 . 
 recognizedLanguages 
 ; 
  
 NSArray<NSValue 
  
 *> 
  
 * 
 blockCornerPoints 
  
 = 
  
 block 
 . 
 cornerPoints 
 ; 
  
 CGRect 
  
 blockFrame 
  
 = 
  
 block 
 . 
 frame 
 ; 
  
 for 
  
 ( 
 FIRVisionTextLine 
  
 * 
 line 
  
 in 
  
 block 
 . 
 lines 
 ) 
  
 { 
  
 NSString 
  
 * 
 lineText 
  
 = 
  
 line 
 . 
 text 
 ; 
  
 NSNumber 
  
 * 
 lineConfidence 
  
 = 
  
 line 
 . 
 confidence 
 ; 
  
 NSArray<FIRVisionTextRecognizedLanguage 
  
 *> 
  
 * 
 lineLanguages 
  
 = 
  
 line 
 . 
 recognizedLanguages 
 ; 
  
 NSArray<NSValue 
  
 *> 
  
 * 
 lineCornerPoints 
  
 = 
  
 line 
 . 
 cornerPoints 
 ; 
  
 CGRect 
  
 lineFrame 
  
 = 
  
 line 
 . 
 frame 
 ; 
  
 for 
  
 ( 
 FIRVisionTextElement 
  
 * 
 element 
  
 in 
  
 line 
 . 
 elements 
 ) 
  
 { 
  
 NSString 
  
 * 
 elementText 
  
 = 
  
 element 
 . 
 text 
 ; 
  
 NSNumber 
  
 * 
 elementConfidence 
  
 = 
  
 element 
 . 
 confidence 
 ; 
  
 NSArray<FIRVisionTextRecognizedLanguage 
  
 *> 
  
 * 
 elementLanguages 
  
 = 
  
 element 
 . 
 recognizedLanguages 
 ; 
  
 NSArray<NSValue 
  
 *> 
  
 * 
 elementCornerPoints 
  
 = 
  
 element 
 . 
 cornerPoints 
 ; 
  
 CGRect 
  
 elementFrame 
  
 = 
  
 element 
 . 
 frame 
 ; 
  
 } 
  
 } 
 }

Tips to improve real-time performance

If you want use the on-device model to recognize text in a real-time application, follow these guidelines to achieve the best framerates:

Throttle calls to the text recognizer. If a new video frame becomes available while the text recognizer is running, drop the frame.
If you are using the output of the text recognizer to overlay graphics on the input image, first get the result from ML Kit, then render the image and overlay in a single step. By doing so, you render to the display surface only once for each input frame. See the previewOverlayView and FIRDetectionOverlayView classes in the showcase sample app for an example.
Consider capturing images at a lower resolution. However, also keep in mind this API's image dimension requirements.

Next steps

Before you deploy to production an app that uses a Cloud API, you should take some additional steps to prevent and mitigate the effect of unauthorized API access .

Recognize text in images of documents

To recognize the text of a document, configure and run the cloud-based document text recognizer as described below.

The document text recognition API, described below, provides an interface that is intended to be more convenient for working with images of documents. However, if you prefer the interface provided by the sparse text API, you can use it instead to scan documents by configuring the cloud text recognizer to use the dense text model .

To use the document text recognition API:

1. Run the text recognizer

Pass the image as a UIImage or a CMSampleBufferRef to the VisionDocumentTextRecognizer 's process(_:completion:) method:

Get an instance of VisionDocumentTextRecognizer by calling cloudDocumentTextRecognizer :

Swift

 let 
  
 vision 
  
 = 
  
 Vision 
 . 
 vision 
 () 
 let 
  
 textRecognizer 
  
 = 
  
 vision 
 . 
 cloudDocumentTextRecognizer 
 () 
 // Or, to provide language hints to assist with language detection: 
 // See https://cloud.google.com/vision/docs/languages for supported languages 
 let 
  
 options 
  
 = 
  
 VisionCloudDocumentTextRecognizerOptions 
 () 
 options 
 . 
 languageHints 
  
 = 
  
 [ 
 "en" 
 , 
  
 "hi" 
 ] 
 let 
  
 textRecognizer 
  
 = 
  
 vision 
 . 
 cloudDocumentTextRecognizer 
 ( 
 options 
 : 
  
 options 
 )

Objective-C

 FIRVision 
  
 * 
 vision 
  
 = 
  
 [ 
 FIRVision 
  
 vision 
 ]; 
 FIRVisionDocumentTextRecognizer 
  
 * 
 textRecognizer 
  
 = 
  
 [ 
 vision 
  
 cloudDocumentTextRecognizer 
 ]; 
 // Or, to provide language hints to assist with language detection: 
 // See https://cloud.google.com/vision/docs/languages for supported languages 
 FIRVisionCloudDocumentTextRecognizerOptions 
  
 * 
 options 
  
 = 
  
 [[ 
 FIRVisionCloudDocumentTextRecognizerOptions 
  
 alloc 
 ] 
  
 init 
 ]; 
 options 
 . 
 languageHints 
  
 = 
  
 @[ 
 @"en" 
 , 
  
 @"hi" 
 ] 
 ; 
 FIRVisionDocumentTextRecognizer 
  
 * 
 textRecognizer 
  
 = 
  
 [ 
 vision 
  
 cloudDocumentTextRecognizerWithOptions 
 : 
 options 
 ];

Create a VisionImage object using a UIImage or a CMSampleBufferRef .

To use a UIImage :

If necessary, rotate the image so that its imageOrientation property is .up .

Create a VisionImage object using the correctly-rotated UIImage . Do not specify any rotation metadata—the default value, .topLeft , must be used.

Swift

 let 
  
 image 
  
 = 
  
 VisionImage 
 ( 
 image 
 : 
  
 uiImage 
 )

Objective-C

 FIRVisionImage 
  
 * 
 image 
  
 = 
  
 [[ 
 FIRVisionImage 
  
 alloc 
 ] 
  
 initWithImage 
 : 
 uiImage 
 ];

To use a CMSampleBufferRef :

Create a VisionImageMetadata object that specifies the orientation of the image data contained in the CMSampleBufferRef buffer.

To get the image orientation:

Swift

 func 
  
 imageOrientation 
 ( 
  
 deviceOrientation 
 : 
  
 UIDeviceOrientation 
 , 
  
 cameraPosition 
 : 
  
 AVCaptureDevice 
 . 
 Position 
  
 ) 
  
 - 
>  
 VisionDetectorImageOrientation 
  
 { 
  
 switch 
  
 deviceOrientation 
  
 { 
  
 case 
  
 . 
 portrait 
 : 
  
 return 
  
 cameraPosition 
  
 == 
  
 . 
 front 
  
 ? 
  
 . 
 leftTop 
  
 : 
  
 . 
 rightTop 
  
 case 
  
 . 
 landscapeLeft 
 : 
  
 return 
  
 cameraPosition 
  
 == 
  
 . 
 front 
  
 ? 
  
 . 
 bottomLeft 
  
 : 
  
 . 
 topLeft 
  
 case 
  
 . 
 portraitUpsideDown 
 : 
  
 return 
  
 cameraPosition 
  
 == 
  
 . 
 front 
  
 ? 
  
 . 
 rightBottom 
  
 : 
  
 . 
 leftBottom 
  
 case 
  
 . 
 landscapeRight 
 : 
  
 return 
  
 cameraPosition 
  
 == 
  
 . 
 front 
  
 ? 
  
 . 
 topRight 
  
 : 
  
 . 
 bottomRight 
  
 case 
  
 . 
 faceDown 
 , 
  
 . 
 faceUp 
 , 
  
 . 
 unknown 
 : 
  
 return 
  
 . 
 leftTop 
  
 } 
 }

Objective-C

 - 
 ( 
 FIRVisionDetectorImageOrientation 
 ) 
 imageOrientationFromDeviceOrientation: 
 ( 
 UIDeviceOrientation 
 ) 
 deviceOrientation 
  
 cameraPosition 
 :( 
 AVCaptureDevicePosition 
 ) 
 cameraPosition 
  
 { 
  
 switch 
  
 ( 
 deviceOrientation 
 ) 
  
 { 
  
 case 
  
 UIDeviceOrientationPortrait 
 : 
  
 if 
  
 ( 
 cameraPosition 
  
 == 
  
 AVCaptureDevicePositionFront 
 ) 
  
 { 
  
 return 
  
 FIRVisionDetectorImageOrientationLeftTop 
 ; 
  
 } 
  
 else 
  
 { 
  
 return 
  
 FIRVisionDetectorImageOrientationRightTop 
 ; 
  
 } 
  
 case 
  
 UIDeviceOrientationLandscapeLeft 
 : 
  
 if 
  
 ( 
 cameraPosition 
  
 == 
  
 AVCaptureDevicePositionFront 
 ) 
  
 { 
  
 return 
  
 FIRVisionDetectorImageOrientationBottomLeft 
 ; 
  
 } 
  
 else 
  
 { 
  
 return 
  
 FIRVisionDetectorImageOrientationTopLeft 
 ; 
  
 } 
  
 case 
  
 UIDeviceOrientationPortraitUpsideDown 
 : 
  
 if 
  
 ( 
 cameraPosition 
  
 == 
  
 AVCaptureDevicePositionFront 
 ) 
  
 { 
  
 return 
  
 FIRVisionDetectorImageOrientationRightBottom 
 ; 
  
 } 
  
 else 
  
 { 
  
 return 
  
 FIRVisionDetectorImageOrientationLeftBottom 
 ; 
  
 } 
  
 case 
  
 UIDeviceOrientationLandscapeRight 
 : 
  
 if 
  
 ( 
 cameraPosition 
  
 == 
  
 AVCaptureDevicePositionFront 
 ) 
  
 { 
  
 return 
  
 FIRVisionDetectorImageOrientationTopRight 
 ; 
  
 } 
  
 else 
  
 { 
  
 return 
  
 FIRVisionDetectorImageOrientationBottomRight 
 ; 
  
 } 
  
 default 
 : 
  
 return 
  
 FIRVisionDetectorImageOrientationTopLeft 
 ; 
  
 } 
 }

Then, create the metadata object:

Swift

 let 
  
 cameraPosition 
  
 = 
  
 AVCaptureDevice 
 . 
 Position 
 . 
 back 
  
 // Set to the capture device you used. 
 let 
  
 metadata 
  
 = 
  
 VisionImageMetadata 
 () 
 metadata 
 . 
 orientation 
  
 = 
  
 imageOrientation 
 ( 
  
 deviceOrientation 
 : 
  
 UIDevice 
 . 
 current 
 . 
 orientation 
 , 
  
 cameraPosition 
 : 
  
 cameraPosition 
 )

Objective-C

 FIRVisionImageMetadata 
  
 * 
 metadata 
  
 = 
  
 [[ 
 FIRVisionImageMetadata 
  
 alloc 
 ] 
  
 init 
 ]; 
 AVCaptureDevicePosition 
  
 cameraPosition 
  
 = 
  
 AVCaptureDevicePositionBack 
 ; 
  
 // Set to the capture device you used. 
 metadata 
 . 
 orientation 
  
 = 
  
 [ 
 self 
  
 imageOrientationFromDeviceOrientation 
 : 
 UIDevice 
 . 
 currentDevice 
 . 
 orientation 
  
 cameraPosition 
 : 
 cameraPosition 
 ];

Create a VisionImage object using the CMSampleBufferRef object and the rotation metadata:

Swift

 let 
  
 image 
  
 = 
  
 VisionImage 
 ( 
 buffer 
 : 
  
 sampleBuffer 
 ) 
 image 
 . 
 metadata 
  
 = 
  
 metadata

Objective-C

 FIRVisionImage 
  
 * 
 image 
  
 = 
  
 [[ 
 FIRVisionImage 
  
 alloc 
 ] 
  
 initWithBuffer 
 : 
 sampleBuffer 
 ]; 
 image 
 . 
 metadata 
  
 = 
  
 metadata 
 ;

Then, pass the image to the process(_:completion:) method:

Swift

 textRecognizer 
 . 
 process 
 ( 
 visionImage 
 ) 
  
 { 
  
 result 
 , 
  
 error 
  
 in 
  
 guard 
  
 error 
  
 == 
  
 nil 
 , 
  
 let 
  
 result 
  
 = 
  
 result 
  
 else 
  
 { 
  
 // ... 
  
 return 
  
 } 
  
 // Recognized text 
 }

Objective-C

 [ 
 textRecognizer 
  
 processImage 
 : 
 image 
  
 completion 
 : 
 ^ 
 ( 
 FIRVisionDocumentText 
  
 * 
 _Nullable 
  
 result 
 , 
  
 NSError 
  
 * 
 _Nullable 
  
 error 
 ) 
  
 { 
  
 if 
  
 ( 
 error 
  
 != 
  
 nil 
  
 || 
  
 result 
  
 == 
  
 nil 
 ) 
  
 { 
  
 // ... 
  
 return 
 ; 
  
 } 
  
 // Recognized text 
 }];

2. Extract text from blocks of recognized text

If the text recognition operation succeeds, it will return a VisionDocumentText object. A VisionDocumentText object contains the full text recognized in the image and a hierarchy of objects that reflect the structure of the recognized document:

For each VisionDocumentTextBlock , VisionDocumentTextParagraph , VisionDocumentTextWord , and VisionDocumentTextSymbol object, you can get the text recognized in the region and the bounding coordinates of the region.

For example:

Swift

 let 
  
 resultText 
  
 = 
  
 result 
 . 
 text 
 for 
  
 block 
  
 in 
  
 result 
 . 
 blocks 
  
 { 
  
 let 
  
 blockText 
  
 = 
  
 block 
 . 
 text 
  
 let 
  
 blockConfidence 
  
 = 
  
 block 
 . 
 confidence 
  
 let 
  
 blockRecognizedLanguages 
  
 = 
  
 block 
 . 
 recognizedLanguages 
  
 let 
  
 blockBreak 
  
 = 
  
 block 
 . 
 recognizedBreak 
  
 let 
  
 blockCornerPoints 
  
 = 
  
 block 
 . 
 cornerPoints 
  
 let 
  
 blockFrame 
  
 = 
  
 block 
 . 
 frame 
  
 for 
  
 paragraph 
  
 in 
  
 block 
 . 
 paragraphs 
  
 { 
  
 let 
  
 paragraphText 
  
 = 
  
 paragraph 
 . 
 text 
  
 let 
  
 paragraphConfidence 
  
 = 
  
 paragraph 
 . 
 confidence 
  
 let 
  
 paragraphRecognizedLanguages 
  
 = 
  
 paragraph 
 . 
 recognizedLanguages 
  
 let 
  
 paragraphBreak 
  
 = 
  
 paragraph 
 . 
 recognizedBreak 
  
 let 
  
 paragraphCornerPoints 
  
 = 
  
 paragraph 
 . 
 cornerPoints 
  
 let 
  
 paragraphFrame 
  
 = 
  
 paragraph 
 . 
 frame 
  
 for 
  
 word 
  
 in 
  
 paragraph 
 . 
 words 
  
 { 
  
 let 
  
 wordText 
  
 = 
  
 word 
 . 
 text 
  
 let 
  
 wordConfidence 
  
 = 
  
 word 
 . 
 confidence 
  
 let 
  
 wordRecognizedLanguages 
  
 = 
  
 word 
 . 
 recognizedLanguages 
  
 let 
  
 wordBreak 
  
 = 
  
 word 
 . 
 recognizedBreak 
  
 let 
  
 wordCornerPoints 
  
 = 
  
 word 
 . 
 cornerPoints 
  
 let 
  
 wordFrame 
  
 = 
  
 word 
 . 
 frame 
  
 for 
  
 symbol 
  
 in 
  
 word 
 . 
 symbols 
  
 { 
  
 let 
  
 symbolText 
  
 = 
  
 symbol 
 . 
 text 
  
 let 
  
 symbolConfidence 
  
 = 
  
 symbol 
 . 
 confidence 
  
 let 
  
 symbolRecognizedLanguages 
  
 = 
  
 symbol 
 . 
 recognizedLanguages 
  
 let 
  
 symbolBreak 
  
 = 
  
 symbol 
 . 
 recognizedBreak 
  
 let 
  
 symbolCornerPoints 
  
 = 
  
 symbol 
 . 
 cornerPoints 
  
 let 
  
 symbolFrame 
  
 = 
  
 symbol 
 . 
 frame 
  
 } 
  
 } 
  
 } 
 }

Objective-C

 NSString 
  
 * 
 resultText 
  
 = 
  
 result 
 . 
 text 
 ; 
 for 
  
 ( 
 FIRVisionDocumentTextBlock 
  
 * 
 block 
  
 in 
  
 result 
 . 
 blocks 
 ) 
  
 { 
  
 NSString 
  
 * 
 blockText 
  
 = 
  
 block 
 . 
 text 
 ; 
  
 NSNumber 
  
 * 
 blockConfidence 
  
 = 
  
 block 
 . 
 confidence 
 ; 
  
 NSArray<FIRVisionTextRecognizedLanguage 
  
 *> 
  
 * 
 blockRecognizedLanguages 
  
 = 
  
 block 
 . 
 recognizedLanguages 
 ; 
  
 FIRVisionTextRecognizedBreak 
  
 * 
 blockBreak 
  
 = 
  
 block 
 . 
 recognizedBreak 
 ; 
  
 CGRect 
  
 blockFrame 
  
 = 
  
 block 
 . 
 frame 
 ; 
  
 for 
  
 ( 
 FIRVisionDocumentTextParagraph 
  
 * 
 paragraph 
  
 in 
  
 block 
 . 
 paragraphs 
 ) 
  
 { 
  
 NSString 
  
 * 
 paragraphText 
  
 = 
  
 paragraph 
 . 
 text 
 ; 
  
 NSNumber 
  
 * 
 paragraphConfidence 
  
 = 
  
 paragraph 
 . 
 confidence 
 ; 
  
 NSArray<FIRVisionTextRecognizedLanguage 
  
 *> 
  
 * 
 paragraphRecognizedLanguages 
  
 = 
  
 paragraph 
 . 
 recognizedLanguages 
 ; 
  
 FIRVisionTextRecognizedBreak 
  
 * 
 paragraphBreak 
  
 = 
  
 paragraph 
 . 
 recognizedBreak 
 ; 
  
 CGRect 
  
 paragraphFrame 
  
 = 
  
 paragraph 
 . 
 frame 
 ; 
  
 for 
  
 ( 
 FIRVisionDocumentTextWord 
  
 * 
 word 
  
 in 
  
 paragraph 
 . 
 words 
 ) 
  
 { 
  
 NSString 
  
 * 
 wordText 
  
 = 
  
 word 
 . 
 text 
 ; 
  
 NSNumber 
  
 * 
 wordConfidence 
  
 = 
  
 word 
 . 
 confidence 
 ; 
  
 NSArray<FIRVisionTextRecognizedLanguage 
  
 *> 
  
 * 
 wordRecognizedLanguages 
  
 = 
  
 word 
 . 
 recognizedLanguages 
 ; 
  
 FIRVisionTextRecognizedBreak 
  
 * 
 wordBreak 
  
 = 
  
 word 
 . 
 recognizedBreak 
 ; 
  
 CGRect 
  
 wordFrame 
  
 = 
  
 word 
 . 
 frame 
 ; 
  
 for 
  
 ( 
 FIRVisionDocumentTextSymbol 
  
 * 
 symbol 
  
 in 
  
 word 
 . 
 symbols 
 ) 
  
 { 
  
 NSString 
  
 * 
 symbolText 
  
 = 
  
 symbol 
 . 
 text 
 ; 
  
 NSNumber 
  
 * 
 symbolConfidence 
  
 = 
  
 symbol 
 . 
 confidence 
 ; 
  
 NSArray<FIRVisionTextRecognizedLanguage 
  
 *> 
  
 * 
 symbolRecognizedLanguages 
  
 = 
  
 symbol 
 . 
 recognizedLanguages 
 ; 
  
 FIRVisionTextRecognizedBreak 
  
 * 
 symbolBreak 
  
 = 
  
 symbol 
 . 
 recognizedBreak 
 ; 
  
 CGRect 
  
 symbolFrame 
  
 = 
  
 symbol 
 . 
 frame 
 ; 
  
 } 
  
 } 
  
 } 
 }

Next steps

Before you deploy to production an app that uses a Cloud API, you should take some additional steps to prevent and mitigate the effect of unauthorized API access .

Recognize Text in Images with ML Kit on iOS Stay organized with collections Save and categorize content based on your preferences.

Before you begin

Swift

Objective-C

Input image guidelines

Recognize text in images

1. Run the text recognizer

Swift

Objective-C

Swift

Objective-C

Swift

Objective-C

Swift

Objective-C

Swift

Objective-C

Swift

Objective-C

2. Extract text from blocks of recognized text

Swift

Objective-C

Tips to improve real-time performance

Next steps

Recognize text in images of documents

1. Run the text recognizer

Swift

Objective-C

Swift

Objective-C

Swift

Objective-C

Swift

Objective-C

Swift

Objective-C

Swift

Objective-C

2. Extract text from blocks of recognized text

Swift

Objective-C

Next steps

Recognize Text in Images with ML Kit on iOS