Latest YouTube Video
Saturday, July 15, 2017
I want to restrict anonymous links to specified do...
from Google Alert - anonymous http://ift.tt/2tVSA7n
via IFTTT
Porcelain Blues
from Google Alert - anonymous http://ift.tt/2trvBgB
via IFTTT
I have a new follower on Twitter
Pro Wrestling Chaos
Bristol. 3 mates put on a show hoping that enough folks would turn up, that was in June 2013.... Tickets/info: https://t.co/TAlkHZAkNJ Call: 07957198511
Bristol, UK
https://t.co/QjrMcLHT4I
Following: 11627 - Followers: 14913
July 15, 2017 at 10:37AM via Twitter http://twitter.com/chaos_wrestling
Two New Platforms Found Offering Cybercrime-as-a-Service to 'Wannabe Hackers'
from The Hacker News http://ift.tt/2t1EgqV
via IFTTT
Alcoholics Anonymous (AA) Meetings // Events // Rev. James E. McDonald, CSC, Center for ...
from Google Alert - anonymous http://ift.tt/2t0QOP4
via IFTTT
Friday, July 14, 2017
network-anonymous-tor
from Google Alert - anonymous http://ift.tt/2tUWb3D
via IFTTT
Commercial PV Designer III
from Google Alert - anonymous http://ift.tt/2tVkDlt
via IFTTT
ISS Daily Summary Report – 7/13/2017
from ISS On-Orbit Status Report http://ift.tt/2us9rQq
via IFTTT
Awesome! WhatsApp Now Lets You Send Files of Any Format
from The Hacker News http://ift.tt/2uid0Z9
via IFTTT
Ravens: Brian Billick named to preseason broadcast team; next step should be Ring of Honor - Jamison Hensley (ESPN)
via IFTTT
Earth's Magnetosphere
from NASA's Scientific Visualization Studio: Most Recent Items http://ift.tt/2vkGEK5
via IFTTT
Jupiter's Magnetosphere
from NASA's Scientific Visualization Studio: Most Recent Items http://ift.tt/2umAoUI
via IFTTT
Saturn's Magnetosphere
from NASA's Scientific Visualization Studio: Most Recent Items http://ift.tt/2vkCJwU
via IFTTT
Uranus' Magnetosphere
from NASA's Scientific Visualization Studio: Most Recent Items http://ift.tt/2umwJGF
via IFTTT
Neptune's Magnetosphere
from NASA's Scientific Visualization Studio: Most Recent Items http://ift.tt/2vkyGR5
via IFTTT
Orioles 1B Chris Davis (oblique) expected to return from DL Friday vs. Cubs (ESPN)
via IFTTT
AlphaBay Shut Down After Police Raid; Alleged Founder Commits Suicide in Jail
from The Hacker News http://ift.tt/2tm7xvN
via IFTTT
Ubuntu Linux for Windows 10 Released — Yes, You Read it Right
from The Hacker News http://ift.tt/2sWFe7I
via IFTTT
[FD] [CVE-2017-7728] - Authentication Bypass allows alarm's commands execution in iSmartAlarm
Source: Gmail -> IFTTT-> Blogger
[FD] CVE request: Multiple vulnerabilities in Cisco DDR2200 Series
Source: Gmail -> IFTTT-> Blogger
Appendix B: Collection of Anonymous Data
from Google Alert - anonymous http://ift.tt/2tRZ59r
via IFTTT
NGC 4449: Close up of a Small Galaxy
Thursday, July 13, 2017
Anonymous Diner Pays $405 Meal Tab For IE Wildfire Fighters
from Google Alert - anonymous http://ift.tt/2t8fzx1
via IFTTT
Tumblr releases anonymous user accounts to 'revenge porn' victim
from Google Alert - anonymous http://ift.tt/2uWmeYj
via IFTTT
How CIA Agents Covertly Steal Data From Hacked Smartphones (Without Internet)
from The Hacker News http://ift.tt/2udxl1N
via IFTTT
I have a new follower on Twitter
Hamish Bayston
PTSD Coach | Live Beyond PTS-Anxiety-Depression
Melbourne, Victoria
https://t.co/Vf2BErCr3q
Following: 7812 - Followers: 8632
July 13, 2017 at 11:46AM via Twitter http://twitter.com/hamishbayston
MP Demands Ban on Anonymous Twitter Accounts
from Google Alert - anonymous http://ift.tt/2umYZK7
via IFTTT
ISS Daily Summary Report – 7/12/2017
from ISS On-Orbit Status Report http://ift.tt/2uiUZt0
via IFTTT
(L.i.v.e-F.r.e.e)!@!//~Grand Slams Wimbledon Live Tannis 2017 - Anonymous' suggestion - Amplify NI
from Google Alert - anonymous http://ift.tt/2ucPdcY
via IFTTT
Muguruza vs Rybarikova Li.ve Free
from Google Alert - anonymous http://ift.tt/2sTr5Iq
via IFTTT
[LIVE.TV]..Muguruza vs Rybarikova Live Free Semi Final Game
Anonymous Woman Covers $405 Dinner Bill For Firefighters
from Google Alert - anonymous http://ift.tt/2thMCKh
via IFTTT
New Ransomware Threatens to Send Your Internet History & Private Pics to All Your Friends
from The Hacker News http://ift.tt/2vfQjla
via IFTTT
[InsideNothing] amaranto es liked your post "[FD] DefenseCode Security Advisory: IBM DB2 Command Line Processor Buffer Overflow"
|
Source: Gmail -> IFTTT-> Blogger
Researcher Claims Samsung's Tizen OS is Poorly Programmed; Contains 27,000 Bugs!
from The Hacker News http://ift.tt/2tPiJTF
via IFTTT
Do not track anonymous submissions if convertion is off
from Google Alert - anonymous http://ift.tt/2vfrplB
via IFTTT
Full Moon and Boston Light
Wednesday, July 12, 2017
Anonymous class missing references / missing in implementations
from Google Alert - anonymous http://ift.tt/2ubjEQF
via IFTTT
[FD] CVE-2017-11173 Missing anchor in generated regex for rack-cors before 0.4.1 allows a malicious third-party site to perform CORS requests
Source: Gmail -> IFTTT-> Blogger
[FD] [CVE-2017-7727] - SSRF vulnerability in iSmartAlarm
Source: Gmail -> IFTTT-> Blogger
[FD] [CVE-2017-7726] - Missing SSL Certificate Validation in iSmartAlarm
Source: Gmail -> IFTTT-> Blogger
[FD] ekoparty: Call for Papers 2017! Open!
Source: Gmail -> IFTTT-> Blogger
Anonymous Work Talk for Silicon Valley
from Google Alert - anonymous http://ift.tt/2sROiKW
via IFTTT
Over 14 Million Verizon Customers' Data Exposed On Unprotected AWS Server
from The Hacker News http://ift.tt/2u9AXBR
via IFTTT
Orioles Poll: Machado, Mancini or Schoop? Who's your pick for MVP of the first half? Vote now! (ESPN)
via IFTTT
Ravens Video: Brandon Williams breaks out spirited rendition of the "Carlton" at fan forum in London (ESPN)
via IFTTT
ISS Daily Summary Report – 7/11/2017
from ISS On-Orbit Status Report http://ift.tt/2t3CWHS
via IFTTT
I have a new follower on Twitter
Slim Palmer
Blowin on that O | #Birdland
Baltimore, MD
Following: 1470 - Followers: 1027
July 12, 2017 at 08:15AM via Twitter http://twitter.com/SlimPalmer22
Katyusha Scanner — Telegram-based Fully Automated SQL Injection Tool
from The Hacker News http://ift.tt/2u7ZEhR
via IFTTT
Critical Flaws Found in Windows NTLM Security Protocol – Patch Now
from The Hacker News http://ift.tt/2sOvOLG
via IFTTT
Enabling a flag adds session cookies to anonymous users
from Google Alert - anonymous http://ift.tt/2uczPwF
via IFTTT
[FD] SEC Consult SA-20170712-0 :: Multiple critical vulnerabilities in AGFEO smart home ES 5xx/6xx products
Source: Gmail -> IFTTT-> Blogger
Messier 63: The Sunflower Galaxy
Tuesday, July 11, 2017
Nicotine Anonymous Metting
from Google Alert - anonymous http://ift.tt/2vaKZzt
via IFTTT
OpenLDAP access control without "Service Account" (anonymous bind)
from Google Alert - anonymous http://ift.tt/2tFRvjz
via IFTTT
Style: cleanest way to pipe to an Enum.map with an anonymous function
from Google Alert - anonymous http://ift.tt/2tbLdok
via IFTTT
Salesforce Tab - Filtering Anonymous Web Activity
from Google Alert - anonymous http://ift.tt/2tbGEdJ
via IFTTT
Anonymous woman picks up $400 dinner tab for crew who battled wildfire
from Google Alert - anonymous http://ift.tt/2tbBeja
via IFTTT
Craftaholics Anonymous Diy Mason Jar Pendant Light Tutorial With Regard To Attractive House ...
from Google Alert - anonymous http://ift.tt/2sNeQgl
via IFTTT
Anonymous's Activity
from Google Alert - anonymous http://ift.tt/2tEM5Wk
via IFTTT
Russian Financial Cybercriminal Gets Over 9 Years In U.S. Prison
from The Hacker News http://ift.tt/2tKi6L2
via IFTTT
ISS Daily Summary Report – 7/10/2017
from ISS On-Orbit Status Report http://ift.tt/2ue1Gh3
via IFTTT
Adwind RAT Returns! Cross-Platform Malware Targeting Aerospace Industries
from The Hacker News http://ift.tt/2sZYO79
via IFTTT
Anonymous access only to Questions in Confluence
from Google Alert - anonymous http://ift.tt/2sL1hhJ
via IFTTT
Love Is Everywhere, episode #69 of Beautiful Stories From Anonymous People on Earwolf
from Google Alert - anonymous http://ift.tt/2t93ZwY
via IFTTT
[FD] DefenseCode Security Advisory: IBM Informix DB-Access Buffer Overflow
Source: Gmail -> IFTTT-> Blogger
[FD] CVE-2017-4918: Code Injection in VMware Horizon’s macOS Client
Source: Gmail -> IFTTT-> Blogger
[FD] [CVE-2017-10798] ObjectPlanet Opinio 7.6.3 Cross-Site Scripting (XSS)
Source: Gmail -> IFTTT-> Blogger
Anonymous image bouard
from Google Alert - anonymous http://ift.tt/2udiXaa
via IFTTT
Google Silently Adds 'Panic Detection Mode" to Android 7.1 – How It's Useful
from The Hacker News http://ift.tt/2v6FnWJ
via IFTTT
Samsung SDS
from Google Alert - anonymous http://ift.tt/2uM3vP5
via IFTTT
Star Cluster Omega Centauri in HDR
Monday, July 10, 2017
Instrumentälischer Bettlermantl (Anonymous)
from Google Alert - anonymous http://ift.tt/2tHzFLA
via IFTTT
PHOTO: Anonymous Art Show
from Google Alert - anonymous http://ift.tt/2uKbxrU
via IFTTT
Member Inner class and Anonymous Inner Class
from Google Alert - anonymous http://ift.tt/2tGBng5
via IFTTT
Anonymous donors to match up to $200K for Ann Arbor theater renovations
from Google Alert - anonymous http://ift.tt/2sXHxvn
via IFTTT
"Lol" - Ed Reed tweets response to being ranked No. 4 safety of all time (ESPN)
via IFTTT
Using Tesseract OCR with Python
In last week’s blog post we learned how to install the Tesseract binary for Optical Character Recognition (OCR).
We then applied the Tesseract program to test and evaluate the performance of the OCR engine on a very small set of example images.
As our results demonstrated, Tesseract works best when there is a (very) clean segmentation of the foreground text from the background. In practice, it can be extremely challenging to guarantee these types of segmentations. Hence, we tend to train domain-specific image classifiers and detectors.
Nevertheless, it’s important that we understand how to access Tesseract OCR via the Python programming language in the case that we need to apply OCR to our own projects (provided we can obtain the nice, clean segmentations required by Tesseract).
Example projects involving OCR may include building a mobile document scanner that you wish to extract textual information from or perhaps you’re running a service that scans paper medical records and you’re looking to put the information into a HIPA-Compliant database.
In the remainder of this blog post, we’ll learn how to install the Tesseract OCR + Python “bindings” followed by writing a simple Python script to call these bindings. By the end of the tutorial, you’ll be able to convert text in an image to a Python string data type.
To learn more about using Tesseract and Python together with OCR, just keep reading.
Looking for the source code to this post?
Jump right to the downloads section.
Using Tesseract OCR with Python
This blog post is divided into three parts.
First, we’ll learn how to install the pytesseract package so that we can access Tesseract via the Python programming language.
Next, we’ll develop a simple Python script to load an image, binarize it, and pass it through the Tesseract OCR system.
Finally, we’ll test our OCR pipeline on some example images and review the results.
To download the source code + example images to this blog post, be sure to use the “Downloads” section below.
Installing the Tesseract + Python “bindings”
Let’s begin by getting
pytesseractinstalled. To install
pytesseractwe’ll take advantage of
pip.
If you’re using a virtual environment (which I highly recommend so that you can separate different projects), use the
workoncommand followed by the appropriate virtual environment name. In this case, our virtualenv is named
cv.
$ workon cv
Next let’s install Pillow, a more Python-friendly port of PIL (a dependency) followed by
pytesseract.
$ pip install pillow $ pip install pytesseract
Note:
pytesseractdoes not provide true Python bindings. Rather, it simply provides an interface to the
tesseractbinary. If you take a look at the project on GitHub you’ll see that the library is writing the image to a temporary file on disk followed by calling the
tesseractbinary on the file and capturing the resulting output. This is definitely a bit hackish, but it gets the job done for us.
Let’s move forward by reviewing some code that segments the foreground text from the background and then makes use of our freshly installed
pytesseract.
Applying OCR with Tesseract and Python
Let’s begin by creating a new file named
ocr.py:
# import the necessary packages from PIL import Image import pytesseract import argparse import cv2 import os # construct the argument parse and parse the arguments ap = argparse.ArgumentParser() ap.add_argument("-i", "--image", required=True, help="path to input image to be OCR'd") ap.add_argument("-p", "--preprocess", type=str, default="thresh", help="type of preprocessing to be done") args = vars(ap.parse_args())
Lines 2-6 handle our imports. The
Imageclass is required so that we can load our input image from disk in PIL format, a requirement when using
pytesseract.
Our command line arguments are parsed on Lines 9-14. We have two command line arguments:
-
--image
: The path to the image we’re sending through the OCR system. -
--preprocess
: The preprocessing method. This switch is optional and for this tutorial and can aceppt two values:thresh
(threshold) orblur
.
Next we’ll load the image, binarize it, and write it to disk.
# load the example image and convert it to grayscale image = cv2.imread(args["image"]) gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY) # check to see if we should apply thresholding to preprocess the # image if args["preprocess"] == "thresh": gray = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU)[1] # make a check to see if median blurring should be done to remove # noise elif args["preprocess"] == "blur": gray = cv2.medianBlur(gray, 3) # write the grayscale image to disk as a temporary file so we can # apply OCR to it filename = "{}.png".format(os.getpid()) cv2.imwrite(filename, gray)
First, we load
--imagefrom disk into memory (Line 17) followed by converting it to grayscale (Line 18).
Next, depending on the pre-processing method specified by our command line argument, we will either threshold or blur the image. This is where you would want to add more advanced pre-processing methods (depending on your specific application of OCR) which are beyond the scope of this blog post.
The
ifstatement and body on Lines 22-24 perform a threshold in order to segment the foreground from the background. We do this using both
cv2.THRESH_BINARYand
cv2.THRESH_OTSUflags. For details on Otsu’s method, see “Otsu’s Binarization” in the official OpenCV documentation.
We will see later in the results section that this thresholding method can be useful to read dark text that is overlaid upon gray shapes.
Alternatively, a blurring method may be applied. Lines 28-29 perform a median blur when the
--preprocessflag is set to
blur. Applying a median blur can help reduce salt and pepper noise, again making it easier for Tesseract to correctly OCR the image.
After pre-processing the image, we use
os.getpidto derive a temporary image
filenamebased on the process ID of our Python script (Line 33).
The final step before using
pytesseractfor OCR is to write the pre-processed image,
gray, to disk saving it with the
filenamefrom above (Line 34).
We can finally apply OCR to our image using the Tesseract Python “bindings”:
# load the image as a PIL/Pillow image, apply OCR, and then delete # the temporary file text = pytesseract.image_to_string(Image.open(filename)) os.remove(filename) print(text) # show the output images cv2.imshow("Image", image) cv2.imshow("Output", gray) cv2.waitKey(0)
Using
pytesseract.image_to_stringon Line 38 we convert the contents of the image into our desired string,
text. Notice that we passed a reference to the temporary image file residing on disk.
This is followed by some cleanup on Line 39 where we delete the temporary file.
Line 40 is where we print text to the terminal. In your own applications, you may wish to do some additional processing here such as spellchecking for OCR errors or Natural Language Processing rather than simply printing it to the console as we’ve done in this tutorial.
Finally, Lines 43 and 44 handle displaying the original image and pre-processed image on the screen in separate windows. The
cv2.waitKey(0)on Line 34 indicates that we should wait until a key on the keyboard is pressed before exiting the script.
Let’s see our handywork in action.
Tesseract OCR and Python results
Now that
ocr.pyhas been created, it’s time to apply Python + Tesseract to perform OCR on some example input images.
In this section we will try OCR’ing three sample images using the following process:
- First, we will run each image through the Tesseract binary as-is.
- Then we will run each image through
ocr.py
(which performs pre-processing before sending through Tesseract). - Finally, we will compare the results of both of these methods and note any errors.
Our first example is a “noisy” image. This image contains our desired foreground black text on a background that is partly white and partly scattered with artificially generated circular blobs. The blobs act as “distractors” to our simple algorithm.
Using the Tesseract binary, as we learned last week, we can apply OCR to the raw, unprocessed image:
$ tesseract images/example_01.png stdout Noisy image to test Tesseract OCR
Tesseract performed well with no errors in this case.
Now let’s confirm that our newly made script,
ocr.py, also works:
$ python ocr.py --image images/example_01.png Noisy image to test Tesseract OCR
As you can see in this screenshot, the thresholded image is very clear and the background has been removed. Our script correctly prints the contents of the image to the console.
Next, let’s test Tesseract and our pre-processing script on an image with “salt and pepper” noise in the background:
We can see the output of the
tesseractbinary below:
$ tesseract images/example_02.png stdout Detected 32 diacritics " Tesséra‘c't Will Fail With Noisy Backgrounds
Unfortunately, Tesseract did not successfully OCR the text in the image.
However, by using the
blurpre-processing method in
ocr.pywe can obtain better results:
$ python ocr.py --image images/example_02.png --preprocess blur Tesseract Will Fail With Noisy Backgrounds
Success! Our blur pre-processing step enabled Tesseract to correctly OCR and output our desired text.
Finally, let’s try another image, this one with more text:
The above image is a screenshot from the “Prerequisites” section of my book, Practical Python and OpenCV — let’s see how the Tesseract binary handles this image:
$ tesseract images/example_03.png stdout PREREQUISITES In order In make the rnosi of this, you will need (a have a little bit of pregrarrmung experience. All examples in this book are in the Python programming language. Familiarity with Pyihon or other scriphng languages is suggesied, but mm required. You'll also need (a know some basic mathematics. This book is handson and example driven: leis of examples and lots of code, so even if your math skills are noi up to par. do noi worry! The examples are very damned and heavily documented (a help yuu follaw along.
Followed by testing the image with
ocr.py:
$ python ocr.py --image images/example_03.png PREREQUISITES Lu order to make the most ol this, you will need to have a little bit ol programming experience. All examples in this book are in the Python programming language. Familiarity with Python or other scripting languages is suggested, but not requixed. You’ll also need to know some basic mathematics. This book is handson and example driven: lots of examples and lots ol code, so even ii your math skills are not up to par, do not worry! The examples are very detailed and heavily documented to help you tollow along.
Notice misspellings in both outputs including, but not limited to, “In”, “of”, “required”, “programming”, and “follow”.
The output for both of these do not match; however, interestingly the pre-processed version has only 8 word errors whereas the non-pre-processed image has 17 word errors (over twice as many errors). Our pre-processing helps even on a clean background!
Python + Tesseract did a reasonable job here, but once again we have demonstrated the limitations of the library as an off-the-shelf classifier.
We may obtain good or acceptable results with Tesseract for OCR, but the best accuracy will come from training custom character classifiers on specific sets of fonts that appear in actual real-world images.
Don’t let the results of Tesseract OCR discourage you — simply manage your expectations and be realistic on Tesseract’s performance. There is no such thing as a true “off-the-shelf” OCR system that will give you perfect results (there are bound to be some errors).
Note: If your text is rotated, you may wish to do additional pre-processing as is performed in this previous blog post on correcting text skew. Otherwise, if you’re interested in building a mobile document scanner, you now have a reasonably good OCR system to integrate into it.
Summary
In today’s blog post we learned how to apply the Tesseract OCR engine with the Python programming language. This enabled us to apply OCR algorithms from within our Python script.
The biggest downside is with the limitations of Tesseract itself. Tesseract works best when there are extremely clean segmentations of the foreground text from the background.
Furthermore these segmentations need to be as high resolution (DPI) as possible and the characters in the input image cannot appear “pixelated” after segmentation. If characters do appear pixelated then Tesseract will struggle to correctly recognize the text — we found this out even when applying images captured under ideal conditions (a PDF screenshot).
OCR, while no longer a new technology, is still an active area of research in the computer vision literature especially when applying OCR to real-world, unconstrained images. Deep learning and Convolutional Neural Networks (CNNs) are certainly enabling us to obtain higher accuracy, but we are still a long way from seeing “near perfect” OCR systems. Furthermore, as OCR has many applications across many domains, some of the best algorithms used for OCR are commercial and require licensing to be used in your own projects.
My primary suggestion to readers when applying OCR to their own projects is to first try Tesseract and if results are undesirable move on to the Google Vision API.
If neither Tesseract nor the Google Vision API obtain reasonable accuracy, you might want to reassess your dataset and decide if it’s worth it to train your own custom character classifier — this is especially true if your dataset is noisy and/or contains very specific fonts you wish to detect and recognize. Examples of specific fonts include the digits on a credit card, the account and routing numbers found at the bottom of checks, or stylized text used in graphic design.
I hope you are enjoying this series of blog posts on Optical Character Recognition (OCR) with Python and OpenCV!
To be notified when new blog posts are published here on PyImageSearch, be sure to enter your email address in the form below!
Downloads:
The post Using Tesseract OCR with Python appeared first on PyImageSearch.
from PyImageSearch http://ift.tt/2uIABj6
via IFTTT