From the June Mastering CorelDRAW newsletter


OCR Comes to CorelDRAW 4

Rich Zaleski

With the arrival of version 4, CorelDRAW has made a serious move into the 
page layout arena. With 4's enhancements to Draw's bulk text handling and 
formatting capabilities, it's only natural that the program's link to scanned-in 
images should evolve from a tool that just converts bitmaps to editable vector 
files, to one that can also turn scanned bitmap pictures of text into editable text.
This is done via the new incorporation of Optical Character Recognition (OCR) 
functionality into the CorelTRACE program. Trace's implementation of OCR may 
not be on the level of dedicated OCR programs, but it is functional and has some 
useful features that you might not expect to find in an add-on to what many 
perceive as merely an add-on utility itself. In fact, it does a remarkable job of 
handling this complex task, especially when you consider that many top-of-the-
line, standalone OCR packages sell for more than the entire Draw 4 suite of 
applications.
Users with heavy OCR requirements will still find it advantageous to invest in a 
more robust, dedicated OCR application. But those with occasional or limited 
need to convert a scanned page of text or incoming fax into editable text, for use 
in either Draw, a word processor or simply to save as a simple ASCII text file, 
should find Trace's OCR capabilities adequate for their needs.

The OCR Advantage
Uncompressed, a full-page, 1 bit (black-and-white) bitmap in Windows .BMP file 
format will occupy the better part of 500 Kb of precious hard disk space. That 
same file can be stored in compressed TIFF format, which will cut the file size 
down to just over 100 Kb, if the page isn't too tightly packed with text. However, if 
like so many users today you're using disk compression software, much of the 
advantage usually gained in compressing graphics files is lost, because Stacker, 
DoubleSpace or whatever compression scheme is being used can't squeeze the 
file much tighter -- it occupies nearly the original amount of real hard disk space. 
Compare such bulky file sizes to the 2 or 3 Kb that the same page of text will 
occupy when converted to ASCII text format, and the advantage to OCR-ing 
any faxes or scanned-in text files that you need to keep on hand is soon evident. 
And, of course, they become editable at the same time.
If you use a scanner, happily the huge bitmap created when scanning pages of 
text need not ever be stored on your hard disk. Simply make use of Trace's 
TWAIN interface to scan in the image directly, by choosing Acquire Image from 
the File menu, then clicking on Acquire. Use Object Linking and Embedding to 
OLE it into PhotoPAINT for cleaning up or deskewing, if necessary, by choosing 
Edit Image from the Edit menu. Then in Trace select the area of the page that 
you want to convert to text by clicking and dragging a marquee, then click on the 
OCR icon.

Memory Considerations
You should keep in mind that OCR is a memory-intensive task. For example, a 
full page of text requires over 10 megabytes of memory to process. Even if 
you've got plenty of available RAM, you may find it necessary to either maintain 
a very large permanent swap file, avoid using Trace's OCR function while other 
tasks run in the background, or both. I've choked Trace with a full page of small 
type, on a 16 Mb system using a 4 Mb swap file. In this case, shutting down 
other applications allowed the job to proceed to completion. If you're relying on a 
swap file to provide the needed memory, you have to be willing to accept the 
performance degradation that comes with virtual memory usage. (Adjust the size 
of your swap file by double-clicking on the 386 Enhanced icon in the Windows 
Control Panel.)
A solution to the possibility of not being able to have any other memory- 
intensive apps running while you perform an OCR operation is to set up all the 
bitmaps on which you need to perform the recognition as a batch trace. Then 
start the batch process just before leaving the office for the day, when no other 
apps will be running. In any case, you should click on Modify in the Settings 
menu, then click on Batch Output, since its here that you set the default output 
directory and the file overwrite/make read only options for all of Trace's output.
Trace provides some controls to work with scanned text files of varying quality. 
Choose OCR Method by clicking on Modify in the Settings menu. The default is 
designed for 300 dpi bitmaps scanned from hard copy of at least laser printer 
quality. Settings for dot matrix and fine-quality faxes (200 by 100 dpi) can also be 
selected. These settings are sticky, and will remain active until you change them 
or select Default from the main Settings menu. How much of a difference do 
these settings make? On a one-page test file generated via fax, tracing it in the 
Normal, rather than Fax, mode produced a text file with 42 errors. With the OCR 
method set to Fax, the same file converted with only a single error.

A Few Rough Spots
You'll also notice an option for Check Spelling in this dialog box. In my tests, I 
found this option to be virtually useless. When Draw, or your word processor, 
checks spelling and comes across a combination of letters that it doesn't 
recognize, it offers you the choice of accepting or correcting the spelling error. 
Trace, however, simply ignores the word and doesn't trace it. I'd rather have the 
output file say "The spell chec~er needs some improvement," than leave the 
word out entirely and give me "The spell needs some improvement." At least in 
the latter case the spell checker in my word processor will have something to 
catch. 
This situation is aggravated by the fact that (as far as I've been able to tell) 
Trace's use of the spell checker does not incorporate any user dictionary that 
you might have created. Proper names and specialized terms simply get 
dropped, rather than being flagged by having the rejected letters converted and 
marked with a ~ or some other uncommon character. All in all, I'd strongly 
recommend that you give Trace's Check Spelling option a miss.
Another area where the OCR function could stand some improvement is in the 
area of text formatting. In short, it doesn't. It's not bad with straight paragraphs of 
text, but with columnar data or anything out of the ordinary it just treats each 
string of text as a line followed by a return and linespace. In the end, despite the 
unexpected accuracy of the character recognition, you're still likely to face some 
meaningful editing and reformatting time. Perhaps by the time 5.0 rolls around, 
we'll at least see Rich Text Format (RTF) output with some semblance of 
maintaining the format of the original image. As long as we're wishing, limited 
font identification might be within reach as well.

The Forms Approach
Having stumbled across the weakest feature of Trace's OCR function, it's time to 
look at what may be its strongest capability, and is certainly its most intriguing. In 
addition to the standard OCR operation of converting to an ASCII text file, you 
have the option of using the Forms tracing method. This routine first examines 
the bitmap and traces any non-text elements as a graphic in outline and/or 
centerline method, as appropriate. It then OCRs the text, but rather than saving 
it as ASCII, it inserts it into the usual .EPS output file created by Trace as strings 
of Artistic text laid out in the positions appropriate to the image that was traced, 
but in the default font. It seems to want to use a sans serif font by default, since 
depending on which fonts are in the Ares FontMinder Font Packs I have loaded, 
it will be either 12.5-point Avant Garde or Arial. While it's not as fast as straight 
OCR tracing, this feature is particularly handy when tracing logos with 
accompanying text, letterheads, maps and technical illustrations. 
In the accompanying illustrations, I faxed myself a blank invoice and used 
Trace's Forms method to convert it to .EPS. The first trace took it just over three 
minutes on my 33 MHz 486 with 16 MB of memory, and it never required disk-
based virtual memory. Since Trace does not treat white text on a black 
background as text, I then saved the .EPS file, cleared the .EPS window (press 
Delete), inverted the image (choose Modify, then Image Filtering from the 
Settings menu), and marquee selected the areas containing that text. After 
running the Forms trace on these leftover text strings, I saved the second .EPS 
file under a different name.
I then imported both .EPS files into Draw and placed them side by side.  After 
ungrouping the .EPS file created with the second scan, I changed the fonts as 
necessary and applied a white fill to them before turning my attention to the 
other copy. I deleted the curves that represented the white text, used the Node 
Edit roll-ups Auto-Reduce function on the larger and more complex curves that 
made up the form. I changed all the curve segments in the table part of the form 
to lines, and performed minor cleaning up and aligning by snapping the corners 
to the grid. 
Finally, I dragged the white text that remained from the second trace on top of 
the form. Total time from loading the .PCX scan of the form into Trace to printing 
out virtual duplicates of the original from Draw was just over half an hour. Could I 
have drawn and lettered the form from scratch more quickly in Draw? I doubt it.

Is it for You?
If you have heavy-duty text conversion needs, you might not ever use Trace's 
OCR capabilities, except for perhaps the occasional need to generate a text-
inclusive .EPS trace of a mixed text and graphic bitmap. But then again, if your 
OCR needs are that intensive, you didn't buy CorelDRAW to fill them. That's why 
Caere and Calera are in business. But for most graphics professionals, who 
don't deal in lengthy text documents, Traces OCR capabilities should fill the bill 
reasonably well.
Those of you interested in trying out Traces OCR capabilities for yourselves can 
use the INV001.PCX file that was placed in the INVOICE directory of this 
months disk when you installed it. This is the scan of the form I discussed in the 
article.


TIP
You can also continue an OCR session that halted due to insufficient memory by 
closing the warning dialog box, selecting a smaller area to process, and then 
doing the page in two passes.

Contents Copyright Kazak Communications 1993


Subscription Information

While the regular subscription rate is $75 per year (in US dollars for Americans, 
Canadian dollars for Canadians), charter subscriptions to the Mastering 
CorelDRAW newsletter are available for a limited time at $60 (add $30 U.S. for 
overseas). A free sample disk, from our exclusive disk-of-the-month service 
(value $20), is included with your paid subscription. 

To subscribe, or for more information, contact:

Chris Dickman
16 Ottawa St.
Toronto, ON M4T 2B6
Canada
416-924-0759 (voice)
416-924-4875 (fax)
CServe: 70730,2265



                                                 - 30 -
