{"id":616,"date":"2023-02-16T17:53:07","date_gmt":"2023-02-16T06:53:07","guid":{"rendered":"https:\/\/www.samontab.com\/web\/?p=616"},"modified":"2023-02-16T17:53:08","modified_gmt":"2023-02-16T06:53:08","slug":"how-to-install-the-latest-version-of-the-open-source-ocr-tesseract-in-ubuntu-22-04-lts","status":"publish","type":"post","link":"https:\/\/www.samontab.com\/web\/2023\/02\/how-to-install-the-latest-version-of-the-open-source-ocr-tesseract-in-ubuntu-22-04-lts\/","title":{"rendered":"How to install the latest version of the open source OCR tesseract in Ubuntu 22.04 LTS"},"content":{"rendered":"\n<p>If you install tesseract from the Ubuntu 22.04 LTS repositories, like this:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: bash; title: ; notranslate\" title=\"\">\nsudo apt-get install tesseract-ocr\n<\/pre><\/div>\n\n\n<p>You&#8217;ll end up with tesseract v4.1.1. Since tesseract v5.3.0 is out already, we&#8217;re going to install that version instead. So, if you already installed it from the repositories, make sure to first uninstall it:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: bash; title: ; notranslate\" title=\"\">\nsudo apt-get remove tesseract-ocr\n<\/pre><\/div>\n\n\n<p>Now we&#8217;re going to install it. First, let&#8217;s make sure you have libraries for reading different types of image files:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: bash; title: ; notranslate\" title=\"\">\nsudo apt-get install libpng-dev libjpeg-dev libtiff-dev libgif-dev libwebp-dev libopenjp2-7-dev zlib1g-dev\n<\/pre><\/div>\n\n\n<p>Now, let&#8217;s get the latest version of leptonica(v1.83.1), an image processing library used by tesseract:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: bash; title: ; notranslate\" title=\"\">\ncd ~\/Desktop\nwget https:\/\/github.com\/DanBloomberg\/leptonica\/releases\/download\/1.83.1\/leptonica-1.83.1.tar.gz\ntar -xzvf leptonica-1.83.1.tar.gz\ncd leptonica-1.83.1\nmkdir build\ncd build\ncmake ..\nmake -j`nproc`\nsudo make install\n<\/pre><\/div>\n\n\n<p>Now we&#8217;re going to grab the source code from tesseract and compile it:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: bash; title: ; notranslate\" title=\"\">\ncd ~\/Desktop\nwget https:\/\/github.com\/tesseract-ocr\/tesseract\/archive\/refs\/tags\/5.3.0.tar.gz\ntar -xzvf 5.3.0.tar.gz \ncd tesseract-5.3.0\/\nmkdir build\ncd build\ncmake ..\nmake -j `nproc`\nsudo make install\n<\/pre><\/div>\n\n\n<p>Now we need to specify where the <strong>tessdata<\/strong> folder is to the system. Open your <strong>~\/.bashrc<\/strong> file like this:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: bash; title: ; notranslate\" title=\"\">\nnano ~\/.bashrc\n<\/pre><\/div>\n\n\n<p>And simply write the following at the end of the file:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: bash; title: ; notranslate\" title=\"\">\nexport TESSDATA_PREFIX=\/usr\/local\/share\/tessdata\n<\/pre><\/div>\n\n\n<p>Now save the file(Ctrl-O) and exit(Ctrl-X). Now run this to activate the setting:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: bash; title: ; notranslate\" title=\"\">\nsource ~\/.bashrc\n<\/pre><\/div>\n\n\n<p>We now need to grab some language models and other data files and put them in that folder. Note that we&#8217;re going to get the English models that are based on the relatively new(since v4) LSTM neural networks engine, and the most accurate version of them. You can read more about these files <a href=\"https:\/\/github.com\/tesseract-ocr\/tessdoc\/blob\/main\/Data-Files.md\" target=\"_blank\" rel=\"noreferrer noopener\">here<\/a>. Let&#8217;s get them:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: bash; title: ; notranslate\" title=\"\">\nwget https:\/\/raw.githubusercontent.com\/tesseract-ocr\/tessdata_best\/main\/eng.traineddata\nwget https:\/\/github.com\/tesseract-ocr\/tessdata\/raw\/3.04.00\/osd.traineddata\nwget https:\/\/raw.githubusercontent.com\/tesseract-ocr\/tessdata\/3.04.00\/equ.traineddata\nsudo mv *.traineddata \/usr\/local\/share\/tessdata\n<\/pre><\/div>\n\n\n<p>And now we should be able to use tesseract from anywhere. Open a new console and test that it&#8217;s all working properly:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: bash; title: ; notranslate\" title=\"\">\ntesseract --version\n<\/pre><\/div>\n\n\n<p>It should say: <strong>tesseract 5.3.0<\/strong>, <strong>leptonica-1.83.1<\/strong>.<\/p>\n\n\n\n<p>Now, let&#8217;s actually use it. In general you&#8217;ll need to preprocess your images beforehand. For example here&#8217;s how you can align the images with <a rel=\"noreferrer noopener\" href=\"https:\/\/www.samontab.com\/web\/2020\/11\/align-text-images-with-opencv-using-python\/\" target=\"_blank\">Python<\/a> or <a rel=\"noreferrer noopener\" href=\"https:\/\/www.samontab.com\/web\/2020\/11\/align-text-images-with-opencv\/\" target=\"_blank\">C++<\/a>. Once you have aligned the text correctly, you should have an image like this:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"eeeeee\" data-has-transparency=\"true\" style=\"--dominant-color: #eeeeee;\" loading=\"lazy\" decoding=\"async\" width=\"717\" height=\"1024\" src=\"https:\/\/www.samontab.com\/web\/wp-content\/uploads\/2023\/02\/aligned-1-717x1024.png\" alt=\"\" class=\"wp-image-617 has-transparency\" srcset=\"https:\/\/www.samontab.com\/web\/wp-content\/uploads\/2023\/02\/aligned-1-717x1024.png 717w, https:\/\/www.samontab.com\/web\/wp-content\/uploads\/2023\/02\/aligned-1-210x300.png 210w, https:\/\/www.samontab.com\/web\/wp-content\/uploads\/2023\/02\/aligned-1-768x1097.png 768w, https:\/\/www.samontab.com\/web\/wp-content\/uploads\/2023\/02\/aligned-1-1075x1536.png 1075w, https:\/\/www.samontab.com\/web\/wp-content\/uploads\/2023\/02\/aligned-1.png 1140w\" sizes=\"auto, (max-width: 717px) 100vw, 717px\" \/><\/figure>\n\n\n\n<p>Now you can simply call tesseract like this:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: bash; title: ; notranslate\" title=\"\">\ntesseract ~\/Desktop\/image.png -\n<\/pre><\/div>\n\n\n<pre class=\"wp-block-code\"><code>458 ADDITIONAL EXAMPLES:\n\nof the beam was brought over the prop, it required the weight of\n2 man, which was 200 \/0. at the less end to keep it in equilibrios\nHence the weight is required ?\n\nAns. 3000 1b.\n\n100. The weight of a ladder 20 feet long is 70 \u00a2b. and its cen=\ntre of gravity 11 feet from the less end; now what weight will a\nman sustain in raising this ladder when he pushes directly against\nit at the distance of 7 fect from the greater end, and his hands are\n5 feet above the ground?\n\nAns. 63 1b. nearly.\n\n101. If the quantity of matter in the moon, be to that of the\nearth, as 1 to 39, and the distance of their centres 240000 miles ;\nwhere is their common centre of gravity ?\n\nAns. 6000 miles from the earth\u2019s centre.\n\n102. Supposing the data as in the last question, to find the\ndistance from the moon in the line joining the centres, where a\nbody would be equally attracted by the carth and moon; the\nforce of attraction in bodies being directly as the quantities of\nmatter, and inversely as the squares of the distances from the\ncentres.\n\n240000 .\nAns. \u2014\u2014\u2014\u2014 = 331264 miles, nearly.\n9 y\n\n103. If two fires, one giving 2 times the heat of the other, are\n6 yards asunder; where must I stand directly between them to\nbe heated on both sides alike; the heat being inversely as the\nsquare of the distance?\n\nAns. 2 yards from the less fire, or 4 from the greater.\n104. To what height above the carth\u2019s surface should a body\nbe carricd to lose 5 of its weight; the ecarth\u2019s radius being\n\n3970 miles, and the force of gravity inversely as the square of\nthe distance from its centre?\n\nAns. 214} miles.\n\n<\/code><\/pre>\n\n\n\n<p>If you want to save the output text to a file, simply specify a filename and it will create a .txt file. In this example it will create a file in your working directory, named <strong>image_ocr.txt<\/strong>:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: bash; title: ; notranslate\" title=\"\">\ntesseract ~\/Desktop\/image.png image_ocr\n<\/pre><\/div>\n\n\n<p>As you can see, it works fairly well for most of the text. As long as you give a reasonably clear input image, tesseract will be able to generate the correct text from it. You can read more about how to improve the quality of the output <a rel=\"noreferrer noopener\" href=\"https:\/\/tesseract-ocr.github.io\/tessdoc\/ImproveQuality.html\" target=\"_blank\">here<\/a>.<\/p>\n\n\n\n<p>Did you enjoy the article?<\/p>\n\n\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>If you install tesseract from the Ubuntu 22.04 LTS repositories, like this: You&#8217;ll end up with tesseract v4.1.1. Since tesseract v5.3.0 is out already, we&#8217;re going to install that version instead. So, if you already installed it from the repositories, make sure to first uninstall it: Now we&#8217;re going to install it. First, let&#8217;s make [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"0","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[29,21,75],"tags":[],"class_list":["post-616","post","type-post","status-publish","format-standard","hentry","category-computer-vision","category-open-source","category-ubuntu"],"_links":{"self":[{"href":"https:\/\/www.samontab.com\/web\/wp-json\/wp\/v2\/posts\/616","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.samontab.com\/web\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.samontab.com\/web\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.samontab.com\/web\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.samontab.com\/web\/wp-json\/wp\/v2\/comments?post=616"}],"version-history":[{"count":0,"href":"https:\/\/www.samontab.com\/web\/wp-json\/wp\/v2\/posts\/616\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.samontab.com\/web\/wp-json\/wp\/v2\/media?parent=616"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.samontab.com\/web\/wp-json\/wp\/v2\/categories?post=616"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.samontab.com\/web\/wp-json\/wp\/v2\/tags?post=616"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}