Jekyll2022-09-11T00:23:30+00:00https://quochungtran.github.io/feed.xmlMy Journey at google summer of Code 2022This blog contains the description of my projects : from internship project, open source distribution and personal project. These projects express my competences, skills and interests
{"email"=>"quochungtran1999@gmail.com"}quochungtran1999@gmail.com<Weeks 10,11,12> Code refactoring and demo application2022-08-30T14:41:53+00:002022-08-30T14:41:53+00:00https://quochungtran.github.io/junk/2022/08/30/lasteweeks<h1 id="resuming">Resuming</h1>
<p>In my final weeks of the Google Summer of Code 2022, I spend my time polishing code and writing and documenting all parts of my code for the final Submission for final integration into the branch master.</p>
<p>I discussed with my mentor throughout these weeks that these tasks mainly focus on :</p>
<ul>
<li>Add a license header on the top of each source file.</li>
<li>Drop extra space by using the bash script : https://invent.kde.org/graphics/digikam/-/blob/master/project/scripts/dropextraspaces.sh.</li>
<li>Use <strong>cppcheck</strong> (Cppcheck is an analysis tool for C/C++ code). It provides unique code analysis to detect bugs and focuses on seeing undefined behavior and danger).</li>
<li>Reduce the text size of each item in QcomboBox, add notes in context for translators to limit translation sizes, and use a tooltip to host a long string description for each item.</li>
<li>Limit digiKam and Qt headers to export the minimum dependencies outside digiKam.</li>
</ul>
<p>To summarize, I would like to demo the functionalities of the Digikam OCR tool what I have done:</p>
<ol>
<li>
<p>The user can process OCR in multiple documented images by in a items list; if list is empty, a pop-up will appear.</p>
</li>
<li>
<p>There are four options that users can choose from based on 4 Tesseract basic options.</p>
</li>
<li>
<p>When the User clicks the button “Start OCR,” The batch process will begin, it finishes 100% in the progress bar, and all results details are displayed in the Text Editor by double click on the item.</p>
</li>
<li>
<p>Double click on each item list allows users to review recognized text.</p>
</li>
<li>
<p>With the support of the spell-checking engine, users can adjust the text and store it in separate text files or XMP metadata by clicking on the “Save” button.</p>
</li>
<li>
<p>The text stored in XMP can translate to another language if stored in another place.</p>
</li>
</ol>
<p>You can view the demo of the plugin here :</p>
<p><a href="https://drive.google.com/file/d/1wyiHLaJbHDna1QLUZ5wS6vrrKwiyPkts/view?usp=sharing">demo</a></p>
<h3 id="main-commits">Main commits</h3>
<ul>
<li><a href="https://invent.kde.org/graphics/digikam/-/merge_requests/177/diffs?commit_id=0e1d1958ac9d85eb73ce5b07ab8272e52d1d40f9">0e1d1958</a></li>
<li><a href="https://invent.kde.org/graphics/digikam/-/merge_requests/177/diffs?commit_id=3d78df68a83dd4e618fd8cc90e92d4e92fdb43d6">3d78df68</a></li>
<li><a href="https://invent.kde.org/graphics/digikam/-/merge_requests/177/diffs?commit_id=91db8752a9c12a49f92766394ba47242c6fad24a">91db8752</a></li>
<li><a href="https://invent.kde.org/graphics/digikam/-/merge_requests/177/diffs?commit_id=49469a9296ee7932a971ded55b3f5f2dcd083545">49469a92</a></li>
<li><a href="https://invent.kde.org/graphics/digikam/-/merge_requests/177/diffs?commit_id=69aa5d07ed3fe7fad925ef28b0e5213217be8621">69aa5d07</a></li>
<li><a href="https://invent.kde.org/graphics/digikam/-/commit/cd1ace90ea27beab6aad8d02cd1f4d879eac14bd">cd1ace90</a></li>
</ul>
<h1 id="improvement">Improvement</h1>
<p>As mentioned in week 1 and 2, the accuracy of OCR need to be enhanced, so a dialog concluding the pre-processing methods is necessary. This feature will be helpful in the future.</p>["TRAN Quoc Hung"]Resuming<Weeks 9> Storing OCR result2022-08-14T14:41:53+00:002022-08-14T14:41:53+00:00https://quochungtran.github.io/junk/2022/08/14/weed9<p>This week, I implemented two ways to store the output text.</p>
<p>The first way is that output text will be recorded and saved as a text file. I implement the method <code class="language-plaintext highlighter-rouge">saveTextFile(const QString& filePath, const QString& text)</code> of Ocr Tesseract Engine; this function uses QTextStream’s streaming operators, so we can conveniently write and update text into the file. The text file is located with the same path as the Url image.</p>
<p><img src="https://github.com/quochungtran/quochungtran.github.io/blob/master/image_blog/week9/txt_save.png?raw=true" alt="figure1" /></p>
<p>The second way is to save the text in XMP. I implement the method <code class="language-plaintext highlighter-rouge">saveXMP(const QString& filePath, const QString& text)</code> to do this task. XMP uses a structured container to host similar metadata. An alternative language string is an entry in the XMP tree (based on XML) which will be an additional property for a title or caption tag. A new language version of a label is appended to XML. There is no limit in size and char encoding. XMP is hosted in the image to a dedicated chunk outside the image data/properties.</p>
<p><img src="https://github.com/quochungtran/quochungtran.github.io/blob/master/image_blog/week9/xmp.png?raw=true" alt="figure1" /></p>
<p>This is precisely where digiKam stores the comment entered by the user in the right sidebar named Captions & Tags from the icon view. The advantage of seeing this place is the capability to set more than one text in different languages if the translation is available. In all cases, the default language must be x-default; variant will be hosted in de-DE for German, fr-FR from French, it-IT for Italian, etc. Furthermore, translating and storing recognized text in the new language is an attractive feature too. My mentor prepared the class <a href="https://github.com/crow-translate/QOnlineTranslator">QOnlineTranslator</a> to be included in digiKam core. Here an example, translating recognize text from English into French</p>
<p><img src="https://github.com/quochungtran/quochungtran.github.io/blob/master/image_blog/week9/translate_ocr.png?raw=true" alt="figure1" /></p>
<p>These two methods are also encapsulated in <code class="language-plaintext highlighter-rouge">slotUpdateText()</code>, a function that is called in response to a particular signal “clicked” on a button “Save” after post-processing text in the text editor.</p>
<div style="page-break-after: always;"></div>
<h3 id="main-commits">Main commits</h3>
<ul>
<li><a href="https://invent.kde.org/graphics/digikam/-/merge_requests/177/diffs?commit_id=19d64c645bc8800857d20802ea990799811782a8">19d64c64</a></li>
<li><a href="https://invent.kde.org/graphics/digikam/-/merge_requests/177/diffs?commit_id=9def2afaf8c04018280a6b94e829f381b217cc51">9def2afa</a></li>
</ul>["TRAN Quoc Hung"]This week, I implemented two ways to store the output text.<Weeks 7,8> OCR batch processing based on internal-multi threading2022-08-08T14:41:53+00:002022-08-08T14:41:53+00:00https://quochungtran.github.io/junk/2022/08/08/weed7-8<h1 id="seventh-week"><strong>Seventh week</strong></h1>
<p>In the seventh week, I will implement the OcrTesseract Engine, an internal object, which handles the input scanned document image, get optional tesseract values from the dialog and release the OCR output text and the text file.</p>
<p><img src="https://github.com/quochungtran/quochungtran.github.io/blob/master/image_blog/week7-8%20%20/ocrEngine.png?raw=true" alt="figure1" /></p>
<p>The detailed architecture of this object is shown in the following :</p>
<p><img src="https://github.com/quochungtran/quochungtran.github.io/blob/master/image_blog/week7-8%20%20/ocrEngineUML.png?raw=true" alt="figure1" /></p>
<p>It implements a function <code class="language-plaintext highlighter-rouge">runOcrProcess()</code> that uses Tesseract CLI, a <strong>command line program that accepts text input to execute operating system functions</strong>. This function returns the type enum state of a process and saves the results are saved in member variables as well :</p>
<table>
<thead>
<tr>
<th>Value</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>PROCESS_COMPLETE</td>
<td>All stages done.</td>
</tr>
<tr>
<td>PROCESS_FAILED</td>
<td>A failure happen while processing.</td>
</tr>
<tr>
<td>PROCESS_CANCELED</td>
<td>User has canceled processing.</td>
</tr>
</tbody>
</table>
<h1 id="eighth-week"><strong>Eighth week</strong></h1>
<p>This week, I started implementing the backend part, an internal multi-thread to manage the batch processing and Data object delivery to send from backend to frontend.</p>
<h2 id="text-converter-thread-management"><strong>Text Converter Thread Management</strong></h2>
<p>This plugin will use an internal multi-thread for OCR processing images. Classes in this part will be implemented based on existing objects to manage and chain threads in digikam. The idea is inspired by using <a href="https://doc.qt.io/qt-6/qthreadpool.html"><strong>QThreadPool</strong></a> and <a href="https://doc.qt.io/qt-6/qrunnable.html"><strong>QRunnable</strong></a>. Existing threads can be reused for new tasks, and <strong>QThreadPool</strong> is a collection of reusable <strong>QThreads</strong>.</p>
<p>The part <strong>TextConverterActionThread</strong> manages the functioning of <strong>TextConverterTask</strong>. Concretely, TextConverterActionThread manages the instantiations of TextConverterTask.</p>
<p>Each Text Converter Task will initialize one <strong>Ocr Tesseract engine</strong> object to manage one URL of an image.</p>
<p>The purpose is to allow each OCR image process to run in parallel and stop properly. The <strong>run()</strong> method of object TextConverterTask is a virtual method that will be reimplemented to facilitate advanced thread management. Here is the architecture of this part:</p>
<p><img src="https://github.com/quochungtran/quochungtran.github.io/blob/master/image_blog/week7-8%20%20/thread_UML.png?raw=true" alt="figure1" /></p>
<p>Here is a sequence diagram representing the communication between GUI and with backend interface.</p>
<p><img src="https://github.com/quochungtran/quochungtran.github.io/blob/master/image_blog/week7-8%20%20/diagram_sequence.png?raw=true" alt="figure1" /></p>
<p>When the user clicks on the button “Start OCR”, the object <strong>TextConverterThread</strong> instantiates objects <strong>TextConverterTask</strong> and sets up Tesseract options from the dialog to them. By receiving the signal “clicked” from the dialog button, <strong>TextConverterTask</strong> creates an OCR engine to control the image’s Url. When the process is finished, all the necessaire outputs are set on the widget list of pictures and text editor.</p>
<p>The most important part is how to deliver the output and set them up on the dialog interface. For this problem, I implement a class <strong>Text Converter Data</strong> containing the status of a process, the destination path of an output file, and the recognized text extracted from the image. Output data is transferred to the dialog through two signals :</p>
<p><em>signalStarting(TextConverterActionData), signalFinished(TextConverterActionData)</em></p>
<p><img src="https://github.com/quochungtran/quochungtran.github.io/blob/master/image_blog/week7-8%20%20/connect.png?raw=true" alt="figure1" /></p>
<p>In the next few weeks, I will:</p>
<ul>
<li>Implement the functionalities of storing OCR result.</li>
<li>Polish and re-implement code if necessary.</li>
</ul>
<h1 id="main-commits">Main commits</h1>
<ul>
<li><a href="https://invent.kde.org/graphics/digikam/-/merge_requests/177/diffs?commit_id=1fcc41e2219def0cb5f052ce999cd9ebe7df39dc">1fcc41e2</a></li>
<li><a href="https://invent.kde.org/graphics/digikam/-/merge_requests/177/diffs?commit_id=4b521c2244af8436bebda0b1db7d99b3b70a8b67">4b521c22</a></li>
<li><a href="https://invent.kde.org/graphics/digikam/-/merge_requests/177/diffs?commit_id=63fe2ee3cf51c5d5132e4cc2139cdd563b5f579c">63fe2ee3</a></li>
<li><a href="https://invent.kde.org/graphics/digikam/-/merge_requests/177/diffs?commit_id=82a373a7dc0700270049f5a4ea5ad8aa1d2195af">82a373a7</a></li>
<li><a href="https://invent.kde.org/graphics/digikam/-/merge_requests/177/diffs?commit_id=fe9f129d05114e38cf365216fb79da04c55c2072">fe9f129d</a></li>
</ul>["TRAN Quoc Hung"]Seventh week<Weeks 3,4> Preprocessing for improving the quality of the output2022-07-25T14:41:53+00:002022-07-25T14:41:53+00:00https://quochungtran.github.io/junk/2022/07/25/week3-4<p>In the following weeks (June 27,2022 to July 10, 2022), for each test case, I will try to :</p>
<ul>
<li>Research algorithms for pre-processing image.</li>
<li>Gather many test cases for improving pre-processing methods.</li>
</ul>
<p>Based on the results of previous research, I realize that Tesseract does various processing internally before doing the actual OCR. However, some instances still exist where Tesseract can result in a signification reduction in accuracy.</p>
<p>So in this post, I would like to introduce some pre-processing methods that apply directly before passing in Tesseract.</p>
<h1 id="removing-shadow"><strong>Removing Shadow</strong></h1>
<ul>
<li>
<p>The main idea is to extract and segment the text from each channel by using Background subtraction (BS) from the grayscale image, which is a common and widely used technique for generating a foreground mask.</p>
</li>
<li>
<p>BS calculates the foreground mask performing a subtraction between the current frame and a background model, containing the static part of the scene or, more in general, everything that can be considered as background given the characteristics of the observed scene.</p>
</li>
<li>
<p>The purpose is to remove the shadow or noise from background of each test case (in each set case here the background of scanned image can be influenced by many external factors)(convert the background into white or black) and then merge each track into an image.</p>
</li>
</ul>
<p>For simplify, here is an example, we can see the Tesseract output can be affected by the light.</p>
<p><img src="https://github.com/quochungtran/Gsoc2022-tesseract-ocr/blob/master/data/shadow_page_book.png?raw=true" alt="figure1" /></p>
<p>Please look at three picture channels; the amount of shadow in blue in the channel is the most distributed.</p>
<p><img src="https://github.com/quochungtran/Gsoc2022-tesseract-ocr/blob/master/image_blog/bl2_1.png?raw=true" alt="figure1" /></p>
<h2 id="implementation">Implementation</h2>
<h3 id="first-step">First step</h3>
<p>Extract the background by making blur the part of the text. Morphological operations are helpful right now.</p>
<ul>
<li>
<p>Grayscale dilation of an image involves assigning to each pixel the maximum value found over the neighborhood of the structuring element by using a square kernel with odd-dimensional (kernel 3 x 3).</p>
</li>
<li>
<p>The dilated value of a pixel x is the maximum value of the image in the neighborhood defined.</p>
</li>
</ul>
<p>Dilation is considered a type of morphological operation. Morphological operations are the set of operations that process images according to their shapes.As the kernel B is scanned over the image, we compute the maximal pixel value overlapped by B and replace the image pixel in the anchor point position with that maximal value. As you can deduce, with background with and black text, this maximizing operation causes dark regions within an image to reduce. The bigger size of the kernel is, the blurrier text is. They are transferred as noise as we need to remove.</p>
<p><img src="https://github.com/quochungtran/Gsoc2022-tesseract-ocr/blob/master/image_blog/bl2_2.png?raw=true" alt="figure1" /></p>
<h3 id="second-step">Second step</h3>
<p>Median Blur helps smooth image and remove noise produced from previous operations. They take each background more generally.</p>
<p><img src="https://github.com/quochungtran/Gsoc2022-tesseract-ocr/blob/master/image_blog/bl2_3.png?raw=true" alt="figure1" /></p>
<h3 id="third-step">Third step</h3>
<p>cv2.absdiff is a function that helps find the absolute difference between the pixels of the two image arrays. Using this, we can extract just the pixels of the text of the objects. Or on the other way, the background becomes white; the shadow is all removed.</p>
<p><img src="https://github.com/quochungtran/Gsoc2022-tesseract-ocr/blob/master/image_blog/bl2_4.png?raw=true" alt="figure1" /></p>
<p>And finally, merging all channels into one image, we get the following result. We can see the noise in background is removed, we obtain of course a better output.</p>
<p><img src="https://github.com/quochungtran/Gsoc2022-tesseract-ocr/blob/master/image_blog/results_remove_shadow.png?raw=true" alt="figure1" /></p>
<p>Here is more the example test cases that I have tried, and all we almost get the better results:</p>
<p><img src="https://github.com/quochungtran/Gsoc2022-tesseract-ocr/blob/master/image_blog/bl2_5.png?raw=true" alt="figure1" /></p>
<p><img src="https://github.com/quochungtran/Gsoc2022-tesseract-ocr/blob/master/image_blog/bl2_6.png?raw=true" alt="figure1" /></p>
<h1 id="perspective-correction"><strong>Perspective correction</strong></h1>
<p>The issue is more specific to some image acquisition devices (digital cameras or Mobile devices). As a result, the acquired area is not a rectangle but a trapezoid or a parallelogram. Once the image transformation is applied, the corrected font size will look almost similar and give a better OCR result.</p>
<p>I was really helpful to use these ressources to help me explain this post :</p>
<p><a href="https://pyimagesearch.com/2014/08/25/4-point-opencv-getperspective-transform-example/">Post 1</a></p>
<p><a href="https://pyimagesearch.com/2014/09/01/build-kick-ass-mobile-document-scanner-just-5-minutes/">Post 2</a></p>
<h2 id="implementation-1">Implementation</h2>
<p>Building the optional document scanner with OpenCV can be executed in the following steps :</p>
<ul>
<li>Step 1: Detect edges.</li>
<li>Step 2: Use the edges in the images to find the contour (outline) representing the piece of paper being scanned.</li>
<li>Step 3: Apply a perspective transform to obtain the top-down view of the document.</li>
</ul>
<p>Take a look at an example.</p>
<p>First, to speed up the image processing process and improve our edge detection more accurately, we take the ratio of how big our image is compared to the height of 500 pixels, then resize the image.</p>
<p>Convert the image from colored to grayscale, use <strong>Gaussian blurring</strong> to remove high-frequency noise, and then apply a <strong>bilateral Filter</strong>, which is highly effective in noise removal while keeping edges sharp. Then I used one <strong>Median blur</strong> operation similar to the other averaging methods. Here, the central element of the image is replaced by the median of all the pixels in the kernel area. This operation processes the edges while removing the noise. All these methods are helpful for edge detection and improving accuracy. See the following steps of this process:</p>
<p><img src="https://github.com/quochungtran/Gsoc2022-tesseract-ocr/blob/master/image_blog/bl2_7.png?raw=true" alt="figure1" /></p>
<p>Second, we scan a piece of paper that usually will take the shape of a rectangle, so we now know that we have a rectangle with four points and four edges. Therefore, we will assume that the most prominent contour in the new image will take precisely four points in our piece of paper; another way, the most prominent edges have a higher probability of being documents we are scanning.</p>
<p>The algorithm to find the largest contours :</p>
<ul>
<li>
<p>Use <code class="language-plaintext highlighter-rouge">cv.findContours()</code> function, who finds contours in a binary image. Contours is a list of all the contours in the image under form (x,y) coordinates of boundary points of the object and then sort the contours by area and keep only the largest ones. This only allows us to examine the largest contours, discarding the rest.</p>
</li>
<li>
<p>It loops over the contours and uses <code class="language-plaintext highlighter-rouge">cv2.approxPolyDP</code> function to smooth and approximate the quadrilateral. <code class="language-plaintext highlighter-rouge">cv2.approxPolyDP</code> works for cases with sharp edges in the shapes like a document boundary.</p>
</li>
<li>
<p>Having four points, we save all rectangles and widths; we determine the width and height, and the top left two points of our rectangle (the largest one) or the largest rectangle should be our document.</p>
</li>
</ul>
<p><img src="https://github.com/quochungtran/Gsoc2022-tesseract-ocr/blob/master/image_blog/bl2_8.png?raw=true" alt="figure1" /></p>
<ul>
<li>Apply four points transformation algorithm; you can see more <a href="https://pyimagesearch.com/2014/08/25/4-point-opencv-getperspective-transform-example/">here</a>. This gives us the bird look to the document like you are flying and see from the vertical perspective “birds eye view.”, which computes the perspective transform matrix by using getPerspectiveTransform function and applies it to obtain our top-down view by calling the function <code class="language-plaintext highlighter-rouge">cv2.warpPerspective</code>.
Here is the good result :</li>
</ul>
<p><img src="https://github.com/quochungtran/Gsoc2022-tesseract-ocr/blob/master/image_blog/bl2_9.png?raw=true" alt="figure1" /></p>
<p>Now we take a look at some test cases :</p>
<p><img src="https://github.com/quochungtran/Gsoc2022-tesseract-ocr/blob/master/image_blog/perspec_res1.png?raw=true" alt="figure1" /></p>
<p><img src="https://github.com/quochungtran/Gsoc2022-tesseract-ocr/blob/master/image_blog/perspec_res2.png?raw=true" alt="figure1" /></p>
<p><img src="https://github.com/quochungtran/Gsoc2022-tesseract-ocr/blob/master/image_blog/perspec_res3.png?raw=true" alt="figure1" /></p>
<p><img src="https://github.com/quochungtran/Gsoc2022-tesseract-ocr/blob/master/image_blog/perspec_res4.png?raw=true" alt="figure1" /></p>
<p>The method works quite well, but in some cases, the algorithm doesn’t act well because the contour of the piece of paper is quite blurred compared to the background color, so the user should take a photo whose edge is apparent. For example:</p>
<p><img src="https://github.com/quochungtran/Gsoc2022-tesseract-ocr/blob/master/image_blog/perspec_bad_res.png?raw=true" alt="figure1" /></p>
<h2 id="conclusion">Conclusion</h2>
<p>These main ideas basically may be implemented into the plugin into an OCR pre-processing dialog in the future.These main ideas basically may be implemented into the plugin into an OCR pre-processing dialog in the future.</p>["TRAN Quoc Hung"]In the following weeks (June 27,2022 to July 10, 2022), for each test case, I will try to :<Weeks 5,6> Plugins Interface components2022-07-25T14:41:53+00:002022-07-25T14:41:53+00:00https://quochungtran.github.io/junk/2022/07/25/weed5-6<p>In the following weeks (June 11, 2022 to July 25, 2022 ), I will try to :</p>
<ul>
<li>Make decisions to choose the architecture of the plugin.</li>
<li>Design UML for each component of the plugin.</li>
<li>Write documentation.</li>
</ul>
<p><strong>Frontend</strong>:</p>
<p>The idea of the OCR processing plugin in digikam is inspired by another plugin used for the conversion from the RAW to the DNG version. The GUI for the batch processing of the plugin, is a dialog, consists of :</p>
<p>The widget list of images to be processed in OCR on the left of the dialog.</p>
<p>On the right side, optional widgets display all the settings components for setting up Tesseract options:</p>
<ul>
<li>Language : Specify language(s) used for OCR.</li>
<li>Segmentation mode : Specify page segmentation mode.</li>
<li>Engine mode : Specify OCR Engine mode.</li>
<li>Resolution dpi : Specify DPI for the input image.</li>
</ul>
<p>A text Editor for visualizing the content of text detected in the image.</p>
<p>A button to save the content in files.</p>
<p><strong>Important link</strong>:</p>
<p>I would like to share a single merge request that contains principally all the implementation of the plugin in the Gsoc period :</p>
<p><a href="https://invent.kde.org/graphics/digikam/-/merge_requests/177/commits#f674cbba0762d717fd1e6268b810b27c7bb43af8">https://invent.kde.org/graphics/digikam/-/merge_requests/177</a></p>
<h1 id="implementation-"><strong>Implementation :</strong></h1>
<p>First of all, a <strong>TextConverterPlugin</strong> is created, an interface that contains brief information about the OCR processing plugin. TextConverterPlugin is inherited from the class <a href="https://invent.kde.org/graphics/digikam/-/blob/master/core/libs/dplugins/core/dplugingeneric.h">DPluginGeneric</a>, a Digikam external plugin class whose virtual functions are overridden for the new features.</p>
<p>This object includes methods overriding from the parent class:</p>
<table>
<thead>
<tr>
<th>Functions</th>
<th>Descriptions</th>
</tr>
</thead>
<tbody>
<tr>
<td>name()</td>
<td>Returns the user-visible name of the plugin, providing enough information as to what the plugin is about in the context of digiKam.</td>
</tr>
<tr>
<td>iid()</td>
<td>Returns the plugin interface’s unique top-level internal identification property of the plugin interface. In this case, the formatted identification text is a constant substitute token-string like: “org.kde.digikam.plugin.generic.TextConverter”</td>
</tr>
<tr>
<td>icon()</td>
<td>Return an icon for the plugin supported by QIcon</td>
</tr>
<tr>
<td>authors()</td>
<td>Return an authors list for the plugin with authors details informations like author ’s names, their emails, copyright year and roles</td>
</tr>
<tr>
<td>description()</td>
<td>Return a short description about the plugin</td>
</tr>
<tr>
<td>detail()</td>
<td>Return a long description about the plugin</td>
</tr>
<tr>
<td>setup()</td>
<td>Create all internal object instances for a given parent.</td>
</tr>
</tbody>
</table>
<p>The interface is shown like:</p>
<p><img src="https://github.com/quochungtran/quochungtran.github.io/blob/master/image_blog/week5-6/about.png?raw=true" alt="figure1" /></p>
<h2 id="text-converter-dialog"><strong>Text Converter Dialog</strong></h2>
<p>The idea is to set up a dialog widget. There is a dialog box using TextConverterDialog to list the files processed in OCR and a status to indicate the processing.</p>
<p><strong>TextConverterDialog</strong> is a <a href="https://invent.kde.org/graphics/digikam/-/blob/master/core/libs/dplugins/widgets/dplugindialog.h"><strong>DPluginDialog</strong></a> (Digikam defaulted plugin dialog class) that uses <a href="https://doc.qt.io/qt-6/qdialogbuttonbox.html"><strong>QDialogButtonBox</strong></a>, which presents buttons in a layout that conforms to the interface guidelines for that platform, allows a developer to add buttons to it, and will automatically use the appropriate format for the user’s desktop environment. We can see the design of the dialog here:</p>
<p><img src="https://github.com/quochungtran/quochungtran.github.io/blob/master/image_blog/week5-6/dialog_UML.png?raw=true" alt="figure1" /></p>
<p>It implements principally a slot <code class="language-plaintext highlighter-rouge">TextConverterAction()</code> that uses internal methods and is called to apply OCR processing into the image after pre-processing.</p>
<p>The main dialog consists of all Tesseract options widgets (Text converter Settings) and a text editor to view the OCR result discussed in the following sections.</p>
<h2 id="text-converter-settings"><strong>Text converter Settings</strong></h2>
<p><strong>Text Converter Settings</strong> object is a widget containing optional widgets displaying all the settings components for setting up Tesseract options for users to select. The main options for these settings are</p>
<p>Three <a href="https://invent.kde.org/graphics/digikam/-/blob/master/core/libs/widgets/combo/dcombobox.h"><strong>DcomboBox</strong></a> widgets (a combo box widget re-implemented with a reset button to switch to a default item ):</p>
<ul>
<li>Page Segmentation mode (psm).</li>
<li>Specify OCR Engine mode.</li>
<li>The language or script to use.</li>
</ul>
<p>One <a href="https://invent.kde.org/graphics/digikam/-/blob/master/core/libs/widgets/range/dnuminput.h"><strong>DNumInput</strong></a> an integer num input widget in Digikam using for Tesseract option dpi resolution.</p>
<p>Two <a href="https://doc.qt.io/qt-6/qcheckbox.html"><strong>QCheckbox</strong></a> for two options for saving OCR Text into separate text files and hosting text recognized by OCR in XMP (Extreme Memory Profile).</p>
<p>TextConverterSettings is a member object of the dialog. Here is a visualization of the OCR settings</p>
<p><img src="https://github.com/quochungtran/quochungtran.github.io/blob/master/image_blog/week5-6/ocr_settings.png?raw=true" alt="figure1" /></p>
<h2 id="text-editor"><strong>Text Editor</strong></h2>
<p>A text Editor for visualizing the content of text detected in the image. A QTextWidget + Sonnet spell checker showing the recognized text from a scanned document. If i select a file from the left side list, the reader is changed accordingly. A text Editor is [<a href="https://invent.kde.org/graphics/digikam/-/blob/master/core/libs/widgets/text/dtextedit.h"><strong>DTextEdit</strong></a> which combined these two functionalities.</p>
<p><img src="https://github.com/quochungtran/quochungtran.github.io/blob/master/image_blog/week5-6/text_edit.png?raw=true" alt="figure1" /></p>
<h2 id="text-converter-list"><strong>Text Converter List</strong></h2>
<p><img src="https://github.com/quochungtran/quochungtran.github.io/blob/master/image_blog/week5-6/ocrList_UML.png?raw=true" alt="figure1" /></p>
<p>The widget list of images to be processed in OCR, where the urls are pointing to the pictures, is located in a generic list for all plugins based on QTreeWidget. TextConverterList is inherited from <a href="https://invent.kde.org/graphics/digikam/-/blob/master/core/libs/dplugins/widgets/ditemslist.h"><strong>DItemList</strong></a>, an abstract item list. TextConvertList composes <a href="https://invent.kde.org/graphics/digikam/-/blob/master/core/libs/dplugins/widgets/ditemslist.h"><strong>DItemsListViewItem</strong></a>, which is an interface of Tree widget items that are used to hold rows of information. Rows usually contain several columns of data, each of which can contain a text label and an icon.</p>
<p>Each text converter item consists of 4 specific features:</p>
<table>
<thead>
<tr>
<th>Column</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>File Name</td>
<td>a URL pointing to the image.</td>
</tr>
<tr>
<td>Recognized Words</td>
<td>number of words recognized</td>
</tr>
<tr>
<td>Target File</td>
<td>a target file that saves converted text from images.</td>
</tr>
<tr>
<td>Status</td>
<td>an indication during processing.</td>
</tr>
</tbody>
</table>
<p>Here is a capture of the text convert List widget which I implemented:</p>
<p><img src="https://github.com/quochungtran/quochungtran.github.io/blob/master/image_blog/week5-6/list.png?raw=true" alt="figure1" /></p>
<h2 id="results"><strong>Results</strong></h2>
<p>The architecture and the position of each widget components are designed in the following image. This visualization help me more easy to implement:</p>
<p><img src="https://github.com/quochungtran/quochungtran.github.io/blob/master/image_blog/week5-6/layoutplugin.png?raw=true" alt="figure1" /></p>
<p>Here is the expected of GUI for the batch processing of the plugin :</p>
<p><img src="https://github.com/quochungtran/quochungtran.github.io/blob/master/image_blog/week5-6/widget.png?raw=true" alt="figure1" /></p>
<h3 id="main-commits">Main commits</h3>
<p><a href="https://invent.kde.org/graphics/digikam/-/merge_requests/177/diffs?commit_id=3f9d78957906eac190f457a080b764bb958ab41a">3f9d7895</a></p>
<p><a href="https://invent.kde.org/graphics/digikam/-/merge_requests/177/diffs?commit_id=81bf0d9c458ab643449a592efeb395896963c0d6">81bf0d9c</a></p>
<p><a href="https://invent.kde.org/graphics/digikam/-/merge_requests/177/diffs?commit_id=5970d961f7c5512d46ed93effc6a579f218e6c3a">5970d961</a></p>
<p><a href="https://invent.kde.org/graphics/digikam/-/merge_requests/177/diffs?commit_id=f17dce695b14c0c4387ed6e1d49166b86518bf52">f17dce69</a></p>
<h3 id="next-step">Next step</h3>
<p>In the next few weeks, I will:</p>
<ul>
<li>Implement the Ocr Tesseract Engine object used for OCR text from the image.</li>
<li>Implement an internal multi-thread for OCR processing image.</li>
<li>Polish and re-implement code if necessary.</li>
</ul>["TRAN Quoc Hung"]In the following weeks (June 11, 2022 to July 25, 2022 ), I will try to : Make decisions to choose the architecture of the plugin. Design UML for each component of the plugin. Write documentation.<Weeks 1,2> Tesseract Page Segmentation Modes (PSMs) Explained and their relations2022-06-27T14:41:53+00:002022-06-27T14:41:53+00:00https://quochungtran.github.io/junk/2022/06/27/week1-2<p>The OCR practitioners use the relevant page segmentation mode (<code class="language-plaintext highlighter-rouge">psm</code>). The Tesseract API provides several page segmentation modes (default behavior of <strong>Tesseract</strong> in mode psm 3) if we want to extract a single line, a single word, a single character, or a different orientation. Here is the list of the supported page segmentation modes by <strong>Tesseract</strong>:</p>
<table>
<thead>
<tr>
<th><strong>modes</strong></th>
<th><strong>Descriptions</strong></th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>Orientation and Script Detection (OSD) only.</td>
</tr>
<tr>
<td>1</td>
<td>Automatic page segmentation with OSD.</td>
</tr>
<tr>
<td>2</td>
<td>Automatic page segmentation, but no OSD or OCR.</td>
</tr>
<tr>
<td>3</td>
<td>Fully automatic page segmentation, but no OSD. (Default)</td>
</tr>
<tr>
<td>4</td>
<td>Assume a single column of text of variable sizes.</td>
</tr>
<tr>
<td>5</td>
<td>Assume a single uniform block of vertically aligned text.</td>
</tr>
<tr>
<td>6</td>
<td>Assume a single uniform block of text.</td>
</tr>
<tr>
<td>7</td>
<td>Treat the image as a single text line.</td>
</tr>
<tr>
<td>8</td>
<td>Treat the image as a single word.</td>
</tr>
<tr>
<td>9</td>
<td>Treat the image as a single word in a circle.</td>
</tr>
<tr>
<td>10</td>
<td>Treat the image as a single character.</td>
</tr>
<tr>
<td>11</td>
<td>Sparse text. Find as much text as possible in no particular order.</td>
</tr>
<tr>
<td>12</td>
<td>Sparse text with OSD.</td>
</tr>
<tr>
<td>13</td>
<td>Raw line. Treat the image as a single text line, bypassing Tesseract-specific hacks.</td>
</tr>
</tbody>
</table>
<p>In the first two weeks (13/6/2022 - 27/06/2022), for each test case, I will try to :</p>
<ul>
<li>
<p>Learn how choosing a PSM can be the difference between a correct and incorrect OCR result review the 14 PSMs built into the Tesseract OCR engine.</p>
</li>
<li>
<p>Witness examples of each of the 14 PSMs in action</p>
</li>
<li>
<p>Discover my tips, suggestions, and best practices when using these PSMs</p>
</li>
</ul>
<h4 id="source-code-and-github-repository"><strong>Source code and github Repository</strong></h4>
<p>This blog is accompagied by my <a href="https://invent.kde.org/quochungtran/gsoc2022-ocr-tesseract-test">Kde gitlab resposiory</a> that containes the source code.</p>
<p>The purpose is to facilitate and decode these options into appropriate and relevant choices for the user to understand more easily. The second purpose is to help me design a pipeline of pre-processing image methods to enhance the accuracy and compensate for the constraints of Tesseract.</p>
<h2 id="what-are-page-segmentation-modes"><strong>What Are Page Segmentation Modes?</strong></h2>
<p>The “page of text“is significant. For example, the default Tesseract PSM may work well for you if you are <strong>OCR’ing</strong> a scanned chapter from a book. But if we are trying to OCR only a single line, a single word, or maybe even a single character, then this default mode will result in either an empty string or nonsensical results.</p>
<p>Despite being a critical aspect of obtaining high OCR accuracy, Tesseract’s page segmentation modes are somewhat of a mystery to many new OCR practitioners. I will review each of the 14 Tesseract PSMs and gain hands-on experience using them and correctly OCR an image using the Tesseract OCR engine.</p>
<h2 id="getting-started"><strong>Getting Started</strong></h2>
<p>I use one or more Python scripts to review. Setting a PSM in Python is as easy as setting an options variable.</p>
<h3 id="psm-0-orientation-and-script-detection-only"><strong>PSM 0. Orientation and Script Detection Only</strong></h3>
<p>The <code class="language-plaintext highlighter-rouge">psm --0</code> does not perform OCR directly; at least, we can know the context of the scanned image.</p>
<p>I have constructed some images with different rotations and watched it happen. In this case, we have an original text image like :</p>
<p><img src="https://github.com/quochungtran/Gsoc2022-tesseract-ocr/blob/master/data/psm0/org.png?raw=true" alt="figure1" /></p>
<p>Tesseract has determined that this input image is unrotated (i.e., 0◦) and that the script is correctly detected as Latin, as the following output :</p>
<p><img src="https://github.com/quochungtran/Gsoc2022-tesseract-ocr/blob/master/image_blog/im11.png?raw=true" alt="figure2" /></p>
<p>Considering various rotations of this image:</p>
<p><img src="https://github.com/quochungtran/Gsoc2022-tesseract-ocr/blob/master/image_blog/im10.png?raw=true" alt="figure3" /></p>
<p>As I can see, the output is correct now with some images rotated with the set of angles in <code class="language-plaintext highlighter-rouge">{0, 90, 180, 270}</code> clockwise. With another rotation angle such as 115 or 130 degrees, the orientation in degree is exclusively followed in these fixed orientations and is incorrect for each case.</p>
<p>The orientation and script detection (OSD) examines the input of the image and returns two values :</p>
<ul>
<li>
<p>How the page is oriented, in degrees, where <code class="language-plaintext highlighter-rouge">angle = {0, 90, 180, 270}</code>.</p>
</li>
<li>
<p>The confidence of the script (i.e., graphics signs/writing system), such as Latin, Han, Cyrillic, etc.</p>
</li>
</ul>
<p><code class="language-plaintext highlighter-rouge">--psm 0</code> mode as a “meta-information” mode where <strong>Tesseract</strong> provides you with just the script and rotation of the input image and may help implement pre-processing methods like skew text image.</p>
<h3 id="psm-1-automatic-page-segmentation-with-osd">PSM 1. Automatic Page Segmentation with OSD</h3>
<p>For this option, Automatic page segmentation for OCR should be performed, and OSD information should be utilized in the OCR process. I attempt to take the images in figure 1 and pass them through Tesseract using this mode; we can notice that there is no OSD information. Tesseract must be performing OSD internally but not returning it to the user.</p>
<blockquote>
<p>Note : The psm 2. Automatic Page Segmentation, But No OSD, or OCR mode is not implemented in Tesseract</p>
</blockquote>
<h3 id="psm-3-fully-automatic-page-segmentation-but-no-osd">PSM 3. Fully Automatic Page Segmentation, But No OSD</h3>
<p>PSM 3 is the default behavior of Tesseract. In fact, Tesseract attempts to segment the text and will OCR the text and return it.</p>
<h3 id="psm-4-assume-a-single-column-of-text-of-variable-sizes">PSM 4. Assume a Single Column of Text of Variable Sizes</h3>
<p>An exceptional example in this mode is a spreadsheet, table, receipt, etc, where we need to concatenate data row-wise. I assume a small sample which is a receipt from the grocery store, and try to OCR this Image using the default <code class="language-plaintext highlighter-rouge">--psm 3</code> modes:</p>
<p><img src="https://github.com/quochungtran/Gsoc2022-tesseract-ocr/blob/master/image_blog/img1.png?raw=true" alt="figure4" /></p>
<p>We can notice the result was not what we expected; in this mode, Tesseract can not infer that we are examining the column data, and the text along the same row should be associated. So let’s see the magic output with the option <code class="language-plaintext highlighter-rouge">--psm 4</code>. The result is better:</p>
<p><img src="https://github.com/quochungtran/Gsoc2022-tesseract-ocr/blob/master/image_blog/img2.png?raw=true" alt="figure5" /></p>
<h3 id="psm-5-assume-a-single-uniform-block-of-vertically-aligned-text">PSM 5. Assume a Single Uniform Block of Vertically Aligned Text</h3>
<p>In this mode, we wish to OCR a single block of vertically aligned text, either positioned at the top of the page, the middle of the page, or the bottom of the page. However, I don’t comprehend and discover any instance corresponding to this model.</p>
<p>In my experiment, <code class="language-plaintext highlighter-rouge">psm --5</code> combines <code class="language-plaintext highlighter-rouge">psm --4</code> and performs well exclusively with a rotated 90◦ clockwise image. For example :</p>
<p><img src="https://github.com/quochungtran/Gsoc2022-tesseract-ocr/blob/master/image_blog/img3.png?raw=true" alt="figure5" /></p>
<p>Tesseract can circle the receipt from the illustration overhead, and we retain the more acceptable output :</p>
<p><img src="https://github.com/quochungtran/Gsoc2022-tesseract-ocr/blob/master/image_blog/img4.png?raw=true" alt="figure6" /></p>
<h3 id="psm-6-assume-a-single-uniform-block-of-text">PSM 6. Assume a Single Uniform Block of Text</h3>
<p>The meaning of the uniform text is a single font without any variation pertinent to a type of page of a book or novel.</p>
<p>Passing a page from the famous book The Wind in the Willows in the default mode 3, we see the result below:</p>
<p><img src="https://github.com/quochungtran/Gsoc2022-tesseract-ocr/blob/master/image_blog/img5.png?raw=true" alt="figure7" /></p>
<p>We can consider that the output of the image above contains many new lines and white space that make the user take the time to remove them. By using the <code class="language-plaintext highlighter-rouge">--psm 6</code> modes, we are better able to OCR this big block of text, containing fewer errors and demonstrating the correct form of a text page :</p>
<p><img src="https://github.com/quochungtran/Gsoc2022-tesseract-ocr/blob/master/image_blog/img6.png?raw=true" alt="figure8" /></p>
<h3 id="psm-7-treat-the-image-as-a-single-text-line-and-psm-8-treat-the-image-as-a-single-word">PSM 7. Treat the Image as a Single Text Line and PSM 8. Treat the image as a Single Word</h3>
<p>For each mode’s name, when we want to OCR a single line in an image or a single word, modes 7 and 8 are suitable. The test case is often the image of the name of a place or restaurant, etc., or a small slogan in a line.</p>
<p>For example, we need to extract the license/number plate license/number plate; this takes time to recover by the user. Considering mode 3, we don’t obtain any result. Otherwise, modes 7 and 8 tell Tesseract to treat the input as a single line or a single word (horizontal on the line) :</p>
<p><img src="https://github.com/quochungtran/Gsoc2022-tesseract-ocr/blob/master/data/psm7/broadway.png?raw=true" alt="figure9" /></p>
<h3 id="psm-9-treat-the-image-as-a-single-word-in-a-circle">PSM 9. Treat the Image as a Single Word in a Circle</h3>
<p>I attempt to test and discover the image corresponding to the meaning of Single Word in a Circle, but Tesseract provides the empty or incorrect result. Accordingly, the test case for this model is rare, and we can ignore it.</p>
<h3 id="psm-10-treat-the-image-as-a-single-character">PSM 10. Treat the Image as a Single Character</h3>
<p>When we extract each character, creating an image as a single character is beneficial. I think modes 7 and 8 can work better because a single character can be considered a single word, so we can skip <code class="language-plaintext highlighter-rouge">--psm 10</code>.</p>
<h3 id="psm-11-sparse-text-find-as-much-text-as-possible-in-no-particular-order">PSM 11. Sparse Text: Find as Much Text as Possible in No Particular Order</h3>
<p>In this mode, Tesseract’s algorithm tries to insert additional whitespace and newlines as a result of Tesseract’s automatic page segmentation. Therefore, for unstructured text (as sparse text ), <code class="language-plaintext highlighter-rouge">--psm 11</code> may be the best choice.</p>
<p>In my experiment, the absolute sample is the table of content, ore menu, etc. The text is relatively parsed and doesn’t stick to a piece of text. Using the default <code class="language-plaintext highlighter-rouge">--psm 3</code>, we got the results with several white spaces and newlines. Tesseract would infer the structure of the documentation, but there is no document structure here.</p>
<p><img src="https://github.com/quochungtran/Gsoc2022-tesseract-ocr/blob/master/image_blog/im7.png?raw=true" alt="figure10" /></p>
<p>Meanwhile, the input image is sparse text with <code class="language-plaintext highlighter-rouge">--psm 11</code>, and this time the results from Tesseract are better :</p>
<p><img src="https://github.com/quochungtran/Gsoc2022-tesseract-ocr/blob/master/image_blog/im8.png?raw=true" alt="figure11" /></p>
<h3 id="psm-12-sparse-text-with-osd">PSM 12. Sparse Text with OSD</h3>
<p>The <code class="language-plaintext highlighter-rouge">--psm 12</code> mode is identical to <code class="language-plaintext highlighter-rouge">--psm 11</code> but now adds in OSD (similar to <code class="language-plaintext highlighter-rouge">--psm 0</code>)</p>
<h3 id="psm-13-raw-line-treat-the-image-as-a-single-text-line-bypassing-hacks-that-are-tesseract-specific">PSM 13. Raw Line: Treat the Image as a Single Text Line, Bypassing Hacks That Are Tesseract-Specific</h3>
<p>In this case, other internal <strong>Tesseract-specific</strong> pre-processing techniques will hurt OCR performance, or we can consider that <strong>Tesseract</strong> may not automatically identify the font face. For example, we have some samples with a different font like :</p>
<p><img src="https://github.com/quochungtran/Gsoc2022-tesseract-ocr/blob/master/data/psm13/test.png?raw=true" alt="figure12" /></p>
<p><img src="https://github.com/quochungtran/Gsoc2022-tesseract-ocr/blob/master/data/psm13/test2.png?raw=true" alt="figure13" /></p>
<p>With the default mode, Tesseract returns an empty output. Otherwise, we have the corresponding results by treating the image as a single raw line of text, ignoring all page segmentation algorithms and Tesseract pre-processing functions.</p>
<h2 id="conclusion">Conclusion</h2>
<p>To facilitate and differentiate between each mode and ignoring another, I would like to divide each mode in terms of the document type.</p>
<ul>
<li>
<p>Mode 3 (default) is always a priority to use first for all cases. In the best context, the accuracy of OCR output is high, we have done. On the other hand, we have to decide to modify the configuration.</p>
</li>
<li>
<p>Mode 4 is often used for spreadsheets, tables, receipts… or we need to concatenate data row-wise.</p>
</li>
<li>
<p>Mode 6 is helpful for pages of text whose a single font, such as books, novels, or emails, etc</p>
</li>
<li>
<p>Mode 11 is useful for unstructured text, or sparse text like menus,…</p>
</li>
<li>
<p>Mode 7,8,10, or even 13 is treated image as a single line or a single word with unique fonts or strange characters such as Logo, license plates, label, etc</p>
</li>
</ul>
<p>Based on this exploration, in the next step, I will construct pre-processing methods such as properly skewing image text to the limitation of using the modes, removing shadow, and correcting perspective. The purpose is to create the pipeline for the automatic processing of images for Tesseract; the user can efficiently reduce the choice of modes on Tesseract.</p>["TRAN Quoc Hung"]The OCR practitioners use the relevant page segmentation mode (psm). The Tesseract API provides several page segmentation modes (default behavior of Tesseract in mode psm 3) if we want to extract a single line, a single word, a single character, or a different orientation. Here is the list of the supported page segmentation modes by Tesseract:Project Introduction - New digiKam Plugin to Process Optical Character Recognition(OCR)2022-06-19T14:41:53+00:002022-06-19T14:41:53+00:00https://quochungtran.github.io/junk/2022/06/19/welcome-to-my-project<h2 id="problem-description"><strong>Problem description</strong></h2>
<p>DigiKam is an advanced open-source digital photo management application that runs on Linux, Windows, and MacOS. The application provides a comprehensive set of tools for importing, managing, editing, and sharing photos and raw files.</p>
<p>However, many digikam users can take a lot of types of document pictures containing the text in them, which is needed to extract for specific reasons. Therefore, it would be practical to generate tags, add a description or a caption automatically.</p>
<p>Implementing Optical Character Recognition (OCR) technology is a proposed solution for automating and extracting data. Printed or written text from a scanned document or image file can be converted to text in a machine-readable form and can be used for data processing, such as editing or searching.</p>
<p>The goal of this project is to implement a new generic DPlugin to process images in batch with Tesseract. Tesseract is an-open-source OCR engine. Even though it can be painful to implement and modify sometimes, only a few of free and powerful OCR alternatives are available on the current market. Tesseract is compatible with many programming languages and frameworks through wrappers that can be found here. Tesseract can be used with the existing layout analysis to recognize text within a large document, or it can be used in conjunction with an external text detector to recognize text from an image of a single text line.</p>
<p>Thanks to the help of the OCR plugin in digikam. The users will be able to select optional parameters to improve the quality of record detected text in image metadata. The output text will be saved in XML files, recorded in the exif of jfif or the user was asked to store output text under the text file in the locale where they want. Furthemore, digikam users will be able to review them and correct (spell checking) any OCR errors .</p>
<p>In this document, I will first represent in detail my planned implementation and finally, my provisional schedule for each step.</p>
<h2 id="plan"><strong>Plan</strong></h2>
<p>The project consists of tree components:</p>
<h3 id="make-a-new-base-for-evaluating-algorithms"><strong><em>Make a new base for evaluating algorithms</em></strong></h3>
<p>Firstly, I will construct the test set for evaluation. Images can be collected from some websites with samples of popular cameras like Nikon, Sony, etc. Then unit tests will be implemented using the current function interface. They will re-evaluate the performance of the OCR plugin. This evaluation will give a more evident status and perspective. These tests could also be used to benchmark the accuracy of the algorithm and the execution time and to improve the performance of the plugin later.</p>
<h3 id="implement-pipeline-to-evaluate-good-prepost-preprocessing-algorithm-in-general-ocr-cases"><strong><em>Implement pipeline to evaluate good pre/post preprocessing algorithm in general OCR cases</em></strong></h3>
<p>Optical Character Recognition remains a challenging problem when text occurs in unconstrained environments, due to brightness, natural scenes, geometrical distortions, complex backgrounds, and/or diverse fonts. According to the document [2], Tesseract does various image processing operations internally before doing OCR actually.</p>
<p>Therefore, each type of document image needs to be preprocessed or segmented before text converting provided by Tesseract. During the implementation of each algorithm, we are able to choose some parts of these processes that will be duplicated for each case of the test data set. Each algorithm of OCR implementation needs a preprocessing of an image before analyzing. The purpose of this phase is to optimize the time of preprocessing and to increase the accuracy of OCR processing before analyzing.</p>
<p>Post-processing is the next step to detect and correct linguistic misspellings in the OCR output text after the input image has been scanned and completely processed.</p>
<p>And finally output text will be recorded in XML files or saved as a form of a text file.</p>
<p>I will plan some of the most basic and important pre-processing and post-processing techniques for OCR .</p>
<h3 id="plugin-implementation"><strong><em>Plugin implementation</em></strong></h3>
<p>The idea of OCR processing plugin in digikam is inspired by a conversion from the RAW to DNG version. For most of the generic plugins of digikam, their components were inherited from the abstract based class implemented exclusively, that provides its own independent implementation through the same interface. The following sections are the proposal of general conceptions of text converter plugin to process OCR. The general architecture of the plugin will be introduced in this part. The details of this plugin will be determined explicitly after having a well tested working version for validating the pre/post preprocessing algorithm.</p>["TRAN Quoc Hung"]Problem description