Document Cleaner: Get Rid of Too Many Tags

Document Cleaner: Get Rid of Too Many Tags

Reading Time: 3 minutes

When you open a document in a CAT tool (e.g. memoQ, Trados, Wordfast, etc.), you might notice too many tags in some segments; such files are usually converted from PDF sources. Incorrect formatting causes many problems in translation, especially if translation is done using CAT tools, creating excessive tags, which makes it hard to translate the text. There are a few methods to safely remove as many of these unnecessary tags as possible while retaining formatting and layout. Among such tools is CodeZapper that we have explored before.

Moreover, part of TransTools is Document Cleaner, a collection of tools for preparation of badly formatted documents for translation to clean a document from tags which invariably appear in CAT tools (e.g., SDL Trados Studio, memoQ, Wordfast Pro, DejaVu, etc.) if a document has many bookmarks, non-standard character spacing, text and paragraph shading, hyphenation, character styles on top of regular formatting, etc.

Available commands

Document Cleaner provides the following commands:

Tag Cleaner – When OCR or PDF conversion software processes text in PDF files or images, it often applies a lot of different complex text formatting which creates a lot of tags in CAT software. Most of the time, such complex formatting cannot be found in the original editable document, so you can safely remove it and still have a document which looks like the original.

Tag Cleaner command performs the following operations to minimise tags and make the document more user-friendly:

  • fixes invisible formatting problems,
  • removes text and paragraph shading,
  • removes text highlighting,
  • resets uneven character spacing,
  • removes manual hyphenation,
  • fixes formatting problems, and
  • turns ‘black’ font colour into ‘automatic’ colour.

You can also watch this video by Dominique Pivard about Tag Cleaner.

Resave – Some documents, when imported into CAT tools, contain too many tags in each segment. The command saves the document to RTF and back to the original format, which often eliminates such ‘rogue’ tags.

Table Column Aligner – When you recognise a document containing multi-page tables, these tables are recognised as several tables, one per page. When you join these tables, however, they will often have misaligned vertical borders. This command helps to format such tables properly.

Line Removal – Some OCR tools insert vertical or horizontal floating lines instead of borders. This command helps you track them down and remove them much easier.

UnFrame – Most OCR tools insert frames around tables, images, or text blocks, but they are often unnecessary and can cause problems for translators, such as the inability to fit translated text or fit it on several pages if the text expands. This command helps you remove such frames, retaining their contents.

Bookmark Cleanup – If a document contains bookmarks, they are imported into CAT tools as pairs of tags. This command removes specific types of bookmarks, such as bookmarks which are not referenced from fields or hyperlinks, or ‘table of contents’ bookmarks, etc.

Apply Variable Row Height – When OCR tools process tables, they apply special formatting to each row so that the row height is equal or larger than the height of the row in the original PDF/scan. Some tools use fixed row height which prevents the row from expanding as more text is added in it. This command allows you to remove this formatting so that the row expands or contracts in height depending on the amount of text in the row.

Formatting Tools – In an attempt to match the original document format, OCR and PDF conversion tools use paragraph spacing, indentation, paragraph and text shading, character styles, etc. Formatting Tools is a collection of commands that allow you to remove specific types of paragraph and text formatting to default values.

Click here to download TransTools, and here to learn more about Document Cleaner.


Print Friendly, PDF & Email
Spread Knowledge
  • 17
Comments are closed.