Which tool to align PDF files and create TM? Thread poster: Jan Sundström
|
Hi all, Which tool do you use to align two PDF files (source/target) and create a TM with minimum fuss? OCR with a third party application and then just use Trados Winalign, or is there anything better/more direct? If it's good, it's worth paying for, so the pricetag is not a problem. What are your experiences? | | | Natalie Poland Local time: 17:31 Member (2002) English to Russian + ... Moderator of this forum SITE LOCALIZER | RobinB United States Local time: 10:31 German to English "Minimum fuss" | Aug 22, 2006 |
Jan Sundström wrote: Which tool do you use to align two PDF files (source/target) and create a TM with minimum fuss? I supposed it depends on your concept of "minimum".... There isn't any quick and easy way to convert and align PDFs, AFAIK. We use either the ABBYY PDF converter or Iceni Gemini - some PDFs convert better with ABBYY, others with Gemini. But whichever you use, there will still be quite a lot of cleaning up to do before you align, for example: - eliminating unnecessary page headers/footers - eliminating "hard" line breaks in sentences - eliminating "manual" line break hyphens - reformatting: it often happens that multiple columns get screwed up. The same applies to tables. Additionally, depending on the font used in the DTP system, you might find yourself confronted with ligatures, which tend to leave spaces in the middle of words when you convert. These have to be identified (manually!) and eliminated. These are just some of the issues relating to PDFs created from DTP systems. The quality of PDFs created by scanning in paper copy will depend critically on the OCR system you use. Then there are the TM-specific edits you need to do before it's worth aligning. For Trados, for example, this means inserting non-breaking spaces after colons, inserting hard line breaks if the sentence ends in a figure, that sort of thing. And then you can move on to the joy of alignment... | | | Heinrich Pesch Finland Local time: 18:31 Member (2003) Finnish to German + ... Don't bother | Aug 22, 2006 |
If the documents are longer than a few pages, it is not worth the effort. I once tried to align two Word-files, which I had translated mostly myself, but the resulting TM was not usable. You can retranslate the file by coping and pasting from the translation, that will deliver a decent TM. Regards Heinrich | |
|
|
Hi Jan Logiterm does this without any trouble - but of course you need Logiterm! (www.terminotix.com) | | | RobinB United States Local time: 10:31 German to English Logiterm experience | Aug 22, 2006 |
Barnaby Capel-Dunn wrote: Logiterm does this without any trouble - but of course you need Logiterm! ( www.terminotix.com) Hi Barnaby, Several colleagues have recently recommended Logiterm to me - so forcefully that I'll probably buy a copy for evaluation. Perhaps you could let us know your experience with this software. Can it handle complex formats in PDFs (multiple columns, tables, that sort of thing)? How good is it at converting special characters? How long does it take to make a bitext out of e.g. two 200 page PDFs (one in each language)? TIA, Robin | | |
First off, I find that OCR isn't the way to go here unless your PDFs are image files (scans, faxes, etc.). If they are text PDFs, this is what I do: I use AutoUnbreak (look it up on Google, it's free and does the trick nicely). First, I select all in the PDF and copy. Then I paste it into AutoUnbreak. The software only takes 65000 characters at once, so with longer PDFs, you may want to break it up into several smaller sections. AutoUnbreak removes carriage returns and creates RTF t... See more First off, I find that OCR isn't the way to go here unless your PDFs are image files (scans, faxes, etc.). If they are text PDFs, this is what I do: I use AutoUnbreak (look it up on Google, it's free and does the trick nicely). First, I select all in the PDF and copy. Then I paste it into AutoUnbreak. The software only takes 65000 characters at once, so with longer PDFs, you may want to break it up into several smaller sections. AutoUnbreak removes carriage returns and creates RTF text, so you keep the formatting, but the unnecessary carriage returns are removed. You paste this into a Word document. You repeat the procedure with the target text and paste it into another empty Word file. Then, you simply align with your usual align tool (I use WinAlign, it works well for this purpose). Once aligned, all you have to do is check that the alignment is done OK (sometimes, a source segment is broken up into two target segments and vice versa). Once the alignment is satisfactory, export the aligned file pair (or project) and voilà! I have used this several times to create TMs using parallel texts from government websites prior to starting an assignment and it helps a lot. Good luck! ▲ Collapse | | | Samuel Murray Netherlands Local time: 17:31 Member (2006) English to Afrikaans + ... OCR and align with PlusTools | Aug 22, 2006 |
Jan Sundström wrote: Which tool do you use to align two PDF files (source/target) and create a TM with minimum fuss? Today I had to do just that. I have an OCR scanner with document feeder so that I can scan multiple pages quickly. I then extracted the segments using Wordfast, and aligned the two files using PlusTools. I don't know how good WinAlign is... is it user-friendly? PlusTools's align feature basically puts the text into a two column table with one row per segment, and you have keyboard shortcuts for merging and splitting cells (Alt+S to split at cursor point, and Alt+M to merge the text with the cell beneath it). I've seen aligners that work with the mouse, where you have to draw lines from one side of the screen to the other, but those are IMO hopelessly too cumbersome. What other free aligners are there? I know of http://sourceforge.net/projects/bitext2tmx and the old Cypressoft aligner. What others are there? | |
|
|
Will get back to you later in the day! | | |
Robin, I've been in touch with Logiterm and this is what they say: Can it handle complex formats in PDFs (multiple columns, tables, that sort of thing)? Yes, it can handle PDFs with columns, tables, etc. Keep in mind though, that PDFs are the most difficult format to handle and it may cause certain misalignments. The best way of seeing how it handles complicated PDF files is to either ask for a 30-day trial of LogiTerm or send the files to Terminotix so they process ... See more Robin, I've been in touch with Logiterm and this is what they say: Can it handle complex formats in PDFs (multiple columns, tables, that sort of thing)? Yes, it can handle PDFs with columns, tables, etc. Keep in mind though, that PDFs are the most difficult format to handle and it may cause certain misalignments. The best way of seeing how it handles complicated PDF files is to either ask for a 30-day trial of LogiTerm or send the files to Terminotix so they process them. How good is it at converting special characters? Which special characters are we talking about? The Professional Edition handles latin-only languages. How long does it take to make a bitext out of e.g. two 200 page PDFs (one in each language)? I just created a bitext with a 155 page French document and its corresponding 153-page English document and it took 25 seconds. I hope this is of some use to you? I personally am a great fan of Logiterm. I must admit I don't use all its features by a long chalk but IMO its worth its price for its alignment tool and its Logitrans component alone. Best Barnaby ▲ Collapse | | | Jan Sundström Sweden Local time: 17:31 English to Swedish + ... TOPIC STARTER Logiterm exporting to TM | Aug 25, 2006 |
Hi all, I've been reading the Logiterm specs, but the part on exporting to a TM is very brief: "Compatible with translation memories [...] Inversely, you can also take one or more bitexts and create documents that can be imported into a translation memory." This is the function that I'm specifically looking for. A question to the ones of you who have tried this: Is the alignment/creation of TM through Logiterm more accurate or ... See more Hi all, I've been reading the Logiterm specs, but the part on exporting to a TM is very brief: "Compatible with translation memories [...] Inversely, you can also take one or more bitexts and create documents that can be imported into a translation memory." This is the function that I'm specifically looking for. A question to the ones of you who have tried this: Is the alignment/creation of TM through Logiterm more accurate or used friendly compared to Winalign or other similar alignment tools? It seems like a very competent tool, but the documentation on exactly which charsets are supported is a bit sketchy. Will it handle Scandinavian characters (åäö), and not mangle them while exporting?! I guess the best way to find out is to download the demo, but if you have any user experience, it would be valuable to find out first! Thanks a lot, Jan ▲ Collapse | | | Jan Sundström Sweden Local time: 17:31 English to Swedish + ... TOPIC STARTER AutoUnbreak - a detour? | Aug 25, 2006 |
Viktoria Gimbe wrote: I use AutoUnbreak (look it up on Google, it's free and does the trick nicely). First, I select all in the PDF and copy. Then I paste it into AutoUnbreak. The software only takes 65000 characters at once, so with longer PDFs, you may want to break it up into several smaller sections. AutoUnbreak removes carriage returns and creates RTF text, so you keep the formatting, but the unnecessary carriage returns are removed. You paste this into a Word document. Without having tried AutoUnbreak, this sounds like a "poor mans solution" of what can be done in less steps with other commercial software. My guess is that the copy-paste way is also very vulnerable to tables, inserted text blocks etc. Converting a PDF with tagged text into a RTF document can just as well be achieved by saving the PDF as RTF in Acrobat 7.0, just with a few mouse clicks. Recent versions of Acrobat interpret line breaks very well, so extra carriage returns hardly occur anyway. ABBYY PDF Transformer also does this virtually automatically, with hardly any erroneous CRs. I'm not sure if Abbyy is just an "optical" character recognition tool, but my guess is that it extracts tagged text directly, rather than interpreting it optically. Anyway, these points aside, I was imaging a tool that would merge the conversion and the alignment steps of a PDF, bypassing this as separate individual tasks. And it seems that Logiterm is the closest match so far... /Jan | |
|
|
Jan Sundström Sweden Local time: 17:31 English to Swedish + ... TOPIC STARTER Some answers | Aug 25, 2006 |
Hi, I got a reply to some of my own questions by writing to Terminotix: **************** 1) LogiTerm's Professional edition supports Scandinavian characters. 2) All alphabets using ISO Latin 1 characters are supported. 3) All file formats [sic!], except for QuarkXpress and Framemaker. PDF imagine files are, of course, not supported, all others are. When a PDF was scanned in an image file, we recommend that it be rescanned using, for ... See more Hi, I got a reply to some of my own questions by writing to Terminotix: **************** 1) LogiTerm's Professional edition supports Scandinavian characters. 2) All alphabets using ISO Latin 1 characters are supported. 3) All file formats [sic!], except for QuarkXpress and Framemaker. PDF imagine files are, of course, not supported, all others are. When a PDF was scanned in an image file, we recommend that it be rescanned using, for example, Omnipage. **************** Good enough, I'm downloading the demo /Jan ▲ Collapse | | | From a Logiterm user | Aug 25, 2006 |
Hope you enjoy it Jan! I certainly like it a lot. Let us know your impressions. Best Barnaby | | | To report site rules violations or get help, contact a site moderator: You can also contact site staff by submitting a support request » Which tool to align PDF files and create TM? Protemos translation business management system | Create your account in minutes, and start working! 3-month trial for agencies, and free for freelancers!
The system lets you keep client/vendor database, with contacts and rates, manage projects and assign jobs to vendors, issue invoices, track payments, store and manage project files, generate business reports on turnover profit per client/manager etc.
More info » |
| Trados Business Manager Lite | Create customer quotes and invoices from within Trados Studio
Trados Business Manager Lite helps to simplify and speed up some of the daily tasks, such as invoicing and reporting, associated with running your freelance translation business.
More info » |
|
| | | | X Sign in to your ProZ.com account... | | | | | |