Indexation of small (<12Kb) DOC files does not work

Author: Edward Hardin Reference Number: AA-00538 Views: 3474 Last Updated: 08/10/2010 11:57 AM 0 Rating/ Voters

KMP uses free AntiWord tool for indexation of DOC files. This tool has certain limitations that doesn't allow it to index attachments created in Word for Mac and OpenOffice that are lesser than 12Kb.

To avoid this issue we recommend to use DOC files created in Windows version of Word, or use files with more than 1024 bytes of text inside (which is equivalent to 12Kb of total file size) if they are created in Mac version of Word or OpenOffice .

Here is the comments on this issue from AntiWord developers.

AntiWord Developers' Comments

This is not a bug, it is a missing feature.

Let me explain.
Inside a Word file the text is stored in a so called text stream. There are
two possible text streams: a small block text stream and a large block text
stream. The small blocks are 64 bytes in size, the large blocks are 512
bytes in size. Because the difference in size Antiword would need two
different methods for reading those two text streams. The method for
reading that small block text stream has not been implemented yet. The
result is that Word files with no large block text stream can no be read by
Antiword. Such Word file are mostly smaller than about 12 kilobytes and
have less than 1024 bytes of text.

The reason for not implementing the missing fearture is simple. Word
documents that use the small block text stream can not be produced by Word
for Windows (all versions), but only Word for Mac. And now by OpenOffice.
Note that these documents can be read by all versions of Word.

Kind Regards,
Adri van Os