How to identify MS Office File types?

Java, C++, C#, Delphi... - Using the UNO bridges
Post Reply
Charming2060
Posts: 2
Joined: Tue Jun 23, 2009 1:27 pm

How to identify MS Office File types?

Post by Charming2060 »

Hi Everyone,

I am supposed to write ( in any language ) an application for debian linux which identifies different microsoft office file types (.mdb, .ppt, .xls, .doc etc). The problem is how to differenciate between these different microsoft office file types? The 'file' command in linux only tells that file is a microsoft office file and doesn't differenciate between different file types i.e. .ppt, .xls, .doc, .mdb.

I posted this question in this community because the openOffice automatically selects the right application for microsoft office files so there is a way (being used in OpenOffice) which can identify microsoft office file types.

Kindly, help me with this problem. :-)

Please don't double-post the same question in different areas of the forum. I have removed your other identical thread. See: The Survival guide (TheGurkha, Moderator)
OOo 2.3.X on OTHER
User avatar
TheGurkha
Volunteer
Posts: 6482
Joined: Thu Mar 13, 2008 12:13 pm
Location: North Wales, UK.

Re: How to identify office File types?

Post by TheGurkha »

Why not, in the language of your choice, just read in the filename(s) from the target folder and then check the extension part of the filename?

Which programming languages do you know? Or are you saying you don't know where to start programming this at all?
Or do you mean you have to look further (deeper into the file itself) and not rely on the extension to identify the file type?
Ubuntu 14.10 Utopic Unicorn, LibreOffice Version: 4.3.3.2
Gurkha Welfare Trust
User avatar
acknak
Moderator
Posts: 22756
Joined: Mon Oct 08, 2007 1:25 am
Location: USA:NJ:E3

Re: How to identify office File types?

Post by acknak »

Here's what I see with a random .ppt (http://moosehead.cis.umassd.edu/cis552/ ... torage.ppt):
  • $ file --version
    file-5.03
    magic file from /usr/share/misc/magic
    $ file 05_storage.ppt
    05_storage.ppt: CDF V2 Document, Little Endian, Os: Windows, Version 5.0, Code page: 1252, Title: Storage and File Structure, Author: Yuejun Hou, Last Saved By: CITS, Revision Number: 71, Name of Creating Application: Microsoft PowerPoint, Total Editing Time: 17:35:43, Create Time/Date: Sat Feb 17 21:47:44 2001, Last Saved Time/Date: Sat Oct 8 09:48:17 2005, Number of Words: 2705
That looks like a pretty thorough report to me.

What do you see for that document?

Different versions of file, or files saved by different versions of MS Office, or saved under different formats will give different results.

You can read/study the "magic" file and it's manual page ("man 5 magic" or ) to find out what signatures and locations "file" is looking at to extract the information.
AOO4/LO5 • Linux • Fedora 23
Charming2060
Posts: 2
Joined: Tue Jun 23, 2009 1:27 pm

Re: How to identify office File types?

Post by Charming2060 »

@Gurkha

Well, i mean to identify the type of MS office file based on contents, not the extension because the extension may be changed. :0)

@acknak

The problem is that the 'file' command can't differentiate between different MS office files (.doc, .xls, .ppt, .mdb), it just says that this is MS office file for all the types. :( I am supposed to write a program ( not restriction on language) which can differentiate between MS office files. Your suggestion will be quite helpful. :0)
OOo 2.3.X on OTHER
User avatar
acknak
Moderator
Posts: 22756
Joined: Mon Oct 08, 2007 1:25 am
Location: USA:NJ:E3

Re: How to identify office File types?

Post by acknak »

Maybe I'm misunderstanding something, but it sure looks to me like file can distinguish between different types of MS Office documents. Are you by chance referring to the new Office 2007 xml document formats?
AOO4/LO5 • Linux • Fedora 23
User avatar
TheGurkha
Volunteer
Posts: 6482
Joined: Thu Mar 13, 2008 12:13 pm
Location: North Wales, UK.

Re: How to identify office File types?

Post by TheGurkha »

If you don't want to use File, then you need to find the specification of the file formats from somewhere - a Microsoft forum perhaps.

Once you have the specifications of the headers, write a program to open the file, read in part of the header, determine the file type from the bytes contained in the header, report that to the user, and then close the file again.
Ubuntu 14.10 Utopic Unicorn, LibreOffice Version: 4.3.3.2
Gurkha Welfare Trust
Post Reply