Page 1 of 1

Reading contents of spreadsheet - best practice

Posted: Sat Apr 19, 2008 4:05 am
by mathew
I have made my first attempt at writing some java code to extract the contents of a spreadsheet. However, I am not particularly happy with the code as it stands, so I was hoping that some people here could offer some advice.
  1. It opens up calc to display the spreadsheet. I would prefer for the code to just run in the background.
  2. It appears that a cell contains either a double or a formula.
    • Have I missed an interface?
    • Is there a method to determine what the formula represents?
  3. Dates & times appear to be represented as doubles in the same way as lotus & excel
    • I grabbed the code from Jakarta POI to convert the double to a date.
    • Is there an OOo equivalent?
  4. Is there a better way to approach this problem?
    • To be honest what I am looking for is the ODF equivalent of Jakarta POI
  5. Can use see any other obvious areas for improvement?
The java source code is available here: http://www.cahaya.com.au/ProcessTimeRecord.java. Please note, that even though it is a single class file, it depends on a large number of libraries. I have also added the file a hubung sourceforge project.

Re: Reading contents of spreadsheet - best practice

Posted: Sat Apr 19, 2008 1:31 pm
by Villeroy
4. Is there a better way to approach this problem?

* To be honest what I am looking for is the ODF equivalent of Jakarta POI
As far as I can see on the first glace, POI tries it's best to read (and write?) binary office files on a server, particulary without having MSOffice installed. You are using the office API so it is not comparable to POI.
Well, everybody loves to read MSOffice files. Since OOXML has been acknowleged as ISO standard there are strong effords to develop toolkits for the new shit.
However, there is a promising project related to ODF: http://odftoolkit.openoffice.org/. Unfortunately, it has been promising for quite a long time now. Personally I believe that Microsoft is going to win another format race.
1. It opens up calc to display the spreadsheet. I would prefer for the code to
just run in the background.
Search this forum or the old one http://www.oooforum.org/forum/search.phtml for "headless"
2. It appears that a cell contains either a double or a formula.
* Have I missed an interface?
* Is there a method to determine what the formula represents?
A cell has a formula, a value and a string.

- The formula is a string, stored within the file in English notation. It is localized in the GUI: c.s.s.table.XCell.getFormula(), for instance "3.14159", "Text Value", "=SUM(A1:B5)"

- The value is the numeric value (zero for text values, errors or blanks), c.s.s.table.XCell.getValue()

- The string is the displayed text (literal text value, formatted number or error-string), c.s.s.text.XTextRange.getString()

- Property c.s.s.sheet.SheetCell.FormulaLocal reflects the localized formula string as it is currently used in the formula bar: "3,14159" (comma), "Text Value", "=SUMME(A1:B5)" (German SUMME)

- Nested sequence (list of lists) of all formulas in a range: []string c.s.s.sheet.XCellRangeFormula.getFormulaArray()

- Nested sequence (list of lists) of all values in a range: c.s.s.sheet.XCellRangeData.getDataArray() [string, double, or Null for errors]

- The array formula of a range, for instance "{=TRANSPOSE(A1:D4)}" through c.s.s.sheet.XArrayFormulaRange.getArrayFormula(). Unfortunately, this interface has a bug. It returns the localized array-formula, for instance German "{=MTRANS(A1:D4)}". You get the correct English notation from c.s.s.table.XCell.getFormula() which is by definition the same for all cells in the array. Nother strange issue: c.s.s.sheet.XArrayFormulaRange.setArrayFormula("TRANSPOSE(A1:D4)") [English without "{=}"]
3. Dates & times appear to be represented as doubles in the same way as lotus & excel

* I grabbed the code from Jakarta POI to convert the double to a date.
* Is there an OOo equivalent?
This is where things become complicated ;)
Calc's day zero is 1899-12-30 by default. This is a per-document setting. The GUI offers 2 common alternatives 1900-01-01 (ISO-SQL) and 1904-01-01 (Quattro?). In fact this date can be any other day (founding day of my company? my birhtday?). c.s.s.NumberFormatSettings.NullDate
Excel's default is none of these. Excel's day zero defaults to 1899-12-31, 1900-01-01 is day #1. Due to the infamous leap-year bug in Excel, this one-day offset makes a difference for the first 60 days until 1900-02-28. From 1900-03-01 (day #61) both Excel and Calc map the same numbers to the same dates when using their respective defaults. See http://user.services.openoffice.org/en/ ... 490#p14490 with Excel document attached. For string conversion there is another relevant parameter: c.s.s.NumberFormatSettings.TwoDigitDateStart

Re: Reading contents of spreadsheet - best practice

Posted: Sat Apr 19, 2008 2:05 pm
by mathew
Villeroy wrote:However, there is a promising project related to ODF: http://odftoolkit.openoffice.org/. Unfortunately, it has been promising for quite a long time now. Personally I believe that Microsoft is going to win another format race.
odftoolkit does look like what I want and I did find it in my searching, but it doesn't look ready yet which is a shame.
Villeroy wrote:
3. Dates & times appear to be represented as doubles in the same way as lotus & excel

* I grabbed the code from Jakarta POI to convert the double to a date.
* Is there an OOo equivalent?
This is where things become complicated ;)
Calc's day zero is 1899-12-30 by default. This is a per-document setting. The GUI offers 2 common alternatives 1900-01-01 (ISO-SQL) and 1904-01-01 (Quattro?). In fact this date can be any other day (founding day of my company? my birhtday?). c.s.s.NumberFormatSettings.NullDate
Excel's default is none of these. Excel's day zero defaults to 1899-12-31, 1900-01-01 is day #1. Due to the infamous leap-year bug in Excel, this one-day offset makes a difference for the first 60 days until 1900-02-28. From 1900-03-01 (day #61) both Excel and Calc map the same numbers to the same dates when using their respective defaults. See http://user.services.openoffice.org/en/ ... 490#p14490 with Excel document attached. For string conversion there is another relevant parameter: c.s.s.NumberFormatSettings.TwoDigitDateStart
I was aware of the excel leap year bug, not aware of the complexities of date handling in OpenOffice.

Would the best approach be to use the c.s.s.text.XTextRange.getString() method and use java to parse the date string?
Alternatively I could look at changing the getJavaDate method in my code to accept a starting date based on c.s.s.NumberFormatSettings.NullDate but that sounds messy. Surely I cannot be the first person who wants to do this?

Re: Reading contents of spreadsheet - best practice

Posted: Sat Apr 19, 2008 2:42 pm
by Villeroy
Surely I cannot be the first person who wants to do this?
Possibly you are the first person who really cares :D Basic macros, embedded in some template or document use to rely on some predefined document settings such as language, day zero, styles, well prepared range names, sheets and links.
Things become really complicated when you need to process any spreadsheet thrown at your program. Same issues in any spreadsheet, nothing Calc specific. IMHO the main problem is that spreadsheets accept any value in all the cells, be it text, number, empty or error.
I would not recommend date conversion by strings. "YY||DD@MMM" could be a userdefined date format in any locale (foreign names of days and months plus two-digit year) or even "On a glory "dddd" in the year of the Lord "YYYY ("on a glory Sunday in the year of the Lord 1999") is a valid number format, representing a numeric value as date. The numeric conversion with known day zero is more reliable. Why not use the spreadsheet itself?
This is how one could try to handle the pitfalls in Python.

Code: Select all

[...]
class FunctionProvider(XFunctionAccess, unohelper.Base):
    def __init__(self, smgr):
        self.FunctionAccess = smgr.createInstance('com.sun.star.sheet.FunctionAccess')

    def setProperties(self, **dArgs):
        for p in dArgs.keys():
            self.FunctionAccess.setPropertyValue(p, dArgs[p])
            
    def setPropertiesByInfo(self, propInfo):
        """Sync FunctionAccess service with a document's PropertySetInfo."""
        for p in self.FunctionAccess.PropertySetInfo.getProperties():
            s = p.Name
            if propInfo.hasPropertyByName(s):
                v = propInfo.getPropertyByName(s)
                self.FunctionAccess.setPropertyValue(s, v)

    def callFunction(self, name, *args):
        return self.FunctionAccess.callFunction(name, *args)

[...]

# Adjust all FunctionProvider's properties to the corresponding properties of the respective document.
FP =  FunctionProvider(Office.smgr,)
FP.setPropertiesByInfo(myDoc.getPropertySetInfo())
Now it is possible to call built-in spreadsheet functions DATE(y;m;d), YEAR(n), MONTH(n), DAY(n), WEEKDAY(n), TEXT(n;strFormat) with the same settings as in the respective spreadsheet document. Yes, even Writer has a NullDate. callFunction supports implicit array context, so you could, for instance, retrieve many string-dates with one call:

Code: Select all

# as above:
FP =  FunctionProvider(Office.smgr,)
FP.setPropertiesByInfo(myDoc.getPropertySetInfo())

# getting an array of ISO-date strings from all values in A2:A1000
oRange = oSheet.getCellRangeByPosition(0,1,0,999) 
aData = oRange.getDataArray()
aISOStrings = FP.callFunction("TEXT", (aData, "YYYY-MM-DD"))
Not shure yet about the implications of text, blanks and errors passed to the function.

Re: Reading contents of spreadsheet - best practice

Posted: Thu Aug 14, 2008 1:47 pm
by Crls7
Aww, I thank you for your answer, but seems I can't implement it : (
I am so iliterate about all this.

Could you please give me syntax to for example print $A$2:$D$46 from current sheeet into c:/bla.txt ?


dim args3(0) as new com.sun.star.beans.PropertyValue
args3(0).Name = "ToPoint"
args3(0).Value = "$A$2"
args3(0).Value = "$A$2:$D$46"
dispatcher.executeDispatch(document, ".uno:GoToCell", "", 0, args3())


dim args4(0) as new com.sun.star.beans.PropertyValue
args4(0).Name = "ToPoint"
args4(0).Value = "$A$2:$D$46"

dispatcher.executeDispatch(document, ".uno:GoToCell", "", 0, args4())

dispatcher.executeDispatch(document, ".uno:Copy", "", 0, Array())

rem ----------------------------------------------------------------------
Dim FileNo As Integer
Dim Filename As String

Filename = "c:/bla.txt"
FileNo = Freefile

Open Filename For Output As #FileNo
Print #FileNo, <<< what should I put here so all content of $A$2:$D$46 is saved in txt file
Close #FileNo

Thank you