First thing, you either need to have access to the tables and understand their structure (which is different in phpBB2 and phpBB3) or use a D/B extract. I run a periodic extract of OOoF and could do the same on usooo. What this produces is a large ZIPped tab separated file, (about 40Mbytes zipped for the current OOoF). The easiest thing to do is to download this to your local PC and process it there. To do this sort of processing you really need to use a language that has good test munging, supports associative arrays and has a good regexp syntax. I use Perl, which I am very familiar with. Possible alternatives could be php and javascript. most of the munging routine will only be a 100 or so lines long. I typically generate outputs in CSV format to load into Calc so that you can further manipulate it there.acknak wrote:I want to use the forum history to develop a list of FAQs. I wonder if you have any ideas for some way to mine the data. I don't want to make a research project out of this--I can make a fairly good start just off the top of my head, or by skimming some popular posts — but it seems like we should be able to use the data to get a better list that doesn't depend on my crappy memory. I'm sure you have a much better feel for what's actually in the tables than I do — any ideas? ...
So one analysis I did was to look at all Calc and Macro posts for XXXXX ( where XXXXX was a valid Calc function call, and produced an analysis of posts by Calc function. OK you get a few false alarms, but it's still a good start.
Another thing to think about is defining some 'interestingness' measures to rank the interestingness of posts. Again this would be noisy but would still be worthwhile as a ranking means. Factors that might score in the interesting quotient might be:
- Number of topic reads
- some weighted length of total posts
- replies by 'interesting people
- the fact that the topic had been referenced in another topic.