Data mining Forum Knowledge

Talk about anything at all....
Post Reply
TerryE
Volunteer
Posts: 1401
Joined: Sat Oct 06, 2007 10:13 pm
Location: UK

Data mining Forum Knowledge

Post by TerryE »

acknak wrote:I want to use the forum history to develop a list of FAQs. I wonder if you have any ideas for some way to mine the data. I don't want to make a research project out of this--I can make a fairly good start just off the top of my head, or by skimming some popular posts — but it seems like we should be able to use the data to get a better list that doesn't depend on my crappy memory. I'm sure you have a much better feel for what's actually in the tables than I do — any ideas? ...
First thing, you either need to have access to the tables and understand their structure (which is different in phpBB2 and phpBB3) or use a D/B extract. I run a periodic extract of OOoF and could do the same on usooo. What this produces is a large ZIPped tab separated file, (about 40Mbytes zipped for the current OOoF). The easiest thing to do is to download this to your local PC and process it there. To do this sort of processing you really need to use a language that has good test munging, supports associative arrays and has a good regexp syntax. I use Perl, which I am very familiar with. Possible alternatives could be php and javascript. most of the munging routine will only be a 100 or so lines long. I typically generate outputs in CSV format to load into Calc so that you can further manipulate it there.

So one analysis I did was to look at all Calc and Macro posts for XXXXX ( where XXXXX was a valid Calc function call, and produced an analysis of posts by Calc function. OK you get a few false alarms, but it's still a good start.

Another thing to think about is defining some 'interestingness' measures to rank the interestingness of posts. Again this would be noisy but would still be worthwhile as a ranking means. Factors that might score in the interesting quotient might be:
  • Number of topic reads
  • some weighted length of total posts
  • replies by 'interesting people
  • the fact that the topic had been referenced in another topic.
I could go on, but am I on the right lines?
Ubuntu 11.04-x64 + LibreOffice 3 and MS free except the boss's Notebook which runs XP + OOo 3.3.
User avatar
acknak
Moderator
Posts: 22756
Joined: Mon Oct 08, 2007 1:25 am
Location: USA:NJ:E3

Re: Data mining Forum Knowledge

Post by acknak »

Great! Thanks for putting this up--I just didn't think anyone would be interested in discussing it.

Just at first glance, I would think the simplest metrics would be view count and back references (which is guess is also the core of Google's index, no?). The problem we have with the forum data is that it's not so easy to find and link back to an older discussion, so people often just answer the question again.

Maybe I should just start with the OOo FAQ's and evaluate each to see if it really is an FAQ. I can publish that and then add more from the collective memory.

How do you access the data tables from Perl? Simple library calls? No, you're probably running a local db server and use DBI to throw SQL at that--or something else?
AOO4/LO5 • Linux • Fedora 23
TerryE
Volunteer
Posts: 1401
Joined: Sat Oct 06, 2007 10:13 pm
Location: UK

Re: Data mining Forum Knowledge

Post by TerryE »

As far as usooo, I can use the DBI interface or mine one of the extracts that we take to backup the system.

As far as OOoF goes, we still haven't finalised interactive system access with Ed, I use web-crawl technology to unload the system. The last unload was updated about a month ago. See http://files.ellisons.org.uk/OOo/bbcode4.zip for a working example. The way that I process this is with simple file manipulation. For example:

Code: Select all

# Q&D to do some posting analysis
use strict;
open (BBC, shift) or die;
my $cnt = shift;
while (<BBC>) {
  my %wc;
  chomp;
  my ($topicNo,$forumNo,$topicTitle,$poster,$postNo,$postDate,$postSubj,$src) = split /\t/;
  $src =  lc $src; 
  while ($src =~ /(\w+)\s*\(/g) {
    ...
And as I said I usually spit out the analysis in CSV to load and post-manipulate in spreadsheet format. Download that extract and have a look. Be aware that its 40Mbytes. Also the UTF-8 and other extended characters are still escaped in HTML format and <cr> is coded as a bbcode br pseudo tag.
Ubuntu 11.04-x64 + LibreOffice 3 and MS free except the boss's Notebook which runs XP + OOo 3.3.
TerryE
Volunteer
Posts: 1401
Joined: Sat Oct 06, 2007 10:13 pm
Location: UK

Re: Data mining Forum Knowledge

Post by TerryE »

Acknak, I've just done another extract of OOoF I will be uploading http://files.ellisons.org.uk/OOo/bbcode5.zip. It takes a bit to upload, but it should be there in about an hour. Put it down, have a play and tell me what you think. If you want I can give you a ring and talk through some of the options, but I will need to do it in the evening UK time.
Ubuntu 11.04-x64 + LibreOffice 3 and MS free except the boss's Notebook which runs XP + OOo 3.3.
User avatar
acknak
Moderator
Posts: 22756
Joined: Mon Oct 08, 2007 1:25 am
Location: USA:NJ:E3

Re: Data mining Forum Knowledge

Post by acknak »

Well, here's a start:

The top ten OOoForum topics linked to from other threads:
Count TopicNo Title
181 50862 Tutorial for Spell check and Language configuration
175 3772 How to convert Word -> PDF from the command line
146 4996 Calc Examples
130 47619 Rough step-by-step instructions reporting bugs
98 9815 Using COM for OOo with different languages
83 3549 Here is complete list of Filter names in OOo 1.1rc1
80 6049 Writer Examples
66 6429 Convert ASCII text by eliminating extra paragraph breaks
64 7995 How to install a macro found here.
63 11890 using openoffice headless ( macro, shell, php )
...
63 8833 ???
23 4490 ???
19 7673 ???
16 2735 ???
... (73 more)

Any idea what's happening with those last links?

In some cases (e.g. 8833), the topic has apparently been deleted; anyway, it's not there now. The others, however, are accessible at oooforum, but are not in the extracted data.

There are 5817 links from a total of 246504 posts, or about 1 link in every 43 posts.
AOO4/LO5 • Linux • Fedora 23
TerryE
Volunteer
Posts: 1401
Joined: Sat Oct 06, 2007 10:13 pm
Location: UK

Re: Data mining Forum Knowledge

Post by TerryE »

As to why the posts are missing, the answer is that I've found a bug in my robot utilities. Since we do not access to the OOoF MySQL D/B, I use exactly the same approach as Google and the other forum crawlers to do the forum extract. I use a Perl script to walk a viewforum.php (or phtml in the case of OOoF) by forum and every page in that forum. I then parse the HTML to extract the table content, that is the list of topics per forum.

To avoid overloading the forum I then download any topics that have been changed since the last capture, and parse the HTML to extract each post (another Perl script). I then post-process the posts to back convert them back from HTML to BBcode (another Perl script). I have kept all the intermediate files, so for most bugs I can replay the process without doing a download.

I used one of the missing topics that you identified (Closing problems with OOo 1.1.1). The issue was with my delta algorithm and how it interacted with forum time-outs. I only downloaded the topics that had been updated. If I had missed them because the forum timed out, they hadn't update so I didn't pick them up. So what we have are some holes. I need to do a bit of analysis so that I can do a smart download of the missing topics. It's a bit late to do that tonight.
Ubuntu 11.04-x64 + LibreOffice 3 and MS free except the boss's Notebook which runs XP + OOo 3.3.
TerryE
Volunteer
Posts: 1401
Joined: Sat Oct 06, 2007 10:13 pm
Location: UK

Re: Data mining Forum Knowledge

Post by TerryE »

I actually stayed up until 02:30 doing that reconciliation. I failed to download about 1,500 single posts topics and tended to time-out on short other 50 multi-post topics. Note sure why the single-post topics were heavily hit — a bit counter-intuitive that. Also I had a load of topics where the number of replies fell over the course of the snapshots. As far as I could see from looking at a small sample of these, they had the same reason — one of the posters had at some stage had a fall out with OOoF and deleted many of his posts from the topic thread. (I won't name him, but some of us will guess who.) Still out of a total of 250,000 posts I have still captured 99.5% so that isn't a bad basis for mining for now.

Another complication is that draude got his data conversion a bit screwed up when he moved the content from the old system to the new and managed to corrupt characters that were in the high UTF pages (lots of little funny black ? characters appeared in the posts) so clearly where possible I want to collect the uncorrupted form. Sometime over the next few days, I'll tweak the algo to pull the missing pages.
Ubuntu 11.04-x64 + LibreOffice 3 and MS free except the boss's Notebook which runs XP + OOo 3.3.
User avatar
acknak
Moderator
Posts: 22756
Joined: Mon Oct 08, 2007 1:25 am
Location: USA:NJ:E3

Re: Data mining Forum Knowledge

Post by acknak »

Ouch. Not fun. Thanks for having a look.

This is certainly "good enough" for what I need.
AOO4/LO5 • Linux • Fedora 23
Post Reply