I've told my teacher that on the long run, I want to implement all 4 scenarios, but what I'll do in short term depends on what he expects He told me to implement 2 of my choice.
For the first one, I've chosen the spam classification, because it seemed to be trivial... Well... Let's see what happened
I've asked for the full flexwiki.com repository on the FlexWikiMailingList, and CraigAndera was so kind to give it to me in few hours. A 50MB zip file with 36 471 files 1564 actual wikipages and the rest are the older versions (including the deleted topics').
I've written an Excel macro to read in the filenames and determine which is spam and which is not
If a version of a topic contains one word, "delete" it means it was deleted, so the previous version was a spam
(1) I assume that if a topic is modified and not deleted, than it wasn't spam, so the version before the latest is a not spam version.
So I got 2885 spam and 1258 notspam samples. With an other Excel macro, I've copied these to a spam and a notspam folder.
And now, I thought, almost ready, find a classification tool, train it and enjoy the results. NOPE
http://kt.ijs.si/Dunja/textgarden/ this seemed to do the trick, but some links are broken, others work, but the prorgam doesn't work. No source code.
http://www.cs.cmu.edu/~mccallum/bow/ another contestant, but "The library does not: Claim to be finished. Have good documentation. Claim to be bug-free." I've spent a day trying to compile it on cygwin with no success.
SVMLight: http://svmlight.joachims.org/ now this is a classifier, an abstract one, doesn't read text files, you have to give it the feature-value pairs. I didn't want to do this
there is nice javadoc, but no specification or howto, so I spent more than a day with finding out what it does, and how to run it. I'm considering writing a short spec and howto, and sending it back to the author...
LOL: the TCT writes out a matrix in a text file, and because of my Hungarian locales, java put , instead of . in the float values. When I wanted to use the file, TCT threw a "not valid float value" exception... So I had to replace them back. BUT! In a 2 MB text file, it took 2 seconds for notepad to replace one occurrence. Even notepad++ and Word died. So I had to write a script that did it line by line...
Right now, I've parsed my training data with TCT, and written a vbscript to convert it into SVMLight input file.
Before trying it, I looked at the files, and realized that the "bag of words" model with stemming is not good for wiki content. Mainly because of the wikiwords - a modified split algorithm would solve the problem. But there are links too, and these are crucial in detecting spam. (2) So my heuristic is that a Ngram would by much better. So I started a new parsing, with Ngrams, and got 15000 terms. I started an other parsing, with dimension cutting: a term must occur in at least 10 docs - got 7700 terms, maybe this will be good.
Aham, precision 38% This is not good. What's the problem? The training data contains lot's of spam. (1) is not true. I'll say now: a topic, that is very long, and not deleted is not spam. And also, too many modifications mean playpage, so I'll filter this too. Back to the Excel macro.
Precision for the BOW training file is 45%, maybe (2) isn't true either
Ok, found the bug: I used the termID-docID pairs as docID-termID
I used Access to swap the docID-termID, but it's still not good...
Status report 2006.12.30.
I gave up the SVNLight spam classification of spam/notspam for now (Maybe the bag of words vector doesn’t correlate with the usefulness? Or I did something wrong? Or the training data is too noisy?). I thought it would be easier. Of course I had some tries before giving up:
http://tcatng.sourceforge.net/ Seemed to be good, but not very elegant, the download link points to nowhere. I committed the source from cvs, compiled, and saw that the command line tool doesn't do anything. As a see, it was designed for classification of languages based on ngrams, nothing more.
http://rapid-i.com/ this is a free, full data mining solution with text processing and clustering toolkits. I didn't find a straightforward tutorial for text classification, and the whole software seemed to be very complicated, so I left this
returning to textgarden I found that by downloading and not just running the zip files I won't get corrupted files. So got the parser and trainer programs, the parser works well (with few ticks), but the trainers exit with unspecified error. But no problem, because the classifiers' links are broken on the web page, so even if I had models, I couldn't do anything with them.
But seeing that the parser works, I went on to the clustering, using textgarden. There are two clustering programs, one does a hierarchical binary clustering, the other a flat one. And, they can produce xml output, which made me very happy. And, they are working and fast. So I wrote XSLs to make use of them.
first, for the hierarchical clustering, I tried to make a sitemap of flexwiki.com with indented lists. Then with nested tables, well, none of them are informative, because the page is too big. Maybe I'll write javascipts for opening-closing the hierarchy tree.
second, the flat clustering: I made 400 clusters of 1500 topics, for giving the users "see also" tips.
with xsl, I created the _ClusteringResults page, that WikiTalk scripts can read
at some topics it gives very horrible results, at AdministratorsGuide it looks ok.
Today was a very long, tiring and also not too useful day... At least I refreshed my eclipse and xslt knowledge.
SzaMa's project for enhancing FlexWiki with text mining algorithms
9/19/2007 7:03:30 PM - -76.84.225.95
There are a number of mailing lists for people who are interested in FlexWiki.
1/24/2008 8:36:46 AM - FLWCOM-jwdavidson
Craig Andera is a consultant for Wangdera Corporation (his company). He blogs at "Pluralsight":http://pluralsight.com"","" and used to teach for DevelopMentor.
1/24/2008 8:03:42 AM - FLWCOM-jwdavidson
WikiTalk is a language for including dynamic content in FlexWiki topics.
9/25/2008 5:53:56 PM - FLWCOM-jwdavidson
A border element that displays the similar topics, based on automatic clustering. See: TextMiningProject
1/4/2007 3:43:36 AM - -84.2.157.119
Marcell Szabó, student in computer sciences at bme.hu
1/24/2008 7:54:15 AM - FLWCOM-jwdavidson
Information on installing, configuring and running a FlexWiki instance.
6/25/2008 5:42:43 AM - -80.169.35.71
Information on installing, configuring and running a FlexWiki instance.