14 Jun
The basic idea here is to build a classification system based on SOM algorithms which can be used to pick “interesting” articles from sites like delicious, diigo, magnolia, and lilisto (I have a partial list of possible sites here).
There are currently several parts to this idea, and it’s really possible that this could be fodder for several project-length experiments.
Can I build a classifier which rates documents on how closely they match your interests, based on placing them in a self-organizing map which uses keywords to position the document. I have already built an algorithm which applies GHSOM to the relationship between keywords which were applied to documents, so the task here is mainly to see how useful this information is for mapping additional documents and for classifying them by interest.
The next steps:
Is this method more effective when using keywords generated by actual people than when it uses machine-generated keywords. There are many existing document sumarizing and keyword extraction algorithms, and even commercial products (eg: brevity intellexer). One or more of these could be run on the document to extract keywords instead of using the human-generated keywords available on delicious et. al. This would make the algorithm more capable of analyzing “any” documents, and would reduce dependency on the websites mentioned earlier (although this seems important, it may be of limited use, since the intent is to classify interesting documents from an incoming “stream” of documents, and currently my “stream” comes from these same sites where the keywords come from).
The next steps:
The most important open question (to me) is whether this idea is original enough to work as a thesis at RIT (as opposed to becoming a project). If it’s not, I’m leaning toward working on a different project which is somewhat more interesting to me.
However, there are several other open questions: