The basic idea here is to build a classification system based on SOM algorithms which can be used to pick “interesting” articles from sites like delicious, diigo, magnolia, and lilisto (I have a partial list of possible sites here).

There are currently several parts to this idea, and it’s really possible that this could be fodder for several project-length experiments.

The first question

Can I build a classifier which rates documents on how closely they match your interests, based on placing them in a self-organizing map which uses keywords to position the document. I have already built an algorithm which applies GHSOM to the relationship between keywords which were applied to documents, so the task here is mainly to see how useful this information is for mapping additional documents and for classifying them by interest.

The next steps:

  1. Apply the algorithm to individual documents and see where they are placed
  2. Determine the area of the map that represents the user’s interests (either by inference from having them rate documents, or by directly “circling” on the map their area(s) of interest)
  3. Rate documents by their (multi-dimensional) proximity to these areas.

The second question

Is this method more effective when using keywords generated by actual people than when it uses machine-generated keywords. There are many existing document sumarizing and keyword extraction algorithms, and even commercial products (eg: brevity intellexer). One or more of these could be run on the document to extract keywords instead of using the human-generated keywords available on delicious et. al. This would make the algorithm more capable of analyzing “any” documents, and would reduce dependency on the websites mentioned earlier (although this seems important, it may be of limited use, since the intent is to classify interesting documents from an incoming “stream” of documents, and currently my “stream” comes from these same sites where the keywords come from).

The next steps:

  1. Create a collection of documents with their human-generated keywords
  2. Run machine summarizing algorithms on these documents
  3. Compare the resulting mappings for relevancy ( what is the metric here? )

Additional questions

The most important open question (to me) is whether this idea is original enough to work as a thesis at RIT (as opposed to becoming a project). If it’s not, I’m leaning toward working on a different project which is somewhat more interesting to me.

However, there are several other open questions:

  • Is GHSOM better than a non-hierarchical growing SOM, or even a simple SOM algorithm for this task? (Instinctively, it seems that the key requirement is that the map size must be inferred, and thus that a growing algorithm is required, but the hierarchy may be uneccessary).
  • How does this system using free keywords (anything can be a keyword, including the user-name of the person who creates the keywords) compare to a system which has set categories. It seems that the classification would have much less adapting to do in a situation where categories are limited, since in the current system new keywords are constantly being added to the database and the algorithm must infer a user’s interest in these new keywords.