Logarithmic Tag Clouds

may 30 2009 7:50 pm

Tag clouds are a pretty popular way of displaying and indexing data, but they’re not without their faults. I added a tag cloud to a helpdesk system I created and while it works well, it eventually suffered from a common tag cloud problem: one tag had been used much more frequently than the other and the resultant cloud had one big tag and a million tiny tags. The way to solve the problem is through weighting the data.

A popular way of weighting the data is to apply a logarithmic scale to the data. Google turns up a bunch of different ways to accomplish this, but I’m not satisfied by any of them. Most solutions rely on grouping the data into steps, and performing a rough logarithm on it. In the end, the tag cloud is easier to read, but it feels a little silly to be “dumbing down” the calculation. Computers are designed to handle complex calculations.

My initial thought was to take one of those “stepped” algorithms apart and have it treat the number of steps as the difference between the maximum and minimum tag counts. I was able to get it to generate relatively correct numbers, but it forced me to treat the minimum as a special case. If I didn’t treat it as such, it produced a font size smaller than the minimum was supposed to be.

I knew there had to be a correct algorithm to draw a curve between two points, but google wasn’t turning up anything particularly informative. So, I turned to stack overflow for some help. While one of the suggestions I got didn’t produce the answer I needed, it did push me in the right direction… though I must admit it took some time for the suggestion to fully sink in.

I explained how all the math works in my answer on stackoverflow, but below you’ll find two example functions to perform the calculations: one for use in django, one in php.

django

This function is designed to work with the django models to get its data, so change Tag.objects.all() and tag.items.count() to fit your models and this function will do all the work for you.

from math import log
def tagcloud(threshold=0, maxsize=1.75, minsize=.75):
    """usage: 
        -threshold: Tag usage less than the threshold is excluded from
            being displayed.  A value of 0 displays all tags.
        -maxsize: max desired CSS font-size in em units
        -minsize: min desired CSS font-size in em units
    Returns a list of dictionaries of the tag, its count and
    calculated font-size.
    """
    counts, taglist, tagcloud = [], [], []
    tags = Tag.objects.all()
    for tag in tags:
        count = tag.items.count()
        count >= threshold and (counts.append(count), taglist.append(tag))
    maxcount = max(counts)
    mincount = min(counts)
    constant = log(maxcount - (mincount - 1))/(maxsize - minsize or 1)
    tagcount = zip(taglist, counts)
    for tag, count in tagcount:
        size = log(count - (mincount - 1))/constant + minsize
        tagcloud.append({'tag': tag, 'count': count, 'size': round(size, 7)})
    return tagcloud

php

All you need to do to use this function is feed it an array of tags in the following format: $tags = array(array('tag'=> , 'count'=> ),)

function tagcloud($tags, $threshold=0, $maxsize=1.75, $minsize=.75) {
    /* usage:
        $tags -an array of tags and their corresponding counts
               format: $tags = array(
                                     array('tag'   => tagname,
                                           'count' => tagcount),
                               );
        $threshold -Tag usage less than the threshold is excluded from
            being displayed.  A value of 0 displays all tags.
        -maxsize: max desired CSS font-size in em units
        -minsize: min desired CSS font-size in em units
       Returns an array of the tag, its count and calculated font size.
    **/
    $counts = $tagcount = $tagcloud = array();
    foreach($tags as $tag) {
        if($tag['count'] >= $threshold) {
            $counts[] = $tag['count'];
            $tagcount += array($tag['tag'] => $tag['count']);
        }
    }
    $maxcount = max($counts);
    $mincount = min($counts);
    $constant = log($maxcount - ($mincount - 1))/(($maxsize - $minsize)==0 ? 1 : ($maxsize - $minsize));
    foreach($tagcount as $tag => $count) {
        $size = log($count - ($mincount - 1)) / $constant + $minsize;
        $tagcloud[] = array('tag'=> $tag, 'count'=> $count, 'size'=> round($size, 5));
    }
    return $tagcloud;
}