Open Discussion → Popfile against twitter?

Popfile against twitter?

OK, so I'm a newbie and I apologize in advance if this has been discussed / dismissed before, or if this is the wrong forum. That said, here goes...

I am interested in tracking the frequency of Twitter "tweets" which contain the word "sick" with a view to tracking the course of the current swine flu pandemic. The initial approach (not my idea, I admit) is to simply fetch from Twitter all of the "tweets" (for the uninitiated, a "tweet" is a single submission to Twitter, of <= 140 bytes) containing the word "sick" in a given time period, and calculating the rate per 30 seconds of such tweets. The hypothesis is that this rate will indicate the progress of the pandemic among the segment of the population that (a) uses twitter; and (b) is not too sick to tweet. There is precedent for such an approach in the Google "flu trends" project.

But you can see the problem immediately. The word "sick" can be used in many ways that do not indicate illness. (e.g. "I'm sick of hearing about Britteny Spears." or "That's a sick rock group.") At the zeroth level one hopes that the non-illness rate of use of "sick" remains fairly constant, while the illness usage will reflect the pandemic (at least after all other illnesses dwarf into insignificance).

So: Maybe there's a better way. What if you could capture the content of the tweets and feed them to popfile as one normally does e-mails, and have it classify them as to relevant or irrelevant, based on your particular criteria.

That's what I'd like to try to do. Ending up with a proportion of relevant to irrelevant tweets, to be used to correct the "tweet rate."

My questions:

(1) has anyone done this before, and would like to share their experience?

(2) can some one point me to something like a flow chart of how all the popfile parts work together, so I can start to implement this idea?

(3) will the fact that tweets are limited to 140 bytes be an insurmountable barrier to classification? (it doesn't give many words to work with. Nevertheless, humans can classify such tweets with good accuracy.)

[legal] Note that if successful, this usage of popfile's methodology could be very valuable in many applications. As it were, an artificial intelligence window into the hive mind of a generation or two. Formally, I relinquish any claim to the idea presented. I commit it to the public domain. Just show me the code, and don't try to patent it. :) /legal :)

Thanks in advance for any pointers!

Albert

  • Message #1087

    Hi Albert,

    (2) can some one point me to something like a flow chart of how all the popfile parts work together, so I can start to implement this idea?

    You can use the XMLRPC interface of POPFile to classify messages.
    For more information about the interface, please see:

    http://getpopfile.org/docs/popfilemodules:xmlrpc

    (3) will the fact that tweets are limited to 140 bytes be an insurmountable barrier to classification? (it doesn't give many words to work with. Nevertheless, humans can classify such tweets with good accuracy.)

    May be.
    POPFile analyzes not only the message body but also the headers of the message.
    POPFile may not classify tweets with satisfactory accuracy because they don't have such additional information.
    For more information, please see the last paragraph of this page:

    http://getpopfile.org/docs/faq:goodstatisticlaratings

    And, POPFile uses Naive Bayes algorithm to classify messages. In this algorithm we assume words appear independent from other words.
    This means that POPFile ignores the combination of words (phrases or contexts).
    For more information about the "Naive Bayes", please see:

    http://getpopfile.org/docs/faq:howitworks

    Naoki

    • Message #1088

      Hi Naoki,

      Thank you for the fast and helpful reply! It does seem that the twitter messages are awfully short, and they won't have the regular email headers (though they do have SOME sort of headers), so it may not work too well. But it is probably worth a try, at least till I get tired of trying to train popfile. :-)

      We'll see how it goes.

      Albert