I did a little experiment to see what people discuss on the most popular blogs.
I downloaded Technorati’s list of the top 1,000 blogs, and crawled 50,000 pages from those blogs. Then I determined the percentage of pages containing words such as “sex”, “Einstein” and “Gaga”. As a baseline for comparison and sanity check, I figured out the percentage of pages containing the top ten most frequently used nouns in English:
- time 94%
- person 52%
- year 71%
- way 83%
- day 91%
- thing 78%
- man 96%
- world 70%
- life 59%
- hand 56%
To get these numbers I did the simplest possible pattern matching, ignoring case completely and just looking for strings anywhere inside a page’s html. This will throw up some false positives, especially for words like “day” and “man” which are part of longer words (“daydream”, “manual”), and such words will have inflated numbers (less so words like “person” and “year”).
Some stats for celebrities:
- Gaga 9%
- Bieber 5%
- LeBron 3%
- Tiger 5%
Note that I don’t distinguish between “tiger woods” and “tiger” the animal. Here’s a couple of scientists:
- Einstein 2%
- Feynman 0.1%
Einstein isn’t quite at the level of Lady Gaga et al, but he clearly remains a cultural icon. Feynman, by contrast, seems to be more an icon of geek culture. Indeed, I’d guess most of that 0.1% comes from the fact that quite a few of the top blogs have a geeky focus. Speaking of geeky focus, let’s take a look at a couple of technology companies:
- Apple 38%
- Google 93%
The number for Google sounds high, but in fact a huge fraction of blog pages include things like Google Analytics, Feedburner, and so on. One day I may redo this on just the non-boilerplate plain text from a page. The numbers for Apple surprised me, even accounting for the fact that this includes mentions of apple-the-fruit. I can’t think of anything analogous to Google Analytics to account for the high numbers. Maybe people who read blogs really are obsessed with Apple.
Athletes like Tiger Woods and LeBron James to some extent transcend their sport. I was curious to see how they compare to the top athletes within a sport, and chose tennis:
- Nadal 0.2%
- Djokovic 0.1%
- Federer 0.1%
The numbers for Djokovic don’t surprise me, but I was surprised by how low Federer is, and Nadal too, to some extent. I thought both were cultural icons, but I guess not.
And finally, does sex really sell? The stats for sex (and related topics) suggests that it does, but apparently not as well as Apple sells:
- sex 31%
- porn 7%
- nude 5%
- naked 7%
the apple nos. may have a lot to do with the media blitz surrounding apple over the last few months – ipad2 launch, the iphone location data fiasco, upcoming icloud launch, steve jobs health, apple crossing microsoft’s market cap, etc … apple has consistently been in the news !
I guess so. I still find it pretty amazing. Lady Gaga is in the news a lot, too, but Apple apparently gets 4 times as many mentions!
Some data supporting this guess: http://www.google.com/insights/search/#q=apple%2Capples%2Capplet%2Capplets%2Cgaga&cmpt=q .
Very nice post. It would be too much to ask for the code you used to do the analysis? Did you use python?
@Manoel – Thanks. Yes, it was done in Python. For some boring reasons I can’t release the code right now, although I hope to later.
Hi
I think that there is a big confusion overall between the notion of a forced, or directed (type of) relation, one that might wear the name tag, “causal”, and the idea that a set of relations may describe, directly (if in proximity) or indirectly through (an as yet undiscerned) ‘web’ of relations.
There are many ideas out there about how to describe, then qualify, quantify, the selection and description of those relations. An analysis based on hypotheses may define a set of relationships having statistical properties. These may or may not be otherwise related to other properties, correlations, larger webs of significance.
This last notion, with that of relevance, or pertinence if the process is tied to an epistemic actor, reminds us that statistical analysis is a set of METHODS and not a set of concepts.
The concepts come from doing the thinking – mostly somewhat philosophical, but also much practical – in which what we KNOW or BELIEVE is linked to both the recognition of features (in the analysis) as well as in the background (knowledge …) used to interpret results.
That’s why statistic LIE so well, as well as telling us Reality’s dirty little secrets.
I depends on what you’re tuning into.
That said, the recognition of SIGNIFICANCE of patterns, webs of meaning, lieing in those objects that are considered in statitical interpretation, explanations derived from that, are all epistemological issues – NOT statitistical ones.
There has to be some separation here or you’re trying to compare apples with oranges, nuclear structures with star clusters. If you think their homologous somewhere along the way, then read your pre-socratic Greek philosophers, and those who preached, “as above, so below”.
Maybe so ? Maybe the secrets our methods find teach us while reminding us of so much we’ve already learned ?
cheers