{"id":4,"date":"2011-06-02T16:25:33","date_gmt":"2011-06-02T20:25:33","guid":{"rendered":"https:\/\/michaelnielsen.org\/ddi\/?p=4"},"modified":"2011-06-02T16:46:35","modified_gmt":"2011-06-02T20:46:35","slug":"sex-einstein-and-lady-gaga","status":"publish","type":"post","link":"https:\/\/michaelnielsen.org\/ddi\/sex-einstein-and-lady-gaga\/","title":{"rendered":"Sex, Einstein, and Lady Gaga: what&#8217;s discussed on the most popular blogs"},"content":{"rendered":"<p>I did a little experiment to see what people discuss on the most popular blogs.<\/p>\n<p>I downloaded Technorati&#8217;s <a href=\"http:\/\/technorati.com\/blogs\/directory\/overall\/\">list<\/a> of the top 1,000 blogs, and crawled 50,000 pages from those blogs.  Then I determined the percentage of pages containing words such as &#8220;sex&#8221;, &#8220;Einstein&#8221; and &#8220;Gaga&#8221;.  As a baseline for comparison and sanity check, I figured out the percentage of pages containing the top ten <a href=\"http:\/\/en.wikipedia.org\/wiki\/Most_common_words_in_English\">most frequently used nouns in English<\/a>:<\/p>\n<ol>\n<li> time  94%\n<li> person 52%\n<li> year 71%\n<li> way 83%\n<li> day 91%\n<li> thing 78%\n<li> man 96%\n<li> world 70%\n<li> life 59%\n<li> hand 56%\n<\/ol>\n<p>To get these numbers I did the simplest possible pattern matching, ignoring case completely and just looking for strings anywhere inside a page&#8217;s html. This will throw up some false positives, especially for words like &#8220;day&#8221; and &#8220;man&#8221; which are part of longer words (&#8220;daydream&#8221;, &#8220;manual&#8221;), and such words will have inflated numbers (less so words like &#8220;person&#8221; and &#8220;year&#8221;).<\/p>\n<p>Some stats for celebrities:<\/p>\n<ul>\n<li> Gaga 9%\n<li> Bieber 5%\n<li> LeBron  3%\n<li> Tiger 5%\n<\/ul>\n<p>Note that I don&#8217;t distinguish between &#8220;tiger woods&#8221; and &#8220;tiger&#8221; the animal.  Here&#8217;s a couple of scientists:<\/p>\n<ul>\n<li>Einstein 2%\n<li>Feynman 0.1%\n<\/ul>\n<p>Einstein isn&#8217;t quite at the level of Lady Gaga et al, but he clearly remains a cultural icon. Feynman, by contrast, seems to be more an icon of geek culture.  Indeed, I&#8217;d guess most of that 0.1% comes from the fact that quite a few of the top blogs have a geeky focus.  Speaking of geeky focus, let&#8217;s take a look at a couple of technology companies:<\/p>\n<ul>\n<li> Apple 38%\n<li> Google 93%\n<\/ul>\n<p>The number for Google sounds high, but in fact a huge fraction of blog pages include things like Google Analytics, Feedburner, and so on.  One day I may redo this on just the non-boilerplate plain text from a page.  The numbers for Apple surprised me, even accounting for the fact that this includes mentions of apple-the-fruit.  I can&#8217;t think of anything analogous to Google Analytics to account for the high numbers.  Maybe people who read blogs really are obsessed with Apple.<\/p>\n<p>Athletes like Tiger Woods and LeBron James to some extent transcend their sport.  I was curious to see how they compare to the top athletes within a sport, and chose tennis:<\/p>\n<ul>\n<li> Nadal 0.2%\n<li> Djokovic 0.1%\n<li> Federer 0.1%\n<\/ul>\n<p>The numbers for Djokovic don&#8217;t surprise me, but I was surprised by how low Federer is, and Nadal too, to some extent.  I thought both were cultural icons, but I guess not.<\/p>\n<p>And finally, does sex really sell?  The stats for sex (and related topics) suggests that it does, but apparently not as well as Apple sells:<\/p>\n<ul>\n<li> sex 31%\n<li> porn 7%\n<li> nude 5%\n<li> naked 7%\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>I did a little experiment to see what people discuss on the most popular blogs. I downloaded Technorati&#8217;s list of the top 1,000 blogs, and crawled 50,000 pages from those blogs. Then I determined the percentage of pages containing words such as &#8220;sex&#8221;, &#8220;Einstein&#8221; and &#8220;Gaga&#8221;. As a baseline for comparison and sanity check, I&hellip; <a class=\"more-link\" href=\"https:\/\/michaelnielsen.org\/ddi\/sex-einstein-and-lady-gaga\/\">Continue reading <span class=\"screen-reader-text\">Sex, Einstein, and Lady Gaga: what&#8217;s discussed on the most popular blogs<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-4","post","type-post","status-publish","format-standard","hentry","category-uncategorized","entry"],"_links":{"self":[{"href":"https:\/\/michaelnielsen.org\/ddi\/wp-json\/wp\/v2\/posts\/4","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/michaelnielsen.org\/ddi\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/michaelnielsen.org\/ddi\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/michaelnielsen.org\/ddi\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/michaelnielsen.org\/ddi\/wp-json\/wp\/v2\/comments?post=4"}],"version-history":[{"count":0,"href":"https:\/\/michaelnielsen.org\/ddi\/wp-json\/wp\/v2\/posts\/4\/revisions"}],"wp:attachment":[{"href":"https:\/\/michaelnielsen.org\/ddi\/wp-json\/wp\/v2\/media?parent=4"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/michaelnielsen.org\/ddi\/wp-json\/wp\/v2\/categories?post=4"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/michaelnielsen.org\/ddi\/wp-json\/wp\/v2\/tags?post=4"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}