Tuesday, December 28, 2010

Culturomics: Not Quite Yet

Well, it’s time to stick my oar in on the Google Ngrams discussion. While a number of computational linguistics scholars have pointed out the pitfalls of Google’s latest toy, I think I have a unique perspective to offer on the issue. I understand what the Ngrams creators were trying to do, because I’m trying to exactly the same thing: get some things cooking. My research on contemporary literary reception is not exhaustive or dependent on highly complex statistical models. That’s because literary reception is a huge, multiply mediated field ranging from cafĂ© conversations to book reviews, and my access to data is limited. But where I have adopted a “core sample” model, choosing a few accessible data sources to make some robust but limited generalizations about readers and reading culture, Google has gone for the moon shot. By creating an opaque front-end to their 5 million book archive, they offer the illusion of a truly global Ngram search—and they emphasize the scale of their ambition by claiming their tool isn’t merely a corpus search mechanism but the portal to a new science of “culturomics.”

As my colleague Matthew Jockers noted in his own oar-insertion post, “To call these charts representations of ‘culture’ is, I think, a dangerous move.” He goes on to suggest it “may be,” but I have to go a bit farther and say “definitely not.” Here’s the problem: we can’t get reasonable, arguable claims about things like culture or literary history unless the limitations of the corpus are acknowledged and dealt with from the outset. Typically, projects like this limit themselves either by going too small or too big, and Google has gone way big. Let me explain what I mean.

Too Small:

The opposite example would be a research project on a small, meticulously tended patch of texts. Classic humanities research, really, but of limited usefulness for making grounded claims about larger literary-historical or cultural issues (at least until enough such small projects emerge with commensurable results that we can begin to construct some causal chains). Traditional humanities as a whole is full of projects that are “too small” for making broad cultural claims because they are limited to a small data footprint. The walled garden of closely tended results is fascinating and lovely to explore, but it’s difficult or impossible to compare the work to anything outside.

Too big:

Google, by contrast, flies off the macro end of the scale by trying to do too much and claim too much. The corpus is amazing, but nevertheless limited and contingent in many ways. As others have pointed out, the OCR is problematic; the metadata is sloppy; the text distribution almost certainly has a number of biases (how could it not? What is the gender, historical and language distribution of the world’s universal library supposed to be anyway?). By choosing to obscure these limitations instead of illuminating them, Google turns “culturomics” into a toy, not a tool.

Fortunately, the data is all there, and these problems can be fixed. Google loves a good algorithm and will presumably figure out solutions to the various technical problems. With luck (and the persistence of its academic research partners) the Ngrams team will also come to acknowledge and reveal the limitations on its data. Once that happens, we can really get cooking and make a clear case for when this vast corpus really does reveal broad cultural trends.

For now, Ngrams is a blunt object but it still has some value as a tool. I’ll post some examples next time.

Friday, December 10, 2010

Stanford Dissertation Browser

While I've had the dissertation specter floating before me for several years now, it has never looked so beautiful. Created by two Stanford graduate students in Computer Science, the Stanford Dissertation Browser uses topic modeling to graph recent dissertations by their disciplinary affiliation. The visualization was created with Flare, successor to Prefuse, which I was using for my own visualizations for a while (this being Stanford, the guy who created all of these visualization tools, Jeffrey Heer, is advising the project).

I'm looking forward to adding my dissertation to the mix next June. I wonder where it will line up?

Friday, October 15, 2010

Map Marathon

I received an email about a wonderful new exhibit/collaboration "Map Marathon" organized by the Serpentine Gallery in London and those intrepid thinkers at Edge. The whole online gallery is fascinating, but what really caught my fancy was this image, apparently submitted by Bruce Sterling. It's a map of writers who are associated with Sterling, and therefor it has a lot in common with my research.

After some investigation it looks like the map was generated with Gnod, or Gnooks to be exact: "a self-adapting community system based on the gnod engine." I'm intrigued--it seems like the site's connections are based on user input to its adaptive learning system. I'd love to compare these networks to my own data.

Thursday, October 14, 2010

Things are Cooking at First Person

I’m happy to share news of some exciting developments on the First Person thread over at the electronic book review.

First, we published a great riposte by Daniel Worden to Sean O’Sullivan’s essay on Deadwood, one of my favorite shows. The original essay appeared in Third Person and discussed the inherent tension between the plot demands of the television episode and the television series. Worden responded by thinking about the different definitions of necessity at work in the show, including the crossover between the narratological manifest destiny of a canceled season and the kind that drove all those characters to settle the deadly Black Hills of South Dakota.

Second, and ongoing, we’re running a series of entries drawn from the Critical Code Studies Working Group. The group took on the challenge of interpreting software not just as the mechanism for all of our new digital texts and toys, but as text itself. The conversation is a virtual who’s who of software studies, and I’m very excited to be editing its ebr instantiation. I find the subject fascinating and this is a great experiment in new models for digital scholarship. In Mark Marino's introduction and the Week 1 discussion participants tried to hammer out some basic definitions and discussed readings of the infamous Anna Kournikova worm.

Friday, September 17, 2010

Franzen on Oprah

As a follow-up to my last post I was planning to talk a little more about the images I posted there. But before I get to that, I need to digest the latest wrinkle in this canon conflict--Franzen's Freedom has been named the latest Oprah's Book Club pick! (Of course I learned of this from a Barnes and Noble email.) This is a bit shocking because of the awkward kerfuffle that happened last time Oprah picked a Franzen novel, when the author said some disparaging things about the whole idea and got himself uninvited.

According to Reuters: "This time, Winfrey said she sent Franzen a note asking for his permission to feature his latest novel 'because we have a little history.'" I wonder if that means Franzen will appear on the show? If so, it's interesting to speculate what's changed in the literary world since 2001. My off-the-cuff guess would be that we're seeing a kind of flattening of the literary universe as professional critics thin their ranks and the publishing industry struggles to adapt to new realities. But on the other hand, it's entirely possible that Franzen won't go on the show, and that Oprah's taking the high road (as in both moral and -brow) on her own.

All of this circles back to the 'Franzenfreude' debate. The same things that presumably attracted Oprah to the book: its themes of American families, love and the struggle for a new domesticity (or so I hear, not having read it yet) are the same things that make the novel appealing to more than just the 'male readers' Franzen was so worried about losing during his previous Oprah spat. And of course these themes would (so critics argue) condemn Franzen to chick-lit middlebrow status if he happened to be a woman.

What we can glean from the images I posted previously is that Franzen really is successful at breaking out of the 'challenging young novelist' box. Unlike, say, David Foster Wallace (whom I'm working on right now), Franzen's books are avenues of exchange for readers of Jane Smiley, Jennifer Egan, David Mitchell, and a host of other writers of both sexes (though, it must be said, more men than women). Oprah's latest pick proves what we see in the images below: Franzen has managed to snag the ring of elite literary prestige while still appealing to diverse audiences. His books lead readers to varied literary clusters, not just to more Franzen. And his links to the canon-spanning roster of previous Oprah selections will only proliferate in the coming months.

Friday, September 3, 2010

Gender Bias in Reviews

I was fascinated to read an analysis Slate's DoubleX staff ran yesterday about gender bias in New York Times book reviews. They discovered that there is a significant slant towards men getting reviewed (and men doing the reviewing), particularly for authors who get the coveted double-coverage treatment (a review in the newspaper as well as one in the weekend Book Review).

One question they pose is about contextualizing writers—would Nick Hornby be a chick-lit writer if he was female? They say: “Our tools are not fine-tuned enough to answer these questions.”

As a former Slatester myself and a current grad student who's working in precisely this area (not on gender, per se, but on reviews, how writers become famous and how books live their own lives online), I have some tools I can bring to the table.

One of the reasons this stuff is hard to pin down is that the literary marketplace is vast, fluid, and poorly documented. The New York Times bestseller list is something of a black box itself, so why not take a look inside some other black boxes to see what distinguishes authors? This is the logic that has led me to spend some serious time looking at Amazon (after all, the world's largest bookseller) to see how authors get contextualized there. I decided to see what the gender breakdown is for books that are recommended1 from the main subjects of the Slate article: Franzen, Hornby, Weiner and Picoult (who kicked off the debate with an angry comment about Franzen's rave in the Times, if I recall correctly).

The results are shocking. See below: boys in blue, girls in yellow (click on the thumbnails to see larger images).2 Yes, Franzen and Hornby are linked to a lot more men than women—not too surprising. Weiner is linked almost exclusively to women—again, not a huge surprise. But take a look at Picoult—she is a literary island unto herself, according to the Amazon recommendation engines. This is very rare in my research, and I think indicates an author who's distinctive in a stylistically interior way—her books lead readers to more of her books, not to things outside the Picoult universe.

I did this quickly so I might have gotten a gender wrong somewhere or messed up a book network somehow, but as a quick sketch of the differences between Franzen, Hornby, Weiner and Picoult, I think this is quite interesting. (Or at least the perceived differences, which in the literary world are more or less the whole of reality anyway). I don't have a strong opinion in the debate; it seems clear that more men than women are reviewed in the Times, while it's almost certainly true that many more women than men read novels. But as some comments on the Slate article pointed out, gender bias doesn't happen in a vacuum--readers, authors and critics are all players in the same complicated literary game.









1. I look at these recommendations because I think they're one of our best models for what books people actually buy together. In practice books connected this way tend to jump the boring categories like genre and author and link together in much more idiosyncratic ways. Obviously Amazon plays with these results...but they're always trying to sell more books, and they're pretty good at it, so I use the recommendations as a best approximation of the marketplace.

2. "What am I looking at?" The nodes here are books on Amazon, and the arrows connecting them are recommendations from one book page to another. These results represent the first ten “Customers who bought X also bought Y" recommendations for each book, starting with each writer's most recent fiction publication (Freedom, Juliet, Naked, Fly Away Home and House Rules, respectively).

Saturday, July 17, 2010

London Nerd Tourism

Last night I was thinking about Douglas Adams' bathtub and I realized I should post about the few touristy things I had time to do while I was in London.

-I visited the National Portrait Gallery and said hello to the Romantic poets, who first induced me to really enjoy studying English. NB: air-conditioned!

-I went on a bat walk. This was exciting for a number of reasons, not least of which was meeting my fellow Batwalkers (distant cousins of the Skywalkers). Also, they hand out bat detectors and keep some sample bats on hand for demonstration purposes. Finally, you get to walk through British parks at night, which is apparently a huge subversive thrill. But the highlight is the audio from the bat detectors, which lend a whole new dimension to the experience.

-I had lunch with my friend Scott at Google's London offices. The major highlight for me was to brush a hand over Douglas Adams' bathtub, which now resides among a forest of deck chairs. Baths are, needless to say, very important in the Hitchhiker's Guide mythos. This one looked sleek and self-satisfied, as if it quietly devoured an AdSense salesperson once every week or two.

-I saw Keats House, the cottage where he fell in love with Fanny Brawne, moved from persistently to gravely ill, and wrote some fine poetry. They've even got the (replacement) tree under which he composed Ode to a Nightingale, a poem that presages the entirety of Yeats in eighty lines.

All in all, a great trip!

Wednesday, July 14, 2010

Back to work

I returned from London on Monday and have been slowly gearing back up into work mode. DH2010 was a great conference experience and I met a lot of people I'm hoping to keep in touch with. I think my talk went pretty well and it seems like more people are doing work similar to mine this year, which is comforting.

And, since Digital Humanities is such an impressively techie and well-organized affair, they've already got an audio interview that I did right after my talk posted online. I understand that more materials from the conference will be posted on arts-humanities.net in the days to come; it would be great if they post slides from presentations that I missed. It will have to console us until next year's conference...at Stanford!

Thursday, July 8, 2010

London Dispatch

I've once again fallen way behind in my blogging, but fortunately I have much to report. I'm writing from Digital Humanities 2010, where I'll be presenting my latest research on Saturday. The conference is in London and it's been exciting and a little befuddling to wrestle jet-lag amidst an exciting array of panels and posters.

The paper I'm giving is on Toni Morrison, the subject of the recently completed Chapter 2. It's in its fourth iteration now, after a trial run among the friendly brains at Stanford and great panels at ASU's Southwest English Grad Students conference and ACLA. At each point I've been refining my methodologies and slides (lesson one: visualization is endlessly finicky).

As before, this is a case study where Morrison's work is really a jumping-off point for an exploration of her reading publics and the nature of literary fame. When I presented at DH2009, I was still working out how to approach these questions and adopted a kind of shotgun strategy, using every data set and methodology I could think of to see what worked. That paper, on Thomas Pynchon, had a lot going on: networks of Amazon recommendations; Wordle images based on word counts of book reviews; bar graphs of library copies; graphs of MLA citations and comparisons of MLA, Amazon and newsgroup publications by year.

Most of these ideas were interesting, but only some of them 'stuck' for me. The cyclical nature of academic and other kinds of publication, for example, was revealing to see but a point that probably only needs to be proven once. This year I've decided to focus on the richest results from the past and push the envelope. My paper will look at the social lives of Morrison's novels, and the 'social' networks they inhabit online. I've worked hard in the past year to create collocation-based networks and to use network analysis to identify the most significant nodes and clusters in Morrison's ideational networks online. These are the most interesting, and the messiest, of my datasets, and network analysis has revealed some surprising patterns that I'll be sharing on Saturday.

So that's the major news. I have a couple of other projects cooking that I'm going to write up when I have some solid bulletins to report.

Tuesday, April 13, 2010

Farewell, Leland

Blogger is ending its support for FTP publishing of blog files. That means that my blog can no longer be hosted on Stanford's servers in its current form. Ergo, we're back here. Assuming this works. Further bulletins as events warrant.

Sunday, February 21, 2010

Talking in Tempe

I had a great time speaking at the Southwest English Graduate Student Symposium on Saturday, or SWEGS, according to its intimidating acronym. This was a great way to introduce some of my research on chapter 2 of the dissertation, which is a case study on Toni Morrison's ouvre. It was great to meet some other members of the local English grad student community, and I was shocked (and pleased) to encounter a fellow panelist who's also looking at Amazon's recommendation networks, and I'm looking forward to sharing ideas with him down the road.

This was the first stop in the 2010 road tour, which will include Stanford, New Orleans and, hopefully, London. I'll be updating the presentation with new bells and whistles as I make more progress on some new ways of looking at references in book reviews.

Until then, back to the mines.

Tuesday, January 12, 2010

A Big Year

2010! Where is my jetpack?

It's been a busy year so far, and I'm hoping to keep up with this new, futuristic energy. After a bit of a slow autumn (we use the term metaphorically here in Phoenix) and the usual distraction of the holidays, I finally got to check a few major items off my list this week. Yesterday I completed a funding application for the Stanford Humanities Center--they offer a few dissertation fellowships each year. Today I finally--FINALLY--finished revising a paper submission based on my Pynchon chapter and sent it back for round two.

Now it's time to buckle down and return to data analysis. I've assembled a great pile of book reviews and recommendations in a MySQL database, and I have a few discrete challenges ahead of me:

First, I need to come up with an effective way to identify and then tag proper nouns in book reviews. This is easy to do badly and then clean up by hand, which is what I did for the last chapter. But there are a lot of Morrison reviews out there, so now I really need a computer for this. As a first pass/proof of concept I'm hand-editing a little "dictionary" of all the proper noun literary references made in professional reviews of Morrison's work. Then I'll write some kind of program to search for and tag those references in the reviews.

Once I get that figured out, the second trial process is going to be creating network graphs of these literary references based on collocations. I think I'll probably start by defining links as "in the same paragraph," but this might change depending on how useful the graphs end up being.

If I can get all this working in the next week or two, hopefully I will get some kind of epiphany for how to do automate the process elegantly for a much larger, and badly proof-read, set of consumer reviews of Morrison. It's 2010...where is my artificial intelligence research assistant?