Friday, June 26, 2015

Indo-European Origins

2015 isn't even half over, but we've already seen a flood of high-profile papers and books weighing in on the question of the 'Indo-European homeland'. It will probably take a while for everything to sink in and get properly digested, but here are my preliminary reactions to what I've read so far.

For those who don't spend their spare time thinking about Avestan verbal conjugations, the basic issue is about where a particular ancient language called Proto-Indo-European (usually abbreviated to PIE) was spoken. We have no direct records of this language, but reconstruct on the basis of a number of ancient and modern languages that seem to have developed from it. This is sort of like how French, Spanish, Italian, and the other Romance languages all developed out of Latin, only in this case it's as if Latin were never written down so that we have to figure out what it was like by comparing the later languages. Some of the more familiar early Indo-European languages include Greek, Latin, and Sanskrit, but the family is very large and includes languages as diverse as English, Lithuanian, Kurdish, and Albanian. People have been working on reconstructing PIE for a couple of centuries now, so that the method of reconstruction is pretty advanced and we really know quite a lot about the language (though there are also some very important points of debate too).

(Note that PIE isn't some sort of primordial language. Humans have been speaking for tens of thousands of years, and PIE is very far removed from the first human language(s). It's also not the only reconstructed proto-language in Eurasia. We also have Proto-Afro-Asiatic (including Proto-Semitic, the ancestor of Babylonian, Hebrew, Arabic, and others), Proto-Uralic (whose most famous descendants are Finnish and Hungarian), Proto-Kartvelian (from which a number of languages in the Caucasus come), and others -- and of course there are more yet around the rest of the world. But PIE is a particularly interesting proto-language, since its descendants include a large number of interesting ancient and modern languages, and its study can shed light on the prehistory of Europe and other regions not available from any other source.)

Naturally, people have often been curious about when and where PIE was spoken, and a lot of possibilities have been brought up over the years. There isn't really an obvious answer, since even the oldest Indo-European languages are found across an enormous area: from Greek in Greece to Sanskrit in India to Tocharian in Western China (!), to the Baltic, Germanic, and Celtic languages across northern Europe. This breadth not only makes it hard to figure out where the original language came from, but raises the question of just how these languages got spread so far and wide. Nowadays only two 'homelands' really receive much attention, the 'Anatolian' and 'Steppe' hypotheses. They differ on not only where PIE was spoken, but also when, and how it spread. Here they are in brief:

1) The Anatolian origin places PIE in Anatolia (Turkey, basically) at a very early date, around 7000-6000 BC or so. Around this time, farming was spreading from this area into Europe (which had only been inhabited by hunter-gatherers until this point), and the idea is that as early farmers settled in slow waves from Anatolia, they spread their language as well as their way of life into these areas. This idea was first laid out by the archaeologist Colin Renfrew in his 1987 book Archaeology and Language: The Puzzle of the Indo-Europeans, which remains an interesting read.

2) The Steppe hypothesis looks to the Eurasian steppe, especially in the region that's now the Ukraine, around 4000-3000 BC. This is over 1000 miles away and on the other side of the Black Sea from Anatolia, as well as several millennia later. The Steppe hypothesis sees the spread of the IE languages as a somewhat more complicated process, but points especially to the invention of wheeled vehicles in this area at this time. The notion is that PIE culture was mobile and competitive, and for various cultural and economic reasons was well-placed to spread rapidly -- partly through the actual migration of people, but also partly by assimilating other people into their lifeways (including language). The modern form of this hypothesis goes back to Marija Gimbutas in the 1950's, but it's been developed by a whole host of people since.

Both hypotheses have been argued at length for a number of years now, but something new has been added to most of the main points during the past few months.

Statistical Dating of Proto-Languages

One of the first big items of the year was the publication by a team from UC Berkeley of the rather technically titled article Ancestry Constrained Phylogenetic Analysis Supports the Indo-European Steppe Hypothesis. The topic of this paper is an approach that goes back to 2003, when a couple of New Zealand biologists tried to apply statistic methods from evolutionary biology to date when PIE began to split up into its various descendent languages. The New Zealanders had found that PIE was very old (7800-5800 BC), a date that was more in keeping with the Anatolian hypothesis rather than the Steppe. There was a major follow-up in 2012, which claimed to find the same thing. Now the Berkeley team has responded on the other side of the debate, claiming that if you use a better methodology, the dates actually end up being much more recent, in keeping with the Steppe rather than the Anatolian hypotheses.

The basic method behind all of these studies is to look at the replacement of basic vocabulary words. The idea is that words for things get replaced from time to time -- the Old English word wamb*, for instance, has been replaced in modern English by belly, though other basic words like hand and night have stuck around (though sometimes they've undergone changes of pronunciation). We know that the rate at which this sort of replacement happens can be extremely variable, but this approach rests on the (questionable) notion that a sufficiently advanced statistical model can still get at least some idea of how long ago related languages split, based on how much their basic vocabulary has diverged.

*[Edit: I should probably specify that wamb is still around in English, in the specialized meaning 'womb'. But it's been replaced as a part of the basic vocabulary by belly, which is all that this sort of lexical statistical model cares about.]

The New Zealanders have been arguing that the early datings by these statistical models provide substantial support for the Anatolian theory, but a lot of people have criticized all aspects of their study. This new paper by the Berkeley team is the first major attempt to actually use their own methods to obtain a different result. The key difference is in the 'ancestry constrained' part of the title. Basically, the New Zealanders had just fed data from a bunch of Indo-European languages in, and let the computer figure out how they were related -- it didn't say anything, for instance, about Latin being an ancestor of Italian, and in fact their model doesn't have Latin as ancestral to the Romance languages. It puts it more as an aunt or uncle. The Berkeley folks figured they'd try putting in these 'ancestry constraints', and tell the computer that the Romance languages come from Latin (and that Modern Irish comes from Old Irish, etc.). When they did this, the age of every part of the family, including PIE itself, came out as more recent.

This is probably because the New Zealand model was having to produce more prehistoric language stages. The Berkeley model has the Romance languages coming from Latin, which in turn comes from Proto-Italic, and that from PIE. The New Zealand approach reconstructs a Proto-Romance, which is significantly different from Latin, and so needs a sort of Proto-Latino-Romance stage that's older than Latin, which means the age of Proto-Italic also gets pushed older, etc. When this happens all over the family tree, the average age of PIE can ultimately be pushed back quite significantly. By eliminating this effect (which they call 'jogging'), and making a few other technical changes, the Berkeley team got dates ranging from 5100-2800 BC. This encompasses all of the Steppe dates, but is too late for the Anatolian origin -- the farmers had already left Turkey.

Like most linguists, I'm pretty sceptical about all this, whatever the conclusions. Lexical data is among the least reliable in language change, for a variety of reasons, and it's hard to see how the best statistical model in the world could get useful results from nearly useless data. The Berkeley team claims that their study shows that statistical models can be useful, and that theirs actively supports the Steppe hypothesis. My own feelings are a bit different: I'd suggest that this study basically makes these statistical models irrelevant. Proponents of either origin can now point to or disregard statistical studies as they wish. Since most linguists have been happily disregarding them already, they'll probably just keep on doing so. The actual evidence comes from other sources.

Traditional Arguments Restated

My favourite paper on all this to appear this year is a well-written piece by archaeologist David Anthony and linguist Don Ringe, titled The Indo-European Homeland from Linguistic and Archaeological Evidence. This paper tries to present the strongest possible arguments in favour of the steppe hypothesis, and it's hard to think of any two authors better qualified to do so.

David Anthony is not new to this area of study. His 2007 book The Horse, the Wheel, and Language is basically an extended archaeological argument for the Steppe origin, and is probably the most up-to-date thing of its kind. Building off of earlier work like Jim Mallory's In Search of the Indo-Europeans and ultimately Marija Gimbutas's idea of the 'Kurgan hypothesis', Anthony has tried hard to put together a coherent picture of what the Steppe origin might have looked like in detail.

This paper says little that's fundamentally new, but it tries to state the traditional arguments in the most rigorous way possible. This is helped by the presence of Don Ringe, an Indo-European linguist who has the expertise to make the linguistic case in a coherent way.

The biggest emphasis in this paper is an approach that's sometimes called Wörter und Sachen (German for 'words and things'), or else linguistic palaeontology. The idea is that if we can securely reconstruct a particular word with a particular meaning for a proto-language, the implication is that the speakers of that proto-language had or knew that thing.

Older versions of the homeland debate have often focused on flora and fauna, since a word for a rarer animal or plant might help pinpoint where the PIE speakers were. This approach hadn't worked real well, mainly because most of the reconstructible words of this sort are for fairly widespread things, like beavers and wolves. This makes sense: any really specific word would have been lost or changed meaning in most branches of Indo-European, once the speakers left the area that plant or animal lived in -- this would make it really hard to recover the original meaning.

In the current debate, the focus has been not on the natural world, but on technology, particularly (but not exclusively) wheels. This paper points out that while a word for 'wheel' is not reconstructible for PIE as such, it can be reconstructed for the next best thing. In Indo-European linguistics, the various descendent languages are grouped into ten sub-families or branches, and most people assume that the first of these families to split off and go its own way was the Anatolian family (not to be confused with the Anatolian hypothesis!). Ringe uses the term NPIE, 'nuclear Proto-Indo-European' for all the other IE languages, which continued to develop after the departure of Anatolian, and shows that words for 'wheel' (and various related words for wheeled vehicles) can be reconstructed for NPIE.

The reason why wheels are interesting is because they are relatively late, archaeologically speaking, only showing up around 4000 BC at the earliest. This seems to show pretty conclusively that NPIE was still around at this date, long after the first farmers had dispersed across Europe.

There are quite a few people who doubt that the wheel vocabulary is significant. People have argued that these words could have been borrowed around later, or been invented independently, or been old words for different things that shifted meaning. This paper does a particularly good job of addressing these alternatives, and spelling out the highly unlikely assumptions required in each case.

Anthony and Ringe also look at other parts of the (N)PIE vocabulary, focusing on words for feasting, the celebration of glory and martial prowess, leaders and followers, guest-host relationships, and the like. They paint a picture of (N)PIE society as interestingly fluid, based around competing chiefs who accumulated followers and maintained long-distance networks. The society was fairly mobile, based around stock-breeding and warfare, and tied together by people visiting and staying with each other over long distances, feasting with other, and sharing a common culture of religious ritual and poems praising successful chiefs and heroes. These things encouraged spread of the cultural system, the recruitment of new potential followers and allies, and the use of the prestigious language of these networks (if your success is partly dependent on maintaining your reputation in a particular poetic tradition, that at least encourages the further use of the language used for that poetry).

It's hard to know just how precise we can get with these sorts of cultural reconstructions, but their conclusions are plausible and really pretty restrained. They also provide a good model for how the spread of the IE languages might have worked.

They also touch on various other topics, such as contact between PIE and the Uralic languages, and how certain archaeological events might be related to language history. If you want a single, pretty concise and well-written overview of the traditional Steppe arguments, you really hardly need look further than this one piece.

The Indo-European Controversy

It was neat to see another major book on the subject appear in April, Asya Pereltsvaig and Martin Lewis's The Indo-European Controversy. It's expensive, but looks very interesting. I haven't had a chance to read it yet, but judging from the description, and from online contributions by the authors, it's pretty clear that it focuses at least in large part on dismantling the New Zealand team's computational models, and on the proper methodology of developing a theory that links linguistic reconstruction with archaeological fact.

Genetic Studies in Nature

The most recent major event in the debate comes not from archaeology, linguistics, or computational methods, but from genetics. Two pieces have just appeared in Nature arguing that the genetic population of Europe was significantly influenced by genes from the steppes (i.e. that there was a significant migration from the steppes into parts of northern Europe) sometime in the period before 3000 BC.

At first glance, this looks like pretty straightforward support for the Steppe hypothesis. People are moving in precisely the right places at the right time, from the steppe into Europe around the 4th millennium BC. Nonetheless, it's harder to prove that the language these people brought with them was PIE (or even that a single language was involved, or really what sort of patterns of linguistic shift we might expect all around). It certainly fits a lot better with the Steppe origin than the Anatolian one, though.

This is assuming that the genetic side of the studies is rigorous and reliable, which is something I don't have enough of a background in to comment on.

Final Thoughts

All in all, this has not been a good year for the Anatolian hypothesis. It's never really received much support from linguists, and has chiefly been promoted by some archaeologists (Renfrew) and evolutionary biologists (the New Zealand team of Atkinson, Gray, and others). Personally, I feel it mainly rests not so much on any real evidence, but on a feeling that for the IE languages to have spread so far and wide, we need to look for some single Big Event. The development of agrarian farming and its spread across Europe certainly provides such an event, at least for the Western IE languages, but otherwise has little to recommend it.

In fact, it really has quite a lot going against it. There's the issue of wheels, of course, and other pieces of technology associated with the 'secondary products revolution' -- the use of milk, wool, and other animal products, which only came about millennia after the initial spread of purely agrarian (crop-based) farming. Beyond this, there's also the problem that the Indo-European languages seem to be relative latecomers in places the Anatolian hypothesis predicts they should have arrived early. 

In Europe, for instance, a number of western Indo-European languages seem to have borrowed words for things like beans and peas (Guus Kroonen has done excellent recent work in gathering the evidence for this 'agricultural substrate' language or languages). The same sound correspondences that prove that the wheel words can't be later borrowings prove that these agricultural words were borrowed independently from some non-IE language into Latin, Germanic, Celtic, Baltic, and Slavic, in the process adjusting in different ways to fit into each of these already separated languages. The implication is that there were already agrarians living in Europe and speaking their own languages when the Indo-European tongues arrived. This is hardly consistent with the hypothesis that Indo-European was brought into Europe by the first wave of agriculturalists.

Beyond this, we know there were quite a few non-IE languages scattered across parts of Europe, even well into the Iron Age, which would be odd if Indo-European had so thoroughly carpeted Europe in the Stone Age. Basque is the only living remnant of these non-IE languages, but we have ample records from across the Mediterranean of languages like Etruscan (the language of an important agrarian civilization), Tartessian, and whatever's written in the still-undeciphered Linear A script (which is highly unlikely to be anything IE).* Europe seems to have been pretty linguistically diverse for a very long time, nowhere near as Indo-Europeanized as the Anatolian hypothesis might predict (given that the whole point of the theory is to explain how Europe could have gotten so thoroughly Indo-Europeanized).

*[Edit: Don Ringe has a nice discussion of all of this in a guest post for LanguageLog:]

I think there's enough evidence from Wörter und Sachen, borrowings, non-IE languages of ancient Europe, and the distribution of the Indo-European languages, along with enough doubt about the underpinnings of the theory (in its need for a single flashy linguistic vector and earlier lexicostatistical studies) to say confidently that the Anatolian hypothesis is not viable -- not just unproven, but highly unlikely to be correct.

Whether this means the Steppe origin is right is another matter. One of the problems with a polarizing debate is that evidence against one theory can be taken as positive support for another. I personally like the Steppe hypothesis a lot. It seems to put PIE in the right time period, and the place is plausible. There are archaeological connections between the steppes and several important places that the IE languages ended up, and the method of linguistic transmission strikes me as well thought out.

Still, I'm not quite sure we ought to call the Steppe hypothesis 'proven'. It's certainly plausible, and the way that it ties together a wide range of evidence quite neatly gives it some real support. But it's really hard to link an unrecorded language with material culture, and there's got to be some room for doubt. I'm not quite sure that, barring the invention of time machines, we should ever let ourselves make the last leap from 'PIE was probably spoken on the steppes' to 'PIE was spoken' there. But however it goes, it's been an interesting half-year so far, and I'm very much looking forwards to seeing what else is waiting down the road.


  1. Nelson

    Brilliant post and really helps get to the core of the argument between these two competing theories- I am slowly going through the Anthony/Ringe paper. Like your style reminds me of Tolkien's review of Philology books in the 1920's This Year's Work in English Studies - look forward to more! Best Andy

  2. Great first post, well done! Thank you for mentioning our book, The Indo-European controversy -- would you be interested in doing a more detailed review of it on your blog? We could send you a free copy, if you are interested. Let me know. Best wishes, Asya

  3. I'm glad you both enjoyed the post!

    Asya, I'd love to do a review of your book (though I would indeed need a copy).

  4. Hello, Nelson! I've just discovered you've started this blog. A nice post #1. I'm adding you to my blog roll, looking forward to #2, #3 etc.

    I just wanted to comment on the *kʷekʷlo- business. Hittite hurki- also has a plausible IE etymology (*h₂werg- isn't attested as a verb in Anatolian, but then *kʷekʷlo- is more widespread than *kʷelh₁- in the other part of the family). Since there's no reason to treat either of the sister branches (Anatolian and "Nuclear") as privileged, *kʷekʷlo- and hurki should enjoy equal status as early IE 'wheel' words. It's hard to decide if both of them are post-PIE innovations. Either of them, or even both, may go back to PIE. *kʷekʷlo-, at any rate, often has meanings like 'circle, ring, disk' (not surprisingly, in view of its etymology), so it may have existed long before the invention of wheeled vehicles.

  5. I'm glad you enjoyed the post, Piotr. You're right about *kʷekʷlo-, of course - I think one of the strengths of the Anthony-Ringe paper is that they stress the importance of word groups, and the cumulative improbability of too many separate innovations. Any one word on its own will always have too much doubt to really be that useful. (Though the cumulative factor also applies, if less strongly, to the number of branches in which a word has a given meaning - *kʷekʷlo- does mean wheel or wheeled vehicle in quite a few rather disparate branches.)

    I've often wondered whether the semantics of *kʷekʷlo- have any inherent probabilities in them. 'Wheel' and 'circle/ring' are pretty closely related concepts, but would a derivation *turn > *wheel > *circle maybe be more likely than *turn > circle > *wheel? Intuitively I find it attractive to derive a word for 'wheel' directly from a root for 'turn', especially if the reduplication is meant to have some sort of iconic iterative connotation - 'the thing that turns around and around'. But semantic change is slippery, and I don't suppose this question can actually be answered with any useful certainty. I'm sure my gut feelings about it don't count for much at all!