Date: | 3 September 2013 |
---|---|
Tags: | python |
An interesting diagram has been making the rounds on the Internet. Attributed to a Twitter personality named @TheBigPharaoh — whose tweets draw attention to the humanitarian and human rights situation in Egypt — it has been cited by no less an authority than the Washington Post, who calls it a “sort of terrifying” depiction of the modern Middle East. But as a consumer of data, I was immediately skeptical: there are many ways to make quite simple information look like chaos if it is presented poorly.
Is the information in the diagram really that complex?
I decided to try building a very simple data model to see if it could predict every single relationship on the diagram. Not because I think that the real Middle East (or anywhere else) can be adequately described with a simple model, but because I strongly suspected that the diagram itself was in fact modeled on only a few basic regional divisions.
Update: reader Alex Burr points out five missing edges in [`pharoahs-chart.json`](http://rhodesmill.org/brandon/2013/pharoahs-chart.json) so please use [`pharoahs-chart-v2.json`](http://rhodesmill.org/brandon/2013/pharoahs-chart-v2.json) instead, which inspired an improvement in the article below: Iran has been added to the `islamists` set, which now overrules the Shia-Sunni split to match the diagram’s assertion that they support Hamas.
So I opened an IPython Notebook and got to work! This blog post is, in fact, the notebook itself, with some Markdown calls full of paragraphs and text added to provide structure and commentary. You can download the original notebook here:
So that every IPython Notebook does not begin
with the same series of verbose import statements,
IPython provides a pylab
directive which imports a few dozen
essential NumPy features.
It is the first step that I took in getting ready to code:
%pylab inline
Populating the interactive namespace from numpy and matplotlib
So that other people can play with the diagram — and probably do an even better job of analysis than I will here! — I have chosen to represent it as JSON instead of using a Python-specific format. You can download my small data file here:
Once this file is saved to the current directory,
Python can load it quite easily with the load()
method:
import json
with open('pharoahs-chart-v2.json') as f:
edges = sorted(json.load(f))
a, verb, b = array(edges).T
print 'Loaded', len(verb), 'edges'
Loaded 42 edges
As is common when doing information processing in modern Python,
note that I have not left the data as a list-of-lists
as it is represented in the underlying JSON file.
Instead, I have passed the entire data structure
to the NumPy array()
method
which I have then transposed
so that the input’s list of 3-element items
becomes three big vectors:
a vector of actors, a vector of verbs,
and finally another vector of actors at which those verbs
are respectively directed.
A quick count of the number of unique nodes
can be a quick way to check against misspellings,
since a misspelling will create two unique nodes
where the original diagram had only one.
Happily, computing the number of unique strings
shared between a
and b
yields
exactly the number of unique nodes in the actual diagram:
print 'Loaded', len(unique(append(a, b))), 'nodes'
Loaded 15 nodes
We are nearly ready to explore the data!
I will propose a series of simple political models of the Middle East,
each of which is a function that,
given a political actor a
like "Turkey"
and a potential client b
like "Syria Rebels"
,
returns one of three predictions:
"supports"
— diagram’s blue lines"hates"
— diagram’s red lines"clueless"
— diagram’s green lines These predictions can then be compared
to the actual arrows on the diagram
to rate the political model for its accuracy.
Note carefully that these models
are only being judged for their ability to correctly color-code
the arrows that actually exist in the diagram;
they can return whatever nonsense they want
for arrows not in the diagram, like ("USA", "Turkey")
,
because we are only testing the functions against the input data set.
Because NumPy supports vector operations
that operate simultaneously on whole vectors of input values,
it only takes a single ==
operation to compare a series of
predictions against the series of actual supports/hates verbs
from the diagram.
The only catch is that, to perform the actual prediction,
we need to “vectorize” each little prediction function
to produce a routine that works on a whole vector at a time.
And we use another trick:
since a series of ==
decisions like True
and False
are in fact equivalent to a series of numbers 1
or 0
,
we can use sum()
to count how many True
values are present!
Aside from these two nuances,
the reporting routine is rather simple Python:
def try_predictor(predictor, report=True, verbose=False):
"""Report on how well a `predictor` function performs."""
# What does the predictor predict for each situation?
prediction = vectorize(predictor)(a, b)
# How does that stack up against the diagram?
match = (prediction == verb)
percent = 100.0 * sum(match) / len(match)
print 'Accuracy: %.03f %%' % percent
# What specific predictions is it making?
if report and (verbose or not all(match)):
print
for is_match, ai, bi, pi in zip(match, a, b, prediction):
if is_match and not verbose:
continue
print ' ' if is_match else 'WRONG:',
print ai, pi, bi
print
Before getting all political,
we should test this analysis and reporting tool
by feeding it one or two dummy predictors
that are not actually interesting,
to see its output.
We will try exercising a pair of functions
that represent the perfect optimist and the perfect pessimist:
the one assumes that members of the human species always support one another,
while the other assumes that "hates"
is the universal relationship.
try_predictor(lambda a, b: 'supports', report=False)
try_predictor(lambda a, b: 'hates', report=False)
Accuracy: 47.619 % Accuracy: 47.619 %
It is a happy fact that the pessimist and optimist are so perfectly balanced in this particular case: the number of friendly links in the diagram, in other words, is equal to the number of enemy relationships. Which almost gives one hope for the world — not quite, but almost.
Given this infrastructure, it will take only a few steps to predict every single political relationship in the Big Pharaoh’s diagram. The real Middle East may be more complex than this, but you would not know it from the diagram!
The first thing that strikes me is how many red arrows cut left-to-right across the diagram between the upper right, where we see Russia, Assad, and Iran, and most of the rest of the state and non-state actors that are depicted. This has deep roots: Islam became separated within its first few centuries into a Sunni majority and a Shia minority (as well as many smaller groups), the latter of which claims both Assad and the Iranian leadership as adherents. If we place all of the Shia in a group and throw in Russia — which shares a border with Iran and has served as an ally following the overthrow of the United-States-backed Shah in 1979 — then we find that we are almost halfway to explaining the entire diagram:
shias = {'Assad', 'Iran', 'Lebanon Shias', 'Russia'}
def p1(a, b):
if (a in shias) != (b in shias):
return 'hates'
else:
return 'supports'
try_predictor(p1)
Accuracy: 71.429 % WRONG: Al Qaeda supports Saudi & Gulf WRONG: Hamas supports Sisi WRONG: Iran hates Hamas WRONG: Israel supports Hamas WRONG: Qatar supports Sisi WRONG: Saudi & Gulf supports Muslim Brotherhood WRONG: Sisi supports Muslim Brotherhood WRONG: Turkey supports Sisi WRONG: USA supports Muslim Brotherhood WRONG: USA supports Sisi WRONG: USA supports Al Qaeda WRONG: USA supports Hamas
You may be a bit confused about why I am performing a pair of in
operations and then comparing the output with an !=
inequality operator.
The reason is that I am looking for situations where the answers
are either True
and False
or else the values False
and True
,
either one of which indicates that a
and b
fall on opposite sides
of the division.
This predictor brings our success rate to 70%.
But there is obviously more going on here, because nearly 30% of the links in the diagram are still being reported incorrectly. Take a moment to read over the list of mis-predictions above. Do they share anything in common?
What our first predictor seems blind to is the opposition between populist Islamist movements and most of the nation-states involved in the region. The Arab Spring has made it possible that several of these organizations will now make significant political gains if they can turn their popular support into votes in newly created democracies, but they are considered terrorist organizations by many Western nations and their allies.
Three state actors, though, have allied themselves with the Islamist movements instead of opposing them. Theocratic Iran was itself born of an Islamist revolution in 1979. Turkey is a secular democracy that has been flirting with the idea of a more explicitly Islamist government. And Qatar is a more interesting case: while the government itself is an autocracy, it is a Wahabi state and thus is strongly aligned with the earnestly conservative Islam that motivates many of these political and religious groups.
Adding these two rough allegiances into our model, and assuming that Islamists always aid one another while Islamists and moderates are always at odds, very nearly completes the entire diagram!
islamists = {'Al Qaeda', 'Hamas', 'Muslim Brotherhood', 'Iran', 'Turkey', 'Qatar'}
moderates = {'Saudi & Gulf', 'Sisi', 'Israel', 'USA'}
def p2(a, b):
either = {a, b}
if (a in islamists) and (b in islamists):
return 'supports'
elif (either & islamists) and (either & moderates):
return 'hates'
elif (a in shias) != (b in shias):
return 'hates'
else:
return 'supports'
try_predictor(p2)
Accuracy: 95.238 % WRONG: USA hates Muslim Brotherhood WRONG: USA supports Sisi
Note my careful use of Python set operations to contrive
a succinct expression for “if one of the players is populist
and the other is autocratic” —
if it were not for the ability to do a quick test
for an intersection between one of the inputs
and either the islamists
set or the moderates
set,
this new if
statement would have had to run to several lines.
The only thing now missing
is that our political predictor
never outputs the result "clueless"
and thus cannot correctly predict the stance of the United States
with respect to the power struggle in Egypt.
I will leave to more informed political commentators
whether this characterization of the current administration
is fair or not;
for our purposes, the only point is that it requires
the addition of but a third clause to our predictor,
yielding an absolutely perfect p3()
:
egypt = {'Muslim Brotherhood', 'Sisi'}
def p3(a, b):
either = {a, b}
if a == 'USA' and b in egypt:
return 'clueless'
elif (a in islamists) and (b in islamists):
return 'supports'
elif (either & islamists) and (either & moderates):
return 'hates'
elif (a in shias) != (b in shias):
return 'hates'
else:
return 'supports'
try_predictor(p3)
Accuracy: 100.000 %
And we are done.
For all of its chaotic hand-drawn relationships, the Big Pharoah diagram really models only two regional feuds, combined with a swipe at the United States for its caution in engaging with either of two warring factions within today’s Egypt.
I draw three lessons about information visualization from the fact that a diagram whose politics are so simplistic has been re-blogged as evidence that the Middle East is complicated.
First, the diagram presents a puzzle for which the human vision is simply not optimized. Never, to my knowledge, does Nature present a hunter-gatherer with a web of different-colored links and demand a quick intuition about whether the nodes form only a few basic groupings or are hopelessly splintered into several. So presenting the information this way makes it basically opaque.
Second, our eyes are very sensitive to similarities between shapes, yet the diagram takes a uniform relationship like “supports” and splays it across the page at a half-dozen different angles and sizes to create a perception of chaos. The fact that the arrows are hand-drawn adds an extra level of visual noise that is simply icing on the cake.
Finally, edge-coloring turns out to be a fairly expensive way to illustrate nodes that fall into a few groups, because in the general case you wind up drawing n² edges when instead you could just use 3 or 4 colors to label broad groups and then explain the relationships among them. You could even use a mix of node-colorings and edges: imagine a map of the 30 Years’ War that colors Catholic countries one color, Protestant countries another, and then has a few annotations thrown in to explain the exceptions to those natural allegances that arose during the protracted conflict. I suspect that the same approach would work better here.