compilerbitch | The number of the geek is 0x029A

Hi folks,

I was chatting to

foxypinkninja the other day, expounding my usual rant about the inadequacy of English when defining imperfectly understood concepts related particularly to gender and sexuality. It occured to me that there might be a more effective, admittedly geekier, solution that could put an end to all world strife. Well, probably not, but it would be fun, anyway. Chatting a bit more to

doseybat today made me decide to post about the idea before brain fade did for it and consigned it to the maybe-perhaps-when pile.

The basic thesis is pretty simple: we all are usually defined by large characteristics, such as male, female, tall, short, sporty, couchpotatoish, gay, straight, bi, lesbian, mono, poly, polyfi, swinger, BDSM, vanilla, trans, cisgendered, etc. These large characteristics are all inherently broad-brush definitions that are easily broken by specific counterexamples, and in any case they have definitions that are difficult to agree upon. My argument is that, in reality, we are defined by a potentially very large number of small characteristics that collectively (and often imperfectly) entail the large characteristics. By finding a large (though hopefully reasonably minimal) set of small characteristics that individually are easily agreed upon, it then becomes possible to define something like a sociological genome for people, potentially allowing the aforementioned large characteristics to be discarded.

By definition, a characteristic is small if and only if it doesn't subdivide into component characteristics (e.g. male is not a small characteristic because it can be subdivided into (at the very least) physical primary and secondary sexual characteristics, which may or may not match the person's gender orientation).

If it's possible to come up with a set of such characteristics that are necessary (in the sense that removing them would reduce specificity) and sufficient (in the sense that adding more characteristics wouldn't do much to improve specificity), we can encode that as a bit string that can then be represented conveniently. I had a couple of ideas for this. One, obvious, way is to encode it as a hex number, because this is reasonably compact, ASCII compatible and relatively easy to cut and paste or to type manually. Given your 'geek ID', it should then be relatively simple to build gadgets like PDA/cellphone/web applications that can encode and decode these things, and (interestingly) compare them with those of people near you, possibly automatically.

OK, here's the second (and possibly geekier) part of the idea. This came out of figuring out how it might be possible to derive a reasonably compact set of characteristics -- we don't really want to have the okcupid.com umpty gazillion not-very-relevant questions here. It occurred to me that it is probably possible to have a bunch of people take an online test deliberately with way more questions than are strictly necessary, then feed all their results into some of the kinds of algorithm used for pulling trees from DNA sequences. I originally thought of tools like MrBayes, but doseybat recommends that I probably find out about Principal Components Analysis. Once you get a tree, it's probably fairly straightforward to do an analysis on the results to figure out which characteristics are predictive and which can be discarded as noise, so from that it should be possible to find a reduced set that, when analysed, generate the same (or at least a very similar) tree.

Anyway, there are more things that you can do with a tree like that. For a start, you can use it to classify new data, so (assuming the original tree is detailed enough) you don't need to do PCA every time you have a new person come along and want to know where they end up in the tree. The second (and IMO very cool) thing you can do is, once you've found somewhere in the tree to place someone, you can encode their position in the tree by starting at the root and making a note of which branches lead to the relevant leaf. Such encodings are likely to be very compact and efficient, certainly O(log N) in size with respect to the number of leaves in the tree, but (cooler still) if you encode the position in the tree most-significant-bit first as you go from the root to the leaf, you end up with a number with some interesting properties: given any two data points on the tree in close proximity to each other, their encodings are also guaranteed to be close to each other. A simple binary enoding running MSB to LSB isn't ideal, though, because you can still end up with adjacent codes that actually reflect the top and lower bounds of not-really-very-connected main branches. A better encoding would space the allocation of codes such that, if you take the tree and encode it, then do cluster analysis on the codes you get the tree back again. This would be incredibly cool -- we could then have a single number, say in the range 0-9999 decimal, which both accurately places us in the tree, and also makes it possible to compare ourselves with other people just by taking the difference in the code. If I'm a 1378 and you're a 1392, we're probably going to be very similar, but if you're a 9326, that's unlikely. The actual ordering isn't necessarily meaningful, of course, but differences actually are. I am tempted to cheekily flip the tree around a bit so that the numbers form some kind of conservative -- liberal continuum, but it might be better to avoid that for political reasons. Or not. I'd certainly like it if it worked like that, though I suspect that the fattest branches will probably correspond most closely to the familiar large characteristics, though it is significantly possible that they may actually not do so.

It's also possible to use this a bit like a dating site by answering the questions as if you were someone you'd want to date. Even cooler, you could also answer it as if you were someone you'd never date in a million years. We could then have a pin badge with three numbers on it: a green one, specifying who we'd like to meet, a black one specifying ourself, and a red one specifying, if you're close to this number, just don't even ask!

So, cool, or a step too far?

Flat | Top-Level Comments Only

(though principal components analysis is probably better -- I must get around to reading about it)

Cooincidentally, I've been doing lots of work recently on clustering text documents by similarity, and have just implemented a simple clustering algorithm like the one you describe. Am just about to move on to some more advanced algorithms which basically involve doing PCA first...

Cool!

As an aside, I'm intending to visit the UK for a week or two around the end of January/beginning of February. I'll probably be staying in Cambridge for at least half of that time, so if you're up for meeting up at some point for a good chat, that would be great. :-)

The number of the geek is 0x029A

no subject

no subject

no subject