What Data To Collect

The most interesting part of Recursive Exhaustion is the recursive step of multiplying the number of data fields.  My genealogical example shows how a sequence of five numbers serving as a descriptor for a person can be expanded first to a sequence of fifteen then to a sequence of forty five.

The question of what numbers to begin the sequence with is important in that example the first few are ridiculous.  In a similar version using six numbers, the first three are badly chosen.  Mine would be 45, 80085, 8.  They represent the 45th most frequent first name in a list from an old US census, which is Douglas, plus the 80085th most frequent last name in a list of last names in that census, Pardoe, and the 8th most frequent name in that same list of last names, Wilson.  This does indeed produce a unique descriptor for me because nobody else in the world has those three names.

But this is absurd.  Collecting the numerically closest names from the lists would give the name Henry Parayuelos Moore.  That is probably unique because of the rare middle name, but even considering just the first and last names, the person would be an unlikely match.  There are no Henrys in my family that I know of, and the only Moore is six generations back.  The problem is that name frequency is a terrible way of assigning number sequences to represent names.  I discuss this on another post on the same genealogy site at that link.

A proper sequence of numbers for describing a persons name might include three or four numbers for each name, totaling from nine to twelve.  If done properly, a nearby point in the vector space representing a person’s name would actually be a similar one.

Whether those fields would be useful or not is an open question, but it is possible.  In my genealogical examples, mentioned hear because they are easy to understand, fields representing a birthdate would be very useful.  Even more useful would be two fields representing the latitude and longitude of the person’s birth.

Other use fields would include a few forming a vector representation of the person’s occupation.  Other fields might include the distance the person traveled from their birthplace during their life.  On one side of my family were mariners who traveled thousands of miles.  The other side included many farmers who probably didn’t travel more than fifty miles from their birthplace.

As more and more fields are added, the description of the person gets better and better.

These can be called first order facts.  They would say a lot about the individual, but many more fields could be added, producing a better description.  To the first order facts about me, one could add the second order facts, which are those of my father and mother.

It might be possible to create a vector of 100 numbers which would be my first order description.  If the fields were well chosen, the linear vector space would have the desired properties:  if the dot product of someone else’s vector with my own was a high positive value, we would be similar people.

But I would also be well described as a son of each parent.  I am somewhat like my father and somewhat like my mother.  Adding their 100 number descriptions to my own would produce a vector of 300 components which would be a much better description of me.  It would have the advantage of making my description much closer to that of my brother, which is a clear improvement.  Our basic 100 component vectors differ too much.

Using genealogy alone, one could continue this backward, producing 900 component vectors.  That strategy would make 800 of them identical to my brother’s, exaggerating our similarities.

That is a strong argument for not using genealogy alone.  Other connections between people could be used to make the differences stand out.  We did not marry the same kind of woman and had quite different children.  We had very different friends while growing up.

Adding all of these social connections makes for a much larger vector and also increases the multiplication factor by which one iteration adds second, third and higher order fields to a description.

A difficult problem is missing data.  I will discuss that elsewhere.  For now assume that all of the data mentioned is available and just consider the most useful fields.

Posted in Uncategorized | Leave a comment

Recursive Exhaustion Method for Social Data Collection

This website explore a remarkable new method for social data collection which makes the notorious acts of Cambridge Analytica look trivial.

Posted in Uncategorized | Leave a comment

What is Recursive Exhaustion?

Recursive Exhaustion is an algorithm. Though it could be used for various purposes, here I discuss it only in the context of social technology.

Among other techniques applicable to society are those for collecting and using personal data. The most notorious example of this to date is the misuse of Facebook information by Cambridge Analytica.

To me, this was almost trivial. They collected a small amount of information on a mere 87 million people. Through Recursive Exhaustion it is possible to collect a vast amount of information about almost everybody.

Essentially recursive exhaustion works by repeatedly exhausting the space of known individuals and their attributes.

In computer science there is a method known as an exhaustive search, also known as a brute-force search. Sometimes it is referred to by its fundamental technique and is known as generate and test. It is one of the most powerful of the general problem-solving techniques, but is computationally expensive.

An exhaustive search is usually conducted on tree structure, which is a discrete combinatorial object. One might somehow transform a list of people into a tree structure then perform a search to find a person meeting certain characteristics.

The problem with this is that the human population changes. People are born and die. Living people change all the time. A fixed tree structure for the human race is impossible.

A recursive exhaustive search is one in which the attributes of one person are reevaluated at each step by considering all changes in his or her social environment. The individuals in that social environment will also have to be reevaluated, so the search for one person requires a data collection step which can propagate recursively throughout the whole population.

As applied to the whole of human society, this violates the most fundamental requirement of a recursive algorithm: it has no end condition.

Nor should it. There is no end to the changes society goes through.

Various implementations of this algorithm are discussed on another page. Details of its application to human society are elsewhere.

Posted in Uncategorized | Leave a comment