What Data To Collect

The most interesting part of Recursive Exhaustion is the recursive step of multiplying the number of data fields.  My genealogical example shows how a sequence of five numbers serving as a descriptor for a person can be expanded first to a sequence of fifteen then to a sequence of forty five.

The question of what numbers to begin the sequence with is important in that example the first few are ridiculous.  In a similar version using six numbers, the first three are badly chosen.  Mine would be 45, 80085, 8.  They represent the 45th most frequent first name in a list from an old US census, which is Douglas, plus the 80085th most frequent last name in a list of last names in that census, Pardoe, and the 8th most frequent name in that same list of last names, Wilson.  This does indeed produce a unique descriptor for me because nobody else in the world has those three names.

But this is absurd.  Collecting the numerically closest names from the lists would give the name Henry Parayuelos Moore.  That is probably unique because of the rare middle name, but even considering just the first and last names, the person would be an unlikely match.  There are no Henrys in my family that I know of, and the only Moore is six generations back.  The problem is that name frequency is a terrible way of assigning number sequences to represent names.  I discuss this on another post on the same genealogy site at that link.

A proper sequence of numbers for describing a persons name might include three or four numbers for each name, totaling from nine to twelve.  If done properly, a nearby point in the vector space representing a person’s name would actually be a similar one.

Whether those fields would be useful or not is an open question, but it is possible.  In my genealogical examples, mentioned hear because they are easy to understand, fields representing a birthdate would be very useful.  Even more useful would be two fields representing the latitude and longitude of the person’s birth.

Other use fields would include a few forming a vector representation of the person’s occupation.  Other fields might include the distance the person traveled from their birthplace during their life.  On one side of my family were mariners who traveled thousands of miles.  The other side included many farmers who probably didn’t travel more than fifty miles from their birthplace.

As more and more fields are added, the description of the person gets better and better.

These can be called first order facts.  They would say a lot about the individual, but many more fields could be added, producing a better description.  To the first order facts about me, one could add the second order facts, which are those of my father and mother.

It might be possible to create a vector of 100 numbers which would be my first order description.  If the fields were well chosen, the linear vector space would have the desired properties:  if the dot product of someone else’s vector with my own was a high positive value, we would be similar people.

But I would also be well described as a son of each parent.  I am somewhat like my father and somewhat like my mother.  Adding their 100 number descriptions to my own would produce a vector of 300 components which would be a much better description of me.  It would have the advantage of making my description much closer to that of my brother, which is a clear improvement.  Our basic 100 component vectors differ too much.

Using genealogy alone, one could continue this backward, producing 900 component vectors.  That strategy would make 800 of them identical to my brother’s, exaggerating our similarities.

That is a strong argument for not using genealogy alone.  Other connections between people could be used to make the differences stand out.  We did not marry the same kind of woman and had quite different children.  We had very different friends while growing up.

Adding all of these social connections makes for a much larger vector and also increases the multiplication factor by which one iteration adds second, third and higher order fields to a description.

A difficult problem is missing data.  I will discuss that elsewhere.  For now assume that all of the data mentioned is available and just consider the most useful fields.

This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply