Recursive Exhaustion Wipes Out Privacy

The summary:  it is possible to collect very large amounts of social data without anyone’s knowledge or permission.   What Cambridge Analytica did was trivial, a drop in a large bucket.  See RecursiveExhaustion.com for more information.

The easy explanation:  start with some data about you. Combine that with similar data about important people, places and institutions in your life. Do that for everyone, every place, every institution, forming an enlarged data record for each. Boil off useless and conflicting data. Repeat until your supercomputer installation runs out of disk space.

Optional:  from time to time, do this on the transpose of the database, thereby increasing the number entities described.

I am in the process of updating my many websites, pages and posts to reflect the consequences of this process, which grows exponentially in the literal mathematical sense, not as a synonym for “very fast” as in popular usage.  This is an endless process, so if  when you find something out of date, please check in again later.

Posted in Uncategorized | Leave a comment

What’s a Gigabyte Good For?

This is cross-posted from my new site intended for notes about my unfinished novel, along with some excerpts.  This is the first excerpt, which explains Recursive Exhaustion in a single short paragraph, then says just how much could be done with it.

“OK, enough mystery. Tell me”, he insisted.

She sighed.

“You may regret asking. Here’s the short version. In the mid-1970s a small group of grad students at NYU played about with new kinds of technology for collecting and using information about various entities. It wasn’t very sophisticated at the time, more of a mathematical curiosity. But they all thought about using it for collected a lot of information about a lot of people.”

“I assume they have, which explains the spy vs. spy stuff.”

“Well they needed a good implementation first, which took a while. There was another problem, now solved. Computers and mass storage devices at the time were not powerful enough for what they wanted to do, but there has been a drastic expansion capability.”

“Is this technology something I would know about?”

“Probably not. Does the term ‘recursive exhaustion’ sound familiar?”

“No. What is it?”

“In a nutshell, start with some data about you. Combine that with similar data about important people, places and institutions in your life. Do that for everyone, every place, every institution, forming an enlarged data record for each. Boil off useless and conflicting data. Repeat until your supercomputer installation runs out of disk space.”

“Oh. Good idea, remind me to try it sometime.”

“That can be arranged.”

“What did they want to do with all that data?”

“Different things. Some people wanted to exploit the technology for personal gain, while others were idealists and thought it could be used to improve society.”

“What happened?”

“The people wanting to exploit the technology for personal gain had no qualms about, acquiring data illegally.”

“Like what?”

“Raw census data with names and addresses is good, income tax records are better. Everything they could get their hands on.”

“Credit card numbers?”

“Useful for stealing money, but pretty crude. Far too easy to detect. Money is one motive, but they wanted larger amounts obtained in undetectable ways. Manipulating the stock market for example. We have evidence that they may have obtained a trillion dollars that way. A million million. Other motives include political power and sexual domination.”

“Lovely. How successful have they been?”

“Very. They started using blackmail and intimidation to make people give them masses of illicit data. From there it was just a short step to using the same means to get whatever they wanted. And whomever they wanted to use, for any reason.

“What about the other group, the idealists?”

“They decided early on never to use illegally obtained information. They wanted everything completely aboveboard. That attracted some exceptional people who have helped them flourish in unanticipated ways. But the new people were even more idealistic and demanded even more ethical behavior.”

“How has this worked out?”

“Starting with a limited amount of information, the bad guys now have an enormous amount on about everybody. They could tell you the name of the first girl you took to bed, and how well you performed.”

“You are making this up. I simply do not believe you.”

“My rule of thumb is Clarke’s First Law. Do you know it?”

“You mean what the science fiction author Arthur C. Clarke wrote? ‘Any sufficiently advanced technology is indistinguishable from magic’?”

“That’s his third law, you idiot. His first law is ‘When a distinguished but elderly scientist states that something is possible, he is almost certainly right. When he states that something is impossible, he is very probably wrong.'”

“Hey, I’m not elderly.”

“Not so distinguished either, but you ask the right questions.”

“So what you just told me about is possible?”

“My own attitude is this: don’t try to judge whether something is possible until you’ve tried to figure out how you would do it. Suppose you absolutely had to find out when I first seduced a boy, how would you go about doing it?”

“I have no idea.”

“Poor man. Work on it a while, wait for an epiphany, or whatever it takes. If you disappointment me by failing to figure it out, I’ll tell you. I hope it won’t come to that. If we might hook up sometime I need to maintain some respect for your intelligence.”

“You are a horrible young woman.”

“Oh no, I’m sure you’d find me thoroughly satisfying, given a chance. Play along and you might get one.”

“OK. How much information do the good guys have?”

“Not nearly as much. Probably only a million terabytes or so altogether. More data on more people than either of us can fully grasp, but all legally obtained.”

“Oh, that’s reassuring. You had me worried for a minute there.”

“Well, using recursive exhaustion and data mining the public stuff can be turned into things you’d rather keep private. It’s ethical to use it because access is strictly limited and every byte originated in public data. There may be a lot of information about your sex life in the database, but no way anyone can see or download it. Nobody can query for anything about you personally. It can only help you find the best people to associate with and the most suitable jobs you can get.”

“And whether I’m in danger or not.”

“That’s true, but the information is even more restricted.”

“What do you know about me?”

“I was looking for someone compatible with me in some useful way, perhaps a lover, perhaps a friend, maybe a co-worker. An ID code for you came up, along with a red flag indicating danger for one or both of us. I was able to see a summary of similar situations in the past. In most of those, both people had been in danger, but the only way of bringing the new one in safely was to meet in person, like this.”

“So you think I am compatible with you in some useful way. I can imagine at least one.”

“I bet.”

Posted in Uncategorized | Leave a comment

Please Do Whatever You Can to Publicize This Website

The prospect of Very Large Scale Social Data Collection is very disturbing. I’m certain that the ability to collect vast amounts information on everyone, including children, without anyone’s knowledge or permission will change society in ways none of us can imagine. It might even destroy our society — it could put nuclear weapons in the hands of many dangerous people.

I am certain that what’s on this website is correct. Even while I continue to update it, what’s here should be publicized. Please do whatever you can to make sure it reaches not only the experts but the general public.

http://DouglasPardoeWilson.SocialTechnology.ca/

Posted in Uncategorized | Leave a comment

Good vs. Evil, really!

Good vs. Evil — the oldest plot line in history.  This post is about available data — do we try to keep it out of the hands of the bad guys or let us defend ourselves with it?   There is an argument used over an over again by the Nation Rifle Association:  with control in place, only the criminals will have guns, we will be at their mercy.

There is no hiding an algorithm like Recursive Exhaustion from the criminals.  Ultimately, there is no hiding our data from them either.

What about hiding it from ourselves, or others we may consider the “good guys”?    I say that’s fundamentally impossible.  There are masses of information available in public records.  What to do about it?   Should we withdraw it from public scrutiny?  Make accessible only by the in-person request of individual users?

I think that falls into to “we should have thought about that” category.  A large part of this data has been collected in digital form or transcribed into that form.  Much of it is available on the Internet.   Once public, always public, so the use of this information is legal.   That means even the most well-meaning individuals determined to stay within the bounds of the law have access to a lot of information already.  Using methods like recursive exhaustion, this can be multiplied millions of times.  At least.

What criminals can do with a lot of illegally obtained information is truly staggering.

I have too much to say about this for one post.  I will link to another, a copy of an article I published on Medium (which accepts almost anything for publication).  It is about using anonymously collected data to derive good estimates of data otherwise unavailable for known individuals.

Posted in Uncategorized | Leave a comment

Using Anonymous Data

Enormous Value and Use of Anonymously Collected Social Data

I don’t want to give the game away, but I have too many technical details on a (hopefully) imaginary conspiracy in the techno-thriller novel I’ve been writing to actually fit within its pages. I can’t resist mentioning some here.

One thing I have tried to explain is enormous value of data collected anonymously?-?and I do mean anonymously. I’ve tried to explain how few people would supply what I want over the Internet, and why in-person collection seems essential. Then I tried to explain why elaborate methods such as giving respondents gloves to wear (to avoiding leaving fingerprints), and proving to them that their answers cannot be seen in any way. Hard to do, very hard. I’ve taken page after page to explain this, but it’s just too much to put into a work of fiction. Here is why it is important:

Suppose you asked a diverse set of 10,000 people 101 questions. Among them, 100 are ordinary ones about interests, personality, educational background, and so on. But one question is highly charged, like “Have ever molested a child?” If the respondents doubt their anonymity, they are most likely to answer `No’, regardless of the facts. On the other hand, if they truly believe nobody will ever know who provided the answers on their response pages, then many will risk telling the truth.

I also tried explaining ways of using cross-validation to weed out people who are fundamentally dishonest even in conditions of anonymity, but that was too technical for a work of fiction. Anyway, it doesn’t matter that much, as long as enough people feel free to tell the truth.

The next step is to train some piece of machine-learning software such as a neural network on the answers to the 100 innocuous questions, with the incriminating answer as the goal. As tests should verify, once trained, the mechanism (e.g. the trained network) would be useful for predicting the felonious behavior of an individual from the answers to 100 innocuous questions.

It would probably not be hard to get potentially important people, job applicants, or criminal suspects to answer the 100 questions which don’t seem at all incriminating. Their answers could be used to estimate which of those people have done something terrible.

This could be useful to employers or law enforcement, but also very dangerous. In the novel I am trying to write, evil conspirators use this technique to find people susceptible to blackmail.

Now comes the tricky part. I think I can demonstrate that to filter out the felons it’s not necessary to ask people to answer the 100 apparently harmless questions. That information is already out there. I’m doing a bit better explaining the technique for accessing it, but it’s still pretty technical. I’ll put out an account of it later, if anyone seems interested in what I’ve written here.

The basic message I have for people trying to use data science in the social realm is that a response dataset from individuals who are entirely convinced of their anonymity and feel free to tell the truth is of enormous value. If it contains the right questions, it could be worth millions of dollars, many times what it cost to collect.

Consider even what Cambridge Analytical did, which influenced the US election. A more insidious way of doing this would be to ask an anonymous group of respondents 100 seemingly non-political questions and one political one, such as `Republican or Democrat’. A trained neural network could then be used to predict the political leanings of people asked only the 100 apparently non-political questions. This is a well-known technique, but users of data science often forget the importance of a truly anonymous dataset, collected from a very diverse group of people. Such a dataset could be of great value.

The basic message I have for people who like techno-thriller novels is that the use of such anonymous datasets could put enormous power in the hands of an evil conspiracy. My goal is to scare people with valid techniques of data science, without boring technical detail. But I love the details myself, and can’t resist writing about them?-?hence this story.

By the way, there is already enough data in publicly available datasets from social surveys to do this stuff. Not enough attention has ever been given to anonymity, but probably enough that they could be used to reveal some nasty personal data, for purposes of blackmail or intimidation.

Posted in Uncategorized | Leave a comment

What’s Wrong With The Brooklyn Bridge Example

Section of Brooklyn Bridge Example Image

This is a piece of the Brooklyn  Bridge example image in the Internal Recursive Exhaustion post.  It is remotely possible that in my lazy slapdash way I constructed a poor example.  If you read the full page, you will see that the final sign has a sequence of ever-smaller inset photos which are exactly the same size and taken from the same perspective.  Ideally they should be taken at identical intervals of time, too.  Perhaps one every week.

Clearly the smaller inset images contain less information than the larger one.  They have in effect been compressed.

Recovering data from compressed versions is done everyday.  Look at the example on my Acronymic Languages page.  The left hand image is at full resolution.  The right hand image is a deconstructed version of it, obtained from a JPG file which was compressed to one tenth the size of the original file.  Yet it is quite recognizable and not just a blurry version of the first image.

The numerical version of this follows if we can assume that that some data fields have been thoroughly linearized.   Suppose the first ten numbers in each 100 component vector are set to zero to start with, while components 11 through 100 represent some valuable data.  For the next iteration, apply a data compression step to the whole 100 numbers, reducing the number of components to 10.   Replace the first 10 numbers (all zeros at the start) with those newly derived ones, which are a compressed version of the whole 100 component vector.  Replace the remaining 90 numbers with the new data.

And repeat.   If the data is nice linear stuff, a good data compression algorithm is to take the SVD orthonormalization of the 100 component vector and use the eigenvectors corresponding to the largest eigenvalues.  If the matrix used to perform the orthonormalization is stored for each step, then it is possible to invert the process and recover approximations of the vectors at each step.

This is Internal Recursive Exhaustion because the same basic algorithm is used but the number of fields is kept the same.   If you consider the example of a sequence of images representing something like stages in the construction of a building, the first image might be made of 1 Kilo  lines of 1 Kilo single byte greyscale pixels.  The resulting image would contain one 1 Meg single byte pixels.

An image of exactly the same size and resolution with an inset representing an earlier stage of construction would contain exactly the same number of pixels.  The number of fields does not increase, only the content of the image.

The numerical example is better, because the first 10 components of the vector would always be used for earlier version of the whole 100 component vector.  None of the following 90 components would be overwritten by any inset data.

In the Brooklyn Bridge example, the insets always cover some of the next larger image, and the insets are not in exactly the same size or from the same perspective view.

I hope it still helps to explain a difficult concept.

Posted in Uncategorized | Leave a comment

Internal Recursive Exhaustion Example

This image shows not only the Brooklyn bridge but it’s history.

Here is a fictional account of recursive exhaustion over the time domain, from an unfinished novel.  In it a brilliant young woman mathematician explains internal recursive exhaustion.  In it, the idea is not to to extend the number of data fields, but to capture changes over time in a single set of fields.  A builder could put up multiple signs to show stages in the construction process.  But they would obscure the onlookers view of the site site itself.  Instead a single sign can be used.

“Let’s start with an analogy. Suppose that a company is putting up a new building. To inspire the workers and keep the public informed of their progress the owner requests that a large sign showing the building at the previous state of development be posted in front of the site.

“The first sign shows a messy building lot, with a lot of junk and garbage on it. Behind, it an actual viewer on the ground would see a nicely cleared lot, already for work to begin.

“Can you visualize that, the cleared lot with a sign in front showing the old uncleared one?”

The others nodded and looked intensely at her. Such an intelligent young women, all four men thought, not quite able to ignore her other appealing attributes.

“Alright then. Suppose that this viewer is actually the official company photographer. The picture he now takes shows that empty lot, with the existing sign in the lower right hand corner. To again show their progress, the company takes down the old sign and puts a copy of this picture up as the new one.

“As soon as the foundations are dug, with forms and rebar in place, the photographer comes and takes a new picture. It shows that amount of progress, plus the last sign, which is in the lower right corner. A copy of this photograph is put up as the new sign.

“Someone taking a close look at that lower right hand corner would see that it has the image of a sign in its lower right hand corner. And that sign has the image of a smaller one in its corner.

“Months later a person could see the entire history of the building, though she might have to use a magnifying glass to see the original uncleared lot.”

The image above is a badly flawed example.  See why in a post about the details of compressing image and numerical data.

Posted in Uncategorized | Leave a comment

New Tool can Make Society Work or Destroy It

For more than half of my life I have had faith in the potential of social technology to change the world for the better — making all of us happier, ending worldwide conflicts, eliminating poverty and funding medical research.  But there is no guarantee that the tools and techniques invented will be used only for good.   Now there is a very dangerous technique,  Recursive Exhaustion, which can be used for the very large scale collection of social data.

This data can (and will) be collected automatically without people’s knowledge or permission.  It will be used to collect information on people who do not use computers.  It will be used to collect information on people whose friends and family do not use computers.  It will be used to collect information on young children.  Even children who have never seen a cellphone.

On the one hand, this could be a very good thing.  Having a lot of information about poor people or refugees in distant countries could make it possible to direct governmental and non-governmental aid to the most needy without delay.  It could also be used for all the other worthy goals discussed on these websites.  On the other hand there will be a complete loss of privacy.  Unless kept out of the wrong hands, this information will leave individuals wide open to blackmail and intimidation.

I hope that this is not already being done.  It is almost impossible to stop.  Perhaps it could be used by law enforcement officers to  find the people abusing it.  That would mean governments around the world using this technology, justifying their use of it by the need to stop the new wave of crime.  I cannot bring myself to trust even the most benevolent democratic government with an overwhelming amount of information about everybody.  The prospect of every government in the world using it scares me.

Using recursive exhaustion for the very large scale collection of social data means acquiring vast amounts of information about every person in the world, including children.  The idea of governments run by self-obsessed dictators having so much knowledge about people in truly civilized free world countries terrifies me.

I am sorry to say that I have had a hand in this, discovering or more likely rediscovering it.   On reflection I must assume other people know of what I call recursive exhaustion.  I hope they are not using it, but I must assume they are.

So what can I do?  Aside from my usual attempts to come up with new ideas by writing fiction, all I can do is warn people.  Consider yourself warned.

Posted in Uncategorized | Leave a comment

What Data To Collect

The most interesting part of Recursive Exhaustion is the recursive step of multiplying the number of data fields.  My genealogical example shows how a sequence of five numbers serving as a descriptor for a person can be expanded first to a sequence of fifteen then to a sequence of forty five.

The question of what numbers to begin the sequence with is important in that example the first few are ridiculous.  In a similar version using six numbers, the first three are badly chosen.  Mine would be 45, 80085, 8.  They represent the 45th most frequent first name in a list from an old US census, which is Douglas, plus the 80085th most frequent last name in a list of last names in that census, Pardoe, and the 8th most frequent name in that same list of last names, Wilson.  This does indeed produce a unique descriptor for me because nobody else in the world has those three names.

But this is absurd.  Collecting the numerically closest names from the lists would give the name Henry Parayuelos Moore.  That is probably unique because of the rare middle name, but even considering just the first and last names, the person would be an unlikely match.  There are no Henrys in my family that I know of, and the only Moore is six generations back.  The problem is that name frequency is a terrible way of assigning number sequences to represent names.  I discuss this on another post on the same genealogy site at that link.

A proper sequence of numbers for describing a persons name might include three or four numbers for each name, totaling from nine to twelve.  If done properly, a nearby point in the vector space representing a person’s name would actually be a similar one.

Whether those fields would be useful or not is an open question, but it is possible.  In my genealogical examples, mentioned hear because they are easy to understand, fields representing a birthdate would be very useful.  Even more useful would be two fields representing the latitude and longitude of the person’s birth.

Other use fields would include a few forming a vector representation of the person’s occupation.  Other fields might include the distance the person traveled from their birthplace during their life.  On one side of my family were mariners who traveled thousands of miles.  The other side included many farmers who probably didn’t travel more than fifty miles from their birthplace.

As more and more fields are added, the description of the person gets better and better.

These can be called first order facts.  They would say a lot about the individual, but many more fields could be added, producing a better description.  To the first order facts about me, one could add the second order facts, which are those of my father and mother.

It might be possible to create a vector of 100 numbers which would be my first order description.  If the fields were well chosen, the linear vector space would have the desired properties:  if the dot product of someone else’s vector with my own was a high positive value, we would be similar people.

But I would also be well described as a son of each parent.  I am somewhat like my father and somewhat like my mother.  Adding their 100 number descriptions to my own would produce a vector of 300 components which would be a much better description of me.  It would have the advantage of making my description much closer to that of my brother, which is a clear improvement.  Our basic 100 component vectors differ too much.

Using genealogy alone, one could continue this backward, producing 900 component vectors.  That strategy would make 800 of them identical to my brother’s, exaggerating our similarities.

That is a strong argument for not using genealogy alone.  Other connections between people could be used to make the differences stand out.  We did not marry the same kind of woman and had quite different children.  We had very different friends while growing up.

Adding all of these social connections makes for a much larger vector and also increases the multiplication factor by which one iteration adds second, third and higher order fields to a description.

A difficult problem is missing data.  I will discuss that elsewhere.  For now assume that all of the data mentioned is available and just consider the most useful fields.

Posted in Uncategorized | Leave a comment

Recursive Exhaustion Method for Social Data Collection

This website explore a remarkable new method for social data collection which makes the notorious acts of Cambridge Analytica look trivial.

Posted in Uncategorized | Leave a comment