Data Science and Terminology

This post is an extensively edited version of my Data Science page, together with my Terminology page and some new material.  The first part of this, exploring the differences between science, technology and engineering is a bit pedantic, but addresses a serious problem — even the experts are badly misusing the terms.

Data Science and Data Technology are not the same thing.  Even the book called Data Science in the MIT Press Essential Knowledge series gets this wrong from the start.

Perhaps the term Data Technology is rare because we have Information Technology, and nobody wants to get into an argument about the difference between data and information.  OK, I do — but I’ll try to restrain myself.

The term engineering is also a part of this mess.   People working in various branches of IT may be called software engineers or hardware engineers.   But engineering is not a part of technology.

I really don’t want to be pedantic, but Science, Technology, and Engineering are different things and we really do know what they are.  Why we misuse the terms in combination is a mystery to me.

  • We all know that science is the empirical or deductive search for knowledge, as in physics or mathematics.
  • Technology is the study and collection of tools and techniques.
  • Engineering is the application of technology.   The technology of iron-working came before engineers built bridges with it.

OK, this is pretty pedantic.  Sorry about that.  But as I said, even books from MIT get it wrong.  Their book, supposedly on Data Science, is clearly about Data Technology, and much of it about Data Engineering.  Read it and see for yourself.  More than anything it is about applications.

When it comes to data stuff, it’s pretty obvious that the first attempts to model the human nervous system with artificial neural networks were by people doing science.  The early explorations of them as possible tools was technology, and their deployment in large scale systems for biometric user authentication is engineering.

So endeth the pedantry.  I hope.

Now on to my use of the various terms found on this site.  I don’t care if you choose the same ones, I just want to explain what I do.  Most of this terminology is an illegitimate combination of standard usage in these disciplines and my own, which evolved over many long years of work on my own projects, such as my mildly crazy Acronymic Language project.

By data I mean structured facts — the presentation of facts in a form useful in data science, technology, or engineering.  Almost always numerical.

A fact may be an attribute of an entity or a relationship between entities.  The fact that I was born at the Vancouver General Hospital can be turned into numerical data by giving its latitude and longitude.

A dataset is a collection of data.

When you download a dataset, you will usually be offered it in a choice of several database formats.

When I use the term database, that’s not quite correct.  I almost always mean dataset, because the format doesn’t matter and because unstructured facts can also be stored in a database.

The newer term ‘factbase’ suggests a collection of facts.  When I use factbase I almost always mean a factset, being equally sloppy.

Someone seems to have trademarked the name Factset to use at the name for their business.  I will use the lowercase term factset, by analogy with dataset.

One converts or partially converts a factset into a dataset by extracting the data from a set of facts.

Recursive Exhaustion is the name I use for an algorithm which seems to have been floating around out there.  Basically it converts a factset into a dataset by exhaustively extracting all of the data from it.

Regardless of database format, a dataset can be fixed or fluid.  The most interesting ones are fluid — constantly being updated.

The term recursive is not meant to imply any implementation, just that the use of it in working with a fluid dataset is a recursive process.   To retrieve the most current data record from its database, you need to retrieve all related ones.  They may also have changed, so the ones on which they depend need to be examined.  And so on.

For the purpose of this and all of my other websites, I seek only numerical datasets which are in a vector format, for use in linear (or non-linear) algebra.  This can be very hard to do.  How would you turn your first and last names into sequences of numbers which are well-behaved in the mathematical sense?  No table-lookups permitted.

A numerical dataset is a matrix of numbers, plus labels for the rows and columns.

Ignoring column ordering and labels some datasets will be the kind of algebraic structure that mathematicians call categories, where the row labels represent the entities (objects, nodes) and the contents of the matrix encode the morphisms (arrows) between them.

Ideally the underlying structure is algebraically closed, in which case it forms a group. If we are talking about the whole of human society and everyone has at least some relationship to some other person, its structure would be algebraically closed.

To be a category, a mathematical structure must have an identity, something which takes an entity into itself.   That will not be true of many human relationships.  You might marry a divorced aunt and become your own uncle, but you cannot become your own father, so the relation of parenthood has no identity.

In general, a dataset will not have identities, and so will not be categories.  Instead they will be semicategories, also known as semigroupoids.  Ideally the underlying structure is algebraically closed, in which case it forms a semigroup.  This assumes associativity  and gets into math I discuss elsewhere, so don’t bother with it for now.

Since recursive exhaustion not only exhausts the space of data but the adjoint space of entities described by the data, the result is really a series of weights for connections between all of the entities.  This makes the output a spectrum, as briefly discussed on my Social Spectra site.

You will find the term ‘social‘ all over my websites, because my original goal was the application of Social Technology.  That term dates from a time when people pointed out that my notion of Social Network Optimization was suspiciously like Social Engineering, a term to evoke horror in all who remember Stalin and Hitler.

I have tried to explain that instead of some global project of social engineering supposedly intended to optimize society, I seek tools and techniques which individuals can use to optimize (let’s just say improve) their own local networks.

Unfortunately the use of algorithms like recursive exhaustion will produce truly global databases filled with information about every individual on the planet.  The result may be an explosion as everyone makes drastic changes to their social environments at once.  Or it may lead to a more subtle social network pessimization (a term used in C++ programming),  as evil people do their own insidious social engineering behind the scenes.

I don’t have any idea what will happen or how to prevent disaster.  I have a new site about Keeping Society Alive, but right now it has only a short front page with a warning about how this might be difficult.  Clearly I have too many websites, but I think this one about very large scale social data collection with recursive exhaustion deserves the most attention.

Comments Off on Data Science and Terminology

Data Science

I have recently added a page on Terminology to my RecursiveExhaustion.com site, but I must clear up something more fundamental first.  Data Science and Data Technology are not the same thing.  Even the book called Data Science in the MIT Press Essential Knowledge series gets this wrong from the start. Not to be pedantic, but Science, Technology, and Engineering are different things.

We all know that science is the empirical or deductive search for knowledge, as in physics or mathematics.  During scientific endeavor some basic technologies may emerge, but these are not science by themselves.  Technology is the study and collection of tools and techniques.  Engineering is the application of technology.  Technologists learned the tools and techniques of working with iron before engineers built bridges with it.

OK, this is pretty pedantic.  Sorry about that.  But as I said, even books from MIT get it wrong.  Their book, supposedly on Data Science, is clearly about Data Technology, and much of it about Data Engineering.  Read it and see for yourself.  More than anything it is about applications.

The first attempts to model the human nervous system with artificial neural networks were scientific.  The early explorations of them as possible tools was technology, and their deployment in large scale systems for biometric user authentication is engineering.

Again, please see Terminology and other pages on my Recursive Exhaustion Sitemap for an example of the way I use these terms.  Most of that terminology is an illegitimate combination of standard usage in these disciplines and my own, which evolved over many long years of work on my own projects, such as my mildly crazy Acronymic Language project.

Comments Off on Data Science

Recursive Exhaustion Wipes Out Privacy

The summary:  it is possible to collect very large amounts of social data without anyone’s knowledge or permission.   What Cambridge Analytica did was trivial, a drop in a large bucket.  See RecursiveExhaustion.com for more information.

The easy explanation:  start with some data about you. Combine that with similar data about important people, places and institutions in your life. Do that for everyone, every place, every institution, forming an enlarged data record for each. Boil off useless and conflicting data. Repeat until your supercomputer installation runs out of disk space.

Optional:  from time to time, do this on the transpose of the database, thereby increasing the number entities described.

I am in the process of updating my many websites, pages and posts to reflect the consequences of this process, which grows exponentially in the literal mathematical sense, not as a synonym for “very fast” as in popular usage.  This is an endless process, so if  when you find something out of date, please check in again later.

Posted in Uncategorized | Leave a comment

What’s a Gigabyte Good For?

This is cross-posted from my new site intended for notes about my unfinished novel, along with some excerpts.  This is the first excerpt, which explains Recursive Exhaustion in a single short paragraph, then says just how much could be done with it.

“OK, enough mystery. Tell me”, he insisted.

She sighed.

“You may regret asking. Here’s the short version. In the mid-1970s a small group of grad students at NYU played about with new kinds of technology for collecting and using information about various entities. It wasn’t very sophisticated at the time, more of a mathematical curiosity. But they all thought about using it for collected a lot of information about a lot of people.”

“I assume they have, which explains the spy vs. spy stuff.”

“Well they needed a good implementation first, which took a while. There was another problem, now solved. Computers and mass storage devices at the time were not powerful enough for what they wanted to do, but there has been a drastic expansion capability.”

“Is this technology something I would know about?”

“Probably not. Does the term ‘recursive exhaustion’ sound familiar?”

“No. What is it?”

“In a nutshell, start with some data about you. Combine that with similar data about important people, places and institutions in your life. Do that for everyone, every place, every institution, forming an enlarged data record for each. Boil off useless and conflicting data. Repeat until your supercomputer installation runs out of disk space.”

“Oh. Good idea, remind me to try it sometime.”

“That can be arranged.”

“What did they want to do with all that data?”

“Different things. Some people wanted to exploit the technology for personal gain, while others were idealists and thought it could be used to improve society.”

“What happened?”

“The people wanting to exploit the technology for personal gain had no qualms about, acquiring data illegally.”

“Like what?”

“Raw census data with names and addresses is good, income tax records are better. Everything they could get their hands on.”

“Credit card numbers?”

“Useful for stealing money, but pretty crude. Far too easy to detect. Money is one motive, but they wanted larger amounts obtained in undetectable ways. Manipulating the stock market for example. We have evidence that they may have obtained a trillion dollars that way. A million million. Other motives include political power and sexual domination.”

“Lovely. How successful have they been?”

“Very. They started using blackmail and intimidation to make people give them masses of illicit data. From there it was just a short step to using the same means to get whatever they wanted. And whomever they wanted to use, for any reason.

“What about the other group, the idealists?”

“They decided early on never to use illegally obtained information. They wanted everything completely aboveboard. That attracted some exceptional people who have helped them flourish in unanticipated ways. But the new people were even more idealistic and demanded even more ethical behavior.”

“How has this worked out?”

“Starting with a limited amount of information, the bad guys now have an enormous amount on about everybody. They could tell you the name of the first girl you took to bed, and how well you performed.”

“You are making this up. I simply do not believe you.”

“My rule of thumb is Clarke’s First Law. Do you know it?”

“You mean what the science fiction author Arthur C. Clarke wrote? ‘Any sufficiently advanced technology is indistinguishable from magic’?”

“That’s his third law, you idiot. His first law is ‘When a distinguished but elderly scientist states that something is possible, he is almost certainly right. When he states that something is impossible, he is very probably wrong.'”

“Hey, I’m not elderly.”

“Not so distinguished either, but you ask the right questions.”

“So what you just told me about is possible?”

“My own attitude is this: don’t try to judge whether something is possible until you’ve tried to figure out how you would do it. Suppose you absolutely had to find out when I first seduced a boy, how would you go about doing it?”

“I have no idea.”

“Poor man. Work on it a while, wait for an epiphany, or whatever it takes. If you disappointment me by failing to figure it out, I’ll tell you. I hope it won’t come to that. If we might hook up sometime I need to maintain some respect for your intelligence.”

“You are a horrible young woman.”

“Oh no, I’m sure you’d find me thoroughly satisfying, given a chance. Play along and you might get one.”

“OK. How much information do the good guys have?”

“Not nearly as much. Probably only a million terabytes or so altogether. More data on more people than either of us can fully grasp, but all legally obtained.”

“Oh, that’s reassuring. You had me worried for a minute there.”

“Well, using recursive exhaustion and data mining the public stuff can be turned into things you’d rather keep private. It’s ethical to use it because access is strictly limited and every byte originated in public data. There may be a lot of information about your sex life in the database, but no way anyone can see or download it. Nobody can query for anything about you personally. It can only help you find the best people to associate with and the most suitable jobs you can get.”

“And whether I’m in danger or not.”

“That’s true, but the information is even more restricted.”

“What do you know about me?”

“I was looking for someone compatible with me in some useful way, perhaps a lover, perhaps a friend, maybe a co-worker. An ID code for you came up, along with a red flag indicating danger for one or both of us. I was able to see a summary of similar situations in the past. In most of those, both people had been in danger, but the only way of bringing the new one in safely was to meet in person, like this.”

“So you think I am compatible with you in some useful way. I can imagine at least one.”

“I bet.”

Posted in Uncategorized | Leave a comment

Please Do Whatever You Can to Publicize This Website

The prospect of Very Large Scale Social Data Collection is very disturbing. I’m certain that the ability to collect vast amounts information on everyone, including children, without anyone’s knowledge or permission will change society in ways none of us can imagine. It might even destroy our society — it could put nuclear weapons in the hands of many dangerous people.

I am certain that what’s on this website is correct. Even while I continue to update it, what’s here should be publicized. Please do whatever you can to make sure it reaches not only the experts but the general public.

http://DouglasPardoeWilson.SocialTechnology.ca/

Posted in Uncategorized | Leave a comment

Good vs. Evil, really!

Good vs. Evil — the oldest plot line in history.  This post is about available data — do we try to keep it out of the hands of the bad guys or let us defend ourselves with it?   There is an argument used over an over again by the Nation Rifle Association:  with control in place, only the criminals will have guns, we will be at their mercy.

There is no hiding an algorithm like Recursive Exhaustion from the criminals.  Ultimately, there is no hiding our data from them either.

What about hiding it from ourselves, or others we may consider the “good guys”?    I say that’s fundamentally impossible.  There are masses of information available in public records.  What to do about it?   Should we withdraw it from public scrutiny?  Make accessible only by the in-person request of individual users?

I think that falls into to “we should have thought about that” category.  A large part of this data has been collected in digital form or transcribed into that form.  Much of it is available on the Internet.   Once public, always public, so the use of this information is legal.   That means even the most well-meaning individuals determined to stay within the bounds of the law have access to a lot of information already.  Using methods like recursive exhaustion, this can be multiplied millions of times.  At least.

What criminals can do with a lot of illegally obtained information is truly staggering.

I have too much to say about this for one post.  I will link to another, a copy of an article I published on Medium (which accepts almost anything for publication).  It is about using anonymously collected data to derive good estimates of data otherwise unavailable for known individuals.

Posted in Uncategorized | Leave a comment

Using Anonymous Data

Enormous Value and Use of Anonymously Collected Social Data

I don’t want to give the game away, but I have too many technical details on a (hopefully) imaginary conspiracy in the techno-thriller novel I’ve been writing to actually fit within its pages. I can’t resist mentioning some here.

One thing I have tried to explain is enormous value of data collected anonymously?-?and I do mean anonymously. I’ve tried to explain how few people would supply what I want over the Internet, and why in-person collection seems essential. Then I tried to explain why elaborate methods such as giving respondents gloves to wear (to avoiding leaving fingerprints), and proving to them that their answers cannot be seen in any way. Hard to do, very hard. I’ve taken page after page to explain this, but it’s just too much to put into a work of fiction. Here is why it is important:

Suppose you asked a diverse set of 10,000 people 101 questions. Among them, 100 are ordinary ones about interests, personality, educational background, and so on. But one question is highly charged, like “Have ever molested a child?” If the respondents doubt their anonymity, they are most likely to answer `No’, regardless of the facts. On the other hand, if they truly believe nobody will ever know who provided the answers on their response pages, then many will risk telling the truth.

I also tried explaining ways of using cross-validation to weed out people who are fundamentally dishonest even in conditions of anonymity, but that was too technical for a work of fiction. Anyway, it doesn’t matter that much, as long as enough people feel free to tell the truth.

The next step is to train some piece of machine-learning software such as a neural network on the answers to the 100 innocuous questions, with the incriminating answer as the goal. As tests should verify, once trained, the mechanism (e.g. the trained network) would be useful for predicting the felonious behavior of an individual from the answers to 100 innocuous questions.

It would probably not be hard to get potentially important people, job applicants, or criminal suspects to answer the 100 questions which don’t seem at all incriminating. Their answers could be used to estimate which of those people have done something terrible.

This could be useful to employers or law enforcement, but also very dangerous. In the novel I am trying to write, evil conspirators use this technique to find people susceptible to blackmail.

Now comes the tricky part. I think I can demonstrate that to filter out the felons it’s not necessary to ask people to answer the 100 apparently harmless questions. That information is already out there. I’m doing a bit better explaining the technique for accessing it, but it’s still pretty technical. I’ll put out an account of it later, if anyone seems interested in what I’ve written here.

The basic message I have for people trying to use data science in the social realm is that a response dataset from individuals who are entirely convinced of their anonymity and feel free to tell the truth is of enormous value. If it contains the right questions, it could be worth millions of dollars, many times what it cost to collect.

Consider even what Cambridge Analytical did, which influenced the US election. A more insidious way of doing this would be to ask an anonymous group of respondents 100 seemingly non-political questions and one political one, such as `Republican or Democrat’. A trained neural network could then be used to predict the political leanings of people asked only the 100 apparently non-political questions. This is a well-known technique, but users of data science often forget the importance of a truly anonymous dataset, collected from a very diverse group of people. Such a dataset could be of great value.

The basic message I have for people who like techno-thriller novels is that the use of such anonymous datasets could put enormous power in the hands of an evil conspiracy. My goal is to scare people with valid techniques of data science, without boring technical detail. But I love the details myself, and can’t resist writing about them?-?hence this story.

By the way, there is already enough data in publicly available datasets from social surveys to do this stuff. Not enough attention has ever been given to anonymity, but probably enough that they could be used to reveal some nasty personal data, for purposes of blackmail or intimidation.

Posted in Uncategorized | Leave a comment

What’s Wrong With The Brooklyn Bridge Example

Section of Brooklyn Bridge Example Image

This is a piece of the Brooklyn  Bridge example image in the Internal Recursive Exhaustion post.  It is remotely possible that in my lazy slapdash way I constructed a poor example.  If you read the full page, you will see that the final sign has a sequence of ever-smaller inset photos which are exactly the same size and taken from the same perspective.  Ideally they should be taken at identical intervals of time, too.  Perhaps one every week.

Clearly the smaller inset images contain less information than the larger one.  They have in effect been compressed.

Recovering data from compressed versions is done everyday.  Look at the example on my Acronymic Languages page.  The left hand image is at full resolution.  The right hand image is a deconstructed version of it, obtained from a JPG file which was compressed to one tenth the size of the original file.  Yet it is quite recognizable and not just a blurry version of the first image.

The numerical version of this follows if we can assume that that some data fields have been thoroughly linearized.   Suppose the first ten numbers in each 100 component vector are set to zero to start with, while components 11 through 100 represent some valuable data.  For the next iteration, apply a data compression step to the whole 100 numbers, reducing the number of components to 10.   Replace the first 10 numbers (all zeros at the start) with those newly derived ones, which are a compressed version of the whole 100 component vector.  Replace the remaining 90 numbers with the new data.

And repeat.   If the data is nice linear stuff, a good data compression algorithm is to take the SVD orthonormalization of the 100 component vector and use the eigenvectors corresponding to the largest eigenvalues.  If the matrix used to perform the orthonormalization is stored for each step, then it is possible to invert the process and recover approximations of the vectors at each step.

This is Internal Recursive Exhaustion because the same basic algorithm is used but the number of fields is kept the same.   If you consider the example of a sequence of images representing something like stages in the construction of a building, the first image might be made of 1 Kilo  lines of 1 Kilo single byte greyscale pixels.  The resulting image would contain one 1 Meg single byte pixels.

An image of exactly the same size and resolution with an inset representing an earlier stage of construction would contain exactly the same number of pixels.  The number of fields does not increase, only the content of the image.

The numerical example is better, because the first 10 components of the vector would always be used for earlier version of the whole 100 component vector.  None of the following 90 components would be overwritten by any inset data.

In the Brooklyn Bridge example, the insets always cover some of the next larger image, and the insets are not in exactly the same size or from the same perspective view.

I hope it still helps to explain a difficult concept.

Posted in Uncategorized | Leave a comment

Internal Recursive Exhaustion Example

This image shows not only the Brooklyn bridge but it’s history.

Here is a fictional account of recursive exhaustion over the time domain, from an unfinished novel.  In it a brilliant young woman mathematician explains internal recursive exhaustion.  In it, the idea is not to to extend the number of data fields, but to capture changes over time in a single set of fields.  A builder could put up multiple signs to show stages in the construction process.  But they would obscure the onlookers view of the site site itself.  Instead a single sign can be used.

“Let’s start with an analogy. Suppose that a company is putting up a new building. To inspire the workers and keep the public informed of their progress the owner requests that a large sign showing the building at the previous state of development be posted in front of the site.

“The first sign shows a messy building lot, with a lot of junk and garbage on it. Behind, it an actual viewer on the ground would see a nicely cleared lot, already for work to begin.

“Can you visualize that, the cleared lot with a sign in front showing the old uncleared one?”

The others nodded and looked intensely at her. Such an intelligent young women, all four men thought, not quite able to ignore her other appealing attributes.

“Alright then. Suppose that this viewer is actually the official company photographer. The picture he now takes shows that empty lot, with the existing sign in the lower right hand corner. To again show their progress, the company takes down the old sign and puts a copy of this picture up as the new one.

“As soon as the foundations are dug, with forms and rebar in place, the photographer comes and takes a new picture. It shows that amount of progress, plus the last sign, which is in the lower right corner. A copy of this photograph is put up as the new sign.

“Someone taking a close look at that lower right hand corner would see that it has the image of a sign in its lower right hand corner. And that sign has the image of a smaller one in its corner.

“Months later a person could see the entire history of the building, though she might have to use a magnifying glass to see the original uncleared lot.”

The image above is a badly flawed example.  See why in a post about the details of compressing image and numerical data.

Posted in Uncategorized | Leave a comment

New Tool can Make Society Work or Destroy It

For more than half of my life I have had faith in the potential of social technology to change the world for the better — making all of us happier, ending worldwide conflicts, eliminating poverty and funding medical research.  But there is no guarantee that the tools and techniques invented will be used only for good.   Now there is a very dangerous technique,  Recursive Exhaustion, which can be used for the very large scale collection of social data.

This data can (and will) be collected automatically without people’s knowledge or permission.  It will be used to collect information on people who do not use computers.  It will be used to collect information on people whose friends and family do not use computers.  It will be used to collect information on young children.  Even children who have never seen a cellphone.

On the one hand, this could be a very good thing.  Having a lot of information about poor people or refugees in distant countries could make it possible to direct governmental and non-governmental aid to the most needy without delay.  It could also be used for all the other worthy goals discussed on these websites.  On the other hand there will be a complete loss of privacy.  Unless kept out of the wrong hands, this information will leave individuals wide open to blackmail and intimidation.

I hope that this is not already being done.  It is almost impossible to stop.  Perhaps it could be used by law enforcement officers to  find the people abusing it.  That would mean governments around the world using this technology, justifying their use of it by the need to stop the new wave of crime.  I cannot bring myself to trust even the most benevolent democratic government with an overwhelming amount of information about everybody.  The prospect of every government in the world using it scares me.

Using recursive exhaustion for the very large scale collection of social data means acquiring vast amounts of information about every person in the world, including children.  The idea of governments run by self-obsessed dictators having so much knowledge about people in truly civilized free world countries terrifies me.

I am sorry to say that I have had a hand in this, discovering or more likely rediscovering it.   On reflection I must assume other people know of what I call recursive exhaustion.  I hope they are not using it, but I must assume they are.

So what can I do?  Aside from my usual attempts to come up with new ideas by writing fiction, all I can do is warn people.  Consider yourself warned.

Posted in Uncategorized | Leave a comment