Data Science and Terminology

This post is an extensively edited version of my Data Science page, together with my Terminology page and some new material.  The first part of this, exploring the differences between science, technology and engineering is a bit pedantic, but addresses a serious problem — even the experts are badly misusing the terms.

Data Science and Data Technology are not the same thing.  Even the book called Data Science in the MIT Press Essential Knowledge series gets this wrong from the start.

Perhaps the term Data Technology is rare because we have Information Technology, and nobody wants to get into an argument about the difference between data and information.  OK, I do — but I’ll try to restrain myself.

The term engineering is also a part of this mess.   People working in various branches of IT may be called software engineers or hardware engineers.   But engineering is not a part of technology.

I really don’t want to be pedantic, but Science, Technology, and Engineering are different things and we really do know what they are.  Why we misuse the terms in combination is a mystery to me.

  • We all know that science is the empirical or deductive search for knowledge, as in physics or mathematics.
  • Technology is the study and collection of tools and techniques.
  • Engineering is the application of technology.   The technology of iron-working came before engineers built bridges with it.

OK, this is pretty pedantic.  Sorry about that.  But as I said, even books from MIT get it wrong.  Their book, supposedly on Data Science, is clearly about Data Technology, and much of it about Data Engineering.  Read it and see for yourself.  More than anything it is about applications.

When it comes to data stuff, it’s pretty obvious that the first attempts to model the human nervous system with artificial neural networks were by people doing science.  The early explorations of them as possible tools was technology, and their deployment in large scale systems for biometric user authentication is engineering.

So endeth the pedantry.  I hope.

Now on to my use of the various terms found on this site.  I don’t care if you choose the same ones, I just want to explain what I do.  Most of this terminology is an illegitimate combination of standard usage in these disciplines and my own, which evolved over many long years of work on my own projects, such as my mildly crazy Acronymic Language project.

By data I mean structured facts — the presentation of facts in a form useful in data science, technology, or engineering.  Almost always numerical.

A fact may be an attribute of an entity or a relationship between entities.  The fact that I was born at the Vancouver General Hospital can be turned into numerical data by giving its latitude and longitude.

A dataset is a collection of data.

When you download a dataset, you will usually be offered it in a choice of several database formats.

When I use the term database, that’s not quite correct.  I almost always mean dataset, because the format doesn’t matter and because unstructured facts can also be stored in a database.

The newer term ‘factbase’ suggests a collection of facts.  When I use factbase I almost always mean a factset, being equally sloppy.

Someone seems to have trademarked the name Factset to use at the name for their business.  I will use the lowercase term factset, by analogy with dataset.

One converts or partially converts a factset into a dataset by extracting the data from a set of facts.

Recursive Exhaustion is the name I use for an algorithm which seems to have been floating around out there.  Basically it converts a factset into a dataset by exhaustively extracting all of the data from it.

Regardless of database format, a dataset can be fixed or fluid.  The most interesting ones are fluid — constantly being updated.

The term recursive is not meant to imply any implementation, just that the use of it in working with a fluid dataset is a recursive process.   To retrieve the most current data record from its database, you need to retrieve all related ones.  They may also have changed, so the ones on which they depend need to be examined.  And so on.

For the purpose of this and all of my other websites, I seek only numerical datasets which are in a vector format, for use in linear (or non-linear) algebra.  This can be very hard to do.  How would you turn your first and last names into sequences of numbers which are well-behaved in the mathematical sense?  No table-lookups permitted.

A numerical dataset is a matrix of numbers, plus labels for the rows and columns.

Ignoring column ordering and labels some datasets will be the kind of algebraic structure that mathematicians call categories, where the row labels represent the entities (objects, nodes) and the contents of the matrix encode the morphisms (arrows) between them.

Ideally the underlying structure is algebraically closed, in which case it forms a group. If we are talking about the whole of human society and everyone has at least some relationship to some other person, its structure would be algebraically closed.

To be a category, a mathematical structure must have an identity, something which takes an entity into itself.   That will not be true of many human relationships.  You might marry a divorced aunt and become your own uncle, but you cannot become your own father, so the relation of parenthood has no identity.

In general, a dataset will not have identities, and so will not be categories.  Instead they will be semicategories, also known as semigroupoids.  Ideally the underlying structure is algebraically closed, in which case it forms a semigroup.  This assumes associativity  and gets into math I discuss elsewhere, so don’t bother with it for now.

Since recursive exhaustion not only exhausts the space of data but the adjoint space of entities described by the data, the result is really a series of weights for connections between all of the entities.  This makes the output a spectrum, as briefly discussed on my Social Spectra site.

You will find the term ‘social‘ all over my websites, because my original goal was the application of Social Technology.  That term dates from a time when people pointed out that my notion of Social Network Optimization was suspiciously like Social Engineering, a term to evoke horror in all who remember Stalin and Hitler.

I have tried to explain that instead of some global project of social engineering supposedly intended to optimize society, I seek tools and techniques which individuals can use to optimize (let’s just say improve) their own local networks.

Unfortunately the use of algorithms like recursive exhaustion will produce truly global databases filled with information about every individual on the planet.  The result may be an explosion as everyone makes drastic changes to their social environments at once.  Or it may lead to a more subtle social network pessimization (a term used in C++ programming),  as evil people do their own insidious social engineering behind the scenes.

I don’t have any idea what will happen or how to prevent disaster.  I have a new site about Keeping Society Alive, but right now it has only a short front page with a warning about how this might be difficult.  Clearly I have too many websites, but I think this one about very large scale social data collection with recursive exhaustion deserves the most attention.

This entry was posted in Uncategorized. Bookmark the permalink.