We want... information...

Unlike #6 of The Prisoner I am interested in how information (data, people, whatever) Can be pushed, filed, stamped, indexed, briefed, de-briefed, and numbered. In a database, at least. (World domination can wait until after this guy's Ph.D. huh. Underachiever.) The core of database theory is indeed in manipulating this information or data to produce other data. These manipulations, however, give rise to the question of "well, what is it that we're manipulating?" (Data, stupid. It says so right in the name.) And yes, while the substance under examination is "data", calling the label the definition doesn't exactly produce any useful insights.

Thus, for today's discussion, I want to explore what data means from a database perspective. This, unlike the broader case, does have a definition formed in a constructed reality: that of a programmed piece of software. To begin, let's see what DB documentation has to say about data. Oracle, the database I'm most knowledgeable in, never defines data in its documentation. (Admittedly I only searched 10.2, but I'll make the claim for everything.)

In their "what is oracle" they have:
Oracle is a relational database. In a relational database, all data is stored in two-dimensional tables that are composed of rows and columns. The Oracle Database enables you to store data, update it, and efficiently retrieve it.

Oracle provides software to create and manage the Oracle database. The database consists of physical and logical structures in which system, user, and control information is stored. The software that manages the database is called the Oracle database server. Collectively, the software that runs Oracle and the physical database are called the Oracle database system.

In their "master glossary" which should have a definition of data (as they keep bloody using the term) .... it doesn't. Clearly Oracle either assumes that data is such fundamental knowledge that they don't need to define it or admit that there are so many competing definitions that it would be counter-productive to get into that debate. We'll return to Oracle when we look at data types. Interestingly, mySQL fails to address the "what is a database" issue at all along with its "what is data?" avoidance.

We can approach a definition by sidling up to it, as it were, in the PostgreSQL manual's definition of a table (tellingly, in its Data Definition section.):

A table in a relational database is much like a table on paper: It consists of rows and columns. The number and order of the columns is fixed, and each column has a name. The number of rows is variable — it reflects how much data is stored at a given moment. SQL does not make any guarantees about the order of the rows in a table. When a table is read, the rows will appear in random order, unless sorting is explicitly requested. This is covered in Chapter 7. Furthermore, SQL does not assign unique identifiers to rows, so it is possible to have several completely identical rows in a table. This is a consequence of the mathematical model that underlies SQL but is usually not desirable. Later in this chapter we will see how to deal with this issue.

Each column has a data type. The data type constrains the set of possible values that can be assigned to a column and assigns semantics to the data stored in the column so that it can be used for computations. For instance, a column declared to be of a numerical type will not accept arbitrary text strings, and the data stored in such a column can be used for mathematical computations. By contrast, a column declared to be of a character string type will accept almost any kind of data but it does not lend itself to mathematical calculations, although other operations such as string concatenation are available.

Here, there is no claim that "data" is being stored, only that the "stuff" stored in a column has a data type. While this is a fascinating failing of the manuals as they assume that complete newcomers will either learn the terms inductively through their examples or will arrive with a pre-known definition of the term, it makes declaring any official definition of the term "data" problematic. (of course, making pronouncements here will just be accepted as Gospel, right Brian?)

Hence, here, I'll need to draw on my experience and the ... definitions around data shown above. Data is something that exists in a database, for it has "types." This is not a useful basis for defining data, however, as these types are merely a programmatic convenience, restricting the data to certain symbol-domains either for error-checking, space-saving, or symbol-manipulating reasons.

So, we're reduced to the idea that "data" is stored in a particular row and column of a database. Except that the introduction paragraph disputes that point: "The number of rows is variable — it reflects how much data is stored at a given moment." Despite the wanton abuse of the poor hyphen in that sentence (and we were never guilty of that, were we, Brian?) the intent of that sentence seems to indicate that a row is data.

Once again, we must back off from the definition of the term to discuss the whole singular/plural problem. Can we have a piece of data? (I'd prefer a piece of pie, brain...) Or is that a datum? Modern usage (and what a reliable guide that is) seems to indicate that we're using data as singular and that prats who use it as a plural are merely putting on airs. "The data are very interesting." But, besides standard anti-intellectual elitism we can actually differentiate between datum, data (singular) and data (plural.)

The easiest case is data (plural). This data can be found in a table, and it's easy to claim that an entire database table is (or has) data (plural). Consequently, as a table is made of a set of rows (or tuples, to be more accurate in set theory terms), those tuples comprise data (singular). It's not too hard to imagine saying "can you get me that piece of data?" as a reference to getting a row from a database. And thus, we see that data(singular) then can comprise a set of individual datum(s).

To reiterate (for i=0; i<stupidLots;i++) the indivisible (at least without getting into single-cell functions) component of a database is the datum. This is related with other datums to form data (singular) or a row of the database, a tuple. These consistently formed tuples then are related in a table to form data (plural). Now, having shown how this data is structured in a database, can we come to a good definition of it?

Hell no. We know that a datum is comprised of symbols entered by either a computer or human. We know that those symbols are rendered by bits. We know that the symbols are generally of either the order of "language" or "numbers" and we know that some language or numbers is inappropriate.

The problem here is that it's theoretically possible to encode anything in a database. We can't claim that a datum is "an observation of reality" as it's possible to store an entire short story in one tuple. (I would link to the Ficlets site, but it's dead.) Given that this short story, if fictitious, contains multiple observations of a fantastic non-reality, we cannot say that there's any basis of singular observation, the "real" or anything like that.

Data types, as a requirement of columns, (My old nemesis. We meet again... for the last time... again...) do show one consistent feature of sets of tuples: they must be of the same nature. The domain of all datums (eeew, hate that word -- and the spellcheck doesn't have a problem with it. WTF?) in a column must be the same, and the relationship between all datums (::twitch::) in a row is also fixed.

Thus, a datum in a database must be defined as: "Something of interest to the designer of the database which can be clearly articulated and is indivisible within the universe of discourse of the database." That definition completely sucks.

Do any of you have ideas?

4 comments:

Wanna 737 RIT said...

Let me try! Data is a collection of meaningful information that is of interest to someone or a group of people. However, I dont believe that all data is stored in a database. There is data that exists only temporary and is not stored in a database. There is data that can not be categorized or yet to be structured.

phavanhna

Brian Ballsun-Stanton said...

::grins:: Okay wanna, but now try defining data without using any ambiguous or potentially synonomous terms. (Data, information, knowledge, are the big ones).

Or, if you do use terms like that, automatically recurse into them and define them. It depends is always a great start.

Unknown said...

As you say, data is synonymous with information ^^ and my own take on information goes something like "information = objective facts describing an entity", where an entity is just about anything, a piece of text, a photon, a geometric form, and so on. By objective I mean that which is not dependant on an observer. Colors are not objective facts, while the wavelength of photons are. The meaning of a text is not objective while that which makes up the text is (for it is the reason of the texts existence, something which won't go away whether the text is interpreted or not).

Now if you're after even deeper meanings, I guess information can be seen as that which seperates order from chaos (or rather, that which makes order what it is, in contrast to chaos), or similarly, that which separates low-entropy entities from high-entropy entities. It is in extension the "essence" of things, that which makes things existent.

Perhaps information is not really "objective facts", but rather "information = that which is described by objective facts". Which is why a random pattern is so much tougher to represent in contrast to a repeating pattern. To describe 1011100101010000010111101000 is tougher than describing 1010101010101010101010101010. Why? That which is described in the first example is greater in terms of data-density. Basically, I guess information just boils down to patterns. :D information = a measure of how easy it is to represent a pattern.

It's really hard to grasp, even I am far from convinced by what I've written ^^ which probably is your point anyway. Oh well, that was quite a lot of text, sorry if it's inconsistent. Philosophical thought trains usually always ends up like that once I get started ^^

Brian Ballsun-Stanton said...

Quick comment before I rush out to class, Matt. This is a philosophy blog -- if the comments are short, we're not asking good questions :)

Anyways, the comment. Can we replace your term "fact" with "statement?" The question of "what is a fact" is a reallllly long one. :)

Post a Comment