Why should databases care about aristotle?

0 comments
Aristotle and the Database

For the last 9 months, I’ve been playing Ars Magica, a game set in Mythic Europe in the 13th century. The beauty of this game is that it requires thinking in a different paradigm. And yes, I use the term Paradigm in the proper Kuhnian sense – a complete way of thinking and understanding the universe.

The root of the Aristotelian paradigm, instead of the scientific falsification that we espouse (note, not actually do, but say we do), the paradigm seeks rigorously logical and plausible explanations based on prior authority. Consider someone in the Aristotelian paradigm trying to explain a volcano. We start by exploring the things known. I’ll quote here from Art and Academe, the sourcebook for Ars Magica that explores these things:
At the very center of the world is a perfect sphere of earth upon the surface of which all material life can be found. Surrounding the sphere of earth on all sides is the sphere of water that constitutes the seas, although in some places the expansive water enters into the pores of the earth and runs beneath it, leaving dry land. Surrounding the sphere of water is the sphere of air and surrounding that is the sphere of fire. The fire in this outer layer is of the purest kind found, not the smoky fire found on earth, and when illuminated by the sun it burns with a bright blue flame giving the sky its colour.
Of note here is the rigorous logic and acceptance of precedent. Given one thing and observations, we logically deduce others.

Given that authority has stated the above (and that phrase is essential to Aristotelian scholarship, "given-that-authority". Not our own “holy grail” of experimental results (not that anyone actually follows our holy grail, but that’s beside the point) ) let us look at how we’d explain a volcano. We start by considering the basis of natural philosophy: earth is below us, and loves itself, which is why it seeks to rejoin itself. Fire is above us, and as fire loves fire, it rises. Thus, a volcano, fire which has somehow been drawn under the earth, is perfectly logical. The lava rises because the fire wants to return to its natural place, but falls again due to love of the earth as the essence of fire departs. It is liquid because it uses the same pores of the earth that water does, and thus shares its properties. Just as air trapped underground causes earthquakes as it escapes, so too does fire cause volcanoes.

The above explanation is perfectly logical… and horribly incorrect as our modern understanding goes. The question now becomes – how does this relate to databases? The process described above is functionally building a constructed reality from observed phenomena and applying observations within that framework to arbitrary categories. Whether those categories involve picking up a lump of earth and calling it “earth!” or doing the same with data, the process of metaphysical creation is the same.

Metaphysical creation: the creation of those categories, those frameworks, those relationships. In the process of creating databases, we find ourselves doing the same thing. We’re taking stuff, making categories for it, and thinking that those categories explain the stuff.

But the parallels continue. Our present practices are the Aristotelian fabrication of data and data models. We have a profound tradition of scholarship designed around these practices, these constructed artifacts. We have “best practices” for designing a database, a data warehouse. We have a number of design methodologies for taking a project through to a theoretical conclusion. We have an entire field of Knowledge Management that purports to give us ways of manipulating knowledge for business gain. We have extensive debates on whether positivist or post-positivist or interpretivist or (my own favorite) post-modern methodologies are scientifically useful for constructing facts in Information Systems. And yet, to a mind used to the paradigm switches between a constructed reality of Aristotelian logic and whatever we want to label whatever it is I’m living in… my academic reality seems just as artificial as this game.

We have rules, we have tenets, we have peer review, but all of that creates scholarship. A profound scholarship, to be true: a /useful/scholarship. We explore that which has come before us, identify holes in the literature and try to fill those holes with created knowledge. And here is why I firmly believe that the contributions we’re making aren’t science.

In the investigations I’ve made so far in the literature, there are tens or hundreds of definitions of data, information, and knowledge. For every constructed reality of the speaker’s perspectives, every single damn definition is true. The problem is not that we’re trying to disprove certain definitions, or we’re trying to create a covering model of information that we can use a scientific basis for business interactions in the world, we’re performing scholarship into how different facets of knowledge, information, data, observations, business intelligence, facts, whatever-the-hell-it-is-we’re-talking-about are used by different people, constructing logical and convincing stories about those researches, and calling those stories science.

Discussing qualitative and quantitative approaches to data-gathering along with other uses of the artifacts of science matters. It matters because these approaches determine the relative merits of our scholarship in this field. Just as citing the correct “authorities” in an Aristotelian treatise on lava makes one more believable to the bishops and cardinals and other intelligencia of the time, discussing the right methodology here makes it more likely to get a bloody piece of paper saying that I’m a Ph.D, or that I’ve gotten the paper published in a journal.

More to the point, better scholarship does indeed produce better results. If I carefully (as I have) refine my research objectives to look at interesting things, and then carefully refine my methodology to accurately capture those objectives, I will have produced better scholarship. What I won’t have produced is science.

Here’s my parting thought. If we accept that what I’m doing is not science… then where does the tugging on these strings of scholarship stop? What “scientific” fields unravel? What don’t? Why?

We want... information...

4 comments
Unlike #6 of The Prisoner I am interested in how information (data, people, whatever) Can be pushed, filed, stamped, indexed, briefed, de-briefed, and numbered. In a database, at least. (World domination can wait until after this guy's Ph.D. huh. Underachiever.) The core of database theory is indeed in manipulating this information or data to produce other data. These manipulations, however, give rise to the question of "well, what is it that we're manipulating?" (Data, stupid. It says so right in the name.) And yes, while the substance under examination is "data", calling the label the definition doesn't exactly produce any useful insights.

Thus, for today's discussion, I want to explore what data means from a database perspective. This, unlike the broader case, does have a definition formed in a constructed reality: that of a programmed piece of software. To begin, let's see what DB documentation has to say about data. Oracle, the database I'm most knowledgeable in, never defines data in its documentation. (Admittedly I only searched 10.2, but I'll make the claim for everything.)

In their "what is oracle" they have:
Oracle is a relational database. In a relational database, all data is stored in two-dimensional tables that are composed of rows and columns. The Oracle Database enables you to store data, update it, and efficiently retrieve it.

Oracle provides software to create and manage the Oracle database. The database consists of physical and logical structures in which system, user, and control information is stored. The software that manages the database is called the Oracle database server. Collectively, the software that runs Oracle and the physical database are called the Oracle database system.

In their "master glossary" which should have a definition of data (as they keep bloody using the term) .... it doesn't. Clearly Oracle either assumes that data is such fundamental knowledge that they don't need to define it or admit that there are so many competing definitions that it would be counter-productive to get into that debate. We'll return to Oracle when we look at data types. Interestingly, mySQL fails to address the "what is a database" issue at all along with its "what is data?" avoidance.

We can approach a definition by sidling up to it, as it were, in the PostgreSQL manual's definition of a table (tellingly, in its Data Definition section.):

A table in a relational database is much like a table on paper: It consists of rows and columns. The number and order of the columns is fixed, and each column has a name. The number of rows is variable — it reflects how much data is stored at a given moment. SQL does not make any guarantees about the order of the rows in a table. When a table is read, the rows will appear in random order, unless sorting is explicitly requested. This is covered in Chapter 7. Furthermore, SQL does not assign unique identifiers to rows, so it is possible to have several completely identical rows in a table. This is a consequence of the mathematical model that underlies SQL but is usually not desirable. Later in this chapter we will see how to deal with this issue.

Each column has a data type. The data type constrains the set of possible values that can be assigned to a column and assigns semantics to the data stored in the column so that it can be used for computations. For instance, a column declared to be of a numerical type will not accept arbitrary text strings, and the data stored in such a column can be used for mathematical computations. By contrast, a column declared to be of a character string type will accept almost any kind of data but it does not lend itself to mathematical calculations, although other operations such as string concatenation are available.

Here, there is no claim that "data" is being stored, only that the "stuff" stored in a column has a data type. While this is a fascinating failing of the manuals as they assume that complete newcomers will either learn the terms inductively through their examples or will arrive with a pre-known definition of the term, it makes declaring any official definition of the term "data" problematic. (of course, making pronouncements here will just be accepted as Gospel, right Brian?)

Hence, here, I'll need to draw on my experience and the ... definitions around data shown above. Data is something that exists in a database, for it has "types." This is not a useful basis for defining data, however, as these types are merely a programmatic convenience, restricting the data to certain symbol-domains either for error-checking, space-saving, or symbol-manipulating reasons.

So, we're reduced to the idea that "data" is stored in a particular row and column of a database. Except that the introduction paragraph disputes that point: "The number of rows is variable — it reflects how much data is stored at a given moment." Despite the wanton abuse of the poor hyphen in that sentence (and we were never guilty of that, were we, Brian?) the intent of that sentence seems to indicate that a row is data.

Once again, we must back off from the definition of the term to discuss the whole singular/plural problem. Can we have a piece of data? (I'd prefer a piece of pie, brain...) Or is that a datum? Modern usage (and what a reliable guide that is) seems to indicate that we're using data as singular and that prats who use it as a plural are merely putting on airs. "The data are very interesting." But, besides standard anti-intellectual elitism we can actually differentiate between datum, data (singular) and data (plural.)

The easiest case is data (plural). This data can be found in a table, and it's easy to claim that an entire database table is (or has) data (plural). Consequently, as a table is made of a set of rows (or tuples, to be more accurate in set theory terms), those tuples comprise data (singular). It's not too hard to imagine saying "can you get me that piece of data?" as a reference to getting a row from a database. And thus, we see that data(singular) then can comprise a set of individual datum(s).

To reiterate (for i=0; i<stupidLots;i++) the indivisible (at least without getting into single-cell functions) component of a database is the datum. This is related with other datums to form data (singular) or a row of the database, a tuple. These consistently formed tuples then are related in a table to form data (plural). Now, having shown how this data is structured in a database, can we come to a good definition of it?

Hell no. We know that a datum is comprised of symbols entered by either a computer or human. We know that those symbols are rendered by bits. We know that the symbols are generally of either the order of "language" or "numbers" and we know that some language or numbers is inappropriate.

The problem here is that it's theoretically possible to encode anything in a database. We can't claim that a datum is "an observation of reality" as it's possible to store an entire short story in one tuple. (I would link to the Ficlets site, but it's dead.) Given that this short story, if fictitious, contains multiple observations of a fantastic non-reality, we cannot say that there's any basis of singular observation, the "real" or anything like that.

Data types, as a requirement of columns, (My old nemesis. We meet again... for the last time... again...) do show one consistent feature of sets of tuples: they must be of the same nature. The domain of all datums (eeew, hate that word -- and the spellcheck doesn't have a problem with it. WTF?) in a column must be the same, and the relationship between all datums (::twitch::) in a row is also fixed.

Thus, a datum in a database must be defined as: "Something of interest to the designer of the database which can be clearly articulated and is indivisible within the universe of discourse of the database." That definition completely sucks.

Do any of you have ideas?

On the need for a Pidgin

0 comments
In the burgeoning field of philosophy I call the Philosophy of Data, There exist over 150 definitions of the term "data." While, in my research I am proposing the use of experimental philosophy to narrow down these definitions, or at least explore which groups hold which definitions, recent conversations between Catherine and myself indicate the potential for a more subtle strand of research. (If we can call this science research at all, the beard says that this is Voodoo science.)

Data, like "Science fiction" and "Art" is one of those things most easily categorized as "we know it when we see it." For most people, the differentiation between data and not-data happens at the intuitive level. However, the interesting point here is that differentiation [citation needed] also changes based on context and conversation. When discussing my research with my colleagues here at UNSW we easily toss around "collecting data about Data" and understand each other as the definitions shift. Furthermore, as we're all part of Information Systems, our shared research background (despite having a range of social-science, business, and infoTech people in the room) allows us to share a common definition. The essence of our activities in these meetings is tacitly constructing a pidgin between our various "sub-disciplines" (as no-one actually can say what IS is. If we can't define it, are we really doing it?)

The construction of a pidgin is extremely easily done within the same contextual domain: that of information systems. Pidgin construction, however, encounters problems in the communication between highly-inbred fields: those of Information Technology and the Philosophy of AI for example. In a recent conversation where we were editing each others' work, Catherine and I had to pause at practically every paragraph and recursively and painfully define terms down to a basic non-technical depth. This specific articulation of terms was required due to the very technical and precise jargon that each of us was using: clearly understandable to members of each scientific "tribe" but incomprehensible outside.

Peter Galison discusses this tendency in his book Image and Logic with Trading Zones in Physics:
"... the subcultures of Physics are diverse and differently situated in the broader culture in which they are prosecuted. But if the reductionist picture of physics-as-theory or physics-as-observation fails by ignoring this diversity, a picture of physics as a merely an assembly of isolated subcultures also falters by missing the felt interconnectedness of physics as a discipline. ... I repeatedly use the notion of a trading zone, an intermediate domain in which procedures could be coordinated locally even where broader meanings clashed."
Thus, when studying how people interact with data, the "Philosophy of Data" has a deep requirement for understanding trading zones. For to understand how people conceptualize data, I have to explore how people construct meanings of/for data at the time of communication to another person, not their "objective" definition of data that they may not functionally use. By going "meta" I look not at the nature of data, but the nature of the definition of data in the briefly constructed reality of a communication from one person to one or more people. With a shift in focus to this meta-aspect, the dilemma of over one hundred and fifty definitions of data becomes easier to deal with: they're just ephemeral pidgins depending on context.

Why you cannot remove the ‘Artificial’ from ‘Artificial Intelligence’

1 comments

There has been a rather disturbing trend of late. This trend involves technology development ‘spectators’ casually waltzing into any given field, mistakenly declaring some minor aspect of said field an issue, resolving (usually through some kind of semantic redefinition) said issue then having the general public declare these misanthropes genius’s.

In no field do these leaches of academic spirit irritate me more then in AI. The particular argument I am well sick of reading can best be summed up by a recent article posted on ‘The Online Investing AI Blog’ aptly titled “There is no such thing as Artificial Intelligence”. The basic argument is that these machines display real intelligence, there is nothing ‘artificial’ or ‘fake’ about the intelligence they are exhibiting. Also, by calling these machines and the programs that drive them ‘artificial’ we are, as a community, somehow cheapening the intelligence they do show. (And may eventually hurt their feelings).

The confusion here is more then just linguistic, it is a misunderstanding driven from the technology spectators’ misinterpretation of what AI is and where it genuinely stands today. The term, ‘Artificial Intelligence’ was first derived over 60 years ago. The field was first brought firmly into the academic environment by Alan Turing in the all famous paper ‘Computing Machines and Intelligence’. This paper described the kind of programs that were considered ‘Intelligent’ at the time, and presented a testing scenario to examine what level of Intelligence these programs may genuinely have. The motivation for calling this intelligence ‘Artificial’ was not to minimize the potential or reduce the relevance of said intelligence, but to separate the academic debate on computer intelligence from that of the complicated and much discussed biological intelligence debate that was then and still is now very prevalent in philosophical literature.

After Turing’s paper was dropped on the academic world, the field of AI very quickly broke down into many different fields dealing with many different approaches to creating AI. First, the Functionalism debate broke out amongst philosophers and cognitive psychologists, who were trying to decide if a functional representation of the mind could be classified as an equivalent re-enactment of the mind, and hence be more then a mimic of our intelligence. This debate lead to cognitive psychologists using computer programs to test their models of cognition. If their program could react to stimuli in the same way that humans do, and then maybe their model of cognition was correct: leading to the distinction between 2 types of AI. The first is Weak AI: that which mimics intelligence or real world events. For instance, a Weak AI program can simulate weather conditions in real time and show what will happen to terrain or buildings in the area. These programs are very helpful in helping us predict and understand real world events.

The second category is called Strong AI. These are programs that claim to be more then just simulations or mimics of our intelligence; they are intelligent in their own right. There has been much debate over what is should be necessarily represented to classify any program as a Strong AI. Some Cognitive psychologists would like to claim that some of their programmed cognitive models are intelligent in their own right and satisfy the conditions of Strong AI, as you cannot tell their responses from a human response. However, this seems deeply unsatisfactory. We would not like to classify weather simulation as being real weather, and the cognitive program mimic follows the same principles, so what exactly would be an accepted basis for Strong AI?

This debate split the field even further. The philosophers divided into many factions: there were those who thought cognitive functional equivalence was enough for Intelligence, and those who thought that neurological functional equivalence was enough. Then there were those such as Searle who didn’t really mind where you got your equivalence from, as long as it carried all the necessary conditions for intelligence with it, and did not fall into the mimic trap, as demonstrated by his 30 year debated ‘Chinese Room’ mind experiment. But the field of AI split into another direction as well: as computer languages became more developed and more complex, and the technology improved to support faster and faster processing, software engineers started developing AI that had no base on human intelligence at all.

These AI programs very quickly gained all kinds of status, especially after the Turing Challenge was established. To win the prize, one had to create a program that, when placed behind a screen could not be distinguished from a human. The only interaction template available to the judge was a keyboard and text screen. The trick was that sometimes there would be a human and not a computer behind the screen. The most popular method of trying to beat this test was simply programming a massive reference board with as many possible responses to as many questions as the programmer could think of. This was enhanced with word recognition software that would try and put common words in previous questions together to sound like it knew what it was talking about, often with hilarious results. To this day, no program has ever satisfied the conditions of the Turing Challenge, and as such competitions such as the ‘Loebner Prize Competition’ remain unclaimed.

Confusion of this fact also came from the well known chess match between IBM’s ‘Deep Blue’ and Kasparov. It is well known that Deep Blue won the chess competition against the chess master, but onlookers would be mistaken in thinking that this was an example of computer ‘intelligence’ winning over mankind. Why? Because of the way that Deep Blue is programmed. The method used by Deep Blue developers is known as the “Brute Force” method: giving a computer enough computational power to run every single scenario then choose and play the move that will give it the best odds of winning a match. This is not the computer making an intelligent choice regarding chess, it is a glorified number crunching machine.

Ideas of developing other methods of choice that did not involve brute force kept software engineers entertained for some time. Programs that have been developed since that time and classified under the AI banner are programs such as Neural Networks, Bayesian Networks, Genetic Algorithms, Fuzzy Logic, and a wide variety of other such program bases. Despite the fact that they are all very different in their conception they are still classified as Artificial Intelligence despite their results simply because that is what they are: they are made by man, as mans attempt at creating something else that is functionally equivalent enough to our understanding of the world to be actually intelligent. No programmer, philosopher or scientist, when facing a real and functional Artificial Intelligence would ever think of it as “fake”, it is simply the wrong essence of the word in question. Hence it is offensive and naive to walk into the field and criticize those who care about the development and foundations of Artificial Intelligence for calling the results of their blood sweat and tears fake. One could argue that so far there has been no such program developed that can be classified as real intelligence, simply a set of highly developed mimic machines, you can even have a debate as to whether this mimicking is basis enough for it to be ‘real’ intelligence, but that’s beyond the point. Knowing the history of their field and knowing how many areas AI now covers, and knowing how diverse and complicated it has become, AI researches are well aware that there may be a time when some of these programs need to be renamed and re-categorized. But this relabeling will not be done by the side watches, the buyers and sellers of technology, and those who love to read the paper and get excited by the iPhone new music recognition software. This renaming will be done by those who understand the field and the place their projects have within that.

Your Roomba and your sorting machines and your autonomous warehouse mechanisms and your search engines and your computer viruses and your programmed computer game opponents and your spam bots may be intelligent or they may not be. It is important to remember that that which looks intelligent may not be. But making claims that your particular example of technology is intelligent, and no-one but you recognizes this and all others are giving the poor thing an inferiority complex is plain ignorant. Next time, talk to those in the field and ask them what they mean when they call something Artificially Intelligent. They will clearly have a better idea then most of you.