Capt. Horatio T.P. Webb
Parks -- Fall 2014
Version 1 -- Last Updated 9:00 PM 4/19/2013
Data Processing began with:
Until the 1960s, all data processing was predicated on "sequential data" stored media such as punched cards and tapes (IBM referred to such methods as QSAM -- seQuential Access Method). These devices were called "unit record" equipment since they processed punched cards. The concept of a "record" thus became associated with a "card" as a unit of data (multiple characters puched sequentially on a card). Later in the 1950s, punched cards could also be transfered to magnetic tape (see 7 track tape)-- but processing was still sequential.
IBM owned the the majority of the computer market until the late 1990s (see this IBM timeline). Their hardware solutions for data processing enjoyed an unprecedented popularity because of IBM's constant innovation, effective marketing and excellent technical support. Major releases of IBM mainframe hardware were:
See the IBM competitors -- the BUNCH (IBM and the seven dwarfs)
Contrary to what most people (people under 60 years of age) say about computers and the Internet: OLTP (on-line transaction processing) is nothing new; Al Gore didn't invent the internet; and there is not much really, really new. It was been done quite well for decades and decades. There is nothing new under the sun.
The steps beyond reading "sequential" data files from cards and tapes are:
Generally, the concept of non-sequential access file is predicated on a file that contains:
With this data file, an "index file" can be constructed containing a pair of data for each record:
The disk location for data is specified by three data values:
Data transfer time depends on how long it takes for the data to pass under the read/write heads. If we have 100 sectors along one track and the disc rotation speed is 120 rotations per second, then to transfer 1 sector would be:
transfer time = time for one rotation / number of sectorsThe "typical total time" to access data is the sum of:
seek time + rotational latency + data transfer time 4 milliseconds + 4 milliseconds + .083 → 8.083 milliseconds total time/sector
This is ONLY about 125 sectors per second -- astoundingly slow. This is generally the reason large data applications can appear to be incredibly slow when data volume gets large (see the solution to this problem below)
Additional information required is:
Below is a sample INDEX FILE that would contain the data for each record's disk location (drive, cylinder, surface (or head), and sector). It is shown with:
Assume we are trying to find the record with KEY=98. Searching the index file below, we find the KEY=98 value and retrieve its disk address:
DISK ADDRESS BUFFER INFO [VOLUME=DF3 CYLINDER=6 HEAD=1 SECTOR=3] [BLOCK=2 RECORD=2]
[ All records in Volume=DF3 CYLINDER=6 SECTOR=3 is shown are blue.]
Volume - Cylinder - R/W Head - Sector - Block - Record Number Number Number Number Number Number
DF3 6 1 3 2 3
DF3 6 1 4 1 2
DF3 6 1 3 3 1
DF3 6 1 4 2 3
DF3 6 1 3 1 2
DF3 6 1 4 3 3
DF3 6 1 3 3 3
DF3 6 1 3 2 1
DF3 6 1 4 1 1
DF3 6 1 4 3 1
DF3 6 1 3 1 1
DF3 6 1 4 2 2
DF3 6 1 4 3 2
DF3 6 1 4 1 3
DF3 6 1 3 2 2
DF3 6 1 4 2 1
DF3 6 1 3 1 3
DF3 6 1 5 1 1
DF3 6 1 5 1 3
DF3 6 1 3 3 2
DF3 6 1 5 1 2
This index file had to constructed as each "record" was written to the disk drive. For a new "record", the values of the "key" and "disk address" pair were added to the the "index file". When a program wished to retrieve a "record" by using its "key", the program had to:
In programs, the difference in QSAM and ISAM can be seen in the COBOL statements that read a "record". In QSAM (sequential) files, records are read in physical sequential order (like a deck of punched cards, records on a tape, or records on disk which are read in the order in which the were physically written). In COBOL a record from a sequential file is retrieved like this:
READ file-name RECORD AT END statements
...file-name is the name of the file as known to the operating system and defined in program
...the AT END clause is followed by a list of statements to be performed when the end of file is encountered
For ISAM, the COBOL statement to read a record is:
READ file-name KEY key-name INVALID KEY statements
...file-name is the name of the file as known to the operating system and defined in program
(the file also has an additional index file associated with it)
...key-name is a program variable that contains the value of the "key" the user is trying to retrieve
...INVALID KEY clause is followed by a list of statements to performed if the value of the "key" if NOT found
Older developers will recall with fondness, the IBM utility programs: IEBGENER used to create files; IEBISAM which performed a re-sort of the index file; b>IEHMOVE used to "move" files;
ISAM became a critical part of early programming languages: FORTRAN, COBOL and eventually C. Each of these steps underwent improvements and changes as the size of the files and the capacities of the disk drives improved.
The second major disk access method invented by IBM was VSAM (Virtual Sequential Access Method) in the late 1970s. This technique in general added "empty" space in the index list so that new records could be inserted into the indexed file without having to resort the index file.
Even though ISAM -- and eventually VSAM -- provided IBM customers with direct data access by "key", most file processing designs and decisions were made by programmers at the lowest level of sofware and hardware. Based on the work of IBM employee Vern Watts, the IMS Database was developed. IMS became the first commercial viable "full function" database product. It was based in the "hierarchical" model where data is represented as a tree-like structure (as opposed to Codd's "relational" model of tables -- see below).
The hierarchial model is like the XML discusssions we have had elsewhere. The "parent-child" relationship of XMl are representative of the type of structure IMS contains (see this on IMS).
IMS was an extremely successful product for IBM and became the IBM's main database product for several years.
The move from hierarchical databases to the relational model began at IBM. System R was an experimental database product that was designed by IBM in the early 1970s (see this). The key development was SQL (Structured Query Language) by by Donald D. Chamberlin and Raymond F. Boyce in the early 1970s:
Michael Stonebraker and Eugene Wong of University of California Berkeley became interested in the System R work. With modest support from NSF, , Air Force Office of Scientific Research, the Army Research Office, and the Navy Electronic Systems Command, students and staff developed INGRES in 1974. By 1980 over a thousand copies of INGRES had been distributed (for free) mostly to universities.
Berkeley students Jerry Held and later Karel Youseffi (at Tandem Computers) built a system that evolved into NonStop SQL. The Tandem database system was a re-implementation of the Ingres technology. By 1989 Tandem was competing with Oracle and lost market share because of Oracle's massive marketing efforts.
Robert Epstien was the lead programmer at Berkeley and later founded to Sybase (see below)
In 1980, Stonebraker and Wong and Lawrence Rowe founded Relational Technology, Inc. (RTI) and renamed their company INGRES in the late 1980s. In 1994, INGRES was purchased by Computer Associates.
Stonebraker returned to Berkeley in 1985 and began an improved version of INGRES whose prototype of 1988 became in the POSTGRES. In 1996 the project was renamed PostgreSQL. Since 1997 the code has been "open source".
Larry Ellison heard about the IBM System R database from an article in the IBM Research Journal provided by Ed Oates (a future co-founder of Oracle). Ellison co-founded Oracle Corporation in 1977 with Bob Miner and Ed Oates under the name "Software Development Laboratories" (SDL). ORACLE version 1 was released in 1978 for the Digital Equipment Corporation PDP 11 hardware. In 1979 SDL changed its name to Relational Software, Inc. (RSI). In 1982, RSI renamed itself Oracle Systems Corporation. In the same year ORACLE was upgraded for the new Digital Equipment VAX hardware system.
Oracle continued its dominance in the non-IBM marketplace for years. With the rise of the web and the client-server architecture, ORACLE took over as the top commercial database product through aggressive marketing and acquisitions such as: Peoplesoft in 2004; Siebel in 2005; Hyperion in 2007; Sun Microsystems (along with all the JAVA patents) in 2012.
From the Relational Database Systems, Inc. founded by Roger Sippl and Laura King, 1980
INFORMation on unIX in 1981.
In 1996 Michael Stonebraker became Informix's CTO, a position he held until September 2000.
Purchased by IBM 2001
The most successful implementation of the relation model was IBM's relational product DB2 which was released for it mainframe customers in 1983.
Founded 1984 by Mark Hoffman, Bob Epstein (lead programmer at INGRES above), Jane Doughty and Tom Haggin in Epstein’s home in Berkeley, California
Sybase's SQL Server 4.9 was used by Microsoft to create Microsoft's SQL Server first used circa 1989
(Microsoft's SQL Server carried Sybase copyright notices until 1997)
Acquired by SAP 2010
Adapted from Sybase Version 4.9 in 1989, the product became Microsoft's flagship database.
MySQL was created by a Swedish company, MySQL AB, founded by David Axmark, Allan Larsson and Michael "Monty" Widenius. The first version of MySQL appeared on 23 May 1995. Based on the low-level language ISAM, which the creators considered too slow and inflexible. They created a new SQL interface, while keeping the same API as mSQL.
In January 2008, Sun Microsystems bought MySQL for $1 billion. Then, in April 2009, Oracle Corporation entered into an agreement to purchase Sun Microsystems,
A movement against Oracle's acquisition of MySQL, to "Save MySQL"from Oracle was started by one of the MySQL founders, Monty Widenius.
The Top Ten DBMS systems from http://db-engines.com/en/ranking (all but MongoDB and Cassandra are Relational DBMS) are shown below:
3. Microsoft's SQL Server
5. MongoDB (document store, see below)
7. Microsoft Access
9. Sybase ASE
10. Cassandra (wide column store)
Most applications begin to perform slowly when the query volume exceeds the server/disk performance capability. The noticeable slowdown in response time necessitates either "fine tuning" the database structure or in many cases splitting the database into several parts and assigning each portion of the database to its own separate server/disk configuration.
This process is called "horizontal partitioning" because the tables are divided along row boundaries. For example, one might split a database table that whose primary key is "LastName" into equal fourths by creating a server and disk for last names that begin with : A→-F, G→L; M→R, and S→Z. Each of the partitions can also be referred top as a "shard" as in a piece of broken glass (i.e., fragment, sliver, splinter, shiver, chip, piece, bit, particle, ...)
There are obvious performance issues with this idea:
Automatic sharding is available in: IBM's INFORMIX (version 12.1) , MongoDB (version 1.6), MySQL Cluster, MS SQL Azure, and Spanner (by Google)
More recently, during 2010 and 2011, Michael Stonebraker has been a critic of the NoSQL movement.