Native XML storage with Berkeley DB XML - part one

Command Line Warriors - Thu, 14/08/2008 - 5:12pm

XML databases have not really broken through yet in a big way, primarily because SQL has proved more resilient than expected in storing a wide range of data.

SQL has limits

For example, object-relational mappers such as SQLAlchemy, the one inside Django and ActiveRecord in Ruby, have become very popular recently, as they all you to serialise in-memory objects and store them in a database. Very little SQL experience is required and you can create very elaborate relational storage.

However, when you have data that is not relational at all, converting it to rows and columns can be either lossy or a lot of work, and often you end up unable to round-trip, i.e. you can get the data in, but you cannot get it out again in the original format.

The most common approach is not to even try, i.e. often people will not store XML in a database, but run a series of XML files though some XSLT stylesheets and then store the compiled HTML in a database. This removes XML's advantage as a processable exchange format. In order to get the data in a different way, the human has to go back to the beginning of the pipeline and rerun the complication and database import. If you want to allow other computers to make flexible requests over the Internet, well then you are out of luck.

Meanwhile, a lot of the XML extensions to SQL databases are still relational underneath, for example, with MS SQL, you can give it XML, which it then splits up into rows and columns; you can then also ask for the data back in XML, which it then retrieves from the rows and columns and outputs in XML. You have not actually gained very much however, as your data is still stored in rows and columns, you have to output the data before you can process or query the XML. This can prove very long-winded with highly complex hierarchical data, it can sometimes prove faster to just dump the XML to disk yourself.

Even worse is storing a whole XML document within a single SQL field in a single row. Again, this is using a database instead of a hard disk.

Elliotte Harold's article, Managing XML data: Native XML databases is a few years old so the software examples he mentions are probably a little old, however the discussion of why one might want an XML database is quite good, this interview with Jonathan Robie is also worth browsing.

Berkeley DB

Berkeley DB, more commonly BDB or just DB, is the classic embedded database, Started in 1991, it is used more or less everywhere. Despite some recent completion from SQLite, it is still by far the leading embedded database. It bundled inside in everything: Linux, BSD, OS X, OpenOffice, Python, Apache, Sendmail, Postfix, subversion, GNOME and no doubt millions of other things.

DB, like much of software that powers the world, was started at the University of California, Berkeley. It was then spun off in Sleepycat Software, which was bought by Oracle a few years ago. Note DB has nothing to do with Oracle's eponymous relational database product. They are at the complete other end of the spectrum, Oracle database is for large corporate data storage, while DB is a fast and light embedded database.

DB is also non-relational, i.e. there is no SQL interface, there are key-value pairs stored in byte arrays and that is more or less it. If you want SQL or network interfaces, then you stick that on top, for example DB is one of the back-end options for MySQL.

Berkeley DB XML

The folks at Sleepycat/Oracle have been working on Berkeley DB XML for a couple of years, but it is not so well known as its older brother. DB XML is a layer on top of DB that forms a fully XML-native database with "XQuery-based access to documents stored in containers and indexed based on their content". (Source)

There are two good things about it. Firstly, it is an XML database written in C and C++, not in Java. Secondly, it is a proper XML database, not putting XML into SQL columns.

The other notable XML database is the Java-based eXist which seems to be more well known, primarily among Java XML developers. eXist is more of a service, rather than something light to embed into applications. It also rumoured to be significantly slower than DB XML, but I have never used eXist so I can't show any benchmarks for that.

There are also a load of proprietary and half-finished XML databases we do not care about. Including the abandoned Java XML database called dbXML which has no relation to the Berkeley on but confuses everyone because the sourceforge page is often the first search engine result.

Install Berkeley DB XML - Windows

Go over to the Oracle/Berkeley DB XML homepage and grab the installer. I have not used it so I have no idea what happens, hopefully it is straightforward.

Install Berkeley DB XML - Posix platforms

DB XML works on all major posix compatible systems, including Linux, OS X, BSD, Solaris and so on. Installing it varies according to distribution. If your distribution has packaged it, then lucky you, you just install it through that. Of the two distributions I usually use as examples on this site, Gentoo Linux and Ubuntu Linux, the former has a DB XML package, the latter currently does not.

Install Berkeley DB XML - Gentoo

sudo emerge dbxml

The package is new, so if you are running stable, then the emerge command will moan and tell you that the package and some of the dependencies are not yet in stable. You will need to add these to /etc/portage/package.unmask, see Using Masked Packages in the Gentoo Handbook for more details.

Install Berkeley DB XML - Ubuntu

Hopefully, in the near future, you will be able to go:

sudo apt-get install dbxml python-dbxml

But we are not there yet. When we are there, the rest of this post will be irrelevant. Therefore this following part will date badly, I will try to remember to update it as events unfold, but if you are reading this paragraph in 2009 or later, it means I have forgotten and it might be worth checking other sources.

What follows shows how important packagers are. Almost any complex package will need to be optimised for that distribution, and normally all the work is done for us behind the scenes. The pointy-heads in the Mozilla corporation that didn't want Firefox to be patched downstream in the distributions are not living in the real world. But that is another story. Let's get going.

Dependencies on Ubuntu

First, we need to get the dependencies. We need Berkeley DB, xqilla and libxerces, the latter two are not all yet in the default Ubuntu repositories, but my mate txwikinger has packaged them for us in his private archive. You need to add these lines to the bottom of your /etc/apt/sources.list with a comment so you remember why you put the lines there:

# Deps for DBXML
deb http://ppa.launchpad.net/txwikinger/ubuntu hardy main
deb-src http://ppa.launchpad.net/txwikinger/ubuntu hardy main

Then run: sudo apt-get update

sudo apt-get install libdb4.6++ libxerces28 xqilla

sudo apt-get install libxerces28-dev libxqilla-dev libdb4.6++-dev

Working directory

We need to have a working directory, in my examples we will assume ~/Sandbox, but use where you like.

mkdir ~/Sandbox/

cd ~/Sandbox/

Go over to the Oracle/Berkeley DB XML homepage and grab the tarball and extract it into your ~/Sandbox directory, or you can just use the commands:

cd ~/Sandbox

wget http://download.oracle.com/berkeley-db/dbxml-2.4.13.tar.gz

tar -xvvzf dbxml-2.4.13.tar.gz

Compile DB XML

The tarball from Oracle contains a lot of dependencies which we have already installed. So ignore the 'buildall' instructions, as they compile all the dependencies which takes all day. You then have half a dozen or so packages that are in your main system folders but are not managed by Apt, this is not really the done thing.

Now because we are doing it the proper Unix/Ubuntu way rather than the Oracle big-tarball-of-mud approach, we want to use the shared library of xqilla that we installed earlier. There is a slight problem in the configure file which stops it from working. Open the following file in your text editor:

~/Sandbox/dbxml-2.4.13/dbxml/dist/configure

In line 4396, change .la to .so

So the line:

elif test `ls "$with_xqilla"/libxqilla*.la 2>/dev/null | wc -l` -gt 0 ; then

Becomes:

elif test `ls "$with_xqilla"/libxqilla*.so 2>/dev/null | wc -l` -gt 0 ; then

Now we can build DB XML

../dist/configure --with-berkeleydb=/usr/lib/ --with-xqilla=/usr/lib/ --with-xerces=/usr/

cd ~/Sandbox/dbxml-2.4.13/dbxml/build_unix

If that works without error, you can type:

make

sudo make install

Now if that works without error, you can try to run dbxml:

/usr/local/BerkeleyDBXML.2.4/bin

./dbxml

Type help to see the list of commands. Press Ctrl+D to quit when you have had enough.

Python Bindings on Ubuntu

So far so good, now we need the Python bindings:

cd ~/Sandbox/dbxml-2.4.13/dbxml/src/python/

Now we have to do a little more patching to let the Python bindings know where dbxml is installed.

Open setup.py in your text editor We are going to add three lines, in each case make sure the indentation lines up with the line above it.

After line 18, add the following line:

db_xml_home = '/usr/local/BerkeleyDBXML.2.4'

After what is now line 65, add the following line:

INCLUDES.append(os.path.join(db_xml_home, "include"))

After what is now line 69, add the following line:

os.path.join(db_xml_home, "lib"),

Now in dbxml-2.4.13, there is another little bug we need to deal with first.

Open the following file in your text editor:

~/Sandbox/dbxml-2.4.13/dbxml/dist/swig/dbxml_python.i

Go to line 533, and change 'Vaue' to 'Value'.

So change:

class XmlInvalidVaue(XmlException):

To:

class XmlInvalidValue(XmlException):

Now this file eas used to automatically generate the Python bindings, so the generated file needs to be fixed too. The dbxml developers need to generate them, but until they do, we can just fix it ourselves. Open the following file in your text editor:

~/Sandbox/dbxml-2.4.13/dbxml/src/python/dbxml.py

Go to line 121, as before change:

class XmlInvalidVaue(XmlException):

To

class XmlInvalidVaue(XmlException):

Now we can finally install the bindings:

python setup.py build

sudo python setup.py install

Now lets test it, which involves, yes you guessed it, more patches:

cd ~/Sandbox/dbxml-2.4.13/dbxml/examples/python/

Now edit examples.py using your text editor. Remove the 3 from line 9. So from:

from bsddb3.db import *

To:

from bsddb.db import *

Now you can go:

python examples.py 7

Which should give you:

Running example 7.
book1 = <book><title>Knowledge Discovery in Databases.</title></book>

If you have that then everything should be in working order. This has been a very long post so I will break here and come back to DB XML in Python another day.

Discuss this post - Leave a comment