On the 13th of March I attended an interesting workshop on techhnology for Genealogy. My interest in this is based on interest in genealogy itself (my family tree contains currently about 3000 persons from various parts of Sweden down to some farmers in northern Sweden born around 1400) and my interest in technology and in particular how MySQL and MySQL Cluster can be used for genealogy applications. Being an LDS myself also adds to my interest in the subject.
The LDS church has developed a Web API FamilySearchAPI where genealogists through their genealogy software can work on a common database where they can add, edit information about our ancestors. The system handling this system currently contains 2.2 PB of data and is going to grow significantly as images and more genealogy information is added.
There were quite a few interesting discussions on how to link information between the source information (scanned images of historical documents), transscribed information from sources and derived family trees. The most complex problem in this application is the fuzziness of the base data and that different genealogists can have many different opinion about how to interpret the fuzzy base data. Thus in order to solve the problem one has to handle quality of genealogists somehow in the model.
From a database point of view this application requires a huge system with large clusters of information, it contains one part which is the base data (the scanned images) and this is typically stored in a large clustered file system containing many petabytes of data. Then the derived data is smaller but given that all versions need to be stored will still be a really huge data set and this is a fairly traditional relational database with large amounts of relations between data.
So what I take home from the workshop is ideas on what MySQL and MySQL Cluster should support in 3-5 years from now to be able to work in applications like this one.