Unification and filtering the stock of of proper names in Petőfi Museum of Literature (contribution to the name space project of NDA)

Kómár Éva <>
Petőfi Irodalmi Múzeum

Lengyel Monika <>
MTA SZTAKI

Simon András <>
BCE EKK

The Literature Museum’s classification databases (the Hungarian biographic index , the Hungarian emigrant writers and their works, the prize winners after 1945, and the genealogy of Hungarian aristocrats) are mentioned in the first place among the data bases declared authentic in the name space project NDA announced in 2004.

The Museum, as a research institute, stands out among the other public collections, for besides the catalogues of different collections it has created a lot of data bases that serve basic literary research purposes. The databases that claim high lexicographic standards were put together as a result of comprehensive background research work of decades.

The data bases originally were built in Access programme. With the introduction of HunTeka, these databases together with the museum’s other classification databases were merged. In the course of the conversion of the different databases, which served different purposes and were put together using different sources, a lot of theoretical and practical problems emerged. One of the most crucial of them was the unification of the stock of proper names as the conversion had produced 600 thousand so called quasi-duplicates.

Checking and maintaining a collection of data of this size is unimaginable without appropriate computers and technology. To build a classification database of required quality we needed more efficient and comprehensive software. The old practice of unification based on the mechanical collation of text files did not seem to be enough due to the unusual details of the background content of the data. Widening the scope of profound collation, contents of the given data base fields had to be analysed and checked by similarity rules as a result of which the quality and content of data together could qualify the data as unique.

In the course of the check first each data field is examined regarding its form and content then, according to the level of its content and completion, it is classified. Comparing the similarity of the data field records that seem identical can be filtered. Within them according to the preferences of the uploaded databases the outstanding records are selected and all secondary personal records are linked to them.

As the identity of individual data contents is established more or less arbitrarily based on their degree of similarity and such data contents have different weight in the classification of the given record, the quality of the data should be compared to each other as well to ensure appropriate classification. By running the qualifying, evaluating and identifying algorithm many times and checking the set of results again and again, the program gets feedback, learns from it and operates accordingly. It is not an exaggeration to consider this method similar to artificial intelligence solutions.