Services for digital libraries

pomaranczona lza man-ha minResources of Polish digital librariesboast nearly 2.5 million digital objects and their increase rate has already reached more than 0.5 million objects per year. This is a huge amount of data which often needs to be processed.

In the context of long-term storage and sharing conversions (lossy and lossless) between different data formats (TIFF, JPEG, DJVU, PDF, etc.) are required.

Many libraries provide their digitized resources without a text layer. Because of that full-text indexing of such resources in the digital library and the Federation of Digital Libraries becomes impossible.

It also happens that developed metadata of digital objects are incomplete, which causes problems in the presentation of objects on different devices.

In response to such challenges as conversion of data formats, OCR and metadata enrichment technology the MAN-HA project has implemented scalable, distributed and fault-tolerant services.

Services have been implemented and run as so-called computing topologies in the Apache Storm system (https://storm.apache.org) which, thanks to the project, has been enhanced with the ability of automatic processing distribution to the next virtual machines reserved ad-hoc. When the number of requests for the service increases, the system starts another data processing nodes (VMs), where it runs the so-called. Apache Storm processes carrying out subtasks of a computational topology. Conversely, if this service requests stop coming, the system slows down and closes the virtual machines.

The computational topology’s individual subtasks have been implemented with the Java language using proven and open tools installed on each virtual machine. Other tools such as the tesseract-ocr (https://github.com/tesseract-ocr) or ImageMagick (http://www.imagemagick.org) are used.

Specific topology sources (i.e. Apache Storm spouts) are able to receive and properly support each type of input data. This can be single data files, but also the whole digital objects and only administrative metadata of digital objects just to be processed. These data may be available in the form of a resource’s URI.