Abstract

Comparison of distributed fault-tolerant file systems for HPC usage

Gergely Pál Dávid <>
NIIF Intézet

The emerging need for interconnecting and aggregating resources (processing power, network, storage, etc.) for more optimal utilization is an obvious trend in modern computing.
The field of high power computing cannot be an exception.

This talk is going to address the storage aspect of the aggregation by showing the possible solutions for merging local HPC storage space into a single pool.

The NIIF Institute has started actively researching this topic for two currently important reasons. The first is our users' need to access the same content simultaneously on all of our distinct HPC sites.
This could achieve not just user satisfaction, but better storage and computing power utilization. The second reason is the Partnership for Advanced Computing in Europe (PRACE) project and one of its task.

One very important feature of this aggregation is that it must be fully transparent to programs and users. For this reason the talk will focus mainly on file systems and file system-like solutions,
because using those this constraint can easily be fulfilled.

During the talk the following topics will be addressed to introduce and explain the problem and the possible solutions:

What is the exact problem or need? In which cases does the problem even appear? What would be solved and what positive effect would we get after introducing a good solution?
What kind of "offline" solutions exists? What are the drawbacks of them?
What is a distributed file system? Why would we need one?
What does fault-tolerant mean? What part of it is fault-tolerant? Why is full fault tolerance important?
What is a parallel file system? Why is this feature beneficial?
What is the main difference between distributed, parallel and replicated features? Which is needed for this case and why?
Which file system implementations could be considered as a valid solution?
What are the results of the evaluation, tests and measurements?