Comparison of distributed fault-tolerant file systems for HPC usage
Gergely Pál Dávid
<>
NIIF Intézet
The emerging need for interconnecting and aggregating resources
(processing power, network, storage, etc.) for more optimal utilization
is an obvious trend in modern computing.
The field of high power computing cannot be an exception.
This
talk is going to address the storage aspect of the aggregation by
showing the possible solutions for merging local HPC storage space into a
single pool.
The NIIF Institute has started actively researching
this topic for two currently important reasons. The first is our users'
need to access the same content simultaneously on all of our distinct
HPC sites.
This could achieve not just user satisfaction, but better
storage and computing power utilization. The second reason is the
Partnership for Advanced Computing in Europe (PRACE) project and one of
its task.
One very important feature of this aggregation is that
it must be fully transparent to programs and users. For this reason the
talk will focus mainly on file systems and file system-like solutions,
because using those this constraint can easily be fulfilled.
During the talk the following topics will be addressed to introduce and explain the problem and the possible solutions:
- What
is the exact problem or need? In which cases does the problem even
appear? What would be solved and what positive effect would we get after
introducing a good solution?
- What kind of "offline" solutions exists? What are the drawbacks of them?
- What is a distributed file system? Why would we need one?
- What does fault-tolerant mean? What part of it is fault-tolerant? Why is full fault tolerance important?
- What is a parallel file system? Why is this feature beneficial?
- What is the main difference between distributed, parallel and replicated features? Which is needed for this case and why?
- Which file system implementations could be considered as a valid solution?
- What are the results of the evaluation, tests and measurements?