Guest blog post by kyvos insights
The HDFS design is totally based on the design of the GFS (Google File System). Its implementation addresses a number of problems that are present in a number of distributed filesystems such as Network File System (NFS). Specifically, the implementation of HDFS address are:
➤ To be able to store a Big amount of data (petabytes), HDFS is designed to expansion the data across a large number of systems, and to support more larger file sizes as compared to distributed filesystems like NFS.
➤ To store data tightly, and to cope with the loss or malfunctioning of individual system in the cluster, and HDFS uses data replication.
➤ To better integrate with Hadoop’s MapReduce, HDFS permits data to be analysis and processed locally.
The scalability and high-performance design of HDFS comes with a cost. HDFS is conditioned to a particular class of applications — it is not a general-intention distributed filesystem. A large number of additional result and trade-offs govern HDFS architecture and implementation, including the following:
➤ HDFS is optimized to support high-streaming read performance, and this comes at the expense of random search performance. This means that if an application is reading from HDFS, it should avoid (or at least minimize) the number of seeks. Sequential reads are the preferred way to access HDFS files.
➤ HDFS supports limited set of operations on files — writes, appends, deletes and reads, but not updates. It assumes that the data will be written to the HDFS once, and then read multiple times.
➤ HDFS does not provide a mechanism for local caching of big data. The overhead of caching is large enough that data should simply be re-read from the source, which is not a problem applications that are mostly doing sequential reads of large-sized data files.