Tuesday, December 01, 2009

Storage System and File System Courses

I researched a lot about storage system classes given at good universities this year. This had two reasons: The first was this post of a researcher at NetApp, about the missing of a good storage or file system class book and secondly our own storage systems class where I was the TA.

In this post I want to give a short overview about the various different courses, their focus, and other things. Please note, the following text might contain errors or misconceptions on my part. I also might have missed other storage courses at these universities.

University of California, Santa Cruz: 

Let's begin with the course of the University of California in Santa Cruz. Storage is a huge at UCSC with the Storage Systems Research Center that partners with nearly very everyone. The ceph file system and the crush hash function are two outcomes of their research. 

The course consists of a series of lectures (two per week), lots of reading material, and a project. The lectures are about file systems beginning with uniprocessor filesystems, performance analysis and (very fast) to distributed filesystems. They also cover fault tolerance and other advanced topics. Their reading material consists of 37 papers from classics like "File System Design for an NFS File System Appliance"  to state of the art research papers like "An Analysis of Data Corruption in the Storage Stack" (FAST 2008) that come about two weeks before. 

I miss some important basics that IMHO are important for understanding storage system design, like properties of modern hard disks and I am not that into archival storage (my boss is), but it is a really good designed course. Unfortunately, the lecture slides are not available online.

Columbia University, New York
Advanced Topics in Network Storage Systems, Spring 2004:
http://www1.cs.columbia.edu/~magoutis/cs699810-spring04/index.html


I may have missed one, but the last storage related course at Columbia University had been in 2004 by Kostas Magoutis. The course is focused on network storage and probably relies on basics from an Operating Systems class or a basis storage class. The lectures had been one per week with one to three papers are reading material per week. 

Really nice is that the lecturer has posted notes how the read the papers with questions and annotations to some of the material. Interestingly, data deduplication is covered with the LBFS, the Venti paper, and Henson's Compare-By-Hash papers. 

There are three books recommended for the course "UNIX Internals (1996)", "The Design and the Implementation of the 4.4 BSD Operating System (1996)" and "NFS Illustrated (1999)".

Cornell University, New York
Advanced Distributed Storage Systems, Spring 2009: 

At the Cornel University, I found the course and advanced distributed storage systems by Hakim Westherspoon (has taken part in the OceanStore project). The lectures, given two per week, handle "Cloud Computing, "Network File Systems", the important topics of Consistency, Availability, Replication, and Scalability. 

I think the major strength of this course is that it seems to focus much more than the other courses and the important concepts needed for storage system design, implementation and research than the focus on standards, products, and storage management issues. The major weakness is that the individual lectures are very focused on the research papers, whose content is presented. Even to the point that there is no single presentation scheme. I think the overall consistency of the lecture is weakened this way. 

One interesting aspect of the course is that the students have to write and hand-in short summaries of the reading material papers consisting a summary (3-4 sentences), two or three major strength points, two or three weaknesses and one question of future work that should be followed in the option of the student.

The have to projects as part of the course: In the first the students have to develop a distributed file system based on Amazon Web Service infrastructure. the second is a research project, the students have to come up with by themselves.

For the course 6 books are recommended: Two books by Richard Stevens (UNIX Network Programming, Advanced Programming in the UNIX Enviromment), two books by Tanenbaum (Modern Operating Systems, Distributed Systems), "The Design and Implementation of the 4.4 BSD Operating Systems", and "The C++ Programming Language" book by Stroustrup.

John Hopkins University
Storage Systems, Fall 2007:

At the John Hopkins University -- where our professors of Christian Scheideler and my advisor Andre Brinkmann (as visiting PhD student) had formerly been -- I found the Storage Systems course by Randal Burns.

As usual the course consists of a lecture series (2 lectures as 50min per week), home works, and a project. I like that the course some basics like disk drive architecture that a essential to understand the design of storage systems. On the other side it is a bit short on distributed file systems.


University of Notre Dame:

The University of Notre Dame offered in 2005 the course "Distributed Storage" by Surendar Chandra. 

As usual the course consists of a series of lectures (2 per week) and a project. The lectures topics are "Naming and location", "Consistency and Replication", "Distributed Storage Management", "Security", "Peer-to-Peer Storage and Sensors", and "Energy Management". The reading material consists of not less than 40 papers. My impression is that the collection of reading material differs much from the material of the other courses covered here, e.g. the well-known "classical" papers are not linked.

Technion

Technion is the "Israel Institute of Technology" in Haifa and I said before: I am pretty envy to the students there. However, not especially because of the "Filesystems" course.

The lecture series consists of an short introduction on disk drive architecture, RAID, sequential data processing on tapes (hey, I infer here from the pictures in the slides only), disk-based sorting, B-Trees, Hashing, concurrency and transactions as well as recovery. 

The course recommends five books: "File Structures and Analytic Approach", "Transactional Information Systems", "Principles of Database and Knowledge-Base Systems", "Database Management Systems", and "Database System Implementation". None of these books are directly filesystem related. The books match exactly to the lectures, mostly related to the basics shared between databases and storage systems, but nothing directly related to file systems.

The assignments seem to be pretty similar to ours. It seems to consist of multiple assignments about an easy filesystem implementation. However, the assignments are given also in Hebrew, so I don't understand them. I expected more from a Technion course. 

University of Wisconsin in Madison: 
Advanced Storage Systems, Spring 2006:
http://pages.cs.wisc.edu/~remzi/Classes/738/Spring2006/


The advanced storage systems class given at the University of Wisconsin seems to be a nicely structures class with interesting topics: It begins with local storage systems, but moves very quickly (3. topic) to distributed and mobile systems. Then important concepts like reliability and fault tolerance, performance and scalability as well as caching, replication and consistency are discussed. The reading material is a nice list of now classics like the WAFL paper, the AutoRAID paper, the GoogleFS and MapReduce, but also Row Diagonal Parity and the "soft update" paper.

What universities are missing:
The University of California, Berkeley is missing: The home of BSD (and therefore the Fast File System), RAID, and a lot of early work in P2P storage seems to have no course focussed on storage or file systems. I could not find classes in Stanford, Harvard, MIT, and Carnegie Mellon.

Summary
To sum these courses up a bit: Most courses have large amounts of reading material. This is unusual in Germany (or at least at UPB). I had enough courses (especially in the SE part) without any reading material: We followed this "US style" in our course, but only with 12 papers. Most courses have a project assignment for the students where the students have to come up with an own topic. I really like this, too.

Our own courses
Storage Systems (German), University of Paderborn, Spring 2009:
http://pc2.uni-paderborn.de/teaching/lectures/speichersysteme/


"Our" own storage systems course consists a lecture series with 15 lectures a 90 min and 6 assignments.

The lecture starts very slow, with "Magnetic Storage Systems" (week 1), Disk Scheduling (week 2), an introduction in MEMS and Flash storage (week 3), and RAID (week 4, 5). Next came filesystems (6,7) and storage connection technologies like SCSI (week 8) to SANS (week 9). Network and parallel file systems are treated in week 10 - 12. 

The assignments consisted of programming small FUSE filesystem in C (step-by-step).

In the last third of the lecture, the courses treated advanced storage topics that are interesting for our current research project like Long Term Archiving, HPC IO (MPI IO), Contentious Data Protection (CDP), Data Deduplication and P2P Storage.

In addition to the reading material, we referred to the book "Linux Device Drivers".

Our professor, Andre Brinkmann also gave a short course (6 lectures) called "Theoretical Aspects of Storage Systems Research" at the Politechnika Wroclwska in Poland, which is a very condensed version of our course focussed on the theoretical aspects.

Last words:
I really liked studying and comparing the storage system lectures. These lecture provide a pretty good overview about the classical (I should call them "essential") research papers of our field and an overview about related books as long as a real storage system course book is missing.

I am impressed that so many universities have "project" assignments where the students have to come up with a topic by themselves. These lectures show want is possible on good (mainly US-) universities, with motivated students, and with the right foundations.

No comments:

Post a Comment