Runner

Hacking to the gate


  • Home

  • Tags

  • Categories

  • Archives

Many Faces of Publish/Subscribe

Posted on 2018-02-05 | In Study Note , Distributed System

Many Faces of Publish/Subscribe

Study note for the paper The Many Faces of Publish/Subscribe

3 Dimensional Decoupling by Pub/Sub

Publish / Subscribe pattern enables a 3-dimensional decoupling:

  1. Time:

    Pub and sub do not have to be up all the time. They can come and go like P2P.

  2. Space:

    They do not need to be aware of each other. Don't share their locations (references).

  3. Synchronization:

    Nobody blocks the other.

The decoupling in these dimensions increases scalability, improves resilience and reduce coordination.

Some Other Communication models

  • Message passing
    • Low level approach to send/receive messages
  • Remote Procedure Call (RPC)
    • Builds over message passing
    • Appears as though a normal procedure call is made but it is a remote invocation
    • JAVA RMI, CORBA request-response, SUN RPC, DCOM
  • Tuple spaces (distributed shared memory)
    • Supports the anonymous, decoupled messaging across time and space BUT not the synchronization decoupling that pub/sub provides
  • Message queuing
    • More of an implementation detail of the message passing protocol
  • Notification
    • Provide synchronization decoupling but not on time and space.

Types of Pub/Sub Models

Topic-based

Publishing/subscription is based on events that are named or have an ID.

Every topic will have a unique name. The interfaces for both pub and sub are simple. Supports hierarchical addressing, which means topics can form hierarchy through containment.

Note, the subscriber will receive everything from the topic it subscribes and the filtering of the event is the responsibility of the subscribers.

Content-based

Publishing/subscription is based on the contents of the event, which provides finer granularity. But the implementation is more complicated and cause more overhead.

Now the filtering is the responsibility of the whole mechanism but not subscribers.

Type-based

Like topic-based but provides a programming language-level type instead of some name. Events become type-safe, i.e. compile time error checking.

Filtering is still subscriber's work as in topic based.

Data-centric

Another model that uses typed topics and also deals with content.

It's a hybrid of the previous 3 models.

Implementation Concerns

Event

Can have message-based and invocation-based.

Media
  • The architecture:
    • Central
    • Distributed
  • Dissemination:
    • Point to point
    • Multicast

Centralized approaches generally use point 2 point and provide strong guarantee rather than high throughput and scalability.

Distributed systems often use reliable protocols to propagates events to all or a subset of the brokers.

Quality of Service
  • Persistance of the events
  • Priorities between events
  • Transactional semantics for group of events
  • Reliability

RSA Algorithm

Posted on 2018-01-27 | In Reading Note

RSA Algorithm

RSA (Rivest–Shamir–Adleman) is one of the first public-key cryptosystems and is widely used for secure data transmission.

How it works

Thanks to my number theory class, I can appreciate and understand how the algorithm works.

Before start

As a convention, use Alice and Bob as example to explain the algorithm. Suppose Alice and Bob want to pass some information between them. Before they start, they choose a large number N which is the product of two prime number. For a simple example, let's use 61 and 53. Then N is 3233 in this case.

Generally people make N pretty large. The length of N in binary notation is the length of the key, and in general, people choose it to be 1024 or 2048 bits. In our case, $3233_{10} = 110010100001_{2}$, which is 12 bits.

Now all we have is $N = 3233$.

Preparation

Apply the Euler function $\phi$ on N. By definition, $\phi(N)$ equals the number of positive integers less than N that are coprime with N. And there is a general formula for $\phi$ $$ \phi(N) = N\prod_{\text{p | N}}(1-\frac{1}{p}) $$ The proof is pretty simple and straightforward. Since N is just a product of two prime p and q, we have $\phi(N) = (p-1)(q-1)$. We denote $\phi(N)$ as $r$.

Then Bob and Alice need to choose a number $e$ that's coprime with $r$. The reason behind this is to ensure a modular multiplicative inverse of $e$ modulo $r$ can be found.

By definition if b is a modular multiplicative inverse of a module r if $$ a*b \equiv 1 ; mod(r) $$ Mathematicians have developed systematic way to find multiplicative inverse and with computer, it only needs few lines of code.

Let's say we find the multiplicative inverse of $e$ modulo $r$ is $d$. Let's take $e=17 $ and $d=2753$ in our case. Now they have finished their preparation and are ready to pass message.

In addition, as a spoiler, $(N, e)$ will be the public key and $(N, d)$ will be the private key.

Encryption

Suppose Bob has a message $M$ to pass. He first translate it into a large integer $m$ using some scheme he and alice agrees on. Then he encrypts the message by $$ c \equiv m^e ; mod(N) $$ And send $c$ to Alice.

Decryption

After receive $c$, Alice recover the message by $$ c^d \equiv (m^e)^d \equiv m ; mod(N) $$ The proof is as follow: $$ ed = k*r + 1 $$ for some integer k by the definition of modulo operation and the fact that $e$ and $d$ are multiplicative inverse of each other.

Then $$ m^{ed} \equiv m*(m^{r})^k ; mod(N) $$ Note that $r = \phi(N)$. According to Euler's Theorem: $$ a^{\phi(N)} \equiv 1 ; mod(N) $$ if $a$ and $N$ are coprime positive integer. This theorem is a generalization of Fermat's Little Theorem, which is again built upon Wilson's Theorem. They can all be learned in a intro to number theory class.

Thus we have $$ c^d = m^{ed} \equiv m * (m^r)^k \equiv m*(1)^k \equiv m ; mod(N) $$ based on the properties of modulo operation.

Now Alice have the information Bob sent and all she need to do is to use the translation scheme to obtain the original message.

To Break RSA

Since the public key $(N, e)$ as it named, is public. An attacker only need to decompose the large number $N$ and obtain $\phi(N)$ to break the whole algorithm. However, such problem is well-known in mathematics as a very hard one for centuries. At least for now, no known method can guarantee to break the algorithm using fixed amount of resource.

Last Thought

I dig in RSA algorithm or the public-key cryptosystems since I am reading a book about block chain. Public-key cryptosystems is one of the building block in bitcoin, and I think they use SECP256k1 in particular. Other algorithms used include various kind of hashing, e.g. SHA-256, and some encoding / decoding algorithm.

Klondike Solitaire Web App

Posted on 2017-11-29

Solitaire Web Application

In 2017 fall semester, I implemented a React Card game Web Application. It's now deployed on AWS. This app has many carefully designed features. I will introduce them one by one.

HDFS

Posted on 2017-11-05 | In Study Note , Bigdata , Hadoop

Hadoop Distributed File System

Since HDFS is quite similar to GFS, to me, it's just an open-source GFS 2.0, I will not go into details on many subjects in this paper.

Architecture

NameNode and DataNode

​ In short, the architecture of HDFS is quite similar to GFS. They both adopt centralized structure, where in HDFS the NameNode is the master in GFS and DataNodes are chunkservers. Just like in GFS, NameNode handles the metadata. It maintains the namespace tree and mapping of files blocks to DataNodes. Typically, each cluster has a NameNode and lots of DataNode.

NameNode can run as either CheckpointNode or BackupNode.

​ HDFS keeps the entire namespace in RAM. It also uses checkpoint and journal to provide fault recovery. Note, the location of DataNode is not stored in the checkpoint.

​ The DataNode also uses checksum to prevent data lost. It stores two file for each block replica, one for data and one for metadata including checksum and generation stamp. Version number is used to remove stale data. Namespace ID is used to decide whether DataNode belongs to a cluster.

​ During normal operation, DataNode sends heartbeats to the NameNode. Heartbeats carry information of the DataNode, like storage capacity, number of data transfers in progress. The NameNode replies to heartbeats to send instructions, including:

  • replicate blocks to other nodes
  • remove local block replicas
  • re-register or shut down the node
  • send an immediate block report, (which is a summary of a block's information)

Image and Journal

The namespace image is the file system metadata that describes the organization of application data as directories and files. A persistent record of the image written to disk is called a checkpoint.

The journal is in fact a write-ahead log, which is a technique widely applied in modern database. For each transaction, the change is recorded in the journal, and the journal file is flushed and synched before the change is committed.

This design has a drawback, since NameNode is a multithread system, saving a transaction to the disk may become a bottleneck as all other thread need to wait. The HDFS optimize this by batching transactions to save.

File I/O Operation

File Read & Write

After a file is closed, the bytes written cannot be altered or removed except that new data can be appended to the end. HDFS also implements a single-writer, multiple reader model. It uses lease to grant permission to client, just as GFS, which means write with concurrent read is supported as well.

When writing files, push sequence of packets on a pipeline to write to blocks that stores the file and its replicas. Need to setup and close the pipeline before and after write. Note, in the figure, a hflush operation is conducted right after packet 2 is sent. It will force the client to wait for the confirm information of packet 2 received before sending packet 3.

<img src="../images/Papers/HDFS-datapipeline.png" width="400">

Block Placement and Others

As in GFS, HDFS tends to split replicas on different machines and also different racks. The default HDFS replica placement policy can be summa- rized as follows:

  1. No Datanode contains more than one replica of any block.
  2. No rack contains more than two replicas of the same block, provided there are sufficient racks on the cluster.

Other functionalities provided by HDFS include balancer, a tool to balance disk usage on cluster, and block scanner, which periodically scans a block's replicas and verifies the checksum.

In my thought, some of the mechanism in HDFS are just grouping functions in GFS into small units and manage them separately.

Conclusion

The rest of the paper is practice experience, future works and acknowledgement. In short, HDFS is a powerful distributed system, which is like an open-source, upgraded GFS.

One final thing to note, the major drawback of having many namespaces is the cost of managing them. I guess that's an important reason for YARN to show up.

Reference

[1] K V Shvachko, H Kuang, S Radia, R Chansler, Yahoo!. 2010. The Hadoop Distributed File System

Hello World

Posted on 2017-10-29

Welcome to Hexo! This is your very first post. Check documentation for more info. If you get any problems when using Hexo, you can find the answer in troubleshooting or you can ask me on GitHub.

Quick Start

Create a new post

1
$ hexo new "My New Post"

More info: Writing

Run server

1
$ hexo server

More info: Server

Generate static files

1
$ hexo generate

More info: Generating

Deploy to remote sites

1
$ hexo deploy

More info: Deployment

MapReduce-2

Posted on 2017-10-29 | In Study Note , Bigdata , Hadoop

Second Look at MapReduce

Early I have written a very simple post about the idea of MapReduce. It was in fact part of the my lecture note in the Big Data course. Recently I re-read the paper, MapReduce: simplified data processing on large clusters [1], and I want to add some note to this topic. I will not follow the order of the paper and only write down some of the things that stands out, including my own thought and other related topics.

Programming Model

The interface is pretty simple: typically MapReduce is a imported library, e.g. mrjob; user only needs to define the map and reduce function and pass them to the library to run.

Execution Overview

As discussed in paper, input data is first partitioned into M splits for Map node; later on there will be R worker node for Reduce operation. Typically, number M and R are much larger than the number of machines.

Each worker executes the map function user defined. And results are written to local disk periodically. The master is responsible for a schedule plan to reduce transmission need. Typically, it will try to put reduce node near map node. In this way, most input data is read locally and consumes no nework bandwidth.

Note:

  1. One reason that Spark is much faster is that intermediate results are written to local disk, and Reduce nodes have to read from disk while Spark stores all the RDDs in memory.
  2. Reduce nodes cannot start until all Map nodes finish. There is a Barrier between these two nodes.

A Confusion

The paper indicates that after all the intermediate keys are received, they are sorted on the reduce node so that all occurance of the same key are grouped together. I don't quite unerstand how they achieve this. To me, it will make more sense for master to handle this. After all the intermediate results are read, master sorts them, possibly use external sort as data volume is too large to fit in memory. Then the sorted result is fed to Reduce nodes. But in this way the network bandwidth will be a restriction.

After discussion with a professor, I realized that the shuffle part between map and reduce is not covered in this paper. In fact, after mapper writes the data to local disk, it sorted the data. And the master shuffle the sorted results from each worker and then schedule the whole reduce task. This explains why there has to be a barrier between map and reduce.

Fault Tolerance

Worker

The MapReduce library kind of supports a strong fault tolerance on worker node. But compared to Spark, it's too inefficient.

On map task, if one fails, this task has to be re-executed in its entirety as the intermediate results are stored on local disks. But for reduce task, since the result is outputed to some global file system, (GFS with no doubt at that time, 2004), there is no need to re-start the whole reduce task.

Master

In the paper, they mention that they can use checkpoint as in GFS to maintain the master status. However, due to the small chance of failure, they decided to just abort the task.

I think they can also adopt the idea of shadow master from GFS. The primal master can record append the status and location of worker tasks to a chunk in GFS. When the master failed, the shadow master can continue from the chunk and handle the rest.

Just some naive thoughts.

Others

The paper also mentioned how they take advantages of location information to better schedule the task, the task granularity. In addition, they handle straggler by scheduling multiple backup tasks.

Combiner

In some cases when repetition in keys are frequent and reduce operation is communative and associative, MapReduce allows user to add a Combiner function to optimize.

In my view, it's the same as a pre-reduce on Map operation.

Conclusion

In short, MapReduce is a very powerful programming model which pave the way for modern big data processing techniques.

Reference

[1] Dean J, Ghemawat S. 2008. MapReduce: Simplified data processing on large clusters. Commun ACM 51: 107–113.

Pyspark in Jupyter Notebook with SSH Forward Porting on Local VM

Posted on 2017-10-28

Pyspark in Jupyter Notebook with SSH Forward Porting on Local VM

This is a note I used to setup Pyspark environment on local VM with vagrant.

Step One — Spawn Local VM

Vagrant up a local virtual machine. There is only one line in VagrantFile that needs to change. Comment back the following line. Note, the guest port should be 8888 as the defualt port for Jupyter notebook is 8888. The host port can be whatever you like.

1
config.vm.network "forwarded_port", guest: 8888, host: 8888

It's recommended that you set up a shared folder as always. We need to use it to pass the Spark.tgz file into VM. The corresponding line in VagrantFile is:

1
config.vm.synced_folder "./ShareFile", "/vagrant_data"

where ShareFile is a folder I created on the same directory as VagrantFile. You can name and put your sharefolder in whatever manner you like.

Step Two — Install Needed Packages

Use vagrant ssh to shell into your local VM. Install packages using the following command:

1
2
3
4
5
6
7
sudo apt-get update
sudo apt-get install -y python3
sudo apt-get install -y python3-pip
pip3 install py4j
pip3 install jupyter
sudo apt-get install -y default-jre
sudo apt-get install -y scala

These commands install python3, pip3, py4j, Jupyter notebook, Java and Scala respectively. Java and Scala are needed because Spark is written in Scala and Scala is written in Java.

Step Three — Download Spark Latest Version

Download Spark lateset version for ubuntu <a href="https://spark.apache.org/downloads.html">here</a>. I downloaded spark-2.2.0-bin-hadoop2.7.tgz. It should be roughly 200 MB. After downloading, put it in the shared folder.

In the local vm, use mv to move the tgz file to the /home/ubuntu directory, i.e. the root directory. Type the following command to unpack the file.

1
sudo tar -zxvf spark-2.2.0-bin-hadoop2.7.tgz

Step Four — Configure the Pyspark Environment

At the root directory of the local VM, use whatever text editor you prefer to open the .bashrc file. Copy and paste the following line at the end of the file.

1
2
3
4
5
6
export SPARK_HOME='/home/ubuntu/spark-2.2.0-bin-hadoop2.7'
export PATH=$SPARK_HOME:$PATH
export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH

export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'

These lines specify the needed path to find the pyspark module in Jupyter notebook. After editting, type this command in the same directory:

1
source ./bashrc

Finally, change the permission of the Spark files.

1
2
3
4
5
suco chmod 777 spark-2.2.0-bin-hadoop2.7.tgz
cd spark-2.2.0-bin-hadoop2.7.tgz/
sudo chmod 777 python
cd python/
sudo chmod 777 pyspark

These commands will give the user privileges to access the pyspark module.

Step Five — Restart the VM and Check

Now restart the virtual machine. You can exit it and use vagrant reload to do so.

After restart, shell into the local vm and under whatever folder, type the following command to start the jupyter notebook:

1
jupyter notebook --ip=0.0.0.0

Note: the —ip argument is needed to enable forward porting!!

Now you can open a browser on the host machine. Type in https://localhost:8888 or whatever host port you have chosen. You should be able to access the Jupyter Notebook in the local vm.

Then in the notebook, try the following command:

1
import pyspark

This should not give any output.

If no error is generated, then congratulation, you have successfully setup !

GFS-2

Posted on 2017-10-19 | In Study Note , Bigdata

Google File System Note 2

Continue from note 1. This note is also based on the Google File System paper [1].

Master Operations

The master execute all the namespace operations. It makes placement decisions, creates new chunks and hence replicas, and coordinates various system-wide activities to keep chunks fully replicated, to balance load across all the chunkservers, and to reclaim unused storage.

Namespace Management and Locking

To allow multiple master operation running concurrently, the master uses locks over regions of the namespace to ensure proper serialization.

GFS logically represents its namespace as a lookup table mapping full pathnames to metadata. Each node in the namespace tree, no matter a file name or a directory name, has an associated read-write lock.

Each master operation acquires a set of locks before it runs. It will acquire read-locks on all the related directory names, as /d1, /d1/d2, ..., /d1/d2/…/dn, and either a read or write lock on the full path name /d1/d2/.../dn/leaf. Note that the leaf can be a directory or a file.

The read / write locks in GFS are the same as locks in other file system. Two things to note:

  1. File creation does not require a write lock on the parent directory because there is no “directory” to be protected. A read lock is enough.
  2. Locks are acquired in a consistent total order to prevent deadlock

One nice property of this locking scheme is that it allows concurrent mutations in the same directory.

Replica Placement

The chunk replica placement policy serves two purposes: maximize data reliability and availability, and maximize net- work bandwidth utilization.

GFS spreads chunk replicas across not only machines but also racks to prevent from rack damage or offline. This also enables read for a chunk to exploit the aggregate bandwidth of multiple racks

Garbage Collection

When a file is deleted, GFS reclaim the physical storage azily during regular garbage collection at both the file and chunk levels.

Mechanism

When a file is deleted, the operation is logged just liked others. However, the file is just renamed to a hidden name that includes the deletion timestamp. The file is actually removed during the master’s regular scan of the file system namespace after it exists for more than three days (configurable). When the hidden file is removed from the namespace, its in- memory metadata is erased.

This mechanism provide efficient way to undelete file.

Stale Replica Detection

For each chunk, the master maintains a chunk version number to distinguish between up-to-date and stale replicas. Whenever the master grants a new lease on a chunk, it increases the chunk version number and informs the up-to-date replicas.

Whenever chunk version number conflicts, always assume that the largest is up-to-date, even when the master's record is not the largest.

Fault Tolerace and Diagnosis

One of the greatest challenges in designing the system is dealing with frequent component failures.

High Availability

Two simple yet effective strategies: fast recovery and replication.

Fast Recovery

GFS does not distinguish between normal and abnormal termination. Both the master and the chunkserver are designed to restore their state and start in seconds.

Replication

For chunks, each chunk is replicated on multiple chunkservers on different racks.

For master, its operation log and checkpoints are replicated on multiple machines. A mutation to the state is considered committed only after its log record has been flushed to disk locally and on all master replicas.

Shadow Master

GFS also support "shadow masters". They provide read-only access to the file system when the primary master is down. These shadows may lag the primary slightly, but they enhance read availability for files that are not being actively mutated or applications that do not mind getting slightly stale results.

To keep informed, a shadow master reads a replica of the growing operation log and applies the same sequence of changes to its data structures exactly as the primary does. It depends on the primary master only for replica location updates resulting from the primary’s decisions to create and delete replicas.

Data Integrity

Each chunkserver uses checksumming to detect corruption of stored data. As discussed in the first note, the semantics of GFS mutations, in particular atomic record append, does not guar- antee identical replicas.

The paper does not discuss the checksum GFS uses. One typical and simple checksum algorithm is parity byte or parity word, which breaks the data into "words" with a fixed number n of bits, and then computes the exclusive or (XOR) of all those words. The result is appended to the message as an extra word. To check the integrity of a message, the receiver computes the exclusive or of all its words, including the checksum; if the result is not a word with n zeros, the receiver knows a transmission error occurred. Detailed discussion of checksum can be found on Wikipedia.

Measurements & Related Works

I will skip these parts as that's pretty outdated (2003) and they are also not the thing I want to learn from this paper.

Conclusion

In GFS, they treat component failures as the norm rather than the exception, optimize for huge files that are mostly appended to (perhaps concurrently) and then read (usually sequentially), and both extend and relax the standard file system interface to improve the overall system.

The system provides fault tolerance by constant moni- toring, replicating crucial data, and fast and automatic recovery. Additionally, we use checksumming to detect data corruption at the disk or IDE subsystem level.

GFS achieves high aggregate throughput by separating file system control, which passes through the master, from data transfer, which passes directly between chunkservers and clients. Master involve- ment in common operations is minimized by a large chunk size and by chunk leases.

Reference

  1. Ghemwat, Sanjay, Howard Gobioff, and Shun-Tak Leung (Google). “The Google File System.” Symposium on Operating Systems Principles (SOSP) ’03, Association of Computing Machinery. Published 2003.

GFS-1

Posted on 2017-10-16 | In Study Note , Bigdata

Google File System Note 1

During my lesure time, I plan to review some of my big data paper. I will try to write study note for some of the paper that I find important or interesting. This is the note for the first half of the famous Google File System paper [1].

Introduction

There are several traindtional choices that the authors decided to reexamined and they wanted to explore radically different points in the design space.

  1. Component failures are the norm rather than the exception.

There are all sorts of possible failures. Constant monitoring, error detection, fault tolerance and automatic recovery must be integral to the system.

  1. Files are huge.
  2. Most files are mutated by appending new data rather than overwriting existing data.
  3. Co-designing the applications and the file system API benefits the overall system by increasing flexibility.

Design Overview

Assumptions

They have listed quite a few assumptions. Of those, I extract a few that catch my attention

  • There are typically two types of reads:
    • Large streaming reads
    • Small random reads
      • Performance-conscious applications often batch and sort their small reads.
  • The system must efficiently implement well-defined semantics for mutiple clients that concurrently append to the same files. (Spoiler: I believe they use record append to handle this problem)
  • High sustained bandwidth is more important than low latency.

Interface

GFS supports the usual CRUD operations, where they name them as create, delete, open, write, close, and read. Moreover, GFS has snapshot and record append operations. They are discussed later.

Architecture

A GFS cluster consists of a single master and multiple chunkservers. The architecture is shown below.

<img src="/images/Papers/GFS-architecture.png">

Files are divided into fixed-size chunks, typically 64 MB. Each chunk is identified by an immutable and globally unique 64 bit chunk handle assigned by the master at the time of chunk creation. For reliability, each chunk is replicated on multiple chunkservers.

The master maintains all the metadata. It periodically communicates with each chunkserver in HeartBeat messages to give instructions and collect its state.

Clients interact with the master for metadata operations, but all data-bearing communication goes directly to the chunkservers.

Neither the client nor the chunkserver caches file data. Clients do cahce metadata, however. These metadata are used to locate chunkservers and are sent from the master.

Single Master

Since there is only one master, GFS must minimize its involvement in reads and writes so that it won't become a bottleneck. The client get information of chunkserver from master and directly interact with those servers.

For example, to perform a simple read, the client first translates the file name and byte offset into a chunk index with the file according to the fixed size of chunks. It sends the master with this information and gets back the corresponding chunk handle and its replicas' locations. Then the client contacts directly with chunkserver for the subsequent operations, and master is not involved.

Chunk Size

The authors suggested 64 MB chunk size. GFS uses lazy space allocation to avoid wasting space due to internal fragmentation.

There are several advantages with a big chunk size:

  1. It reduces clients' need to interact with the master as reads and writes on same chunk only need one initail request.
  2. It reduces network overhead by keeping a persistent TCP connection.
  3. It reduces the size of metadata stored on master.

There is also a disadvantage:

  • Small files on only one chunk may turn it into a hot spot when many clients are accessing it at the same time.

Metadata

The master stores three major types of metadata in its memory: the file and chunk namespaces, the mapping from files to chunks, and the locations of each chunk’s replicas. The first two types are also kept persistent by logging mutations to an operation log stored on the master’s local disk and replicated on remote machines.

In-Memory Data Structure

Master operations are fast as metadata are in memory. Furthermore, the master periodically scan through its entire state in the background.

One may concern that the system is limited by the memory size. In fact this is not a serious limit as one chunk only needs less than 64 byte to store.

Chunk Locations

The master does not keep a persistent record of which chunkservers have a replica of a given chunk. It simply polls chunkservers for that information at startup.

Operation Log

This is the only persistent record of metadata on GFS. It also servers as a logical time line that defines the order of concurrent operations. It's replicated on multiple remote machines and respond to a client operation only after flushing the corresponding log record to disk both locally and remotely.

The master recovers its file system state by replaying the operation log. To keep the log small, the master checkpoins its state whenever the log grows beyond a threshold. The checkpoint is a compact B-tree like form that can be directly mapped into memory.

Consistency Model

GFS has a relaxed consistency model.

Guarantees by GFS

Two definitions:

  • A file region is consistent if all clients will always see the same data, regardless of which replicas they read from.
    • A region is defined after a file data mutation if it is consistent and clients will see what the mutation writes in its entirety.

After a sequence of successful mutations, the mutated file region is guaranteed to be defined and contain the data written by the last mutation.

System Interactions

The GFS system is designed to minimize the master's involvement in all operations.

Lease and Mutations Orders

A mutation is an operation that changes the contents or metadata of a chunk such as a write or an append operation. GFS uses lease to maintain a consistent mutation order across all replicas.

The master grants a chunk lease to one of the replicas, which we call the primary. The primary picks a serial order for all mutations to the chunk. All replicas follow this order when applying mutations.

​ <img src="/images/Papers/GFS-dataflow.png" width="400" class="center">

The figure above shows the detailed order of write control and data flow.

Data Flow

To fully utilize each machine’s network bandwidth, the data is pushed linearly along a chain of chunkservers. Each machine forwards the data to the “closest” machine in the network topology that has not received it. Once a chunkserver receives some data, it starts forwarding immediately.

Atomic Record Appends

In a record append operation, the client only specifies the data. GFS will append to the file at least once atomically at an offset GFS choosing and returns that offset to the client.

When record append, need to check whether the chunk will exceed the maximum size.

  • If so, insert padding to the chunk and its replicas. Then informs the client that the operation should be retried on next chunk. (RA is restricted to be at most one-fourth of the chunk size.)
  • If fits inside the chunk, the primary appends the data, informs the secondaries to do the same and finally replies success to the client.

GFS does not guarantee that all replicas are bytewise identical. It only guarantees that the data is written at least once as an atomic unit.

Snapshot

The snapshot operation makes a copy of a file or a directory tree almost instantaneously. Uses copy-on-write techniques to implement.

To perform snapshot:

  1. Revok all the leases on the chunks in which files are about to snapshot.
  2. Then master logs the operation to the log and duplicate the metadata for the source file or directory tree.
  3. When client write to the snapshot, copy all the replicas locally by creating new chunks on the same chunkserver.

Reference

  1. Ghemwat, Sanjay, Howard Gobioff, and Shun-Tak Leung (Google). “The Google File System.” Symposium on Operating Systems Principles (SOSP) ’03, Association of Computing Machinery. Published 2003.

Offer From Microsoft

Posted on 2017-10-04

Offer From Microsoft

Preparation

This summer, while doing research with Professor Schumaker, I got this idea: why not find an internship at big Tech Company right before my grad school?

So I start to prepare in August. I do have applied to Google and Microsoft for internship last year, but I did that too late. So it turns out that I got an open offer from Google but did not match any team. For Microsoft, I passed the first round but they did not have enough headcount to even schedule a final round for me.

With some experience, I made a plan for applying to Tech Companies. The most important thing is the problem solving skill, namely to solve algorithm problem on white board. I use Leetcode, a free website, and the famous Cracking the Code Interview to prepare.

I got email from Microsoft at mid August and we schedule an onsite interview in Early October. Before that all I did is actually just going over all the courses in CS department that I have took and to keep practicing.

Interview

Last Monday, I flew all the way from Nashville to Seattle, Microsoft Campus at Redmond. The interview is scheduled at 7:30 a.m and I have to get up at about 6:30 as I stayed in a hotel 20 mins away from the building by taxi.

The interview consists of 4 rounds, each around 45 mins, from 8:00 to 12:00. Due to confidentiality concerns, I cannot disclose any problems that I got. But the overall process is pretty smooth. The interviewers are pretty nice, most of them have been working for Microsoft for more than 10 years. The group I am interviewing with are Azure group and the intern and fulltime offer that I got later are all for this group.

After the interview, all the interviewees are brought to the Microsoft campus dining hall, called Commons, for lunch. I made accquaintance with some of the interviewees from all over the U.S. Though the food is not that tasty, I did have a good time there.

After that, we were dismissed and I went directly back to my hotel since I really didn't sleep well the night before.

Offer

Two days ago, I finally got my offer as an intern. I used the word 'finally' because generally they made decisions within few days. The reason I got a delay is hilarious. It turns out that I have a pretty messy handwriting and they kept a wrong phone number in the system. My recruiter has been calling me since last Wednesday, when they have already made a decision.

Just like the good old Chinese phrase, "好事多磨", which means good fortune may come after some amount of troubles. Since I am graduating next year right before my intern starts, I have no idea how and where to get a kind of VISA to legally work in U.S. So I called my recruiter and tried to get some advice.

That's when I got the news that since I am eligible for a fulltime position, they decide to just give me a fulltime offer. If I preferred to go to grad school later, I can decline it and rearrange it into an intern!

This is amazing and totally surprising!! Later one of my friends who work at Facebook told me, it may due to my good performance at the interview that Microsoft thought I am qualified for a direct fulltime offer. Now my faith for graduate school is shaken. Some friends suggested me to work for a few years first and then go for grad. Since I am undergoing this tedious and consuming graduate school application process, I really don't know what's the right choise.

Hopefully when I look back at this post next year, I have already made a promising decision and won't regret for it in the further future. We will see.

12…4
Ziqi Yang

Ziqi Yang

Welcome to my hub

34 posts
14 categories
14 tags
GitHub E-Mail
© 2018 Ziqi Yang
Powered by Hexo
|
Theme — NexT.Pisces v5.1.4