Posts

Conflict-free Replicated Data Types (2011)

In a distributed system with replicated objects distributed across processes interconnected by an asynchronous network, updating a replica without synchronization can lead to conflicts when the update is sent to other replicas. Conflict-free Replicated Data Types (CRDTs) are data structures whose states form a join semilattice and monotonically non-decreasing across updates. The replicas of CRDTs that have delivered the same updates eventually reach the same state if clents stop submitting updates.

#Conflict-Free Replicated Data Types

December 31, 2023

Dense Passage Retrieval for Open-Domain Question Answering (2020)

Open-domain question answering (QA) involves answering fact-based questions using documents. An open-domain QA system can be divided into two components: one that retrieves relevant passages and another that extracts the answer spans from those passages (Chen et al., 2017). While traditional approaches use sparse vector space models like BM25 for the retrieval step, Dense Passage Retrieval for Open-Domain Question Answering shows that dense representations can also be practically implemented using dense representations.

The embeddings are learned from the training dataset, optimizing for maximizing inner products between question and relevant passage vectors. The training is essentially metric learning, and each question needs irrelvant passages. In experiments, it was found that utilizing both the top passages returned by BM25, excluding the answer, and positive passages paired with other questions as negatives yielded best performance.

#Question Answering

December 30, 2023

CAP Twelve Years Later: How the "Rules" Have Changed (2012)

The CAP theorem, coined by Eric Brewer, states that in a shared-data system, you can have either Consistency (C), Availability (A), or Parition tolerance (P), but not all three simultaneously. However, Martin Kleppmann challenges the CAP theorem and discusses its limitations in Designing Data-Intensive Applications, and A Critique of the CAP Theorem. In CAP Twelve Years Later: How the “Rules” Have Changed, Brewer clarifies that the notion of “2 of 3” is misleading. In practice, designers cannot forfeit P and have a choice between C and A.

#CAP Theorem

December 23, 2023

Wide and Deep Learning for Recommender Systems (2016)

Generalized linear models are widely used for large-scale regression and classification problems with sparse inputs because they are simple, scale and interpretable. One limitation of interactions or cross-prodcution transformations in generalized linear models is that they do not generalize to query-item feature pairs that have not appeared in the training data. Compared with generalized linear models, deep neural networks can improve the diversity of the recommendations. Howerver it is difficult to learn effective low-dimentional dense embedding vectors. Wide & Deep Learning for Recommender Systems jointly trains a generalized linear model and a feed-forward neural network (FFN) to combines their benefits.

#Recommendation

December 16, 2023

A Protocol for Packet Network Intercommunication (1974)

Vinton Cerf and Robert Kahn presented a protocol that supports packet communication between hosts in different packet switching networks in A Protocol for Packet Network Intercommunication. The protocol assumes that a Transmission Control Program (TCP) in a host handles the transmission and acceptance of messages on behalf of processes. Later, the program was divided into the Transmission Control Protocol (TCP) and the Internet Protocol (IP). At that time, several protocols that supported exchanging packets between computers were developed, but they assumed the computers were on the same network.

#TCP

December 9, 2023

Probablistic Latent Semantic Indexing (1999)

Latent Semantic Analysis approximates an original term-document matrix by singular value decomposition (SVD). The components consist of the frequencies with which each term occurred in each document. The Latent Semantic Analysis thresholds all but the largest \(K\) singular values to zero. The left singular vectors are a \(t \times K\) matrix, and the transpose of the right singular vectors are a \(K\times d\) matrix where \(t\) is the number of the terms and \(d\) is the number of the documents.

Probabilistic Latent Semantic Indexing (PLSI) is a generative model that associates each term and document with an unobserved class variable \(z \in \mathcal{Z} = \{z_1,\dots , z_K\}\). The likelihoods of the documents and terms are estimated by the Expectation Maximization (EM) algorithm, and the joint probability model \(\textbf{P}\) can be written as \(\textbf{P}=\hat{\textbf{U}}\hat{\Sigma}\hat{\textbf{V}}^t\) where \(\hat{\textbf{U}}=(P(d_i|z_k))_{i, k}\), \(\hat{\textbf{V}}=(P(w_j|z_k))_{j, k}\), \(\hat{\Sigma}=\text{diag} (P(z_k))_k\). Compared to the Latent Semantic Analysis, the components of \(\textbf{P}\) have a clear probabilistic meaning in terms of mixture component distributions.

#Latent Semantic Analysis

December 2, 2023

Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications (2001)

Chord is a distributed lookup protocol that supports just one operation: mapping a given key onto a node in a Chord cluster. If an \(N\)-node system is in the steady state, each node resolves all the queries via \(O(\log N)\) messages to other nodes. Chord uses consistent hashing to assign keys to Chord nodes.

#Peer-to-Peer

November 23, 2023

Pastry: Scalable Decentralized Object Location and Routing for Large Scale Peer to Peer Systems (2001)

Pastry is a decentralized peer-to-peer object location and routing system based on nodes connected via the Internet. When each node in the Pastry network receives a message and a numeric key, it routes them to the node with a nodeId that is numerically close to the key. If a node is not the final destination of a message, it forwards the message to another node with a nodeId that is numerically closer to the key than the nodeId of the present node. The nodeId ranges from 0 to \(2^{128}-1\).

#Peer-to-Peer

November 7, 2023

An Incremental Approach to Compiler Construction (2006)

An Incremental Approach to Compiler Construction shows an incremental approach to build a compiler that accepts a large subset of the Scheme programming language. The compiler produces assembly code for the Intel-X86 architecture. Because real-life compilers are too complex to serve as an educational tool, the gap between real-life compilers and the educational toy compilers is too wide.

The goal of the paper is to break the barrier. The development of the compiler is broken down into 24 incremental steps. Every step yields a fully working compiler for progressively expanding a subset of Scheme.

#Compiler

October 23, 2023

GAUSSIAN ERROR LINEAR UNITS (GELUs) (2016)

The GAUSSIAN ERROR LINEAR UNITS (GELUs), a neural network activation function,is \(x\Phi(s)\), where \(\Phi(x)\) the standard Gaussian cumulative distribution function. GELUs have properties of Dropout, Zoneout, and ReLUs. Zoneout is a method for regularizing RNNs, and stochastically forces some hidden units to maintain their previous values. ReLUs introduce nonlinearity to neural networks. Dropout is a regularizer. GELUs merge the both functionalities by multipling the neuron input \(x\) by \(m \sim \text{Bernoulli}(\Phi (x))\) , where \(\Phi(x)=P(X\le x), X\sim \mathcal{N}(0, 1)\). GELU is the expected transformation on an input \(x\), which is \(\Phi(x) \times Ix + (1-\Phi(x))\times 0x = x\Phi(x)\)

#Activation Function

October 21, 2023