Spark SQL is a module in Apache Spark that enables relational processing (e.g., declarative queries) using Spark’s functional programming API. Spark SQL also provides a declarative DataFrame API to bridge between relational and procedural processing. It supports both external data sources (e.g., JSON, Parquet and Avro) and internal data collections (i.e., RDDs). Besides, it uses a highly extensible optimizer Catalyst, making it easy to add complex rules, control code generation, and define extension points.
LevelDB is a fast key-value storage library written at Google that provides an ordered mapping from string keys to string values. Many of its ideas and techniques are widely used in Big Data stacks, e.g., BigTable and HBase. The code in LevelDB is well-written with good documents, which is a ideal project to learn from.
This post illustrates the implementations of UDF in Spark SQL, where the targeted version is Spark 1.6.0 and the targeted language is Scala. I will talk about UDF in roughly two parts: registration and execution.
Bulk loading is a feature of HBase for ingesting tons of data efficiently. In this post, I are going to share some basic concepts of bulk loading and its practice in MapReduce and Spark.
These is actually an old post from my GitHub, I put it here for better exposition and presentation.
This summer, I was really lucky to work with Prof. Andy Pavlo at the Database Group of Carnegie Mellon University. We worked on a project named Carnegie Mellon Database Application Catalog(CMDBAC), a repository of thousands of ready-to-run database applications for analysis and benchmarking.