Zeyuan Shang

2021 Weekly Summary Report

Fri, 01 Jan 2021 00:00:00 +0000

2022/1/3 - 2022/1/9

Davos
- Python event optimization
- Check data size while caching
- WhatIf optimization to avoid an idle thread
- Memory issue investigation
- Nginx configuration for embedded views
- Fix random bugs
Learning
- LeetCode 101 Chap 11
- Japanese
Misc
- STEM OPT evaluation
- ICDE reviews

2021/12/27 - 2022/1/2

Davos
- Refactor cache hints
- Investigate EMR Kerberos
Learning
- LeetCode 101 Chap 8-10
- Japanese

2021/12/20 - 2021/12/26

Davos
- Davos slides
- Backend job descriptions
Learning
- LeetCode 101 Chap 7
- Japanese

2021/12/13 - 2021/12/19

Davos
- Multi-job benchmark
- Shared sample store & cacha manager
- Fix random bugs
Learning
- LeetCode 101 Chap 6

2021/12/6 - 2021/12/12

Davos
- Expose UDO on ECS
- Enable auto scaling
- Prometheus integration
- Fix random bugs
Learning
- LeetCode 101 Chap 1-5

2021/11/29 - 2021/12/5

Misc
- Preparation for N2
- I-485 submission

2021/11/22 - 2021/11/28

Davos
- UDO on AWS ECS
- Multi-job benchmark
- Verbier as a job
Reading
- 百面深度学习 Chap 5
Misc
- Preparation for N2
- I-485 preparation

2021/11/15 - 2021/11/21

Davos
- More logging in control plane
- Rename workspace id
- Python runtime fix
- Support mean for timestamp
Misc
- Preparation for N2

2021/11/8 - 2021/11/14

Davos
- Enable expose in UDO
- Investigate AWS Lambda
- Data source load refactor
- Reorganize K8s files (decoupling)
- Deploy istio on staging
- Deprecate load data source
- UDO use cases
- Fix random bugs
Misc
- Preparation for N2

2021/11/1 - 2021/11/7

Davos
- Setup conbench
- Datadog integrations (Redis, Mongo)
- Cache manager refactor
- Investigate & install istio
- Final version optimization
- Fix random bugs
Reading
- 百面深度学习 Chap 3-4
Misc
- Preparation for N2
- EB1 submission

2021/10/25 - 2021/10/31

Davos
- OpenShift deployment
- Refactor K8s files
- Rewrite process mining
- Fix random bugs
Reading
- 百面深度学习 Chap 2
Misc
- Preparation for N2

2021/10/18 - 2021/10/24

Davos
- Local sample store
- Shared sample store
- Skip intermediate versions for operators
- Fix random bugs
Reading
- 百面深度学习 Chap 1
- C++ tips 175-187 (finished)
Misc
- Preparation for N2

2021/10/11 - 2021/10/17

Davos
- Pipeline validator
- Pipeline scorer
- Remove log to file
- Better stack trace
Reading
- 容器实战高手课 15-24 (finished)
- C++ tips 152-173
Misc
- Preparation for N2

2021/10/4 - 2021/10/10

Davos
- Containerize KDA
- Containerize text feature
- Silence pystan
Reading
- 容器实战高手课 8-14
- C++ tips 131-136
Misc
- Preparation for N2

2021/9/27 - 2021/10/3

Davos
- K8s deployment documentation
- Optimize K8s QoS class
- Deploy & test control plane
- Job resource accounting
Reading
- 容器实战高手课 0-7
- C++ tips 122-130
Misc
- Preparation for N2

2021/9/20 - 2021/9/26

Davos
- Release script for AWS
- Deploy UDO disk limit on AWS
- Migrate staging, demo and saas
Reading
- 深入剖析Kubernetes 32-47 (finished)
- C++ tips 88-120
Misc
- Preparation for N2

2021/9/13 - 2021/9/19

Davos
- Fix UDO network limit
- Deploy UDO disk limit on GCP
- Shared cache
- Migrate CI to K8s
Reading
- 深入剖析Kubernetes 5-31
- C++ tips 55-77
Misc
- Preparation for N2

2021/9/6 - 2021/9/12

Davos
- Einblick deployment on AWS
- Python package limit for UDO
- Better shared cache
- Investigate AWS DataSync
Reading
- C++ tips 1-49
- 深入剖析Kubernetes 0-4
Misc
- N2 reservation

2021/8/30 - 2021/9/5

Davos
- Network limit for UDO
- Disk limit for UDO
- Data adapter fix
- Introduce kustomize
- Fix random bugs
Reading
- 现代C++实战14-30 (finished)

2021/8/23 - 2021/8/29

Davos
- Better header files with forward declarations
- Join optimization
- Add planner convert options
- Ccache with CLion
- DataDog K8s integration
- DataDog monitors
- Refactor UDO client
- Fix random bugs
Reading
- 现代C++实战0-13

2021/8/16 - 2021/8/22

Davos
- Time limit for UDO
- Finalize Python operators
- Refactor UDO server
- Containerize profiler
- Reorganize header files
- Switch to Arrow’s thread pool
- Switch to Ninja
- Fix random bugs
Reading
- More Effective C++ Chap 30-35 (finished)

2021/8/9 - 2021/8/15

Davos
- Refactor operator
- Switch to Arrow 5.0
- Set up cache on GitLab
- Clean up streams in operators
- Fix random bugs
Misc
- Mail EB1B evidence

2021/8/2 - 2021/8/8

Davos
- Refactor Python data sources to readers
- Refactor CI scripts
- Log redaction through Replicated
- Fix random bugs
Research
- VLDB camera ready & talk
Learning
- N2 Preparation
Reading
- More Effective C++ Chap 29
Misc
- EB1B Preparation

2021/7/26 - 2021/8/1

Davos
- Shared cache v1.0
- Fixed batch size for all connectors
- Job refactor
- Update SaaS configs
- Containerize pipeline trainer
Learning
- N2 Preparation
Reading
- More Effective C++ Chap 27-28
Misc
- EB1B Preparation

2021/7/19 - 2021/7/25

Davos
- Shared cache v1.0
- Refactor sample
- Fix random bugs
Learning
- N2 Preparation
Reading
- More Effective C++ Chap 25-26
Misc
- EB1B Preparation

2021/7/12 - 2021/7/18

Davos
- Support week in binning
- Count not null in aggregation
- Turn on containerized AutoML by default
- Add validation info to AutoML
- Shared cache: control plane
- Fix random bugs
Learning
- N2 Preparation
Reading
- More Effective C++ Chap 9-24
Misc
- EB1B Preparation

2021/7/5 - 2021/7/11

Davos
- Support overall progress in Python operators
- Parquet timestamp conversion
- Refactor Python operator & job
- Fix random bugs
Learning
- N2 Preparation
Reading
- More Effective C++ Chap 1-8

2021/6/28 - 2021/7/4

Davos
- Deprecate old data sources (e.g., CSV)
- AutoML containerization deployment & optimization
- Fix random bugs
Learning
- N2 Preparation
Reading
- Kubernetes in Action Chap 18 (finished)
Misc
- EB1B Preparation

2021/6/21 - 2021/6/27

Davos
- Projection column limit
- File based data source for S3
- Better scaling for aggregation
- Fix random bugs
Learning
- N2 Preparation
Reading
- Kubernetes in Action Chap 17

2021/6/14 - 2021/6/20

Davos
- AutoML containerization
- Improve big query connector for large datasets
- Fix random bugs
Learning
- N2 Preparation

2021/6/7 - 2021/6/13

Davos
- Scaling factor framework
- AutoML containerization
- Fix random bugs
Learning
- N2 Preparation

2021/5/31 - 2021/6/6

Davos
- Add stream progress in logging
- Support null type in tensor
- Script for automatic update
- Containerization framework
Learning
- 新完全掌握语法9-12
- 练习
Reading
- Kubernetes in Action Chap 16
Misc
- Recommendation letters

2021/5/24 - 2021/5/30

Davos
- Snowflake experimental reader
- Containerization framework
- Expose more AutoML info
- Vertica pushdown for predicate and sampling
- Davos CPU and memory limits
- Fix random bugs
Learning
- 新完全掌握语法5-8
Reading
- Kubernetes in Action Chap 13-15

2021/5/17 - 2021/5/23

Davos
- Switch to Nginx ingress
- Limit UDO jobs
- Fix random bugs
Reading
- Kubernetes in Action Chap 12

2021/5/10 - 2021/5/16

Davos
- UDO finish
- Fix random bugs
Learning
- Japanese lecture 329-331
Misc
- EB1B recommenders

2021/5/3 - 2021/5/9

Davos
- UDO profiling
- UDO container logs
- UDO network optimization
- Limit AutoML jobs
- Force HTTPS
- Fix random bugs
Learning
- Japanese lecture 319-328
- 新完全掌握语法4
Reading
- Kubernetes in Action Chap 11
Misc
- EB1B preparation

2021/4/26 - 2021/5/2

Davos
- Dynamic batching for UDO
- Referenced memory usage for batch
- Vertica ODBC
- Investigate Python GIL impact
- Better error message for failed UDO docker container
- Fix random bugs
Learning
- Japanese lecture 318
- 新完全掌握语法1-3
Reading
- Kubernetes in Action Chap 10
Misc
- Gym 3
- EB1B preparation

2021/4/19 - 2021/4/25

Davos
- Refactor sample store
- Multi-node migration
- Fix random bugs
Reading
- Kubernetes in Action Chap 9
Misc
- Gym 3

2021/4/12 - 2021/4/18

Davos
- Better profiling in AutoML
- Fix random bugs
Learning
- Japanese lecture 314-316
- Japanese vocabulary 61-80
- 新完全掌握语法A-D

2021/4/5 - 2021/4/11

Davos
- Specify labels for f1/precision/recall
- Improve UDOs
- Join result size control
- Vertica connector
- Fix random bugs
Learning
- Japanese lecture 305-313
- Japanese vocabulary 56-60

2021/3/29 - 2021/4/4

Davos
- Consider expose keys in caching
- Airbyte for K8s
- Fix random bugs
Learning
- Japanese lecture 298-302
- Japanese vocabulary 46-55
- 新完全掌握语法25-26
Reading
- Kubernetes in Action Chap 8

2021/3/22 - 2021/3/28

Davos
- Integrate Valgrind
- Refactor unit tests
- Throw an error if AutoML times out
- Rename UDFEquation to Transformation
Learning
- Japanese vocabulary 31-45
- 新完全掌握语法19-24
Reading
- Kubernetes in Action Chap 6-7

2021/3/15 - 2021/3/21

Davos
- Binning in KDA
- Remove ; in macros
- Expose metrics in C++ server
- Grafana dashboards
Learning
- Japanese lecture 297
- Japanese vocabulary 16-30
- 新完全掌握语法18

2021/3/8 - 2021/3/14

Davos
- Make Binning callable
- Nginx for Einblick
- Dynamic sampling pushdown
- Decouple UDO and UDOOperator
- Cluster-wide monitoring
Learning
- Japanese lecture 295-296
- Japanese vocabulary 6-15
- 新完全掌握语法15-17
Reading
- Kubernetes in Action Chap 5

2021/3/1 - 2021/3/7

Davos
- Handle date types in Python for Arrow
- Cache hint request
- SQL prediacte pushdown translation
- Sampling pushdown
- Refactor Python data source
- Fix random bugs
Learning
- Japanese lecture 283-294
- Japanese vocabulary 1-5
- 新完全掌握语法12-14
Misc
- Gym 5

20221/2/22 - 2021/2/28

Davos
- Improve Snowflake
- Integrate horloge
- Fix random bugs
Paper
- Writing
- Experiments
- Submit
Learning
- Japanese lecture 277-282
- Japanese vocabulary 168-177
- 新完全掌握语法1-11

2021/2/15 - 2021/2/21

Davos
- Expose external UDO’s outputs
- Verify gottardo for embedded
- Refactor what if
- Better error message
- Investigate cloud databases
- Set up Snowflake
- Upgrade to Arrow 3.0
- Support datetime
- Fix random bugs
Learning
- Japanese lecture 271-276
- Japanese vocabulary 148-167
Reading
- 股票大作手回忆录 Chap 20-24

2021/2/8 - 2021/2/14

Davos
- Integrate gottardo
- Integrate Keycloak
- What if optimization
- Fix random bugs
Learning
- Japanese lecture 268-270
- Japanese vocabulary 138-147
Reading
- 股票大作手回忆录 Chap 19

2021/2/1 - 2021/2/7

Davos
- Paper
- Refactor AutoML pipelines
- Cache long-chain job
- Support join prefix option
- Reduce AutoML latency
- Support Google Big Query
- Fix random bugs
Learning
- Japanese lecture 260-267
- Japanese vocabulary 131-137
Reading
- 股票大作手回忆录 Chap 11-15
Misc
- ICDE reviews

2021/1/25 - 2021/1/31

Davos
- Paper
- Collect feedbacks
- Refine coding tasks
- Fix random bugs
Learning
- Japanese lecture 258-259
Reading
- Kubernetes in Action Chap 4

2021/1/18 - 2021/1/24

Davos
- Load balancing
- Support single pass mode
- Paper
Learning
- Japanese lecture 251-257
- Japanese vocabulary 116-130
Reading
- Kubernetes in Action Chap 3
- 股票大作手回忆录 Chap 9-10

2021/1/11 - 2020/1/17

Davos
- Paper full pass
- Set up a server for NLP
Learning
- Japanese lecture 247-250
- Japanese vocabulary 106-115
Reading
- Kubernetes in Action Chap 2
- 股票大作手回忆录 Chap 7-8
Misc
- O-1 documents

2021/1/4 - 2021/1/10

Davos
- Progressiveness benchmark
- Investigate Prometheus operator in Kubernetes
- WhatIf optimization
- Update paper
- Fix random bugs
Learning
- Japanese lecture 245-246
- Japanese vocabulary 96-105
Reading
- 股票大作手回忆录 Chap 3-6
Misc
- O-1 documents

Alpine Meadow 2.0: New Horizons in AutoML

Thu, 30 Jan 2020 00:00:00 +0000

TL;DR: This bost is about something I really would like to do but unfortunately haven’t got time yet :).

Following up my last post, I would like to talk about my ideas about the future directions in AutoML research, which are mostly based on my personal interests and some “pain points” I encountered with while working on AutoML-related things.

Online AutoML

As far as I know, most of the AutoML studies focus on a relative “closed” problem setting, i.e., given the dataset (w/ or w/o train/validation splits), the target column and selection metric (e.g., accuracy), finding a pipeline that is as good as possible (in terms of the given metric).

However, in the real world, the data keeps changing over time, either the volume increases as more and more data comes or the distribution shifts. To this end, an “online” AutoML system should be able to take advantage of this and adapt its decision making over time. No matter the decision making process is a model or some heuristics, they shall learn the trends of incoming data while not forgetting the knowledge learned from previous inputs.

Interpretable and Interactive AutoML

AutoML has always been a black box, especially the decision making process. It usually involves meta-learning, Bayesian Optimization, multi-armed bandit and even deep reinforcement learning. All these techniques are mostly hard to interpret, which poses an huge challenge for understanding the behaviors of the AutoML system.

However, besides the predictive performance of pipeline, the real world users tend to understand how the pipelines have been selected. What kind of primitives or pipelines has the system tried? Based on what observations the system decided to switch to another primitive/pipeline or put into more resources for this primitive/pipeline?

We are currently using a cost model (with a set of heuristics) and it provides good explainability. You can easily know a pipeline is selected because of its relative advantages in performance and speed. We are working on this to make this process more visually interpretable.

Ensemble

AutoML is a process of trials and errors, which usually involves training and evaluating many pipelines. Therefore ensemble learning works seamlessly with AutoML, since these pipelines can be ensembled to make a more powerful pipeline. Previous works like Auto-Sklearn employ ensembling at the end of the search and build a ensemble of pipelines using some greedy methods.

However, simply ensembling all the evaluated pipelines is not efficient since ensembling itself has overheads. Further, ensemble exploits the diversity of models while during the AutoML search we only favor pipelines with better performance. It could be the case that at the end of the search there are lots of good pipelines similar with each other, and then ensemble would not help much. In this sense, to better promote ensembling, we want to consider the possible benefit to ensembling when we select pipelines, that is, if a pipeline might greatly improve the diversity, we should simply try it, even though the pipeline may not have the optimal performance.

In other words, ensembling changes the goal of AutoML from finding the best pipeline to finding the best group of pipelines (to form the best ensemble). We are going to update the cost model in our system to favor the exploration, thus encouraging more diversity in the pipeline traces.

Pruning

Efficient evaluation of pipelines have been neglected by many previous works. Most systems simply do cross-validation at this step, and this could be impractical or sub-optimal for big datasets. At the same time, for classical learning methods (e.g., SVM), their capacities are relatively small and they don’t require many data points. This implies that we are able to predict the predictive performance of a model while only training it on a small subset.

In our SIGMOD 2019 paper, we proposed the Adaptive Pipeline Selection algorithm, which trains models over increasingly-larger samples and evaluate them on the samples (on which they are trained) and the validation dataset. We further used the training error as a lower bound for the validation error. Therefore if the training error is beyond the current best validation error, we think it is very unlikely for this model to have a better performance than the current best model, and then we can simply prune this model to save more computational resources. Another advantage brought by this algorithm is the training efficiency. Since we train over larger samples, for learning algorithms which support incremental training, we only need to train over the difference of samples.

There are a bunch of potential improvements here. Instead of using the training error to estimate the validation error, we can predict the learning curve to estimate the validation error. We can also adopt a “soft” pruning method here. Instead of kill the pipeline immediately, we probably can just starve it by allocating less resources on it, to avoid killing a good pipeline in the early stage.

Execution

Execution is another important topic which have been rarely discussed by previous works. At the end of day, if we are able to execute pipelines 1,000 times faster than other systems, we probably don’t need a fancy pipeline selection algorithm there, we can just do grid search and it is good. Therefore we somehow want to build a specialized system for executing the workloads for AutoML.

One thing we notice which is probably specific to AutoML is that lots of pipelines share similar structures (e.g., the same primitives), and this means caching the intermediate outputs of primitives is probably a good idea. Assume that you are working on an image classification problem and you want to use a pre-trained neural network (maybe on ImageNet) to extract high-level features and then select a simple predictive model (e.g., SVM, logistic regression). Then if you cache the outputs of the neural networks, you can save lots of time since the inference of neural networks takes a long time. In other words, you can try many more different models with different hyper-parameters than others if you adopt caching on this problem.

You can even employ a more fine-grained caching, e.g., if you run the pre-trained neural network on the first 100 data points, and if you want to get the high-level features for the first 200 data points, you can just execute the neural network over the difference and use the cached output for the overlap.

We are currently building an awesome system designed for the general data analytical workloads (including AutoML, OLAP and many others), and I probably will write a post about it later.

Conclusion

To sum up, there are new needs for AutoML and we shall do an end-to-end re-design of the AutoML system, from the interface, through the decision process, down to the execution. There are millions of opportunities in this emerging area and I wish there could be more publications on these interesting topics.

2020 Weekly Summary Report

Wed, 01 Jan 2020 00:00:00 +0000

2020/12/28 - 2021/1/3

Davos
- Scheduling experiments
Learning
- Japanese lecture 237-244
- Japanese vocabulary 81-90
Reading
- Kubernetes in Action Chap 1
- 股票大作手回忆录 Chap 1-2

2020/12/21 - 2020/12/27

Davos
- Scheduling experiments
- Fix web socket for ingress
- Optimize UDF/UDA
- Optimize AutoML
- Update paper
- Fix random bugs
Learning
- Japanese lecture 233-236
- Japanese vocabulary 71-80

2020/12/14 - 2020/12/20

Davos
- UDF/UDA executor
- Secure UDF/UDA executor
- Support UDF/UDA returning UDF/UDA
- Integrate migration with Kubernetes
- Support list operators call
Learning
- Japanese lecture 229-232
- Japanese vocabulary 51-70
Misc
- O-1 documents
- Linked search

2020/12/7 - 2020/12/13

Davos
- Blog post
- Fix SQL reader for complex query
- Investigate Numba
- Refactor dataset API
- Improve first response for histogram
- Stop a job with no subscribers
- Paper
Learning
- Japanese lecture 224-228
- Japanese vocabulary 41-45
Reading
- Hulu ML book Chap 9-14 (finished)

2020/11/30 - 2020/12/6

Davos
- HTTPS support
- Caching experiments
- Frequency-based caching strategy
Learning
- Japanese lecture 220-223
- Japanese vocabulary 36-40
Reading
- Hulu ML book Chap 6-8
Misc
- ICDE reviews
- O-1 documents

2020/11/23 - 2020/11/29

Davos
- Better output for UDF/UDA
- WhatIf
- Implement the server in C++
- Upgrade gRPC and Protobuf
Learning
- Japanese lecture 216-219
- Japanese vocabulary 6-25
Reading
- Hulu ML book Chap 2-5
Misc
- O-1 documents

2020/11/16 - 2020/11/22

Davos
- Handle duplicate columns in join
- Single experiments
- Reduce CPU usage while idle
- Make C++ unit tests faster
- Fix memory leak with AutoML
Learning
- Japanese lecture 210-215
- Japanese vocabulary 1-5
Reading
- STL 源码剖析 Chap 4-8
- Hulu ML book Chap 1
- Random papers (2)
Misc
- O-1 documents

2020/11/09 - 2020/11/15

Davos
- Basic WhatIf operator
- Integrate timestamp parsers in CSV
- Replicated LDAP configurations
- Upgrade Arrow to 2.0
- Single experiments
- Fix random bugs
Learning
- Japanese lecture 207-209
- Japanese vocabulary 136-147
Reading
- OSTEP Chap 45-51 (finished)
- STL源码剖析 Chap 1-3
- Random papers (2)

2020/11/02 - 2020/11/08

Davos
- Switch to Arrow streaming CSV reader
- Make CSV loading more robust
- Speedup UDF/UDA startup
- Add an option for logging level
- Improve shuffling
- Scheduling experiments
- Fix random bugs
Learning
- Japanese lecture 205-206
- Japanese vocabulary 127-135
Reading
- OSTEP Chap 42-44
- Random papers (1)
Misc
- Mail STEM OPT

2020/10/26 - 2020/11/01

Davos
- Skip unnecessary responses
- MonetDB comparison
- Stop jobs in closed workspaces
- Update configs for Replicated
- Integrate secure UDF/UDA for local, docker and K8s
- Better algorithm for reservoir sampling
- Fix job step index in error message
- Add ephemeral sampling
- Fix AutoML bug
Learning
- Japanese lecture 199-204
- Japanese vocabulary 112-126
Reading
- OSTEP Chap 38-41
- Random papers (2)
Misc
- Gym
- Prepare STEM OPT

2020/10/19 - 2020/10/25

Davos
- Fix the oil dataset
- Fix random bugs
- Technical report
- Run experiments
Learning
- Japanese lecture 195-198
Reading
- OSTEP Chap 37
Misc
- Gym

2020/10/12 - 2020/10/18

Davos
- Migrate staging to Replicated
- Expose the output in close function for external UDF/UDA
- Optimize aggregation
- Optimize order by
- Make AutoML multi-processing
- Fix Loki logs
- Fix random bugs
Learning
- Japanese lecture 194
- Japanese vocabulary 103-111
Reading
- OSTEP Chap 30-36
- Random papers (2)
Misc
- Submit STEM OPT MIT application

2020/10/05 - 2020/10/11

Davos
- Add dockerignore
- Remove exceptions in C++
- Add incremental external UDF/UDA
- Try out embedded cluster for Replicated
- Improve nominal binning
- Fix random bugs
Learning
- Japanese lecture 188-193
- Japanese vocabulary 94-102
Reading
- OSTEP Chap 26-29

2020/09/28 - 2020/10/04

Davos
- Techinical report
- Upgrade Arrow to 1.0
- Code coverage for Python
- External UDF/UDA installing requirements and monitoring
- Natural binning
- Fix replay benchmark
- Experiments
Learning
- Japanese lecture 184-187
Reading
- OSTEP Chap 25
Misc
- OPT STEM preparation
- O-1 letters

2020/09/21 - 2020/09/27

Davos
- Investigate Replicated
- Sandboxing UDF/UDA design & implementation
- Build Apache Arrow from source
- Fix random bugs
Learning
- Japanese lecture 177-183
- Japanese vocabulary 82-93
Reading
- OSTEP Chap 19-24
- Codebase of Aria
- Codebase of AutoBazzar
- Random papers
Misc
- O-1 letters

2020/09/14 - 2020/09/20

Davos
- Compiled expression optimization
- Expose key refactor
- External UDF/UDA shortcut
- Filter pushdown optimization
- Investigate Replicated
Learning
- Japanese lecture 175-176
- Japanese vocabulary 64-81
Reading
- OSTEP Chap 13-18
- Random papers
Misc
- O-1 letters

2020/09/07 - 2020/09/13

Davos
- Technical report
- Figure out Excel solutions
- Support hybrid UDF equation
- Better error handling in AutoML
- Adaptive caching strategy
- Fix the replay benchmark
- Fix random bugs
Learning
- Japanese lecture 167-174
- Japanese vocabulary 49-63
Reading
- OSTEP Chap 7-8
- Random papers

2020/08/31 - 2020/09/06

Davos
- Put expose into loop
- Run benchmarks (scheduling, caching, disk/memory management, pushdown)
- New workload for replay
- Fix PM2 error
Learning
- Japanese lecture 160-166
- Japanese vocabulary 40-48
Reading
- OSTEP Chap 3-6
- Random papers

2020/08/24 - 2020/08/30

Davos
- Refactor operator & job
- Safe memory usage
- Release resources in close
- Try Repicated
- Better Grafana dashboard
- Progress semantic
- Add a configuration file for benchmark
- Technical report skeleton
- Fix random bugs
Learning
- Japanese lecture 156-159
- Japanese vocabulary 34-39
Reading
- OSTEP Chap 1-2
- uvloop codebase
Misc
- O-1 letters

2020/08/17 - 2020/08/23

Davos
- Refactor binning
- Better profiling & monitoring
- Infer job type
- New aggregation methods
- Refactor TableResult
- Improve Python metrics
- More C++ profiling
- More Python profiling
- Fix random bugs
Alpine Meadow
- Overview of pipelines
Learning
- Japanese lecture 152-155
- Japanese vocabulary 31-33
Reading
- Linux多线程服务端编程 Chap 11-12
- 15721 Lecture 1-25
- Google C++ Style Guide
- Google Python Style Guide
Misc
- O-1 letters

2020/08/10 - 2020/08/16

Davos
- Refactor locks
- Data model for the cloud web portal
- Support dumping as CSV
- Investigate code generation
- Refresh data source
- Fix random bugs
Alpine Meadow
- Overview of pipelines
Learning
- Japanese lecture 150-151
- Japanese vocabulary 25-30
Reading
- Linux多线程服务端编程 Chap 9-10
- The Algorithmic Foundations of Differential Privacy Chap 1-4

2020/08/03 - 2020/08/09

Davos
- Make protobuf TableResult smaller
- Make release notes
Alpine Meadow
- Refactor API
Learning
- Japanese vocabulary 22-24
- Japanese calligraphy
Reading
- Linux多线程服务端编程 Chap 7-8
Misc
- Move

2020/07/27 - 2020/08/02

Davos
- Optimize aggregation
- Cloud web portal design
- Benchmark for scheduling
- Reduce docker image size
- Make benchmark plan
- Add timestamp arithmetic
- Refactor subscribe response
- Fix random bugs
Learning
- Japanese lecture 147-149
- Japanese vocabulary 13-21
- Japanese calligraphy
Reading
- Linux多线程服务端编程 Chap 4-6
Misc
- Prepare for moving

2020/07/20 - 2020/07/26

Davos
- Refactor the synthesized benchmark
- Support AutoML in the workload generator
- Add simple user interaction simulator
- Fix random bugs for DARPA demo
Alpine Meadow
- Show tasks
Learning
- Japanese lecture 145-146
- Japanese calligraphy
Reading
- Linux多线程服务端编程 Chap 1-3
- ICDE reviews (3)
- MonetDB papers (3)
Misc
- Prepare for moving

2020/07/13 - 2020/07/19

Davos
- Improve AWS EKS script
- Investigate Azure
- Categorize unit tests
- Tutorial for AWS EKS
- Fix random bugs
Learning
- Japanese lecture 144
- Japanese vocabulary 1-12
- Japanese calligraphy
Reading
- Random papers (5)
- ICDE reviews (2)
Misc
- O-1 letters

2020/07/06 - 2020/07/12

Davos
- Investigate AWS
- Investigate DataBricks
- Improve caching manager
- Automate AWS EKS deployment
Learning
- Japanese lecture 140-143
- Japanese vocabulary 1-15
- Japanese calligraphy
Reading
- Random papers (4)
Misc
- O-1 letters

2020/06/29 - 2020/07/05

Davos
- Improve disk cache
- New join operator
- Integrate Postgres
- Fix random bugs
Alpine Meadow
- API server
Learning
- Japanese lecture 134-139
- Japanese vocabulary 82-98 (finished)
- Japanese calligraphy
Reading
- Random papers (1)
Misc
- O-1 letters

2020/06/22 - 2020/06/28

Davos
- Disk management design document
- Improve sampling
- Refactor context
Learning
- Japanese lecture 131-133
- Japanese vocabulary 79-81
- Japanese calligraphy
Misc
- O-1 letters

2020/06/15 - 2020/06/21

Davos
- Recalculate bins
- Refactor removing samples
- Integrate HDFS
- Investigate big data systems
- Arrow disk storage serialization benchmark
- Fix random bugs
Learning
- Japanese lecture 128-130
- Japanese vocabulary 67-78
- Japanese calligraphy
Reading
- RL book Chap 18 (finished)
- Streaming 101 & 102
- Random papers (1)
Misc
- O-1 letters

2020/06/08 - 2020/06/14

Davos
- Upgrade Arrow to 0.17
- Refactor from include guard to pragma
- Implement calculated data source in C++
- Disable third-party tests
- Refactor C++/Python bindings
- Remove unused code
- Refactor AutoML
Alpine Meadow
- Integrate the thresholding primitive
Learning
- Japanese lecture 125-127
- Japanese vocabulary 58-66
- Japanese calligraphy
Reading
- RL book Chap 14-17
Misc
- Walk

2020/06/01 - 2020/06/07

Davos
- Test memory management
- Scheduling strategy
- Python process priority
Alpine Meadow
- Read NNI web UI code
- Set up React codebase
Learning
- Japanese lecture 119-124
- Japanese vocabulary 52-57
- Japanese calligraphy
Reading
- RL book Chap 12-13
Misc
- Walk

2020/05/25 - 2020/05/31

Davos
- Job memory management
- Sample memory management fix
- Migrate GCP
- Scheduling design document
- Support Python configuration for context
- Fix random bugs
Learning
- Japanese lecture 113-118
- Japanese vocabulary 46-51
- Japanese calligraphy
Reading
- Paper reviews
Misc
- Prepare O-1 letters

2020/05/18 - 2020/05/24

Davos
- Sample memory management
- Stream memory management
Learning
- Japanese lecture 105-112
- Japanese vocabulary 37-45
- Japanese calligraphy
Reading
- RL book Chap 8-11

2020/05/11 - 2020/05/17

Davos
- New aggregation
- Integrate Loki
- Prepare coding interview questions
- Design document for memory management
- Fix random bugs
Alpine Meadow
- React tutorial
- Investigate NNI
Learning
- Japanese lecture 97-104
- Japanese vocabulary 28-36
- Japanese calligraphy
Reading
- Random papers (1)
- RL book Chap 7
Misc
- Prepare O-1 letters

2020/05/04 - 2020/05/10

Davos
- C++/Python binding refactor
- Support reloading data source
- Refactor aggregation
- Refactor header files
- Customized aggregation in Python
- Fix random bugs
Alpine Meadow
- Set up OpenML benchmark
- Export pipeline as script
- Upgrade sklearn to 0.22
Learning
- Japanese lecture 89-96
- Japanese vocabulary 19-27
Reading
- RL book Chap 5-6
- VLDB paper review
- IDEA paper
- Linguist book

2020/04/27 - 2020/05/03

Davos
- Quantile binning extending
- Folder union data source
- Refactor planning framework
- Integrate Prometheus and Grafana
- External UDF/UDA
- Replay benchmark V2
- Documentation for UDF/UDA
- Better asserts and exceptions
- Design document for customized aggregation
- Documentation for server
Alpine Meadow
- Run benchmarks
Learning
- Japanese lecture 83-88
- Japanese vocabulary 4-18
Reading
- RL book Chap 4
- Random papers (7)

2020/04/20 - 2020/04/26

Davos
- Convert unit test workloads
- Better micro benchmark
- Conform to Google C++ style
- Set up email for daily benchmark
- Replay benchmark
- Log failed job
- Support nullness function
- Quantile binning
- Documentation for scheduling
- Documentation for caching
- Documentation for memory management
- Fix random bugs
Alpine Meadow
- Run benchmarks
- Upgrade SMAC to 0.12
- Integrate NNI’s curve fitting
Learning
- Japanese lecture 81-82
- Japanese vocabulary 1-3
- Japanese review 19-24
- N5 tests

2020/04/13 - 2020/04/19

Davos
- Support shrinking sample
- Support sample TTL
- Fix sampling for large data sources
- Fix sample semantics for aggregation
- Support numerical to string
- Add warning mechanism for operators
- Daily regression benchmark
- Fix GitLab JIRA integration
- Fix random bugs
Alpine Meadow
- Run benchmarks
- Check target type in AutoML
Learning
- Japanese vocabulary 91-98
- Japanese review 9-18

2020/04/06 - 2020/04/12

Davos
- Improve data source loading
- Documentation for planning, benchmarks, C++, Python
- Memory management for large Parquet files
- Add warning for data sources
- Prepare image for BMW
- Fix random bugs
Alpine Meadow
- Run benchmarks
Niseko
- Investigate papers
- Section 2
Learning
- Japanese lecture 79-80
- Japanese vocabulary 82-90
- Japanese review 1-8
Misc
- Prepare O-1 documents

2020/03/30 - 2020/04/05

Davos
- Documentation for stream, sample, operator, job, tests
- Upgrade Arrow to 0.16
- Refactor header files
- Export job as pure scikit-learn
- Support dump streams to disk
- Make Brewfile
- Fix random bugs
Alpine Meadow
- Run benchmarks
- Support multiple metrics
Niseko
- Draft outline
Learning
- Japanese lecture 71-78
- Japanese vocabulary 78-81
- Workouts

2020/03/23 - 2020/03/29

Davos
- Group by numerical values
- Time extraction functions
- Test Oracle DB
- Fix random bugs
- Fix job lifetime management
Alpine Meadow
- Run benchmarks
- Fix hanging bug
- More meta-learning
- Make meta-learning faster
Learning
- Japanese lecture 66-70
- Japanese vocabulary 69-77
Reading
- Park paper

2020/03/16 - 2020/03/22

Davos
- Refactor loading in Python
- Refine sampling strategies
- Test datasets with errors
- Scripts for release
- Configure k8s environment
- Fix binning bug
- Fix timestamp conversion bug
- Fix rtail output
- Exceptions with context
- Fix random bugs
Alpine Meadow
- Run benchmarks
Learning
- Japanese lecture 62-65
- Japanese vocabulary 66-68
- Japanese practice 1-4
Reading
- RL book Chap 1-2
- Random papers (1)

2020/03/09 - 2020/03/15

Davos
- Fix null type info
- Expose logs through HTTP
- Make Oracle faster
- Fix random bugs
Alpine Meadow
- Run benchmarks
- Support class weights
Learning
- Japanese lecture 57-61
- Japanese vocabulary 56-65
- English speaking
- Workouts
Reading
- Building Micro Services Chap 9-12 (finished)
- TDS review
- Random papers (2)
Misc
- File federal tax return
- File state tax return

2020/03/02 - 2020/03/08

Davos
- Work with BMW (found lots of bugs!)
- Support more types in SQL
- Handle inf in binning and aggregation
Alpine Meadow
- Run benchmarks
Learning
- Japanese lecture 56
- Japanese vocabulary 48-55
- English speaking
- Workouts
Reading
- Building Micro Services Chap 6-8

2020/02/24 - 2020/03/01

Davos
- Set up Oracle on GCP
- Support bulk operation of SOCI in SQL data source
- Stop AutoML operators
- Convert workloads to JSON
- Scaling test of BMW
- Add type cast operator
- Add docker compose
Alpine Meadow
- Run benchmarks
- Refactor metadata management
Learning
- Japanese lecture 52-55
- Japanese vocabulary 35-47
- English speaking
Reading
- VLDBJ paper review
- Building Micro Services Chap 5

2020/02/17 - 2020/02/23

Davos
- Documentation skeleton
- Support specifying time granularity in datetime binning
- Log operator running time
- Support predicting probabilities in pipeline
- Support reading from OracleDB
- Refactor third-party modules
- Fix random bugs
Alpine Meadow
- Re-enable meta-learning
- Better epochs in APS
- Support computing naive score
- Simple explainability
- Set up benchmarks
- Run benchmarks
- Re-enable ensembling
- Remove null values in target columns
Learning
- Japanese lecture 46-51
- English speaking
Reading
- Building Micro Services Chap 4
- Random papers (1)
- NAS reviews (2)

2020/02/10 - 2020/02/16

Davos
- Exception handling for Python operators
- SHAP primitive
- Set up new GitLab runners
- Fix random bugs
Alpine Meadow
- Set up new linters
- More profiling and logging
- Constrain the search space for explainability
- Better dump
Learning
- Japanese lecture 42-45
- English speaking
- Workout

2020/02/03 - 2020/02/09

Davos
- Test workload on Parquet
- Improve binning
- Keep track of memory usage for stream
- Make CI faster
- Fix random bugs
D3M
- Support mini-metadata
- Support more problems
- Submit TA2
Learning
- Japanese lecture 34-41
- English speaking
- Workout
Blog
- AM 2.0 Post

2020/01/27 - 2020/02/02

Davos
- Monitoring in C++ and Python
- Set up monitoring server
- Expose UDF executor
- Support UDF executor with parameters
- Support reading Parquet file
- Fix random bugs
Niseko
- Brainstorm ideas
- Design data model
Master thesis
- Print, sign and submit
Learning
- English speaking
- Workout
- Japanese lecture 29-33
- CS285 lecture 10-21
- CS285 Homework 3, 4, 5
Reading
- Building Micro Services Chap 2-3

2020/01/20 - 2020/01/26

Davos
- Expose more information in AutoML
- Support model persistence
- Support literal data source
- Support exporting pipeline as a Python script
- Fix random bugs
Master thesis
- Revise
- Sent for review
Learning
- English speaking
- Workout
- Japanese lecture 18-28
- CS285 lecture 8-9
- CS285 Homework 2
Reading
- Agile Development Chap 1-10 (finished)
Work out

2020/01/13 - 2020/01/19

Davos
- Support Python UDF
- Support Python UDA
- UDF Equation
- Fix random bugs
Learning
- English speaking
- Workout
- CS285 lecture 1-7
- CS285 Homework 1
- Japanese lecture 1-17
Reading
- Building Micro Services Chap 1
Work out

2020/01/06 - 2020/01/12

Davos
- Fix random bugs
Research
- Release Niseko 0.1
Reading
- Designing Distributed Systems Chap 1-13 (finished)
- Reinforcement Learning Chap 1-2 (finished)

Some Interesting ML-related Papers in VLDB 2018

Sun, 01 Sep 2019 00:00:00 +0000

TL;DR: VLDB 2019 just happened and now I am going to write a post about papers in VLDB 2018 :).

In this post, I am going to write my reading report for ML-related research papers in VLDB 2018. I will go through all the papers based my own research interests and write their basic ideas, methods and my personal evaluations. Please let me know if you find anything inappropriate.

MLBench: Benchmarking Machine Learning Services Against Human Experts
- Summary
  1. This paper presents MLBench, a benchmark providing a best-effort baseline of both feature engineering and machine learning models for each dataset, proposes a performance metric measuring the map between a ML system and top-ranked Kaggle performers, and extract some interesting insights.
  2. The performance metric, namely “quality tolerance”, is \(\pi\) if the user is satisfied by only being ranked amont the top \(\pi\%\) in a Kaggle competition.
  3. They manually collected the winning code from 41 Kaggle competitions as the best-effort baseline and compare Azure ML and Amazon ML services on these datasets (with or without hyper-parameter tuning). They found that model diversity helps, model selection is necessary and hyper-parameter tuning also makes a difference.
  4. Further, nonlinear models outperform linear models on big datasets (but they are also more likely to suffer from overfitting on small datasets), and linear models have similar performance in general, therefore it is a hit-or-miss pattern for linear models. Another interesting insight to note is that among nonlinear models, they exert similar performance within each model family (e.g., SVM or decision tree), and the loner training time doesn’t help within the nonlinear or linear model space.
  5. Not surprisingly, feature engineering helps a lot.
- Comments
  1. This paper is well written and easy to understand, and also it discusses the limitations and potential alternative methods to justify its designs.
  2. This is definitely a good paper with lots of insightful findings. Based on my experience of building AutoML systems, some models are just mostly better, but this is not easy to explain in theory (whereas there is No Free Lunch Theorem).
  3. I think it would be more helpful to make a large-scale comparison across hundreds of datasets and different libraries.
On Optimizing Operator Fusion Plans for Large-Scale Machine Learning in SystemML
- Summary
  1. This paper presents an exact, cost-based framework for optimizing operator fusion plans over DAGs of linear algebra operations, which guarantees finding the optimal plan regarding the considered decisions.
  2. For candidate exploration, they enumerate partial fusion plan per operator. They propose a memo table to store all the candidate plans and design a DFS algorithm to populate the memo table bottom-up.
  3. For candidate selection, they first split the plans into independent partitions, and linearize the search space per partition and enumerate plans while skipping plans that can be safely pruned. They also cache plans for repeated optimization problems.
  4. They compare against Julia and TensorFlow on some synthetic datasets and real datasets (e.g., Airline78, MNIST, Netflix, Amazon product review).
- Comments
  1. This paper is well written and clear structured.
  2. I do think there are better baselines for comparison in the experiments (e.g., TensorFlow is not built for such scenarios).
Snorkel: Rapid Training Data Creation with Weak Supervision
- Summary
  1. This paper presents Snorkel, a system that enables user to write labeling functions that express heuristics, learns a generative model over the labeling functions, and then trains a discriminative model.
  2. Snorkel constructs the generative model as a factor graph (including three factors, labeling propensity, accuracy and pairwise correlations of labeling functions).
  3. They train a discriminative model on the probabilistic labels by minimizing a noise-aware variant of the loss, i.e., the expected loss.
  4. They present a theoretical analysis of when a simple majority vote will work just as well as the modeling of the accuracies of labeling functions, and introduces an optimizer for deciding when to model accuracies of labeling functions and which correlations to model among labeling functions.
  5. Snorkel provides 132% average improvements to predictive performance over prior heuristics and comes within an average 3.60% of the predictive performance over large hand-curated training sets.
- Comments
  1. This paper is well written and clear structured.
  2. Although the overall architecture is normal, the analysis of tradeoffs is really interesting and insightful.
Ease.ml: Towards Multi-tenant Resource Sharing for Machine Learning Workloads
- Summary
  1. This paper presents ease.ml, a declarative machine learning service platform for multi-tenant resource sharing.
  2. It proposes a novel algorithm for multi-tenant, cost-aware model selection algorithm. First step it determines the best model for each user by estimating the “potential for accuracy improvement” for each model and then picks the user with the highest potential. They develop a cost-aware variant of the standard GP-UCB algorithm for selecting the model of each user.
  3. Secondly, they use a greedy algorithm that selects the user with a confidence bound above the average, and in practice, they pick the user with the maximum gap between the largest upper confidence bound and the best accuracy so far.
- Comments
  1. This paper is well written and clear structured. Its introduction is written in a way such that they provide some real-world failed experiences as the motivation.
  2. It is always helpful to provide some examples for better illustration and some alternative designs to justify your proposed designs.
  3. It is good to discuss about the limitations to better scope the paper and illustrate why the limitations are currently not resolved.
  4. They mention Bayesian Optimization in the abstract but it is not discussed in the paper, I guess it is probably used in the hyerparameter tuning?

AutoML: Methods, Challenges and Opportunities

Fri, 05 Apr 2019 00:00:00 +0000

Recently, we got an paper about interactive AutoML accepted by SIGMOD 2019 (Shang et al., 2019). In that paper, we discussed how we tackled the problem of automated machine learning while providing interactive responses. We architected our system into several components and discussed implementations and techniques for each of them. Although looking back, the overall design seems pretty straightforward, it takes us almost one year and a half to get into the kingdom of AutoML and figure out a reasonable architecture for a practical system. But I are really glad that eventually we figured this out and have this paper accepted to SIGMOD.

Also, there is an awesome book about AutoML recently released, which is written by several well-known researchers from the AutoML community (and their papers are really helpful for us!). Inspired by both of the paper acceptance and the new book, I am going to write a blog post summarizing the methods we used or invented, challenges we faced and opportunities for future research. However, considering this is just a blog post, definitely some details of our methods will not be covered, so if you would like to know anything more about our system, please refer to our paper at SIGMOD 2019.

1. Introduction

First, what is AutoML? Based on my understanding, it removes humans from the loop of tedious process of cleaning data, preprocessing features, searching machine learning model and tuning its hyper-parameters. One example is that a doctor has a large volume of data (e.g., age, height, blood test results) collected from his patients and he would like to know if it is possible to build a machine learning pipeline predicting whether a new patient has some disease or not. Then AutoML is a perfect solution here as it doesn’t require the user to have machine learning or computer science knowledge. In other words, it democratizes the machine learning to the general public and make machine learning accessible. This probably explains why it is getting more and more popular.

Nevertheless, AutoML is not supposed to totally bypass humans from the building of machine learning pipeline, otherwise even a perfect model can be found by AutoML, everything is still a black box for the user and it is difficult to have useful findings. This requires AutoML to leave some opportunities to have human efforts in the decision process, especially domain knowledge, e.g., a doctor can give the system some hints on how to clean or process the data, and if the machine learning pipeline has good interpretability, that is even more helpful since the doctor probably has chance to have some interesting discoveries.

Further, it doesn’t make senses for AutoML to be a standalone component. It shall reside on a data exploration platform to better combine traditional data analytics and machine learning. We cannot expect our user to write code to run our AutoML system, we shall have a interactive easy-to-use GUI where user can trigger the AutoML operation. Therefore, it requires a general-purpose data platform to make AutoML actually accessible. That’s why we have integrated our AutoML system Alpine Meadow with Northstar.

Last but not least, AutoML can help the users to find a reasonable ML pipeline, but there is still so-called “the last mile” to go through, e.g., to achieve better performance, the users should be able to hand over this pipeline to some engineers or real data scientists to further improve it. For example, the pipeline found by AutoML can probably be exported as a Python script, and the engineering team can convert this script to a Apache Spark job to scale it on larger volumes of data.

To summarize, in my opinion, a good AutoML system shall be Automated, Interpretable, Interactive, Exportable.

2. Methods

For an end-to-end AutoML system, the input is essentially the task description (i.e., dataset and problem, e.g., predict the digits for MNIST dataset), and the output is the so-called “best” pipeline. The metrics to evaluate a pipeline can be just calculated based on the predictions of a pipeline (e.g., accuracy or MST), or the speed of a pipeline (i.e., how long does it take to train or test this pipeline), even combined with each other.

Since now we know what the input and output is, then it comes to the problem of designing all steps in the AutoML system. An intuitive design is that we build the search space (i.e., the space of applicable pipelines), select promising pipelines from the space, evaluate them in some way and return the best pipeline.

2.1 Building Search Space

As far as I know, currently all AutoML systems use some template-like methods more or less. Basically we pre-define the search space for each problem type (e.g., classification or regression) and column type (e.g., different scaling methods for numerical features or different encoding methods for categorical features) in some templates. At runtime, the system just reads out the search space based on the input task.

Our system Alpine Meadow improves the template-based method by abstracting the construction of the space into the execution of rules (we adopt this idea from database systems). This enables more flexibility than template-based method as rules can be programmed and added in an easy way, therefore we are able to support multiple dataset/problem types (e.g., image classification, collaborative filtering). And we further define the search space as a space of logical pipeline plans, where each logical pipeline plan is a pipeline DAG with domains of hyper-parameters (i.e., their hyper-parameters are ranges not exact values).

2.2 Selecting Promising Pipelines

The search space is usually huge and heterogeneous, therefore sometimes simply taking a pipeline out of the search space by random is not efficient enough (but some randomness is necessary to avoid be trapped in sub-optimal regions). Some systems (e.g., Auto-sklearn (Feurer et al., 2015)) models the selection of pipeline as a hyper-parameter tuning problem, in other words, they convert the primitive structure (i.e., DAG) of a pipeline to some hyper-parameters. Then they are able to employ some hyper-parameter tuning techniques for finding promsing pipelines. For example, Bayesian Optimization has been proved successful for tackling with hyper-parameter tuning and is also now widely-used. I am not going to talk about Bayesian Optimization here as there is already one very good review (Shahriari et al., 2016).

In our system, the selection of pipeline consist of two parts, selecting the primitive structure (i.e., DAG) of a pipeline (which is defined as a logical pipeline plan), and fine tuning the hyper-parameters. We model the selection of logical pipeline plan as a Multi-Armed Bandit problem and we adopt the idea of cost model from the DB world to estimate a score for each logical pipeline, while the cost model considers the performance (e.g., accuracy) and speed (e.g., time to train or test) at the same time to trade-off between performance and interactiveness. The cost model also employs some meta-learning techniques to improve the estimation by using history data from some similar datasets (e.g., the accuracy and execution time of a pipeline on a similar dataset). For the multi-armed problem, we employ a combination of \(\epsilon\)-greedy and upper confidence bound to select promising logical pipeline plans. After selecting some promising logical pipelines, we use Bayesian Optimization to fine-tune their hyper-parameters.

2.3 Evaluation of Pipelines

When we evaluate a candidate pipeline, most of time we would like to find a pipeline with good predicting power, e.g., with high accuracy on test dataset. However, since we don’t have access to the test dataset at runtime, to get a sense of how the pipeline will perform on the test dataset (i.e., the generalization error), we can test it on some validation dataset. There are usually two ways: one is that we have a holdout validation dataset, usually we do 80%-20% (in other words, we split 80% of the input dataset as the train dataset and the rest 20% dataset as the validation dataset). The other way is cross-validation, usually we do k-fold cross-validation where k is set as 3 or 5.

However, they all have some disadvantages. The holdout way is fast but the estimation of generalization power is not accurate, while cross-validation is more accurate but tends to be slow. Based on this observation and also inspired by HyperBand (Li et al., 2016), we devise an Adaptive Pipeline Selection (APS) method which trade-off between speed and accuracy of evaluation. Essentially, the adaptive pipeline selection is a resource-efficient way to evaluate pipeline. The basic idea is that we train and test the pipeline on a small sampled dataset, prune those pipelines performing bad on this dataset, and we increase the size of the dataset and continue this process. One of the major difference between APS and HyperBand is our pruning condition, that is, we use the train error as the upper bound of the validation error, and we compare it with the best-so-far validation error, if the train error is bigger, which means that this pipeline is not likely to be better than the current best, then we can safely prune it.

Furthermore, we also investigate a little bit on optimizing the evaluation from the system perspective. One observation is that, any AutoML systems will try out lots of pipelines at the same time, so it is likely that some pipelines will share some common primitives, which means, some computations can be reused. This is the so-called inter-pipeline caching in our system. The other angle is that by using APS, for a pipeline, since we train and test it on increasingly larger datasets with overlaps, there leaves out some opportunities to reuse the computation from the last iteration, which is the so-called intra-pipeline caching in Alpine Meadow.

3. Challenges and Opportunities

Learning of Rules. For now, the rules or heuristics in most AutoML systems are mostly hand-crafted, therefore it prevents from the scaling and coverage of rules. If we manage to make these rules learned either over time or over external history (e.g., learning from models on Kaggle or OpenML), the performance is expected to be greatly improved.

Learned Cost Model. Similar with the learning of rules, cost models can be fine tuned by machine learning. First, we can have better estimation of execution time by learning-based methods, second, the selection or ranking of pipelines can be optimized by learning as well.

Large-Scale Meta-Learning. Meta-learning is getting more and more popular as it aims to find the common “knowledge” underneath different machine learning tasks. Given a new task, if we are able to reuse some prior knowledge from similar tasks we have seen before, the whole search of optimal pipeline can be warm-started. Therefore, if we can figure out a common language for meta-learning (e.g., common description of pipeline and run, including execution time, predicting performance), and scale up the meta-learning data, we have good chance to find a good pipeline for a new task with little time.

Better Ensemble. AutoML systems will evaluate lots of pipelines by design, therefore, if we can ensemble them together into a model, the expected performance (e.g., generalization error) can be further improved. Better ensemble requires us to have it in mind in every aspect of our system, for example, we should encourage diversity when we select pipelines.

Efficient Execution. If our AutoML system is able to evaluate one or two orders of magnitude more pipelines than other systems, then we are very likely to win the game. Caching is one important technique as we have mentioned above, and other opportunities may be using GPU, more fine-grained pruning strategy in APS.

Interpretability. Considering AutoML is an end-to-end process, if would be awesome if we are able to show our users how our system makes the decision of select this pipeline. Rule-based methods are sweet pots here as they are explainable by nature. Also, the cost model provides some intuition as well.

4. References

AutoML book

Feurer, M., Klein, A., Eggensperger, K., Springenberg, J., Blum, M., & Hutter, F. (2015). Efficient and robust automated machine learning. Advances in Neural Information Processing Systems, 2962–2970.
Shahriari, B., Swersky, K., Wang, Z., Adams, R. P., & De Freitas, N. (2016). Taking the human out of the loop: A review of bayesian optimization. Proceedings of the IEEE, 104(1), 148–175.
Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A., & Talwalkar, A. (2016). Hyperband: A novel bandit-based approach to hyperparameter optimization. ArXiv Preprint ArXiv:1603.06560.
Shang, Z., Zgraggen, E., Buratti, B., Kossmann, F., Eichmann, P., Chung, Y., Binnig, C., Upfal, E., & Kraska, T. (2019). Democratizing Data Science through Interactive Curation of ML Pipelines . Proceedings of the 2019 International Conference on Management of Data.

A Practical Overview of Feature Engineering in Machine Learning and How to Automate it

Tue, 05 Mar 2019 00:00:00 +0000

In this blog post, I am going to talk about feature engineering in machine learning, which is fundamental but difficult. I will first examine what techniques we have for applying feature engineering and then discuss some works on automating the process.

1. Introduction

Feature engineering is probably the most time consuming process when we do fine-tuning of a machine pipeline, which consist of pre-processing single feature (e.g., scaling numerical features or encoding categorical features), exploring the data to find correlations (e.g, feature crossing), feature selection and dimensionality reduction.

Feature engineering is very important to the final performance of the predictive pipeline, which in some way decides the predictive power. In general, if we can compose better features, the better result we will get. On Kaggle (a website for all kinds of machine learning competitions), most winners spend most of them efforts in feature engineering.

In this blog post, I will first talk about the basic methods used in feature engineering, and then review the recent works in automating this process.

2. Methods for Feature Engineering

Basically, there are three different groups of methods for feature engineering

Pre-processing single feature, e.g., scaling, encoding and embedding
Combination of multiple features, e.g., feature crossing
Feature selection, e.g., decision tree-based feature selection
Dimensionality reduction, e.g., PCA

In this section, I will go over all these three groups of methods in details, and for each group I will discuss some widely-used techniques and their benefits, along with their implementations in scikit-learn.

2.1 Single Feature Processing

For single features, we first check if there are missing values (if so we need to impute them), then scale numerical features to make them have the same magnitude and encode categorical features for later computation. For some specific features, we can apply quantization-based methods to transform them into a quantile range to remove data redundancy.

Imputation of Missing Values

For some features, there may exist missing values, therefore we have to apply some imputation. In sklearn, we can use sklearn.impute.SimpleImputer(missing_values=nan,strategy=’mean’,fill_value=None), where strategy specifies what value we will be using for replacing missing values, e.g. mean, median, most_frequent, or constant.

Scaling of Numerical Features

Since different numerical features probably have different orders of magnitude of scale (e.g., the number of students and the average subjects registered per student at MIT), the performance of our predictive model can be solely decided by features with large scale and variance. Therefore, it is important to make each numerical feature in comparable scales.

There are two mostly-used scalers in sklearn:

sklearn.preprocessing.StandardScaler: standardizing features by removing the mean and scaling to unit variance, i.e., \(\frac{x - \mu}{\delta}\)
sklearn.preprocessing.MinMaxScaler: transforming features by scaling each feature to a given range, e.g., to transform features into the range (0, 1), we can scale the features as \(\frac{X - X_{min}}{X_{max} - X_{min}}\).

There are also some other scalers, for example, sklearn.preprocessing.RobustScaler is most robust to outliers by using statistics like median and quantile ranges. Besides, there is also a sklearn.preprocessing.Normalizer which normalizes the features to convert them into unit vectors.

Encoding of Categorical Features

There are two types of encodings in sklearn, i.e., ordinal encoding and one-hot encoding. Ordinal encoding transforms features into the range between 0 and the number of categories minus by 1, implemented by sklearn.preprocessing.OrdinalEncoder (for converting features) and sklearn.preprocessing.LabelEncoder (for converting labels).

One-hot encoding converts features into a vector whose length is the number of categories (which is a one-to-one match to the categories), and sets 1 for the corresponding category and 0 for other categories. There are sklearn.preprocessing.OneHotEncoder (for converting features) and sklearn.preprocessing.LabelBinarizer (for converting labels) in sklearn.

Binarization and Quantization

Sometimes the absolute value of a feature makes little difference after being above some threshold (e.g., the GRE score for graduate school application), it will be useful to simply binarize the data. Sklearn provides a easy-to-use sklearn.preprocessing.Binarizer which binarizes data (set feature values to 0 or 1) according to the given threshold.

Binarization is actually a special case of quantization. We can employ quantization to transform features to follow some specific distribution (e.g., a uniform or a normal distribution). There is a sklearn.preprocessing.QuantileTransformer in sklearn implementing quantization. There are other transformers as well, e.g., sklearn.preprocessing.PowerTransformer.

2.2 Combination of Multiple Features

Sometimes it is important to combine several related features together, for example, if we have the height and weight information, we can compute BMI as a new feature. However, this requires

Polynomial Features

A easy trick to extend linear regression to polynomial regression is to augment the data by adding polynomial features. For example, assume the feature vector is \((X_1, X_2)\), we can augment them into \((X_1, {X_1}^2, X_2, {X_2}^2, X_1 X_2)\), therefore we can better fit the linear model in the quadratic space. We can easily implement this using sklearn.preprocessing.PolynomialFeatures in sklearn by specifying the degree (e.g., 2 in the example).

Functional Transformations

It is also possible to support more complex transformations, e.g., \(X_3 = \frac{X_1}{X_2}\). Sklearn provides sklearn.preprocessing.FunctionTransformer for a user-defined function to transform features.

2.3 Feature Selection

Filtering

Filter-based methods select features based on some metric to evaluate its importance to the to-be-solved problem and just keep the most important features. For example,

Variance, where sklearn.feature_selection.VarianceThreshold removes all low-variance features based on the given threshold;
Pearson correlation, where we pass the Pearson correlation (e.g., scipy.stats.pearsonr) as the score_func for sklearn.feature_selection.SelectKBest;
Chi-squared stats, where we pass the chi-squared stats (e.g., sklearn.feature_selection.chi2) as the score_func for sklearn.feature_selection.SelectKBest;
ANOVA F-value between label/feature, where we pass sklearn.feature_selection.f_classif (for classification) or sklearn.feature_selection.f_regression (for regression) as the score_func for sklearn.feature_selection.SelectKBest

There are also some other score functions, e.g., mutual information (sklearn.feature_selection.mutual_info_classif or sklearn.feature_selection.mutual_info_regression).

Wrapper Methods

One of the most widely-used method is recursive feature elimination. We have a base model that is able to assign importance to each feature (e.g., a decision tree), we can train it in a recursive way. In each round, we remove some features with smallest weights and eventually we will get the desired number of features. This method is implemented by sklearn.feature_selection.RFE.

Embedded Methods

Embedded methods try to embed the feature selection along with the train of predictive model. For example, we can use L1 or L2 regularization for feature selection (especially L1). This is implemented by sklearn.feature_selection. SelectFromModel.

2.4 Dimensionality Reduction

After all these above preprocessing steps, we are essentially ready to train our first model for prediction. However, if the number of features is too big (so called the Curse of Dimensionality), the training will be super slow and the performance in general will not be good. Therefore, it is sometimes necessary to reduce the dimensionality. There are several methods we can use:

PCA (Principal Component Analysis), this is implemented in sklearn.decomposition.PCA, and its kernel variant sklearn.decomposition.KernelPCA, sparse variant sklearn.decomposition.SparsePCA and online variant sklearn.decomposition.IncrementalPCA.
Truncated SVD (while PCA is TruncatedSVD on centered data), this is implemented in sklearn.decomposition.TruncatedSVD;
Factor Analysis, this is implemented in sklearn.decomposition.FactorAnalysis;
Independent Component Analysis, this is implemented in sklearn.decomposition.FastICA;
Non-Negative Matrix Factorization, this is implemented in sklearn.decomposition.NMF;
DictionaryLearning, this is implemented in sklearn.decomposition.DictionaryLearning;
Agglomerative Clustering, this is implemented in sklearn.cluster.FeatureAgglomeration.

3. Automatic Feature Engineering

Although deep learning has been employed for feature engineering on image, video and text data, it requires huge volume of training instances therefore it is not suitable for small or medium size of datasets. Also, it is difficult to find a good representation of features in normal tabular datasets as the inputs for deep learning (while data like image and video has a good representation by nature). Furthermore, the features learned by deep learning is difficult to explain and interpret, which prevents users from understanding the machine learning pipeline.

To this end, there have been several other methods proposed to automate the tedious process of feature engineering. In general, there are TBU different categories: (1) expansion-reduction; (2) greedy evolution; (3) Learning-based transformation.

3.1 Rule-based Feature Expansion-Reduction

(Kanter & Veeramachaneni, 2015) proposes a rule-based methods Deep Feature Synthesis to expand the search space for relational data. Basically they have some pre-defined functions for processing single row in a table (e.g., doing some aggregations for some columns, MIN/MAX/SUM), and they follow the links (i.e., primary-foreign key) to join tables together and run these pre-defined functions on these joined tables.

After getting the processed features, they further use Truncated SVD for feature selection and dimensionality reduction. They build a random forest on top and since the processed features are fixed, they can fine-tune the hyper-parameters of the random forest by Bayesian Optimization.

3.2 Greedy Feature Evolution

(Katz et al., 2016) constructs feature greedily by evaluating the performance of the model trained with the addition of candidate feature. To sort of avoid the expensive computation of training models, they employ learning to rank those newly constructed features and only evaluate these most promising ones. However, since they still need to train models to expand the feature space, this category of methods are still considered time-consuming.

3.3 Learning-based Transformation

Supervised Learning

(Nargesian et al., 2017) proposes a machine learning-based methods for transformation of features. Specifically, they have a pre-defined set of unary (e.g., log, square-root) and binary (e.g., sum, subtraction) operations for features, and they train a classifier predicts the most promising transformation for each feature, which takes the Quantile Sketch Array of feature(s) for all classes as the input.

To train this classifier, they evaluate the model with the original feature and the transformed feature, and if the improvement is beyond a threshold, they use this transformation as the positive sample and other transformations as the negative samples. One limitation is that their methods only work for binary classification problem and consider single feature (they don’t support any operations for multiple features).

Reinforcement Learning

(Khurana et al., 2018) is sort of the follow-up of the above-mentioned paper. They still define a set of transformations for features, and they abstract the process of applying feature engineering as traversing on the transformation graph where the dataset is the node in the graph and the edge between two nodes is the transformation which changes the dataset (the source node) into another dataset (the destination node) by employing corresponding transformations on all applicable features. It starts with the initial dataset, and the ideal solution is to apply the sequence of transformation until the optimal node is constructed.

To guide the expansion of the transformation graph, we need to find a strategy that picks the “correct” transformation and the “correct” source node at each step. There are lots of important factors influencing the decision, e.g., the node’s accuracy, the transformation’s average performance, number of times this transformation has been applied, accuracy gain to the source node from its parent, the depth of the node, the current remaining budget, number of features in the node and so on. They use reinforcement learning (more specifically, Q-learning with functional approximation) to find out the optimal strategy. One limitation is that they have to train every dataset to learn the Q-value function for this dataset.

4. References

Kanter, J. M., & Veeramachaneni, K. (2015). Deep feature synthesis: Towards automating data science endeavors. 2015 IEEE International Conference on Data Science and Advanced Analytics (DSAA), 1–10.
Katz, G., Shin, E. C. R., & Song, D. (2016). Explorekit: Automatic feature generation and selection. 2016 IEEE 16th International Conference on Data Mining (ICDM), 979–984.
Nargesian, F., Samulowitz, H., Khurana, U., Khalil, E. B., & Turaga, D. S. (2017). Learning Feature Engineering for Classification. IJCAI, 2529–2535.
Khurana, U., Samulowitz, H., & Turaga, D. (2018). Feature engineering for predictive modeling using reinforcement learning. Thirty-Second AAAI Conference on Artificial Intelligence.

Some Good Papers in SIGMOD 2018 (Industry Sessions)

Sat, 16 Feb 2019 00:00:00 +0000

In this post, I am going to write my reading report for industry papers in SIGMOD 2018. I will go through all the papers I feel interesting and write their basic ideas, methods and my personal evaluations. Please let me know if you find anything inappropriate.

Session 1: Adaptive Query Processing

Computation Reuse in Analytics Job Service at Microsoft
- Summary
  1. This paper presents a computation reuse framework, CLOUDVIEWS, addressing the computation overlap problem in Microsoft’s SCOPE job service.
  2. To materialize overlapping computations over recurring jobs (jobs that appear repeatedly, have template changes in each instance, and operate over new data each time), they use normalized signatures (which normalize the recurring changes) to identify subgraphs across recurring instances for materialization and precise signature to identify subgraphs within a recurring instance for reuse.
  3. To provide accurate estimation for materialization, they use a feedback loop which extracts runtime statistics from the previous runs (by enumerating all possible subgraphs of all jobs seen within a time window in the past).
  4. For the runtime, they build a metadata service for managing the information of materialized views, which provides looking-up (with inverted index) and saving (with exclusive lock). To prevent from multiple jobs with the same overlapping computation being scheduled concurrently, they reorder recurring jobs in the client job submission systems.
  5. Computation reuse actually finds hidden redundancies, promotes data sharing across teams, provides better reliability and better cost estimates.
- Comments
  1. This paper is well written and structured. As an industry paper, it depicts the challenge, architecture, interface in a clear way. The section for “Lessons Learned” is really interesting.

Session 2: Real-time Analytics

Pinot: Realtime OLAP for 530 Million Users
- Summary
  1. This paper presents Pinot, a single system at Linkedin serving tens of thousands of analytical queries per second, while offering near-realtime data ingestion from streaming data sources and handling the operational requirements of large web properties.
  2. Pinot is used to power customer facing applications such as “Who viewed my profile” (WVMP) and newsfeed customization which requires very low latency, as well as internal business analyst dashboards where users want to slice and dice data.
  3. Pinot follows the lambda architecture, and supports near-realtime data ingestion by reading from Kafka and offline data from Hadoop. Zookeeper is used as persistent metadata store and as the communication mechanism between nodes in the cluster.
  4. Pinot uses fixed schema for tables, and tables are composed of segments. Segments are replicated and data in segments is immutable. Data orientation in Pinot segments is columnar, and various encodings are supported.
  5. Pinot has been designed as a share-nothing architecture with stateless instances to be able to run on cloud infrastructure.
- Comments
  1. This paper is well written and structured. It provides a detailed and comprehensive solution to realtime OLAP analytics.
Robust, Scalable, Real-Time Event Time Series Aggregation at Twitter
- Summary
  1. This paper presents TSAR (TimeSeries AggregatorR), a robust, scalable, real-time event time series aggregation framework built primarily for engagement monitoring: aggregating interactions with Tweets, segmented along a multitude of dimensions such as device, engagement type, etc.
  2. TSAR is built on top of Summingbird, an open-source framework for integrating batch and online MapReduce computations.
  3. TSAR relies on Twitter’s Manhattan key-value store to provide access for high-load dashboard applications. The output of batch jobs is first written to HDFS, and then bulk imported into Manhattan.
- Comments
  1. This paper mentions the Lambda Architecture, and the Kappa Architecture - where everything is a stream and therefore there is no distinction between batch and stream processing. In the future, it seems that there will be a unified architecture incorporating these two architectures.
TcpRT: Instrument and Diagnostic Analysis System for Service Quality of Cloud Databases at Massive Scale in Real-time
- Summary
  1. This paper presents TcpRT, the instrument and diagnosis in Alibaba Cloud RDS for real-time anomaly detection.
  2. The overall workflow/architecture TcpRT is: the kernel module collects the metrics of each query, then sends them to a local process for aggregations, the results are written to Kafka, and ETL jobs are triggered to be running on JStorm to process the data in Kafaka and transform them into time series. These time series outputs are cached in Redist cluster for a while and then they are flushed to HybridDB. The automatic anomaly detection module scans time series data in Redis Cluster and HybridDB periodically.
  3. They implement the collector module on top of the TCP congestion control module in Linux kernel, and they use a customized debugfs (a high performance in-memory filesystem) to transfer the collected trace records to the user space for further aggregation and transmission.
  4. To support exact-one semantics, they have an independent offline repair job running to replay the data within the failure time with eventual consistency.
  5. As for anomaly detection, they use a self-adjustable Cauchy distribution statistical model from historical performance data for each DB instance. They also refer to the network topology and anomalous events (e.g., out-of-order, retransmissions) to detect network issues.
- Comments
  1. This paper is well written and it provides a detailed solution to real-time monitoring. The visualization/reporting component is as important as the infrastructure.

Session 3: DB systems in the Cloud and Open Source

Amazon Aurora: On Avoiding Distributed Consensus for I/Os, Commits, and Membership Changes
- Summary
  1. This paper presents Amazon Aurora, a high-throughput cloud-native relational database, which pushes redo processing to a multi-tenant scale-out storage service.
  2. Aurora consists of database instances which act as SQL endpoints and include most of the components of a database kernel (query processing, access methods, transactions, locking, buffer caching and undo management) and storage fleet which takes over redo logging, materialization of data blocks, garbage collection and backup/restore.
  3. Aurora uses quorum model for read/write, where a system employs \(V\) copies and must obey two rules: (1) \(V_r + V_w > V\); (2) \(V_w > V/2\). Storage is partitioned into segments which are the minimum unit of failure, and they are small with no more than 10GB of addressable data blocks.
  4. Each storage node in Aurora maintains a local Segment Complete LSN (SCL), and piggybacks it to the database instance which advances the Protection Group Complete LSN (PGCL) if four of six members of the protection group of storage nodes have advanced. They further have a Volume Complete LSN (VCL) on top of PGCL.
  5. For crash recovery, it simply computes VCL and annuls any log records beyond VCL, and it uses volume epoch for establishing write quorum.
  6. Aurora doesn’t do quorum read as the database instance knows which segments have the last durable version of a data block and can request it directly.
  7. Aurora uses epoch for membership change, and they make two transitions (add the new member and then discard the failed member) to make each transition is reversible.
  8. In Aurora, a protection group is composed of three full segments, which store both redo log records and materialized data blocks, and three tail segments, which contain redo log records alone. By doing this, it gives a smaller cost amplification and provides better flexibility.
- Comments
  1. This paper is well written. However, it doesn’t talk about how these database instances communicate.
  2. This paper presents a elegant and effective solution to scale the relation database on cloud, basically it isolates the complex logic and heavy computation/storage to better use the cloud environment.
Survivability of Cloud Databases - Factors and Prediction
- Summary
  1. This paper presents a solution to predicting how long public cloud databases survive before being dropped on Azure SQL DB.
  2. They use Kaplan-Meier (KM) estimator to estimate the survival curve empirically.
  3. They formulate the problem of predicting whether a database will be live more than 30 days given 2 days of telemetry data. This is a classification problem and they learn a random forest for prediction. Features include creation time, server and database names (patterns), database size, edition and performance level, subscription type and subscription history.
- Comments
  1. This paper is well written. Although the method is relatively simple, it researches around an interesting and useful problem in cloud databases, which is like predicting user churn rate in subscription-based websites.

Graph Databases & Query Processing on Modern Hardware

RAPID: In-Memory Analytical Query Processing Engine with Extreme Performance per Watt
- Summary
  1. This paper presents RAPID, a relational query processing engine designed to support modern analytical workloads with an emphasis on architecture-conscious performance at lower power consumption compared to existing database systems. RAPID provides a novel design from scratch with hardware aware data/storage model, query optimizations and data processing operators.
  2. For the hardware architecture, RAPID adopts a Data Processing Unit (DPU) consisting of 32 data processing cores (dpCore), an on-chip programmable data movement engine Data Movement System (DMS), and a hardware block called the Atomic Transaction Engine (ATE) to communicate among dpCores efficiently.
  3. The RAPID software is integrated into the host DBMS by offloading the query to RAPID for execution and sending results back, therefore durability and persistence are provided by the host database system.
  4. RAPID stores the entire data in main-memory in columnar format, and it extensively uses decimal scaled binary number (DSB) encoding, dictionary encoding and run length encoding.
  5. RAPID query execution framework adopts: (1) push-based execution model to avoid deep call stacks and save valuable resources (e.g., instruction caches and program stack memory); (2) an actor model for parallelism; (3) hardware-aware design for relational data access; (4) vectorized query processing.
  6. RAPID’s query compiler and optimizer is a cost-based physical query optimizer working on top of the logical query optimizations by the host database. It uses “task” as the materialization point to stage pipelines. It also proposes some optimizations on partitioning and has optimized implementations for data processing operators. RAPID implements hash join by using a hash join kernel optimized for the DPU and DMEM.
- Comments
  1. This paper is well written. It points out a fruitful direction for future research: co-design of hardware and software for specific applications (e.g., database and deep learning).

2019 Weekly Summary Report

Tue, 01 Jan 2019 00:00:00 +0000

2019/12/30 - 2020/01/05

Davos
- Fix random bugs

2019/12/23 - 2019/12/29

Relax in Japan

2019/12/16 - 2019/12/22

Relax in Japan

2019/12/9 - 2019/12/15

Davos
- Discuss UDF and UDA
- C++ UDF
- C++ UDA
Travel for NeurIPS 2019

2019/12/2 - 2019/12/8

Research
- Camera ready for MetaLearn 2019
Davos
- Load a large dataset (100GB) into a DB on GCP
- Refactor UDF/UDA docs
- New image
Alpine Meadow
- Refactor
D3M
- Submit TA2 image
Reading
- Exceptional C++ (done)
Misc
- Review SIGMOD papers (2)

2019/11/25 - 2019/12/1

Davos
- Design new architecture
- Update UDF/UDA design doc
- Update othere design docs
- Add planner
Alpine Meadow
- 1.0 design doc
Reading
- DuckDB code
Misc
- Update J-1 waiver doc

2019/11/18 - 2019/11/24

Research
- DuckDB benchmark
- Throughput benchmark
- Sampling benchmark
Davos
- Fix random bugs
- Support removing job
Alpine Meadow
- Refactor code
Niseko
- Refactor code
Reading
- Interpretable Machine Learning Chap 1-7 (finished)
- DuckDB code
Misc
- Prepare J-1 waiver doc

2019/11/11 - 2019/11/17

Davos
- Design documents
- Refactor Python operators
- Refactor job-related code (e.g., removing job runner)
- Make Python test easier
Reading
- Random papers (3)
Misc
- Prepare Japan tourist visa
- International driving permit

2019/11/04 - 2019/11/10

Research
- Paper section 3-4
- Comparison with MonetDB
- Investigate VerdictDB
- Discuss design
Davos
- Investigate pybind11
Reading
- Linux多线程服务端编程 Chap 7-12 (finished)

2019/10/28 - 2019/11/03

Research
- Design and implement scheduling algorithm
- Prepare DSAIL Retreat slides & poster
- Collect ground truths and validate
- Support caching for streams
- Restructure the paper
Davos
- Fix random bugs
- Support context for job
- Implement rename operator
- Support time range
- Fix caching for calculated data source
- Fix null values for binning and aggregation
- Fix float/double precision problem
Reading
- Linux多线程服务端编程 Chap 2-6
- VLDB paper (1)

2019/10/21 - 2019/10/27

Research
- Generate workloads with IDE-Bench
- Experiment plan
- Adaptive scheduling strategy
- Profile and improve throughput
- Benchmark for huge dataset
Davos
- Fix random bugs
Reading
- LLVM Essentials Chap 5-7
- LLVM Slides 1-3 (finished)
- VLDB papers (4)
- Linux多线程服务端编程 Chap 1
Misc
- Baidu scholarship application
- Polish website
- Clean room

2019/10/14 - 2019/10/20

Research
- Discuss with Tim about paper submission (VLDB Dec)
- Benchmark for multiple jobs
- Profile and improve throughput
- Improve binning with Arrow kernels
Davos
- Support deprecating streams in Python
Reading
- Designing Data-Intensive Applications Chap 11-12(finished)
- LLVM Essentials Chap 1-4
Misc
- Thesis proposal submission
- Mail OPT application

2019/10/07 - 2019/10/13

Research
- Davos paper section 6-9
Davos
- Assign operator and job id
- Create operator/job from description
- Upgrade Apache Arrow to 0.15
- Support random in filter
- Integrate Prometheus cpp
Alpine Meadow
- Integrate feature engineering
Niseko
- Run dump exerperiments
Reading
- Designing Data-Intensive Applications Chap 9-10

2019/09/30 - 2019/10/06

Research
- Davos paper section 2-5
Davos
- Refactor AutoML
- Support feature importance operator
- Design memory management
- Make AutoML and feature importance faster
- Refactor job management
- Refactor Cython structure
- Store aggregation results in nested layout
- Investigate running Python in C++
Niseko
- Run dump exerperiments
Reading
- Designing Data-Intensive Applications Chap 5-8
Misc
- Make thesis proposal

2019/09/23 - 2019/09/29

Davos
- Simple scheduling strategy
- Support min/max/distinct count aggregation
- Keep track of job response timeline
- Keep track of Python operator
- Support startswith in filter
- Refactor job runner
- Refactor AutoML
- Switch to pytest
- Conform to Google C++ style
- Fix random bugs
Niseko
- Expose more information
- Run dump exerperiments

2019/09/16 - 2019/09/22

Research
- Davos paper new outline
- Davos paper writing
Davos
- Support training and testing ML pipelines
- Make a list of TODO features
- Add SubJob operator
- Refactor code to support job-based scheduling
- Set up CppLint and clang-tidy
- Refactor code
- Keep track of jobs and operators in both C++ and Python
- Implement new scheduling model
Reading
- C++ Concurrency in Action Chap 8-10 (finished)
- Random English readings (2)
- Random paper (1)
- Designing Data-Intensive Applications Chap 1-4
Misc
- Practice piano (p58)

2019/09/09 - 2019/09/15

Research
- Niseko paper submission (NeurIPS Meta-Learning workshop)
- Alpine Meadow & Davos paper submission (NeurIPS Systems for ML workshop)
- Davos paper writing
Davos
- Make starter projects for UROPs
- Support faster projection
- Implement OrderBy operator
- Framework for translating jobs to SQL
Reading
- C++ Concurrency in Action Chap 4-7
Misc
- Practice piano (p50)

2019/09/02 - 2019/09/08

Research
- Davos paper outline
- Davos paper writing
Davos
- Fix Gandiva offset bug
- Support group by for strings
- Improve sample info
Alpine Meadow
- Fix random bugs
Niseko
- Run dump experiments
Reading
- C++ Concurrency in Action Chap 1-3

2019/08/26 - 2019/09/01

Research
- Davos paper introduction
Davos
- Support reading from S3
- Support shuffling when sampling
- Attend Arrow talk at VLDB
- Set up Arrow development environment
- Investigate Gandiva bug
- Upload a 100 GB file to S3 and test
- Fix CSV data source progress for S3
Alpine Meadow
- Fix random bugs
Niseko
- Understand AM dumps
Reading
- Effective Modern C++ Chap 6-8 (finished)
- Make VLDB 2018 reading list

2019/08/19 - 2019/08/25

Davos
- Support reading from MySQL database
- Generate documentation
- Test huge dataset
- PM2 to manage server process
- Support stopping job
- Fix several bugs
- Translator framework
- Simple benchmark
- Refactor CMake files
Reading
- Effective Modern C++ Chap 4-5
- Star code

2019/08/12 - 2019/08/18

Davos
- Support reading from SQL database (SQLite3)
- AutoML operator
- Update API for sampling
- Fix several bugs
- Simple monitor
- Fix hanging issue
- Support creating data source from job
- In operator
- Distinct operator
- Renamings in API
- Refactor server code
Reading
- Effective Modern C++ Chap 3
Misc
- Move!

2019/08/05 - 2019/08/11

Davos
- Deprecated stream
- Join operator
- Run Alpine Meadow
- Design AutoML API
Alpine Meadow
- Refactor to remove D3M dependencies
- Refactor to run on Davos
- Fix exiting (adding timeout)
- Fix logging
Reading
- Effective STL Chap 3-7 (finished)
- Effective Modern C++ Chap 1-2

2019/07/29 - 2019/08/04

Research
- Davos paper outline
- Read NIPS reviews
- Submit NIPS rebuttal
Davos
- Add logging
- Simple AutoML operator
- Python script operator
- Refactor server code
- Architecture design
- Semantic model
- Code coverage report
- More comments
Reading
- 七周七并发 Chap 3-9 (finished)
- RxCpp code
- Effective STL Chap 0-2
- Terrier code
- HpBandSter code
- Tuplex code

2019/07/22 - 2019/07/28

Research
- Fix memory leak bug in Davos
- Better memory management
- Better job management
- Adopt publisher/subscriber model for stream in Davos
- Refactor C++ and Python code
Reading
- More Effective C++ Chap 5 (finished)
- Essential C++ Chap 1-7 (finished)
- 七周七并发 Chap 1-2

2019/07/15 - 2019/07/21

Research
- Add progress in data stream
- Investigate Python Embedding
- Implement filter in C++
- Implement calculate attribute in C++
- Fix docker image
- Introduce ANTLR for UDF
- Internal metrics
- Support sampler in Davos
D3M
- Fix bugs
- Prepare submission for TA2
Reading
- More Effective C++ Chap 4

2019/07/08 - 2019/07/14

Research
- Implement new server
- Prepare docker images for development and server
- Upgrade Apache Arrow
D3M
- Fix server bugs
- Swtich back to use D3M primitives
- Merge TA2 stuff together
- Evaluation script for TA2
- Support more datasets
- Support data augmentation
Reading
- More Effective C++ Chap 3
- Random papers (1)

2019/07/01 - 2019/07/07

Research
- Fix binning, aggregation operator
- Set up docker
- SIGMOD talk & poster
- Set up CI
- Land new design to master branch
- Fix warnings & linters
- Support task discovery
- Support machine learning primitives
- Add more tests
D3M
- Prepare DARPA D3M Breif
Reading
- TPOT code
- More Effective C++ Chap 1-2

2019/06/24 - 2019/06/30

Research
- Support filter, brush in Davos
- Support histogram workflow
- Update SIGMOD talk slides
D3M
- Fix server bug
Reading
- Random papers (4)

2019/06/17 - 2019/06/23

Research
- Implement new design

2019/06/10 - 2019/06/16

Research
- Prepare SIGMOD talk slides
D3M
- TA2 submission
Reading
- Python High Performance Chap 1-9 (finished)
- Random papers (2)

2019/06/03 - 2019/06/09

Research
- Discuss new design
Reading
- Effective C++ Chap 7-9 (finished)
- Random papers (2)

2019/05/27 - 2019/06/02

Research
- Support merge tables in Davos
- Support reservior sampling in Davos
- Support UDF in Davos
- ML primitives and pipeline in Davos
- Support projection in Davos
- Davos paper outline
- Support single and multiple batches in Davos
- Support scoring pipelines in Davos
Reading
- Effective C++ Chap 1-6
- Random papers (2)

2019/05/20 - 2019/05/26

Research
- NIPS paper submission
D3M
- TA3 submission
Class
- 6.858 Exam
Reading
- 分布式机器学习 Chap 7-12 (finished)
- Random papers (1)

2019/05/13 - 2019/05/19

Research
- NIPS paper writing
D3M
- TA2 submission
- TA3 submission
Class
- 6.858 checkoff meeting
- 6.888 poster session
Reading
- Cython book Chap 11-13
- Review AutoML papers (2)
- 分布式机器学习 Chap 1-6
- Random papers (2)

2019/05/06 - 2019/05/12

Davos
- More aggregation functionality
- Support chunked-based reading of CSV
Alpine Meadow
- Run benchmarks for Alpine Meadow
D3M
- Fix TA2 submission
Class
- 6.858 final project
- 6.888 final project
Reading
- Cython book Chap 7-10
Misc
- Netherlands visa interview

2019/04/29 - 2019/05/05

Davos
- Start writing design document
- Faster implementation of filter
- Full implementation of binning
- Full implementation of brushes
- Aggregation with confidence interval
Alpine Meadow
- Run benchmarks
D3M
- Fix TA2TA3 CI
Class
Reading
- NIPS papers (3)
- Cython book Chap 5-6
Misc
- Netherlands visa appointment

2019/04/22 - 2019/04/28

Davos
- Understand histogram in IDEA
- Investigate Bazel
- Use docker to compile Davos API protobuf
- C++ implementation of binning
Alpine Meadow
- Run benchmarks
Class
- 6.888 PC Discussion
- 6.888 Project Meeting
- 6.888 PC Discussion Summary
Reading
- NIPS papers (3)
- Cython book Chap 1-4
- Understand ConfigSpace
- Understand Apache Arrow documentation
- Understand DataLinter

2019/04/15 - 2019/04/21

D3M
- JIRA/Gitlab integration
- Expose outputs in the new backend
- Support simple D3M dataset in the new backend
- Set up CI for the new backend
- Interview with UROPs
- Show metrics for the new backend
- Scramble code of Alpine Meadow
Research
- Brainstorm NIPS paper idea
- Submit the final SIGMOD camera-ready paper
- Run benchmarks for Alpine Meadow
Class
- 6.858 Lab 4
- 6.888 Paper Reviews
- 6.888 Paper Review Slide
Reading
- NIPS papers (5)
- AutoML book Chap 10
- Understand autosklearn
- Understand SMAC
- Tensorflow paper
- AlphaD3M paper

2019/04/08 - 2019/04/14

D3M
- Fix benchmark of Alpine Meadow
- Support graph for fast D3M dataset loader
- Scrum tutorial
- Add examples for Alpine Meadow
- Support catalog for new backend
- Better logging in the new backend
- Simple Python prototype for the new backend
Research
- Submit SIGMOD camera-ready paper
Class
- 6.888 Project Meeting
- 6.888 Lab 4
- 6.858 Lab 4
Reading
- AutoML book Chap 1-9
- Hulu Deep Learning posts
Misc
- AutoML post
- Add research of Alpine Meadow on personal website
- NIPS papers (3)

2019/04/01 - 2019/04/07

D3M
- Discuss new backend design
- Investigate Jira
- Dump metrics as json
- Setup and run benchmarks on K8s
Research
- Run experiments for SIGMOD camera-ready paper
Class
- 6.888 final project proposal
Reading
- PhD grind (finished)
- AutoML post Section 0 and 1
Misc
- Update research page for personal website

2019/03/25 - 2019/03/31

D3M
- Support persistent caching for image datasets
- Support for Strata
- Add benchmark for TPOT and autosklearn
- Fast D3M dataset loader for image dataset
- Refactor ensemble as an optimization
Research
- New idea for feature engineering
- New idea for backend
- Run experiments for SIGMOD camera-ready paper
- Update plots for SIGMOD camera-ready paper
Class
- Find final project group for 6.888
Reading
- Mastering PhD Chap 11-19 (finished)

2019/03/18 - 2019/03/24

D3M
- Set up Kubernetes volume on GCP with Filestore
Research
- Discuss the new backend system design
- Run experiments for SIGMOD camera-ready paper
Class
- 6.888 Lab 3 Part II (finished)
- Prepare for 6.858 Quiz
- 6.858 Lab 3
Reading
- 推荐系统实践 Chap 2-8 (finished)
- Mastering PhD Chap 4-10

2019/03/11 - 2019/03/17

D3M
- Set up dockerfile for GPU and test the system with GPU
- Make the marketing dataset faster by pre-computing results
- Export a pipeline as a script
- Submit TA3 image for dry-run evaluation
Class
- 6.888 Lab 3 Part I
Reading
- Automatic feature engineering papers
- Mastering PhD Chap 1-3

2019/03/04 - 2019/03/10

Research
- Collect some related works for new backend
- Implement the first prototype for histogram in the new backend
D3M
- Submit TA2 image for dry-run evaluation
- Run the marketing dataset
Class
- 6.888 Lab 2
Reading
- Hands on Machine Learning Chap 9-16, Appendix (finished)
- SIGMOD 2018 industry papers (3)
- 百面机器学习 Chap 13-14
- CS231n slides
- Write feature engineering post
- CNN book (解析卷积神经网络——深度学习实践手册)
- CS20 slides
- 推荐系统实践 Chap 1

2019/02/25 - 2019/03/03

Research
- Discuss new backend design
- Propose the first version of Protobuf for the new backend
- Investigate streaming engine
D3M
- Submit TA2/TA3 to D3M program index
- Upgrade to latest TA2-TA3 API
Class
- 6.858 Lab 2
Reading
- Hands on Machine Learning Chap 4-6, 8
- SIGMOD 2018 industry papers (4)
- 百面机器学习 Chap 4-12

2019/02/18 - 2019/02/24

Research
- Submit revised paper to SIGMOD
D3M
- Add routine heartbeat message
- Add tag for pipeline
Reading
- Deep Learning Chap 12-20 (finished)
- Hands on Machine Learning Chap 1-3, 7
- 百面机器学习 Chap 1-3

2019/02/11 - 2019/02/17

Research
- Address SIGMOD paper reviews
D3M
- Speed up time model
- Support graph matching
- Dump cost model as CSV/JSON
- Check performance variation of mimic dataset and believe it is much smaller now
- Visualize dumps/logs (time as x-axis and performance of pipeline is y-axis)
Class
- 6.858 Lab 1
- 6.888 Lab 0, 1
Reading
- Deep Learning Chap 10, 11
- SIGMOD 2018 research papers (18)
日语五十音

Some Good Papers in SIGMOD 2018 (Research Sessions 9-15)

Mon, 15 Oct 2018 00:00:00 +0000

In this post, I write my reading report for research papers in SIGMOD 2018. I will go through all the papers I feel interesting and write their basic ideas, methods and my personal evaluations. Please let me know if you find anything inappropriate.

Session 9: Similarity Queries & Estimation

Lightweight Cardinality Estimation in LSM-based Systems
- Summary
  1. This paper presents a light-weight way of collecting statistics that alleviates the high cost of building synopses by incorporating the statistics accumulation into the common LSM-based database storage layer lifecycle events, e.g., flush, merge.
  2. They support three kinds of synopses: equi-width histograms, equi-height histograms and wavelets, only on primary keys (PK) or on secondary keys (SK). To address the issue of anti-matter records (e.g., deletion), they construct a separate “anti”-synopsis for anto-matter records and compute the difference of these two synopses when there is a query. As for the distributed computation, every LSM-framework event creates a local synopsis that is sent over the network to the master node.
  3. The mergeability of synopses is important in the distributed setting, however, only equi-width histograms can be combined (while wavelets allow merging with loss of accuracy). Since this paper primarily focuses on using the statistics for query optimization, where a silght mis-estimation could lead to significant errors, the authors choose to keep all statistics, even with mergeable ones as separate entries. To mitigate the cost of querying synopses, the system periodically merges appropriate synopses.
  4. They used a experimental framework from a paper in SIGMOD 1996 which supports generating synthesized data of many distributions. Aside from those, they also used a real-life dataset of web sever log entries during World Cup 1998. They used four types of range queries: fixed-length, half open, random and point. They measured the overhead and accuracy in the experiments.
- Comments
  1. This paper is well written with enough background and smooth logic flow. However, the pseudocode of algorithms are not well formatted.
Column Sketches: A Scan Accelerator for Rapid and Robust Predicate Evaluation
- Summary
  1. This paper presents the Column Sketch, a data structure for accelerating scans which uses lossy compression to improve performance regardless of selectivity, data-value distribution, and data clustering.
  2. Traditional indexes (e.g., B-trees) don’t perform well with high selectivity; lightweight indexes (e.g., ZoneMap, Column Imprints and Feature Based Data Skipping), which use summary statistics over groups of data to enable data skipping, provide no help when data does not exhibit clustering properties; early pruning methods (e.g., Byte-Slicing, Bit-Slicing and Approximate and Refine) which partition data, decompose the predicate into conjunctions of disjoint sub-predicates and use them to skip blocks of data. Thus they propose the column sketch for queries with moderate selectivity over unclustered data.
  3. The column sketch consists of two parts, a compression map that maps the values in the base data to their assigned codes in the sketched column, and a sketched column that sores the output of compression map. During query, we read the sketched column first and only look at the base data when the predicate on the sketched data is true.
  4. The objective of compression map is to: (1) assign frequently seen values their own unique code; (2) assign non-unique codes similar number of values; (3) preserve order when necessary; (4) handle unseen values in the domain without re-encoding; (5) exploit frequently queried values (optional). For numerical compression maps, they use equi-depth histograms with reserved codes for frequent values; for categorical compression maps, they use dictionary encoding with a limit of number of unique codes.
  5. They use SIMD instructions when doing predicate evaluation over column sketches.
  6. They compared against an optimized sequential scan, BitWeaving/V, Column Imprints and a B-tree index. To eliminate the effects of NUMA, each of the experiments in run on a single socket. They measure the performance in terms of cycles per tuple.
- Comments
  1. This paper mentions an interesting fact: currently scans outperform B-trees for query selectivities as low as 1%.
  2. This paper is well written with clear logic flows. It is important to first clarify your design goals then describe the implementation details to let readers better understand your techniques.
  3. This paper finds a scenario (or a problem setting) where the current approaches don’t work well (i.e., queries with moderate selectivity over unclustered data), then proposes the targeted methods to address it.
Overlap Set Similarity Joins with Theoretical Guarantees
- Summary
  1. This paper presents the solution to the set similarity join problem with overlap constraints. It divides the sets into small and large ones and processes them separately. They propose some optimization heuristics for small sets since there are been existing methods for large sets.
  2. The size boundary between small and large sets is crucial to the efficiency, they further propose a cost-based method to select the size boundary.
- Comments
  1. This paper is well written and structured.

Session 10: Analytical Queries

Efficient k-Regret Query Algorithm with Restriction-free Bound for any Dimensionality
- Summary
  1. This paper presents an algorithm to solve the k-regret query problem, which is a integration of top-k and skyline query.
  2. They proposes the algorithm SPHERE, which is a variation of the \(\epsilon\)-kernel algorithm.
- Comments
  1. This paper is well written as a theory paper. However, I don’t have the patience to carefully read it.
A Rating-Ranking Method for Crowdsourced Top-k Computation
- Summary
  1. This paper presents a rating-ranking method for crowdsourced top-k computation, which asks either rating or ranking questions to the crowd.
  2. Rating questions are used to get a rough score for each object, from which the objects with much smaller scores can be pruned; ranking questions are used to refine the scores.
- Comments
  1. This paper is well written and structured. It uses some statistics and probability methods to estimate the top-k objects, and the combination of rating and ranking is novel.

Session 13: Machine Learning & Knowledge-base Construction

SketchML: Accelerating Distributed Machine Learning with Data Sketches
- Summary
  1. This paper presents SketchML, which compresses the sparse and nonuniform-distributed gradient values to better support distributed machine learning (assuming that the communication cost of gradients is nontrivial, i.e. communication-intensive workloads).
  2. For a sparse gradient vector consisting of key-value pairs, they use a sketch-based algorithm (quantile sketch with loss of accuracy) to compress values and a delta-binary encoding method (without loss of accuracy) to compress keys.
  3. They first convert the gradient values to the bucket indexes by quantile sketch, then use a MinMaxSketch to store the bucket indexes. This is based on the assumption that SGD can still converge with quantification error and underestimated gradients. They also propose separation of positive/negative gradients to avoid reversed gradient (where gradient is actually overestimated) and adaptive learning rate and grouped MinMaxSketch to address vanishing gradient.
  4. They use datasets from KDD CUP 2010, KDD Cup 2012 to compare against Adam SGD and ZipML on Logistic Regression, SVM and Linear Regression.
- Comments
  1. This paper is well written and structured. However, some parts are a little bit wordy (e.g., why they didn’t use frequency sketch).
  2. They also pointed out the limitations of their methods in the paper, I think it is good.
MISTIQUE: A System to Store and Query Model Intermediates for Model Diagnosis
- Summary
  1. This paper presents MISTIQUE, a system that works with traditional ML pipelines and deep neural networks to efficiently capture, store and query model intermediates for diagnosis.
  2. To store the model intermediates, they use quantization (including lower precision float representation, k-bit quantization and threshold-based quantization), exact and approximate de-duplication (e.g., sharing the intermediate results between pipelines/models) and adaptive materialization (only materializing frequently-queried intermediates), to reduce the storage footprint.
  3. They propose a cost model to decide whether to store a intermediate and whether to execute a query by running a model or reading an intermediate.
  4. They used dataset and pipelines from Kaggle Zestimate competition for traditional ML models, and CIFAR10 with VGG16 and another simple CNN model for DNN models.
- Comments
  1. This paper is well written and structured.
  2. The topic of storing intermediate results is very interesting and also important, especially for model diagnosis and AutoML. By smart use of intermediate results, we can save lots of computation and also use them to better guide the model search.
A General and Efficient Querying Method for Learning to Hash
- Summary
  1. This paper presents a new fine-grained similarity indicator, quantization distance (QD), to replace the Hamming Rank (HR) used in learning to hash for the approximate nearest neighbors (ANN) search problem. They further develop two efficient querying methods based on QD.
  2. The quantization distance is defined by the sum of the product of XOR of binary codes and absolute value of the projected vector for each dimension. Intuitively, it measures the minimum change required to the projected vector such that it can be quantized to bucket.
- Comments
  1. This paper is well written and structured.
  2. Learning to hash is an interesting topic and the proposed indicator QD is more reasonable than coarse-grained Hamming Rank. It reminds me that sometimes we can replace some intermediate metric (or indicator) to achieve better performance and help algorithm design.
DimBoost: Boosting Gradient Boosting Decision Tree to Higher Dimensions
- Summary
  1. This paper presents DimBoost, a system for training gradient boosting decision tree (GBDT) on high-dimensionality data.
  2. DimBoost uses the parameter server (PS) architecture, where several machines together store a parameter to prevent the single-point bottleneck, and provide interfaces for workers to push and pull parameters. Each worker holds a local copy of the parameter, and periodically pushes parameter updates to the PS.
  3. To speed up histogram construction, they propose a sparsity-aware algorithm, and construct it in batch with layer-wise parallelism.
  4. To speed up finding split, they compress the gradient histogram (with loss of precision), use round-robin based scheduling to assign the task of splitting active nodes to workers, and they use two-phase (firstly each server finds the local optimal split and send it to the assigned worker, and secondly the worker finds the global optimal from local optimal splits) to reduce the communication overhead.
  5. DimBoost is implemented in Java and deployed on Yarn. They use Netty to manage the message passing between physical machines.
- Comments
  1. This paper is well written and structured.
  2. It is good to cooperate with companies because they have huge volumes of real-world data and real-world problems to be solved.
Auto-Detect: Data-Driven Error Detection in Tables
- Summary
  1. This paper presents Auto-Detect, a statistics-based technique that leverages co-occurrence statistics from large corpora for error detection.
  2. They convert values to patterns using some generalization languages, and detect the single-column error by checking the point-wise mutual information (PMI). By aggregating the scores (PMIs) of different generalization languages together, it give the final prediction for whether it is a error.
  3. To generate the training data, they use distant-supervision to augment the labeled data to get a large training dataset. And they develop an approximate static-threshold aggregation algorithm to find thresholds for each generalization algorithm. If any of the generalization languages give a score lower than its threshold, we predict it as incompatible. To reduce the memory footprint, they use sketch (and thus control the memory budget).
- Comments
  1. This paper is well written and structured.
  2. The generalization languages still have been to composed by manual. And we have to run the whole framework for each combination of columns.
  3. If I understand correctly, for each column, they need to find another paired column, I didn’t find anything in the paper about how to find this paired column.
  4. I am curious about the running time of their algorithms, seems that it would be slow. However, there are no related experiments in the paper.

Session 14: Approximate Query Processing

VerdictDB: Universalizing Approximate Query Processing
- Summary
  1. This paper presents VerdictDB, a middleware rewriting analytical queries to compute an approximate answer and error estimates.
  2. VerdictDB creates samples offline and rewrites the query to make it execute on the samples, including uniform, hashed, stratified and irregular samples.
  3. They propose the variational sub-sampling to enable faster error estimation while retaining the same asymptotic properties.
- Comments
  1. This paper is well written and structured.
  2. I don’t think middleware is the right way to go for AQP, it provides good generalization and flexibility, however, it also makes low-level optimizations difficult.
  3. This paper talks about offline sampling, while online sampling is also an important topic.
AQP++: Connecting Approximate Query Processing With Aggregate Precomputation for Interactive Analytics
- Summary
  1. This paper presents AQP++, which connects sampling-based approximate query processing (AQP) and aggregate precomputation (AggPre) and achieves a better trade-off among preprocessing cost, query response time, and answer quality. The basic idea is that it uses precomputed exact answers as the starting point, and then estimates the difference between the precomputed answer and the real answer by sampling. Therefore, AQP++ subsumes AQP and AggPre.
  2. Since AQP++ uses AQP to estimate the difference, if we compare the variances of them estimations, when the correlation between the original query and the difference query is big, AQP++ returns a more accurate result.
  3. To build the pre-computed aggregation, they propose blocked prefix cube which computes a small portion of the cells in the traditional prefix cubes. The problem of selecting blocks can be regarded as a optimization problem that minimizes the expected query error. They propose an adaptive hill climbing approach to address this problem.
  4. To decide which pre-computed aggregation to use, they use sub-sampling to estimate the confidence interval of the difference query, and choose the one with the smallest confidence interval.
  5. They ran their system on TPCD-Skew, BigBench (a synthetic dataset from the Big Data Benchmark) and TLCTrip (from the NYC Taxi and Limousine Commission).
- Comments
  1. This paper is well written and structured.
  2. It is always interesting and useful to combine two seemingly orthogonal (but actually connected in some way) ideas together to achieve better overall performance in the system.

Random Sampling over Joins Revisited
- Summary
  1. This paper presents a general join sampling framework that can be combined with any join size upper bound method, which can process general multi-way joins (acyclic or cyclic, with or without selection predicates).
  2. It implements the sampling over a chain of joins by starting from a single root tuple, and joins it with all tuples in the next relation, and with some probability to reject the whole sample (and restart), or get a uniform random tuple from the joined results and continue processing the next relation. These probabilities can be computed by the upper bound of join sizes. By using this method, it returns each join result with equal probability (uniform and independent).
  3. They used TPC-H dataset and a social graph data of twitter friendship links and user profiles. They used KStest to verify their samples are indeed uniform.
- Comments
  1. This paper is well written and structured.

Session 15: Database for Emerging Hardware

Efficient Top-K Query Processing on Massively Parallel Hardware
- Summary
  1. This paper presents several algorithms for top-k problem on GPU, including a new algorithm based on bitonic sort.
  2. The algorithm includes three steps: (1) local sort, it generates sorted sequences of size \(k\) using partial binotic sort; (2) merge, it bitonically merge two sorted sequences of size \(k\); (3) rebuild, it sorts the sequence with the greater \(k\) elements and discard the subsequence with the smaller \(k\) elements. After each merge and rebuild, the size of the problem is halved, and it recursively applies these two steps until \(k\) elements left.
  3. They further propose several detailed optimizations about implementing these top-k algorithms on GPU.
  4. They also present a cost model that predicts the performance of these algorithms with respect to \(k\), allowing a query optimizer to choose the best top-\(k\) implementation for a particular query.
- Comments
  1. This paper is well written and structured. It gives a short but useful description of the background knowledge (GPU data access, Sorting on the GPU) with intuitive examples.
  2. This paper is a must-read for people want to do DB-related research on GPU.

Some Good Papers in SIGMOD 2018 (Research Sessions 1-8)

Fri, 15 Jun 2018 00:00:00 +0000

Session 1: Data Integration & Cleaning

Deep Learning for Entity Matching: A Design Space Exploration
- Summary
  1. This paper proposes a categorization of DL solutions in EM: (1) attribute embedding; (2) attribute similarity representation: attribute summarization and attribute comparison; (3) classifier.
  2. This paper considers three types of EM problems: structured, textual and dirty.
  3. DL is competitive with state-of-the-art on structured instances with far longer training time, while it outperforms state-of-the-art on textual and dirty instances.
- Comments
  1. This paper is well-written and provides a good summary of DL solutions in EM. It proposes the design space for DL in EM and summarizes the problem types in EM, i.e., both the problem and solutions are well-defined.
  2. This paper is a good example of how to apply DL in database research: categorizing the problem types, summarizing design space, applying DL and doing empirical evaluations.
  3. It is really important to provide heuristics for the design choices, discussions for why the methods are better, goals and takeaways for the experiments.

Session 2: Usability & Security/Privacy

The Data Interaction Game
- Summary
  1. Users actually interact with DBMSs during which they learn and modify how to express their information needs, which is rarely captured in current query interfaces, i.e., the user feedbacks are not well utilized or user strategies are assumed fixed. Therefore this paper models this as a game with identical interest between two rational agents whose goal is to establish a common language for representing information needs in form of queries. It further proposes a reinforcement learning method to explore this model.
  2. This paper firstly uses a reinforcement learning algorithm (Roth and Erev’s model) to imitate and model the user behavior (justified by empirical analysis), then uses the same reinforcement algorithm to train the interactions between users and DBMSs.
- Comments
  1. This paper is too dry to read, probably because it contains too much theoretical stuff and few illustrative examples. Probably I will read back when I have an better knowledge of reinforcement learning.
  2. One thing I don’t understand is that they model the user behaviors with a reinforcement learning algorithm, and then they train the interactions with the same algorithm. Since the user behavior is already modeled, it doesn’t make sense for me that reinforcement learning can model the interactions in real world. Although I understand that collecting user feedbacks for training the interactions are difficult, this algorithm still seems making too strong assumptions and not very practical.

Session 3: Transactions & Indexing

Carousel: Low-Latency Transaction Processing for Globally-Distributed Data
- Summary
  1. Most storage systems support geo-distributed and multi-partitioned transactions by layering transaction management, such as concurrency control and two-phase commit (2PC), on top of a consensus protocol, which incurs high latency to commit a transaction because it sequentially executes the layered protocols. This paper addresses this by targeting 2-round Fixed-set Interactive (2FI) transactions (where each transaction consists of a round of reads followed by a round of writes with read and write keys that are known at the start of the transaction) to enable transaction processing to overlap with 2PC and state replication.
  2. By knowing the read and write sets (the assumption of 2FI transactions), the prepare request for 2PC can be sent along with the read request for reading data, therefore the total number of wide-area network roundtrips (WANRTs) observed by the client is at most two (read/prepare + write/commit).
  3. This paper achieves the fast path for preparing (parallelizing 2PC and consensus) by sending prepare requests to all participants rather than participant leaders, which require some conditions as discussed in the paper. This actually saves the time that the leader replicates the prepare decision to its followers.
  4. This paper uses two workloads: Retwis and YCSB+T, for comparison against TAPIR.
- Comments
  1. This paper is a well-written paper, it gives a very detailed design overview, including assumptions, restrictions and architecture. I think this is really important for system papers.
  2. This paper assumes a restricted transaction (i.e., 2FI transaction) model to parallelize transaction processing, consensus and replication. It points out the drawbacks of this model, i.e., no support for dependent reads and writes, and proposes the solution itself: using reconnaissance transactions (divide the original transaction into two dependent transactions). I think it is a good way to justify your assumptions by pointing out the drawbacks directly and providing solutions.
  3. When justifying anything (e.g., design assumptions), this paper always uses experimental results or cite other papers, I think it is necessary to justify your assumptions in such a way.
  4. It is a good practice to describe the basic version of the system, and then propose some optimizations on top of it to make everything more elegant.
  5. It is important to discuss failure-tolerance, although it is not necessary to implement it.
  6. It is a good practice to describe how you implement your system, especially for system papers.
FASTER: A Concurrent Key-Value Store with In-Place Updates
- Summary
  1. This paper uses a epoch-based synchronization combined with trigger actions to facilitate lazy propagation of global changes to all threads. This generalization of threading model helps simplifying the scalable currency design.
  2. This paper designs a concurrent latch-free resizable cache-friendly hash index by cache-aligned arrays, atomic operations for deleting keys, tentative bits for latch-free two-phase insert, afore-mentioned epoch protection to performing resizing, and atomic operations for checkpointing.
  3. By combing the hash index with a simple in-memory record allocator such as jemalloc, this paper states that they can build a in-memory key-value store easily.
  4. Epoch-based framework also manages the loading of log records to secondary storage in a latch-free manner, they use head and tail offsets to control the circular in-memory buffer.
  5. This paper proposes the HybridLog, which combines in-place updates (in memory) and log-structured organization (on-disk), and consists of three regions: stable region (on secondary storage and append-only), read-only region (in-memory and read-copy-update) and mutable (in-memory and in-place update). The HybridLog can also be used for checkpointing and recovery.
  6. This paper uses one workload: YCSB-A, for comparison against Masstree, Intel TBB concurrent hashmap, RocksDB and Redis.
- Comments
  1. For a system paper, it is important to describe your design goals, user interfaces and architecture to justify that your system is a well-developed system.
  2. One thing I don’t understand: “A thread has guaranteed access to the memory location of a record, as long as it does not refresh its epoch”. Does this mean that this thread’s operations will be coordinated by the epoch framework, so it has the guaranteed access?
  3. This paper is written in a incremental way: describing the basic framework, presenting a basic component, optimizing this component.
  4. Cache-behavior is important for system papers, especially for systems sensitive to IO latency.
  5. A good design philosophy: make the common case faster.
  6. In-place updates are critical for building a fast in-memory key-value store.
Workload-Aware CPU Performance Scaling for Transactional Database Systems
- Summary
  1. This paper proposes an on-line workload-aware scheduling and frequency scaling algorithm POLARIS, which controls both transaction execution order and processor frequency to minimize CPU power assumption while observing per-workload latency targets.
  2. The control of processor frequency is implemented by DVFS (dynamic voltage and frequency scaling), which is standardized as the Advanced Configuration and Power Interface (ACPI). ACPI defines P-States for different voltage and frequency operating points, and C-States for different idle levels. Linux provides a generic CPU power control module cpufreq.
  3. POLARIS aims to find the smallest processor frequency such that all transactions will finish running before their deadlines. It implements by estimating the execution time of transaction using statistical methods (i.e., keeping track of execution time of transactions in the past).
  4. This paper gives a theoretical analysis of POLARIS, Yao-Demers-Schenker (YDS) and Optimal Available (OA).
  5. This paper implements the prototype within Shore-MT and uses two workloads: TPC-C and TPC-E for comparison against OS baselines.
- Comments
  1. This paper researches around an interesting problem where few previous studies exist, and it is also important since power consumption is directly related with the maintenance cost of clusters.
  2. The theoretical part is really interesting, in my opinion, there is nothing better than a combination of good theory and useful system.
  3. They have a train phase for POLARIS to gain information about execution time, but I wonder if this obviates the on-line algorithm nature of POLARIS.

Session 4: Query Processing

How to Architect a Query Compiler, Revisited
- Summary
  1. This paper uses Futamura projections to link interpreters and compilers through specialization, which is also the guiding principle in the design of query compilers. Partially evaluating an interpreter with respect to a source program produces a compiled version of that program, is known as the first Futamura projection.
  2. For code generation, the query engine have different options to place it, (1) pure template expansions: each operator is specialized as a string with placeholders for parameters; (2) programmatic specialization: push the specialization into the structures that make up the query engine; (3) optimized programmatic specialization: push code generation to the level of primitive types and operations; (4) lightweight modular staging (LMS): an intermediate representation similar with LLVM.
  3. LB2 (the system built in this paper) uses an abstract class Record as the entry point and an abstract class Buffer as the storage to support both row and column layout; it also abstracts the data structures, indexes. LB2 also hoists memory allocation and other expensive operations from frequently executed paths to less frequent paths, e.g., pre-allocating memory in advance for hash join and aggregate queries. LB2 enabled parallelism by modifying the internal logic of operators through ParOp.
  4. This paper compares against Postgres, Hyper and DBLAB. It uses TPC-H as the workload.
- Comments
  1. This paper is well written, however I am not familiar with query compiler, I don’t understand some parts, e.g., how parallelism works in practice. It is worth reading it again if needed.
  2. Although this paper proposes a quite novel way for compiling query, almost all real world systems (both in industry and academia) still stick to the old paradigm. It will be interesting to see if new DBMSs adopt such design.
  3. I don’t know if code generation is only used inside database community. I guess the answer is probably yes since code generation is simply a technique to convert domain-specific languages (DSLs) to low-level language for performance’s sake.
SuRF: Practical Range Query Filtering with Fast Succinct Tries
- Summary
  1. SuRF is a data structure for approximate membership test, which supports both single-key lookups and range queries. The core data structure in SuRF is the Fast Succinct Tries (FST), which encodes upper levels with a fast bitmap-based encoding scheme (LOUDS-Dense, speed-efficient) and encodes lower levels with a space efficient LOUDS-Sparse schema. Level-Ordered Unary Degree Sequence (LOUDS) traverses the nodes in a breadth-first order and encodes each node’s degree using the unary code.
  2. Based on the observation that all bit-sequences require either rank or select support but not both, SuRF further optimizes rank, select and local search. They use sampling with pre-computed values to optimize rank and select, and 128-bit SIMD instructions to perform the label search (which is faster than binary search). They also use prefetching so that relevant addresses in other sequences can be computed for future use.
  3. Since FST itself is a trie-based index structure, SuRF must truncate it (i.e., remove lower levels and replace them with suffix bits extracted from the key) to balance between a low false positive rate with small memory usage.
  4. This paper compares against B-tree (and its variants), other succinct tries and Bloom filter. It uses YCSB and a synthesized dataset of integer and string keys as the workloads. They also integrate SuRF into RocksDB and run a evaluation of time-series data.
- Comments
  1. In general, this paper is well written and it gives comprehensive evaluations from both algorithm’s and system’s perspective.
  2. Although the idea of using fast succinct tries is not novel, combining two different encodings together is a good way to balance between speed and space.
  3. I think this paper is influential (and got the best paper award) because it states that it solves a practical problem: range query of membership test. Although the the practicalness still needs to be proved since real world workloads are far more complicated.
  4. It is important to show your techniques have applications when you write the paper, and it will be even better when you can integrate your stuff into a real world widely-used system.

Session 5: Graph Data Management

TopPPR: Top-k Personalized PageRank Queries with Precision Guarantees on Large Graphs
- Summary
  1. Top-k PPR (Personalized PageRank) query is an important building block for web search and social networks, such as Twitter’s Who-To-Follow recommendation service. However, previous studies cannot guarantee the precision and performance with small overheads at the same time. TopPPR provides a method that (1) is \(\rho\)-precise with at least 1 - 1/n probability, (2) doesn’t require preprocessing, and (3) is computationally efficient.
  2. TopPPR first performs forward search from the source node, and then conducts random walks from those nodes with non-zero forward residues; after that, it applies backward search from some target nodes and combines the results with the random walks to estimate PPR values for top-k derivation.
  3. TopPPR adopts the filter-refinement paradigm for top-k processing. In the filter step, it computes a rough estimation of each node’s PPR, based on which it identifies a candidate node set C; and then in the refinement step, it iteratively refines the PPR estimation for each node in C, until it derives the top-k results with high confidence. Specifically, after forward search, during sampling, it utilizes Bernstein inequality to find a confidence bound, then it employs backward search on those potential nodes to reduce the variation of the confidence bound.
  4. TopPPR employs \(\sqrt{1 - \alpha}\)-walk to improve over random walk with restart, which can update the estimation of more nodes and produce lower variance.
- Comments
  1. This paper is very well written. It has a good motivation (e.g., real world application in Twitter), and it points the drawbacks of previous studies. Then it formulates the problem, and describes the basic techniques for PPR computation and explains the state-of-the-art solutions. Then it mentions the challenges, ideas and provides an analysis and discussion of the proposed algorithm. The evaluation part is also comprehensive.
  2. It is a fruitful direction combining random methods (e.g., sampling) and deterministic methods (e.g., breadth-first search) adaptively.

Session 6: Storage & Indexing

The Case for Learned Index Structures
- Summary
  1. This paper observes that B-trees can be seen as a model mapping a key to the position of a record, and utilizes deep-learning to train learned indexes based on this observation. The experimental results show that learned indexes have significant advantages (both in space and speed) over traditional indexes.
  2. For learned indexes, this paper proposes a recursive model index which is similar with mixture of experts in machine learning community and learns the distribution of data level by level. Further, it can also include traditional indexes (e.g., B-trees) as a node in the model tree if the distribution is difficult to learn. During search, it queries the recursive model index and uses either binary or biased quaternary search.
  3. Besides range index (i.e., B-trees), learned indexes can also work well on point index (i.e., hash maps) and existence index (e.g., bloom filter).
- Comments
  1. As the paper says, the idea of replacing core components of a DBMS through learned models have far reaching implications for future system designs. Probably led by this paper, there has been a trend for researching the intersection of machine learning and systems, which are so called machine learning for systems and systems for machine learning. I think this area is a very promising direction in both system and machine learning research with huge real world impact.
  2. Although this paper doesn’t contain much complicated theory or system design, it brings up a fundamental problem: can machine learning and system benefit each other? Therefore, a great paper is not only just making good technical contributions, but also guiding the future direction for the whole community.
Dostoevsky: Better Space-Time Trade-Offs for LSM-Tree Based Key-Value Stores via Adaptive Removal of Superfluous Merging
- Summary
  1. This paper investigates the the space-time trade-off in LSM-tree by introducing Lazy Leveling (which removes merge operations from all levels of LSM-tree but the largest) and Fluid LSM-tree (which generalizes the entire LSM-tree design by parameterizing the merges). They put everything together to design the key-value store Dostoevsky and implemented it on top of RocksDB, and they show that it strictly dominates state-of-the-art designs in terms of performance and storage space.
  2. LSM-tree optimizes for write-heavy workloads, and it organizes runs into L conceptual levels of exponentially increasing sizes. All designs today use either one of two merge policies: tiering or leveling (e.g., Cassandra and RocksDB use tiering and leveling by default, respectively). With tiering, we merge runs within a level only when the level reaches capacity; with leveling, we merge runs within a level whenever a new run comes in.
  3. There is a trade-off between update cost and the costs of lookups and space-amplification. Leveling has strictly better lookup costs and space-amplification and strictly worse update cost than tiering. Furthermore, point lookup cost, long range lookup cost, and space-amplification derive mostly from the largest level, while update cost derives equally from across all levels.
  4. They represent lazy leveling, which applies leveling at the largest level and tiering at all other levels. It improves the cost complexity of updates, maintains the same complexity for point lookups, long range lookups, and space-amplification, and provide competitive for short range lookups.
  5. They propose fluid LSM-tree, a generalization of LSM-tree that enables switching and combining merge policies. It does this by controlling the frequency of merge operations separately for the largest level and for all other levels.
- Comments
  1. This paper gives a good introduction of LSM-tree and summarizes its properties and operations in great detail.
  2. This paper is well written and organizes its content in array of bullet points, making the illustration clear to understand and the transition smooth.
  3. For the evaluation section, we can start each paragraph with the key observation (e.g., Dostoevsky dominates existing systems).
  4. This paper reminds me that for those traditional data structures, they can actually be parameterized to allow for flexibility and even better trade-off.
HOT: A Height Optimized Trie Index for Main-Memory Database Systems
- Summary
  1. This paper presents the Height Optimized Trie (HOT), a fast and space-efficient in-memory index whose core idea is to dynamically vary the number of bits considered at each node, which enables a consistently high fanout and thereby good cache efficiency. They also carefully engineer the layout of each node for compactness and fast search using SIMD instructions.
  2. To use the space more efficiently, HOT combines the nodes of a binary trie into compound nodes, and it features a data-dependent span and a fixed maximum fanout. During insertion, there are four cases: a normal insert only modifies an existing node, whereas leaf-node pushdown creates a new node. Overflows are either handled using a parent pull up or intermediate node creation.
  3. As for the node layout, the size of the partial keys and the representation of the bit positions (single-mask or multi-mask) can be adapted to fit the data distribution. Partial keys are used to parallel the lookup with SIMD instructions.
  4. HOT uses the combination of copy-on-write and CAS for lookup. For modification, it detects the set of affected nodes and acquire a lock for each of them. It also marks nodes as obsolete instead of directly reclaiming the nodes’ memory and HOT uses a epoch-based memory reclamation strategy.
  5. This paper compares with ART, Masstree, STX B+-tree. The workload is extended based on YCSB. It shows it is 2x space efficient, generally outperforms its state-of-the-art competitors in terms of lookup and scan performance and it features the same linear scalability.
- Comments
  1. The paper follows a general-to-specific style of writing, it firstly presents the overall data structure, then fills missing details. By adopting such style, it is much easier for readers to grasp the major idea and understand the algorithm.
The Data Calculator: Data Structure Design and Cost Synthesis from First Principles and Learned Cost Models
- Summary
  1. This paper presents the Data Calculator, an interactive and semi-automated design engine for data structures. It offers a set of fine-grained design primitives that capture the first principles of data layout design: how data structure nodes lay data out, and how they are positioned relative to each other. It also supports computation of performance using learned cost models.
  2. The Data Calculator firstly proposes a set of design primitives as fundamental design choices with different domains. Then it introduces elements as a full specification of a single data structure node which defines the data and access methods used to access the node’s data.
  3. The Data Calculator computes the cost (latency) of running a given workload on a given hardware for a particular data structure specification by analyzing the data access primitives and using learned cost models to synthesize the cost of complex operations. These cost models are trained and fitted for combinations of data and hardware profiles.
  4. The Data Calculator supports what-if design (comparing different design specifications) and auto-completion (benchmarking all candidate elements).
- Comments
  1. This paper provides lots of well-depicted figures with colorful elements, structured layout and illustrative text. This is a good practice when the concepts are difficult to explain in plain text.
A Comparative Study of Secondary Indexing Techniques in LSM-based NoSQL Databases
- Summary
  1. This paper presents a taxonomy of NoSQL secondary indexes, Embedded Indexes (i.e., lightweight filters embedded inside the primary table) and Stand-Alone Indexes (i.e., separate data structures). They built a system LevelDB++ on top of LevelDB to measure two embedded indexes and three state-of-the-art stand-alone indexes.
  2. The experimental study and theoretical evaluation show that none of these indexing techniques dominate the others: the embedded indexes offer superior write throughput and are more space efficient, whereas the stand-alone indexes achieve faster query response times. Thus the optimal choice of secondary index depends on the application workload.
Efficient Selection of Geospatial Data on Maps for Interactive and Visualized Exploration
- Summary
  1. This paper proposes four features that a map rendering system shall support: representativeness, visibility constraint, zooming consistency and panning consistency. They further propose the problem of Interactive Spatial Object Selection (ISOS) problem and devise a greedy algorithm to address it. They also propose to use sampling strategy and pre-fetching strategy to improve the efficiency of their algorithm.
- Comments
  1. This paper is well written and easy to read. Also, the authors formulate the problem in a precise and easy-to-understand way, and the solutions are presented with good explanation and illustrative examples.
  2. I think the problem discussed in this paper is pretty important, and those constraints are representative for interactive map exploration. The authors did a good job by formulating the complex problem into a well-formed “mathematical” problem and used good techniques to address it. They also provided interesting optimizations and solid proof.
  3. It is important to prove that your proposed metric is useful, especially by experimental evaluations.

Session 7: Tuning, Monitoring & Query Optimization

Query-based Workload Forecasting for Self-Driving Database Management Systems
- Summary
  1. This paper presents QueryBot 5000 that forecasts the expected arrival rate of queries in the future. It firstly pre-processes the queries by converting them into templates, maps these templates to the most similar group of previous queries based on its semantics (e.g., the accessed tables) with clustering techniques, and train those big clusters to predict the arrival rates.
  2. The features for clustering are based on the arrival rate history. They combine the ensemble of linear regression and RNN and kernel regression (which is good at predicting spikes) to predict the arrival rate.
- Comments
  1. This paper is well written and easy to read.
  2. The whole method essentially is a two-step clustering (by query template then by the history of arrival rate) and then a simple model predicting the future arrival rate based on the past arrival rates. Basically all the query-specific (e.g., the logical and physical features mentioned in the paper) are all discarded, therefore it essentially predicts the future based on the past without using any DB related context information. I think probably it is better to embed more DB information into the feature space.
  3. Also, the whole framework is not end-to-end, there are three components trained or tuned individually. An end-to-end approach may further improve the performance.
  4. If I understand correctly, the system doesn’t work for new query (never seen before by this system). Basically this system can only find arrival patterns of some kind of fixed workloads.
On the Calculation of Optimality Ranges for Relational Query Execution Plans
- Summary
  1. This paper analyzes the optimality range, which is the range of cardinality of an intermediate result where the current plan remains optimal. To compute the optimal range, they propose Parametric Cost Function (where the cardinalities of some intermediate results are parameters) and compute the intersection points of all possible plans.
  2. To compute the optimal range, they propose Parametric Cost Function (where the cardinalities of some intermediate results are parameters) and maintains these optimality ranges by a data structure Optimal Plans Container. The basic idea is to find the intersection points with other candidate plans.
  3. To make the space of candidates small, they adopt the Bell’s principle of optimality (an optimal solution can be always constructed from optimal sub solutions) to only enumerate only plans consisting of sub plans that are somewhere optimal. They further prune those plans whose optimality fall out of the current optimality range. They also derive the theoretical worse case bounds for the number of enumerated pipelines.
  4. Optimality ranges can be used in the following cases: (1) execution plan caching: once the cardinality is out of its optimality range, the cached plan is not optimal anymore and evicted from the cache; (2) parametric queries: store a range in which a plan is optimal instead of a cost point (i.e., a configuration of parameters); (3) mid-query re-optimization: deciding if the optimizer should be invoked again while the current approach uses simple heuristics or considers only a subset of the alternatives.
  5. The workloads are TPC-H, Join Order Benchmark (JOB) and a generated one. However, the improvement for mid-query re-optimization is not very impressive (around 20%).
- Comments
  1. Although the authors stated that they considered all the alternative plans, actually they made some assumptions on the possible search space.
  2. The optimality ranges are an interesting and important topic in query optimizer. I think probably we may design a better query optimizer with small computation overheads by using these optimality ranges.
Adaptive Optimization of Very Large Join Queries
- Summary
  1. This paper presents an adaptive optimization framework that scales to queries with thousands of joins. It uses the search space linearization technique to find near-optimal execution plans for large classes of queries.
  2. PostgreSQL uses dynamic programming to find the optimal join order for queries with less than 12 relations and switches to genetic algorithms for larger queries, while DB2 uses dynamic programming and switches to a greedy strategy when the queries become large.
  3. For small queries (less than 14 relations or the number of connected subgraphs is within the chosen budget of 10,000), the framework uses DPHyp. For medium queries (up too 100 relations), the framework firstly uses IKKBZ to linearize the search space into a linear ordering of relations and it restricts the DP algorithm to consider only connected subchains of this linear relation ordering. For large queries, it adopts the idea from Iterative DP: first constructing an execution plan using a greedy algorithm (Greedy Operator Ordering), and then improving that plan by running a more expensive optimization algorithm (the algorithm for medium queries) on subplans up to size k (and k can be changed to control the optimization time).
  4. This paper uses standard benchmarks: TPC-H, TPC-DS, LDBC BI, Join Order Benchmark, SQLite test suite. They also generate synthetic queries for scalability experiments.
- Comments
  1. This paper is well written and easy to read. It also offers a pretty useful related works section that presents almost all the methods for finding join order, and they essentially provide an hybrid approach that uses different methods for different cases.
  2. It provides a section of implementation details, which is pretty important for system papers.
  3. I think this paper not only is a comprehensive summary of previous studies on joins, but also provides a practical way to handle joins in different cases.
Improving Join Reorderability with Compensation Operators
- Summary
  1. This paper presents a novel approach for join reordering problem for queries involving inner-joins, single-sided outer-joins, and/or antijoins, i.e., providing a more comprehensive enumeration of possible orders for join.
  2. There are two state-of-the-art approaches for the join reorder problem: (1) Transformation-Based Approach (TBA), which enumerates all valid join reorderings using the associativity and commutativity properties of the join operators; (2) Compensation-Based Approach (CBA), which permits certain invalid join reorderings as long as they can be compensated to become valid. This paper proposes a algorithm based on the compensation-based approach, which proposes two new compensation operators to allow more rules for reorder joins.
  3. This paper presents the enumeration algorithm based on the reordering algorithm, and optimizes it to enable reuse of query subplans by finding the equivalence while considering the dependencies of query nodes.
  4. This paper uses TPC-H as its workload and implements the algorithm in PostgreSQL.
- Comments
  1. This paper essentially provides a way to extend the search space of all possible joins and explain how to explore this search space.
  2. The idea of compensation is really a interesting topic in many areas, by introducing some compensation operations, we can apply some “invalid” transformations to provide better flexibility and performance improvement.

Session 8: Spatial Data & Streams

DITA: Distributed In-Memory Trajectory Analytics
- Summary
  1. This paper presents DITA, a distributed in-memory trajectory analytics system which support trajectory similarity search and join (both threshold-based and KNN-based) with numerous similarity functions. The core idea is a filter-verification framework which employs a trie-based method to efficiently search similar trajectories. To distribute the computation, an partitioning method is proposed based on the trie indexing mechanism and a cost model is constructed to balance the workloads. The whole system is built on Spark.
- Comments
  1. This is my paper and I think it is good. It supports huge volume of trajectories and is significantly better than state-of-the-art.
Sketching Linear Classifiers over Data Streams
- Summary
  1. This paper presents the Weight-Median Sketch for learning linear classifiers over data streams that supports approximate retrieval of the most heavily-weighted features. In other words, it does dimension reduction in the streaming setting.
  2. It essentially adopts the idea of count-sketch to save the gradients (for training the classifier) in a sketch-like structure. By “active set”, they simply introduce a heap to store heavy hitters (i.e., heavy weights).
  3. They show three applications: streaming explanation (outlier detection), network monitoring (identifying differences between concurrent traffic streams), streaming pointwise mutual information (finding correlation between events).
- Comments
  1. This paper is well written. It sort of “transfers” idea from one sub-area to another sub-area, I think this is interesting.
  2. It is important to show some real world applications in the paper.