Big Data

The problem

As the volume of data increase rapidly, most traditional machine learning algorithms become computationally prohibitive. Furthermore, the available data can be so big that a single machine’s memory can easily be overflown. We propose Coreset-Based Conformal Prediction, a strategy for dealing with big data by using a weighted summary of the original input (i.e. the coresets).

Most of this research were carried out by my student Nery Riquelme-Granada.

Our approach

A coreset is a small weighted set of data (i.e. a summary), such that the solution found in the summary is provably correct with respect to the solution found in the full data. Ideally, a coreset should be significantly smaller than the original dataset.

The most effective randomised approach for constructing coresets is by doing nonuniform sampling, namely importance sampling. In this approach, rather than just assigning equal probability to all our input data points, we first compute an importance score that tells us "how redundant" a data point is for our learning problem. This score is called the sensitivity of the point, and is the central quantity for constructing coresets non-uniformly. Once we computed the sensitivity for each input point, we sample M points according to the sensitivities. The final step is to compute the weights for the sampled points, which are generally inverse-proportional to the sensitivity scores, and return the M weighted points.


Interested researchers are invited to follow our papers and journals.

Riquelme-Granada, Nery, Khuong Nguyen, and Zhiyuan Luo. "Coreset-based Conformal Prediction for Large-scale Learning." In Conformal and Probabilistic Prediction and Applications, pp. 142-162. 2019.