If you must choose one single idea to work on for the rest of your life, what will that be ?
In 2003, my grandma passed away with severe dementia. At the time, there was no existing technology to help communicating her daily needs, or her whereabouts for the carer to attend to swiftly. It was a struggle for my family to provide her the end-of-life care that she deserved.
For the past decade, my research interests are related to Machine Learning theory and its applications in Navigation and Smart Healthcare. I have published extensively in the sub-fields of contact tracing, indoor navigation, and reliable predictions, with 2 Best Paper Awards and 1 Best Poster Award.
Since 2020, I have published on average 5 papers and journal articles a year. See below for a full list of my publications. I am helping my 9 PhD students in making their dreams a reality.
Aspects disclosed provide system and methods for providing reliability measures to outputs of large language models (LLMs). The system and methods do this by integrating Large Language Models (LLMs) in a multi-label classification setting, utilizing the Conformal Prediction (CP) framework. This approach ensures that the predictions made by the LLM are accompanied by mathematically guaranteed error bounds, enhancing the LLMs reliability and trustworthiness.
(Nominated for the Best Paper Award)
The transcritical mixing of liquid fuel sprays is a process characterised by a fuel droplet's transition from a classical evaporation state to a state of diffusive mixing. Although we previously proposed a phenomenological model identifying distinct mixing regimes (classical evaporation, transitional mixing, and diffusive mixing), analysing such large image datasets still requires significant human intervention. This is primarily due to these mixing regimes being identified through temporal criteria (i.e. evolution of droplet shape in time), which is particularly challenging to automate using traditional image processing algorithms.
To address this issue, we designed a deep spatiotemporal learning algorithm trained on human-annotated frames of synthetic transcritical droplets, inspired by high-speed long-distance microscopy videos. In this paper we present our two-stage detection and classification pipeline, where a multiple-object tracking (MOT) algorithm based on YOLOv11 and an integrated BoT-SORT tracking layer initially detect moving droplets and isolate them at the object level. Then, a novel residual convolutional neural network and bidirectional long short-term memory network with a temporal attention module (CNN-BiLSTM-TAM) is proposed to classify the mixing regimes using the extracted object-level droplet images. Our algorithm is designed to learn both the rich visual characteristics of the droplets and their time-based evolution, using spatial and temporal attention to capture the most informative frames in the droplet image sequences.
We provide robust empirical validation of our work through attention map visualisation and performance comparison with two state-of-the-art image classifiers, showcasing improvements in precision, recall, and F1 score of +50%, +31%, +40% and +81%, +56%, +72%.
WiFi fingerprinting has been a prominent solution for indoor positioning, yet its dependence on labour-intensive data collection and susceptibility to environmental dynamics are on-going major challenges.
Thus, this paper presents a comprehensive survey and analysis of the data augmentation techniques designed to enhance WiFi fingerprinting datasets, focusing on the efficiency in data construction and the robustness in positioning accuracy. We reviewed over 70 studies, and proposed a novel taxonomy that categorises existing methods into 6 groups: traditional (e.g., interpolation, perturbation), propagation models, machine learning, deep learning, hybrid approaches, and other emerging techniques. Our quantitative analysis correlates key metrics, such as input data size, synthetic data volume, and augmentation ratios, with positioning performance.
We found that traditional methods achieved notable performance enhancements with minimal computational overhead. Surprisingly, deep learning models became less efficient when generating more data, particularly when the synthetic data exceeded an threefold ratio over the input samples. Our findings provide actionable guidance for selecting data augmentation strategies and bridge the gap between theoretical advancements and practical deployment for WiFi fingerprinting dataset enhancement.
This paper introduces a novel algorithm to enhance the reliability and interpretability of Value at Risk (VaR) estimates in financial markets, using Conformal Prediction.
We address the issue of lack of general mechanism to quantify uncertainty of VaR estimates, especially under volatile market conditions. This is done by employing Adaptive Conformal Inference (ACI) methodology on both synthetic and real data. Two ACI techniques, Aggregation-based ACI (AgACI) and Dynamically-tuned ACI (DtACI) suggested that Conformal Prediction can successfully construct valid predictive intervals around VaR forecasts. Additionally, we demonstrate that these intervals dynamically adapt to changes in market volatility, widening during periods of financial stress, such as the COVID-19 crisis and the 2022 geopolitical shocks.
Multimodal classification models, particularly those designed for fine-grained tasks, offer significant potential for various applications. However, their inability to effectively manage uncertainty often hinders their effectiveness. This limitation can lead to unreliable predictions and suboptimal decision-making in real-world scenarios.
We propose integrating conformal prediction into multimodal classification models to address this challenge. Conformal prediction is a robust technique for quantifying uncertainty by generating sets of plausible classifications for unseen data. These sets are accompanied by guaranteed confidence levels, providing a transparent assessment of the model's prediction reliability. By integrating conformal prediction, our objective is to increase the reliability and trustworthiness of multimodal classifica-tion models, thereby enabling more informed decision-making in contexts where uncertainty is a significant factor.
This paper addresses the challenge of efficient and accurate retrieval of Waste Electrical and Electronic Equipment (WEEE) products, specifically printer cartridges. Traditional recycling methods, heavily reliant on manual inspections, are labour-intensive and inefficient.
To overcome these limitations, we propose a hybrid search and re-ranking approach which incorporates a combination of NLP-based methods. By leveraging these methods, we aim to improve the recall@1k performance of WEEE product retrieval, facilitating more efficient and sustainable WEEE recycling practices. The experimental results achieve a recall@1k score of 60.18%, showing the effectiveness of our approach. This improvement has significant implications for the recycling industry, enabling more accurate identification and sorting of WEEE products and ultimately contributing to a more sustainable circular economy.
Over the past few years, several Universities and Educational Institutes have introduced e-learning platforms to support robust alternatives to face-to-face teaching, where students can benefit from them by revisiting topics covered in class without the constraints of time and space. However, despite this considerable flexibility, the role of the instructor as a facilitator is crucial to support learners when they have doubts on their learning or get stuck, by encouraging them to consider suitable strategies to approach the problem, or by providing clarification on some organisational aspects of the module.
Providing quality feedback that is tailored to the individual needs of each learner, including personality and neurodiversity, is a challenging task for educators. Developing different methods of learner-specific feedback increases the workload and often fails to fully address learning gaps. The lecturer's empathy, which consists of a deep understanding of students' personal and social situations, care and concern for students' emotions, and compassionate responses, also poses a critical role in student success. Several intelligent tutoring systems have been implemented in e-learning platforms to try to provide immediate feedback to support students, but they focus more on providing feedback on content and often don't tailor feedback with adaptive empathy based on different students' personalities or neurodiversity.
In this paper, an AI intelligent tutoring system based on LLM has been implemented within an e-learning platform, fine-tuned to the content and organisational aspects of the final year project module in the IT programme, with the aim of providing immediate feedback based on students’ requests. The software can tailor comments to each student's personality and, where appropriate, neurodiversity, for example, showing genuine interest in responses from introverts or paraphrasing content to improve written comprehension for dyslexics. The neurodiversity information was taken from the user's profile, while personality was extracted using the MBTI (Myers-Briggs Type Indicator). Finally, the software was tested using a bespoke algorithm consisting in a matchmaking process able to detect the level of communication strategies (empathy, creativity, sensitivity) by cross matching the responses received with open online dictionaries to evaluate the effectiveness of the tailored responses.
Accurately identifying a household demographic profile based on its television viewing pattern is important for content personalisation, targeted advertising, and programme design. By understanding who is watching what and when, broadcasters can tailor content to match viewers' interests. Although machine learning can predict household attributes, uncertainty is often high due to overlapping viewing patterns across demographic groups, shared device usage, and limited samples.
This paper implements the Conformal Prediction algorithm to provide an uncertainty measure for machine prediction. We also introduce a new nonconformity score to improve prediction efficiency.
Experiments on a large-scale, imbalanced TV dataset show that our method achieves an average prediction set size (APS) of 1.18 and a 82.8% singleton rate (OneC) at the 95% confidence level—outperforming conventional nonconformity measures in both reliability and efficiency.
Traditional printer cartridges classification systems struggle with the inherent difficulties of fine-grained product classification due to the significant intra-class variability and inter-class similarity among cartridge types.
To address this urgent challenge, this paper develops a resilient, fine-grained classification methodology that utilises multimodal learning of images and text, coupled with hybrid search techniques. Crucially, to ensure robustness and trustworthiness in real-world applications, our work includes a thorough methodology for uncertainty quantification, implemented through the use of conformal prediction adapted for multimodal learning. This enables the classification model to produce probabilistic outputs, indicating both the predicted cartridge type and a quantifiable level of confidence in the prediction. By addressing the dual challenges of precise classification and robust uncertainty estimation, this research aims to advance the development of intelligent systems for enhanced sorting of waste electronic and electrical equipment (WEEE), promoting a more sustainable and ecologically sound management of electronic waste streams.
Multimodal models can experience multimodal collapse, leading to sub-optimal performance on tasks like fine-grained e-commerce product classification. To address this, we introduce an approach that leverages multimodal Shapley values (MM-SHAP) to quantify the individual contributions of each modality to the model’s predictions.
By employing weighted stacked ensembles of unimodal and multimodal models, with weights derived from these Shapley values (MM-SHAP), we enhance the overall performance and mitigate the effects of multimodal collapse. Using this approach we improve previous results (F1-score) from 0.67 to 0.79.
A major challenge in quantitative finance is not just predicting which stocks will outperform but quantifying the uncertainty and reliability of those predictions. This is critical because financial markets are inherently noisy, volatile, and affected by countless unpredictable factors, meaning that even the best models can be dramatically wrong. Reliable measures of uncertainty are essential for risk-aware investment decisions: they help portfolio managers judge when to trust a prediction, size positions appropriately, and avoid overconfidence that can lead to costly losses.
Currently, most machine learning approaches for stock selection produce only point predictions, offering no meaningful measure of confidence, which limits their practical value for investors who need to manage risk.
Thus, in this paper, we benchmark classical and deep learning models for US stock selection~(\cite{fu2018machine}), and apply conformal prediction (CP) to generate well-calibrated prediction sets. Across all models, CP achieves empirical coverage closely matching the nominal confidence level, with most prediction sets being singletons.
We propose the Hybrid Calibration Score (HCS), a new nonconformity measure for inductive conformal prediction. HCS combines instance-level scoring with global model calibration via Expected Calibration Error.
On a real-world demographic classification task, HCS achieves 99% coverage with smaller prediction sets (APS = 1.55) and higher decisiveness (OneC = 55.19%) than standard measures, while preserving formal coverage guarantees.
With the proliferation of increasingly complicated Deep Learning architectures, data synthesis is a highly promising technique to address the demand of data-hungry models.
However, reliably assessing the quality of a ‘synthesiser’ model’s output is an open research question with significant associated risks for high-stake domains.
To address this challenge, we propose a unique synthesis algorithm that generates data from high-confidence feature space regions based on the Conformal Prediction framework. We support our proposed algorithm with a comprehensive exploration of the core parameter’s influence, an indepth discussion of practical advice, and an extensive empirical evaluation of five benchmark datasets. To show our approach’s versatility on ubiquitous real-world challenges, the datasets were carefully selected for their variety of difficult characteristics: low sample count, class imbalance, and non-separability. In all trials, training sets extended with our confident synthesised data performed at least as well as the original set and frequently significantly improved Deep Learning performance by up to 61% points F1-score.
This volume presents the proceedings of the Fourteenth Symposium on Conformal and Probabilistic Prediction with Applications (COPA 2025). The symposium takes place at Royal Holloway University of London, the birthplace of Conformal Prediction, on September 10–12, 2025. Notably, this event also marks the 20th anniversary of the influential book "Algorithmic Learning in a Random World", authored by the founders of Conformal Prediction.
Overall, 36 full papers and 13 extended abstracts have been accepted for publication in the Proceedings of Machine Learning Research, Volume 266. The full papers are divided into 7 topics.
(Impact factor: 3.4)
WiFi fingerprinting is one of the most widely used techniques for indoor positioning systems. However, existing fingerprinting datasets came in different shapes and forms with varying levels of information without any standardised format. They were also dispersed across multiple platforms, making it challenging for new researchers to identify and access a suitable dataset to evaluate their own positioning systems.
To address this challenge, this paper provides a comprehensive review of more than 50 publicly available WiFi fingerprinting datasets. We examine the most critical elements for fingerprinting, including the size and location of the testbed, the WiFi signal input, the number of locations, the temporal and spatial intervals of data collection, the positioning performance, and more. Surprisingly, it was observed that a large number of reference and access points, the use of 3D coordinates, denser sampling grid, and higher data collection frequencies do not always guarantee improved performance as often reported in the literature.
The paper also outlines current challenges, and proposes guidelines for creating new WiFi fingerprint datasets.
Amidst the digital transformation, traditional linear TV faces major challenges, including fragmented viewership, fixed schedule, and inaccurate targeting.
Therefore, this paper proposes a novel Machine Learning framework to understand the audience's demographics from their viewing behaviour. By employing state-of-the-art classification models on an extensive TV first-party dataset, we achieved an average 88.6% accuracy in correctly identifying each household demographics.
Our result offers promising outcomes for refining strategies within linear TV to improve viewer engagement, content programming, and market insights.
The advances in WiFi technology have encouraged the development of numerous indoor positioning systems. However, their performance varies significantly across different indoor environments, making it challenging in identifying the most suitable system for all scenarios.
To address this challenge, we propose an algorithm that dynamically selects the most optimal WiFi positioning model for each location. Our algorithm employs a Machine Learning weighted model selection algorithm, trained on raw WiFi RSS, raw WiFi RTT data, statistical RSS & RTT measures, and Access Point line-of-sight information.
We tested our algorithm in four complex indoor environments, and compared its performance to traditional WiFi indoor positioning models and state-of-the-art stacking models, demonstrating an improvement of up to 1.8 meters on average.
Cough segmentation using Machine Learning is known to be sensitive to the effects of class-confounding characteristics in the training data, significantly skewing predictions with the introduction of bias. Mechanisms by which bias may permeate a dataset include small sample sizes and noise in the samples.
In this paper, we propose a novel audio segmentation algorithm as a means to solve these issues through automatic isolation and extraction of biological audio events.
Our algorithm, CoBrS, is based on heuristics derived from physiological assumptions and is designed to accurately isolate all cough types, including the complex peal cough, and provides segmentation support for breaths, a previously undocumented modality in segmentation literature. CoBrS was validated on three public cough datasets with varying segmentation complexity (Coswara, COUGHVID, Virufy) against two state-of-the-art algorithms (COUGHVID and Virufy), achieving mean signal quality increases of 169.3%, 274.2%, and 39.8%, and sample size increases of 250% and 280% respectively. Our findings were also manually verified by two human raters who reported a 94% peal cough segmentation rate and that 88% of coughs in the moderate noise test subset are of high quality. Our algorithm is capable of effectively isolating cough and breath events of all types from samples with low to moderate noise, whilst improving signal quality and retaining high-frequency information that is often lost in the process.
Fine-grained classification is a challenging task that aims to reduce the misclassification errors in the visual classification of similar image samples. Multimodal learning can improve results by combining text and image data. This approach can help minimise misclassification errors caused by intra-class variability and inter-class similarity.
The main contributions of our paper are: (1) We created a new multimodal dataset with 17,000 image-text pairs. (2) We propose a generalised pipeline for collecting text and image multimodal datasets to simplify the data collection process and encourage more researchers to curate such datasets. (3) We provide the baseline results using a CNN-based unimodal architecture (ResNet-152) and a text and image multimodal architecture (CLIP & MultiModal BiTransformers) to quantitatively demonstrate how the fusion of text and image modalities work to improve the results.
Despite its high accuracy in the ideal condition where there is a direct Line-of-Sight between the Access Points and the user, most WiFi indoor positioning systems struggle under the Non-Line-of-Sight scenario.
Thus, we propose a novel feature selection algorithm leveraging Machine Learning weighting methods and Multi-Scale selection, with WiFi RTT and RSS as the input signals.
We evaluate the algorithm performance on a campus building floor. The results indicated an accuracy of 93% Line-of-Sight detection success with 13 Access Points, using only 3 seconds of test samples at any moment; and an accuracy of 98% for individual AP detection.
In 2023, there are various WiFi technologies and algorithms for an indoor positioning system. However, each technology and algorithm comes with their own strengths and weaknesses that may not universally benefit all building locations.
Therefore, we propose a novel algorithm to dynamically switch to the most optimal positioning model at any given location, by utilising a Machine Learning based weighted model selection algorithm, with WiFi RSS and RTT signal measures as the input features.
We evaluated our algorithm in three real-world indoor scenarios to demonstrate an improvement of up to 1.8 metres, compared to standard WiFi fingerprinting algorithm.
Smartphones are part of our daily lives, they allow the users to conduct online transactions, take photos and even play games. To support and enrich the user experience, sensors have been added to enable the detection of the context of a user's interaction with the phone. For example, by detecting the phone's orientation through the sensor readings, app developers may adjust the screen to portrait or landscape for a better viewing experience; or switching the screen off to avoid unintended touches when the phone is near the face using the proximity sensor; or adjusting the screen brightness using the ambient light sensor.
However, despite being widely used in many mobile apps, these low powered sensors do not require any permissions from the user, and are potential targets for side channel attacks. Hackers can design apps that harvest the sensor readings to infer the user's activities. Detecting such malicious use is a difficult task for Machine Learning and often results in many unwanted false positives.
Therefore, in this paper, we implement Conformal Prediction to detect potentially sensitive information being leaked via the magnetometer. We will assess the validity and accuracy of our algorithm in three different real-world scenarios.
(Impact factor: 2.3)
Indoor positioning systems based on WiFi Round-Trip Time (RTT) measurement were reported to deliver sub-metre level accuracy using trilateration, under ideal indoor conditions. However, the performance of WiFi RTT positioning in complex, non-line-of-sight environments remains an open research question.
To this end, this article investigates the properties of WiFi RTT in several real-world indoor environments on heterogeneous smartphones. We present three datasets collected on a large-scale building floor, an office room and an apartment. The datasets contain both RTT and received signal strength (RSS) signal measures with correct ground-truth labels for further research.
Our results indicated that in a complex indoor environment, RTT fingerprinting system delivered an accuracy of 0.6 m which was 107% better than RSS fingerprinting and 6 m better than RTT trilateration which failed to deliver sub-metre accuracy as claimed.
(Impact factor: 1.154)
Deep Learning predictions with measurable confidence are increasingly desirable for real-world problems, especially in high-risk settings. The Conformal Prediction (CP) framework is a versatile solution that automatically guarantees a maximum error rate. However, CP suffers from computational inefficiencies that limit its application to large-scale datasets.
In this paper, we propose a novel conformal loss function that approximates the traditionally two-step CP approach in a single step. By evaluating and penalising deviations from the stringent expected CP output distribution, a Deep Learning model may learn the direct relationship between input data and conformal p-values.
Our approach achieves significant training time reductions up to 86% compared to to Aggregated Conformal Prediction, an accepted CP approximation variant. In terms of approximate validity and predictive efficiency, we carry out a comprehensive empirical evaluation to show our novel loss function's competitiveness with ACP on the well-established MNIST dataset.
This volume contains the Proceedings of the Twelfth Symposium on Conformal and Probabilistic Prediction with Applications (COPA 2023), organised by Frederick University, Cyprus. The Symposium is held in Limassol on September 13–15, 2023.
Overall, 33 full papers and 7 extended abstracts have been accepted for publication in the Proceedings of Machine Learning Research, Volume 204.
(Winner of the Best Paper Award)
COVID-19 cough classification has rapidly become a promising research avenue as an accessible and low-cost screening alternative, needing only a smartphone to collect and process cough samples. However, audio processing of recordings made in uncontrolled environments and prediction confidence are key challenges that need to be addressed before cough-screening could be widely accepted as a trusted testing method.
Therefore, we propose a novel approach for cough event detection that identifies cough clusters instead of individual coughs, significantly reducing onset detection's usual hypersensitivity to energy fluctuations between cough phases.
By using this technique to improve training sample quality and quantity by +200%, we improve Machine Learning performance on the minority COVID-19 class by up to 20%, achieving up to +47% precision and +15% recall. We propose a novel, class-agnostic Conformal Prediction non-conformity measure which takes the cough sample quality into account to counteract the variance caused by limiting segmentation to just the training set. Our Conformal Prediction model introduces uncertainty quantification to COVID-19 cough classification and achieves an additional 34% improvement to precision and recall.
(Winner of the Best Poster Award)
COVID cough data is heavily imbalanced, and it is challenging to collect more samples. Therefore, models are biased and their predictions cannot be trusted.
Thus, we propose a confidence measure for COVID-19 cough classification.
Audio classification using breath and cough samples has recently emerged as a low-cost, non-invasive, and accessible COVID-19 screening method. However, no application has been approved for official use at the time of writing due to the stringent reliability and accuracy requirements of the critical healthcare setting.
To support the development of the Machine Learning classification models, we performed an extensive comparative investigation and ranking of 15 audio features, including less well-known ones. The results were verified on two independent COVID-19 sound datasets.
By using the identified top-performing features, we have increased the COVID-19 classification accuracy by up to 17% on the Cambridge dataset, and up to 10% on the Coswara dataset, compared to the original baseline accuracy without our feature ranking.
(Impact factor: 5.349)
The emerging WiFi Round Trip Time measured by the IEEE 802.11mc standard promised sub-meter-level accuracy for WiFi-based indoor positioning systems, under the assumption of an ideal line-of-sight path to the user. However, most workplaces with furniture and complex interiors cause the wireless signals to reflect, attenuate, and diffract in different directions.
Therefore, detecting the non-line-of-sight condition of WiFi Access Points is crucial for enhancing the performance of indoor positioning systems. To this end, we propose a novel feature selection algorithm for non-line-of-sight identification of the WiFi Access Points.
Using the WiFi Received Signal Strength and Round Trip Time as inputs, our algorithm employs multi-scale selection and Machine Learning-based weighting methods to choose the most optimal feature sets. We evaluate the algorithm on a complex campus WiFi dataset to demonstrate a detection accuracy of 93% for all 13 Access Points using 34 out of 130 features and only 3 s of test samples at any given time. For individual Access Point line-of-sight identification, our algorithm achieved an accuracy of up to 98%. Finally, we make the dataset available publicly for further research.
(Impact factor: 3.352)
The unsustainable take-make-dispose linear economy prevalent in healthcare contributes 4.4% to global Greenhouse Gas emissions. A popular but not yet widely-embraced solution is to remanufacture common single-use medical devices like electrophysiology catheters, significantly extending their lifetimes by enabling a circular life cycle.
To support the adoption of catheter remanufacturing, we propose a comprehensive emission framework and carry out a holistic evaluation of virgin manufactured and remanufactured carbon emissions with Life Cycle Analysis (LCA). We followed ISO modelling standards and NHS reporting guidelines to ensure industry relevance.
We conclude that remanufacturing may lead to a reduction of up to 60% per turn (−1.92 kg CO2eq, burden-free) and 57% per life (−1.87 kg CO2eq, burdened). Our extensive sensitivity analysis and industry-informed buy-back scheme simulation revealed long-term emission reductions of up to 48% per remanufactured catheter life (−1.73 kg CO2eq). Our comprehensive results encourage the adoption of electrophysiology catheter remanufacturing, and highlight the importance of estimating long-term emissions in addition to traditional emission metrics.
Indoor positioning system based on WiFi Round-Trip Time (RTT) measurement is believed to deliver sub-metre level accuracy with trilateration, under ideal indoor conditions. However, the performance of WiFi RTT positioning in complex, non-line-of-sight environments re-mains a research challenge.
To this end, this paper investigates the properties of WiFi RTT in several real-world indoor environments on heterogeneous smartphones. We present a large-scale real-world dataset containing both RTT and received signal strength (RSS) signal measures with correct ground-truth labels.
Our results indicated that RTT fingerprinting system delivered an accuracy below 0.75 m which was 98% better than RSS fingerprinting and 166% better than RTT trilateration, which failed to deliver sub-metre accuracy as claimed.
Despite its potential, Machine Learning has played little role in the present pandemic, due to the lack of data (i.e., there were not many COVID-19 samples in the early stage).
Thus, this paper proposes a novel cough audio segmentation framework that may be applied on top of existing COVID-19 cough datasets to increase the number of samples, as well as filtering out noises and uninformative data. We demonstrate the efficiency of our framework on two popular open datasets.
Coresets have been proven useful in accelerating the computation of inductive conformal predictors (ICP) when the training data becomes large in size.
This work shows that coreset-based conformal predictors are not only computationally efficient in the centralised setting, but may also naturally be used in scenarios where the dataset of interested in inherently distributed.
Malicious software (malware) is designed to circumvent the security policy of the host device. Smartphones represent an attractive target to malware authors as they are often a rich source of sensitive information. Attractive targets for attackers are sensors (such as cameras or microphones) which allow observation of the victims in real time.
To counteract this threat, there has been a tightening of privileges on mobile devices with respect to sensors, with app developers being required to declare which sensors they need access to, as well as the users needing to give consent.
We demonstrate by conducting a survey of publicly accessible malware analysis platforms that there are still implementations of sensors which are trivial to detect without exposing the malicious intent of a program. We also show how that, despite changes to the permission model, it is still possible to fingerprint an analysis environment even when the analysis is carried using a physical device with the novel use of Android's Activity Recognition API.
This volume contains the Proceedings of the Eleventh Symposium on Conformal and Probabilistic Prediction with Applications (COPA 2022), hosted by University of Brighton, UK. The Symposium is held in Brighton on August 24–26, 2022.
Overall, 17 full papers have been accepted for publication in the Proceedings of Machine Learning Research, Volume 179.
The continual proliferation of mobile devices has encouraged much effort in using the smartphones for indoor positioning.
This article is dedicated to review the most recent and interesting smartphones based indoor navigation systems, ranging from electromagnetic to inertia to visible light ones, with an emphasis on their unique challenges and potential real-world applications.
A taxonomy of smartphones sensors will be introduced, which serves as the basis to categorise different positioning systems for reviewing. A set of criteria to be used for the evaluation purpose will be devised. For each sensor category, the most recent, interesting and practical systems will be examined, with detailed discussion on the open research questions for the academics, and the practicality for the potential clients.
(Impact factor: 0.34)
One of the most popular approaches for indoor positioning is WiFi fingerprinting, which has been intrinsically tackled as a traditional machine learning problem since the beginning, achieving a few meters of accuracy on average.
In recent years, deep learning has emerged as an alternative approach, with a large number of publications reporting sub-meter positioning accuracy.
Therefore, this survey presents a timely, comprehensive review of the most interesting deep learning methods being used for WiFi fingerprinting. In doing so, we aim to identify the most efficient neural networks, under a variety of positioning evaluation metrics for different readers.
The coreset paradigm is a fundamental tool for analysing complex and large datasets. Although coresets are used as an acceleration technique for many learning problems, the algorithms used for constructing them may become computationally exhaustive in some settings. We show that this can easily happen when computing coresets for learning a logistic regression classifier. We overcome this issue with two straightforward methods: Accelerating Clustering via Sampling (ACvS) and Regressed Data Summarisation Framework (RDSF); the former is an acceleration procedure based on a simple theoretical observation on using Uniform Random Sampling for clustering problems, the latter is a coreset-based data-summarising framework that builds on ACvS and extend it by using a regression algorithm as part of the coreset construction.
We tested both procedures on five public datasets, and observed that computing the coreset and learning from it is 11 times faster than learning directly from the full input data in the worst case, and 34 times faster in the best case. We further observed that the best regression algorithm for creating summaries of data using the RDSF framework is the Ordinary Least Squares (OLS).
The work aims to develop an automatic cutting tool life prediction model for die-cuts machine at Parafix Ltd. Such model will be able to estimate how long a given tool is likely to last, in order to improve performance and productivity.
This work is part of the KTP project between Parafix Ltd and University of Brighton.
This volume contains the Proceedings of the Tenth Symposium on Conformal and Probabilistic Prediction with Applications (COPA 2021), co-organised by Royal Holloway, University of London, and University of Brighton, UK. This year the Symposium is held online on September 8–10, 2021. This due to the ongoing Covid-19 pandemic. For general information about conformal prediction and its sister methods, see the preface to the Proceedings of COPA 2017 (volume 60 of the PMLR), Proceedings of COPA 2018 (volume 91 of the PMLR), Proceedings of COPA 2019 (volume 105 of the PMLR), and Proceedings of COPA 2020 (volume 128 of the PMLR).
Overall, 15 papers have been accepted for publication in the Proceedings of Machine Learning Research, and an additional paper describing the tool Orange that is covered in the tutorials.
(Winner of the Best Paper Award)
In the era of datasets of unprecedented sizes, data compression techniques are an attractive approach for speeding up machine learning algorithms. One of the most successful paradigms for achieving good-quality compression is that of coresets: small summaries of data that act as proxies to the original input data. Even though coresets proved to be extremely useful to accelerate unsupervised learning problems, applying them to supervised learning problems may bring unexpected computational bottlenecks.
We show that this is the case for Logistic Regression classification, and hence propose two methods for accelerating the computation of coresets for this problem. When coresets are computed using our methods on three public datasets, computing the coreset and learning from it is, in the worst case, 11 times faster than learning directly from the full input data, and 34 times faster in the best case. Furthermore, our results indicate that our accelerating approaches do not degrade the empirical performance of coresets.
Support Vector Machine (SVM) is a powerful paradigm that has proven to be extremely useful for the task of classifying high-dimensional objects. In principle, SVM allows us to train scoring classifiers those that output a prediction score; however, it can also be adapted to produce probability-type outputs through the use of the Venn-Abers framework. This allows us to obtain valuable information on the labels distribution for each test object. This procedure, however, is restricted to very small data given its inherent computational complexity.
We circumvent this limitation by borrowing results from the field of computational geometry. Specifically, we make use of the concept of a coreset: a small summary of data that is constructed by discretising the feature space into enclosing balls, so that each ball will be represented by only one point.
Our results indicate that training Venn-Abers predictors using enclosing balls provides an average acceleration of 8 times compared to the regular Venn-Abers approach while largely retaining probability calibration. These stimulating results imply that we can still enjoy well-calibrated probabilistic outputs for kernel SVM even in the realm of large-scale datasets.
(Impact factor: 1.36)
Contact tracing is widely considered as an effective procedure in the fight against epidemic diseases. However, one of the challenges for technology based contact tracing is the high number of false positives, questioning its trust-worthiness and efficiency amongst the wider population for mass adoption.
To this end, this paper proposes a novel, yet practical smartphone based contact tracing approach, employing WiFi and acoustic sound for relative distance estimate, in addition to the ambient air pressure and magnetic field environment matching. We present a model combining 6 smartphone sensors, prioritising some of them when certain conditions are met.
We empirically verified our approach in various realistic environments to demonstrate an achievement of up to 95% fewer false positives, and 62% more accurate than Bluetooth-only system. To the best of our knowledge, this paper was one of the first work to propose a combination of smartphone sensors for contact tracing.
(Impact factor: 3.43)
Passengers travelling on the London underground tubes currently have no means of knowing their whereabouts between stations. The challenge for providing such service is that the London underground tunnels have no GPS, WiFi, Bluetooth or any kind of terrestrial signals to leverage.
This paper presents a novel, yet practical idea to track passengers in realtime using the smartphone accelerometer and a training database of the entire London underground network. Our rationales are that London tubes are self-driving transports with predictable accelerations, decelerations and travelling time, that they always travel on the same fixed rail lines between stations with distinctive bumps and vibrations, which permit us to generate an accelerometer map of the tubes' movements on each line. Given the passenger's accelerometer data, we identify in realtime what line they are travelling on, and what station they depart from, using pattern matching algorithm, with an accuracy of up to about 90% when the sampling length is equivalent to at least 3 station stops. We incorporate Principal Component Analysis to perform inertial tracking of passenger's position along the line, when trains break away from scheduled movements during rush hours.
Our proposal was painstakingly assessed on the entire London underground covering approximately 940 kilometres of travelling distance, spanning across 381 stations on 11 different lines.
We demonstrate a breach in smartphone location privacy through the accelerometer and magnetometer's footprints. The merits or otherwise of explicitly permissioned location sensors are not the point of this paper. Instead, our proposition is that other non-location-sensitive sensors can track users accurately when the users are in motion, as in travelling on public transport, such as trains, buses, and taxis.
Through field trials, we provide evidence that high accuracy location tracking can be achieved even via non-location-sensitive sensors for which no access authorisation is required from users on a smartphone.
As the volume of data increase rapidly, most traditional machine learning algorithms become computationally prohibitive. Furthermore, the available data can be so big that a single machine's memory can easily be overflown.
We propose Coreset-Based Conformal Prediction, a strategy for dealing with big data by applying conformal predictors to a weighted summary of data - namely the coreset. We compare our approach against stand-alone inductive conformal predictors over three large competition-grade datasets to demonstrate that our coreset-based strategy may not only significantly improve the learning speed, but also retains predictions validity and the predictors' efficiency.
Cough and sneeze are the most common means to spread respiratory diseases amongst humans. Existing approaches to detect coughing and sneezing events are either intrusive or do not provide any reliability measure.
This paper offers a novel proposal to reliably and non-intrusively detect such events using a smartwatch as the underlying hardware, Conformal Prediction as the underlying software.
We rigorously analysed the performances of our proposal with the Harvard ESC Environmental Sound dataset, and real coughing samples taken from a smartwatch in different ambient noises.
This report describes the work in progress, analysing ExCAPE data on possibility of multitarget learning.
We start with observing the structure of missing values (labels), as the sets of examples overlap but are not identical for different targets.
Then we concentrate on the part of the data with full information in order to consider mutual dependence between the targets, and possibility of improvement of prediction by collecting the information together.
This report summaries the performance of Mondrian Inductive Conformal Prediction on the top 10 largest targets in the ExCAPE dataset.
Out of 526 targets in the ExCAPE database, only the top 150 have more than 100,000 compounds. The number of compounds drops below 5,000 beyond the top 200.
(Nominated for the Best Paper Award)
The public transports provide an ideal means to enable contagious diseases transmission.
This paper introduces a novel idea to detect co-location of people in such environment using just the ubiquitous geomagnetic field sensor on the smart phone. Essentially, given that all passengers must share the same journey between at least two consecutive stations, we have a long window to match the user trajectory.
Our idea was assessed by a painstaking survey of over 150 kilometres of travelling distance, covering different parts of London, using the overground trains, the underground tubes and the buses.
The major challenges for optical based tracking are the lighting condition, the similarity of the scene, and the position of the camera.
This paper demonstrates that under such conditions, the positioning accuracy of Google's Tango platform may deteriorate from fine-grained centimetre level to metre level.
The paper proposes a particle filter based approach to fuse the WiFi signal and the magnetic field, which are not considered by Tango, and outlines a dynamic positioning selection module to deliver seamless tracking service in these challenging environments.
(Impact factor: 0.46)
WiFi fingerprinting has been a popular approach for indoor positioning in the past decade. However, most existing fingerprint-based systems were designed as an on demand service to guide the user to his wanted destination.
This article introduces a novel feature that allows the positioning system to predict in advance which walking route the user may use, and the potential destination. To achieve this goal, a new so-called routine database will be used to maintain the magnetic field strength in the form of the training sequences to represent the walking trajectories. The benefit of the system is that it does not adhere to a certain predicted trajectory. Instead, the system dynamically adjusts the prediction as more data are exposed through-out the user's journey. The proposed system was tested in a real indoor environment to demonstrate that the system did not only successfully estimate the route and the destination, but also improved the single positioning prediction.
Indoor navigation provides the positioning service to the indoor users, where the GPS coverage is not available. The challenges for most signal-based indoor positioning systems are the unpredictable signal propagation caused by the complex building interiors, and the dynamic of the environment caused by the peoples' movements. However, most existing systems made no assumption about the quality of their predictions, which is crucial in such noisy indoor environment.
To address this challenge, this article proposes a confidence measure to reflect the uncertainty of the positioning prediction. More importantly, the users may control the size of the prediction set by setting the confidence level tailoring to their personal requirement. The proposed approach in this article has been validated in three real office buildings with challenging indoor environments, which indicated that it performed up to 20% more accurate than traditional Naïve Bayes and Weighted K-nearest neighbours (W-KNN) algorithms.
Indoor localisation provides the positioning service to the indoor users, where the GPS coverage is not available. Much research effort has been invested into 'Location Fingerprinting', which is considered one of the most effective indoor tracking methods to date. Fingerprint-based approaches piggyback on top of the existing indoor communication layers such as the WiFi network to provide the location-based service. However, the challenges of fingerprinting are the huge training database, the dynamic indoor environment, and the WiFi fingerprints may struggle to provide fine-grained positioning accuracy at certain indoor positions. This thesis addresses the mentioned problems using several machine learning algorithms and additional information observed from the users and the indoor environment.
The proposed approaches in this thesis have been validated in the real offices with challenging indoor environments. One test bed has multiple buildings and floors, and has been previously used in the EvAAL 2015 indoor positioning competition, which provides a relative baseline for the proposed techniques. In particular, the regression and classification algorithms in this thesis were ranked second and third out of the 5 contestants, under the same competition's test domain. In addition, they performed up to 20% more accurate than traditional Naive Bayes and W-KNN algorithms.
An epidemic may be controlled or predicted if we can monitor the history of physical human contacts. As most people have a smart phone, a contact between two persons can be regarded as a handshake between the two phones. Our task becomes how to detect the moment the two mobile phones are close.
In this paper, we investigate the possibility of using the outdoor WLAN signals, provided by public Access Points, for off-line mobile phones collision detection. Our method does not require GPS coverage, or real-time monitoring. We designed an Android app running in the phone’s background to periodically collect the outdoor WLAN signals. This data are then analysed to detect the potential contacts. We also discuss several approaches to handle the mobile phone diversity, and the WLAN scanning latency issue. Based on our measurement campaign in the real world, we conclude that it is feasible to detect the co-location of two phones with the WLAN signals only.
One of the challenges of location fingerprinting to be deployed in the real offices is the training database handling process, which does not scale well with increasing amount of tracking space to be covered. However, little attention was paid to tackle such issue, where the majority of previous work rather focused on improving the tracking accuracy.
In this paper, we propose a novel idea to enhance fingerprinting's processing speed and positioning accuracy with mixture of Gaussians clustering. We realised the key difference between fingerprinting and other un-supervised problems, that is we do know the label (the Cartesian co-ordinate) of the signal data in advance. This key information was largely ignored in previous work, where the fingerprinting clustering was based solely on the signal data information. By exploiting this information, we tackle the indoor signal multipath and shadowing with two-level signal data clustering and Cartesian co-ordinate clustering.
We tested our approach in a real office environment with harsh indoor condition, and concluded that our clustering scheme does not only reduce the fingerprinting processing time, but also improves the positioning accuracy.
Indoor localisation helps monitoring the positions of a person inside a building, without GPS coverage. In the past decade, much research effort have been invested into Indoor Fingerprinting, which is considered one of the most effective indoor tracking methods to date.
In recent years, some researches started looking at crowdsourcing the fingerprinting database with the contributions from indoor users via mobile phones or laptop PCs. However, the crowdsourcing process was greatly limited due to the lack of indoor reference, in contrast to the widespread use of GPS reference for outdoor crowdsourcing.
In this paper, we propose a novel idea to crowdsource the fingerprinting database without any preset infrastructure, landmarks, nor using any advanced sensors. Our idea is based on the observations that the users often carry a mobile phone with them, and there are multiple social contacts amongst those users indoor. First, we exploit the user's continuous movement indoor to refine the location prediction set. Our approach can be applied to enhance other systems. Second, we use a unique concept to detect the indoor social contacts with NFC by tapping the back of the two phones together. Third, we propose a novel idea to combine this social contact and the user's continuous movements to identify the exact entries with confidence in the fingerprinting database that need updating for crowdsourcing. Finally, we share our thoughts on automating the crowdsourcing process without any user input.
Current indoor localisation systems make use of common wireless signals such as Bluetooth, WiFi to track the users inside a building. Amongst those, Bluetooth has been widely known for its low-power consumption, small maintenance cost, as well as its wide-spread amongst the commodity devices. Understanding the properties of such wireless signal definitely aids the tracking system design. However, little research has been done to understand the properties of Bluetooth wireless signal amongst the current Bluetooth-based tracking systems.
In this chapter, the most important Bluetooth properties related to indoor localisation are experimentally investigated from a statistical perspective. A Bluetooth-based tracking system is proposed and evaluated with the location fingerprinting technique to incorporate the Bluetooth properties described in the chapter.
(Impact factor: 0.78)
Indoor localisation is the state-of-the-art to identify and observe a moving human or an object inside a building. However, because of the harsh indoor conditions, current indoor localisation systems remain either too expensive or not accurate enough.
In this paper, we tackle the latter issue in a different direction, with a new conformal prediction algorithm to enhance the accuracy of the prediction. We handle the common indoor signal attenuation issue, which introduces errors into the training database, with a reliability measurement for our prediction. We show why our approach performs better than other solutions through empirical studies with two testbeds. To the best of our knowledge, we are the first to apply conformal prediction for the localisation purpose in general, and for the indoor localisation in particular.
We proposed the first Conformal Prediction (CP) algorithm for indoor localisation with a classification approach. The algorithm can provide a region of predicted locations, and a reliability measurement for each prediction. However, one of the shortcomings of the former approach was the individual treatment of each dimension. In reality, the training database usually contains multiple signal readings at each location, which can be used to improve the prediction accuracy.
In this paper, we enhance our former CP with the Kullback-Leibler divergence, and propose two new classification CPs. The empirical studies show that our new CPs performed slightly better than the previous CP when the resolution and density of the training database are high. However, the new CPs performs much better than the old CP when the resolution and density are low.
Indoor localisation is the state-of-the-art to identify and observe a moving human or object inside a building. Location Fingerprinting is a cost-effective software-based solution utilising the built-in wireless signal of the building to estimate the most probable position of a real-time signal data. In this paper, we apply the Conformal Prediction (CP) algorithm to further enhance the Fingerprinting method. We design a new nonconformity measure with the Weighted K-nearest neighbours (W-KNN) as the underlying algorithm. Empirical results show good performance of the CP algorithm.
The release of the Lego Mindstorms kit has carried the flexibility and creativity of Lego into the world of robotics, whilst targeting a variety of children and adults audiences. To achieve this goal, a programming language called NXT-G was developed to provide everyone full control of the Lego Mindstorms kit, regardless of their programming experience.
In this project, the programming language ambition is tested through practical experiments. In a controlled experiment, twelve participants carry out four tasks using the NXT-G software and a Lego robot. Their performances are then analysed to confirm the stated claim.
This thesis proposed and implemented a new affordable indoor tracking system. The Bluetooth signal was found to be very stable and is reliable for any indoor positioning system. The Fingerprinting method was employed to manipulate the Bluetooth signal at many positions in the office room. In addition, a robot was created to perform the complex and time-consuming data collection process.
Spam (junk-email) identification is a well-documented research area. A good spam filter is not only judged by its accuracy in identifying spam, but also by its performance.
This project aims to replicate a Naive Bayesian spam filter, as described in the "SpamCop: A Spam Classification & Organization Program" paper. The accuracy and performance of the filter are examined with the GenSpam corpus. In addition, the project investigates the actual effect of the Porter Stemming algorithm on such filter.
The thesis was one of the first to investigate the structure of the UK Chip & PIN debit and credit cards. Particularly, we looked into CAMs (Card Authentication Methods) implemented by different banks to understand their policies.
A Java-based reader was developed to exchange information with all UK Debit/Credit cards, some overseas cards were also tested. The software can simulate an ATM machine to perform off-line PIN verification.
I am keen on supporting enthusiastic students who would like to apply their ideologies to impact the real world. If you are interested in doing research with me, feel free to get in touch.
I have had the pleasure to supervise the following PhD students.
- Charles Gadd (2025 - present): researching Machine Learning for Smart Transport.
- Alice Ashby (2025 - present): researching Bias Detection in Large Language Models. "Chris Boyne" Best Undergraduate Thesis Award 2022. COPA 2023 Best Student Paper Award.
- Yinqi Zhang (2025 - present): researching Smart Earables for Indoor Navigation.
co-supervised with Prof. Zhiyuan Luo.
- Javier Carreño (2023 - present): researching Anomaly Detection for Digital Advertising.
- Robert Choudhury (2020 - present): researching Machine Learning for Mobile Security.
co-supervised with Prof. Zhiyuan Luo.
- Ajibola Obayemi (2022 - 2025): researched Multimodal Fine-grained Classification. Head of Technology at BCMY Ltd. Best PhD Project at IEEE RTSI 2025.
Thesis: "Multimodal Learning for fine-grained product classification". (Passed with No Corrections).
- Xu (Sean) Feng (2021 - 2025): researched Machine Learning for Navigation. 85% GPA at Zhejiang University (China).
Thesis: "Machine Learning approaches for WiFi Round Trip Time indoor positioning systems".
- Julia Meister (2021 - 2024): researched Machine Learning for Digital Health. Top 10 students in BSc Computer Science. Bronze Award at STEM for Britain 2023.
Thesis: "Confident COVID-19 detection with Conformal Prediction".
- Nery Riquelme-Granada (2018 - 2021): researched Machine Learning for Data Summary. DATA 2020 Best Student Paper Award.
co-supervised with Prof. Zhiyuan Luo.
Thesis: "Coreset-based protocols for Machine Learning classification".