Method Summary

Nan Xiao (Seven Bridges) , Soner Koc (Seven Bridges) , Kaushik Ghose (Seven Bridges)

Table of Contents


Numerous signal detection methods have been proposed in the past decades for pharmacovigilance monitoring in large databases. These methods often produce a ranked list of detected signals (anomalies) warranting further investigations. However, there is still a lack of a holistic view and integration of the results from these approaches. Here, we developed a new signal detection method that can ensemble the vaccine safety signals detected from any signal detection methods, and generate an aggregated list of top signals based on distance metric optimization. Our method can potentially enhance the weak signals from individual methods and, at the same time, reduce the number of false-positive signals discovered by chance.

A flow diagram of our method:

Base signal detectors and signal rankers

We included four mainstream signal detection methods that have already been applied in real-world pharmacovigilance monitoring by regulatory agencies globally. These methods (or metrics) are:

As expected, the signal detection results from these methods are somehow similar (in terms of high-ranking vaccine-adverse event pairs), but different in the ranking details. We used the implementations available from the R packages openEBGM (Canida and Ihrie 2017) and PhViD (Ahmed et al. 2010).

Rank aggregation for ensembled safety signal detection

To aggregate the ranked safety signal lists detected by the multiple methods above, we model it as an optimization problem:

\[ \delta^* = \arg \min \sum_{i=1}^{m} d(\delta, L_i) \]

where \(\sigma\) is an “ideal” ranked aggregated safety signal list of length \(k = |L_i|\), \(d\) is a distance function that can measure the distance between rankings (Spearman footrule distance here), and \(L_i\) is the ranked list of detected signals generated by each base method. The idea is to find a \(\delta^*\) that minimizes the total distance between \(\delta\) and \(L_i\).

A similar rank aggregation problem also exists in gene list prioritization for high-throughput data analysis, where a number of ordered genes lists discovered by statistical tests are aggregated. The R package RankAggreg (Pihur, Datta, and Datta 2007) was repurposed to solve the same optimization problem here.


The raw data used in this solution is downloaded from the VAERS database, covering 30 years (1990-2019) of domestic vaccine adverse event reports in the United States. The raw data is then cleaned up and transformed into an analyzable format. About 3.44 million vaccine-adverse event pairs are extracted and included in the analysis.

Code and website

We created a companion website detailing our approach, analysis pipeline, and findings in this challenge. The site is accessible from: All code available on GitHub:


Besides the potential health-impacting signals, a considerable proportion of our findings in the top-ranked vaccine-adverse event pair list indicate a possibility to improve the vaccine administration process or to improve vaccine product labeling, and guiding the improvement of the upstream reporting data quality and the data ingestion procedures.

This solution also verifies the concept that by harnessing the power of open data and high-quality open source data analysis software, we can quickly develop new analytical approaches and flexible pipelines for extracting new insights from public health information, and present both of the process and the results to the community, thus increase computational transparency and reproducibility.

Ahmed, I, C Dalmasso, F Haramburu, F Thiessard, P Broët, and P Tubert-Bitter. 2010. “False Discovery Rate Estimation for Frequentist Pharmacovigilance Signal Detection Methods.” Biometrics 66 (1): 301–9.

Canida, Travis, and John Ihrie. 2017. “OpenEBGM: An R Implementation of the Gamma-Poisson Shrinker Data Mining Model.” The R Journal 9 (2): 499–519.

Pihur, Vasyl, Susmita Datta, and Somnath Datta. 2007. “Weighted Rank Aggregation of Cluster Validation Measures: A Monte Carlo Cross-Entropy Approach.” Bioinformatics 23 (13): 1607–15.


If you see mistakes or want to suggest changes, please create an issue on the source repository.