Open Access Open Access  Restricted Access Subscription Access

doi:10.3808/jei.201700375
Copyright © 2017 ISEIS. All rights reserved

Supervised Machine Learning and Heuristic Algorithms for Outlier Detection in Irregular Spatiotemporal Datasets

K. P. Chowdhury

    University of California - Irvine, Paul Merage School of Business, CA 92697-3125, USA

*Corresponding author. Tel.: +(818) 857-9325. E-mail address: kpchowdh@uci.edu (K. P. Chowdhury).

Abstract


A central problem in time series analysis is the detection of outliers, with further complications presented by irregular time series data measured having spatiotemporal components. This paper presents one Heuristic and two Supervised Machine Learning algorithms for the detection of outliers in this context in univariate time series data, with comparison of results to Chen and Liu's (1993) automatic outlier detection methodology. Due to the recent trend of set up of large environmental databases across many states in the US and around the world, which allow submission of pollutant measurement data from virtually any source, these procedures are applied to the measurements of various surface water pollutants in the California Environmental Data Exchange Network (CEDEN) for understanding and exploring the viability of such databases and the proposed methods. The proposed methodologies though not as robust, give similar results to existing methodologies given the nature of the data, but can be far less time intensive to implement providing interesting insights into the database. Thus, the algorithms presented can be widely used with minimal computing resource requirements with very tractable results even with very large datasets. The methodologies have wide applicability in a variety of contexts and a wide variety of databases with similar measurement challenges across many disciplines, specifically in the environmental setting. In particular, the results have large potential regulatory impact on accepted levels of different pollutants in California water bodies, as well as the amounts to be charged for industrial discharge into those water bodies, and is intended to provide direction for further research and regulatory investments. Based on the results it seems reasonable to assume that there is further room for the inclusion of nongovernmental agency pollutant measurements in the debate of environmental pollution, specifically in California. However, the results also indicate that the use of such databases in a more inclusive way for regulatory matters must be carefully evaluated on an individualized basis. That is to ensure that poorly collected/handled measurements, do not inundate the database over and above those collected with more rigor, thus potentially making inference on the true population distribution of the pollutants more difficult; being especially relevant for those pollutant measurements, which require more delicate sampling procedures.

Keywords: time series; irregular spatiotemporal time series; outlier detection; water pollution; CEDEN


Supplementary Files:

Refbacks

  • There are currently no refbacks.