Naslov (eng)

Detecting outliers in claim data using machine learning

Autor

Azdejković, Dragan

Publisher

University of Belgrade, Faculty of economics and business, Publishing centre

Opis (eng)

In general, outliers are items that are significantly different from the majority of other items they may be compared to. In this chapter, we use the term anomaly synonymously, though also often to indicate a single unusual property of an item: the individual oddities that make an item an outlier. Outlier detection refers to the process of finding items that are unusual611. For tabular data, this usually means identifying unusual rows in a table; for image data, unusual images; for text data, unusual documents, and similarly for other types of data. The specific definitions of “normal” and “unusual” can vary, but at a fundamental level, outlier detection operates on the assumption that the majority of items within a dataset can be considered normal, while those that differ significantly from the majority may be considered unusual, or outliers. For instance, when working with a database of claims, we assume that the majority of claims represent normal behaviour, and our goal would be to locate the claim that stands out as distinct from these. In statistical theory, outliers can arise due to measurement errors, an error in data transmission, elements outside the population being incorrectly included in the population, a flaw in an assumed theory, or genuine variability. Identifying outliers is crucial, as they can skew statistical analysis and lead to misleading conclusions. Applications of outlier detection include fraud detection, bot detection on social media612, network security, financial auditing, regulatory oversight of financial markets, medical diagnosis, astronomy, data quality and the development of autonomous vehicles. It’s important to note that not all outliers are necessarily problematic, and in fact, many are not even interesting. Outlier detection can be seen as being not merely interested in removing noise but also in finding interesting database objects deviating in their behaviour considerably from the majority and, as such, providing new insights. Inconsistency can mean that the data object is contaminated from a different distribution than the model considered to describe the data. But inconsistency could also mean that the pre-supposed model is not describing the data as well as was assumed when selecting the model. Both conclusions can bear rather significant repercussions on the interpretation of the given observations. There are several ways to find fraudulent records in tabular data. If we have economic or financial data, we can use a group of methods based on Benford's law. 613 However, its weakness is sensitivity to sample size. In this chapter, we will show how the same task can be solved using unsupervised learning techniques for detecting outliers from insurance data. By applying both techniques, the pool of suspicious records would be smaller, which could lead to savings in audit work.

Jezik

engleski

Datum

2025

Licenca

Creative Commons licenca
Ovo delo je licencirano pod uslovima licence
Creative Commons CC BY-NC-ND 4.0 - Creative Commons Autorstvo - Nekomercijalno - Bez prerada 4.0 International License.

http://creativecommons.org/licenses/by-nc-nd/4.0/legalcode

Predmet

Key words: machine learning, outliers, insurance

Deo kolekcije (2)

o:32766 Master teze Univerziteta u Beogradu Ekonomskog fakulteta
o:28218 Ekonomski fakultet