Imbalanced data sampling design based on grid boundary domain for big data

שמור ב:
מידע ביבליוגרפי
הוצא לאור ב:Computational Statistics vol. 40, no. 1 (Jan 2025), p. 27
יצא לאור:
Springer Nature B.V.
נושאים:
גישה מקוונת:Citation/Abstract
Full Text - PDF
תגים: הוספת תג
אין תגיות, היה/י הראשונ/ה לתייג את הרשומה!

MARC

LEADER 00000nab a2200000uu 4500
001 3165215246
003 UK-CbPIL
022 |a 0943-4062 
022 |a 1613-9658 
022 |a 0723-712X 
024 7 |a 10.1007/s00180-024-01471-8  |2 doi 
035 |a 3165215246 
045 2 |b d20250101  |b d20250131 
084 |a 108412  |2 nlm 
245 1 |a Imbalanced data sampling design based on grid boundary domain for big data 
260 |b Springer Nature B.V.  |c Jan 2025 
513 |a Journal Article 
520 3 |a The data distribution is often associated with a priori-known probability, and the occurrence probability of interest events is small, so a large amount of imbalanced data appears in sociology, economics, engineering, and various other fields. The existing over- and under-sampling methods are widely used in imbalanced data classification problems, but over-sampling leads to overfitting, and under-sampling ignores the effective information. We propose a new sampling design algorithm called the neighbor grid of boundary mixed-sampling (NGBM), which focuses on the boundary information. This paper obtains the classification boundary information through grid boundary domain identification, thereby determining the importance of the samples. Based on this premise, the synthetic minority oversampling technique is applied to the boundary grid, and random under-sampling is applied to the other grids. With the help of this mixed sampling strategy, more important classification boundary information, especially for positive sample information identification is extracted. Numerical simulations and real data analysis are used to discuss the parameter-setting strategy of the NGBM and illustrate the advantages of the proposed NGBM in the imbalanced data, as well as practical applications. 
653 |a Data analysis 
653 |a Parameter identification 
653 |a Algorithms 
653 |a Classification 
653 |a Big Data 
653 |a Sampling designs 
653 |a Sampling methods 
653 |a Data sampling 
653 |a Datasets 
653 |a Regression analysis 
653 |a Sampling techniques 
653 |a Approximation 
653 |a Methods 
773 0 |t Computational Statistics  |g vol. 40, no. 1 (Jan 2025), p. 27 
786 0 |d ProQuest  |t ABI/INFORM Global 
856 4 1 |3 Citation/Abstract  |u https://www.proquest.com/docview/3165215246/abstract/embedded/L8HZQI7Z43R0LA5T?source=fedsrch 
856 4 0 |3 Full Text - PDF  |u https://www.proquest.com/docview/3165215246/fulltextPDF/embedded/L8HZQI7Z43R0LA5T?source=fedsrch