Skip to content

sortinghyeok/InsuranceFraudPrediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

61 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๋ณดํ—˜์‚ฌ์˜ ํšจ์œจ์ ์ธ ์šด์˜์„ ์œ„ํ•œ ์˜ˆ์ธก ๋ชจ๋ธ ๊ฐœ๋ฐœ

๋ฐ์ดํ„ฐ ์ฒญ๋…„ ์บ ํผ์Šค ํ”„๋กœ์ ํŠธ 1์กฐ

๊ฐœ๋ฐœ๊ธฐ๊ฐ„ : 2021/08/02 ~ 2021/08/27

Notion Link : https://nervous-stranger-60b.notion.site/Insurance-Fraud-Prediction-9a9d4408c8c64753afa16497aec26d60

๋ณดํ—˜ ์‚ฌ๊ธฐ ํ˜„ํ™ฉ & ํ”„๋กœ์ ํŠธ ๋ชฉ์ 

๊ธˆ์œต๊ฐ๋…์› ์ž๋ฃŒ์— ๋”ฐ๋ฅด๋ฉด 2015๋…„๋„ ์ดํ›„ ๋ณดํ—˜์‚ฌ๊ธฐ ๊ฑด์ˆ˜๋Š” ๊พธ์ค€ํžˆ ์ฆ๊ฐ€ํ•˜์˜€๋‹ค.

ํŠนํžˆ, 2020๋…„๋„๋ถ€ํ„ฐ ์ฝ”๋กœ๋‚˜ ๋ฐ”์ด๋Ÿฌ์Šค ํŒ๋ฐ๋ฏน ์ดํ›„์—๋Š” ๋ณดํ—˜์‚ฌ๊ธฐ ๊ธˆ์•ก์ด ๊ธ‰๊ฒฉํžˆ ์ฆ๊ฐ€ํ•˜์˜€์Œ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค.

์ด๋Ÿฌํ•œ ์ƒํ™ฉ์—์„œ ๊ฐ์ข… ์ฆ๊ถŒ์‚ฌ๋Š” ๋จธ์‹ ๋Ÿฌ๋‹ ๊ธฐ์ˆ ์„ ์ ‘๋ชฉํ•œ ๋ณดํ—˜์‚ฌ๊ธฐ ์˜ˆ์ธก ์‹œ์Šคํ…œ์„ ๊ฐœ๋ฐœํ•˜์˜€์œผ๋ฉฐ, ๋Œ€ํ‘œ์ ์ธ ์˜ˆ๋กœ๋Š” ๊ต๋ณด์ƒ๋ช…์˜ K-FDS, KB์†ํ•ด๋ณดํ—˜์˜ SMA ์‹œ์Šคํ…œ ๋“ฑ์ด ์žˆ๋‹ค.

๊ทธ๋Ÿฌ๋‚˜ ๊ธฐ์กด ๋ณดํ—˜์‚ฌ์˜ ์˜ˆ์ธก ์‹œ์Šคํ…œ์—๋Š” ๋ณดํ—˜๊ธˆ์„ ๋ฐ›์•„์•ผํ•  ์†Œ๋น„์ž๋ฅผ ์‚ฌ๊ธฐ์ž๋กœ ์ง„๋‹จํ•˜๋Š” ๋“ฑ ์‹œ์Šคํ…œ์— ๋Œ€ํ•œ ์‹ ๋ขฐ๋„๊ฐ€ ๋–จ์–ด์ง€๋Š” ๋ฌธ์ œ๊ฐ€ ์žˆ๋‹ค.

๋”ฐ๋ผ์„œ ์ด๋ฒˆ ํ”„๋กœ์ ํŠธ๋Š” ์‚ฌ๊ธฐ ์—ฌ๋ถ€์— ๋Œ€ํ•œ ์˜ˆ์ธก์„ ํ™•๋ฅ ๋กœ ๋‚˜ํƒ€๋‚ด์–ด ๋ณด์™„ํ•˜๊ณ , ๋ณดํ—˜์‚ฌ๊ฐ€ ๊ณ ๊ฐ์„ ๊ด€๋ฆฌํ•จ์— ์žˆ์–ด ๋” ํšจ์œจ์ ์ธ ์šด์˜์„ ํ•  ์ˆ˜ ์žˆ๋„๋ก ์˜ˆ์ธก๋ชจ๋ธ์„ ๊ตฌ์ถ•ํ•˜๋Š” ๋ฐ์— ๊ฐœ๋ฐœ ๋ชฉ์ ์„ ๋‘”๋‹ค.

๋ฐ์ดํ„ฐ ์ •์ œ, ์ „์ฒ˜๋ฆฌ

Raw_Data ํด๋” ๋‚ด๋ถ€์˜ 4๊ฐœ์˜ ๋ฐ์ดํ„ฐ ์…‹ ํŒŒ์ผ์€ ๊ฐ๊ฐ ๋ณดํ—˜์‚ฌ ๊ฐ€์ž… ํšŒ์›์ •๋ณด(CUST), ๊ณ ๊ฐ๋ณ„ ์ฒญ๊ตฌ(CLAIM), ๋ณดํ—˜์„ค๊ณ„์‚ฌ ๊ด€๋ จ ๋ฐ์ดํ„ฐ(FPINFO), ๋ณดํ—˜ ๊ณ„์•ฝ ๊ด€๋ จ ๋ฐ์ดํ„ฐ(CNTT)์ด๋‹ค.

์ด๋“ค์˜ ์ •์ œ๋ฅผ ์œ„ํ•˜์—ฌ ์ธ์ฝ”๋”ฉ ๋ฐฉ์‹์„ UTF-8๋กœ ๋งž์ถ”๊ณ , CSVํŒŒ์ผ๋กœ ๋ณ€ํ™˜ํ•œ ๊ฒƒ์ด Converted_Data ํด๋” ๋‚ด๋ถ€ ์ปจํ…์ธ ์ด๋‹ค. ํด๋” ๋‚ด๋ถ€ ๊ฐ ํŒŒ์ผ์— ๋Œ€์‘ํ•˜๋Š” ๋‚ด์šฉ์€ RAW_DATA์™€ ๊ฐ™๋‹ค.

์ •์ œ ๊ณผ์ •์— ์žˆ์–ด ์ฒซ๋ฒˆ์งธ ๋ชฉํ‘œ๋Š” ๋ถ„์„์— ๊ฐ€์žฅ ์šฉ์ดํ•œ ์ตœ์ ์˜ ๋ฐ์ดํ„ฐ ์…‹์„ ์‚ฐ์ถœํ•˜๋Š” ๊ฒƒ์ด์—ˆ๋‹ค. ๋ชจ๋“  ๋ฐ์ดํ„ฐ์…‹์˜ ์ปฌ๋Ÿผ์„ ํ•ฉ์น˜๊ฒŒ ๋˜๋ฉด 70๊ฐœ ์ด์ƒ์˜ ์ปฌ๋Ÿผ์ด ์กด์žฌํ•˜๊ฒŒ ๋˜๊ณ , ์ด๋ ‡๊ฒŒ ๊ณผํ•œ ์ˆ˜์˜ ์ปฌ๋Ÿผ์€ ๋ฐ์ดํ„ฐ ๋ถ„์„์„ ์šฉ์ดํ•˜๊ฒŒ ํ•˜์ง€ ๋ชปํ•˜๋Š” ๋ฌธ์ œ๋ฅผ ์•ผ๊ธฐํ•˜์˜€๋‹ค.

๋•Œ๋ฌธ์—, ์šฐ๋ฆฌ๋Š” 1. CNTT + CUST + FPINFO ๋ณ‘ํ•ฉ๋ณธ, 2. CUST + CLAIM ๋ณ‘ํ•ฉ๋ณธ, 3. ์ „ ๋ฐ์ดํ„ฐ ์…‹ ๋ณ‘ํ•ฉ๋ณธ์œผ๋กœ ๋‚˜๋ˆ„์–ด ์ •์ œ๋ฅผ ์ง„ํ–‰ํ•˜์˜€๋‹ค. (์ด๋“ค๊ณผ ๊ด€๋ จ๋œ ์ฝ”๋“œ ๋ฐ ์ „์ฒ˜๋ฆฌ ํ›„ CSVํŒŒ์ผ์€ Data_Processing branch์—์„œ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.)

๊ฐ ์ „์ฒ˜๋ฆฌ๋œ ๋ฐ์ดํ„ฐ์…‹์˜ ์ปฌ๋Ÿผ ์ •๋ณด์™€ ์œ ํšจ ์ปฌ๋Ÿผ์€ ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

Cust + Claim


CUST_ID : ๊ณ ๊ฐ์˜ ๊ณ ์œ  ์•„์ด๋””์ด๋‹ค. ๋ณ‘ํ•ฉ์˜ PK๋กœ์„œ ์—ญํ• ์„ ํ•˜๋ฉฐ, ๋ถ„์„์šฉ ๋ฐ์ดํ„ฐ ์…‹์—์„œ ๊ณ ์œ ํ•˜๋‹ค.

SIU_CUST_YN : ๋ณดํ—˜์‚ฌ๊ธฐ์ž ์—ฌ๋ถ€์ด๋ฉฐ, ๋ถ„์„ ๊ณผ์ •์— ์žˆ์–ด Target Data๊ฐ€ ๋œ๋‹ค. 1 : Y, 2 : N์˜ binaryํ˜•ํƒœ ๋ฐ์ดํ„ฐ์ด๋ฉฐ, ์ง€๋„ ํ•™์Šต์— ์žˆ์–ด ๊ฐ€์žฅ ์ค‘์š”ํ•œ ์—ญํ• ์„ ํ•˜๊ฒŒ ๋œ๋‹ค.

SEX : ๊ณ ๊ฐ์˜ ์„ฑ๋ณ„. 1 : MALE, 2 : FEMALE

AGE : ๊ณ ๊ฐ์˜ ๋‚˜์ด

FP_CAREER : ๊ณ ๊ฐ์˜ ๋ณดํ—˜์„ค๊ณ„์‚ฌ ์ด๋ ฅ ์—ฌ๋ถ€

MAX_PRM : ์ตœ๋Œ€ ๋ณดํ—˜๋ฃŒ, ๋‹น์‚ฌ์— ์ตœ๋Œ€ ๊ทœ๋ชจ ๋ณดํ—˜๋ฃŒ๋ฅผ ๋‚ฉ์ž…ํ•œ ์›” ๋ณดํ—˜๋ฃŒ ์ˆ˜์ค€์œผ๋กœ, 10๋งŒ ๋‹จ์œ„๋กœ 1์”ฉ ๋ˆ„๊ณ„

RESL_CD1 : ๋ณดํ—˜ ์‚ฌ๊ณ ์— ๋Œ€ํ•œ ๊ฒฐ๊ณผ ์ฝ”๋“œ

ACCI_OCCP_GRP : ๋ณดํ—˜ ์ฒญ๊ตฌ์ž์˜ ์ง์—… ์ฝ”๋“œ

CHME_LICE_NO : ๋Œ€ํ‘œ ๋‹ด๋‹น์˜์‚ฌ๋ฉดํ—ˆ๋ฒˆํ˜ธ

DMND_AMT : ์‚ฌ๊ณ ๋ณดํ—˜๊ธˆ์ฒญ๊ตฌ๊ธˆ์•ก

PAYM_AMT : ์‹ค์ง€๊ธ‰๊ธˆ์•ก

NON_PAY_RATIO : ์‹ค์†๋น„๊ธ‰์—ฌ๋น„์œจ

HEED_HOSP_YN : ์œ ์˜๋ณ‘์›์—ฌ๋ถ€

CLAIM_CNT : ๋ณดํ—˜ ์ฒญ๊ตฌ ๊ฑด์ˆ˜

TOTAL_VLID_HOSP_OTDA : ์œ ํšจ์ž…์› ๋ฐ ํ†ต์›์ผ์ˆ˜์˜ ํ•ฉ

HOSP_VARIES : ๋ณดํ—˜ ์ฒญ๊ตฌ๊ฑด์— ๋Œ€ํ•˜์—ฌ ๋‹ค๋‹Œ ๋ณ‘์›์˜ ๊ฐฏ์ˆ˜

HOSP_DVSN_VARIES : ๋ณดํ—˜ ์ฒญ๊ตฌ๊ฑด์— ๋Œ€ํ•˜์—ฌ ๋‹ค๋‹Œ ๋ณ‘์›์˜ ์ข…๋ฅ˜ ๊ตฌ๋ถ„ ๋ˆ„๊ณ„

CHME_LICE_COUNT : ๋‹ด๋‹น์˜์‚ฌ ๋ฉดํ—ˆ ๊ฑด์ˆ˜


์ƒ๊ด€๋ถ„์„ ๊ฒฐ๊ณผ

image

์ƒ๊ด€๋ถ„์„ ๊ฒฐ๊ณผ๋ฅผ ๋ณด๋ฉด, SIU_CUST_YN๊ณผ CLAIM_CNT, TOTAL_VLID_HOSP_OTDA, HOSP_VARIES, HOSP_DVSN_VARIES, CHME_LICE_COUNT๊ฐ€ ์œ ํšจํ•˜๊ฒŒ ๊ด€๋ จ์žˆ๋Š” ์ปฌ๋Ÿผ์ž„์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

๋‹ค์ค‘ํšŒ๊ท€๋ถ„์„ ๊ฒฐ๊ณผ

image

๋‹ค์ค‘ํšŒ๊ท€๋ถ„์„ ๊ฒฐ๊ณผ๋ฅผ ๋ณด๋ฉด, MINCRDT, CAUS_CODE, CLAIM_CNT, TOTAL_VLID_HOSP_OTDA, HOSP_VARIES, HOSP_DVSN_VARIES, CHME_LICE_COUNT ๋“ฑ์ด ์œ ํšจํ•œ ์ปฌ๋Ÿผ์ž„์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

RandomForest ๋ณ€์ˆ˜์ค‘์š”๋„ ๋ถ„์„ ๊ฒฐ๊ณผ

image

Random Forest์˜ ๋ณ€์ˆ˜์ค‘์š”๋„ ๋ถ„์„ ๊ฒฐ๊ณผ๋กœ๋ถ€ํ„ฐ CLAIM_CNT, TOTAL_VLID_HOSP_OTDA, HOSP_VARIES, HOSP_DVSN_VARIES, CHME_LICE_COUNT, HEED_HOSP_YN, NON_PAY_RATIO ๋“ฑ์ด ์œ ํšจํ•œ ์ปฌ๋Ÿผ์ž„์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.


์œ„ ๊ฒฐ๊ณผ๋ฅผ ์ข…ํ•ฉํ•œ ๊ฒฝ์šฐ, ์•„๋ž˜์™€ ๊ฐ™์€ 5๊ฐœ์˜ ์ปฌ๋Ÿผ์ด ์„ธ ๊ฐ€์ง€ ๋ถ„์„๋ฒ• ๋ชจ๋‘์—์„œ ๊ณตํ†ต์ ์œผ๋กœ ๋ณดํ—˜์‚ฌ๊ธฐ์ž ์—ฌ๋ถ€์™€ ์ƒ๋Œ€์ ์œผ๋กœ ๋†’์€ ๊ด€๋ จ์„ฑ์„ ๋ณด์˜€๋‹ค.

  1. CHME_LICE_COUNT
  2. HOSP_DVSN_VARIES
  3. HOSP_VARIES
  4. TOTAL_VLID_HOSP_OTDA
  5. CLAIM_CNT

Cust + CNTT + FPINFO


CLLT_FP_PRNO : FP ์‚ฌ๋ฒˆ

INCB_DVSN : ์žฌ์ง ๊ตฌ๋ถ„

CUST_ID : ๊ณ ๊ฐ์˜ ID๋กœ, PK๋กœ์„œ ๊ณ ์œ ํ•œ ๊ฐ’์„ ๊ฐ€์ง„๋‹ค.

DIVIDED_SET : ๋ฐ์ดํ„ฐ ์…‹์˜ ๊ตฌ๋ถ„, TEST OR TRAIN SET์ด๋ƒ์— ๋”ฐ๋ผ ๊ตฌ๋ถ„๋˜๋ฉฐ, ๋ถ„์„์‹œ์— ์ œ๊ฑฐ๋  ๊ฐ€๋Šฅ์„ฑ ์žˆ์Œ

SIU_CUST_YN : ๋ณดํ—˜์‚ฌ๊ธฐ ์—ฌ๋ถ€๋กœ, ๋ถ„์„์‹œ์˜ Target Data์œผ๋กœ binaryํ˜• ๋ฐ์ดํ„ฐ

SEX : ๊ณ ๊ฐ์˜ ์„ฑ๋ณ„(1 : male, 2 : female)

AGE : ๊ณ ๊ฐ์˜ ๋‚˜์ด

FP_CAREER : FP๊ฒฝ๋ ฅ ์—ฌ๋ถ€๋ฅผ ์˜๋ฏธ

OCCP_GRP : ์ง์—… ๊ทธ๋ฃน์ฝ”๋“œ

TOTALPREM : ํ˜„์žฌ๊นŒ์ง€ ๋‚ฉ์ž…ํ•œ ์ด ๋ณดํ—˜๋ฃŒ

WEDD_YN : ๊ฒฐํ˜ผ ์—ฌ๋ถ€

MAX_PAYM_YEAR : ์ตœ๋Œ€ ๋ณดํ—˜๋ฃŒ๋ฅผ ๋‚ฉ์ž…ํ•œ ์—ฐ๋„

MAX_PAYM_MONTH : ์ตœ๋Œ€ ๋ณดํ—˜๋ฃŒ๋ฅผ ๋‚ฉ์ธํ•œ ์›”

MAX_PRM : ๋‹น์‚ฌ์— ์ตœ๋Œ€๊ทœ๋ชจ์˜ ๋ณดํ—˜๋ฃŒ๋ฅผ ๋‚ฉ์ž…ํ–ˆ๋˜ ์›”๋ณดํ—˜๋ฃŒ ์ˆ˜์ค€

RGST_MONTH : ๊ณ ๊ฐ๋“ฑ๋ก์›”

RGST_YEAR : ๊ณ ๊ฐ๋“ฑ๋ก์—ฐ๋„

MNTH_INCM_AMT_AVG : ์ฒญ์•ฝ์„œ ์†Œ๋“ ํ‰๊ท 

MAIN_INSR_AMT_SUM : ์ฃผ๋ณดํ—˜๊ธˆ ํ•ฉ๊ณ„

SUM_ORIG_PREM_SUM : ๊ณ„์•ฝ(์ฃผ๊ณ„์•ฝ + ํŠน์•ฝ)์˜ ์ „์ฒด ๋ณดํ—˜๋ฃŒ

EXPR_SUM : ์ข…์‹  ๋ณดํ—˜๋ฃŒ์˜ ํ•ฉ๊ณ„

CNTT_TERM_AVG : ํ‰๊ท  ๊ณ„์•ฝ ์†Œ์š”์ผ

WORK_YEARS_MAX : ์ตœ๋Œ€๊ทผ๋ฌด์—ฐ์ˆ˜

WORK_YEARS_MIN : ์ตœ์†Œ๊ทผ๋ฌด์—ฐ์ˆ˜

EXPR_COUNT : ์ข…์‹ ๋ณดํ—˜๊ฐœ์ˆ˜


์ƒ๊ด€๋ถ„์„ ๊ฒฐ๊ณผ

image

์ƒ๊ด€๋ถ„์„ ๊ฒฐ๊ณผ๋ฅผ ๋ณด๋ฉด, SIU_CUST_YN๊ณผ ์œ ์˜๋ฏธํ•˜๋‹ค๊ณ  ๋ณผ ์ˆ˜ ์žˆ๋Š” ์ปฌ๋Ÿผ์ด ์•„์˜ˆ ์กด์žฌํ•˜์ง€ ์•Š๋Š”๋‹ค.

๋‹ค์ค‘ํšŒ๊ท€๋ถ„์„ ๊ฒฐ๊ณผ

์„ค๋ช…๋ ฅ(0 < R-Squred < 1) ๋ถ„์„

R-squared 0.016 ๋งค์šฐ ์•ฝํ•œ ์„ค๋ช…๋ ฅ

Adj. R-squared 0.015 ๋งค์šฐ ์•ฝํ•œ ์„ค๋ช…๋ ฅ

F-statistic 14.3

Prob (F-statistic) 3.53e-23

Log-Likelihood -1429.4

AIC 2879

BIC 2948

P-VALUE (0.05) ์œ ์˜ ์ปฌ๋Ÿผ

SEX 0.0244 3.37 0.001

AGE 0.0084 2.381 0.017

FP_CAREER 0.0444 3.664 0

OCCP_GRP -0.0017 -2.979 0.003

TOTALPREM -0.0031 -4.13 0

MAIN_INSR_AMT_SUM 1.02E-10 6.57 0

EXPR_SUM 1.167e-10 -4.521 0

WORK_YEARS_MIN -0.0013 -2.581 0.01

EXPR_COUNT 0.0029 2.758 0.006

๋‹ค์ค‘ํšŒ๊ท€๋ถ„์„ ๊ฒฐ๊ณผ ์œ„ 9๊ฐœ ์ปฌ๋Ÿผ์— ๋Œ€ํ•ด์„œ ์œ ์˜ํ•œ ์ •๋„์˜ ์ˆ˜์น˜๋ฅผ ์‚ฐ์ถœํ•˜์˜€๋‹ค.

RandomForest ๋ณ€์ˆ˜์ค‘์š”๋„ ๋ถ„์„ ๊ฒฐ๊ณผ

image

Random Forest์˜ ๋ณ€์ˆ˜์ค‘์š”๋„ ๋ถ„์„ ๊ฒฐ๊ณผ๋กœ๋ถ€ํ„ฐ EXPR_COUNT, WORK_YEARS_MIN, WORK_YEARS_MAX์˜ ์„ธ ์ปฌ๋Ÿผ ์ •๋„๊ฐ€ 0.1 ์ด์ƒ์˜ ์ค‘์š”๋„๋ฅผ ๋ณด์˜€์œผ๋‚˜, ์ด๊ฐ™์€ ๊ฒฐ๊ณผ๋กœ๋Š” Target Data์™€ ๊นŠ์€ ๊ด€๊ณ„๊ฐ€ ์žˆ๋Š” ์ปฌ๋Ÿผ์ด ์กด์žฌํ•œ๋‹ค๊ณ  ๋ณด๊ธฐ ์–ด๋ ค์› ๋‹ค.


์œ„ ๊ฒฐ๊ณผ๋ฅผ ์ข…ํ•ฉํ•˜์—ฌ, CNTT์™€ FPINFO ํ…Œ์ด๋ธ”์—์„œ๋Š” CUST_ID ์ƒ์˜ SIU_CUST_YN์„ ๊ตฌ๋ถ„ํ•ด๋‚ด๊ธฐ ์–ด๋ ต๋‹ค๋Š” ๊ฒฐ๋ก ์„ ๋„์ถœํ•˜์˜€๋‹ค.

์ตœ์ข… ๋ฐ์ดํ„ฐ์…‹ : Insurance Data (CUST + CLAIM + CNTT + FPINFP)


์œ„์˜ ๋‘ ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ๋ถ€ํ„ฐ ์šฐ๋ฆฌ๋Š” ์œ ์˜๋ฏธํ•œ ์ปฌ๋Ÿผ์„ 8๊ฐœ์ •๋„ ์ถ”๋ฆด ์ˆ˜ ์žˆ์—ˆ๋‹ค.

๊ทธ๋Ÿฌ๋‚˜ ๊ฐ๊ฐ์˜ ํ…Œ์ด๋ธ”์—์„œ ๋…์ž์ ์œผ๋กœ ์กด์žฌํ•  ๋•Œ์— ์ด๋“ค์€ ํฐ ํšจ๊ณผ๋ฅผ ๋ฐœํœ˜ํ•˜๊ธฐ ์–ด๋ ค์šฐ๋ฏ€๋กœ, ์ด๋“ค์„ ํ•˜๋‚˜์˜ ๋ฐ์ดํ„ฐ์…‹์— ๋ณ‘ํ•ฉํ•˜์—ฌ ๋ถ„์„์šฉ ๋ฐ์ดํ„ฐ์…‹์„ ๋งŒ๋“ค ํ•„์š”์„ฑ์„ ๋А๋ผ๊ฒŒ ๋˜์—ˆ๋‹ค.

๋•Œ๋ฌธ์— ์ œ์ž‘ํ•œ ๊ฒƒ์ด 3๋ฒˆ์งธ ๋ฐ์ดํ„ฐ์…‹์ธ Insurance Data๋กœ, ์ด๋Š” ๋ชจ๋“  ํ…Œ์ด๋ธ”์—์„œ ์ƒ๋Œ€์ ์œผ๋กœ ์œ ์˜๋ฏธํ•˜๋‹ค๊ณ  ์ƒ๊ฐ๋˜๋Š” ์ปฌ๋Ÿผ์„ ์ถ”์ถœํ•ด ๋ณ‘ํ•ฉํ•œ ๊ฒƒ์ด๋‹ค.

์ด ๋ฐ์ดํ„ฐ์…‹์— ํฌํ•จ๋œ ์ปฌ๋Ÿผ์€ ์•„๋ž˜์™€ ๊ฐ™์œผ๋ฉฐ, ๊ฐ ์ปฌ๋Ÿผ์— ๋Œ€ํ•œ ์ •๋ณด๋Š” ์œ„์—์„œ ์„ค๋ช…ํ•œ ๊ฒƒ๋“ค์„ ํฌํ•จํ•˜๊ณ  ์žˆ์œผ๋ฏ€๋กœ ๋ณ„๋„๋กœ ๊ธฐ์žฌํ•˜์ง€ ์•Š๋Š”๋‹ค.

SEX, AGE, FP_CAREER, TOTALPREM, MNTH_INCM_AMT_AVG, MAIN_INSR_AMT_SUM, MINCRDT, CAUS_CODE_COUNT, DMND_RESN_CODE_COUNT, RESL_CD1_COUNT, NON_PAY_RATIO_SUM, CLAIM_CNT, TOTAL_VLID_HOSP_OTDA, HOSP_DVSN_VARIES, CHME_LICE_COUNT

์ด๋“ค์˜ ๋‹ค์ค‘ํšŒ๊ท€๋ถ„์„ ๊ฒฐ๊ณผ๋Š” ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

image

๋ฐ์ดํ„ฐ ๋ชจ๋ธ๋ง ์ ์šฉ ๋ฐ ๋ถ„์„

์ด์ œ๋ถ€ํ„ฐ์˜ ๊ณผ์ •์€ ์œ„์—์„œ ํ™•์ •๋œ ํ†ตํ•ฉ๋ฐ์ดํ„ฐ InsuranceData.csv๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค.

์ด ๊ณผ์ •์€ ํ”„๋กœ์ ํŠธ ๋ชฉ์ ์— ์ ํ•ฉํ•œ ๋ชจ๋ธ์„ ์„ ํƒํ•˜๊ณ , overfitting์„ ๋ฐฉ์ง€ํ•˜๊ธฐ์œ„ํ•œ ์ƒ˜ํ”Œ๋ง ๊ธฐ๋ฒ•์„ ์„ ํƒํ•˜๋Š” ๋ฐ์— ์ค‘์ ์„ ์ค€๋‹ค.

๋ชจ๋ธ์ ์šฉ


InsuranceData.csv์— ๋Œ€ํ•˜์—ฌ, 3๊ฐ€์ง€ ๋ชจ๋ธ์„ ์ ์šฉํ•ด ๊ทธ ์˜ˆ์ธก ์„ฑ๋Šฅ์„ ๋น„๊ตํ•ด๋ณด๊ธฐ๋กœ ํ•œ๋‹ค. ์ ์šฉํ•  ๋ชจ๋ธ์€ ๋‹ค์Œ์˜ 3๊ฐ€์ง€์ด๋‹ค.


Logistic Regression

Random Forest

Support Vector Machine


๋˜ํ•œ ์œ„ ๋ชจ๋ธ๋“ค์— ๋Œ€ํ•˜์—ฌ ์•„๋ž˜์˜ sampling ๋ฐฉ์‹์„ ์ ์šฉํ•ด๋ณด๋„๋ก ํ•œ๋‹ค.


SMOTE

BorderlineSMOTE

ADASYN

SVMSMOTE

Logistic Regression

  1. SMOTE

image

image

  1. BorderlineSMOTE

image

image

  1. ADASYN

image

image

  1. SVMSMOTE

image

image

์œ„ ๊ณผ์ •์˜ ๊ฒฝ์šฐ, ๋Œ€์ฒด์ ์œผ๋กœ Accuracy, recall, F1์˜ ์ˆ˜์น˜๋Š” ๋น„์Šทํ–ˆ์œผ๋‚˜ SVMSMOTE๊ฐ€ precision ๋ฉด์—์„œ ๋” ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ์„ ๋ณด์˜€๋‹ค.

Random Forest

  1. SMOTE

image

image

  1. BorderlineSMOTE

image

image

  1. ADASYN

image

image

  1. SVMSMOTE

image

image

Support Vector Machine

  1. SMOTE

image

image

  1. BorderlineSMOTE

image

image

  1. ADASYN

image

image

  1. SVMSMOTE

image

image


์ •๋ฆฌ

๊ฐ ๋ชจ๋ธ๋“ค์— ๊ด€ํ•œ ์„ฑ๋Šฅ์ฐจ๋Š” ์œ„ ์ž๋ฃŒ์—์„œ ์•Œ ์ˆ˜ ์žˆ๋“ฏ์ด ๋šœ๋ ทํ•˜๊ฒŒ ๋‚˜ํƒ€๋‚˜์ง€ ์•Š์œผ๋ฉฐ, ๋Œ€์ฒด๋กœ ๋น„์Šทํ•œ ์„ฑ๋Šฅ์„ ๋ณด์ธ๋‹ค. ๋‹ค๋งŒ, ์ƒ˜ํ”Œ๋ง ๋ฐฉ์‹์˜ ๊ฒฝ์šฐ๋Š” SVMSMOTE๊ฐ€ Precision ๋ฉด์—์„œ ์ƒ๋Œ€์ ์œผ๋กœ ๋†’์€ ์ˆ˜์ค€์„ ๋ณด์˜€๋‹ค.

๋ณธ ํ”„๋กœ์ ํŠธ์˜ ๋ชฉ์ ์ด ๋ณดํ—˜์‚ฌ์˜ ํšจ์œจ์ ์ธ ๊ณ ๊ฐ๊ด€๋ฆฌ์— ์žˆ๋Š”๋งŒํผ, LogisticRegression ๋ชจ๋ธ์˜ Target Data ๋ถ„๋ฅ˜์— ๋Œ€ํ•œ ํ™•๋ฅ  ํ•จ์ˆ˜๋ฅผ ์ด์šฉํ•˜๊ธฐ๋กœ ํ•œ๋‹ค.

์ •๋ฆฌํ•˜๋ฉด, LogisticRegression ๋ชจ๋ธ์„ ์ด์šฉํ•˜๋˜ SVMSMOTE๋ฐฉ์‹์œผ๋กœ ์ƒ˜ํ”Œ๋ง์„ ํ•˜์—ฌ ์˜ˆ์ธก ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜๊ธฐ๋กœ ํ•œ๋‹ค.


๊ต์ฐจ ๊ฒ€์ฆ๊ณผ ์˜ˆ์ธก

์œ„ ๊ณผ์ •๊นŒ์ง€์˜ ๋ฐ์ดํ„ฐ์…‹์„ ๊ตฌ๋ถ„ํ•˜๋Š” ๋ฐฉ์‹์€ ๋ชจ๋‘ ์ž„์˜์ ์œผ๋กœ, ์ „์ฒด ๋ฐ์ดํ„ฐ๋ฅผ ๋ชจ๋‘ ํ›ˆ๋ จ์— ์ด์šฉํ•  ์ˆ˜ ์—†์œผ๋ฉฐ ๋…ธ์ด์ฆˆ ๊ฐ’์ด ํฐ ๋ฐ์ดํ„ฐ๋“ค์ด ํ•œ ์ชฝ์— ์ ๋ฆฌ๊ฒŒ ๋  ๊ฒฝ์šฐ ์ œ๋Œ€๋กœ ๊ฒ€์ฆ ๋ฐ ํ›ˆ๋ จ์ด ์ด๋ฃจ์–ด์งˆ ์ˆ˜ ์—†๋‹ค๋Š” ๋ฌธ์ œ์ ์ด ์žˆ์œผ๋ฏ€๋กœ, ๋น„์†Œ๋ชจ์  ๊ต์ฐจ ๊ฒ€์ฆ์„ ์‹ค์‹œํ•˜์—ฌ ๋ฐ์ดํ„ฐ์™€ ์•Œ๊ณ ๋ฆฌ์ฆ˜์— ๋Œ€ํ•œ ์ •ํ™•๋„๋ฅผ ๋†’์ด๋„๋ก ํ•œ๋‹ค.

image

๊ต์ฐจ ๊ฒ€์ฆ์€ ์œ„์™€ ๊ฐ™์ด ๋‹ค์–‘ํ•œ TEST SET๊ณผ TRAIN SET์„ ๋งŒ๋“ค์–ด ๊ฒฐ๊ณผ๋ฅผ ๋„์ถœํ•˜๋Š” ๋ฐฉ์‹์ด๋‹ค.


์—ฌ๊ธฐ์—, ์šฐ๋ฆฌ๋Š” ๋ถˆ๊ท ํ˜•์ด ์‹ฌํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์œผ๋ฏ€๋กœ LABEL๋“ค์ด ๋น„์Šทํ•œ ๋น„์œจ์„ ์œ ์ง€ํ•  ์ˆ˜ ์žˆ๋„๋ก Stratified K-fold ๋ฐฉ์‹์„ ์ฑ„ํƒํ•˜์—ฌ ๊ต์ฐจ ๊ฒ€์ฆ์„ ์ง„ํ–‰ํ•˜์˜€๋‹ค.

image

ACCURACY


image

CONFUSION MATRIX


image

์œ„๋Š” TT,FF : FT, TF๋กœ ๊ทธ๋ฃนํ™”ํ•˜์—ฌ ์ด๋ถ„ํ™”ํ•œ ํ›„์˜ ์‹œ๊ฐํ™” ์‚ฌ์ง„์ด๋‹ค.

์œ„ ์‚ฌ์ง„์˜ ๊ฒฝ์šฐ, ์ดˆ๋ก์ƒ‰ ๋ถ€๋ถ„์ด TT, FF๋กœ ์˜ˆ์ธก์ด ์„ฑ๊ณตํ•˜์—ฌ ์ œ๋Œ€๋กœ ๋ถ„๋ฅ˜๋œ ๋ถ€๋ถ„์ด๊ณ  ์ฃผํ™ฉ์ƒ‰ ๋ถ€๋ถ„์ด ๊ทธ๋ ‡์ง€ ๋ชปํ•œ ๋ถ€๋ถ„์ด๋‹ค.

์ด๋“ค์˜ ๋ถ„ํฌ๋Š” X์ถ• ๊ธฐ์ค€์œผ๋กœ ๋ณดํ—˜์‚ฌ๊ธฐ์ž์ผ ํ™•๋ฅ ์— ๋Œ€ํ•˜์—ฌ 0 : 0.00 ~ 0.1, 1 : 0.10 ~ 0.2 ....์™€ ๊ฐ™์ด ๋ผ๋ฒจ๋งํ•˜์—ฌ ๋‚˜ํƒ€๋‚ด์—ˆ๋Š”๋ฐ, ์ด๋ฅผ ๋ถ„์„ํ•ด๋ณด๋ฉด N์ผ ๊ฒฝ์šฐ์˜ ์ œ๋Œ€๋กœ๋œ ์˜ˆ์ธก์„ ํ•  ํ™•๋ฅ ์ด ๋งค์šฐ ๋†’์ง€๋งŒ Y์ผ ๊ฒฝ์šฐ์˜ ์ œ๋Œ€๋กœ ๋œ ์˜ˆ์ธก ํ™•๋ฅ ์€ ๋งค์šฐ ๋‚ฎ๋‹ค๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.


image

๋ฐ”๋กœ ์œ„์—์„œ ๋ณด์ธ ๋ถ„๋ฅ˜ ํ˜„ํ™ฉ์„ ์„ค๋ช…ํ•˜๋“ฏ, ์œ„์˜ ๊ทธ๋ฆผ์„ ๋ณด๋ฉด N์œผ๋กœ ์ œ๋Œ€๋กœ ๋ถ„๋ฅ˜ํ•  ํ™•๋ฅ ์ด ๋†’์•„ ์ฃผ์˜๋ž€์— ๋Œ€๋ถ€๋ถ„์˜ ๋ถ„ํฌ๊ฐ€ ๋ชฐ๋ ค์žˆ๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜, ๊ทธ ์™ธ ๊ฒฝ๊ณ„๋‚˜ ์œ„ํ—˜๊ตฐ์˜ ๊ฒฝ์šฐ ์ œ๋Œ€๋กœ ๋œ ์˜ˆ์ธก์„ ํ•  ํ™•๋ฅ ์ด ์ƒ๋Œ€์ ์œผ๋กœ ๋–จ์–ด์ ธ ์†Œ๊ทœ๋ชจ ๋ถ„ํฌ๋งŒ์ด ์žˆ์Œ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

About

Data Project

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors