[250421] 심화 프로젝트 2일차

카테고리 없음

[250421] 심화 프로젝트 2일차

jeonieee 2025. 4. 21. 22:38

import pandas as pd
import numpy as np
from google.colab import drive
from datetime import datetime

df = pd.read_csv("/2025_Airbnb_NYC_listings.csv")
airbnb = ['neighbourhood_group_cleansed', 'property_type', 'room_type', 'accommodates', 'bathrooms', 'bathrooms_text', 'bedrooms', 'beds', 'amenities', 'price', 'minimum_nights', 'maximum_nights', 'number_of_reviews', 'number_of_reviews_l30d', 'review_scores_rating']
df = df[airbnb]

df['price']=df['price'].str.replace('$', '')
df['price']=df['price'].str.replace(',', '').astype(float)

# 이제 df는 DataFrame이므로, drop() 함수를 사용할 수 있습니다.
# 'first_review' 컬럼은 airbnb에 없으므로 삭제할 필요가 없습니다.

# instant_bookable 변환 (원본 데이터에 instant_bookable 컬럼이 있다면)
# df['instant_bookable'] = df['instant_bookable'].apply(lambda x: 1 if x == 't' else 0)

# amenities 컬럼을 편의시설 개수로 변환
df['amenities'] = df['amenities'].apply(lambda x: len(x.split(',')))

# room_type one-hot 인코딩
dummies = pd.get_dummies(df, columns=['room_type'], prefix='col', drop_first=False)
df = pd.concat([df, dummies], axis=1)
df.drop('room_type', axis=1, inplace=True)

# bathrooms_text 컬럼을 shared 여부에 따라 0 또는 1로 변환
df['bathrooms_text'] = df['bathrooms_text'].apply(lambda x: 0 if 'shared' in x else 1)

# 1. 라이브러리 임포트
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.font_manager as fm
import seaborn as sns
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score, GridSearchCV
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.metrics import (
    accuracy_score,
    precision_score,
    recall_score,
    f1_score, 
    confusion_matrix,
    roc_auc_score,
    roc_curve,
    precision_recall_curve, 
    classification_report
)
import warnings
warnings.filterwarnings('ignore')

# 환경 설정
# 한글 폰트 설정
plt.rcParams['font.family'] = 'NanumGothic' # 나눔고딕 폰트 사용
plt.rcParams['axes.unicode_minus'] = False # 마이너스 기호 깨짐 방지

import platform
from matplotlib import rc
# 폰트가 없을 경우 대비한 예외 처리
system = platform.system()
if system == 'Darwin' : # macOS
    rc('font', family='AppleGothic')
elif system == 'Windows':
    rc('font', family='Malgun Gothic')
else :
    rc('font', family='NanumGothic')

#모든 열을 보려할때
pd.set_option('display.max_columns',None)
##모든 행을 보려할때
#pd.set_option('display.max_rows',None)
#원래대로 되돌리기
pd.options.display.max_rows = 60
#pd.options.display.max_columns = 20

airbnb = pd.read_csv('/2025_Airbnb_NYC_listings.csv')
airbnb.info()

airbnb = airbnb[['neighbourhood_group_cleansed', 'property_type', 'room_type', 'bathrooms_text',
                 'accommodates', 'bathrooms', 'bedrooms', 'beds', 'minimum_nights',
                 'maximum_nights', 'number_of_reviews', 'number_of_reviews_l30d',
                 'review_scores_rating', 'description', 'amenities','calculated_host_listings_count','price']]

airbnb.isnull().sum()
airbnb.head()
airbnb.describe() # 통계량 요약 (개수, 평균, 표준편차, 최소, 최대)

상관관계 분석

# 데이터 전처리

- 사용할 범주형 : 'neighbourhood_group_cleansed','property_type', 'room_type', 'bathrooms_text'

- 사용할 연속형 : 'accommodates', 'bathrooms','bedrooms','beds', 'minimum_nights','maximum_nights','number_of_reviews', 'number_of_reviews_l30d', 'review_scores_rating','calculated_host_listings_count'

- 숫자로 변경할 컬럼 : 'description', 'amenities'

아이디, 호스트네임, 날짜는 제외 범주형 + price(연속형과 범주형 상관관계 0.3 이상인거 뽑기) / 연속형 = price와 돌리기 (그냥 상관관계)
범주형에서도 2분형만 모아서 프라이스와 돌리기 point-biserial 연속형과 범주형
나머지는 아노바로 돌리기 -> 없다

airbnb['price']=airbnb['price'].str.replace('$', '')
airbnb['price']=airbnb['price'].str.replace(',', '').astype(float)

연속형 변수의 상관관계 분석

# 연속형 변수들과 price 사이의 상관관계
continuous_vars = [
    'availability_30', 'availability_60', 'availability_90', 'availability_365',
    'number_of_reviews', 'number_of_reviews_ltm', 'number_of_reviews_l30d',
    'review_scores_rating', 'review_scores_accuracy', 'review_scores_cleanliness',
    'review_scores_checkin', 'review_scores_communication',
    'review_scores_location', 'review_scores_value', 'reviews_per_month',
    'calculated_host_listings_count', 'calculated_host_listings_count_entire_homes',
    'calculated_host_listings_count_private_rooms', 'calculated_host_listings_count_shared_rooms'
]

# 상관계수 계산
correlations = airbnb[continuous_vars + ['price']].corr()
correlations['price'].sort_values(ascending=False)

plt.figure(figsize=(12, 10))
corr_matrix = airbnb[continuous_vars].corr()
mask = np.triu(np.ones_like(corr_matrix, dtype=bool)) 
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', linewidths=0.5, mask=mask, fmt=".2f")
plt.title('변수 간 상관관계')
plt.tight_layout()
plt.show()

이거도 무러보기...

bool 타입 변수의 상관관계 분석

## has_availability

airbnb['has_availability'].unique()

array(['t', nan], dtype=object)

내일 이거 처리해야지...

## 2 
# price → float으로 변환
airbnb['price'] = airbnb['price'].replace('[\$,]', '', regex=True).astype(float)

# instant_bookable → bool로 변환
if airbnb['instant_bookable'].dtype == 'object':
    airbnb['instant_bookable'] = airbnb['instant_bookable'].map({'t': True, 'f': False})

# 상관관계 분석 다시 시도
from scipy.stats import pointbiserialr

r2, p2 = pointbiserialr(airbnb['instant_bookable'], airbnb['price'])
print(f"[instant_bookable] correlation: {r2:.4f}, p-value: {p2:.4f}")

[instant_bookable] correlation: 0.1085, p-value: 0.0000

## 3 
r3, p3 = pointbiserialr(airbnb['has_license'], airbnb['price'])
print(f"[has_license] correlation: {r3:.4f}, p-value: {p3:.4f}")

[has_license] correlation: 0.0785, p-value: 0.0000

현재글[250421] 심화 프로젝트 2일차

jeonieee 님의 블로그

jeonieee 님의 블로그 입니다.

Today :
Yesterday :

일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

jeonieee 님의 블로그

[250421] 심화 프로젝트 2일차

상관관계 분석

'카테고리 없음'의 다른글

티스토리툴바