[250702] 최종 프로젝트 22일차

카테고리 없음

[250702] 최종 프로젝트 22일차 - 워드 클라우드

jeonieee 2025. 7. 2. 20:09

import pandas as pd
import re

# 불용어 리스트
stopwords = {
    'a', 'an', 'the', 'for', 'to', 'of', 'in', 'on', 'at', 'by', 'and', 'or', 'with', 'as', 'from',
    '-', '_', "mm", "cm", "inch", "kg", "g", "oz", "pcs", "pack", "lot", "set", "new", "free", 
    "brand", "type", "item", "sale", "hot", "best", "great", "good", "original", "us", "au", "uk"
}

# 숫자 포함 단어 제거용 정규식 (ex: "5mm", "abc12", "2020model" 등 포함)
contains_digit = re.compile(r'.*\d+.*')

# 전처리 함수
def preprocess_title(title):
    if pd.isna(title):
        return []
    
    title = title.lower()
    words = re.findall(r'\b\w+\b', title)
    
    filtered = []
    for word in words:
        if (
            word not in stopwords and     # 불용어 제거
            not word.isdigit() and        # 숫자만 있는 단어 제거
            len(word) >= 2 and            # 너무 짧은 단어 제거
            not contains_digit.match(word)  # 숫자 포함된 단어 제거
        ):
            filtered.append(word)
    
    return filtered

# 적용
df_pick['title'] = df_pick['title'].astype(str)
df_pick['word'] = df_pick['title'].apply(preprocess_title)
df_pick = df_pick.explode('word').reset_index(drop=True)

워드 클라우드를 위해 단어 전처리를 먼저함

쓸데없는 단어들은 제거하고 숫자가 포함된 5mm, 20mm 같은 단어 제거를 위해 숫자 있는 단어도 제거

이것이 저희의 키워드입니다

탑 1000개 ...

카테고리별로 찢어서 보여주고 싶은데 그러면 저기 큰 black이 카테고리별로 찢겨서 다르더라....

흐음... 어덕케해야할까...

현재글[250702] 최종 프로젝트 22일차 - 워드 클라우드

jeonieee 님의 블로그

jeonieee 님의 블로그 입니다.

Today :
Yesterday :

일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

jeonieee 님의 블로그

[250702] 최종 프로젝트 22일차 - 워드 클라우드

'카테고리 없음'의 다른글

티스토리툴바