[2주차 과제] 구글 리뷰 데이터 탐색

하와이

뉴욕

변수정리

간단한 eda

🌀다시 볼 코드

state 집계

alias("count"):생성된 열의 이름을 "count"로 지정

state_cnt=meta_data.groupBy("state").agg(count("*").alias("count"))
print(state_cnt)

MISC

explode: 배열 또는 맵을 가진 열을 입력으로 받아, 각 배열 요소 또는 맵의 key-value 쌍을 별도의 행으로 변환하는 함수

from pyspark.sql.functions import explode, col

# Convert the "MISC" column to an array type
misc_array = meta_data.select(col("MISC.Accessibility").alias("MISC"))

# Explode the array column
exploded_misc = misc_array.select(explode("MISC").alias("key"))

# Group by the "key" column and count the occurrences
misc_keys = exploded_misc.groupBy("key").agg(count("*").alias("count")).select("key")

# Collect the results and convert them to a list
misc_keys_list = [row.key for row in misc_keys.collect()]

# Get the count of keys
count = len(misc_keys_list)

print(count, misc_keys_list)

근데 코드가 좀 구림,

UDF 사용 이유?