빅데이터/자동화
[자동화] BeautifulSoup을 사용한 유투브 동영상 URL 추출
pbj0812
2020. 5. 24. 19:01
0. 목표
- 유투브 동영상 URL 추출
- selenium은 너무 느림
1. 실습
1) library 호출
import requests
import pandas as pd
from bs4 import BeautifulSoup
2) URL 추출
keyword = '미르방'
req = requests.get('https://www.youtube.com/results?search_query=' + keyword)
html = req.text
soup = BeautifulSoup(html, 'html.parser')
my_titles = soup.select(
'h3 > a'
)
title = []
url = []
for idx in my_titles:
title.append(idx.text)
url.append(idx.get('href'))
3) 데이터 프레임화
title_list = pd.DataFrame(url, columns = ['url'])
title_list['title'] = title
4) 확인
title_list
- 채널의 URL도 같이 딸려온 것을 알 수 있음
5) 코드 개선
import requests
import pandas as pd
from bs4 import BeautifulSoup
keyword = '미르방'
req = requests.get('https://www.youtube.com/results?search_query=' + keyword)
html = req.text
soup = BeautifulSoup(html, 'html.parser')
my_titles = soup.select(
'h3 > a'
)
title = []
url = []
for idx in my_titles:
if idx.get('href')[:7] != '/watch?':
pass
else:
title.append(idx.text)
url.append(idx.get('href'))
title_list = pd.DataFrame(url, columns = ['url'])
title_list['title'] = title
6) 확인
title_list