SKKU |Python |Day. 03 |
Pandas 크롤링II

[구글 코랩 (Google colab)과 Python으로 구현 // Day. 03]

Pandas 크롤링II 학습

#09DaysOfCode _ 20220720

해당 게시물은 성균관대에서 주관하는 격-파이썬 프로그램을

수강 후 학습한 내용을 토대로 작성된 단순 학습 결과물임을 알려드립니다.

#09DaysOfCode #Day03

[03일 차 학습]

[판다스 크롤링]

from bs4 import BeautifulSoup #HTML

import pandas as pd

import requests #페이지 요청

import time

#3.태그 찾기

#4.정리

[연금복권 데이터 저장하기 //표기: [2,7,4,9,0,6,3]]

- 방법1 / 회차 정보 숫자 크롤링하기

> 코드

from bs4 import BeautifulSoup

import requests

import time

url = requests.get("https://search.daum.net/search?w=tot&DA=YZR&t__nil_searchbox=btn&sug=&sugo=&sq=&o=&q=%EC%97%B0%EA%B8%88%EB%B3%B5%EA%B6%8C")

html = BeautifulSoup(url.text)

html

html2 = html.find("div",class_="prize").find('span').text

html3=html2[1:4]

current=int(html3)

current

> 출력

115

- 방법1 / 표기 값 크롤링하기(완)

> 코드

from bs4 import BeautifulSoup

import requests

import time

url = requests.get("https://search.daum.net/search?w=tot&DA=YZR&t__nil_searchbox=btn&sug=&sugo=&sq=&o=&q=%EC%97%B0%EA%B8%88%EB%B3%B5%EA%B6%8C")

html = BeautifulSoup(url.text)

html

html2 = html.find("div",class_="prize").find('span').text

html3=html2[1:4]

current=int(html3)

current

total=[]

for n in range(1,101):

url = requests.get(f"https://search.daum.net/search?w=tot&DA=JIM&rtmaxcoll=JIM&&q=%EC%97%B0%EA%B8%88%EB%B3%B5%EA%B6%8C%20720%2B%20{n}%ED%9A%8C%EC%B0%A8")

html = BeautifulSoup(url.text)

numbers = html.find("div",class_="inner").text.split()

#del numbers [-2]

numbers = list(map(int,numbers))

total.append(numbers)

print(f"{n}회 연금 데이터 저장완료 ... {numbers}")

time.sleep(1)

> 출력

1회 연금 데이터 저장완료 ... [1, 6, 2, 1, 3, 2] 2회 연금 데이터 저장완료 ... [4, 5, 0, 5, 5, 8] 3회 연금 데이터 저장완료 ... [5, 4, 4, 9, 5, 5] 4회 연금 데이터 저장완료 ... [1, 2, 4, 4, 2, 0] 5회 연금 데이터 저장완료 ... [7, 5, 4, 6, 5, 5] 6회 연금 데이터 저장완료 ... [1, 9, 3, 2, 0, 2] 7회 연금 데이터 저장완료 ... [5, 9, 7, 0, 9, 3] 8회 연금 데이터 저장완료 ... [2, 3, 4, 0, 5, 8] 9회 연금 데이터 저장완료 ... [1, 3, 3, 5, 1, 0] 10회 연금 데이터 저장완료 ... [7, 7, 0, 1, 7, 3] 11회 연금 데이터 저장완료 ... [8, 6, 7, 6, 5, 4] 12회 연금 데이터 저장완료 ... [0, 7, 6, 6, 7, 6] 13회 연금 데이터 저장완료 ... [6, 6, 9, 2, 4, 5] 14회 연금 데이터 저장완료 ... [4, 3, 2, 4, 9, 6] 15회 연금 데이터 저장완료 ... [4, 7, 7, 2, 3, 8] 16회 연금 데이터 저장완료 ... [6, 6, 4, 0, 5, 6] 17회 연금 데이터 저장완료 ... [3, 1, 7, 2, 2, 7] 18회 연금 데이터 저장완료 ... [5, 6, 2, 2, 2, 2] 19회 연금 데이터 저장완료 ... [8, 5, 9, 2, 1, 9] 20회 연금 데이터 저장완료 ... [8, 1, 9, 6, 0, 5] 21회 연금 데이터 저장완료 ... [2, 3, 9, 9, 3, 7] 22회 연금 데이터 저장완료 ... [9, 1, 3, 6, 2, 2] 23회 연금 데이터 저장완료 ... [1, 5, 8, 0, 7, 1] 24회 연금 데이터 저장완료 ... [0, 0, 2, 9, 8, 9] 25회 연금 데이터 저장완료 ... [4, 6, 3, 6, 3, 7] 26회 연금 데이터 저장완료 ... [1, 9, 2, 8, 5, 6] 27회 연금 데이터 저장완료 ... [4, 1, 2, 0, 0, 8] 28회 연금 데이터 저장완료 ... [0, 8, 0, 8, 3, 9]

...

- 방법2(정석) / 회차 값, 데이터 값 크롤링하기

> 코드

#>>>

from bs4 import BeautifulSoup

import requests

import time

url = requests.get("https://search.daum.net/search?w=tot&DA=YZR&t__nil_searchbox=btn&sug=&sugo=&sq=&o=&q=%EC%97%B0%EA%B8%88%EB%B3%B5%EA%B6%8C")

html = BeautifulSoup(url.text)

current=html.find('span',class_='f_red').text[1:-1]

current

numbers = html.find("tr",class_="fst").text

numbers

> 출력

'1등 월 700만원 x 20년 2조 7 4 9 0 6 3 '

- 방법2(정석) / 데이터 정제/저장, 각 회차별 데이터 크롤링하기(완)

> 코드

#>>>

from bs4 import BeautifulSoup

import requests

import time

import pickle

url = requests.get("https://search.daum.net/search?w=tot&DA=YZR&t__nil_searchbox=btn&sug=&sugo=&sq=&o=&q=%EC%97%B0%EA%B8%88%EB%B3%B5%EA%B6%8C")

html = BeautifulSoup(url.text)

current=int(html.find('span',class_='f_red').text[1:-1])

try:

f=open("pension.dat",'rb')

total = pickle.load(f)

f.close()

except:

total=[] #try 안 코드 에러시 실행

for n in range(len(total)+1, current+1):

url = requests.get(f"https://search.daum.net/search?w=tot&DA=JIM&rtmaxcoll=JIM&&q=%EC%97%B0%EA%B8%88%EB%B3%B5%EA%B6%8C%20720%2B%20{n}%ED%9A%8C%EC%B0%A8")

html = BeautifulSoup(url.text)

numbers = html.find("tr",class_="fst").text.split()[5:]

numbers[0] = numbers[0].replace("조","")#numbers[0:-1]

numbers = list(map(int,numbers))

total.append(numbers)

print("{}회 연금복권 저장완료 ... {}".format(n, numbers))

time.sleep(1)

f=open("pension.dat",'wb')

pickle.dump(total, f)

f.close()

> 출력

1회 연금복권 저장완료 ... [4, 1, 6, 2, 1, 3, 2] 2회 연금복권 저장완료 ... [2, 4, 5, 0, 5, 5, 8] 3회 연금복권 저장완료 ... [4, 5, 4, 4, 9, 5, 5] 4회 연금복권 저장완료 ... [4, 1, 2, 4, 4, 2, 0] 5회 연금복권 저장완료 ... [4, 7, 5, 4, 6, 5, 5] 6회 연금복권 저장완료 ... [5, 1, 9, 3, 2, 0, 2] 7회 연금복권 저장완료 ... [2, 5, 9, 7, 0, 9, 3] 8회 연금복권 저장완료 ... [4, 2, 3, 4, 0, 5, 8] 9회 연금복권 저장완료 ... [3, 1, 3, 3, 5, 1, 0] 10회 연금복권 저장완료 ... [2, 7, 7, 0, 1, 7, 3] 11회 연금복권 저장완료 ... [1, 8, 6, 7, 6, 5, 4] 12회 연금복권 저장완료 ... [5, 0, 7, 6, 6, 7, 6] 13회 연금복권 저장완료 ... [1, 6, 6, 9, 2, 4, 5] 14회 연금복권 저장완료 ... [3, 4, 3, 2, 4, 9, 6] 15회 연금복권 저장완료 ... [4, 4, 7, 7, 2, 3, 8] 16회 연금복권 저장완료 ... [4, 6, 6, 4, 0, 5, 6] 17회 연금복권 저장완료 ... [3, 3, 1, 7, 2, 2, 7] 18회 연금복권 저장완료 ... [1, 5, 6, 2, 2, 2, 2]

...

#해당 방법2 의 경우, 조 단위도 출력함

- 네이버 코스피/코스닥 데이터 크롤링

> 코드

from bs4 import BeautifulSoup

import requests

import time

url=requests.get("https://finance.naver.com/sise/sise_market_sum.naver?sosok=0")

html = BeautifulSoup(url.text)

kospi_page = int(html.find("td", class_="pgRR").find("a")['href'].split('=')[-1])#[-2:] 세자리일 경우 안 좋음 #['href']==접근

kospi_page

> 출력

- 코스피 페이지 크롤링 url 작업

> 코드

from bs4 import BeautifulSoup

import requests

import time

url=requests.get("https://finance.naver.com/sise/sise_market_sum.naver?sosok=0")

html = BeautifulSoup(url.text)

table= html.find("table",class_="type_2")

table

> 출력

<table cellpadding="0" cellspacing="0" class="type_2" summary="코스피 시세정보를 선택한 항목에 따라 정보를 제공합니다."> <caption>코스피</caption> <colgroup> <col width="2%"/> <col width="*"/> <col width="7%"/> <col width="9%"/> <col width="7%"/> <col width="8%"/> <col width="8%"/> <col width="8%"/> <col width="8%"/> <col width="8%"/> <col width="8%"/> <col width="8%"/> <col width="6%"/> </colgroup> <thead> <tr> <th scope="col">N</th> <th scope="col">종목명</th> <th scope="col">현재가</th> <th class="tr" scope="col" style="padding-right:8px">전일비</th> <th scope="col">등락률</th> <th scope="col">액면가</th> <th scope="col">시가총액</th> <th scope="col">상장주식수</th> <th scope="col">외국인비율</th> <th scope="col">거래량</th> <th scope="col">PER</th> <th scope="col">ROE</th> <th scope="col">토론실</th> </tr> </thead> <tbody> <tr><td class="blank_08" colspan="10"></td></tr> <tr onmouseout="mouseOut(this)" onmouseover="mouseOver(this)"> <td class="no">1</td> <td><a class="tltle" href="/item/main.naver?code=005930">삼성전자</a></td> <td class="number">60,500</td> <td class="number"> <img alt="하락" height="6" src="https://ssl.pstatic.net/imgstock/images/images4/ico_down.gif" style="margin-right:4px;" width="7"/><span class="tah p11 nv01"> 400 </span> </td> <td class="number"> <span class="tah p11 nv01"> -0.66% </span>

...

- 코스피 url text가 아니고 table로 판다스 이용 다시 바꿔줌(?)

> 코드

from bs4 import BeautifulSoup

import requests

import time

import pandas as pd

url=requests.get("https://finance.naver.com/sise/sise_market_sum.naver?sosok=0")

html = BeautifulSoup(url.text)

table= html.find("table",class_="type_2")

table

table= html.find("table",class_="type_2")

table= pd.read_html(str(table))[0] #뷰티풀이 바꾼걸 다시 바꿈

table

> 출력

N	종목명	현재가	전일비	등락률	액면가	시가총액	상장주식수	외국인비율	거래량	PER	ROE	토론실
0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	1.0	삼성전자	60700.0	200.0	-0.33%	100.0	3623658.0	5969783.0	49.78	12996594.0	9.53	13.92	NaN
2	2.0	LG에너지솔루션	384000.0	1000.0	+0.26%	500.0	898560.0	234000.0	3.26	146884.0	122.37	10.68	NaN
3	3.0	SK하이닉스	102000.0	2000.0	+2.00%	5000.0	742562.0	728002.0	49.71	3296300.0	7.01	16.84	NaN
4	4.0	삼성바이오로직스	823000.0	4000.0	+0.49%	2500.0	585762.0	71174.0	10.71	27264.0	114.91	8.21	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...
76	49.0	한화솔루션	33500.0	750.0	+2.29%	5000.0	64078.0	191278.0	19.14	735161.0	19.32	8.79	NaN
77	50.0	하이브	151500.0	1000.0	-0.66%	500.0	62650.0	41353.0	15.50	155142.0	39.06	6.83	NaN
78	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
79	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
80	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

81 rows × 13 columns

- 데이터 정제

> 코드

from bs4 import BeautifulSoup

import requests

import time

import pandas as pd

table= html.find("table",class_="type_2")

table= pd.read_html(str(table))[0] #뷰티풀이 바꾼걸 다시 바꿈

del table['토론실']

del table['N']

table

> 출력

종목명	현재가	전일비	등락률	액면가	시가총액	상장주식수	외국인비율	거래량	PER	ROE
0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	삼성전자	60700.0	200.0	-0.33%	100.0	3623658.0	5969783.0	49.78	12996594.0	9.53	13.92
2	LG에너지솔루션	384000.0	1000.0	+0.26%	500.0	898560.0	234000.0	3.26	146884.0	122.37	10.68
3	SK하이닉스	102000.0	2000.0	+2.00%	5000.0	742562.0	728002.0	49.71	3296300.0	7.01	16.84
4	삼성바이오로직스	823000.0	4000.0	+0.49%	2500.0	585762.0	71174.0	10.71	27264.0	114.91	8.21
...	...	...	...	...	...	...	...	...	...	...	...
76	한화솔루션	33500.0	750.0	+2.29%	5000.0	64078.0	191278.0	19.14	735161.0	19.32	8.79
77	하이브	151500.0	1000.0	-0.66%	500.0	62650.0	41353.0	15.50	155142.0	39.06	6.83
78	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
79	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
80	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

81 rows × 11 columns

- 데이터 정제2 //종목명 속성이 비어있는 항목 제거

> 코드

from bs4 import BeautifulSoup

import requests

import time

import pickle

import pandas as pd

table= html.find("table",class_="type_2")

table= pd.read_html(str(table))[0] #뷰티풀이 바꾼걸 다시 바꿈

del table['토론실']

del table['N']

table[table['종목명'].notnull()]

> 출력

종목명	현재가	전일비	등락률	액면가	시가총액	상장주식수	외국인비율	거래량	PER	ROE
1	삼성전자	60700.0	200.0	-0.33%	100.0	3623658.0	5969783.0	49.78	12996594.0	9.53	13.92
2	LG에너지솔루션	384000.0	1000.0	+0.26%	500.0	898560.0	234000.0	3.26	146884.0	122.37	10.68
3	SK하이닉스	102000.0	2000.0	+2.00%	5000.0	742562.0	728002.0	49.71	3296300.0	7.01	16.84
4	삼성바이오로직스	823000.0	4000.0	+0.49%	2500.0	585762.0	71174.0	10.71	27264.0	114.91	8.21
5	삼성전자우	56200.0	100.0	-0.18%	100.0	462462.0	822887.0	72.15	699541.0	8.82	NaN
9	NAVER	246000.0	2500.0	+1.03%	100.0	403561.0	164049.0	53.39	549289.0	29.91	106.72
10	현대차	188500.0	0.0	0.00%	5000.0	402765.0	213668.0	26.94	873397.0	10.04	6.84
11	삼성SDI	557000.0	12000.0	+2.20%	5000.0	383018.0	68765.0	42.71	224890.0	28.16	8.45
12	LG화학	536000.0	15000.0	+2.88%	5000.0	378375.0	70592.0	47.62	182883.0	13.75	18.47
13	기아	81700.0	400.0	-0.49%	5000.0	331182.0	405363.0	36.21	1041825.0	6.96	14.69
17	카카오	73400.0	600.0	+0.82%	100.0	326592.0	444948.0	28.83	1560941.0	13.22	17.10
18	셀트리온	184500.0	1000.0	+0.54%	1000.0	259717.0	140768.0	20.69	193485.0	51.45	16.04
19	삼성물산	116000.0	500.0	+0.43%	100.0	216789.0	186887.0	16.39	148550.0	16.50	5.40
20	현대모비스	218500.0	1000.0	-0.46%	5000.0	206642.0	94573.0	34.09	212467.0	9.11	6.87
21	POSCO홀딩스	231500.0	1500.0	+0.65%	5000.0	201838.0	87187.0	53.04	148459.0	2.76	13.97
25	KB금융	48400.0	2100.0	+4.54%	5000.0	199579.0	412352.0	72.75	1589763.0	4.38	9.80
26	신한지주	35400.0	800.0	+2.31%	5000.0	181579.0	512934.0	61.91	1389303.0	4.47	8.80
27	SK이노베이션	178000.0	1000.0	+0.56%	5000.0	164589.0	92466.0	25.40	199655.0	11.31	1.91
28	SK	214000.0	2000.0	-0.93%	200.0	158680.0	74149.0	22.08	76923.0	5.22	10.19
29	LG전자	93500.0	600.0	+0.65%	5000.0	153011.0	163648.0	26.82	532874.0	13.42	6.32
33	카카오뱅크	30850.0	250.0	+0.82%	5000.0	146944.0	476317.0	12.54	701619.0	62.45	4.91
34	한국전력	21900.0	0.0	0.00%	5000.0	140590.0	641964.0	14.67	837832.0	-1.24	-7.99
35	LG	77900.0	900.0	+1.17%	5000.0	122537.0	157301.0	35.91	115422.0	4.41	12.36
36	크래프톤	245000.0	10000.0	+4.26%	100.0	120227.0	49072.0	29.10	303572.0	20.15	17.86
37	삼성생명	58900.0	900.0	+1.55%	500.0	117800.0	200000.0	13.21	175550.0	18.10	4.01
41	HMM	24050.0	400.0	+1.69%	5000.0	117614.0	489039.0	9.02	1491344.0	1.23	88.62
42	SK텔레콤	53200.0	800.0	-1.48%	100.0	116419.0	218833.0	48.26	472528.0	7.91	13.63
43	두산에너빌리티	17600.0	50.0	+0.28%	5000.0	112342.0	638308.0	9.99	1715397.0	26.39	10.67
44	KT&G	81600.0	300.0	+0.37%	5000.0	112031.0	137292.0	40.09	205748.0	11.58	10.74
45	LG생활건강	717000.0	7000.0	+0.99%	5000.0	111982.0	15618.0	37.92	34533.0	18.17	16.65
49	현대중공업	122000.0	1000.0	-0.81%	5000.0	108303.0	88773.0	5.77	117409.0	-9.89	-14.87
50	하나금융지주	36450.0	800.0	+2.24%	5000.0	107857.0	295903.0	72.73	1899231.0	3.05	10.86
51	S-Oil	93000.0	200.0	-0.21%	2500.0	104702.0	112583.0	82.68	298748.0	5.69	21.76
52	삼성전기	139500.0	500.0	+0.36%	5000.0	104198.0	74694.0	25.78	370691.0	11.11	14.29
53	삼성에스디에스	132500.0	500.0	+0.38%	500.0	102526.0	77378.0	13.19	80897.0	15.71	8.80
57	SK바이오사이언스	132500.0	0.0	0.00%	500.0	101739.0	76784.0	3.73	775057.0	29.73	38.08
58	KT	36750.0	300.0	-0.81%	5000.0	95959.0	261112.0	45.02	901920.0	6.55	9.36
59	삼성화재	198500.0	1500.0	-0.75%	500.0	94039.0	47375.0	49.57	62953.0	8.86	7.09
60	대한항공	25450.0	250.0	+0.99%	5000.0	93712.0	368221.0	12.98	930785.0	7.63	11.60
61	포스코케미칼	118000.0	4500.0	+3.96%	500.0	91407.0	77463.0	6.36	284517.0	70.07	7.92
65	카카오페이	67800.0	1000.0	+1.50%	500.0	89855.0	132529.0	43.64	362526.0	-342.42	-2.45
66	우리금융지주	11900.0	300.0	+2.59%	5000.0	86639.0	728061.0	40.12	3242519.0	3.13	10.59
67	고려아연	455500.0	1000.0	-0.22%	5000.0	85953.0	18870.0	20.42	32792.0	11.38	11.07
68	엔씨소프트	369500.0	2000.0	+0.54%	500.0	81120.0	21954.0	42.34	91312.0	16.73	12.62
69	아모레퍼시픽	137500.0	4000.0	+3.00%	500.0	80428.0	58493.0	26.08	164241.0	53.96	4.20
73	LG이노텍	336000.0	7500.0	+2.28%	5000.0	79521.0	23667.0	25.45	288853.0	8.79	30.94
74	기업은행	9340.0	140.0	+1.52%	5000.0	69518.0	744301.0	13.84	1122239.0	3.00	9.21
75	현대글로비스	182000.0	2500.0	-1.36%	500.0	68250.0	37500.0	45.15	62338.0	7.22	14.41
76	한화솔루션	33500.0	750.0	+2.29%	5000.0	64078.0	191278.0	19.14	735161.0	19.32	8.79
77	하이브	151500.0	1000.0	-0.66%	500.0	62650.0	41353.0	15.50	155142.0	39.06	6.83

- 엑셀 제작, 데이터 프레임 합치기

> 코드

from bs4 import BeautifulSoup

import requests

import time

import pickle

import pandas as pd

from tqdm import tqdm #fot문의 진행상황 확인

url = requests.get("https://finance.naver.com/sise/sise_market_sum.naver?sosok=0&page=1")

html = BeautifulSoup(url.text)

kospi_page=int(html.find("td",class_="pgRR").find("a")['href'].split('=')[-1])

total=[]

for n in tqdm(range(1, kospi_page+1)):

url = requests.get("https://finance.naver.com/sise/sise_market_sum.naver?sosok=0&page={}".format(n))

html = BeautifulSoup(url.text)

table= html.find("table",class_="type_2")

table= pd.read_html(str(table))[0] #뷰티풀이 바꾼걸 다시 바꿈

del table['토론실']

del table['N']

table=table[table['종목명'].notnull()]

total.append(table)

time.sleep(1)

#total[0] 장마다 확인하기

kospi=pd.concat(total, ignore_index=True)

#여러개 데이터 프로임 한개의 데이터 프레임으로 합쳐준다

#ignore_index=True 행번호 무시하고 작성

kospi['소속']=['KOSPI']*len(kospi)

kospi.to_excel('kospi.xlsx')

kospi

#해당 코드 실행 뒤 코랩 문서에 들어가면 엑셀 생성이 되어있음

> 출력

100%|██████████| 37/37 [00:52<00:00, 1.41s/it]

종목명	현재가	전일비	등락률	액면가	시가총액	상장주식수	외국인비율	거래량	PER	ROE	소속
0	삼성전자	60500.0	400.0	-0.66%	100.0	3611718.0	5969783.0	49.78	16698160.0	9.49	13.92	KOSPI
1	LG에너지솔루션	384000.0	1000.0	+0.26%	500.0	898560.0	234000.0	3.26	180543.0	122.37	10.68	KOSPI
2	SK하이닉스	102000.0	2000.0	+2.00%	5000.0	742562.0	728002.0	49.71	3962564.0	7.01	16.84	KOSPI
3	삼성바이오로직스	823000.0	4000.0	+0.49%	2500.0	585762.0	71174.0	10.71	35494.0	114.91	8.21	KOSPI
4	삼성전자우	56200.0	100.0	-0.18%	100.0	462462.0	822887.0	72.15	779200.0	8.82	NaN	KOSPI
...	...	...	...	...	...	...	...	...	...	...	...	...
1825	KBSTAR 모멘텀밸류	12060.0	100.0	+0.84%	0.0	12.0	100.0	0.00	1190.0	NaN	NaN	KOSPI
1826	KBSTAR 200에너지화학	10035.0	170.0	+1.72%	0.0	10.0	100.0	0.00	136.0	NaN	NaN	KOSPI
1827	KBSTAR 200생활소비재	6650.0	80.0	+1.22%	0.0	9.0	140.0	0.00	462.0	NaN	NaN	KOSPI
1828	KBSTAR 200산업재	10820.0	50.0	+0.46%	0.0	9.0	80.0	0.00	16.0	NaN	NaN	KOSPI
1829	KBSTAR 200경기소비재	9625.0	75.0	+0.79%	0.0	8.0	80.0	0.00	5.0	NaN	NaN	KOSPI

1830 rows × 12 columns

#1830라인 수에 맞게 인덱스가 생성된 것을 알 수 있음

- 코스피/코스닥 동일 페이지 구분을 위한 소속 셀 생성

> 코드

kospi['소속']=['KOSPI']*len(kospi)

kospi

> 출력

종목명	현재가	전일비	등락률	액면가	시가총액	상장주식수	외국인비율	거래량	PER	ROE	소속
0	삼성전자	60500.0	400.0	-0.66%	100.0	3611718.0	5969783.0	49.78	16698160.0	9.49	13.92	KOSPI
1	LG에너지솔루션	384000.0	1000.0	+0.26%	500.0	898560.0	234000.0	3.26	180543.0	122.37	10.68	KOSPI
2	SK하이닉스	102000.0	2000.0	+2.00%	5000.0	742562.0	728002.0	49.71	3962564.0	7.01	16.84	KOSPI
3	삼성바이오로직스	823000.0	4000.0	+0.49%	2500.0	585762.0	71174.0	10.71	35494.0	114.91	8.21	KOSPI
4	삼성전자우	56200.0	100.0	-0.18%	100.0	462462.0	822887.0	72.15	779200.0	8.82	NaN	KOSPI
...	...	...	...	...	...	...	...	...	...	...	...	...
1825	KBSTAR 모멘텀밸류	12060.0	100.0	+0.84%	0.0	12.0	100.0	0.00	1190.0	NaN	NaN	KOSPI
1826	KBSTAR 200에너지화학	10035.0	170.0	+1.72%	0.0	10.0	100.0	0.00	136.0	NaN	NaN	KOSPI
1827	KBSTAR 200생활소비재	6650.0	80.0	+1.22%	0.0	9.0	140.0	0.00	462.0	NaN	NaN	KOSPI
1828	KBSTAR 200산업재	10820.0	50.0	+0.46%	0.0	9.0	80.0	0.00	16.0	NaN	NaN	KOSPI
1829	KBSTAR 200경기소비재	9625.0	75.0	+0.79%	0.0	8.0	80.0	0.00	5.0	NaN	NaN	KOSPI

1830 rows × 12 columns

- 인덱스 정제

> 코드

box=[]

for i in range(len(kospi)):

box.append('코스피{}등'.format(i+1))

kospi.index = box

kospi.to_excel('kospi.xlsx')

kospi

> 출력

종목명	현재가	전일비	등락률	액면가	시가총액	상장주식수	외국인비율	거래량	PER	ROE	소속
코스피1등	삼성전자	60500.0	400.0	-0.66%	100.0	3611718.0	5969783.0	49.78	16698160.0	9.49	13.92	KOSPI
코스피2등	LG에너지솔루션	384000.0	1000.0	+0.26%	500.0	898560.0	234000.0	3.26	180543.0	122.37	10.68	KOSPI
코스피3등	SK하이닉스	102000.0	2000.0	+2.00%	5000.0	742562.0	728002.0	49.71	3962564.0	7.01	16.84	KOSPI
코스피4등	삼성바이오로직스	823000.0	4000.0	+0.49%	2500.0	585762.0	71174.0	10.71	35494.0	114.91	8.21	KOSPI
코스피5등	삼성전자우	56200.0	100.0	-0.18%	100.0	462462.0	822887.0	72.15	779200.0	8.82	NaN	KOSPI
...	...	...	...	...	...	...	...	...	...	...	...	...
코스피1826등	KBSTAR 모멘텀밸류	12060.0	100.0	+0.84%	0.0	12.0	100.0	0.00	1190.0	NaN	NaN	KOSPI
코스피1827등	KBSTAR 200에너지화학	10035.0	170.0	+1.72%	0.0	10.0	100.0	0.00	136.0	NaN	NaN	KOSPI
코스피1828등	KBSTAR 200생활소비재	6650.0	80.0	+1.22%	0.0	9.0	140.0	0.00	462.0	NaN	NaN	KOSPI
코스피1829등	KBSTAR 200산업재	10820.0	50.0	+0.46%	0.0	9.0	80.0	0.00	16.0	NaN	NaN	KOSPI
코스피1830등	KBSTAR 200경기소비재	9625.0	75.0	+0.79%	0.0	8.0	80.0	0.00	5.0	NaN	NaN	KOSPI

1830 rows × 12 columns

- 위 코드 합치기/등수 제외 //정석(위 코드랑 동일 할 거임 pass)

> 코드

#>>>

from tqdm import tqdm # for문의 진행상황 확인

url = requests.get("https://finance.naver.com/sise/sise_market_sum.naver?sosok=0&page=1")

html = BeautifulSoup(url.text)

kospi_page = int(html.find("td", class_ = "pgRR").find("a")['href'].split('=')[-1])

kospi_box = []

for n in tqdm(range(1, kospi_page + 1)):

url = requests.get("https://finance.naver.com/sise/sise_market_sum.naver?sosok=0&page={}".format(n))

html = BeautifulSoup(url.text)

table = html.find("table", class_ = "type_2")

table = pd.read_html(str(table))[0]

del table['토론실']

del table['N']

table = table[table['종목명'].notnull()]

kospi_box.append(table)

time.sleep(1)

kospi = pd.concat(kospi_box, ignore_index = True) # 여러개의 데이터프레임을 하나의 데이터프레임으로 합쳐줌

kospi['소속'] = ['KOSPI'] * len(kospi)

kospi.to_excel("kospi.xlsx")

kospi

> 출력

100%|██████████| 37/37 [00:52<00:00, 1.41s/it]

종목명	현재가	전일비	등락률	액면가	시가총액	상장주식수	외국인비율	거래량	PER	ROE	소속
0	삼성전자	60600.0	300.0	-0.49%	100.0	3617688.0	5969783.0	49.78	15352434.0	9.51	13.92	KOSPI
1	LG에너지솔루션	384500.0	1500.0	+0.39%	500.0	899730.0	234000.0	3.26	173069.0	122.53	10.68	KOSPI
2	SK하이닉스	102500.0	2500.0	+2.50%	5000.0	746202.0	728002.0	49.71	3755502.0	7.05	16.84	KOSPI
3	삼성바이오로직스	823000.0	4000.0	+0.49%	2500.0	585762.0	71174.0	10.71	31036.0	114.91	8.21	KOSPI
4	삼성전자우	56200.0	100.0	-0.18%	100.0	462462.0	822887.0	72.15	779200.0	8.82	NaN	KOSPI
...	...	...	...	...	...	...	...	...	...	...	...	...
1825	KBSTAR 모멘텀밸류	12060.0	100.0	+0.84%	0.0	12.0	100.0	0.00	1190.0	NaN	NaN	KOSPI
1826	KBSTAR 200에너지화학	10035.0	170.0	+1.72%	0.0	10.0	100.0	0.00	136.0	NaN	NaN	KOSPI
1827	KBSTAR 200생활소비재	6650.0	80.0	+1.22%	0.0	9.0	140.0	0.00	462.0	NaN	NaN	KOSPI
1828	KBSTAR 200산업재	10820.0	50.0	+0.46%	0.0	9.0	80.0	0.00	16.0	NaN	NaN	KOSPI
1829	KBSTAR 200경기소비재	9625.0	75.0	+0.79%	0.0	8.0	80.0	0.00	5.0	NaN	NaN	KOSPI

1830 rows × 12 columns

- 위 코드 합치기/등수 //정석(위 코드랑 동일 할 거임 pass)

> 코드

#>>>

box = []

for i in range(len(kospi)):

box.append("코스피{}등".format(i+1))

kospi.index = box

kospi.to_excel("kospi.xlsx")

kospi

> 출력

pass

- 코스피 코스닥 포함 코드전체 //정석(완)

> 코드

from tqdm import tqdm # for문의 진행상황 확인

url = requests.get("https://finance.naver.com/sise/sise_market_sum.naver?sosok=0&page=1")

html = BeautifulSoup(url.text)

kospi_page = int(html.find("td", class_ = "pgRR").find("a")['href'].split('=')[-1])

kospi_box = []

for n in tqdm(range(1, kospi_page + 1)):

url = requests.get("https://finance.naver.com/sise/sise_market_sum.naver?sosok=0&page={}".format(n))

html = BeautifulSoup(url.text)

table = html.find("table", class_ = "type_2")

table = pd.read_html(str(table))[0]

del table['토론실']

del table['N']

table = table[table['종목명'].notnull()]

kospi_box.append(table)

time.sleep(1)

kospi = pd.concat(kospi_box, ignore_index = True) # 여러개의 데이터프레임을 하나의 데이터프레임으로 합쳐줌

kospi['소속'] = ['KOSPI'] * len(kospi)

###아래 코스닥코드

url = requests.get("https://finance.naver.com/sise/sise_market_sum.naver?sosok=1&page=1")

html = BeautifulSoup(url.text)

kosdaq_page = int(html.find("td", class_ = "pgRR").find("a")['href'].split('=')[-1])

kosdaq_box = []

for n in tqdm(range(1, kosdaq_page + 1)):

url = requests.get("https://finance.naver.com/sise/sise_market_sum.naver?sosok=1&page={}".format(n))

html = BeautifulSoup(url.text)

table = html.find("table", class_ = "type_2")

table = pd.read_html(str(table))[0]

del table['토론실']

del table['N']

table = table[table['종목명'].notnull()]

kosdaq_box.append(table)

time.sleep(1)

kosdaq = pd.concat(kosdaq_box, ignore_index = True) # 여러개의 데이터프레임을 하나의 데이터프레임으로 합쳐줌

kosdaq['소속'] = ['KOSDAQ'] * len(kosdaq)

stock = pd.concat([kospi, kosdaq], ignore_index=True)

stock.to_excel("stock.xlsx")

stock

> 출력

100%|██████████| 37/37 [00:52<00:00, 1.41s/it] 100%|██████████| 32/32 [00:44<00:00, 1.39s/it]

#3400개가 나와야 함

#엑셀 내, 필터 적용 시, 구분 수월 오름 내림차순 가능

[뉴스 워드클라우드]

- 워드클라우드 제작 데이터 수집 //'성균관대학교'뉴스 헤더, 콘텐츠

> 코드

from bs4 import BeautifulSoup

import requests

import time

import pandas as pd

from tqdm import tqdm

title_box=[]

content_box=[]

for n in tqdm(range(1,3992,10)):

url=requests.get("https://search.naver.com/search.naver?where=news&sm=tab_pge&query=%EC%84%B1%EA%B7%A0%EA%B4%80%EB%8C%80%ED%95%99%EA%B5%90&sort=0&photo=0&field=0&pd=0&ds=&de=&cluster_rank=102&mynews=0&office_type=0&office_section_code=0&news_office_checked=&nso=so:r,p:all,a:all&start={}1".format(n))

html=BeautifulSoup(url.text)

contents=html.find("ul",class_="list_news").find_all("li",class_="bx")

for i in contents:

title= i.find('a', class_= "news_tit").text

content=i.find('div', class_= "news_dsc").text

title_box.append(title)

content_box.append(content)

time.sleep(0.3)

> 출력

100%|██████████| 400/400 [06:13<00:00, 1.07it/s]

- 데이터 확인 1

> 코드

len(title_box)

> 출력

4000

- 데이터 확인 2

> 코드

len(content_box)

> 출력

4000

- 한 텍스트 내, 각 헤더와 콘텐츠 포함

> 코드

text = ""

for i in range(4000):

text += title_box[i] #뉴스헤더

text +='\n'

text += content_box[i] #콘텐츠

text +='\n'

#엔터로 각각의 헤더 콘텐츠 연결 00, 11, 22, 33 ...

len(text)

> 출력

624266

- 정규표현식 re 정제

> 코드

import re

#konipy:한국어 처리 라이브러리 주로 조사를 없앨때 쓰임

word_box=re.findall("[가-힣]+",text) #("[0-9]",text) 2가지 재료 전달 하면 해당텍스트에서 해당 숫자를 찾을 때마다 가져옴 #("[0-9]+",text) 숫자가 1개이상있는 것 #("[a-zA-z]+",text)

word_box

#빈도분석

dic={}

for i in word_box:

if i not in dic.keys():

dic[i]=1

else:

dic[i] +=1

dic

> 출력

{'경기력': 1, '좋아지는': 1, '박무빈': 1, '연세대를': 1, '다시': 3, '만난다면': 1, '강': 7, '에서': 374, '성균관대를': 9, '로': 31, '꺾고': 2, '준결승에': 1, '진출했다': 1, '대학농구리그에서': 2, '위를': 3, '차지한': 2, '고려대는': 1, '위': 67, '성균관대와': 26, '플레이오프에서': 1, '만난다': 1, '미리': 2, '보는': 3, '플레이오프였다': 1, '양팀': 1, '모두': 8, '선수들을': 1, '고르게': 1, '기용했다': 1, '고려대가': 4, '성균관대학교': 2308, '박성호': 3, '교수팀': 15, '화학적': 1, '조각': 1, '반응을': 2, '이용한': 9, '차원': 3, '나노': 3, '구조체': 1, '화학과': 7, '사진출처': 380, '총장': 1159, '신동렬': 796, '는': 794, '교수': 1747, '연구팀': 31, '저자': 13, '김정원': 1, '석박통합과정': 3, '이': 429, '수용액': 1, '상에서': 1, '진행되는': 2, '새로운': 6, '화학반응을': 1, '통해': 200, '빛과의': 1, '유충식': 7, '성균관대': 5019, '세계지반학술단체총연합회': 5, '회장': 198, '당선': 2, '성균관대는': 123, '건설환경공학부': 4, '교수가': 212, '회장에': 1, '당선됐다고': 1, '일': 1435, '그동안': 58, '회장으로': 5, '활동하면서': 1, '쌓은': 2, '국제적': 1, '리더십을': 2, '토대로': 1, '년간': 2, '지반': 1, '신소재': 2, '분야에서': 5, '노완주': 1, '지면': 1, '탈락하는': 1, '경기에서': 12, '빛나다': 1, '대학농구리그': 1, '건국대와': 4, '홈': 3, '점': 15, '차이로': 2, '졌다': 1, '이날은': 1, '그': 7, '날': 3, '패배도': 1, '되갚았다': 1, '노완주는': 1, '학년': 8, '형들은': 1, '마지막': 3, '배': 185, '대회이고': 1, '플레이오프가': 1, '남아있지만': 1, '오늘': 1,

...}

- dic으로 빈도분석 구현, 코드 합치기

> 코드

import re

text = ""

for i in range(4000):

text += title_box[i]

text +='\n'

text += content_box[i]

text +='\n' #데이터 00, 11, 22 한문장으로 제작

#konipy:한국어 처리 라이브러리 주로 조사를 없앨때 쓰임

#빈도분석

dic={}

for i in word_box:

if i not in dic.keys():

dic[i]=1

else:

dic[i] +=1

dic

> 출력

...}

- 데이터 클라우드 구현

> 코드

from wordcloud import WordCloud

wc = WordCloud(font_path = "BMDOHYEON_ttf.ttf",background_color='white') #한글 글씨체 깨짐방지 글씨체 언급 #글씨체 코랩 문서에 선 업로드 필요

cloud = wc.generate_from_frequencies(dic)

cloud.to_file("my_cloud.jpg")

> 출력

<wordcloud.wordcloud.WordCloud at 0x7f2cccbb4150>

#코랩 문서 파일 내 다운확인

[일별시세 크롤링]

- 네이버 일별 시세 크롤링//정석

> 코드

#>>>

#존재하는데 안 나옴

#User-Agent때문에 보내지지 않음

url=requests.get("https://finance.naver.com/item/sise_day.naver?code=005930")

html=BeautifulSoup(url.text)

html

> 출력

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"> <html> <head> <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/> <title>네이버 :: 세상의 모든 지식, 네이버</title> <style type="text/css"> .error_content * {margin:0;padding:0;} .error_content img{border:none;} .error_content em {font-style:normal;} .error_content {width:410px; margin:80px auto 0; padding:57px 0 0 0; font-size:12px; font-family:"나눔고딕", "NanumGothic", "돋움", Dotum, AppleGothic, Sans-serif; text-align:left; line-height:14px; background:url(https://ssl.pstatic.net/static/common/error/090610/bg_thumb.gif) no-repeat center top; white-space:nowrap;} .error_content p{margin:0;} .error_content .error_desc {margin-bottom:21px; overflow:hidden; text-align:center;} .error_content .error_desc2 {margin-bottom:11px; padding-bottom:7px; color:#888; line-height:18px; border-bottom:1px solid #eee;} .error_content .error_desc3 {clear:both; color:#888;} .error_content .error_desc3 a {color:#004790; text-decoration:underline;} .error_content .error_list_type {clear:both; float:left; width:410px; _width:428px; margin:0 0 18px 0; *margin:0 0 7px 0; padding-bottom:13px; font-size:13px; color:#000; line-height:18px; border-bottom:1px solid #eee;} .error_content .error_list_type dt {float:left; width:60px; _width /**/:70px; padding-left:10px; background:url(https://ssl.pstatic.net/static/common/error/090610/bg_dot.gif) no-repeat 2px 8px;} .error_content .error_list_type dd {float:left; width:336px; _width /**/:340px; padding:0 0 0 4px;} .error_content .error_list_type dd span {color:#339900; letter-spacing:0;} .error_content .error_list_type dd a{color:#339900;} .error_content p.btn{margin:29px 0 100px; text-align:center;} </style> </head>  <body> <div class="error_content"> <p class="error_desc"><img alt="페이지를 찾을 수 없습니다." height="30" src="https://ssl.pstatic.net/static/common/error/090610/txt_desc5.gif" width="319"/></p> <p class="error_desc2">방문하시려는 페이지의 주소가 잘못 입력되었거나,<br/> 페이지의 주소가 변경 혹은 삭제되어 요청하신 페이지를 찾을 수 없습니다.<br/> 입력하신 주소가 정확한지 다시 한번 확인해 주시기 바랍니다. </p> <p class="error_desc3">관련 문의사항은 <a href="https://help.naver.com/" target="_blank">고객센터</a>에 알려주시면 친절히 안내해드리겠습니다. 감사합니다.</p> <p class="btn"> <a href="javascript:history.back()"><img alt="이전 페이지로" height="35" src="https://ssl.pstatic.net/static/common/error/090610/btn_prevpage.gif" width="115"/></a> <a href="https://finance.naver.com"><img alt="금융홈으로" height="35" src="https://ssl.pstatic.net/static/nfinance/btn_home.gif" width="115"/></a> </p> </div> </body> </html>

- header 수정 우회접근 //정석

-#참고 사이트

: https://www.whatismybrowser.com/guides/the-latest-user-agent/macos

> 코드

#>>>

#존재하는데 안 나옴

#User-Agent때문에 보내지지 않음

me={'User-Agent' : 'Mozilla/5.0 (Macintosh; Intel Mac OS X 12_4) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/15.4 Safari/605.1.15'}

#파이썬이 아닌 아이폰임으로 위장

url=requests.get("https://finance.naver.com/item/sise_day.naver?code=005930")

html=BeautifulSoup(url.text)

html

> 출력

- 네이버 일별 시세 크롤링 //정석(완)

#시세 페이지 내, 크롤링 원 주소 참고

#https://finance.naver.com/item/sise_day.naver?code=005930

> 코드

me={'User-Agent' : 'Mozilla/5.0 (Macintosh; Intel Mac OS X 12_4) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/15.4 Safari/605.1.15'}

url=requests.get("https://finance.naver.com/item/sise_day.naver?code=005930", headers = me)

html=BeautifulSoup(url.text)

table = html.find('table')

table = pd.read_html(str(table))[0].dropna() #dropna() none 한개라도 있으면 삭제

del table['전일비']

table

> 출력

날짜	종가	시가	고가	저가	거래량
1	2022.07.20	60500.0	61800.0	62100.0	60500.0	16718393.0
2	2022.07.19	60900.0	61400.0	61500.0	60200.0	15248261.0
3	2022.07.18	61900.0	60600.0	62000.0	60500.0	20832517.0
4	2022.07.15	60000.0	58400.0	60000.0	58100.0	18685583.0
5	2022.07.14	57500.0	57500.0	58200.0	57400.0	15067012.0
9	2022.07.13	58000.0	58300.0	58600.0	58000.0	10841315.0
10	2022.07.12	58100.0	58600.0	58700.0	58100.0	9336061.0
11	2022.07.11	58800.0	59300.0	59600.0	58700.0	13042624.0
12	2022.07.08	58700.0	58600.0	59300.0	58200.0	15339271.0
13	2022.07.07	58200.0	56400.0	58700.0	56300.0	21322833.0

[upbit 고객센터 제목 크롤링]

#개발자 도구에서 해당 변수들이 보이지 않는 것을 확인 할 수 있다

#이 때, 공지사항에 들어가서 개발자 도구를 킨 뒤 Network 탭을 택한다

#다음 해당 프로젝트 공식에 들어가 disclosure를 Network 탭 내에서 잡아준다

# 해당 disclosure를 우측 클릭 혹은 더블 클릭한다면 해당 사이트의 내용을 확인 할 수 있다

- upbit 고객센터 제목 크롤링 //정석(완)

> 코드

#upbit 고객센터 제목 크롤링

url=requests.get("https://project-team.upbit.com/api/v1/disclosure?region=kr&per_page=20",headers=me)

url.text

# html=BeautifulSoup(url.text)

# html

# time.sleep(0.3)

import json

data = json.loads(url.text) #딕셔너리 형태로 바뀐다

for i in data['data']['posts']:

print(i['text']) #해당 text키값으로 더욱 세밀히 접근

> 출력

[기공개] 리브랜딩 : 피체인(PCHAIN)이 플리안(Plian)으로 바뀝니다 어뷰징 물량 회수 및 소각 공시 RINGX 재단, 롯데슈퍼와 업무 제휴 협약 체결 베트남 기업 '마켓 사이공'에 블록체인 모빌리티 플랫폼 수출(SaaS) [기공개] 카르테시 x Travala 파트너십 체결 [기공개] 엔진, 한국 최대 소셜 게이밍 플랫폼(겜톡톡)과 파트너십... 친환경 NFT 도입 예정 [기공개] 플레이댑, 루데나 프로토콜 NFT아이템 거래 지원 계약 [기공개] Bifrost PAID Network와 업무 협력 파트너쉽 체결 [기공개] 칠리즈, 맨체스터 시티 FC 파트너십 발표 [기공개] 토큰 액면 병합: NPXS가 PUNDIX로 바뀝니다. [기공개] 크립토닷컴, 비자와 글로벌 파트너쉽 체결 및 주요 회원사로 선정 [기공개] 메디블록, 블록체인 기반의 DID 백신패스 출시 예정 [기공개] P2P 마켓플레이스 오리진 프로토콜 , NFT 및 OUSD 라이트페이퍼 출시 5조원 규모 초대형 북미 펀드인 Celsius Network에서 GOM2에 투자 인도네시아 기업 '퀵스'에 MVL 프로토콜 기반 모빌리티 서비스 플랫폼 수출(SaaS) 오브스(Orbs), 블록체인 기업 MOONSTAKE와 협업 쎄타랩스, 분산형 비디오 및 데이터 전송을 지원하기 위한 초고 트랜잭션 처리량 소액 결제에 대한 두번째 미국특허취득 픽션 네트워크, 신임 대표이사 선출 RINGX 재단, OK캐쉬백((주)위페이)과 ‘마이비(Mivy)’ 플랫폼 전환 협업 진행 토카막 다오 베타 출시

jammanbooboo's Code Log

2022년 7월 20일 수요일