使用python与人工智能破解图片验证码

这里我们不用神经网络,我们用向量空间引擎的思想。什么是向量空间引擎呢?

拿文章的例子来说:

你有 3 篇文档,我们要怎么计算它们之间的相似度呢?2 篇文档所使用的相同的单词越多,那这两篇文章就越相似!但是这单词太多怎么办,就由我们来选择几个关键单词,选择的单词又被称作特征,每一个特征就好比空间中的一个维度(x,y,z 等),一组特征就是一个矢量,每一个文档我们都能得到这么一个矢量,只要计算矢量之间的夹角就能得到文章的相似度了。

用 Python 类实现向量空间:

import math
class VectorCompare:
 #计算矢量大小
 def magnitude(self,concordance):
 total = 0
 for word,count in concordance.iteritems():
 total += count ** 2
 return math.sqrt(total)
 #计算矢量之间的 cos 值
 def relation(self,concordance1, concordance2):
 relevance = 0
 topvalue = 0
 for word, count in concordance1.iteritems():
 if concordance2.has_key(word):
 topvalue += count * concordance2[word]
 return topvalue / (self.magnitude(concordance1) * self.magnitude(concordance2))

它会比较两个 python 字典类型并输出它们的相似度(用 0~1 的数字表示)

这里取大量验证码提取单个字符图片作为训练集合的工作,但只要是有好好读上文的同学就一定知道这些工作要怎么做,在这里就略去了。可以直接使用提供的训练集合来进行下面的操作。

iconset目录下放的是我们的训练集。

最后追加的内容:

#将图片转换为矢量
def buildvector(im):
 d1 = {}
 count = 0
 for i in im.getdata():
 d1[count] = i
 count += 1
 return d1
v = VectorCompare()
iconset = ['0','1','2','3','4','5','6','7','8','9','0','a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z']
#加载训练集
imageset = []
for letter in iconset:
 for img in os.listdir('./iconset/%s/'%(letter)):
 temp = []
 if img != "Thumbs.db" and img != ".DS_Store":
 temp.append(buildvector(Image.open("./iconset/%s/%s"%(letter,img))))
 imageset.append({letter:temp})
count = 0
#对验证码图片进行切割
for letter in letters:
 m = hashlib.md5()
 im3 = im2.crop(( letter[0] , 0, letter[1],im2.size[1] ))
 guess = []
 #将切割得到的验证码小片段与每个训练片段进行比较
 for image in imageset:
 for x,y in image.iteritems():
 if len(y) != 0:
 guess.append( ( v.relation(y[0],buildvector(im3)),x) )
 guess.sort(reverse=True)
 print "",guess[0]
 count += 1

全部代码:

from PIL import Image
import hashlib
import time
import os
import math
class VectorCompare:
 def magnitude(self,concordance):
 total = 0
 for word,count in concordance.iteritems():
 total += count ** 2
 return math.sqrt(total)
 def relation(self,concordance1, concordance2):
 relevance = 0
 topvalue = 0
 for word, count in concordance1.iteritems():
 if concordance2.has_key(word):
 topvalue += count * concordance2[word]
 return topvalue / (self.magnitude(concordance1) * self.magnitude(concordance2))
def buildvector(im):
 d1 = {}
 count = 0
 for i in im.getdata():
 d1[count] = i
 count += 1
 return d1
v = VectorCompare()
iconset = ['0','1','2','3','4','5','6','7','8','9','0','a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z']
imageset = []
for letter in iconset:
 for img in os.listdir('./iconset/%s/'%(letter)):
 temp = []
 if img != "Thumbs.db" and img != ".DS_Store": # windows check...
 temp.append(buildvector(Image.open("./iconset/%s/%s"%(letter,img))))
 imageset.append({letter:temp})
im = Image.open("captcha.gif")
im2 = Image.new("P",im.size,255)
im.convert("P")
temp = {}
for x in range(im.size[1]):
 for y in range(im.size[0]):
 pix = im.getpixel((y,x))
 temp[pix] = pix
 if pix == 220 or pix == 227: # these are the numbers to get
 im2.putpixel((y,x),0)
inletter = False
foundletter=False
start = 0
end = 0
letters = []
for y in range(im2.size[0]): # slice across
 for x in range(im2.size[1]): # slice down
 pix = im2.getpixel((y,x))
 if pix != 255:
 inletter = True
 if foundletter == False and inletter == True:
 foundletter = True
 start = y
 if foundletter == True and inletter == False:
 foundletter = False
 end = y
 letters.append((start,end))
 inletter=False
count = 0
for letter in letters:
 m = hashlib.md5()
 im3 = im2.crop(( letter[0] , 0, letter[1],im2.size[1] ))
 guess = []
 for image in imageset:
 for x,y in image.iteritems():
 if len(y) != 0:
 guess.append( ( v.relation(y[0],buildvector(im3)),x) )
 guess.sort(reverse=True)
 print "",guess[0]
 count += 1

关于字符的训练,大家可以参考我前面的文章

展开阅读全文

页面更新:2024-05-18

标签:夹角   神经网络   向量   维度   图片   人工智能   矢量   单词   片段   字符   特征   文档   引擎   体育   工作   文章   空间

1 2 3 4 5

上滑加载更多 ↓
推荐阅读:
友情链接:
更多:

本站资料均由网友自行发布提供,仅用于学习交流。如有版权问题,请与我联系,QQ:4156828  

© CopyRight 2020-2024 All Rights Reserved. Powered By 71396.com 闽ICP备11008920号-4
闽公网安备35020302034903号

Top