支持向量机（svm）原理及其matlab实现

vlambda
2022-03-25

支持向量机（svm）原理及其matlab实现

虽然网络上有大量资源关于svm理论的介绍，但是这里仅做学习记录。

一. 支持向量机理论推导
(此图片来源于网络)

线性可分情况（左图1所示）

为了便于理解与说明，以二维平面为例。我们的目的是找到一个平面（这里是一条直线，如图中红色直线），向上向下平行移动平面，使其与部分向量（这里是蓝色○与红色□）刚好接触，这个平面刚好处于中间位置，且满足方程wx+b=0。两个虚线之间的距离d 定义为此平面的优化量度，使之尽可能大。与虚线相接触的向量称为支持向量。

支持向量机（svm）原理及其matlab实现

5. 支持向量机的matlab代码实现。

clear all;% Read the data.fid = fopen('krkopt.DATA');c = fread(fid, 3);
vec = zeros(6,1);xapp = [];yapp = [];while ~feof(fid) string = []; c = fread(fid,1); flag = flag+1; while c~=13 string = [string, c]; c=fread(fid,1); end fread(fid,1);  if length(string)>10 vec(1) = string(1) - 96; vec(2) = string(3) - 48; vec(3) = string(5) - 96; vec(4) = string(7) - 48; vec(5) = string(9) - 96; vec(6) = string(11) - 48; xapp = [xapp,vec]; if string(13) == 100 yapp = [yapp,1]; else yapp = [yapp,-1]; end endendfclose(fid);

程序的开始是对数据进行预处理，将字符串转化为数字形式数据。最终得到的xapp与yapp分别是6*28506与1*28506的数据。它们分别对应于样本与标签。

[N,M] = size(xapp);p = randperm(M); %直接打乱了训练样本numberOfSamplesForTraining = 5000;xTraining = [];yTraining = [];for i=1:numberOfSamplesForTraining xTraining = [xTraining,xapp(:,p(i))]; yTraining = [yTraining,yapp(p(i))];end;xTraining = xTraining';yTraining = yTraining';
xTesting = [];yTesting = [];for i=numberOfSamplesForTraining+1:M xTesting = [xTesting,xapp(:,p(i))]; yTesting = [yTesting,yapp(p(i))];end;xTesting = xTesting';yTesting = yTesting';

这部分是划分数据集：训练集（5000条）+测试集（余下所有的数据）。不过这部分写的过于冗长，可以向量化为

[N,M] = size(xapp);p = randperm(M); %直接打乱了训练样本numberOfSamplesForTraining = 5000;xTraining = xapp(:,p(1:numberOfSamplesForTraining)');yTraining = yapp(:,p(1:numberOfSamplesForTraining)');xTesting = xapp(:,p(numberOfSamplesForTraining+1:end)');yTesting = yapp(:,p(numberOfSamplesForTraining+1:end)');

接下来是数据预处理（归一化过程）

%Normalization[numVec,numDim] = size(xTraining);avgX = mean(xTraining);stdX = std(xTraining);for i = 1:numVec xTraining(i,:) = (xTraining(i,:)-avgX)./stdX;end;[numVec,numDim] = size(xTesting);
for i = 1:numVec xTesting(i,:) = (xTesting(i,:)-avgX)./stdX;end;

下面是对数据进行训练，采用高斯核SVM。由于高斯核里面有两个超参数需要设定，这里分别遍历多个参数选择最优参数组合。

%Firstly, search C and gamma in a crude scale (as recommended in 'A practical Guide to Support Vector Classification'))CScale = [-5, -3, -1, 1, 3, 5,7,9,11,13,15];gammaScale = [-15,-13,-11,-9,-7,-5,-3,-1,1,3];C = 2.^CScale;gamma = 2.^gammaScale;maxRecognitionRate = 0;for i = 1:length(C) for j = 1:length(gamma) cmd=['-t 2 -c ',num2str(C(i)),' -g ',num2str(gamma(j)),' -v 5']; recognitionRate = svmtrain(yTraining,xTraining,cmd); if recognitionRate>maxRecognitionRate maxRecognitionRate = recognitionRate maxCIndex = i; maxGammaIndex = j; end; end;end;

下面介绍SVM库的基本用法。LIBSVM -- A Library for Support Vector Machines

1。svm库 matlab版安装：在官方网站下载工具包之后，在matlab里面添加库（不知道怎么添加的可以自行网上搜索教程，我相信需要用这个库的一半都比较熟系matlab了，这里不再赘述）。

2。调用函数接口

model= svmtrain(yTraining, xTraining, cmd)

其中前两个参数就是之前准备过的数据集，cmd是一些参数设置，下面做具体介绍。

Examples of options: -s 0 -c 10 -t 1 -g 1 -r 1 -d 3Classify a binary data with polynomial kernel (u'v+1)^3 and C = 10options:-s svm_type : set type of SVM (default 0) 0 -- C-SVC 1 -- nu-SVC 2 -- one-class SVM 3 -- epsilon-SVR 4 -- nu-SVR-t kernel_type : set type of kernel function (default 2) 0 -- linear: u'*v 1 -- polynomial: (gamma*u'*v + coef0)^degree 2 -- radial basis function: exp(-gamma*|u-v|^2) 3 -- sigmoid: tanh(gamma*u'*v + coef0)-d degree : set degree in kernel function (default 3)-g gamma : set gamma in kernel function (default 1/num_features)-r coef0 : set coef0 in kernel function (default 0)-c cost : set the parameter C of C-SVC, epsilon-SVR, and nu-SVR (default 1)-n nu : set the parameter nu of nu-SVC, one-class SVM, and nu-SVR (default 0.5)-p epsilon : set the epsilon in loss function of epsilon-SVR (default 0.1)-m cachesize : set cache memory size in MB (default 100)-e epsilon : set tolerance of termination criterion (default 0.001)-h shrinking: whether to use the shrinking heuristics, 0 or 1 (default 1)-b probability_estimates: whether to train a SVC or SVR model for probability estimates, 0 or 1 (default 0)-wi weight: set the parameter C of class i to weight*C, for C-SVC (default 1)
The k in the -g option means the number of attributes in the input data.