## Abstract

The selection of the penalty functional is critical for the performance of a regularized learning algorithm, and thus it deserves special attention. In this article, we present a least square regression algorithm based on *l ^{p}*-coefficient regularization. Comparing with the classical regularized least square regression, the new algorithm is different in the regularization term. Our primary focus is on the error analysis of the algorithm. An explicit learning rate is derived under some ordinary assumptions.

## 1. Introduction

In this letter, we provide an error analysis for regularized least square regression (RLSR) with an *l ^{p}*-coefficient regularizer.

*X*be a compact subset of , for some

*M*> 0 and ρ be an unknown probability distribution on . Given a sample drawn independently according to ρ, the regression problem in learning theory is to find a function , such that

*f*

_{z}(

*x*) is a satisfactory estimate of the output

*y*when a new input

*x*is given. The prediction ability of a measurable function

*f*is measured by the generalization error where is the marginal distribution on

*X*and is the conditional probability measure at

*x*induced by ρ. The function minimizing is called the regression function, which is defined by Undoubtedly, is the ideal estimator, but it is unknown because ρ and then is unknown. What we can do is to find good approximations of from the random samples.

*K*. A function is called a Mercer kernel if it is continuous, symmetric, and positive semidefinite, that is, for any finite set of distinct points , the matrix (

*K*(

*x*,

_{i}*x*))

_{j}^{l}

_{i,j=1}is positive semidefinite. The RKHS is then defined (see Aronszajn, 1950) as the closure of the linear span of the set of functions with the inner product satisfying The reproducing property is given by Denote

*C*(

*X*) as the space of continuous functions on

*X*with the norm . Because of the continuity of

*K*and compactness of

*X*, we have So the above reproducing property tells us

**z**. is called the regularization term, and λ is a regularization parameter, which is usually chosen to be some function of

*m*and .

In this letter, we consider a new RLSR algorithm in which the regularizer is not the RKHS norm but an *l ^{p}*-norm of the coefficients in the kernel ensembles. More precisely:

Algorithms like equation 1.4 are also popular in a wide range of cases. Daubechies, Defrise, and Demol (2004) pointed out that by taking *p*<2, and especially for the limit value *p*=1, the proposed minimization procedure in equation 1.4 can promote the sparsity of the solution. Therefore, this scheme may be useful in the fields of signal processing, data compression, feature selection, and so on.

Mathematical analysis of learning algorithm 1.2 has been well understood (see Caponnetto & De Vito, 2007; Smale & Zhou, 2005, 2007; Wu, Ying, & Zhou, 2006). However, there are essential differences between algorithms 1.4 and 1.2. On one hand, the penalty functional is not a Hilbert space norm, which causes a technical difficulty for mathematical analysis. On the other hand, both and hypothesis space depend on sample **z**, and thus the classical error decomposition approach (see Cucker & Smale, 2001; Niyogi & Girosi, 1996) cannot be applied directly to algorithm 1.4. In a word, the standard error analysis methods for scheme 1.2 are not appropriate to scheme 1.4. As a result, the theoretical analysis of scheme 1.4 is not very rich yet.

As we know, coefficient regularization learning algorithms were first considered in linear programming support vector machine (SVM) classification (see e.g., Bradley & Mangasarian, 2000; Kecman & Hadzic, 2000; Vapnik, 1998). Tarigan and Van de Geer (2006) studied SVM type classifiers that were linear combinations of given base functions and endowed *l*^{1} penalty on the coefficients. Under some conditions on the margin and base functions, an oracle inequality was given. But they did not consider the approximation error (see the definition in section 2) and thus failed to give any learning rate for the algorithm. Wu and Zhou (2005) considered linear programming SVM on an RKHS. By setting a stepping-stone, they bridged the linear programming SVM with the classical quadratic programming SVM, and consequently derived an explicit learning rate of the former. Furthermore, Wu and Zhou (2008) established a general analysis framework for learning with sample dependent hypothesis spaces. In particular, they proposed the concept of hypothesis error (see also the definition in section 2) and introduced a novel error decomposition technique (involving a hypothesis error) to overcome the difficulties mentioned above.

In this letter, we adopt ideas from Wu and Zhou (2005, 2008) and give an error analysis for algorithm 1.4. Under some standard assumptions, we finally derive an explicit learning rate of this algorithm. The rest of the letter is organized as follows. In section 2, we present an error decomposition approach tailored to algorithm 1.4 and estimate the hypothesis error. The analysis of the sample error is given in section 3. We derive the learning rate in section 4.

## 2. Error Decomposition

Since is independent of samples and characterizes the approximation ability of with respect to , we often call it an approximation error. Things are completely different in scheme 1.4, where both the hypothesis space and the penalty term depend on sample bounds for an approximation error independent of **z** and thus are not available. To overcome the difficulty, we apply a modified error decomposition approach proposed by Wu and Zhou (2008). But before we do so, we introduce a projection operator.

Note that . We have for all , so it is natural to restrict the approximations of also contained in [−*M*, *M*]. The idea of projection was introduced in an analysis of classification algorithms (see e.g., Bartlett, 1998) and subsequently applied to regression too (see, e.g. Györfi et al., 2002).

In comparison with equation 2.1, the extra term in equation 2.5 is called a hypothesis error, which we now estimate by the following theorem:

**I**is the identity matrix,

*K*[

**x**]=(

*K*(

*x*,

_{i}*x*))

_{j}^{m}

_{i,j=1}, and . Therefore, for , By the Hölder inequality, we get Noting that , we can derive from equations 2.4, 1.4, and 2.7 that By taking and

*f*=0 in equation 1.2, respectively, we have Putting equations 2.9 and 2.10 into 2.8, one has This proves the theorem.

The result of theorem 1 for *p*=2 was also obtained by Wu and Zhou (2008).

## 3. Sample Error

Applying proposition 2.1 in Wu et al. (2006), we yield the following estimation for :

*Let be given by equation 2.3. For any t > 0, with confidence 1 − e*

^{−t}, there holdsNotice that *f*_{z} involves the sample **z** and thus runs over a set of functions in . In order to estimate , we often need a complexity measure of . For simplicity, in this letter, we use only the uniform covering number of .

*C*(

*X*) with continuous inclusion, and thus the uniform covering number is well defined. We denote the uniform covering number of unit ball as

The uniform covering number has been extensively studied in learning theory. Zhou (2003) showed that equation 3.2 always holds with *s*=2*n*/*r* if the kernel *K* is *C ^{r}* with

*r*> 0 on a bounded subset

*X*of . In particular, for a kernel (such as a gussian kernel and a polynomial kernel), equation 3.2 is valid for any

*s*> 0.

The following lemma is also adapted from Wu et al. (2006). It is a uniform law of large numbers for a class of functions:

*If equation 3.2 is satisfied, then for any t > 0, with confidence 1 − e*

^{−t}, there holds*for all , where C*.

_{1}is a constant independent of m, t, and RNow we need to find a ball containing *f*_{z}:

*If equation 3.2 is satisfied, then for any t > 0, with confidence 1 − e*

^{−t}, it holds that*where*.

## 4. Learning Rates

Combining the estimations in sections 2 and 3, we can derive an explicit learning rate for algorithm 1.4:

*Suppose equation 3.2 is satisfied, and for some and Let Then by taking for any , with confidence at least , there holds*

*where C*.

_{2}is a constant independent of m and δSo our theorem follows by taking and

We end this letter with some remarks.

When *p* is close to 1, we can see and is a monotonous increase with respect to *p*. So it is a reasonable conjecture that one should sacrifice some sparsity in order to improve the learning rate.

To derive a nontrivial learning rate, *s*+*p*(1−*s*) is required to be positive in theorem 2. We are not sure whether there is some inherent difficulty in using a larger value of *p* on an RKHS that has a larger complexity exponent. However, at least for a kernel such as a Gaussian or a polynomial kernel, *p* can be chosen arbitrarily in (1, 2].

*p*=1 is not included in our result. To get a learning rate in this case, we conjecture that some additional assumptions on *X* and ρ are necessary, just as done in Xiao and Zhou (in press).

## Acknowledgments

We thank the anonymous referees for their careful review and helpful suggestions. This work is partially supported by NSF of China under grants 10871015 and 10872009, and the National Basic Research Program of China under grant 973-2006CB303102.

## References

*l*

_{1}complexity regularization

*l*

^{1}-regularizer