Extended Hybrid Image Similarity – Combined Full-Reference Image Quality Metric Linearly Correlated with Subjective Scores

1 Abstract—One of the most relevant issues in image processing and analysis is a reliable image quality assessment. During last several years numerous metrics have been proposed by various researchers which are much better than traditionally used Mean Squared Error or similar metrics in the aspect of the accordance with human perception of various distortions. Nevertheless, the direct application of such metrics does not provide high correlation with subjective scores because of the required additional nonlinear mapping. Unfortunately, such fitting, typically applied for each image database using the logistic function, leads to different values of parameters for each dataset. As a more universal approach, some nonlinear combinations of various metrics have been proposed recently which do not require any nonlinear mapping. In the paper an extended combined similarity metric is proposed, which provides high prediction accuracy of the image quality with highly linear correlation with subjective scores. The results of extensive tests conducted using the most relevant image quality assessment databases are also presented.


I. INTRODUCTION
The role and importance of the image analysis in various applications is still growing.Regardless of the specific problem, the accuracy of detection, recognition or classification based on the image processing and analysis strongly depends on the quality of input images.In many cases such subjective image quality assessment is conducted by the human operator of the system and may be specific for a given application.Nevertheless, some typical image distortions are common and their impact on the results of the image analysis is similar.
Some other important issues are data transmission and visualization which are directly related to image quality assessment and some of image processing methods.In such applications an objective reliable image quality assessment, independent on the human subject but well correlated with subjective quality scores, may be useful for the development of some new algorithms as well as their optimization and verification, especially for color images.

II. CLASSICAL IMAGE QUALITY METRICS
A reliable objective image quality metric should be independent on the image content so different images subjected to the same distortions should give equal results.Due to the effortless usage for the optimization purposes, scalar metrics are preferred, especially with dynamic range from 0 to 1.
The objective image quality assessment schemes may be classified as belonging to three major groups.The first one is known as "blind" (no-reference) approach, which does not require the access to the original undistorted image at all [1], [2].Although the potential usage is wide and such metrics are the most desired ones, their universality is currently rather low as they are usually sensitive to only one or two types of distortions.Some typical examples are blur metrics and measures of JPEG artifacts observed as blocks.
Another, more popular group of methods consists of numerous full-reference metrics, with classical Mean Squared Error (MSE) and Peak Signal to Noise Ratio (PSNR).Such methods require the exact knowledge of the original, perfect quality image without any distortions and the quality score is calculated by comparing some features between the distorted and original (reference) images.Although in practical applications the access to the original image is not always possible, a great progress in this family of metrics has taken place in recent years.
The third, less popular approach is known as reducedreference and it requires a partial knowledge about the reference image e.g.some DCT coefficients or specified features.Such methods may also be used for the estimation of some full-reference metrics [3].
Regardless of the type of the metric, each newly developed one should be verified in order to determine its concordance with subjective evaluations and Human Visual System (HVS) which can be modeled using many techniques describing the way that human observers perceive various kinds of image distortions.For this reason some image quality assessment databases have been developed containing reference and distorted images together with Mean Opinion Scores (MOS) or Differential Mean Opinion Scores (DMOS) collected during the experiments conducted in cooperation with numerous observers.

III. RECENT FULL-REFERENCE IMAGE QUALITY METRICS
Poor correlation of some traditional metrics based on the comparison of corresponding pixels from the reference and distorted images, such as MSE or PSNR, caused the necessity to develop a new approach based on alternative assumptions.The first such idea, known as Universal Image Quality Index, is based on the comparison of three local features corresponding to common distortions, using the sliding window approach [4].These features are: luminance distortions, loss of contrast and structural distortions.After some modifications, e.g.increasing its stability, the UIQI metric, has been extended [5] into one of the most popular image quality metrics called Structural Similarity (SSIM).The local SSIM formula for each window position can be expressed as the combination of the mean values, variances and the covariance , assuming that x and y denote the local fragments of distorted and reference images inside 11×11 pixels windows (weighted using Gaussian function) with small stabilizing constants C1 = (0.01•L) 2 and C2 = (0.03•L) 2 where L is the number of available luminance levels (typically 256).
Due to the popularity of the SSIM metric some modifications have also been proposed e.g.three-component weighted SSIM, gradient SSIM or Multi-Scale SSIM [6], which is defined as where components representing the luminance, contrast and structural distortions respectively, are weighted for each j-th scale (but the luminance changes are considered only for full resolution images).The default values of coefficients and have been proposed by the authors of the paper [6] as the result of the optimization procedure with 600 images used for testing involving 8 observers.Some other interesting ideas of full-reference image quality assessment are based on the applications of Singular Value Decomposition (SVD), wavelets, other transforms or using some elements of the information theory.An example of this approach is the Visual Information Fidelity (VIF) defined as the mutual information that vision extracts from the distorted image divided by the information extracted from the reference one, calculated for several sub-bands in the wavelet domain [7].
One of the most promising directions of research seems to be the similarity based approach represented by Riesz-based Feature Similarity (RFSIM) and Feature Similarity (FSIM), which are quite similar to the idea of the SSIM.The RFSIM metric [8] is defined as where M is the binary mask being the edge detection result (for this purpose well-known Canny filter may be applied) and di are local similarity values calculated for five features obtained using the 1 st and 2 nd order Riesz transform coefficients.The local similarity for the images x and y can be calculated (using small stabilizing constant value C) as 2 2 2 ( , ) ( , ) ( , ) .( , ) ( , ) Further research of the same group has led to the definition of the FSIM metric [9] based on two factors: phase congruency (PC) and gradient magnitude (G).The construction of the overall index is quite similar as in (3) and its value can be obtained as max max ( , ) ( , ) , ( , ) where PCmax is the higher from two local values of phase congruency from the reference and assessed image.The local similarity value is defined as the product of two factors related to gradient (Scharr filter is recommended for this purpose) and phase congruency The color version of the metric can also be calculated using the YIQ color model in a similar way, replacing the PC and G values by the chrominance I and Q respectively and multiplying the obtained result (using the exponent value =0.03) by the formula (6), assuming the values of exponents  and  equal to 1.
All the metrics briefly presented above have an important common disadvantage -their values are not directly related to the subjective scores expressed as MOS or DMOS values so the additional nonlinear mapping is required in order to achieve high values of the linear correlation coefficients between objective and subjective quality scores.

IV. IMAGE QUALITY ASSESSMENT DATASETS
The verification of the compliance of objective metrics with MOS or DMOS values can be expressed as the quality prediction accuracy and prediction monotonicity.The accuracy is measured using Pearson's linear Correlation Coefficient (CC) whereas the monotonicity can be evaluated by Spearman Rank Order Correlation Coefficient (SROCC) or Kendall Rank Order Correlation Coefficient (KROCC).
In order to achieve a reliable verification, calculations should be conducted for all available datasets.The most relevant of them is Tampere Image Database [10] containing 1700 color images with 17 types of distortions with MOS values obtained from 838 observers.The other two relevant datasets are Categorical Subjective Image Quality (CSIQ) database from Oklahoma State University [11] with 866 images (35 observers and 6 distortion types) and well-known LIVE dataset from Texas University at Austin [12] containing 779 images distorted in five ways assessed by 29 subjects.
As a supplement for those three databases, less important ones may also be used such as: IRCCyN/IVC [13] from University of Nantes (160 images with 4 types of distortions assessed by 15 observers), Wireless Image Quality (WIQ) with 80 distorted greyscale images judged by 30 observers [14] and A57 dataset [15] consisting of 54 test images with 6 types of contaminations evaluated by 7 experts.The oldest database has been developed by Toyama University in Japan and is known as MICT database [16].Nevertheless, its usefulness is currently strongly limited as it contains 198 images assessed by 16 students but only two types of distortions related to JPEG and JPEG2000 compression.

V. PROPOSED APPROACH AND THE VERIFICATION RESULTS
Highly linear relationship between the objective and subjective scores is typically obtained by nonlinear mapping, e.g. by logistic function, with necessary optimization of the mapping function's parameters.Unfortunately it ought to be conducted independently for each dataset leading to different values of the coefficients.For this reason such approach cannot be considered as a universal one.
Much results can be obtained using the nonlinear combination of some metrics as proposed in the paper [17].Using the weighted product of three (or more) metrics with exponent values optimized for the largest database (TID) a serious increase of the CC values can be achieved in comparison to each of the metrics separately (even after nonlinear mapping), e.g. the combination of MS-SSIM, VIF and R-SVD metrics leads to CC = 0.86 [17] and replacing the R-SVD by FSIMc metric proposed in the paper [18] as CISI metric leads to CC = 0.8752 for the TID database.
Good results may also be obtained using the nonlinear combination of the RFSIM and FSIMc metrics leading to HFSIMc metric with CC = 0.8861 for the same dataset.
Another possibility is changing the weighting exponents inside the calculation procedure of the FSIM - and  in (6) -or FSIMc metric, discussed in [22], leading to the Weighted FSIM (WFSIM) metric and its color version WFSIMc, increasing the values of the rank order correlation coefficients with subjective scores.
The extended version of the approach based on the combination of four metrics: MS-SSIM, VIF, RFSIM and weighted FSIM is proposed in this paper defined as assuming using the color version of the WFSIM metric for available color images.The results of the exponents obtained as the result of optimization conducted using TID database are

   
1.6131 0.2037 59.7151 0.1989 , a b c d   (8) with the definition of the WFSIM or WFSIMc metric using the values of the coefficients suggested in the paper [19] 0.01 0.05 or 0.01 0.05 .0.004 The obtained CC, SROCC and KROCC values and their comparison with other metrics for the databases described in Section III are presented in Tables I-III and the aggregate values for all datasets weighted according to the number of test images in respective datasets are shown in Table IV.

Fig. 1 .
Fig. 1.Scatter plot of the proposed metric versus MOS values for TID2008 database illustrating highly linear relationship between objective and subjective evaluations.

Fig. 2 .
Fig. 2. Lower and upper bounds for the 95% confidence interval calculated for Pearson correlation with subjective scores for major datasets.

TABLE I .
PEARSON LINEAR CORRELATION COEFFICIENTS (CC) FOR VARIOUS METRICS AND DATASETS.

TABLE II .
SPEARMAN RANK-ORDER CORRELATION (SROCC) FOR VARIOUS METRICS AND DATASETS.

TABLE III .
KENDALL RANK-ORDER CORRELATION (KROCC) FOR VARIOUS METRICS AND DATASETS.