I Introduction
Modelling based on graphs has recently attracted an increasing amount of interest in machine learning and signal processing research. On the one hand, many realworld data are intrinsically graphstructured, e.g. individual preferences in social networks or environmental monitoring data from sensor networks. This makes graphbased methods a natural approach to analysing such structured data. On the other hand, graphs are an effective modelling language for revealing relational structure in complex domains and may assist in a variety of learning tasks. For example, knowledge graphs improve the performance in semantic parsing and question answering
[1]. Despite their usefulness, however, a graph is not always readily available or explicitly given. The problem of graph learning therefore concerns the construction of a topological structure among entities from a set of observations on these entities.Methodologies to learn a graph from the structured data include naïve methods such as nearest neighbours (
NN), and approaches from the literature of probabilistic graphical models (PGMs) and more recently graph signal processing (GSP) and graph neural networks (GNNs). The basic idea of
NN is to connect a node to other nodes with the smallest pairwise distances in terms of the observations [2, 3, 4, 5]. In PGMs, a graph expresses the conditional dependence with edges between random variables represented by nodes
[6]. The GSP literature, on the other hand, focuses on algebraic and spectral characteristics of the graph signals [7, 8, 9], which are defined as observations on a collection of nodes. The GSPbased graph learning methods (see [10, 11] for two recent reviews) further fall into two distinct branches, i.e. those based on the diffusion processes on graphs [12, 13, 14, 15] and those based on smoothness measures of graph signals [16, 17, 18, 19, 20]. Very recently, GNNs have attracted a surging interest in the machine learning community which leads to a number of approaches to graph inference [21, 22].While many of the above methods can effectively learn a meaningful graph from observations, there is a lack of consideration of the additional information, i.e. nodeside or observationside covariates, which may be available for the task at hand. Those covariates that provide valuable side information should be integrated into the graph learning framework. Taking an example of measuring temperature records in different locations in a country, where nodes represent weather stations, the latitude, longitude and altitude of each station are useful nodeside information. One major benefit is to lessen the reliance of the above models on the quality of the observations. Heavily corrupted or even missing records can be predicted from the relationship between the observations and the side information, which in turn helps improve the efficiency in graph inference.
Furthermore, although nodeside dependency is inherently accounted for in the process of graph learning, the observationside dependency is largely ignored in the literature. One example are temperature records collected at different timestamps, which could largely affect the evaluation of the strength of relation between stations. Another example is that of a recommender system, where the item ratings collected from different individuals are largely affected by the social relationship between them.
To tackle the above issues, we revisit the graph signal observations from a functional viewpoint and propose a framework for learning undirected graphs by considering additional covariates on both the node and observationside. This allows us to capture dependency structure within the graph signals which leads to more effective graph and signal recovery. More specifically, as shown in Figure 1, the th entry of the graphstructured data matrix , which contains graph signals collected on nodes, can be viewed as some potentially noisy or missing observation of , i.e. the th function evaluated at th node. To model the nodeside information, we introduce a covariate
that can explain the variations in a graph signal, e.g. a vector that contains the latitude, longitude and altitude of stations in the aforementioned temperature example. To model the observationside information, we also introduce a generic covariate
. For example, could be the timestamp at which the temperature record is collected. Observationside dependency hence arises due to depending on . Combining the two, the function underlying the graph signals takes the form of .Specifically, we define the function in a reproducing kernel Hilbert space (RKHS) with a product kernel on . At the same time, the twoside dependency in is encoded in a Kronecker product of two graph Laplacian matrices , where represents the connectivity between nodes to be learned and represents the observationside dependency, essentially a nuisance dependency for the graph learning problem. We assume can be captured by evaluating at the observationside covariates . Our key contribution is the Kernel Graph Learning (KGL) framework, which allows us to infer by jointly learning the function and optimising for a novel Laplacian quadratic form that effectively expresses the smoothness of over .
In addition, we provide several extensions of KGL for the scenario of a partially observed with known missing value positions, and that of observations without either nodeside or observationside information. The learning problem is effectively solved via a block coordinate descent algorithm, which has a theoretical guarantee of convergence. We show that KGL can effectively recover the groundtruth graph from the twoside dependent data and outperform the stateoftheart smoothnessbased graph learning methods in both synthetic and realworld experiments.
In summary, the main contributions of our work are as follows:

A novel graphbased regularisation based on a smoothness measure of dependent graph signals over the Kronecker product of two graph Laplacian matrices;

A graph learning framework that integrates node and observationside covariates from a functional viewpoint;

An efficient method for denoising and imputing missing values in the observed graph signals as a byproduct of the graph learning framework.
Ii Related Work
In this section, we survey the classical methods of learning a graph from a number of different perspectives. From each perspective, we highlight the most related work that considers one or more aspects of the 1) nodeside information, 2) observationside dependency, and 3) noisy and missing data.
Iia Nearest Neighbours Methods
The nearest neighbours (NN) connects a node to
other nodes with the smallest pairwise distances in terms of the observations. It is flexible with different choices of distance metrics and yet heuristic since the neighbourhood search is based on pairwise comparison of observations on nodes. The majority of the
NN variants focuses on fast approximate search algorithms (ANN) [4, 2, 3, 5]and recent variants apply deep reinforcement learning to explicitly maximise search efficiency
[23, 24]. By comparison, the modelbased methods, e.g. PGMs and GSP, directly integrate global properties of the observations into learning objectives.IiB Probabilistic Graphical Models
In the field of PGMs, the inverse covariance matrix
is often regarded as an undirected graph that parameterises the joint distribution of random variables representing nodes. There is a rich literature on effective algorithms to estimate a sparse
from Gaussiandistributed data by solving an
regularised loglikelihood maximisation [25, 26, 27, 28, 29], including the widely used graphical Lasso [30] and GISTA [31]. The recent stateoftheart algorithm BigQUIC [32] scales to millions of nodes with performance guarantees. Besides computational improvements, models based on attractive Gaussian Markov Random Fields (GMRFs) [33, 34, 35, 36] further restrict the offdiagonal entries of to be nonpositive, which is equivalent to learning the Laplacian matrix of the corresponding graph with nonnegative edge weights. The most related extensions of the graphical Lasso were proposed in [37, 38], which simultaneously learn two dependency structures in the matrixvariate Gaussian data. While their work focuses on estimating covariance matrices, our work focuses on recovering a graph topology from data.IiC Structural Equation Models
Structural equation models (SEMs) are another type of models (similar to PGMs) that is widely used to learn a directed acyclic graph (DAG) that encodes the conditional dependence of random variables [39, 40, 41]. Based on SEMs, the authors in [42] proposed a block coordinate descent algorithm to solve the joint optimisation problem of denoising the data and learning a directed graph. The joint learning framework is further extended to time series [43]
, where the structural vector autoregressive models (SVARMs) replace the linear SEMs to handle temporal dependency. The main difference from our work is that their denoising function is an identity mapping without side information as covariates. The work in
[41] also considers the temporal dependency in learning a DAG with SVARMs, but does not consider the denoising scenario.IiD Graph Signal Processing
In the context of GSP, every observation on a collection of nodes is defined as a graph signal. GSPbased graph learning models have seen an increasing interest in the literature [10, 11] and further fall into two distinct branches. The first branch assumes graph signals are outcomes of diffusion processes on graphs and reconstructs a graph from signals according to the diffusion model [12, 13, 14, 15]. The other branch constructs a graph by promoting the global smoothness of graph signals, which is defined by the Laplacian quadratic form [16, 17] or more generally via total variation [20]. Smoothnessbased methods are related to GMRFs by recognising that the Laplacian quadratic form is closely related to the loglikelihood of the precision matrix defined as the graph Laplacian. Our work can be regarded as an extension to smoothnessbased graph construction.
In the literature of smoothnessbased GSP graph learning, the authors in [17, 20] adopt a twostep learning algorithm to learn an undirected graph while denoising graph signals. They simply assume an identity mapping between the actual graph signals and noisy observations, which is different from our work that considers side information. The most related work is proposed in [44]
, which uses kernel ridge regression with observationside covariates to infer graph signals. However, their work mainly focuses on data prediction and graph learning is only a byproduct in their approach. In Section
VIA, we will show, both theoretically and empirically, that their method uses a smoothness term that imprecisely incorporates the observationside dependency in the learned graph structure, leading to an inferior performance in learning a graph.In terms of the observationside dependency, there exist some GSP graph learning models that consider temporal dependency in graph signals. A socalled spatiotemporal smoothness was proposed in [45, 46] to transform the graph signals using a temporally weighed difference operator. If every timestamp is equally important, the operator is equivalent to a prepossessing step to make the time series observed on each node stationary. It should be noted that there is another branch of research assuming that the temporal dependency in graph signals originates in the dynamic changes in the edges [47, 48], and therefore the problem is formulated as learning a dynamic series of graphs, which is different from the goal of our paper.
IiE Graph Neural Networks
A new branch of graph learning models is developed from the perspective of GNNs. Essentially, GNNs discover the patterns in graphstructured data in a hierarchical manner [49, 50, 51]. The activations at intermediate layers, e.g. the th layer, can be interpreted as a new representation for the nodes in the embedding space that incorporates the information from a specifically defined neighbourhood of the nodes. The authors in [52, 21] thus defined the strength of connectivity between nodes and based on the pairwise similarity of their embeddings and at the th layer of the GNN architecture. The authors in [53] extended this method to construct a directed graph in the process of training a GNN that deals with time series data. The main goal of these methods is to improve the performance of noderelated tasks (e.g. classification or prediction) and graph learning is only a byproduct, whose performance is often not guaranteed. The recent works in [54, 55, 56, 22] start to incorporate an additional loss for recovering graphs while training the GNNs. However, a significant limitation of most GNNbased methods is that they typically require a large volume of training data and the learned connectivity is often less explainable compared to PGM and GSP methods.
Iii Preliminaries
Iiia SmoothnessBased GSP Graph Learning
Observing a data matrix whose th entry corresponds to the observation on the th node in the th graph signal, we are interested in constructing an undirected and weighted graph . The node set represents a collection of variables, where . The edge set represents the relational structure among them to be inferred. The structure is completely characterised by the weighted adjacency matrix whose th entry is . If two nodes and are connected by an edge then , else if then . The graph Laplacian matrix is defined as , where denotes the allone vector. and are equivalent and complete representations of the graph on a given set of nodes.
In the literature of GSP, one typical approach of constructing a graph from is formulated as minimising the variation of signals on graphs as measured by the Laplacian quadratic form^{1}^{1}1We acknowledge that the conventional form of the Laplacian quadratic in GSP literature is , where each column of corresponds to a graph signal. In our case, has twoside dependency such that either a column or a row may be regarded as a graph signal. The term measures the smoothness of row vectors over a column graph. This formulation is however consistent with the statistical modelling convention where each column in is often regarded as a random variable and the graph of main interest is the column graph. [7, 16, 17]:
(1) 
where defines the space of valid Laplacian matrices, and
is a regularisation term with a hyperparameter
. Equivalently, the problem can be formulated using the weighted adjacency matrix such that(2) 
where defines the space of valid weighted adjacency matrices. Popular choices of regularisation include the sum barrier and (often added as a constraint such that ) to prevent trivial solutions where all edge weights are zero and meanwhile controlling the variations of edge weights [17], or the logbarrier to prevent isolated nodes and promote connectivity [16].
With a fixed Frobenius norm for , a small value of the objective in Eq.(1) implies that is smooth on in the sense that neighbouring nodes have similar observations. The authors in [17] further propose a probabilistic generative model of the noisefree smooth observations such that
(3) 
where denotes the MoorePenrose pseudoinverse of a matrix. This leads to a graph learning framework which solves an optimisation problem similar to Eq.(1).
IiiB Kronecker Product Kernel Regression
Taking a functional viewpoint on the generation of graphstructured data matrix, we can make use of the wellstudied formalism of Kronecker product kernel ridge regression to infer the latent function [57]. Specifically, we consider to be an element of a reproducing kernel Hilbert space (RKHS) corresponding to the product kernel function on , where denotes the Kronecker product.
A kernel function can be expressed as an inner product in a corresponding feature space, i.e. where and , where . An explicit representation of feature maps and is not necessary and the dimension of mapped feature vectors could be high and even infinite. By the representer theorem, the function that fits the data takes the form
(4) 
where are the coefficients to be learned, and the estimated value . Denoting the corresponding kernel matrices as and , where and , we have the matrix form
(5) 
where is an approximation to and the coefficient matrix has the th entry as . We assume the observation is a noisy version of , where the noise is normally distributed random variables such that , where
measures the noise level. This leads to a natural choice of the sum of squared errors as the loss function.
A standard Tikhonov regulariser is often added to the regression model to reduce overfitting and penalise complex functions, which is defined in our case as
(6) 
where is the vectorisation operator for a matrix. We now arrive at the following optimisation problem to infer the function that approximates the observation matrix such that
(7) 
where the hyperparameter controls the penalisation of the complexity of the function to be learned.
To have a better understanding of how this model is expressive for the twoside dependency, we show that the objective in Eq.(7) can be derived from a Bayesian viewpoint. In the vector form, i.e. and , we assume that both the data likelihood and the prior follow a Gaussian distribution:
(8a)  
(8b) 
where is a zerovector of length . Notice that and (and their Kronecker product) can be either singular or nonsingular matrices, depending on the kernel choice. For the sake of simplicity, we use the pseudoinverse notation throughout the paper. Now, the marginal likelihood of is
(9) 
from which we can see that the covariance structure of can be understood as a combination of and . Specifically, in the noisefree scenario where , the covariance matrix over the rows of is
(10) 
Similarly, the covariance matrix over the columns of is
(11) 
The proof for Eq.(10) and Eq.(11) can be found in [58]. In Appendix A, we show that the maximisation of the logposterior of the coefficient vector leads to the objective in Eq.(7).
Iv A Smoothness Measure for Dependent Graph Signals
From Eq.(9), a noisefree version of has a Kronecker product covariance structure . Recall that, for Gaussian graph signals, the covariance is often modelled as the pseudoinverse of the graph Laplacian matrix (see Eq.(3) and [17, 33]), which is used for measuring the signal smoothness. Inspired by this observation, we define a notion of smoothness for the graph signals with twoside dependency using the Laplacian quadratic form as follows
(12) 
where can be interpreted as a Laplacianlike operator with a Kronecker product structure. To see this more clearly, first, let us define as an undirected weighted column graph that represents the structure among column vectors, and correspondingly as a row graph that represents the structure among row vectors. Second, let us connect the kernel matrices and to the Laplacian matrices and by recognising that the former can be defined as functions of the latter as kernels on graphs [59], e.g.
Therefore, we have , and we further show in Appendix B that is a Laplacianlike operator on which the notion of frequencies of can be defined.
In practice, the observationside dependency is often given or easy to obtain. For example, for graph signals with temporal Markovian dependency, is often modelled as a path graph representing that the observation at time only depends on the observation at time . By comparison, is the primary variable of interest that is often estimated in the graph learning literature. Therefore, in this paper, we assume that can be encoded in the observationside information via such that . We simply denote from this section onwards and define a smoothness measure where we replace the kernel matrix in Eq.(12) with the Laplacian matrix
(13) 
This effectively disentangles the relationship among nodes (i.e. the graph to be learned) from the observationside dependency in graph signals.
The smoothness term, in the vectorised form, can be viewed as a Laplacian regulariser which can be added to the problem of inferring a function that fits the graph signals in Eq.(7). Specifically, the graph signals in the smoothness term can be replaced with the estimates from the function such that
(14) 
where denotes a compact manifold^{2}^{2}2We refer the interested reader to [60, 61] for the theorem of manifold regularisation with the LaplaceBeltrami operator.. We will make use of the Laplacian regulariser in Eq.(LABEL:eq:_f_smoothness) to derive the proposed graph learning models in the following section.
V Kernel Graph Learning
Va Learning Framework
We propose a joint learning framework for inferring the function that fits the graph signals as in Eq.(7) as well as the underlying graph to capture the relationship between the nodes as in Eq.(LABEL:eq:_f_smoothness). This relationship is disentangled from the observationside dependency of non graph signals with the notion of smoothness introduced in Section IV. We name this framework Kernel Graph Learning (KGL) which aims at solving the following problem:
(15) 
where , and denotes the Frobenius norm. The first two terms correspond to the functional learning part where the hyperparameter controls the complexity of the function for fitting . The last two terms and the constraints can be viewed as a graph learning model in Eq.(1) with the fitted values of as input graph signals and the sum barrier as the graph regulariser. The hyperparameter controls the relative importance between fitting the function and learning the graph, and controls the distribution of edge weights. The trace constraint acts as a normalisation term such that the sum of learned edge weights equals the number of nodes. The model is compatible with constraints that enforce other properties on the learned graph, e.g. the log barrier introduced in Section IIIA. This paper is mainly based on one of the choices for the constraints in order to maintain focus on the general framework.
VB Optimisation: Alternating Minimisation
We first recognise that Eq.(15) is a biconvex optimisation problem, i.e. convex w.r.t while is fixed and vice versa. This motivates an iterative blockcoordinate descent algorithm that alternates between minimisation in and [62, 63]. In this section, we derive the update steps of and separately, propose the main algorithm in Algorithm 1, and prove its convergence.
VB1 Update of
The update of coefficients can be regarded as solving a Laplacianregularised kernel regression [61]. Given , the optimisation problem of Eq.(15) becomes
(16) 
and, after dropping constant terms,
(17) 
Denote and , we obtain a dualform for such that
(18) 
where , and for simplicity. We prove in Appendix C that is positive semidefinite thus Eq.(18) is an unconstrained quadratic programme. The gradient of w.r.t. is
(19) 
Strictly speaking, the matrix
may contain zero eigenvalues which makes it not invertible. However, a majority of popular kernel functions for
and are positivedefinite, e.g. the RBF kernel. Since Kronecker product preserves positive definiteness, and hence the whole matrix is positivedefinite and invertible. Setting and cancelling out , we have:(20) 
Denote , where has a dimension of . We can therefore obtain a closedform solution such that
(21) 
where the inverse of requires .
To further reduce the complexity, we make use of the Kronecker structure and matrix tricks. We first recognise as
where is the Kronecker sum. With the eigendecomposition , and , we have
(22) 
Here, is an diagonal matrix with entries being all the pairwise sums of eigenvalues in and . Inversion of that matrix is thus . We can now obtain cheap inversion with
(23) 
The operation is simply rescaling each term in the vector with the corresponding diagonal entry of . If we denote and column vectors containing the diagonal entries this can be expressed as with
where denotes the Hadamard product and denotes entrywise inversion. Remaining operations are now direct and give the closedform solution as
Notice that this solution requires only matrix multiplications and inversions and eigendecompositions on or matrices, giving an overall computational cost of .
When and are not invertible, we suggest using the gradient descent to avoid the inverse of a large matrix of dimension . The update step using Eq.(19) is:
(25) 
where is the learning rate
VB2 Update of
Given , the optimisation problem of Eq.(15) becomes
(26) 
which is a constrained quadratic programme w.r.t. . By taking , the problem fits in the learning framework in Eq.(1). We use the package CVXPY [64] to solve this problem.
The overall KGL framework is presented in Algorithm 1. The convergence for each update step of and is guaranteed in solving the respective convex optimisation of Eq.(17) and Eq.(26). It should be noted that the step size in Eq.(25) needs to be set appropriately for the gradient descent to converge. We suggest a from empirical results. We now prove the following lemma.
Lemma V.1.
Proof.
We follow the convergence results of the alternate convex search in [63] and that of a more general cyclic blockcoordinate descent algorithm in [62]. By recognising Eq.(17) and Eq.(26) are quadratic programmes (with Lemma C.1 in Appendix C), the problem of Eq.(15) is a biconvex problem with all the terms differentiable and the function continuous and bounded from below. Theorem 4.5 in [63] states that the sequence generated by Algorithm 1 converges monotonically. Theorem 4.1 in [62] states that the sequence and generated by Algorithm 1 are defined and bounded. Furthermore, according to Theorem 5.1 in [62], every cluster point is a coordinatewise minimum point of hence the solution is a stationary point of Eq.(15).
∎
Our empirical results suggest that after only 10 iterations or less, the sequence does not change more than the tolerance level. The computational complexity of KGL in Algorithm 1 is dominated by the step of updating . It requires to compute the closedform solution of or to compute the gradient in Eq.(19) if the closedform solution is not applied when and not invertible. Updating requires . Overall, for iterations that guarantee the convergence of Algorithm 1, it requires either or operations, depending on whether and are chosen to be invertible. We note that one could readily appeal to largescale kernel approximation methods for further reduction of computational and storage complexity of the KGL framework, and hence broaden its applicability to larger datasets. There are two main approaches to largescale kernel approximations and both can be applied to KGL. The former focuses on kernel matrix approximations using methods such as Nyström sampling [65], while the latter deals with the approximation of the kernel function itself, using methods such as Random Fourier Features [66]. Nonetheless, this paper focuses on the modelling perspective, and we will leave the algorithmic improvement as a future direction.
VC Special Cases of Kernel Graph Learning
Independent observations
It is often assumed that graph signals are hence there exists no dependency along the observation side. This is equivalent to setting in our framework. We refer to this special case of KGL as Nodeside Kernel Graph Learning (KGLN):
(27) 
No nodeside information
It also may be the case that no nodeside information is available for the problem at hand. In this case, we can simply set in KGL, leading to to Observationside Kernel Graph Learning (KGLO):
(28) 
In the cases of oneside KGL, the optimisation again follows the alternating minimisation, i.e. to solve for KGLN or KGLO), one can simply set or in Algorithm 1. It should be noted, however, that the update step of requires less computational cost when either or . Indeed, the objective function of KGLN can be decomposed into the sum according to functions such that
(29) 
where and . Consequently, the update step can be parallelised.
VD Learning with Missing Observations
By modifying the leastsquares loss in KGL, we propose an extension to jointly learn the underlying graph and function from graphstructured data with missing values. We encode the positions of missing values with a mask matrix such that if is missing, and otherwise. Now, we only need to minimise the leastsquares loss over observed in the functional learning part, which leads to the formulation:
(30) 
This formulation also applies to oneside kernel graph learning, i.e. KGLN or KGLO, with or .
The optimisation problem in Eq.(30) is a biconvex problem and alternating minimisation can still be applied. The update step of remains the same as in Eq.(26), but the gradient in the update step of (Step 4. in Algorithm 1) becomes
(31) 
where . The detailed derivation of the gradient is provided in Appendix D. We further assume is invertible, which is a mild assumption as we have many choices of kernel functions for and to be invertible. Also noting , the gradient becomes
(32) 
One can either derive a closeform solution or use gradient descent based on Eq.(32).
Vi Synthetic Experiments
Via General Settings
Groundtruth Graphs
Random graphs of nodes are drawn from the ErdösRényi (ER), BarabásiAlbert (BA) and stochastic block model (SBM) as groundtruth, which are denoted as , and
, respectively. The parameters of each network model are chosen to yield an edge density of 0.3. The edge weights are randomly drawn from a uniform distribution
. The weighted adjacency matrix is set as for symmetry and normalised such that the sum of edge weights is equal to for ease in comparison. The graph Laplacian is calculated from .Groundtruth Data
We generate mild noisy data from , where is drawn from according to Eq.(8b). Every entry of the noise matrix is an sample from . To test the proposed model against different levels of noises, we vary the value of in Section VIB. For all other synthetic experiments, we add a mildlevel noise with . We choose , as it is a popular method to generate smooth signals in related work [16, 44]. We consider both dependence and independence along the observation side:

Independent Data: ;

Dependent Data: is obtained from an RBF kernel evaluated on synthetic observationside information
, which can be interpreted as the timestamps of a discretetime Markov chain. The bandwidth parameter is chosen according to the median heuristic
[67].
Model Candidates
To have a fair comparison, the models are divided into two groups. The first group contains the baseline models that cannot deal with observationside dependence:

GL2step (Eq.(16) in [17]): a twostep GSP graph learning framework with an identity mapping as denoising function. From a modelling perspective, Eq.(13) in [20] proposed a similar model with more constraints on edge weights and a different optimisation algorithm. We treat them as the same kind of techniques.

KGLN (proposed model in Eq. (27)): in KGL.
The second group is examined with observationside dependent data:

KGL (proposed model in Eq. (15)): the main learning framework.

KGLO (proposed model in Eq.(28)): To have a fair comparison to KGLAgnostic, we also assume the graph is agnostic to the model (i.e. as model input).
For each method, we determine the hyperparameters via a grid search, and report the highest performance achieved by the best set of hyperparameters.
Evaluation Metrics
Average precision score (APS) and normalised sum of squared errors () are used to evaluate the graph estimates, and outofsample mean squared error () is used to evaluate the estimated entries of graphstructured data matrix that were not observed (or missing). The APS is defined in a binary classification scenario for graph structure recovery, which automatically varies the threshold of weights above which the edges are declared as learned edges. An APS score of 1 indicates that the algorithm can precisely detect the groundtruth edges and nonedges. The is defined over learned adjacency matrix and the groundtruth adjacency matrix as
The outofsample of data matrix is defined with a mask matrix (same as in Eq.(30)), where indicates is a missing entry:
where and is obtained from model estimates. Similarly, we are interested in the training for analysing overfitting:
ViB Learning a Graph from Noisy Data
In order to evaluate the performance of the proposed model in learning a graph from noisy data, we add noise to the groundtruth data such that , where every entry of the noise matrix is an sample from . We vary the noise level
from 0 to 2 against which we plot the evaluation metrics in Figure
2. Under the same settings of the noise level, random graph and model candidate, we repeat the experiment for 10 times and report the mean (the solid curves) as well as the 5th and 95th percentile (the error bars) of the evaluation metrics.From Figure 2, Figure 8 and Figure 9 (the latter two in Appendix EA), the proposed models outperform the baseline models in terms of all evaluation metrics. Specifically, for , when the data are independent, the performance of KGLN drops slowly as the noise level increases, while that of GL and GL2step drops quickly when noise level goes above 0.5. It is worth mentioning that the curves of two almost overlap in terms of APS. This indicates the identity mapping in GL2step as denoising function does not help much in recovering graph structure, although it yields slightly smaller SSE than GL with the noise level greater then 0.75.
For the dependent data, the proposed model KGL achieves a high performance when the noise level is less than 0.75. Even without the nodeside information as model input in KGL (note that the groundtruth data are generated in a consistent way in [44] proposing KGLagnostic), the proposed method (KGLO) can still learn a meaningful graph with slightly worse performance compared to KGL. By contrast, KGLagnostic cannot recover the groundtruth graph to a satisfying level even with little noise (, as its smoothness term does not capture the dependence structure on the observation side.
ViC Learning a Graph from Missing Data
To examine the performance of learning a graph from incomplete data with described in Section VD, we generate the mask matrix indicating missing entries, where and is the missing rate, i.e.
has a probability of
to be missing and has a value of 0. The preprocessed data with 0 replacing missing entries is , which is a natural choice in practice for the model candidates that cannot directly deal with missing entries, as the mean value of the entries in is 0 by design. We use as the input in the baseline models GL and GL2Step. For KGLAgnostic in Eq.(33), we also add a mask matrix in the leastsquares loss term to have a fair comparison.We vary from 0 to 0.9, against which we plot the evaluation metrics, APS and SSE, in Figure 3. The plots for and are in Appendix EB. Similar to the noisy scenario, we repeat the experiments 10 times under the same settings of missing rate, random graph and model candidate and report the mean (the solid curves) as well as the 5th and 95th percentile (the error bars) of the evaluation metrics.
For , the proposed methods KGL and KGLN can recover the groundtruth graphs reasonably well even when there are 80% missing entries. For the independent data scenario, the performance of the baseline models with the preprocessed data drops steeply as the missing rate increases. Although it can still recover the groundtruth graphs with a high APS and low SSE when the missing rate is less than 20%, the performance is not as good as KGLN. By contrast, for KGLN, the APS only drops by 0.1 from no missing entries to around 90% missing entries, while SSE only increases by around 0.05.
For the dependent data scenario, all three model candidates can deal with missing data directly by adding a mask matrix in the leastsquares loss term. Consequently, their curves are relatively stable as the missing rate increases from 0 to 80%. However, the accuracy levels at which each of the model stabilises are different. The proposed method KGL, aware of both nodeside and observationside information, achieves the highest APS and lowest SSE. By contrast, KGLO, without access to the nodeside information, can still recover a meaningful graph but with less correct edges and less accurate edge weights when the missing rate is less than 0.6. The performance of KGLAgnostic is not as good as KGLO because the smoothness term in Eq.(33) is not able to disentangle the influence of the observationside dependency from the graph structure.
ViD GraphStructured Matrix Completion
In the experiment of learning a graph with incomplete data matrix in Section VIC, we are also interested in the performance of matrix completion. In Figure 4, we plot the MSE of the missing entries (i.e. the outofsample MSE) that is averaged over all types of graphs against the varying missing rate. The proposed methods lead to much smaller errors compared to baseline models, for both independent and dependent data. It should be noted that GL does not offer a mechanism for inferring missing data, hence is not included in this experiment.
ViE Learning a Graph of Different Sizes
The graph learning performance of the proposed method varies with the size of graphs and number of observations. As shown in Figure 5, a hundred observations are sufficient to recover a small graph with nodes with a high accuracy. When the graph size increases, the number of the observations required for KGL
to achieve a high APS increases roughly exponentially. On the other hand, when the number of observations increases, the variance of APS of the learned graphs decreases.
ViF Impact of Regularisation Hyperparameters
Three hyperparameters are involved in and its variants. As introduced in Section V, controls the complexity of the functional learning and prevents overfitting; controls the relative importance of graph learning compared to functional learning, and at the same time determines the smoothness of the predicted data over ; Finally, controls the Frobenius () norm of the graph Laplacian which, together with the trace () constraint, bears similarity to an elastic net regularisation [68]. The larger the , the less sparse the graph with more uniform edge weights. The accuracy in learning the graph Laplacian and inferring the data matrix is determined by the combination of these three hyperparameters, which is not straightforward to visualise and analyse at the same time. Fortunately, we may still gain some insights by examining their distinct effects separately.
Firstly, should be chosen according to the prior belief of the graph sparsity defined as the number of edges with nonzero weights. As shown in Figure 12 (in Appendix EC), when , the learned graph contains only a few most significant edge. Due to the constraint on the sum of edge weights, i.e. , the total weights are allocated to a few significant edges when is small. When , the learned graph becomes fully connected with equal edge weights. In the synthetic experiment where we have knowledge of the groundtruth graph, the best accuracy is obtained when the sparsity coincides with the groundtruth graph.
The value of , on the other hand, has little effect on the accuracy of predicting the missing entries in , as the update step of does not involve the term with . As shown in Figure 13(a)(d), the outofsample MSE is determined by the combination of and . We first notice that the error is the same when , showing that we overly penalise the function complexity in this case. Indeed, when , the function is overly smooth such that the entries of the coefficient matrix are all zero leading to the entries of prediction being all zero as well. This also happens when , where the vector form of the prediction is forced to be overly smooth on . In particular, when , every row vector of (i.e. the predicted graph signal) has constant entries, as a result of minimising the term to zero. On the other hand, when , all the entries in has constant values such that this term is minimised to zero.
The ranges of values of and for which the outofsample MSE is the smallest generally coincide with that for which the APS is the largest in Figure 13 (in Appendix EC), e.g. when , and . However, the outofsample MSE is generally small when and . This is understandable as the function could be very complex with little penalisation, but this does not guarantee good performance in recovering the groundtruth graph.
Vii RealWorld Experiments
Viia Swiss Temperature Data
In this experiment, we test the proposed models in learning a meteorological graph of 89 weather stations in Switzerland from the incomplete temperature data^{3}^{3}3The data are obtained from https://www.meteoswiss.admin.ch/home/climate/swissclimateindetail/climatenormals/normalvaluespermeasuredparameter.html.. The raw data matrix contains 12 rows representing the temperatures of 12 months that are averaged over 30 years from 1981 to 2010 and 89 columns representing 89 measuring stations. The raw data are preprosessed such that each row has a zero mean.
To have a fair comparison, we deliberately omit a portion of the data as input for the model candidates and treat as groundtruth the learned graph obtained from GL using the complete 12month data. Specifically, we only use the first threemonth temperature records (i.e. the first three rows) to learn a graph. To test the performance in missing scenario, we further generate the mask matrix with different rates of missing values, as described in Section VIC, and apply them to the threemonth data. Similar to the synthetic experiment, we choose the hyperparameters in models that yield a graph density of 30% (i.e. keeping 30% most significant edges) to make the learned graphs comparable.
The altitude of a weather station is a useful nodeside information for predicting temperature and learning a meteorological graph. Therefore, the corresponding kernel matrix is obtained from an RBF kernel evaluated at the altitudes of each pair of weather stations for use in KGL and KGLN. Monthly timestamps correspond to the known observationside information. The bandwidth parameter is chosen according to the median heuristic [67]. As described in Section VIA, is thus obtained from an RBF kernel evaluated at three timestamps for threemonth graph signals and used as input in KGL, KGLAgnostic and KGLO. For GL and GL2step, no side information is used. The hyperparameters are tuned via a grid search and highest performance achieved by the best set of hyperparameters is reported. For the realworld scenario where a groundtruth graph is not easy to obtain, the hyperparameter can be chosen according to the results in Section VIF.
We present the results in Figure 6. In terms of both APS and SSE, KGL and KGLN outperform the other candidates. This indicates that altitude is a reliable nodeside covariate with which we can learn a meaningful meteorological graph despite a small number of signals. When there are more missing values, KGL, with the known temporal information, slightly outperforms KGLN. However, since we only have threemonth signals, the temporal information is less predictive. Since the groundtruth graph is learned from GL with 12month signals, the recovering ability of GL and GL2step is not far behind when there is no missing values in threemonth data, but drops sharply with an increasing missing rate. Compared to KGLO, the poor performance of KGLAgnostic indicates that the imprecise smoothness term in Eq.(33) is the main reason that we cannot recover an annual meteorological graph with only threemonth temperature records, as both of them are agnostic to the nodeside information.
ViiB Sushi Review Data
In this experiment we will evaluate the performance of our proposed methods by comparing the recovered graphs with groundtruth using the Sushi review data collected in [69]. The authors tasked 5000 reviewers to rate 10 out of 100 sushis randomly with a score from 1 (least preferred) to 5 (most preferred); reviews for each sushi are treated as one graph signal in this experiment. For each reviewer, we have 10 descriptive features which cover demographical information about the reviewers, such as age, gender and the region the reviewer currently lives in. We also have 7 attributes describing each sushi, including its oiliness in taste, normalised price and its grouping (for example, redmeat fish sushi, whitemeat fish sushi or shrimp sushi). We will treat the grouping attribute as the underlying groundtruth label for each sushi and not use it in the KGL algorithm.
We will consider 32 sushis from 5 sushi groups, namely redmeat (7 sushis), clam (6 sushis), blueskinned fish (8 sushis), vegetable (6 sushis) and roe sushi (5 sushis). These are treated as the groundtruth labels. Our goal will be to recover a graph of sushis which contains clusters corresponding to these group labels (while omitting the group attribute from the nodeside information, i.e. we retain only 6 remaining attributes).
We pick an increasing number of reviewers at random for our experiment. This is to demonstrate how the algorithm performs under different number of signals. After preprocessing, we arrive to a data matrix with 32 columns, each representing a type of sushi, and rows representing each reviewer’s rating to the sushis. This is a sparse matrix with an average sparsity of . We run KGL, KGLN, KGLO, GL, GL2step and KGL Agnostic to obtain a graph of sushis. To evaluate the result quantitatively, we compute the normalised mutual information
(NMI) between the cluster assignments obtained by applying spectral clustering
[70] to the recovered graphs and the underlying sushi grouping. NMI is used to measure the agreement between two grouping assignments, 0 indicating no mutual information while 1 indicating perfect correlation. To emphasise that nodeside attributes alone are not sufficient to recover the groundtruth label, we also applied clustering on the RBF graph obtained from remaining six sushi attributes, which resulted in an NMI score of 0.34.Figure 7 illustrated our results. KGL was the best performer, followed by KGLN, and both significantly outperformed KGLAgnostic, KGLO, GL2step and GL. Moreover, both KGL and KGLN outperformed the case where we solely use the RBF Graph. The results demonstrate the merit of our proposed methods in incorporating side information for graph recovery, in particular the observationside information (i.e. the reviewers’ information) to capture the dependency between the observed signals.
Viii Conclusion
In this paper, we have revisited the smooth graph signals from a functional viewpoint and proposed a kernelbased graph learning framework that can integrate nodeside and observationside covariates. Specifically, we have designed a novel notion of smoothness of graph signals over the Kronecker product of two graph Laplacian matrices and combined it with a Kronecker product kernel regression of graph signals in order to capture the twoside dependency. We have shown the effectiveness and efficiency of the proposed method, via extensive synthetic and realworld experiments, demonstrating its usefulness in learning a meaningful topology from noisy, incomplete and dependent graph signals. Although we have proposed a fast implementation exploiting the Kronecker structure of kernel matrices, the computational complexity remains cubic in the maximum of the number of nodes and the number of signals. Hence, a natural future direction is to further reduce the computational complexity with the stateoftheart methods for largescale kernelbased learning, such as random Fourier features. Another interesting direction is to develop a generative graph learning model based upon the framework presented here, using connections between Gaussian processes and kernel methods.
References

[1]
J. Berant, A. Chou, R. Frostig, and P. Liang, “Semantic parsing on freebase
from questionanswer pairs,” in
Proceedings of the 2013 conference on empirical methods in natural language processing
, 2013, pp. 1533–1544.  [2] A. Andoni and P. Indyk, “Nearoptimal hashing algorithms for approximate nearest neighbor in high dimensions,” Proc.  Annu. IEEE Symp. Found. Comput. Sci. FOCS, vol. 51, no. 1, pp. 459–468, 2006.
 [3] M. Muja and D. G. Lowe, “Fast approximate nearest neighbors with automatic algorithm configuration,” VISAPP 2009  Proc. 4th Int. Conf. Comput. Vis. Theory Appl., vol. 1, pp. 331–340, 2009.
 [4] J. L. Bentley, “Multidimensional Binary Search Trees Used for Associative Searching,” Commun. ACM, vol. 18, no. 9, pp. 509–517, 1975.
 [5] W. Dong, M. Charikar, and K. Li, “Efficient Knearest neighbor graph construction for generic similarity measures,” Proc. 20th Int. Conf. World Wide Web, WWW 2011, pp. 577–586, 2011.
 [6] S. L. Lauritzen, Graphical models. Clarendon Press, 1996, vol. 17.

[7]
D. I. Shuman, S. K. Narang, P. Frossard, A. Ortega, and P. Vandergheynst, “The emerging field of signal processing on graphs: Extending highdimensional data analysis to networks and other irregular domains,”
IEEE signal processing magazine, vol. 30, no. 3, pp. 83–98, 2013.  [8] A. Ortega, P. Frossard, J. Kovacevic, J. M. Moura, and P. Vandergheynst, “Graph Signal Processing: Overview, Challenges, and Applications,” Proc. IEEE, vol. 106, no. 5, pp. 808–828, 2018.
 [9] A. Sandryhaila and J. M. Moura, “Discrete signal processing on graphs,” IEEE transactions on signal processing, vol. 61, no. 7, pp. 1644–1656, 2013.
 [10] G. Mateos, S. Segarra, A. G. Marques, and A. Ribeiro, “Connecting the dots: Identifying network structure via graph signal processing,” IEEE Signal Processing Magazine, vol. 36, no. 3, pp. 16–43, 2019.
 [11] X. Dong, D. Thanou, M. Rabbat, and P. Frossard, “Learning graphs from data: A signal representative perspective,” IEEE Signal Processing Magazine, vol. 36, no. 3, pp. 44–63, 2019.
 [12] D. Thanou, X. Dong, D. Kressner, and P. Frossard, “Learning heat diffusion graphs,” IEEE Transactions on Signal and Information Processing over Networks, vol. 3, no. 3, pp. 484–499, 2017.
 [13] B. Pasdeloup, V. Gripon, G. Mercier, D. Pastor, and M. G. Rabbat, “Characterization and Inference of Graph Diffusion Processes from Observations of Stationary Signals,” IEEE Trans. Signal Inf. Process. over Networks, vol. 4, no. 3, pp. 481–496, 2018.
 [14] R. Shafipour, S. Segarra, A. G. Marques, and G. Mateos, “Identifying the Topology of Undirected Networks from Diffused Nonstationary Graph Signals,” no. 1, pp. 1–14, 2018. [Online]. Available: http://arxiv.org/abs/1801.03862
 [15] S. Segarra, A. G. Marques, G. Mateos, and A. Ribeiro, “Network topology inference from spectral templates,” IEEE Transactions on Signal and Information Processing over Networks, vol. 3, no. 3, pp. 467–483, 2017.
 [16] V. Kalofolias, “How to learn a graph from smooth signals,” in Artificial Intelligence and Statistics, 2016, pp. 920–929.
 [17] X. Dong, D. Thanou, P. Frossard, and P. Vandergheynst, “Learning laplacian matrix in smooth graph signal representations,” IEEE Transactions on Signal Processing, vol. 64, no. 23, pp. 6160–6173, 2016.
 [18] H. E. Egilmez, E. Pavez, and A. Ortega, “Graph learning from data under laplacian and structural constraints,” IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 6, pp. 825–841, 2017.
 [19] S. P. Chepuri, S. Liu, G. Leus, and A. O. Hero, “Learning sparse graphs under smoothness prior,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017, pp. 6508–6512.
 [20] P. Berger, G. Hannak, and G. Matz, “Efficient graph learning from noisy and incomplete data,” IEEE Transactions on Signal and Information Processing over Networks, vol. 6, pp. 105–119, 2020.
 [21] T. Kipf, E. Fetaya, K.C. Wang, M. Welling, and R. Zemel, “Neural relational inference f
Comments
There are no comments yet.