Recommender Systems in Financial Services

A recent conversation with a colleague reminded me of earlier work exploring the potential of recommender systems in the financial services sector. As quite some time had passed since then, I took the opportunity to revisit the topic and write down some key insights for future reference. Although deploying such systems for fully automated investment advice involves significant regulatory challenges—such as suitability requirements and GDPR compliance—the underlying concepts remain highly relevant. When applied thoughtfully, these techniques can yield deeper client insights, improve engagement, and help surface relevant products or content in a compliant manner, even without relying on direct automation.

Understanding the Data Landscape

Exploring recommender concepts in finance involves grappling with data characteristics common to many domains, but with specific nuances in this industry. Like many online platforms, client interaction histories (trades, views, inquiries) vary dramatically in length, frequency, and type. A new client might have sparse data, while a long-term active trader has a rich history. While this variability isn’t unique to finance, the nature of these interactions – involving significant monetary values, diverse asset classes (equities, bonds, derivatives, funds), and varying client objectives – necessitates careful feature engineering. Distilling these variable sequences into meaningful, fixed-dimension feature vectors (e.g., calculating aggregate metrics like trading frequency by asset class, average holding periods, engagement with specific themes like ESG) becomes crucial for building informative models. Central to many recommender concepts is the user-item interaction matrix – clients mapped against financial assets, products, or even content. This matrix is typically very sparse. Defining what constitutes an “interaction” is key: Is it a trade, a view, adding an item to a watchlist? Should value or duration be considered? This impacts the signals derived from the data.

The Core Recommendation Engine Concept: Matrix Factorization & SVD

A prime example of matrix factorization techniques involves decomposing the large, sparse interaction matrix (A) into lower-dimensional matrices representing latent features. While textbook explanations often use Singular Value Decomposition (SVD), it’s worth noting that pure SVD, as formally defined below, requires a complete matrix and doesn’t directly handle any missing values in interaction data. In practice, recommender systems typically employ related algorithms optimized for sparse matrices, such as Alternating Least Squares (ALS), which approximate the factorization without needing the full decomposition. However, the underlying principle of finding latent features remains similar. Formally, the SVD expresses the matrix A as:

\[A = U \Sigma V^T\]

Where:

$U$: User-feature matrix (orthogonal). Rows represent users in a latent feature space.
$\Sigma$: A diagonal matrix containing singular values ($\sigma_1, \sigma_2, …, \sigma_r$), ordered from largest to smallest. These values represent the “importance” or magnitude of each latent dimension.
$V$: Item-feature matrix (orthogonal). Columns represent items in the same latent feature space.

By truncating these matrices to keep only the top k latent dimensions (the ones corresponding to the largest k singular values in $\Sigma$), we get an approximation: $A_k = U_k \Sigma_k V_k^T$. This $A_k$ provides predicted interaction scores, filling in the gaps of the original sparse matrix $A$.

The magnitude of the singular values in $\Sigma$ directly indicates the importance of each corresponding latent dimension. The squared singular values relate to the variance captured by each dimension. Plotting the cumulative sum of squared singular values (normalized) against the number of dimensions (k) shows the percentage of the original matrix’s variance explained by the top k factors. Selecting the optimal number of latent factors, k (the rank of the approximation), is crucial. Too few factors might underfit, missing important patterns, while too many might overfit, capturing noise. Common approaches to choosing k include using explained variance, where a common heuristic is to select k such that a desired percentage (e.g., 80-90%) of the total variance (sum of squared singular values) is captured. Another approach is treating k as a hyperparameter. This involves training models with different k values on a training set and evaluating their prediction performance (e.g., using metrics like RMSE or Precision@K) on a separate validation set, ultimately choosing the k that yields the best validation performance.

Leveraging Client History Snapshots Over Time

Financial data is inherently time-ordered. Client preferences aren’t static; they evolve, often influenced by the prevailing market environment. For instance, appetite for riskier assets like stocks might increase during bull markets, while preference shifts towards safer assets during downturns. Capturing these dynamics is important. A key technique involves treating different points in a client’s history as distinct observations. Instead of relying on one static profile per client, we can create time-stamped snapshots, such as (timestamp_T, client_id, client_features_at_T, item_interactions_up_to_T).

There are two main advantages to this approach. First, it allows the model to potentially learn how preferences change in response to market conditions or over the client’s lifecycle. Second, using multiple observations per client significantly increases the size of the training dataset, which can be beneficial for training more complex models and potentially learning more nuanced patterns.

However, this approach introduces a critical challenge: the increased risk of data snooping or temporal leakage if not handled carefully during model training and evaluation. Standard random train/test splits are invalid because they can inadvertently expose future information (about the same client at a later time) to the model during training, leading to overly optimistic performance estimates. Therefore, data must be split chronologically. A typical setup involves training on all data up to a certain time $T_{train}$, validating (for tuning hyperparameters) on data between $T_{train}$ and $T_{val}$, and finally testing the model’s performance on data entirely after $T_{val}$. It’s crucial that for any given client, their validation and test data points always occur chronologically after all of their training data points.

Evaluating SVD-Based Rankings

If we use matrix factorization methods (like ALS or SVD approximations) trained on historical interaction data up to a certain point in time (T), the resulting model ($A_k \approx U_k \Sigma_k V_k^T$) provides a predicted affinity score between each client and potentially all items. These scores allow us to generate a ranked list of items for each client, suggesting those they might have the highest preference for based on latent patterns.

However, a ranked list by itself doesn’t tell us if the recommendations are meaningful or actionable. The crucial next step is evaluation. How do we determine if these rankings derived from past behavior actually predict future behavior? This requires defining a clear prediction task tied to a specific timeframe. For example, we could train the model on data up to time T. Then, we look at the actual client behavior in a subsequent period, say from T+1 to T+delta. We need to decide what constitutes a “successful” recommendation within that period. Does a high rank for item X mean the client is likely to buy item X tomorrow? Or within the next week? Or the next month? Evaluating the model thus involves checking if the items ranked highly for a client at time T are indeed interacted with (e.g., bought, viewed) by that client within the defined future evaluation window (delta), using appropriate metrics like Precision@K or Recall@K while respecting the temporal data split.

Beyond Matrix Factorization: Incorporating Additional Features

While matrix factorization provides valuable latent features, we often have additional explicit information about clients (e.g., demographics, stated risk tolerance, account tenure) and items (e.g., asset class, sector, ESG rating, volatility). Incorporating these can lead to more nuanced predictive models.

One approach is a client-centric model. Here, we combine the client’s latent features from matrix factorization with their other available explicit features. The goal is then to train a model that takes all these client features as input at time T and predicts the likelihood of interaction with various items in the future window (T+1 to T+delta). Again, defining the target interaction (e.g., purchase) and the window length (delta) is crucial. Too short a window might result in very few positive interaction examples for training. Models like XGBoost, an efficient gradient boosting algorithm, are often suitable for such predictive tasks. A key advantage of this client-centric approach is inference efficiency: for a given client, a single run of the model can potentially generate scores or probabilities for all relevant items simultaneously.

Alternatively, if we also have rich item features, we can build a user-item interaction model. This model takes the combined features of both the client (latent + explicit) and a specific item as input at time T to predict the likelihood of that specific client interacting with that specific item in the future window. This allows for modeling potentially complex interplay between client and item characteristics. However, this comes at the cost of inference complexity. To generate a ranked list for a single client, the model must be run separately for each candidate item, which can be significantly more computationally expensive than the client-centric approach where one inference run covers all items. The choice between these approaches depends on the available data, the desired level of interaction modeling, and computational constraints.

It’s important to recognize that using matrix factorization-derived features within a subsequent predictive model (like XGBoost) creates a two-step procedure. First, factorization is used to generate latent features. Second, these latent features, along with other explicit features, are fed into the predictive model. A practical challenge arises when dealing with new clients or new interaction data for existing clients that were not part of the original factorization training. Their latent feature vectors are not directly available. A common technique, often called “folding-in,” can address this if some initial interaction data exists: assuming the item-feature matrix ($V_k^T$) is stable, we can keep it fixed. Then, using the new interaction data for a specific client, we can mathematically solve (often via least squares) for the corresponding user-feature vector ($u_{new}$) that best explains their observed interactions. This allows mapping new data into the existing latent space, but it’s a warm-start approach and doesn’t solve the true cold-start problem for clients with absolutely no prior interactions (which requires other strategies like content-based or hybrid models). Furthermore, this two-step setup introduces potential fragility. The latent features derived from factorization are abstract mathematical constructs ordered by their associated singular values (or equivalent importance measure). They don’t inherently represent easily interpretable concepts like “ESG affinity,” and any such interpretation is speculative. As data evolves and the factorization model is retrained, the relative importance and ordering of these abstract concepts can shift. What was captured by the first latent feature in one run might be captured by the second or third in a later run. This means the meaning of a specific feature position (e.g., “latent_feature_1”) can change. Therefore, it’s crucial to be mindful of this potential instability and ensure that the downstream predictive model (e.g., XGBoost) is always retrained or recalibrated in alignment with the specific set and ordering of latent features generated by the corresponding factorization model run.

Understanding Model Predictions: Explainability (XAI)

Understanding why a model makes a certain prediction or surfaces specific information is crucial in finance for building trust, debugging models, and meeting potential regulatory requirements. SHAP (SHapley Additive exPlanations) offers a powerful, model-agnostic technique rooted in cooperative game theory. The core concept involves fairly distributing the model’s prediction “payout” (the deviation from a baseline prediction) among the contributing features based on their average marginal contribution across all possible feature combinations. SHAP answers: “How much did this specific feature value contribute to this specific prediction?” SHAP algorithms provide efficient ways to estimate these Shapley values, enabling both local explainability (understanding the drivers of a single prediction, like which factors caused asset Y to be suggested for client Z) and global explainability (aggregating SHAP values across many predictions provides a robust measure of overall feature importance, often visualized to show magnitude and direction of effects). Using SHAP allows for a much richer understanding than basic feature importance alone, although it’s worth noting that for very large datasets or models with highly correlated features, SHAP calculations can become computationally expensive and potentially less stable.

Final Polish: Post-Processing and Practical Realities

Raw model outputs invariably require careful post-processing and consideration of the practical context. This includes filtering out items the client might already hold significantly or have recently interacted with, depending on the objective. Critically, it involves applying business and regulatory rules. This step is non-negotiable in finance; any surfaced items must be rigorously checked against compliance regulations, suitability rules (matching client risk profile, objectives, knowledge), diversification guidelines, concentration limits, and internal policies.

Disclaimer: This post is the result of me chatting with an AI to dust off my knowledge on this topic. The AI then kindly drafted this summary based on our talk and my outline, serving as my personal ‘don’t forget!’ note for the future – because apparently, my brain isn’t a perfect recording device. I’ve made minor edits for clarity.