{
  "version": "https://jsonfeed.org/version/1.1",
  "title": "Code Labs Posts",
  "description": "Code deep-dives.",
  "home_page_url": "https://ouatu.ro/",
  "feed_url": "https://ouatu.ro/blog/lab/feed.json",
  "items": [
    {
      "id": "https://ouatu.ro/blog/als/",
      "url": "https://ouatu.ro/blog/als/",
      "title": "Implementing Alternating Least Squares (ALS) from Scratch in Python",
      "content_html": "<p>Alternating Least Squares (ALS) is a cornerstone of classic recommendation systems. But how well does a pure implementation actually perform on a real-world dataset like MovieLens? I built it from scratch in Python to find out. In this lab, we'll derive the math, build an efficient vectorized implementation, and discover a surprising truth: without specific modifications, it can fail to beat even simple baselines. Here's the full story.</p>\n<h2>The Matrix Factorization Problem</h2>\n<p>In recommendation systems, we typically have:</p>\n<ul>\n<li>A set of users</li>\n<li>A set of items</li>\n<li>A sparse matrix of known ratings or interactions</li>\n</ul>\n<p>Let's call this sparse matrix $R$ where $R_{ui}$ represents user $u$'s interaction with item $i$. Most entries in $R$ are missing.</p>\n<p>The basic ALS formulation is particularly useful in <strong>implicit feedback settings</strong>, where zeros are assumed to be meaningful rather than missing data. This approach is commonly used in <strong>implicit recommendation systems</strong>, such as collaborative filtering for user interactions (e.g., clicks, purchases, or views).<br />\nHowever, in <strong>explicit feedback settings</strong> (e.g., movie ratings), missing values should be properly handled to avoid bias in factorization.</p>\n<p>The goal of matrix factorization is to approximate $R$ as the product of two lower-dimensional matrices:</p>\n<p>$$\nR \\approx U \\cdot V^T\n$$</p>\n<p>Where:\n$$\n\\begin{aligned}\n&amp; U \\text{ is a user-feature matrix of shape } [\\text{num_users}, \\text{num_features}] \\\n&amp; V \\text{ is an item-feature matrix of shape } [\\text{num_items}, \\text{num_features}]\n\\end{aligned}\n$$</p>\n<p>Each row of $ U $ represents a <strong>user’s latent preference vector</strong>, which encodes how much they tend to interact with certain hidden factors. Likewise, each row of $ V $ represents an <strong>item’s latent feature vector</strong>, which captures how strongly an item aligns with those same factors. These vectors are learned such that their interactions best approximate the observed data in $ R $, revealing patterns in user behavior and item similarities.</p>\n<h3><strong>Solving a Non-Convex Optimization Problem</strong></h3>\n<p>To reiterate, we aim to approximate $R$ using two lower-rank matrices, capturing the latent structure in the data:</p>\n<p>$$\n\\min_{U, V} \\sum_{(u,i) \\in \\text{observed}} (R_{ui} - U_u V_i^T)^2\n$$</p>\n<p>This is a <strong>non-convex optimization problem</strong> because both $ U $ and $ V $ are unknown and multiply each other. <strong>Simultaneously</strong> optimizing both matrices would require solving a problem with multiple local minima, making direct gradient-based optimization difficult.</p>\n<h3><strong>Why ALS Works</strong></h3>\n<p>Instead of solving for both $ U $ and $ V $ at once, <strong>Alternating Least Squares (ALS)</strong> breaks the problem into two <strong>convex subproblems</strong>:</p>\n<ol>\n<li><strong>Fix $ V $ and solve for $ U $</strong></li>\n<li><strong>Fix $ U $ and solve for $ V $</strong></li>\n</ol>\n<p>Since each of these steps is a standard <strong>least squares problem</strong>, they have a closed-form solution that can be computed efficiently.</p>\n<h2>Setting Up Our Example</h2>\n<p>Let's create a small example with synthetic data to demonstrate ALS.</p>\n<pre><code class=\"language-python\">import numpy as np\nnp.set_printoptions(precision=3)\nnp.random.seed(42)\n</code></pre>\n<pre><code class=\"language-python\">users = 50\nitems = 30\nfeatures = 10\n</code></pre>\n<p>The following matrix $R$ represents our sparse user-item interaction matrix. For simplicity, we're using binary values (0 or 1) to indicate whether a user interacted with an item:</p>\n<pre><code class=\"language-python\">R = np.random.choice([0, 1], size= [users,items], p=[.9, .1])\nR\n</code></pre>\n<pre><code>array([[0, 1, 0, ..., 0, 0, 0],\n       [0, 0, 0, ..., 0, 0, 0],\n       [0, 0, 0, ..., 0, 0, 0],\n       ...,\n       [0, 0, 0, ..., 0, 0, 0],\n       [1, 0, 0, ..., 0, 0, 0],\n       [0, 0, 0, ..., 0, 0, 0]], shape=(50, 30))\n</code></pre>\n<p>Here we initialize our user and item latent factor matrices with random values from a normal distribution. In a real application, we might use different initialization strategies:</p>\n<pre><code class=\"language-python\">U = np.random.normal(0,1, [users, features])\nV = np.random.normal(0,1, [items, features])\n</code></pre>\n<h3><strong>Understanding the ALS Updates</strong></h3>\n<p>Matrix factorization assumes we approximate $R$ using two lower-rank matrices $U$ and $V$:</p>\n<p>$$\nU V^T = R \\quad \\Leftrightarrow \\quad V U^T = R^T\n$$</p>\n<p>To find optimal values for these matrices, ALS alternates between solving for $U$ with $V$ fixed and vice versa. This alternating approach makes each sub-problem convex, guaranteeing a unique global minimum at each step.</p>\n<p>Let's illustrate this clearly by deriving the update rule explicitly for one row of $U$:</p>\n<ul>\n<li>\n<p><strong>For each row $U[i]$:</strong></p>\n<p>$$\nU[i] V = R[i]\n$$</p>\n<p>This corresponds to solving the least squares problem[^1]:</p>\n<p>$$\n\\min_{U[i]} \\sum_{j}\\left(R_{ij} - U[i] V[j]^T\\right)^2\n$$</p>\n<p>Expanding the squared term gives:</p>\n<p>$$\n\\sum_{j}\\left(R_{ij}^2 - 2 R_{ij} U[i] V[j]^T + (U[i] V[j]^T)^2\\right)\n$$</p>\n<p>Since we fixed $V$, this function is convex in $U[i]$, and the minimum is found by setting the derivative to zero:</p>\n<p>$$\n\\sum_j\\left(-2 R_{ij} V[j] + 2 U[i] V[j]^T V[j]\\right) = 0\n$$</p>\n<p>Simplifying, we have:</p>\n<p>$$\nU[i] \\sum_j V[j]^T V[j] = \\sum_j R_{ij} V[j]\n$$</p>\n<p>Recognizing the matrix forms:</p>\n<p>$$\nV^T V = \\sum_j V[j]^T V[j], \\quad V^T R[i] = \\sum_j R_{ij} V[j]\n$$</p>\n<p>we obtain the closed-form solution:</p>\n<p>$$\nU[i] = (V^T V)^{-1} V^T R[i]\n$$</p>\n</li>\n<li>\n<p><strong>For each row $V[i]$:</strong></p>\n<p>$$\nV[i] U = R[:, i]\n$$</p>\n<p>Which leads to the least squares update:</p>\n<p>$$\nV[i] = (U^T U)^{-1} U^T R[:, i]\n$$</p>\n</li>\n</ul>\n<p>Now, instead of iterating over rows, these updates can be <strong>vectorized</strong> for efficiency and conciseness:</p>\n<ul>\n<li>\n<p><strong>Solving for $U$ in matrix form:</strong></p>\n<p>$$\nU = R V (V^T V)^{-1}\n$$</p>\n</li>\n<li>\n<p><strong>Solving for $V$ in matrix form:</strong></p>\n<p>$$\nV = R^T U (U^T U)^{-1}\n$$</p>\n</li>\n</ul>\n<h3><strong>Solving the Least Squares System with <code>numpy.linalg.solve</code></strong></h3>\n<p>The least squares updates derived above involve computing matrix inverses. However, explicitly computing the inverse (or pseudoinverse) can be numerically unstable. Instead, we solve the linear system directly using <code>numpy.linalg.solve</code>, which finds the solution ( X ) to the equation:</p>\n<p>$$\nA X = B\n$$</p>\n<p>where $A = V^T V$ and $B = V^T R^T$ (for solving $U$), or $A = U^T U$ and $B = U^T R$ (for solving $V$).</p>\n<p>Thus, the least squares updates can be computed efficiently and with better numerical stability as:</p>\n<p>$$\nU = \\text{np.linalg.solve}(V^T V, V^T R^T)^T\n$$</p>\n<p>$$\nV = \\text{np.linalg.solve}(U^T U, U^T R)^T\n$$</p>\n<p>We will use this formulation in the ALS implementation below.</p>\n<pre><code class=\"language-python\"># Compute initial reconstruction error\nprev_score = np.linalg.norm(U @ V.T - R)\n\n# Store errors for visualization\nerrors = [prev_score]\n\n# ALS Iterations\nnum_iterations = 100\nfor iteration in range(num_iterations):\n    # Without vectorization (per-row computation)\n    # for i in range(U.shape[0]):\n    #     U[i] = np.linalg.pinv(V.T @ V) @ V.T @ R[i]\n\n    # Equivalent vectorized forms for solving U:\n    # U = (np.linalg.pinv(V.T @ V) @ V.T @ R.T).T\n    # U = R @ V @ np.linalg.pinv(V.T @ V)\n    U = np.linalg.solve(V.T @ V, V.T @ R.T).T  # numerically stable solution\n\n    # Without vectorization (per-row computation)\n    # for i in range(V.shape[0]):\n    #     V[i] = np.linalg.pinv(U.T @ U) @ U.T @ R[:, i]\n\n    # Equivalent vectorized forms for solving V:\n    # V = (np.linalg.pinv(U.T @ U) @ U.T @ R).T\n    # V = R.T @ U @ np.linalg.pinv(U.T @ U)\n    V = np.linalg.solve(U.T @ U, U.T @ R).T  # numerically stable solution\n\n    # Compute and store error after each iteration\n    errors.append(np.linalg.norm(U @ V.T - R))\n\n# Compute final reconstruction error\nfinal_score = errors[-1]\n\nprint(f\"Initial error: {errors[0]:.4f}\")\nprint(f\"Error after 1 iteration: {errors[1]:.4f}\")\nprint(f\"Final error after {num_iterations} iterations: {final_score:.4f}\")\n\n</code></pre>\n<pre><code>Initial error: 120.4196\nError after 1 iteration: 8.3655\nFinal error after 100 iterations: 6.6819\n</code></pre>\n<pre><code class=\"language-python\">import matplotlib.pyplot as plt\nplt.style.use('dark_background')\nplt.figure(figsize=(12, 6))\nplt.plot(errors[1:], marker=\"o\", linestyle=\"-\", markersize=3, label=\"Reconstruction Error\") # Exclude the first error value to better visualize the convergence process without the initial large drop.\nplt.xlabel(\"Iteration\")\nplt.ylabel(\"Error (Frobenius Norm)\")\nplt.title(\"ALS Error Convergence\")\nplt.legend()\nplt.grid()\nplt.show()\n\n</code></pre>\n<p><img src=\"als_files/als_14_0.png\" alt=\"png\" /></p>\n<h3><strong>Regularized ALS</strong></h3>\n<p>In practice, it's common to add regularization terms to ALS to prevent overfitting and improve numerical stability. Regularized ALS optimizes the following loss function:</p>\n<p>$$\nL = \\sum_{m,n}(R_{mn} - U_m^T V_n)^2 + \\lambda \\sum_m ||U_m||^2 + \\lambda \\sum_n ||V_n||^2\n$$</p>\n<p>This leads to the regularized update equations:</p>\n<p>$$\nU = (V^T V + \\lambda I)^{-1} V^T R\n\\quad \\text{and} \\quad\nV = (U^T U + \\lambda I)^{-1} U^T R\n$$</p>\n<p>Regularization is especially valuable when working with sparse datasets, as it helps avoid <strong>singular matrix issues</strong> during matrix inversion steps.</p>\n<h3><strong>Weighted ALS (WALS)</strong></h3>\n<p>Weighted ALS generalizes ALS by assigning different importance (weights) to observed ratings. This method is particularly beneficial in explicit feedback scenarios, such as rating systems (e.g., movie ratings), where some items or users have significantly more interactions than others. By applying weights, WALS compensates for this imbalance, boosting underrepresented items and improving recommendation fairness.</p>\n<p>WALS optimizes the following loss function:</p>\n<p>$$\nL^w = \\sum_{m,n} w_{mn}(R_{mn} - U_m^T V_n)^2 + \\lambda \\sum_m ||U_m||^2 + \\lambda \\sum_n ||V_n||^2\n$$</p>\n<p>Here, each rating's squared error is scaled individually by a weight (w_{mn}).</p>\n<h4><strong>Choosing the weights</strong></h4>\n<p>Inspired by <a href=\"https://cs229.stanford.edu/proj2017/final-posters/5147271.pdf\">this resource</a>[^2], we propose a practical method for computing these weights, accounting for item popularity:</p>\n<ul>\n<li>Each weight (w_{mn}) is computed as a baseline plus a scaling factor dependent on how frequently the item (n) has been reviewed:</li>\n</ul>\n<p>$$\nw_{mn} = w_0 + f(c_n)\n$$</p>\n<ul>\n<li>$(w_0)$ is a baseline weight, ensuring every interaction has a minimal influence.</li>\n<li>$ (c_n = \\sum_{m} \\mathbf{1}(R_{mn} &gt; 0)) $ is the number of non-zero ratings for item (n), representing the item's popularity.</li>\n</ul>\n<p>Two common choices for the scaling function (f(c_n)) are:</p>\n<ul>\n<li>\n<p><strong>Linear (explicit) scaling</strong>, suitable for explicit feedback datasets (such as movie ratings):</p>\n<p>$$\nf(c_n) = \\frac{w_k}{c_n}\n$$</p>\n<p>Here, more popular items (higher (c_n)) receive lower additional weight, balancing their influence.</p>\n</li>\n<li>\n<p><strong>Exponential (implicit) scaling</strong>, suitable for implicit feedback scenarios (such as clicks or views):</p>\n<p>$$\nf(c_n) = \\left(\\frac{1}{c_n}\\right)^e\n$$</p>\n<p>This sharply decreases the influence of very popular items, controlled by the exponent (e).</p>\n</li>\n</ul>\n<h4><strong>Weighted ALS Update Step</strong></h4>\n<p>When performing updates in WALS, the weight vector for each user or item is transformed into a diagonal matrix by multiplying with the identity matrix:</p>\n<ul>\n<li>\n<p>For user factors (U_m):</p>\n<p>$$\nU_m = \\left(V^T (\\text{diag}(w_m)) V + \\lambda I\\right)^{-1} V^T (\\text{diag}(w_m)) R_m\n$$</p>\n</li>\n<li>\n<p>For item factors (V_n):</p>\n<p>$$\nV_n = \\left(U^T (\\text{diag}(w_n)) U + \\lambda I\\right)^{-1} U^T (\\text{diag}(w_n)) R_n\n$$</p>\n</li>\n</ul>\n<p>In these equations, $ \\text{diag}(w_m) $ and $ \\text{diag}(w_n) $ explicitly create diagonal matrices from weight vectors (w_m) and (w_n), respectively, ensuring that each interaction is weighted correctly and independently.</p>\n<p>A complete, efficient implementation of Weighted ALS using these updates will be provided in the full code example at the end of this blog post.</p>\n<h3><strong>Alternatives and Extra Resources</strong></h3>\n<p>While ALS and its weighted variant are effective, other optimization methods like <strong>Stochastic Gradient Descent (SGD)</strong> are frequently employed:</p>\n<ul>\n<li><strong>Stochastic Gradient Descent (SGD)</strong> updates parameters iteratively, adjusting each user-item interaction individually. This characteristic makes SGD well-suited for <strong>online recommendation systems</strong>, though typically slower for large batch-processed datasets.</li>\n</ul>\n<p>Notable resources and advanced readings include:</p>\n<ul>\n<li><a href=\"https://arxiv.org/pdf/1708.05024\">\"Fast Matrix Factorization for Online Recommendation with Implicit Feedback\"</a>[^3], presenting efficient algorithms specifically tailored for implicit-feedback online scenarios.</li>\n</ul>\n<p>These alternatives and resources are valuable considerations when adapting matrix factorization methods to diverse real-world scenarios.</p>\n<h2>Conclusion</h2>\n<p>Alternating Least Squares is a powerful technique for matrix factorization in recommendation systems. The algorithm's key advantage is that it handles the non-convex optimization problem by alternating between convex subproblems, each of which has a closed-form solution.</p>\n<p>While more advanced techniques like neural collaborative filtering have emerged in recent years, ALS remains relevant for its simplicity, interpretability, and effectiveness, especially for large-scale recommendation tasks.</p>\n<h2>Full Python implementations</h2>\n<pre><code class=\"language-python\">import numpy as np\n\n# ---------- weighting utils ----------\ndef linear_weight_fn(c_n, w0: float = 0.1, wk: float = 1.0):\n    \"\"\"Per-item popularity weights; smaller extra weight for very popular items.\"\"\"\n    return w0 + wk / (c_n + 1e-8)\n\ndef make_mask(R: np.ndarray, zero_means_missing: bool = True) -&gt; np.ndarray:\n    \"\"\"\n    Build observation mask M.\n    - If zero_means_missing: treat 0 as missing (implicit/binary clicks).\n    - Else: any entry counts as observed (or non-NaN for float matrices).\n    \"\"\"\n    if zero_means_missing:\n        return (R &gt; 0).astype(float)\n    return (~np.isnan(R)).astype(float) if np.issubdtype(R.dtype, np.floating) else np.ones_like(R, dtype=float)\n\ndef item_weights_from_mask(M: np.ndarray, weight_fn=linear_weight_fn) -&gt; np.ndarray:\n    \"\"\"Compute per-item weights from popularity (column sums of M).\"\"\"\n    c_n = M.sum(axis=0)          # popularity per item\n    return weight_fn(c_n)         # shape: (items,)\n\n# ---------- metrics ----------\ndef observed_rmse(R_true: np.ndarray, R_pred: np.ndarray, M: np.ndarray) -&gt; float:\n    num = ((R_true - R_pred)**2 * M).sum()\n    den = M.sum()\n    return np.sqrt(num / max(den, 1))\n\ndef weighted_rmse(R_true: np.ndarray, R_pred: np.ndarray, w_item: np.ndarray, M: np.ndarray) -&gt; float:\n    W = M * w_item               # broadcasts over columns\n    num = ((R_true - R_pred)**2 * W).sum()\n    den = W.sum()\n    return np.sqrt(num / max(den, 1))\n\n</code></pre>\n<pre><code class=\"language-python\">import numpy as np\n\ndef als_with_regularization(\n    R: np.ndarray,\n    rank: int = 10,\n    iters: int = 100,\n    reg: float = 0.1,\n    *,\n    zero_means_missing: bool = True,\n    seed: int | None = 42,\n    dtype=np.float32,\n):\n    \"\"\"\n    Unweighted ALS. Zeros are treated as real observations inside the update;\n    M is returned so you can ignore zeros as 'missing' in metrics.\n    Returns: U, V, w_item (all ones), M\n    \"\"\"\n    rng = np.random.default_rng(seed) if seed is not None else np.random.default_rng()\n    users, items = R.shape\n    I = np.eye(rank, dtype=dtype)\n\n    U = rng.normal(0, 1, (users, rank)).astype(dtype)\n    V = rng.normal(0, 1, (items, rank)).astype(dtype)\n    Rt = R.astype(dtype)\n\n    for _ in range(iters):\n        Gv = V.T @ V + reg * I\n        U = (np.linalg.solve(Gv, V.T @ Rt.T).T).astype(dtype)\n\n        Gu = U.T @ U + reg * I\n        V = (np.linalg.solve(Gu, U.T @ Rt).T).astype(dtype)\n\n    M = make_mask(R, zero_means_missing=zero_means_missing)\n    w_item = np.ones(items, dtype=float)\n    return U, V, w_item, M\n\n# ----- demo -----\nusers, items, rank = 50, 30, 10\nR = np.random.default_rng(42).choice([0, 1], size=(users, items), p=[0.9, 0.1])\n\nU, V, w_item, M = als_with_regularization(R, rank=rank, iters=100, reg=0.1, seed=42, zero_means_missing=True)\nR_hat = U @ V.T\n\nprint(f\"Observed RMSE (mask only): {observed_rmse(R, R_hat, M):.4f}\")\nprint(f\"Weighted RMSE (matches training): {weighted_rmse(R, R_hat, w_item, M):.4f}\")\nprint(f\"Full Frobenius Norm (incl. missing zeros): {np.linalg.norm(R - R_hat):.4f}\")\n\n</code></pre>\n<pre><code>Observed RMSE (mask only): 0.4026\nWeighted RMSE (matches training): 0.4026\nFull Frobenius Norm (incl. missing zeros): 6.9064\n</code></pre>\n<pre><code class=\"language-python\">import numpy as np\n\ndef weighted_als(\n    R: np.ndarray,\n    rank: int = 10,\n    iters: int = 100,\n    reg: float = 0.1,\n    *,\n    weight_fn=linear_weight_fn,\n    zero_means_missing: bool = True,\n    seed: int | None = 42,\n    dtype=np.float32,\n):\n    \"\"\"\n    Weighted ALS where observed entries are scaled by per-item weights.\n    Returns: U, V, w_item, M\n    \"\"\"\n    rng = np.random.default_rng(seed) if seed is not None else np.random.default_rng()\n    users, items = R.shape\n    I = np.eye(rank, dtype=dtype)\n\n    U = rng.normal(0, 1, (users, rank)).astype(dtype)\n    V = rng.normal(0, 1, (items, rank)).astype(dtype)\n    Rt = R.astype(dtype)\n\n    M = make_mask(R, zero_means_missing=zero_means_missing)     # (users, items)\n    w_item = item_weights_from_mask(M, weight_fn)               # (items,)\n    W = M * w_item                                              # per-entry weights\n\n    for _ in range(iters):\n        # users\n        for m in range(users):\n            wm = W[m]                       # (items,)\n            Vw = V * wm[:, None]\n            A = V.T @ Vw + reg * I\n            b = (Rt[m] * wm) @ V\n            U[m] = np.linalg.solve(A, b).astype(dtype)\n        # items\n        for n in range(items):\n            wn = W[:, n]                    # (users,)\n            Uw = U * wn[:, None]\n            A = U.T @ Uw + reg * I\n            b = (Rt[:, n] * wn) @ U\n            V[n] = np.linalg.solve(A, b).astype(dtype)\n\n    return U, V, w_item, M\n\n# ----- demo -----\nusers, items, rank = 50, 30, 10\nR = np.random.default_rng(42).choice([0, 1], size=(users, items), p=[0.9, 0.1])\n\nU, V, w_item, M = weighted_als(R, rank=rank, iters=100, reg=0.1, weight_fn=linear_weight_fn, seed=42, zero_means_missing=True)\nR_hat = U @ V.T\n\nprint(f\"Observed RMSE (mask only): {observed_rmse(R, R_hat, M):.4f}\")\nprint(f\"Weighted RMSE (matches training): {weighted_rmse(R, R_hat, w_item, M):.4f}\")\nprint(f\"Full Frobenius Norm (incl. missing zeros): {np.linalg.norm(R - R_hat):.4f}\")\n\n</code></pre>\n<pre><code>Observed RMSE (mask only): 0.0925\nWeighted RMSE (matches training): 0.0931\nFull Frobenius Norm (incl. missing zeros): 31.8075\n</code></pre>\n<h1>Real-world example</h1>\n<pre><code class=\"language-python\"># --- deps &amp; insecure download (requested) ---\nimport io, time, numpy as np, pandas as pd, requests, urllib3\nfrom functools import partial\n\nurl = \"https://files.grouplens.org/datasets/movielens/ml-100k/u.data\"\nurllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)\nresp = requests.get(url, verify=False, timeout=30)\nresp.raise_for_status()\n\nratings = pd.read_csv(io.BytesIO(resp.content), sep=\"\\t\",\n                      names=[\"user_id\", \"item_id\", \"rating\", \"timestamp\"])\n\n# --- preprocess: zero-index, time-based 80/20 split, filter cold test users ---\nratings[\"user_id\"] -= 1\nratings[\"item_id\"] -= 1\n\nratings_sorted = ratings.sort_values(\"timestamp\").reset_index(drop=True)\nsplit_idx = int(len(ratings_sorted) * 0.8)\ntrain_data, test_data = ratings_sorted[:split_idx], ratings_sorted[split_idx:]\n\nuser_rating_counts = train_data['user_id'].value_counts()\nvalid_users = user_rating_counts[user_rating_counts &gt;= 10].index\ntest_data = test_data[test_data['user_id'].isin(valid_users)]\n\nprint(f\"Train users: {train_data['user_id'].nunique()} | Train ratings: {len(train_data)}\")\nprint(f\"Test  users: {test_data['user_id'].nunique()} | Test  ratings: {len(test_data)}\")\n\nnum_users = ratings.user_id.max() + 1\nnum_items = ratings.item_id.max() + 1\n\ndef to_matrix(df):\n    R = np.zeros((num_users, num_items), dtype=np.float32)\n    for r in df.itertuples(index=False):\n        R[r.user_id, r.item_id] = r.rating\n    return R\n\nR_train = to_matrix(train_data)\nR_test  = to_matrix(test_data)\n\n# --- masks &amp; popularity weights (computed **only from train**) ---\nM_train = make_mask(R_train, zero_means_missing=True)\nM_test  = make_mask(R_test,  zero_means_missing=True)\n\n# default linear popularity weights; will override with partials in grid search\nw_item_train_default = item_weights_from_mask(M_train, weight_fn=linear_weight_fn)\n\n# --- baselines: unweighted ALS vs weighted ALS (default linear weights) ---\nrank, iters, reg = 10, 100, 1.5\n\nt0 = time.time()\nU_reg, V_reg, _, _ = als_with_regularization(\n    R_train, rank=rank, iters=iters, reg=reg, seed=42, zero_means_missing=True\n)\ntime_reg = time.time() - t0\n\nt0 = time.time()\nU_wlin, V_wlin, w_item_used, _ = weighted_als(\n    R_train, rank=rank, iters=iters, reg=reg,\n    weight_fn=linear_weight_fn, seed=42, zero_means_missing=True\n)\ntime_wlin = time.time() - t0\n\nRhat_reg  = U_reg @ V_reg.T\nRhat_wlin = U_wlin @ V_wlin.T\n\n# --- evaluation policy:\n# For model selection: weighted RMSE with **train-derived** item weights, masking on the eval split.\n# Also report plain observed RMSE for comparability.\ntrain_weighted_rmse_reg  = weighted_rmse(R_train, Rhat_reg,  w_item_train_default, M_train)\ntest_weighted_rmse_reg   = weighted_rmse(R_test,  Rhat_reg,  w_item_train_default, M_test)\n\ntrain_weighted_rmse_wlin = weighted_rmse(R_train, Rhat_wlin, w_item_train_default, M_train)\ntest_weighted_rmse_wlin  = weighted_rmse(R_test,  Rhat_wlin, w_item_train_default, M_test)\n\ntrain_obs_rmse_reg  = observed_rmse(R_train, Rhat_reg,  M_train)\ntest_obs_rmse_reg   = observed_rmse(R_test,  Rhat_reg,  M_test)\n\ntrain_obs_rmse_wlin = observed_rmse(R_train, Rhat_wlin, M_train)\ntest_obs_rmse_wlin  = observed_rmse(R_test,  Rhat_wlin, M_test)\n\nresults_baselines = pd.DataFrame({\n    \"Method\": [\"ALS (unweighted)\", \"ALS (weighted linear default)\"],\n    \"Train RMSE (weighted)\": [train_weighted_rmse_reg,  train_weighted_rmse_wlin],\n    \"Test RMSE (weighted)\":  [test_weighted_rmse_reg,   test_weighted_rmse_wlin],\n    \"Train RMSE (observed)\": [train_obs_rmse_reg,       train_obs_rmse_wlin],\n    \"Test RMSE (observed)\":  [test_obs_rmse_reg,        test_obs_rmse_wlin],\n    \"Time (s)\": [time_reg, time_wlin],\n})\nprint(\"\\n--- Baseline Results ---\")\nprint(results_baselines)\n\n# --- global mean &amp; item-bias baselines (observed RMSE only; included for context) ---\nglobal_mean = R_train[M_train &gt; 0].mean() if M_train.sum() &gt; 0 else 0.0\nR_global = np.full_like(R_train, global_mean, dtype=np.float32)\n\nitem_sums   = (R_train * M_train).sum(axis=0)\nitem_counts = M_train.sum(axis=0)\nitem_biases = np.divide(item_sums - item_counts * global_mean, item_counts + 1e-8)\nR_itembias  = np.tile(global_mean, (num_users, num_items)).astype(np.float32) + item_biases\n\ngm_train = observed_rmse(R_train, R_global,   M_train)\ngm_test  = observed_rmse(R_test,  R_global,   M_test)\nib_train = observed_rmse(R_train, R_itembias, M_train)\nib_test  = observed_rmse(R_test,  R_itembias, M_test)\n\nresults_context = pd.DataFrame({\n    \"Method\": [\"Global Mean\", \"Item Bias\"],\n    \"Train RMSE (observed)\": [gm_train, ib_train],\n    \"Test RMSE (observed)\":  [gm_test,  ib_test],\n})\nprint(\"\\n--- Context Baselines (observed RMSE) ---\")\nprint(results_context)\n\n# --- grid search over **linear** weight parameters (train weights; eval policy honored) ---\nprint(\"\\nGrid search: Linear popularity weights (train-derived)\")\ngrid_rows = []\nfor w0 in [0.5, 1.0, 1.5]:\n    for wk in [0.05, 0.1, 0.2]:\n        wf = partial(linear_weight_fn, w0=w0, wk=wk)\n        # train model with these weights\n        start = time.time()\n        U_g, V_g, _, _ = weighted_als(\n            R_train, rank=rank, iters=iters, reg=reg,\n            weight_fn=wf, seed=42, zero_means_missing=True\n        )\n        elapsed = time.time() - start\n        Rhat_g = U_g @ V_g.T\n\n        # compute **train-derived weights** for scoring\n        w_item_train = item_weights_from_mask(M_train, weight_fn=wf)\n        test_wrmse = weighted_rmse(R_test, Rhat_g, w_item_train, M_test)\n        train_wrmse = weighted_rmse(R_train, Rhat_g, w_item_train, M_train)\n        test_obs = observed_rmse(R_test, Rhat_g, M_test)\n\n        grid_rows.append({\n            \"w0\": w0, \"wk\": wk,\n            \"Train RMSE (weighted)\": train_wrmse,\n            \"Test RMSE (weighted)\":  test_wrmse,\n            \"Test RMSE (observed)\":  test_obs,\n            \"Time (s)\": elapsed\n        })\n        print(f\"w0={w0:.2f}, wk={wk:.2f} | test(weighted)={test_wrmse:.4f}, test(observed)={test_obs:.4f}, time={elapsed:.2f}s\")\n\nlin_df = pd.DataFrame(grid_rows)\nbest_idx = lin_df[\"Test RMSE (weighted)\"].idxmin()\nbest_lin = lin_df.loc[best_idx]\nprint(\"\\nBest Linear Weights by Test RMSE (weighted, train-derived):\")\nprint(best_lin)\n\n</code></pre>\n<pre><code>Train users: 751 | Train ratings: 80000\nTest  users: 107 | Test  ratings: 2875\n\n--- Baseline Results ---\n                          Method  Train RMSE (weighted)  Test RMSE (weighted)  \\\n0               ALS (unweighted)               2.334016              3.395304   \n1  ALS (weighted linear default)               0.800927              3.395304   \n\n   Train RMSE (observed)  Test RMSE (observed)   Time (s)  \n0               2.276690              3.122586   0.108951  \n1               0.803892              1.201897  12.054622  \n\n--- Context Baselines (observed RMSE) ---\n        Method  Train RMSE (observed)  Test RMSE (observed)\n0  Global Mean               1.127381              1.128035\n1    Item Bias               0.995549              1.029807\n\nGrid search: Linear popularity weights (train-derived)\nw0=0.50, wk=0.05 | test(weighted)=3.3953, test(observed)=1.1765, time=11.87s\nw0=0.50, wk=0.10 | test(weighted)=3.3953, test(observed)=1.1763, time=12.13s\nw0=0.50, wk=0.20 | test(weighted)=3.3953, test(observed)=1.1760, time=12.03s\nw0=1.00, wk=0.05 | test(weighted)=3.3953, test(observed)=1.2039, time=11.59s\nw0=1.00, wk=0.10 | test(weighted)=3.3953, test(observed)=1.2042, time=11.49s\nw0=1.00, wk=0.20 | test(weighted)=3.3953, test(observed)=1.2049, time=11.33s\nw0=1.50, wk=0.05 | test(weighted)=3.3953, test(observed)=1.2274, time=11.85s\nw0=1.50, wk=0.10 | test(weighted)=3.3953, test(observed)=1.2278, time=12.19s\nw0=1.50, wk=0.20 | test(weighted)=3.3953, test(observed)=1.2274, time=11.48s\n\nBest Linear Weights by Test RMSE (weighted, train-derived):\nw0                        1.500000\nwk                        0.050000\nTrain RMSE (weighted)     0.670400\nTest RMSE (weighted)      3.395290\nTest RMSE (observed)      1.227372\nTime (s)                 11.847758\nName: 6, dtype: float64\n</code></pre>\n<h1>Implicit feedback performance</h1>\n<pre><code class=\"language-python\"># --- Exponential popularity weight fn (new) ---\ndef exponential_weight_fn(c_n, w0: float = 1.0, e: float = 0.1):\n    # w_i = w0 + (1 / count_i)^e ; keeps weights finite and softer than linear\n    return w0 + np.power(1.0 / (c_n + 1e-8), e)\n\n# --- Signed prefs ---\ndef to_signed_matrix(df):\n    R = np.zeros((num_users, num_items), dtype=np.float32)\n    for row in df.itertuples():\n        R[row.user_id, row.item_id] = 1.0 if row.rating &gt;= 4 else -1.0\n    return R\n\nR_train = to_signed_matrix(train_data)\nR_test  = to_signed_matrix(test_data)\n\n# deterministic sign with tie-break to +1\ndef sign_with_tiebreak(x):\n    s = np.sign(x)\n    s[s == 0] = 1.0\n    return s\n\ndef compute_sign_accuracy(R_true, R_pred):\n    mask = (R_true != 0)\n    preds = sign_with_tiebreak(R_pred)\n    return np.mean((preds[mask] == R_true[mask]).astype(np.float32))\n\n# ---- Train models (100 iters for all) ----\nrank, iters, reg = 10, 100, 10.0\n\nstart = time.time()\nU_reg, V_reg, _, _ = als_with_regularization(\n    R_train, rank=rank, iters=iters, reg=reg,\n    seed=42, zero_means_missing=True\n)\ntime_reg = time.time() - start\n\nstart = time.time()\nU_wals_lin, V_wals_lin, _, _ = weighted_als(\n    R_train, rank=rank, iters=iters, reg=reg,\n    weight_fn=linear_weight_fn, seed=42, zero_means_missing=True\n)\ntime_wals_lin = time.time() - start\n\nstart = time.time()\nU_wals_exp, V_wals_exp, _, _ = weighted_als(\n    R_train, rank=rank, iters=iters, reg=reg,\n    weight_fn=exponential_weight_fn, seed=42, zero_means_missing=True\n)\ntime_wals_exp = time.time() - start\n\n# ---- Baselines (sign) ----\n# Global-mean sign with tie-break\nnonzero = (R_train != 0)\ngm = R_train[nonzero].mean() if nonzero.any() else 0.0\ngm_sign = 1.0 if gm == 0 else float(np.sign(gm))\n\nbaseline_train_acc = np.mean((R_train[nonzero] == gm_sign).astype(np.float32))\nnonzero_test = (R_test != 0)\nbaseline_test_acc  = np.mean((R_test[nonzero_test] == gm_sign).astype(np.float32))\n\n# Item-bias score then sign\nitem_sums   = R_train.sum(axis=0)\nitem_counts = (R_train != 0).sum(axis=0)\nitem_biases = (item_sums - item_counts * gm) / (item_counts + 1e-8)\nR_itembias  = np.tile(gm, (num_users, num_items)).astype(np.float32) + item_biases\nR_itembias_signed = sign_with_tiebreak(R_itembias)\n\nitembias_train_acc = compute_sign_accuracy(R_train, R_itembias_signed)\nitembias_test_acc  = compute_sign_accuracy(R_test,  R_itembias_signed)\n\n# ---- Collect results ----\nmethods = [\n    \"ALS Regularized\",\n    \"Weighted ALS (Linear)\",\n    \"Weighted ALS (Exponential)\",\n    \"Global Mean Baseline\",\n    \"Item Bias Baseline\",\n]\n\ntrain_acc = [\n    compute_sign_accuracy(R_train, U_reg @ V_reg.T),\n    compute_sign_accuracy(R_train, U_wals_lin @ V_wals_lin.T),\n    compute_sign_accuracy(R_train, U_wals_exp @ V_wals_exp.T),\n    baseline_train_acc,\n    itembias_train_acc,\n]\n\ntest_acc = [\n    compute_sign_accuracy(R_test, U_reg @ V_reg.T),\n    compute_sign_accuracy(R_test, U_wals_lin @ V_wals_lin.T),\n    compute_sign_accuracy(R_test, U_wals_exp @ V_wals_exp.T),\n    baseline_test_acc,\n    itembias_test_acc,\n]\n\ntimes = [time_reg, time_wals_lin, time_wals_exp, 0.0, 0.0]\n\nresults = pd.DataFrame({\n    \"Method\": methods,\n    \"Train Accuracy\": train_acc,\n    \"Test Accuracy\": test_acc,\n    \"Time (s)\": times,\n})\nprint(results)\n\n</code></pre>\n<pre><code>                       Method  Train Accuracy  Test Accuracy   Time (s)\n0             ALS Regularized        0.788575       0.654957   0.108554\n1       Weighted ALS (Linear)        0.550900       0.546087  14.665292\n2  Weighted ALS (Exponential)        0.550900       0.546087  17.718337\n3        Global Mean Baseline        0.550900       0.546087   0.000000\n4          Item Bias Baseline        0.677163       0.670261   0.000000\n</code></pre>\n<h1>Comparing with surprise SVD</h1>\n<pre><code class=\"language-bash\">%%bash\nuv run --python 3.11 --no-project \\\n  --with surprise --with \"numpy&lt;2\" --with pandas --with requests - &lt;&lt;'EOF'\nfrom surprise import SVD, Dataset, Reader, accuracy\nimport pandas as pd\nimport urllib3, requests, io\n\n# Insecure download (ignore cert)\nurllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)\nurl = \"https://files.grouplens.org/datasets/movielens/ml-100k/u.data\"\nresp = requests.get(url, verify=False, timeout=30)\nresp.raise_for_status()\nratings = pd.read_csv(io.BytesIO(resp.content), sep=\"\\t\",\n                      names=[\"user_id\", \"item_id\", \"rating\", \"timestamp\"])\n\n# Zero-index\nratings[\"user_id\"] -= 1\nratings[\"item_id\"] -= 1\n\n# Time split\nratings_sorted = ratings.sort_values(\"timestamp\").reset_index(drop=True)\nsplit_idx = int(len(ratings_sorted) * 0.8)\ntrain_data = ratings_sorted[:split_idx].copy()\ntest_data  = ratings_sorted[split_idx:].copy()\n\n# Keep only test users with &gt;=10 ratings in train (don't shrink train)\nuser_counts = train_data['user_id'].value_counts()\nvalid_users = set(user_counts[user_counts &gt;= 10].index)\ntest_data = test_data[test_data['user_id'].isin(valid_users)]\n\n# Ensure test items exist in train (avoid Surprise cold-start)\ntrain_items = set(train_data['item_id'].unique())\ntest_data = test_data[test_data['item_id'].isin(train_items)]\n\nprint(f\"Train users: {train_data['user_id'].nunique()}, Ratings: {len(train_data)}\")\nprint(f\"Test  users: {test_data['user_id'].nunique()}, Ratings: {len(test_data)}\")\n\n# Surprise setup\nreader = Reader(rating_scale=(1, 5))\ntrain_dataset = Dataset.load_from_df(train_data[['user_id','item_id','rating']], reader)\ntrainset = train_dataset.build_full_trainset()\ntestset = list(test_data[['user_id','item_id','rating']].itertuples(index=False, name=None))\n\n# Train SVD (biased MF)\nalgo = SVD()\nalgo.fit(trainset)\n\n# Evaluate observed RMSE\npred_test  = algo.test(testset)\ntest_rmse  = accuracy.rmse(pred_test, verbose=False)\n\npred_train = algo.test(trainset.build_testset())\ntrain_rmse = accuracy.rmse(pred_train, verbose=False)\n\nprint(f\"Train RMSE: {train_rmse:.4f}\")\nprint(f\"Test  RMSE: {test_rmse:.4f}\")\nEOF\n\n</code></pre>\n<pre><code>Train users: 751, Ratings: 80000\nTest  users: 105, Ratings: 2786\nTrain RMSE: 0.6786\nTest  RMSE: 0.9774\n</code></pre>\n<pre><code class=\"language-bash\">%%bash\nuv run --python 3.11 --no-project \\\n  --with surprise --with \"numpy&lt;2\" --with pandas --with requests - &lt;&lt;'EOF'\nfrom surprise import SVD, Dataset, Reader\nimport pandas as pd\nimport urllib3, requests, io\n\n# Insecure download (ignore cert)\nurllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)\nurl = \"https://files.grouplens.org/datasets/movielens/ml-100k/u.data\"\nresp = requests.get(url, verify=False, timeout=30)\nresp.raise_for_status()\nratings = pd.read_csv(io.BytesIO(resp.content), sep=\"\\t\",\n                      names=[\"user_id\",\"item_id\",\"rating\",\"timestamp\"])\n\n# Zero-index\nratings[\"user_id\"] -= 1\nratings[\"item_id\"] -= 1\n\n# Time split\nratings_sorted = ratings.sort_values(\"timestamp\").reset_index(drop=True)\nsplit_idx = int(len(ratings_sorted) * 0.8)\ntrain_data = ratings_sorted[:split_idx].copy()\ntest_data  = ratings_sorted[split_idx:].copy()\n\n# Filter test users with &gt;=10 ratings in TRAIN (do NOT shrink train)\nuser_counts = train_data['user_id'].value_counts()\nvalid_users = set(user_counts[user_counts &gt;= 10].index)\ntest_data = test_data[test_data['user_id'].isin(valid_users)]\n\n# Ensure test items exist in train (avoid cold-start)\ntrain_items = set(train_data['item_id'].unique())\ntest_data = test_data[test_data['item_id'].isin(train_items)]\n\n# Binarize ratings to ±1\ntrain_data = train_data.copy()\ntest_data  = test_data.copy()\ntrain_data[\"rating\"] = (train_data[\"rating\"] &gt;= 4).astype(int).replace({0:-1, 1:1})\ntest_data[\"rating\"]  = (test_data[\"rating\"]  &gt;= 4).astype(int).replace({0:-1, 1:1})\n\nprint(f\"Train users: {train_data['user_id'].nunique()}, Ratings: {len(train_data)}\")\nprint(f\"Test  users: {test_data['user_id'].nunique()}, Ratings: {len(test_data)}\")\n\n# Surprise setup (scale is -1..1 after binarization)\nreader = Reader(rating_scale=(-1, 1))\ntrain_dataset = Dataset.load_from_df(train_data[['user_id','item_id','rating']], reader)\ntrainset = train_dataset.build_full_trainset()\ntestset  = list(test_data[['user_id','item_id','rating']].itertuples(index=False, name=None))\n\n# Train SVD (biased MF) — regression on [-1,1]\nalgo = SVD(n_factors=20, n_epochs=10, lr_all=0.005, reg_all=0.02)\nalgo.fit(trainset)\n\n# Sign-accuracy (tie at 0 =&gt; positive)\ndef sign_accuracy(preds):\n    correct = 0\n    for p in preds:\n        pred_label = 1 if p.est &gt;= 0 else -1\n        correct += (pred_label == p.r_ui)\n    return correct / len(preds) if preds else 0.0\n\npred_test  = algo.test(testset)\ntest_acc   = sign_accuracy(pred_test)\n\npred_train = algo.test(trainset.build_testset())\ntrain_acc  = sign_accuracy(pred_train)\n\nprint(f\"Train Accuracy: {train_acc:.4f}\")\nprint(f\"Test  Accuracy: {test_acc:.4f}\")\nEOF\n\n</code></pre>\n<pre><code>Train users: 751, Ratings: 80000\nTest  users: 105, Ratings: 2786\nTrain Accuracy: 0.7398\nTest  Accuracy: 0.6831\n</code></pre>\n<h1>Real-World Conclusion</h1>\n<p>In a realistic temporal split, <strong>ALS performed well in implicit feedback</strong>, with fast convergence and solid accuracy. On explicit feedback, <strong>Weighted ALS (WALS)</strong> outperformed plain ALS but still failed to beat strong baselines like item bias or Surprise’s SVD. <strong>WALS is also much slower in Python</strong>, due to per-user/item weighted matrix solves. Despite that, ALS remains a strong, efficient choice, especially for large-scale or implicit recommendation tasks.</p>\n<p><em>Note</em>: after some analysis, our ALS implementations did not include user/item bias terms, which are known to capture a large portion of the signal in explicit rating data. Models like Surprise’s SVD include these biases, so part of their performance edge comes from that. Will come back with better implementation.</p>\n<p>[^1]: Weisstein, Eric W. \"Normal Equation.\" From MathWorld--A Wolfram Web Resource. <a href=\"https://mathworld.wolfram.com/NormalEquation.html\">https://mathworld.wolfram.com/NormalEquation.html</a></p>\n<p>[^2]: Stanford CS229 Project. \"Weighted Alternating Least Squares.\" Accessed from: <a href=\"https://cs229.stanford.edu/proj2017/final-posters/5147271.pdf\">https://cs229.stanford.edu/proj2017/final-posters/5147271.pdf</a></p>\n<p>[^3]: He, Xiangnan, et al. (2017). \"Fast Matrix Factorization for Online Recommendation with Implicit Feedback.\" <em>Proceedings of SIGIR</em>. <a href=\"https://arxiv.org/pdf/1708.05024\">https://arxiv.org/pdf/1708.05024</a></p>\n",
      "summary": "A deep dive into building ALS for recommendation systems in Python. Includes the full derivation, a vectorized implementation, and an analysis of its real-world performance limits.",
      "date_published": "2022-08-07T00:53:16.000Z",
      "tags": [
        "lab",
        "python",
        "machine-learning"
      ]
    }
  ]
}