Architecture Analysis
Knowing what a CNN, RNN, or Transformer is represents the first level of understanding. The second level — the one that distinguishes engineers who can build new systems from those who can only replicate tutorials — is the ability to analyze an unfamiliar problem, reason about which architecture fits it, identify what will go wrong with plausible alternatives, and defend the choice with explicit reference to the properties of the data and the requirements of the task. That is what this lesson trains.
A Framework for Architecture Selection
Every architecture encodes inductive biases: assumptions about the structure of the problem baked into the design. A CNN assumes local spatial correlations and translational invariance. An RNN assumes ordered, sequential dependency. A Transformer assumes that arbitrary pairwise relationships between positions are important and that the entire sequence is available at once. Choosing an architecture is choosing which inductive biases to impose. A principled selection process asks six questions. 1. What is the input structure? Is it a fixed-size grid (images: CNN), a variable-length sequence (text, audio: Transformer or RNN), a graph (molecules, social networks: Graph Neural Network), or unstructured tabular data (gradient-boosted trees often outperform neural networks here)? 2. What are the relevant relationships? Are important features local (edges in images: CNN filters) or global (long-range dependencies in text: Transformer attention)? Are the relationships ordered (time series: RNN or causal Transformer) or unordered (set of features: Transformer without position)? 3. What is the output structure? Classification (fixed discrete label), regression (continuous value), generation (variable-length sequence), segmentation (per-pixel label), structured prediction (sequence of labels)? Generation tasks favor autoregressive decoder models; segmentation favors encoder-decoder architectures like U-Net. 4. How much data is available? Large labeled datasets favor training from scratch or full fine-tuning. Small datasets strongly favor pretrained backbones with frozen feature extraction. 5. What are the latency and compute constraints? A Transformer with 70 billion parameters cannot run on a mobile device. A tiny CNN can. Deployment constraints may force a less accurate but faster model. 6. Is interpretability required? If regulatory requirements demand explanations, simpler models (linear, shallow trees) may be mandated regardless of accuracy.
An inductive bias is an assumption. If the assumption matches the data's true structure, it is a powerful advantage — fewer examples are needed to learn the right function. If the assumption is wrong, it is a constraint that prevents the model from learning well. The skill is choosing architectures whose assumptions match the problem.
Let us work through five analysis scenarios. Scenario 1: Audio keyword spotting (detect 'Hey Siri' in a stream of audio) Input: raw waveform or short-time Fourier transform spectrogram (a 2D time-frequency image) Relevant relationships: local patterns in both time and frequency; temporal ordering matters Output: binary classification per window Data: millions of labeled utterances Constraints: must run in real time on a microphone chip with milliwatt power budget Analysis: A small CNN applied to the spectrogram captures local time-frequency patterns efficiently. A full Transformer would exceed the compute budget. A tiny depthwise-separable CNN (used in models like MobileNet) is the practical choice. RNNs are viable but slower to deploy in this constrained setting. The key inductive bias match: spectrograms have local structure similar to images. Scenario 2: Predicting hospital readmission from electronic health records (EHR) Input: a variable-length sequence of clinical events (diagnoses, procedures, medications) over 12 months, each event occurring at irregular time intervals Relevant relationships: some dependencies are short-range (drug interaction), some long-range (chronic condition diagnosed months ago predicting current episode) Output: binary classification (readmitted within 30 days) Data: 100,000 patients Constraints: predictions needed within seconds; result must be explainable to clinicians Analysis: A Transformer encoder handles variable-length sequences and long-range dependencies well. Irregular timestamps can be handled with learned time embeddings. However, interpretability is required — attention weights provide some evidence of which events influenced the prediction, though they are not fully reliable explanations. A simpler logistic regression or gradient-boosted model on hand-engineered features may be mandated by clinical governance requirements, trading accuracy for auditability. Scenario 3: Protein structure prediction from amino acid sequence Input: a sequence of amino acids (a string of up to ~2000 characters from an alphabet of 20 characters) Relevant relationships: residues far apart in sequence can be spatially close (forming disulfide bonds, hydrophobic cores); the entire sequence must be considered to predict 3D structure Output: 3D coordinates of every atom Data: ~200,000 known structures (Protein Data Bank), plus billions of evolutionary sequences without known structures Analysis: AlphaFold2 uses a Transformer-based architecture (Evoformer) that computes attention between all pairs of residues — precisely because long-range interactions are critical and cannot be captured by local filters. The quadratic attention cost is manageable at sequence lengths of ~2000. Pretraining on evolutionary sequence data (self-supervised) provides a huge boost. This is a case where the Transformer's ability to model arbitrary pairwise relationships is not a luxury but a necessity. Scenario 4: Real-time fraud detection on credit card transactions Input: structured tabular features per transaction (merchant category, amount, time since last transaction, geographic distance from last transaction, ~50 features total) Output: binary fraud probability (score between 0 and 1) Data: millions of labeled historical transactions; fraud rate ~0.1% Constraints: decision in under 10 milliseconds; highly imbalanced classes Analysis: This is a tabular data problem — neural networks do not consistently outperform gradient-boosted trees (XGBoost, LightGBM) here, and trees are faster, more interpretable, and robust on imbalanced data. The severe class imbalance requires techniques like SMOTE (oversampling), class-weighted loss, or threshold tuning. If a neural network is used, it is typically shallow (2-3 layers). Architecture selection here correctly says: do not default to deep learning. Scenario 5: Video action recognition (classify 'playing basketball' vs. 'swimming' in a 10-second clip) Input: a sequence of video frames (each a 3D spatial grid), so the input is 4D: time × height × width × channels Relevant relationships: spatial patterns within each frame (CNN) and temporal patterns across frames Output: single class label per clip Analysis: Two dominant approaches. 3D CNNs (C3D, I3D) apply 3D convolutions across time and space jointly, capturing local spatiotemporal features. Transformer-based video models (ViT applied to frame patches, with temporal attention) capture long-range temporal dependencies but are computationally expensive. A practical pipeline: extract per-frame features with a pretrained image CNN, then apply a lightweight Transformer or LSTM across the frame sequence for temporal reasoning — leveraging strong pretrained spatial features while learning temporal dynamics efficiently.
Complete each statement correctly.
Critiquing Architecture Choices
A good architecture critique identifies three elements: the assumption the choice makes, whether the data supports that assumption, and the alternative that would be chosen if the assumption were wrong. Example: A team proposes using a vanilla RNN to translate legal documents of 2000 words. Critique: The RNN's fixed-size hidden state creates a bottleneck — compressing 2000 words of legal meaning into one vector. Legal language has complex long-range dependencies (a definition in clause 1 governs meaning throughout the document). The sequential computation prevents training parallelism. The correct choice is a Transformer encoder-decoder, which handles long-range dependencies via direct attention and trains efficiently in parallel. If compute is severely constrained, an LSTM with attention is a compromise. Example: A team proposes a 175-billion-parameter GPT-style model for an app that classifies user reviews as positive, negative, or neutral, trained on 50,000 reviews. Critique: This is severe overkill. A 50,000-sample classification task can be solved effectively by a fine-tuned DistilBERT (66 million parameters) or even a pretrained sentence embedding plus logistic regression. Using 175 billion parameters introduces enormous inference cost, requires expensive serving infrastructure, provides no accuracy advantage for a simple three-class problem, and makes debugging and monitoring harder. Right-size the model to the task.
The most common error in applied deep learning is over-engineering: reaching for the largest model, most complex architecture, and most sophisticated training setup before establishing a simple baseline. A logistic regression baseline tells you whether the features contain signal. A small pretrained model tells you whether transfer helps. Only escalate to large-scale training when the simpler approaches have a documented ceiling.
A team needs to classify customer support emails into 12 categories. They have 8,000 labeled emails. They propose training a 1-billion-parameter Transformer from scratch. What is the primary problem with this approach?
A researcher replaces an LSTM with a Transformer for classifying 500-character tweets. The Transformer achieves slightly better accuracy but takes 8x longer to train. Is the trade-off justified?
Architecture Decision Brief
- Step 1. Choose one of these problems: (a) classify 10 years of sensor readings from a wind turbine into 'normal' vs. 5 types of fault, with 500,000 labeled windows of 100 timesteps each; (b) generate captions for satellite images of disaster zones for emergency responders, given 20,000 image-caption pairs; (c) predict the next note in a melody given the preceding 32 notes, trained on 50,000 MIDI songs.
- Step 2. Apply the six selection questions: input structure, relevant relationships, output structure, data size, compute constraints, interpretability.
- Step 3. Recommend a specific architecture (name it; do not just say 'a neural network'). Identify the inductive bias that makes it a good match.
- Step 4. Describe the most plausible alternative and state one concrete reason to prefer your recommendation over it.
- Step 5. Identify one assumption your recommendation makes that could turn out to be wrong, and describe how you would detect this after training.