Key Steps

Setting up the problem: ask clarifying questions to narrow down
Understanding scale and latency requirements: performance and capacity
- Latency: Do we return result in X milliseconds?
- Capacity: how many “items/requests/queries”?
Defining metrics:
- Offline: Test models during development (AUC, precision, recall, F1, NDCG)
- Online: Test model in production
  - Component based: metric for specific components (search ranking: NDCG)
  - End-to-end: Metic for overall system (user engagement & retention rate)
Architecture Discussion:
- What components and How the data will flow through the system
- Architecting for scale: complex ml system vs funnel approach (staged models where each subsequent stage receives lesser data to filter out the final result)
Offline model building and evaluation:
- 2 ways to generate training data: collection / generation
  - Collection: Human labelers, open-source datasets, using pre-existing system (user interaction)
- Feature Engineering
  - pinpoint the actors involved in the given task: users, system, context
    - Ex: (User, movies available on netflix, upcoming holiday)
  - Historical engagement: user’s interaction in the last 3 months
  - Cross features: user-media cross features
- Model Training
  - Choose ML model for the given task keeping the performance and capacity in mind, hyperparameters
  - Funnel approach: select simpler models for top of the funnel and complex as it goes down
  - use a pre-trained SOTA model for transfer learning
- Offline Eval:
  - use validation and test dataset
  - Top models that show most promise are taken to the next stage
- Online Eval:
  - After selecting the top model, evaluate them online
  - component and end-to-end eval
  - If there is significant increase in online eval, then we can deploy the model
- Iterative Model Development
  - Model may perform well offline but not online
  - Debug: component?, features distribution different in training and testing?
  - Monitor performance

Performance and Capacity considerations

Major Capacity discussions in 2 phases:

Training Time: How much training data and capacity is needed to build
Evaluation Time: What are the SLA’s to meet while serving the model & capacity needs

We need to measure the complexity of ML model as well as use it in the decision process of building ML system architecture and selection of ML modeling technique.

3 types of complexities:

Training complexity: Time taken to train ML model for a given task
Evaluation complexity: Time taken by model to evaluate input at testing time
Sample complexity: total # of training examples needed

Comparison of complexities

Assume that