Domain-driven Data Science

Pietro Peterlongo

PyData NYC, Nov 8 2024

github.com/pietroppeter/domain-driven-data-science

Agenda

  1. Why domain is important? ๐Ÿคน
  2. Logistics and Supply Chain ๐Ÿ“ฆ
  3. Stories and Ideas ๐Ÿ’ก

Agenda

  1. Why domain is important? ๐Ÿคน
  2. Logistics and Supply Chain ๐Ÿ“ฆ
  • Forecasting ๐Ÿ”ฎ
  1. Stories and Ideas ๐Ÿ’ก

๐Ÿ‘‹ Pietro (he/him) ๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘ง๐Ÿ‡ฎ๐Ÿ‡น๐Ÿ๐Ÿ”๏ธโ›ต๏ธ๐ŸŽญ

๐Ÿ‘‹ Pietro (he/him) ๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘ง๐Ÿ‡ฎ๐Ÿ‡น๐Ÿ๐Ÿ”๏ธโ›ต๏ธ๐ŸŽญ

  • ๐Ÿงฎ๐Ÿ‘จโ€๐Ÿ”ฌ (Applied) Math

๐Ÿ‘‹ Pietro (he/him) ๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘ง๐Ÿ‡ฎ๐Ÿ‡น๐Ÿ๐Ÿ”๏ธโ›ต๏ธ๐ŸŽญ

  • ๐Ÿงฎ๐Ÿ‘จโ€๐Ÿ”ฌ (Applied) Math

๐Ÿ‘‹ Pietro (he/him) ๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘ง๐Ÿ‡ฎ๐Ÿ‡น๐Ÿ๐Ÿ”๏ธโ›ต๏ธ๐ŸŽญ

  • ๐Ÿงฎ๐Ÿ‘จโ€๐Ÿ”ฌ (Applied) Math

๐Ÿ‘‹ Pietro (he/him) ๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘ง๐Ÿ‡ฎ๐Ÿ‡น๐Ÿ๐Ÿ”๏ธโ›ต๏ธ๐ŸŽญ

  • ๐Ÿงฎ๐Ÿ‘จโ€๐Ÿ”ฌ (Applied) Math

๐Ÿ‘‹ Pietro (he/him) ๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘ง๐Ÿ‡ฎ๐Ÿ‡น๐Ÿ๐Ÿ”๏ธโ›ต๏ธ๐ŸŽญ

  • ๐Ÿงฎ๐Ÿ‘จโ€๐Ÿ”ฌ (Applied) Math

1. Domain ๐Ÿคน

Data Science Venn Diagram, Drew Conway, 2010

๐Ÿ“š Content Production

๐Ÿงฎ Math

๐Ÿง‘โ€๐Ÿ’ป Code

๐Ÿšš Domain

๐Ÿ“š Content Production

๐Ÿงฎ Math

๐Ÿง‘โ€๐Ÿ’ป Code

๐Ÿšš Domain

๐Ÿ“š Content Production

๐Ÿงฎ Math

๐Ÿง‘โ€๐Ÿ’ป Code

๐Ÿšš Domain

๐Ÿ“š Content Production

๐Ÿงฎ Math

๐Ÿง‘โ€๐Ÿ’ป Code

๐Ÿšš Domain

via GIPHY

Success of DS/ML/AI Projects

Technology

  • data quality
  • model accuracy
  • implementation

Success of DS/ML/AI Projects

Technology

  • data quality
  • model accuracy
  • implementation

Business

  • stakeholders
  • valuable
  • used

Inspiration ๐Ÿ’ก

Domain Driven Design

โ €

  • from SWE
  • domain is important for devs
  • ubiquitous language

Data Mesh

โ €

Data Mesh

โ €

Our product: witboost ๐Ÿค

Data Mesh

Data Architecture

๐ŸŽง podcast

2. Logistics & Supply Chain ๐Ÿšš

What is Logistics?

"an army without its baggage train is lost; without provisions it is lost; without bases of supply it is lost" Art of War, Sun Tzu, 5th BC

Storage

Transportation

Automation

Modern Supply Chains

Modern Supply Chains

  • extended networks of suppliers
  • demand-driven market
  • compete on both service and cost
  • turbulence and volatility

Planning

Planning

Demand

Service

Stock

Planning

Demand

Service

Stock

Planning

Demand

Service

Stock

Planning

Demand

Service

Stock

Uncertainity

Demand is uncertain, Supply is uncertain

Uncertainity

Demand is uncertain, Supply is uncertain

Lead Time

Uncertainity

Demand is uncertain, Supply is uncertain

Lead Time

Safety Stock

  • extra stock due to uncertainty to mitigate risk of stock outs
  • linked to demand variability

Constraints

Constraints

  • capacity: production, budget, storage, time
  • integral: minimum lot, incremental lot, unit of measures
  • feasibility: schedules, opening hours

Constraints

  • capacity: production, budget, storage, time
  • integral: minimum lot, incremental lot, unit of measures
  • feasibility: schedules, opening hours

Optimization techniques (Operations Research)

2.1 Forecasting ๐Ÿ”ฎ

Modelling

Dimensions

  • Product (SKUs, Family, Brands, Divisions)
  • Market (Warehouses, Regions, Customers, Channels)
  • Time (day, week, month)

Algorithms

Statistical

Exponential smoothing:

$$y_{t+1} = \alpha x_t + (1 - \alpha)y_t$$

  • Holt Winters (trend, seasonality)
  • Croston (intermitted demand)

Algorithms

Statistical

Exponential smoothing:

$$y_{t+1} = \alpha x_t + (1 - \alpha)y_t$$

  • Holt Winters (trend, seasonality)
  • Croston (intermitted demand)

ML

  • boosting methods (LightGBM, ...)
  • neural methods (DeepAR, ...)
  • foundational models (TimeGPT)

Evaluation

Metrics

  • Bias
  • RMSE
  • MAE
  • โŒ MAPE

vandeput's article

Evaluation

Metrics

  • Bias
  • RMSE
  • MAE
  • โŒ MAPE

vandeput's article

M5 Competition

  • Kaggle competition
  • Walmart Sales data
  • Point Forecast and Probabilistic
  • 12 product-market levels
  • WRMSSE

research article

Seasonality

Promotions

Initialization

Seasonality

Promotions

Initialization

Seasonality

Promotions

Initialization

Seasonality

Promotions

Initialization

Forecasting with Nixtla

github.com/pietroppeter/pymi-timeseries-forecasting-nixtla

3. Stories ๐Ÿ’ก

๐Ÿ”ฎ Trust in the model

Story: a model for forecast of New Product sales

  • model is better than baseline
  • users do not trust the numbers
  • they are used to a different process
  • does interpretability help?

๐Ÿ—๏ธ Data Generating Process

Story: Data Exploration of Warehouse data

  • order data: many more rows for outbound than inbound orders, why?
  • pick task data: missing a day in a specific week, why?
  • ...

๐Ÿ—๏ธ Data Generating Process

Story: Data Exploration of Warehouse data

  • order data: many more rows for outbound than inbound orders, why?
  • pick task data: missing a day in a specific week, why?
  • ...

Rules of ML

  1. First, design and implement metrics
  2. Choose machine learning over a complex heuristic. "

Google's rules of ML

Talk to experts

Interview

Try to uncover opportunties and risks for a data-driven tool you might want to build.

โ €

Interface

It helps to iterate with an interactive tool that shows data and visualization.

Document the domain

Document the domain

e.g. Business, Domain and Data essentials in README

Document the domain

e.g. Business, Domain and Data essentials in README

- Company X is investing strategically in 3rd party logistics
- Low costs are key to a successful operation

Document the domain

e.g. Business, Domain and Data essentials in README

- Company X is investing strategically in 3rd party logistics
- Low costs are key to a successful operation
- Main cost component in a Warehouse is picking (time)
- Using a ABC class based positioning of items could help

Document the domain

e.g. Business, Domain and Data essentials in README

- Company X is investing strategically in 3rd party logistics
- Low costs are key to a successful operation
- Main cost component in a Warehouse is picking (time)
- Using a ABC class based positioning of items could help
- Data comes from a WMS
- We have 2 years of data, 1 year of clean data

Random forest vs Xgboost

Context: AutoML for Forecast Initialization

Random forest vs Xgboost

Context: AutoML for Forecast Initialization

  • boosted trees may predict outside training range

Random forest vs Xgboost

Context: AutoML for Forecast Initialization

  • boosted trees may predict outside training range
  • boosted trees more difficult to calibrate than RF

Random forest vs Xgboost

Context: AutoML for Forecast Initialization

  • boosted trees may predict outside training range
  • boosted trees more difficult to calibrate than RF
  • learned: 1) resist the hype; 2) watch your results.

Random forest vs Xgboost

Context: AutoML for Forecast Initialization

  • boosted trees may predict outside training range
  • boosted trees more difficult to calibrate than RF
  • learned: 1) resist the hype; 2) watch your results.

Learn the Domain

Socials

Experts

Books

Learn the Domain

Socials

Experts

Books

Learn the Domain

Socials

Experts

Books

Learn the Domain

Socials

Experts

Books

Get inspired by Domain

5S methodology

  1. ๆ•ด็† (seiri) Sort
  2. ๆ•ด้ “ (seiton) Straighten
  3. ๆธ…ๆŽƒ (seiso) Shine
  4. ๆธ…ๆฝ” (seiketsu) Standardize
  5. ใ—ใคใ‘ (shitsuke) Sustain

Conclusions

Conclusions

  • Domain: a worthy and relatively untapped opportunity for Data Scientist

Conclusions

  • Domain: a worthy and relatively untapped opportunity for Data Scientist
  • More complexity than you might expect

Conclusions

  • Domain: a worthy and relatively untapped opportunity for Data Scientist
  • More complexity than you might expect
  • Inspired to learn and talk about your domain? ๐Ÿ’ก

Conclusions

  • Domain: a worthy and relatively untapped opportunity for Data Scientist
  • More complexity than you might expect
  • Inspired to learn and talk about your domain? ๐Ÿ’ก

๐Ÿ™

๐Ÿง‘โ€๐Ÿ’ป github.com/pietroppeter

๐Ÿฆ‹ @pietroppeter.bsky.social

๐Ÿ˜ @pietroppeter@fosstodon

๐Ÿ‘จโ€๐Ÿ’ผ LinkedIn - Pietro Peterlongo

๐Ÿ”ตโšช๏ธ agilelab.it

๐ŸCome to PyCon Italy!๐ŸคŒ

May 28-31, Bologna | pycon.it

โ €

โ €

โ €

โ €

โ €

โ €

โ €