How do you feel about the Kedro project template? #208

yetudada · 2020-01-28T11:59:05Z

Introduction

The joy about open-sourcing Kedro is that we've been exposed to diverse opinions and use cases that we didn't think Kedro covered. We're going to proactively ask for feedback and you will see a lot more "How do you feel about ..." GitHub issues raised by the Kedro maintainers as we try to capture your thoughts on specific issues.

First in the series was a question around introducing telemetry into Kedro and this one is about the project template.

Context

The Kedro project template is based on a template derived from CookieCutter Data Science. Some of our open-source users immediately picked up this relationship and have said that Kedro is a version of CookieCutter Data Science that thought about a pipeline framework, data abstraction, and versioning.

The project template is core to us being able to help you create reusable analytics code, according to the CookieCutter Data Science philosophy, but we've had feedback that the template is considered overwhelming for new users because they're not sure why we create so many directories. We've also observed users not using all of the template, or even removing generated folders in their templates.

Examples of directory removal are present here:

Possible Implementation

There's thought around removing non-essential folders and creating directories when certain actions are taken.

We're proposing the following categorization:

core directories are essential for Kedro
nice to have directories are linked to functionality that extends Kedro
non-essential directories can be removed and do not extend functionality in Kedro

Folder	Description	Category	Proposed Action
`conf`	The `conf` directory is the place where all your project configuration is located. Using `conf` encourages a clear and strict separation between project code and configuration.	Core	Keep
`data`	A place to store local project data according to a suggested Data Engineering Convention. For production workloads we do not recommend storing data locally, but rather utilizing cloud storage (AWS S3, Azure Blob Storage), distributed file storage or database interfaces through Kedro's Data Catalog.	Nice to have	Keep but remove sub-directories that indicate Data Engineering Convention
`docs`	`docs` is where your auto-generated project documentation is saved	Nice to have	Create this directory when `kedro build-docs` is run
`logs`	A directory for your Kedro pipeline execution logs	Nice to have	Create this directory when `kedro run` is run
`notebooks`	Kedro supports a Jupyter workflow, that allows you to experiment and iterate quickly on your models. `notebooks` is the folder where you can store your Jupyter Notebooks	Nice to have	Keep
`references`	Auxiliary folders for project references and standalone results like model artifacts, plots, papers, and statistics	Non-essential	Remove
`results`	Auxiliary folders for project references and standalone results like model artifacts, plots, papers, and statistics	Non-essential	Remove
`src`	Source directory that contains all your pipeline code	Core	Keep

This would create the following template when you run kedro new:

conf/
data/
notebooks/
src/

Questions for you

Note: These are "yes" and "no" questions but we would like the answers caveated with a reason why you have indicated the following.

We need your help in answering the following:

Are our assumptions around priority for directories correct?
Do you agree with the proposed actions? Yes, no and why?
Do you think that this change would help make Kedro less intimidating for new users of Kedro?
Do you have any other thoughts we should consider for the project template?

The text was updated successfully, but these errors were encountered:

datajoely · 2020-01-28T12:28:19Z

I'm fine with removing data we shouldn't encourage local storage. Same with results.
I'd keep logs since it will be created immediately anyway.
In conf I feel base is not immediately clear. Whilst potentially a breaking change - I'd suggest renaming this public or shared.

ThomasLittrell1 · 2020-01-28T14:22:05Z

I'm not as big a fan of removing the data subdirectories because I utilize them for personal projects (where local storage makes sense). Would it be possible to somehow have different initialization settings? Like 'core' would create the proposed structure and 'full' could create the old full structure? Not sure if that would make things less confusing though...
I never use references or results. In particular, I tend to put references in some other system and results tend to be model outputs in the catalog
Totally agreed with @datajoely's suggestion about conf naming. I've had to explain to a bunch of people what base means.
To be honest, I'm not sure how much this would do to make Kedro less intimidating for new users. It would probably help, but when I work with new users all the struggles are in learning the data abstraction and pipeline abstraction not figuring out what goes where.

khdlim · 2020-01-29T05:21:24Z

I am a big fan of keeping the data subdirectories because I think the data hierachy that is being used makes sense. It's also good to keep intermediate data artifacts cached on disk for debugging, partial reruns, and guides new users into thinking about how to manage such artifacts in their projects.

I extended ProjectContext to save all outputs to the results folder in timestamped directories but there's no such functionality out of the box (I point the data catalog to data/08_reporting)

datajoely · 2020-01-29T12:32:16Z

Perhaps we should include a short readme within the /data directory?

yetudada · 2020-01-31T13:27:49Z

Thank you! This is great insight!

I have a further question, is anyone using the global nodes directory in src/<project-name>, applies to kedro 0.15.4 and above?

khdlim · 2020-02-01T01:09:10Z

@yetudada Yes, I put the node logic there and in each module only try to expose functions which are used as nodes (helper functions are set to private).

And then for most of the actual heavy duty code like models, evaluation, metrics I put them into submodules in another folder like src/<project-name>/engine/models.py, src/<project-name>/engine/metrics.py etc.

nraw · 2020-02-03T10:18:46Z

Agreed with most what @ThomasLittrell1 wrote.

Suggestion:
Having an option of full project vs core would be nice, so that the bloat is removed in case not needed, but it is kept for larger projects to still be standardised.
In general, I think kedro should try to be modular. It offers a lot, but maybe people want to buy only into a specific feature, like the catalog, without changing all of their practices.

Opinions:
I got used to the data structure and going forward would ask for a way to keep it. It's easier for me to delete a folder (or just not push it to git) than to recreate the exact structure every time.

I used results once where I dumped some charts and I used notebooks once where I created an analysis of what was done. I think it's okay to have them as then people know where to look for them, but would keep them

WaylonWalker · 2020-02-03T17:40:35Z

I really like gatsby's idea of starters. You can start a new project with any starter from the command line with gatsby new <project-name> <starter-url. You could easily start off with legacy (current format), slim (only core directories), or full (give me the full structure).

A similar effect can be achieved with options in cookiecutter, but I think the community aspect with the gallery make it really nice.

gatsby starters
https://www.gatsbyjs.org/starters/?v=2

how users contribute starters
https://www.gatsbyjs.org/contributing/submit-to-starter-library/

WaylonWalker · 2020-02-03T17:58:31Z

We are currently using a custom version of node_global from kedro's predecessor. All of our nodes live in <proj_name>/<proj_name>/nodes/<layer>

We do not use src. we use <proj_name>/<proj_name>. Its a personal preference.

For normal use I would not recommend notebooks/, data/, or results/. I do see it being beneficial for learning and creating simple examples. I think there needs to be an option to keep them to reduce friction while learning, but not necessarily encourage their use in real projects intended for use across a team.

What happened to the .ipython directory? One of my biggest complaints with the template was the magic behind %run, and how it changes your current working directory. Our template makes it so that you can from project import Project and project_instance = Project(). With those two lines you now have a project_instance with all of the necessary kedro data and functions. It also keeps you in your current working directory.

benjaminjack · 2020-02-03T18:03:38Z

I actually like kedro because it has such an opinionated, detailed project template. If anything, more documentation (e.g. READMEs in each directory) would be helpful. I want to minimize the number of decisions I have to make about where data, results, code, docs, etc. have to go for both myself and my team. Just my two cents.

yetudada · 2020-02-06T23:33:54Z

@ThomasLittrell1 What else do you think would help with making Kedro less intimidating for new users? And how do you usually explain data and pipeline abstraction to them? And how can we make this easier for you?

@khdlim What type of outputs do you have and are they listed in the Data Catalog? It sounds like you could use versioning if you are after a way to save the outputs every single time the pipeline is run. It saves things in timestamped directories.

@nraw We do know about the library components of Kedro like the DataCatalog and Pipeline. How would you change Kedro's messaging so that people knew they could just use the one element, instead of everything?

@WaylonWalker Would you want a way to create a directory structure your way? And, we'll look into the changes to the .ipython directory! Thanks for raising this.

@benjaminjack This is super feedback!

Thank you guys so much for this!

WaylonWalker · 2020-02-07T15:06:36Z

@yetudada I think a curated list of starters is a very powerful tool and can help have different reccomended ways to start a project. I also see where certain teams will have their own internal things they will need to add that does not make sense, or may not be possible, to open source. For these cases should the recomendation be to fork the template repo you want to build from and have your users use a kedro new <internal-template-url> or cookiecutter <internal-template-url>. The difference there is minimal really. I think the power comes from having the curated list of official starters that you can browse and determine what works best for your use case.

Not sure if this answers the question you were asking.

ThomasLittrell1 · 2020-02-07T15:06:36Z

@yetudada Usually I'm just showing them a function, then showing the function in the context of the pipeline and, hey, that thing in outputs='...' has an entry in the catalog, and, oh look, here's some parameter value that I'm getting in the function by parsing out a parameters dictionary that I'm passing in through the inputs.

To explain the abstractions, I try to be make it as non-abstract as possible. Most of the people I work with come from notebooks, so I like to show how kedro builds on those. Ideas like: if you're used to having a project broken into different cells, each of those cells is now a node and executing a notebook or section of a notebook is like executing a pipeline. Then I point to how they read in data and maybe save a model after training to talk about a catalog and data abstraction.

I'm honestly not sure how to improve on that process because the kedro docs/tutorial are already great. Maybe two quick ideas:

A video/screencast showing how all the parts cohere e.g. somebody recording themselves doing the spaceflights tutorial. That could reduce the cost of getting started for somebody who's nervous about even doing the tutorial on their own (I know I watch videos on tons on things I could just read about...) and it would replicate my process with new users.
Building on the above, it might be useful to have a guide that shows users coming from notebooks how kedro is just a small step from what they already know.

MigQ2 · 2020-02-11T21:37:22Z

Good initiative @yetudada, thank you for sharing!

I really like the suggested changes, although I would definitely keep Data Engineering Convention subdirectories in data. Even if the local directories are not used because data is in a cloud storage, the convention can be followed there.

Another suggestion would be to provide a more opinionated template or framework for src on where to define nodes and where to add them to a pipeline (e.g should I add the nodes to a giant pipeline in a pipeline directory or should I create a little pipeline in the same file I wrote the node and import it later?). I think this is one of the key points where kedro projects differ significantly among each other

yetudada · 2020-02-24T11:39:19Z

Hey everyone! Here's an action on the proposed changes to the project template:

Folder	Priority Change	Action
`conf`	Medium	Keep, however `base` and `local` will be renamed
`data`	Low	Keep sub-directories and include a `README.md` that explains the convention specified here
`docs`	Low	Create this directory when `kedro build-docs` is run
`logs`	Low	Create this directory when `kedro run` is run
`notebooks`	Low	Create this directory when `kedro jupyter notebook` or `kedro jupyter lab` is run
`references`	High	Remove, this is implemented in the upcoming `kedro 0.15.6`
`results`	High	Remove, this is implemented in the upcoming `kedro 0.15.6`
`src`	Minor	Keep, and add a `README.md` about what you'll find in it

Eventually, resulting in a kedro new template that looks like:

conf/
data/
src/

What we haven't figured out yet:

A way for you to call src by your project name e.g. <proj_name>/<proj_name>. This affects our framework.
And how to create starters as suggested by @WaylonWalker, we really enjoyed the idea of being able to do something like kedro new <internal-template-url>

What else we're going to work on:

Two videos, inspired by @ThomasLittrell1's comments:
- One explaining a walk through of the Spaceflights tutorial
- And an animation deconstructing a Jupyter Notebook into a Kedro project

WaylonWalker · 2020-02-24T15:24:13Z

And how to create starters as suggested by @WaylonWalker, we really enjoyed the idea of being able to do something like kedro new

The way that cookiecutter is typically used is to run cookiecutter <url-to-git-repo> So these could simply be separate git repos. As a suggestion to have more official suggested ones you could embed an alias inside of kedro such that running kedro new simple would call cookiecutter https://github.com/quantumblack/simple

I think that it would be really cool to have various projects like the spaceflights tutorial already completed as examples that could easily be accessed with kedro new spaceflights. Then folks can use that as a template, or as a place to start trying out what a kedro pipeline feels like when you are interacting with it.

A way for you to call src by your project name e.g. <proj_name>/<proj_name>. This is a known issue with the way Cookiecutter Data Science creates templates, see here. If you have any ideas then please brainstorm this with us.

There are quite a number of other cookie cutter templates out there. If memory serves me right many of them use the <proj_name>/<proj_name> format. the first one I pulled up was cookiecutter-flask, and it used that format.

limdauto · 2020-04-17T10:22:38Z

I completely agree with @WaylonWalker here regarding starters. To make the separation of Kedro's library and framework more explicit, I think we should remove template altogether from kedro codebase. Instead, having dedicated git repo for templates (all officially supported templates could be in one repo) and pull them according to CLI arguments supplied to kedro new.

One cautionary note about having too many starters though: A huge value Kedro brings to an organisation is the shared mental model of an analytics pipeline among different functions (DE, DS, SWE, etc.). This shared mental model is enforced by the convention laid out in template. If we introduce more templates, or starters, we need to make sure they enforce, and not take away, from the shared mental model already established in a Kedro pipeline. It'd be a problem if someone creates a template which refers to node as task, for example.

Minyus · 2020-07-01T04:05:23Z

As discussed in #397 and the Zoom meeting with Lais and @921kiyo yesterday, I prepared simplified and enhanced Kedro project template suitable for both beginners and experts at:

https://github.com/Minyus/kedro_template

The major change is that src directory is restructured to top-level main.py and 4 folders (pipelines, nodes, hooks, and catalogs) like this:

├── conf
├── data
├── kedro_cli.py
├── logs
├── main.py
└── src
    ├── __init__.py
    ├── catalogs
    │   ├── __init__.py
    │   └── catalog.py
    ├── hooks
    │   ├── __init__.py
    │   └── add_catalog_dict.py
    ├── nodes
    │   ├── __init__.py
    │   └── my_module.py
    └── pipelines
        ├── __init__.py
        └── pipeline.py

I would suggest this Kedro project template because users can:

run the project by either:
- python main.py (or, for example, /opt/conda/bin/python main.py or /usr/bin/python main.py to use a non-default Python environment):
  - This can allow users to use debugging features of IDEs (VS Code, PyCharm, etc.)
- kedro run
declare datasets in either:
- catalog.py
- catalog.yml
add hooks easily

yetudada · 2020-07-06T19:11:42Z

This is great @Minyus! I think you might just be a candidate for the Kedro starters project that is due out in the next release of Kedro. @limdauto took inspiration from @WaylonWalker's idea and we have a way to use your version of the Kedro template to jumpstart your project.

limdauto · 2020-07-13T17:31:01Z

Hi everyone, we have just released 0.16.3. In this version, we add the ability for you to specify your own template when creating a new Kedro project by calling kedro new with a --starter flag, i.e.

kedro new --starter=<path-to-my-starter>

Documentation for this feature could be found here: https://kedro.readthedocs.io/en/latest/02_get_started/06_starters.html. Please take a look and let me know if you have any questions / comments / feedbacks.

To add a bit more information, the mental model that we are trying to reinforce, which is also in line with what everyone is discussing here, is: modular pipeline is for business logic reusability and starters are for convention reusability. If you have any idea on how to improve the synergy of these two features to make reusability in Kedro even more robust, we would love to learn more.

yetudada · 2020-07-20T15:55:43Z

You can see a walk-through of Kedro Starters by checking out @dataengineerone.

lorenabalan · 2020-10-05T14:57:54Z

Thank you everyone for weighing in. In light of the starters being released, I believe this issue can now be closed.

yetudada added Type: Discussion labels Jan 28, 2020

yetudada self-assigned this Jan 28, 2020

WaylonWalker mentioned this issue Feb 24, 2020

remove src from template in favor of <proj_name>/<proj_name> #229

Closed

6 tasks

yetudada removed the Component: Template label Mar 11, 2020

lorenabalan closed this as completed Oct 5, 2020

This was referenced Aug 7, 2023

Insights and opportunities related to helping Kedro impact more users #2901

Closed

Research summary of insights for improving Kedro's value #2902

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How do you feel about the Kedro project template? #208

How do you feel about the Kedro project template? #208

yetudada commented Jan 28, 2020 •

edited

Loading

datajoely commented Jan 28, 2020

ThomasLittrell1 commented Jan 28, 2020 •

edited

Loading

khdlim commented Jan 29, 2020

datajoely commented Jan 29, 2020

yetudada commented Jan 31, 2020

khdlim commented Feb 1, 2020

nraw commented Feb 3, 2020

WaylonWalker commented Feb 3, 2020

WaylonWalker commented Feb 3, 2020

benjaminjack commented Feb 3, 2020

yetudada commented Feb 6, 2020

WaylonWalker commented Feb 7, 2020

ThomasLittrell1 commented Feb 7, 2020

MigQ2 commented Feb 11, 2020

yetudada commented Feb 24, 2020 •

edited

Loading

WaylonWalker commented Feb 24, 2020

limdauto commented Apr 17, 2020 •

edited

Loading

Minyus commented Jul 1, 2020 •

edited

Loading

yetudada commented Jul 6, 2020

limdauto commented Jul 13, 2020 •

edited

Loading

yetudada commented Jul 20, 2020

lorenabalan commented Oct 5, 2020

How do you feel about the Kedro project template? #208

How do you feel about the Kedro project template? #208

Comments

yetudada commented Jan 28, 2020 • edited Loading

Introduction

Context

Possible Implementation

Questions for you

datajoely commented Jan 28, 2020

ThomasLittrell1 commented Jan 28, 2020 • edited Loading

khdlim commented Jan 29, 2020

datajoely commented Jan 29, 2020

yetudada commented Jan 31, 2020

khdlim commented Feb 1, 2020

nraw commented Feb 3, 2020

WaylonWalker commented Feb 3, 2020

WaylonWalker commented Feb 3, 2020

benjaminjack commented Feb 3, 2020

yetudada commented Feb 6, 2020

WaylonWalker commented Feb 7, 2020

ThomasLittrell1 commented Feb 7, 2020

MigQ2 commented Feb 11, 2020

yetudada commented Feb 24, 2020 • edited Loading

WaylonWalker commented Feb 24, 2020

limdauto commented Apr 17, 2020 • edited Loading

Minyus commented Jul 1, 2020 • edited Loading

yetudada commented Jul 6, 2020

limdauto commented Jul 13, 2020 • edited Loading

yetudada commented Jul 20, 2020

lorenabalan commented Oct 5, 2020

yetudada commented Jan 28, 2020 •

edited

Loading

ThomasLittrell1 commented Jan 28, 2020 •

edited

Loading

yetudada commented Feb 24, 2020 •

edited

Loading

limdauto commented Apr 17, 2020 •

edited

Loading

Minyus commented Jul 1, 2020 •

edited

Loading

limdauto commented Jul 13, 2020 •

edited

Loading