Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define a "standard" process to follow when solving simple data science problems #26

Open
NickSeagull opened this issue Jan 18, 2018 · 3 comments

Comments

@NickSeagull
Copy link
Member

No description provided.

@NickSeagull NickSeagull changed the title **[documentation]** - Define a "standard" process to follow when solving simple data science problems Define a "standard" process to follow when solving simple data science problems Jan 18, 2018
@Drezil
Copy link
Member

Drezil commented Jan 19, 2018

Some short things this should talk about:

  • Data preparation

    • cleaning (removal of incomplete data, lookup of further data, get everything in one matrix/table/format)
    • whitening (normalize, center & de-corrolate - Warning: throws away corrolation-information!)
    • dimension-reduction (PCA, ICA, ... - Warning: throws away information!)
  • Algorithm selection

    • supervised? unsupervised?
    • typical solutions for typical problems (classification, corrolation, non-metric-solutions (i.e. NLP with suffix-trees, edit-distance, etc.))
    • for each algorithm
      • when and when NOT to use
      • further reading
  • Ways to present/interpret results

    • statistical significance?
    • typical tests/metrics (AUC, F_1 score, sensitivity/specificity, etc.)

@ocramz
Copy link
Member

ocramz commented Jan 25, 2018

@ixxie wants to write about reproducibility with Jupyter and Nix, I've added to the DH members'list, he should see this soon as well

@ixxie
Copy link

ixxie commented Apr 28, 2018

Hmmm, I am not sure if this is quite relevant to this; my goal is more to try and create easily reproducible infrastructure as code, i.e. to allow anybody to deploy a data science platform relatively easily. Reproducibility of individual computations is also of great interest and Nix can help with this, but I don't know much about this atm (would be willing to look into it some time!).

FWIW, it seems a bit far fetched to be able to specify a simple decision tree recipe for doing data science; the way I would approach this is to think of it like a bipartite graph: list some problems (e.g. tokenization, classification, clustering, etc) and some algorithms (CRFs, RNNs, HDBSCAN) and link between them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants