been useful to just try things out without having to boot up Python locally or just experimenting on the go on mobile
been useful to just try things out without having to boot up Python locally or just experimenting on the go on mobile
the official Python website with an interactive Python shell built into the front page, showing libraries like numpy and sci kit learn are installed
need a quick Python REPL, but you're away from your main computer? but you also need libraries like numpy and scikit-learn? the python.org homepage has got you covered
the autocomplete for everything is so amazing
i may be behind the times regarding Python REPLs, but just found ptpython. it works super well! i'm impressed. the last innovation in Python REPLs i saw was years ago with ipython
github.com/prompt-toolk...
need to reproducibly take a screenshot of a webpage using R? you can use the chromate package to do this, so cool
rstudio.github.io/chromote/art...
how to validate a marketing mix model
www.stellaheystella.com/blog/how-do-...
another way to do propensity matching without regression and only using SQL is to do some bucketing
www.datacult.com/post/propens...
Python package to do propensity matching
pypi.org/project/psmpy/
preparing for questions on propensity matching, especially in the marketing world. use a regression based on each of the covariates you want to match case and controls with while trying to predict intervention as a way to match them
towardsdatascience.com/psmpy-propen...
trying to prepare questions about how to detect outliers and what you might do with them
en.wikipedia.org/wiki/Outlier
Wikipedia page and similar external page on model validation, especially for regressions like looking at the residuals vs fitted values plot
en.wikipedia.org/wiki/Statist...
library.virginia.edu/data/article...
because i was looking at some marketing roles and thinking about survival analysis, i thought about the Weibull distribution, which can be used to model the change in time-to-failure rate over time if needed
en.wikipedia.org/wiki/Weibull...
more emphasis on multicollinearity, with detecting them using Variance Inflation Factors (VIF) and dealing with them by dropping correlated variables or centering the variables
statisticsbyjim.com/regression/m...
false discovery rate is rate of type I errors in null hypothesis testing when conducting multiple comparison, which is defined as FDR = FP / (FP + TP)
en.wikipedia.org/wiki/False_d...
more info on selecting a model from regression or machine learning
en.wikipedia.org/wiki/Model_s...
i considered looking at terminal CLI ways of exploring data. not sure if I'll use it much, but otherwise, this is a good reference book for doing data science at the command line
jeroenjanssens.com/dsatcl/chapt...
overfitting techniques
- Hold-out
- Cross-validation
- Data augmentation (transform data to make more)
- Feature selection
- L1 / L2 regularization
- Remove layers/number of units per layer (in neural network)
- Dropout (of connections in NN)
- Early stopping
towardsdatascience.com/8-simple-tec...
regression diagnostics and assumptions, like linearity and additivity of the independent variables, statistical independence of errors, homoscedasticity of errors, and normality of errors
people.duke.edu/~rnau/testin...
a generic reference for Wikipedia and all the statistics articles on there
en.wikipedia.org/wiki/List_of...
just in case i needed to talk about statistical model building. it was good to catch up on terms like multicollinearity, overfitting, model evaluation metrics (like BIC and AUC), feature selection and feature engineering
domystats.com/advanced-met...
a more conceptual and philosophical read about the elements of data analysis and data science
arxiv.org/abs/1903.076...
for those technical live-coding interviews, a quick overview of what pandas can do (i haven't used this in a bit, so this was a good review for me), like
df.dtypes
df.describe()
df['categorical_column'].value_counts()
df.dropna()
df.drop_duplicates()
diogoribeiro7.github.io/data%20scien...
a nice read about the basics of Bayesian modeling, including importance of p(model | data) as a focus, a shift in mindset to worry about the distribution of models themselves, prior distributions are more important with less data, and quantifying uncertainty
statisticalbiophysicsblog.org?p=233
a linked article on the SQL interview at Instacart that was interesting to read, especially incorporating the fact that on the job, you'll likely have access to some LLM to help you out anyways, and focusing more on "prompt engineering" to get to the core business
tech.instacart.com/data-science...
"Advice for Data Scientists/Statisticians interested in working at Instacart"
some good specific but general questions and tips to consider when interviewing for data roles
docs.google.com/document/d/1...
piqued my interest to open up a resource on how to use AI in data analysis and found this short course
gabors-data-analysis.com/ai-course/
it lists the following as risk factors:
boundary erosion (abstractions become fluid),
entanglement (of features),
hidden feedback loops,
undeclared consumers,
data dependencies,
configuration issues,
changes in the external world, and
a variety of system-level anti-patterns
an oldie but a goodie regarding technical debt in machine learning systems.
probably applicable to other technical systems too.
proceedings.neurips.cc/paper_files/...
the standard behavioral interview questions are good to review
www.themuse.com/advice/behav...
preparing for some interviews, and here are some things i've looked up in prep