NumPy arrays are homogeneous (same data type), support vectorized operations, more memory efficient. Offer broadcasting, advanced indexing, mathematical operations. Better performance for numerical computations. Fixed size vs dynamic size of lists.
Use fillna(), dropna(), interpolate() methods. Handle different types of missing data (NaN, None). Consider imputation strategies (mean, median, forward/backward fill). Check missing patterns. Handle missing data in calculations.
Matplotlib for basic plots (line, scatter, bar). Seaborn for statistical visualizations (distributions, regressions). Handle customization, styling. Consider plot types for different data. Implement interactive features.
Use groupby(), agg(), pivot_table(). Apply different aggregation functions. Handle multi-level aggregation. Consider performance implications. Implement custom aggregation functions. Handle grouping with different criteria.
Broadcasting allows operations between arrays of different shapes. Rules: dimensions must be compatible (same, one, or missing). Automatically expands arrays to match shapes. Consider memory implications. Handle dimension compatibility.
Use get_dummies() for one-hot encoding, LabelEncoder for label encoding. Handle ordinal vs nominal categories. Consider feature hashing for high cardinality. Implement proper encoding strategy for ML models.
Use StandardScaler, MinMaxScaler, RobustScaler. Handle outliers in scaling. Consider feature distribution. Implement proper scaling strategy. Handle scaling in train/test split.
Use datetime indexing, resample(), rolling(). Handle time zones, frequencies. Implement time-based operations. Consider seasonal decomposition. Handle missing timestamps. Implement proper date parsing.
Use SMOTE for oversampling, undersampling techniques. Implement class weights. Consider ensemble methods. Handle evaluation metrics properly. Implement cross-validation strategy for imbalanced data.
Use tokenization, stemming/lemmatization. Handle stop words, special characters. Implement proper text cleaning strategy. Consider language specifics. Handle text encoding issues.
Use correlation analysis, VIF calculation. Consider feature selection strategies. Implement proper feature elimination. Handle correlation in model building. Consider impact on model interpretation.
Use vectorization, proper array operations. Consider memory layout. Implement efficient algorithms. Handle large arrays properly. Consider parallel processing options.
Use random sampling, stratified sampling, systematic sampling. Consider sample size, representation. Implement proper sampling strategy. Handle sampling in time series. Consider sampling bias.
Use merge(), concat(), join(). Handle different join types. Consider memory implications. Implement proper key matching strategy. Handle duplicates in merging.
Use KFold, StratifiedKFold, TimeSeriesSplit. Handle validation strategy selection. Consider data characteristics. Implement proper scoring metrics. Handle cross-validation with parameter tuning.
Use multiprocessing, Dask for parallel operations. Handle memory management. Consider scalability issues. Implement proper error handling. Consider overhead vs benefits.
Implement different augmentation strategies. Handle domain-specific augmentation. Consider data balance. Implement proper validation strategy. Handle augmentation in pipeline.
Use appropriate streaming libraries, implement proper buffering. Handle real-time updates. Consider memory management. Implement proper error handling. Handle data consistency.
Use scipy.stats for statistical tests. Handle different test types (t-test, chi-square). Consider assumptions, sample size. Implement proper test selection. Handle multiple testing.
Use polynomial features, spline transformations. Consider feature transformations. Implement proper validation strategy. Handle overfitting risks. Consider model selection.
Use SHAP values, feature importance analysis. Implement model-specific interpretation techniques. Consider global vs local interpretation. Handle complex model interpretation.
Use SelectKBest, RFE, feature importance from models. Consider correlation analysis, mutual information. Implement proper validation strategy. Handle feature selection in pipeline.
Use IQR method, z-score method. Consider domain knowledge for outlier definition. Implement proper outlier treatment strategy. Handle outliers in different features. Consider impact on model performance.