# Questions tagged [scikit-learn]

scikit-learn is a machine-learning library for Python that provides simple and efficient tools for data analysis and data mining, with a focus on machine learning. It is accessible to everybody and reusable in various contexts. It is built on NumPy and SciPy. The project is open source and ...

1,294 questions

**100**

votes

**13**answers

59k views

### How to extract the decision rules from scikit-learn decision-tree?

Can I extract the underlying decision-rules (or 'decision paths') from a trained tree in a decision tree as a textual list?Something like: if A>0.4 then if B<0.2 then if C>0.8 then class='...

**78**

votes

**5**answers

39k views

### How to split data into 3 sets (train, validation and test)?

I have a pandas dataframe and I wish to divide it to 3 separate sets. I know that using train_test_split from sklearn.cross_validation, one can divide the data in two sets (train and test). However, I ...

**143**

votes

**17**answers

84k views

### Label encoding across multiple columns in scikit-learn

I'm trying to use scikit-learn's LabelEncoder to encode a pandas DataFrame of string labels. As the dataframe has many (50+) columns, I want to avoid creating a LabelEncoder object for each column; I'...

**136**

votes

**6**answers

50k views

### Is it possible to specify your own distance function using scikit-learn K-Means Clustering?

Is it possible to specify your own distance function using scikit-learn K-Means Clustering?

**59**

votes

**7**answers

36k views

### How to get most informative features for scikit-learn classifiers?

The classifiers in machine learning packages like liblinear and nltk offer a method show_most_informative_features(), which is really helpful for debugging features:viagra=None ok : spam ...

**125**

votes

**7**answers

321k views

### How to normalize an array in NumPy?

I would like to have the norm of one NumPy array. More specifically, I am looking for an equivalent version of this functiondef normalize(v):norm=np.linalg.norm(v)if norm==0: ...

**145**

votes

**5**answers

76k views

### Save classifier to disk in scikit-learn

How do I save a trained Naive Bayes classifier to disk and use it to predict data?I have the following sample program from the scikit-learn website:from sklearn import datasetsiris=datasets....

**69**

votes

**5**answers

53k views

### Use scikit-learn to classify into multiple categories

I'm trying to use one of scikit-learn's supervised learning methods to classify pieces of text into one or more categories. The predict function of all the algorithms I tried just returns one match....

**88**

votes

**8**answers

113k views

### sklearn: Found arrays with inconsistent numbers of samples when calling LinearRegression.fit()

Just trying to do a simple linear regression but I'm baffled by this error for:regr=LinearRegression()regr.fit(df2.iloc[1:1000, 5].values, df2.iloc[1:1000, 2].values)which produces:ValueError:...

**34**

votes

**2**answers

28k views

### How does sklearn.svm.svc's function predict_proba() work internally?

I am using sklearn.svm.svc from scikit-learn to do binary classification. I am using its predict_proba() function to get probability estimates. Can anyone tell me how predict_proba() internally ...

**21**

votes

**5**answers

6k views

### How to one-hot-encode from a pandas column containing a list?

I would like to break down a pandas column consisting of a list of elements into as many columns as there are unique elements i.e. one-hot-encode them (with value 1 representing a given element ...

**58**

votes

**18**answers

97k views

### Why is pydot unable to find GraphViz's executables in Windows 8?

I have GraphViz 2.32 installed in Windows 8 and have added C:\Program Files (x86)\Graphviz2.32\bin to the System PATH variable. Still pydot is unable to find its executables.Traceback (most recent ...

**93**

votes

**16**answers

117k views

### ImportError in importing from sklearn: cannot import name check_build

I am getting the following error while trying to import from sklearn:>>> from sklearn import svmTraceback (most recent call last):File "<pyshell#17>", line 1, in <module>...

**7**

votes

**2**answers

894 views

### How to one hot encode variant length features?

Given a list of variant length features:features=[['f1', 'f2', 'f3'],['f2', 'f4', 'f5', 'f6'],['f1', 'f2']]where each sample has variant number of features and the feature dtype ...

**83**

votes

**7**answers

83k views

### Find p-value (significance) in scikit-learn LinearRegression

How can I find the p-value (significance) of each coefficient?lm=sklearn.linear_model.LinearRegression()lm.fit(x,y)

**17**

votes

**4**answers

9k views

### scikit-learn DBSCAN memory usage

UPDATED: In the end, the solution I opted to use for clustering my large dataset was one suggested by Anony-Mousse below. That is, using ELKI's DBSCAN implimentation to do my clustering rather than ...

**33**

votes

**5**answers

56k views

### Preprocessing in scikit learn - single sample - Depreciation warning

On a fresh installation of Anaconda under Ubuntu... I am preprocessing my data in various ways prior to a classification task using Scikit-Learn.from sklearn import preprocessingscaler=...

**6**

votes

**2**answers

4k views

### scikit-learn GridSearchCV with multiple repetitions

I'm trying to get the best set of parameters for an SVR model.I'd like to use the GridSearchCV over different values of C.However, from previous test I noticed that the split into Training/Test set ...

**0**

votes

**2**answers

1k views

### Scikit Learn OneHotEncoder fit and transform Error: ValueError: X has different shape than during fitting

Below is my code.I know why the error is occurring during transform. It is because of the feature list mismatch during fit and transform.How can i solve this? How can i get 0 for all the rest ...

**117**

votes

**2**answers

35k views

### Why does one hot encoding improve machine learning performance?

I have noticed that when One Hot encoding is used on a particular data set (a matrix) and used as training data for learning algorithms, it gives significantly better results with respect to ...

**107**

votes

**9**answers

59k views

### RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility

I have this error for trying to load a saved SVM model. I have tried uninstalling sklearn, NumPy and SciPy, reinstalling the latest versions all-together again (using pip). I am still getting this ...

**19**

votes

**2**answers

26k views

### Clustering text documents using scikit-learn kmeans in Python

I need to implement scikit-learn's kMeans for clustering text documents. The example code works fine as it is but takes some 20newsgroups data as input. I want to use the same code for clustering a ...

**26**

votes

**6**answers

15k views

### Efficiently count word frequencies in python

I'd like to count frequencies of all words in a text file.>>> countInFile('test.txt')should return {'aaa':1, 'bbb': 2, 'ccc':1} if the target text file is like:# test.txtaaa bbb ccc...

**23**

votes

**1**answer

13k views

### Should I use `random.seed` or `numpy.random.seed` to control random number generation in `scikit-learn`?

I'm using scikit-learn and numpy and I want to set the global seed so that my work is reproducible.Should I use numpy.random.seed or random.seed?Edit:From the link in the comments, I understand ...

**16**

votes

**7**answers

16k views

### Error importing scikit-learn modules

I'm trying to call a function from the cluster module, like so:import sklearndb=sklearn.cluster.DBSCAN()and I get the following error:AttributeError: 'module' object has no attribute 'cluster'...

**85**

votes

**6**answers

41k views

### How are feature_importances in RandomForestClassifier determined?

I have a classification task with a time-series as the data input, where each attribute (n=23) represents a specific point in time. Besides the absolute classification result I would like to find out, ...

**32**

votes

**6**answers

18k views

### Does the SVM in sklearn support incremental (online) learning?

I am currently in the process of designing a recommender system for text articles (a binary case of 'interesting' or 'not interesting'). One of my specifications is that it should continuously update ...

**27**

votes

**1**answer

21k views

### scikit-learn cross validation, negative values with mean squared error

When I use the following code with Data matrix X of size (952,144) and output vector y of size (952), mean_squared_error metric returns negative values, which is unexpected. Do you have any idea?...

**10**

votes

**2**answers

3k views

### sklearn pipeline - how to apply different transformations on different columns

I am pretty new to pipelines in sklearn and I am running into this problem: I have a dataset that has a mixture of text and numbers i.e. certain columns have text only and rest have integers (or ...

**91**

votes

**4**answers

139k views

### Run an OLS regression with Pandas Data Frame

I have a pandas data frame and I would like to able to predict the values of column A from the values in columns B and C. Here is a toy example:import pandas as pddf=pd.DataFrame({"A": [10,20,30,...