Questions tagged [scikit-learn]

scikit-learn is a machine-learning library for Python that provides simple and efficient tools for data analysis and data mining, with a focus on machine learning. It is accessible to everybody and reusable in various contexts. It is built on NumPy and SciPy. The project is open source and ...

100
votes
13answers
59k views

How to extract the decision rules from scikit-learn decision-tree?

Can I extract the underlying decision-rules (or 'decision paths') from a trained tree in a decision tree as a textual list?Something like: if A>0.4 then if B<0.2 then if C>0.8 then class='...
78
votes
5answers
39k views

How to split data into 3 sets (train, validation and test)?

I have a pandas dataframe and I wish to divide it to 3 separate sets. I know that using train_test_split from sklearn.cross_validation, one can divide the data in two sets (train and test). However, I ...
143
votes
17answers
84k views

Label encoding across multiple columns in scikit-learn

I'm trying to use scikit-learn's LabelEncoder to encode a pandas DataFrame of string labels. As the dataframe has many (50+) columns, I want to avoid creating a LabelEncoder object for each column; I'...
136
votes
6answers
50k views

Is it possible to specify your own distance function using scikit-learn K-Means Clustering?

Is it possible to specify your own distance function using scikit-learn K-Means Clustering?
59
votes
7answers
36k views

How to get most informative features for scikit-learn classifiers?

The classifiers in machine learning packages like liblinear and nltk offer a method show_most_informative_features(), which is really helpful for debugging features:viagra=None ok : spam ...
125
votes
7answers
321k views

How to normalize an array in NumPy?

I would like to have the norm of one NumPy array. More specifically, I am looking for an equivalent version of this functiondef normalize(v):norm=np.linalg.norm(v)if norm==0: ...
145
votes
5answers
76k views

Save classifier to disk in scikit-learn

How do I save a trained Naive Bayes classifier to disk and use it to predict data?I have the following sample program from the scikit-learn website:from sklearn import datasetsiris=datasets....
69
votes
5answers
53k views

Use scikit-learn to classify into multiple categories

I'm trying to use one of scikit-learn's supervised learning methods to classify pieces of text into one or more categories. The predict function of all the algorithms I tried just returns one match....
88
votes
8answers
113k views

sklearn: Found arrays with inconsistent numbers of samples when calling LinearRegression.fit()

Just trying to do a simple linear regression but I'm baffled by this error for:regr=LinearRegression()regr.fit(df2.iloc[1:1000, 5].values, df2.iloc[1:1000, 2].values)which produces:ValueError:...
34
votes
2answers
28k views

How does sklearn.svm.svc's function predict_proba() work internally?

I am using sklearn.svm.svc from scikit-learn to do binary classification. I am using its predict_proba() function to get probability estimates. Can anyone tell me how predict_proba() internally ...
21
votes
5answers
6k views

How to one-hot-encode from a pandas column containing a list?

I would like to break down a pandas column consisting of a list of elements into as many columns as there are unique elements i.e. one-hot-encode them (with value 1 representing a given element ...
58
votes
18answers
97k views

Why is pydot unable to find GraphViz's executables in Windows 8?

I have GraphViz 2.32 installed in Windows 8 and have added C:\Program Files (x86)\Graphviz2.32\bin to the System PATH variable. Still pydot is unable to find its executables.Traceback (most recent ...
93
votes
16answers
117k views

ImportError in importing from sklearn: cannot import name check_build

I am getting the following error while trying to import from sklearn:>>> from sklearn import svmTraceback (most recent call last):File "<pyshell#17>", line 1, in <module>...
7
votes
2answers
894 views

How to one hot encode variant length features?

Given a list of variant length features:features=[['f1', 'f2', 'f3'],['f2', 'f4', 'f5', 'f6'],['f1', 'f2']]where each sample has variant number of features and the feature dtype ...
83
votes
7answers
83k views

Find p-value (significance) in scikit-learn LinearRegression

How can I find the p-value (significance) of each coefficient?lm=sklearn.linear_model.LinearRegression()lm.fit(x,y)
17
votes
4answers
9k views

scikit-learn DBSCAN memory usage

UPDATED: In the end, the solution I opted to use for clustering my large dataset was one suggested by Anony-Mousse below. That is, using ELKI's DBSCAN implimentation to do my clustering rather than ...
33
votes
5answers
56k views

Preprocessing in scikit learn - single sample - Depreciation warning

On a fresh installation of Anaconda under Ubuntu... I am preprocessing my data in various ways prior to a classification task using Scikit-Learn.from sklearn import preprocessingscaler=...
6
votes
2answers
4k views

scikit-learn GridSearchCV with multiple repetitions

I'm trying to get the best set of parameters for an SVR model.I'd like to use the GridSearchCV over different values of C.However, from previous test I noticed that the split into Training/Test set ...
0
votes
2answers
1k views

Scikit Learn OneHotEncoder fit and transform Error: ValueError: X has different shape than during fitting

Below is my code.I know why the error is occurring during transform. It is because of the feature list mismatch during fit and transform.How can i solve this? How can i get 0 for all the rest ...
117
votes
2answers
35k views

Why does one hot encoding improve machine learning performance?

I have noticed that when One Hot encoding is used on a particular data set (a matrix) and used as training data for learning algorithms, it gives significantly better results with respect to ...
107
votes
9answers
59k views

RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility

I have this error for trying to load a saved SVM model. I have tried uninstalling sklearn, NumPy and SciPy, reinstalling the latest versions all-together again (using pip). I am still getting this ...
19
votes
2answers
26k views

Clustering text documents using scikit-learn kmeans in Python

I need to implement scikit-learn's kMeans for clustering text documents. The example code works fine as it is but takes some 20newsgroups data as input. I want to use the same code for clustering a ...
26
votes
6answers
15k views

Efficiently count word frequencies in python

I'd like to count frequencies of all words in a text file.>>> countInFile('test.txt')should return {'aaa':1, 'bbb': 2, 'ccc':1} if the target text file is like:# test.txtaaa bbb ccc...
23
votes
1answer
13k views

Should I use `random.seed` or `numpy.random.seed` to control random number generation in `scikit-learn`?

I'm using scikit-learn and numpy and I want to set the global seed so that my work is reproducible.Should I use numpy.random.seed or random.seed?Edit:From the link in the comments, I understand ...
16
votes
7answers
16k views

Error importing scikit-learn modules

I'm trying to call a function from the cluster module, like so:import sklearndb=sklearn.cluster.DBSCAN()and I get the following error:AttributeError: 'module' object has no attribute 'cluster'...
85
votes
6answers
41k views

How are feature_importances in RandomForestClassifier determined?

I have a classification task with a time-series as the data input, where each attribute (n=23) represents a specific point in time. Besides the absolute classification result I would like to find out, ...
32
votes
6answers
18k views

Does the SVM in sklearn support incremental (online) learning?

I am currently in the process of designing a recommender system for text articles (a binary case of 'interesting' or 'not interesting'). One of my specifications is that it should continuously update ...
27
votes
1answer
21k views

scikit-learn cross validation, negative values with mean squared error

When I use the following code with Data matrix X of size (952,144) and output vector y of size (952), mean_squared_error metric returns negative values, which is unexpected. Do you have any idea?...
10
votes
2answers
3k views

sklearn pipeline - how to apply different transformations on different columns

I am pretty new to pipelines in sklearn and I am running into this problem: I have a dataset that has a mixture of text and numbers i.e. certain columns have text only and rest have integers (or ...
91
votes
4answers
139k views

Run an OLS regression with Pandas Data Frame

I have a pandas data frame and I would like to able to predict the values of column A from the values in columns B and C. Here is a toy example:import pandas as pddf=pd.DataFrame({"A": [10,20,30,...

153050per page
angop.ao, elkhabar.com, noa.al, afghanpaper.com, bbc.com, time.com, cdc.gov, nih.gov, xnxx.com, github.com,