Is there no way for them to become a data scientist then?
With the recent boom in data science, a lot of people are interested in getting into this domain. but don’t have the slightest idea about coding. Therefore, I understand how terrible it feels when something you have never learned haunts you at every step.
The good news is that there is a way for you to become a data scientist, regardless of your programming skills. There are tools that typically obviate the programming aspect and provide user-friendly GUI (Graphical User Interface) so that anyone with minimal knowledge of algorithms can simply use them to build high quality machine learning models.
Note: All the information provided is gather from open-source information sources. We are just presenting some facts and not opinions. In no manner do we intent to promote/advertise any of the products/services.
List of Tools
RapidMiner (RM) was originally started in 2006 as an open-source stand-alone software named Rapid-I. Over the years, they have given it the name of RapidMiner and also attained ~35Mn USD in funding. The tool is open-source for old version (below v6) but the latest versions come in a 14-day trial period and licensed after that.
RM covers the entire life-cycle of prediction modeling, starting from data preparation to model building and finally validation and deployment. The GUI is based on a block-diagram approach, something very similar to Matlab Simulink. There are predefined blocks which act as plug and play devices. You just have to connect them in the right manner and a large variety of algorithms can be run without a single line of code. On top of this, they allow custom R and Python scripts to be integrated into the system.
There current product offerings include the following:
RapidMiner Studio: A stand-alone software which can be used for data preparation, visualization and statistical modeling
RapidMiner Server: It is an enterprise-grade environment with central repositories which allow easy team work, project management and model deployment
RapidMiner Radoop: Implements big-data analytics capabilities centered around Hadoop
RapidMiner Cloud: A cloud-based repository which allows easy sharing of information among various devices
RM is currently being used in various industries including automotive, banking, insurance, life Sciences, manufacturing, oil and gas, retail, telecommunication and utilities.
DataRobot (DR) is a highly automated machine learning platform built by all time best Kagglers including Jeremy Achin, Thoman DeGodoy and Owen Zhang. Their platform claims to have obviated the need for data scientists. This is evident from a phrase from their website – “Data science requires math and stats aptitude, programming skills, and business knowledge. With DataRobot, you bring the business knowledge and data, and our cutting-edge automation takes care of the rest.”
DR proclaims to have the following benefits:
Platform automatically detects the best data pre-processing and feature engineering by employing text mining, variable type detection, encoding, imputation, scaling, transformation, etc.
Hyper-parameters are automatically chosen depending on the error-metric and the validation set score
Computation is divided over thousands of multi-core servers
Uses distributed algorithms to scale to large data sets
Easy deployment facilities with just a few clicks (no need to write any new code)
For Software Engineers
Python SDK and APIs available for quick integration of models into tools and softwares.
BigML provides a good GUI which takes the user through 6 steps as following:
Sources: use various sources of information
Datasets: use the defined sources to create a dataset
Models: make predictive models
Predictions: generate predictions based on the model
Ensembles: create ensemble of various models
Evaluation: very model against validation sets
These processes will obviously iterate in different orders. The BigML platform provides nice visualizations of results and has algorithms for solving classification, regression, clustering, anomaly detection and association discovery problems. They offer several packages bundled together in monthly, quarterly and yearly subscriptions. They even offer a free package but the size of the dataset you can upload is limited to 16MB.
Cloud AutoML is part of Google’s Machine Learning suite offerings that enables people with limited ML expertise to build high quality models. The first product, as part of the Cloud AutoML portfolio, is Cloud AutoML Vision. This service makes it simpler to train image recognition models. It has a drag-and-drop interface that let’s the user upload images, train the model, and then deploy those models directly on Google Cloud.
Cloud AutoML Vision is built on Google’s transfer learning and neural architecture search technologies (among others). This tool is already being used by a lot of organizations.
Paxata is one of the few organizations which focus on data cleaning and preparation, and not the machine learning or statistical modeling part. It is an MS Excel-like application that is easy to use. It also provides visual guidance making it easy to bring together data, find and fix dirty or missing data, and share and re-use data projects across teams. Like the other tools mentioned in this article, Paxata eliminates coding or scripting, hence overcoming technical barriers involved in handling data.
Paxata platform follows the following process:
Add Data: use a wide range of sources to acquire data
Explore: perform data exploration using powerful visuals allowing the user to easily identify gaps in data
Clean+Change: perform data cleaning using steps like imputation, normalization of similar values using NLP, detecting duplicates
Shape: make pivots on data, perform grouping and aggregation
Share+Govern: allows sharing and collaborating across teams with strong authentication and authorization in place
Combine: a proprietary technology called SmartFusion allows combining data frames with 1 click as it automatically detects the best combination possible; multiple data sets can be combined into a single AnswerSet
BI Tools: allows easy visualization of the final AnswerSet in commonly used BI tools; also allows easy iterations between data preprocessing and visualization
Praxata has set its foot in financial services, consumer goods and networking domains. It might be a good tool to use if your work requires extensive data cleaning.
Trifacta is another startup with a heavy focus on data preparation. It has 3 product offerings:
Wrangler: A free stand-alone software. Allows up to 100MB of data
Wrangler Pro: An upgraded version of the above. It allows both single and multi-user and the data volume limit is 40GB
Wrangler Enterprise: The ultimate offering from Trifacta. It does not have any limit on the amount of data you process and allows unlimited users. Ideal for big organizations
Trifacta offers a very intuitive GUI for performing data cleaning. It takes data as input and provides a summary with various statistics by column. Also, for each column it automatically recommends some transformations which can be selected using a single click. Various transformations can be performed on the data using some pre-defined functions which can be called easily in the interface.
Trifacta platform uses the following steps of data preparation:
Discovering: this involves getting a first look at the data and distributions to get a quick sense of what you have
Structure: this involves assigning proper shape and variable types to the data and resolving anomalies
Cleaning: this step includes processes like imputation, text standardization, etc. which are required to make the data model ready
Enriching: this step helps in improving the quality of analysis that can be done by either adding data from more sources or performing some feature engineering on existing data
Validating: this step performs final sense checks on the data
Publishing: finally the data is exported for further use
Trifacta is primarily used in the financial, life sciences and telecommunication industries.
MLBase is an open-source project developed by AMP (Algorithms Machines People) Lab at the University of California, Berkeley. The core idea behind this is to provide an easy solution for applying machine learning to large scale problems.
It has 3 offerings:
MLlib: It works as the core distributed ML library in Apache Spark. It was originally developed as part of MLBase project, but now the Spark community supports it
MLI: An experimental API for feature extraction and algorithm development that introduces high-level ML programming abstractions
ML Optimizer: This layer aims to automating the task of ML pipeline construction. The optimizer solves a search problem over feature extractors and ML algorithms included in MLI and MLlib
Auto-WEKA is a data mining software written in Java, developed by the Machine Learning Group at the University of Waikato, New Zealand. It is a GUI based tool which is very good for beginners in data science. The best part about it is that it is open-source and the developers have provided tutorials and papers to help you get started.
Driverless AI is a magical platform for enterprises from h2o.ai that supports automatic machine learning. A 1 month trial version is available as a docker image at this link. All you have to do is using simple dropdowns select the files for train, test and mention the metric using which you want to track model performance. Sit back and watch as the platform with an intuitive interface trains on your dataset to give excellent results at par with a good solution an experienced data scientist can come up with.
These are some mindblowing features of Driverless AI
It supports multi GPU support for XGBOOST, GLM and K-Means and more which results in excellent training speeds even for large complex datasets
Automatic feature engineering, tuning and ensembling of a variety of models to produce highly accurate predictions
Great features for interpreting the model along with a panel for real time feature importance ranks during the training process
When there are so many big name players in this field, how could Microsoft lag behind? The Azure ML Studio is a simple yet powerful browser based ML platform. It has a visual drag-and-drop environment where there is no requirement of coding. They have published comprehensive tutorials and sample experiments for newcomers to get the hang of the tool quickly. It employs a simple five step process:
Import your dataset
Perform data cleaning and other preprocessing steps, if necessary
Split the data into training and testing sets
Apply built-in ML algorithms to train your model
Score your model and get your predictions
MLJar is a browser based platform for quickly building and deploying machine learning models. It has an intuitive interface and allows you to train models in parallel. It comes with built-in hyper-parameters search and makes deploying your model easier. MLJar offers integration with NVIDIA’s CUDA, python, TensorFlow, among others.
You only need to perform three steps to build a decent model:
Upload your dataset
Train and tune many Machine Learning algorithms and select the best one
Use the best models for predictions and share your results
Currently the tool works on a subscription plan. It has a free plan as well with a 0.25GB dataset limit. It’s definitely worth checking out.
Amazon Lex provides an easy-to-use console for building your own chatbot in a matter of minutes. You can build conversational interfaces in your applications or website using Lex. All you need to do is supply a few phrases and Amazon Lex does the rest! It builds a complete Natural Language model using which a customer can interact with your app, using both voice and text.
It also comes with built-in integration with the Amazon Web Services (AWS) platform. Amazon Lex is a fully managed service so as your user engagement increases, you don’t need to worry about provisioning hardware and managing infrastructure to improve your bot experience.
How could we leave out IBM Watson from this list? It is one of the most recognizable brands in the world. IBM Watson Studio provides a beautiful platform for building and deploying your machine learning and deep learning models. You can interactively discover, clean and transform your data, use familiar open source tools with Jupyter notebooks and RStudio, access the most popular libraries, train deep neural networks, among a a vast array of other things.
For people just starting out in this field, they have provided a bunch of videos to ease the introductory phase. You can choose to take a free trial and check out this awesome tool by yourself. The above video guides you through how to create a project in Watson Studio.
Automatic Statistician is not a product per se but a research organization which is creating a data exploration and analysis tool. It can take in various kinds of data and uses natural language processing at it’s core to generate a detailed report. It is being developed by researchers who have worked in Cambridge and MIT and also won Google’s Focussed Research Award with a price of $750,000.
It is still under active development but it’s one to keep an eye on in the near future.
KNIME – This tool is awesome for training machine learning models. It takes some getting used to initially but the GUI is awesome to get started with. It produces results on par with most tools and is free of cost as well
FeatureLab – It allows easy predictive modeling and deployment using GUI. One of the best selling points it has is automated feature engineering
MarketSwitch – This tool is more focussed on optimization rather than predictive analytics
Logical Glue – Another GUI based machine learning platform which works from raw data to deployment
Pure Predictive – This tool uses a patented Artificial Intelligence system which obviates the part of data preparation and model tuning; it uses AI to combine 1000s of models into what they call “supermodels”
版权声明：本站内容全部来自于腾讯微信公众号，属第三方自助推荐收录。《分享 | 不会编程也能做数据科学和机器学习的19种工具》的版权归原作者「52Psychology」所有，文章言论观点不代表Lambda在线的观点， Lambda在线不承担任何法律责任。如需删除可联系QQ:516101458