Build Machine Learning Solutions for Data Lakes and Data Warehouses

Make machine learning and data science projects simpler with cloud-based technologies

1 Machine Learning: What It Is and How It’s Used

2 Barriers to Machine Learning Adoption

3 Oracle and Machine Learning

4 About Oracle Cloud Infrastructure Data Science

5 About Oracle Machine Learning

6 How Oracle Simplifies Machine Learning

7 The Unique Capabilities of Oracle’s Data Science Service

8 The Unique Capabilities of Oracle Machine Learning

9 The Oracle Data Science Supporting Services

1. Machine Learning: What It Is and How It’s Used

Make machine learning and data science projects simpler with cloud-based technologies

What is Machine Learning?

Data science technology is growing in popularity—and it is fueling one of the hottest segments of the software industry. According to a report by Market Research Future, the global machine learning market will see a compound annual growth rate of 42 percent between 2018 and 2024, driven in part by the rise in the ability to ingest and process unstructured data at scale, and the widespread adoption of cloud-based services.1

Data scientists use machine learning algorithms to extract knowledge and insights from a given data set. These insights can be used by applications and business analytics systems to drive decision-making.

This guide describes how you can use machine learning technology in conjunction with a comprehensive data platform that includes mature Oracle technologies for data management across data lakes and data warehouses, data preparation, and analysis—backed by complementary applications, tools, frameworks, and infrastructure.

Who Uses Machine Learning?

Whenever we interact with banks, shop online, or use social media networks, machine learning algorithms are making our experiences more valuable, efficient, and secure. According to The Data Warehouse Institute (TDWI), which surveyed a wide range of companies about their use of AI and machine learning, 92 percent of today’s companies use machine learning technology in some fashion and 85 percent are building predictive models with machine learning tools.2

For example, financial institutions use machine learning to determine a person’s credit score to aid in loan approval decisions. Manufacturers use machine learning to monitor production equipment and avert potential downtime, or identify root causes of product defects.

From improving crop yields to predicting fraud, forward-looking organizations depend on machine learning technologies to gain better insights, make better decisions, and improve their competitive advantage in rapidly evolving markets.

Use Cases

Machine learning has blossomed across a wide range of industries and support a variety of use cases, including:

Customer lifetime value

Anomaly detection

Dynamic pricing

Predictive maintenance

Image classification

Recommendation engines

Fraud detection

Retention / churn models

1 Machine Learning Market Research Report - Global Forecast to 2024, September, 2019
2 Halper, Fern, Ph. D., Best Practices Report: “Driving Digital Transformation Using AI and Machine Learning” (tdwi.org/bpreports – September 25, 2019).

2. Barriers to Machine Learning Adoption

In the past, businesses have typically made decisions based solely on historical data, gut feelings, or intuition. Today, machine learning enables business leaders to make data-driven decisions that are more insightful, forward-looking, and proactive.

To derive the most value from these machine learning initiatives, you need data management systems to store and secure your data. You also need hardware and software systems to run data science tools. Finally, you need to deliver results in a form that can be consumed by nontechnical users, as well as by other information systems.

The Challenges

These challenges can be summed up in four primary ways:

  • Access: Machine learning algorithms require lots of data, which is often spread across different applications and databases and controlled by different business units.
  • Complexity: Preparing data and developing machine learning models are often time-consuming tasks and may involve the use of multiple specialized software tools, open-source libraries, and the coordination of multiple contributors and business units.
  • Scalability: Data scientists need the ability to analyze more data, build more models, and score more data at scale. All of this drives data productivity.
  • Deployment: To achieve tangible results from machine learning initiatives, models must be put in an operational context, i.e., deployed, and then monitored and updated as circumstances change.

Defining Terms

Data science uses scientific methods and algorithms to derive knowledge and insights from structured and unstructured data.

Artificial intelligence (AI) is a branch of computer science dealing with the simulation of intelligent behavior in computers.

Machine learning (ML) entails the use of algorithms and statistical models to discover patterns and predict outcomes from data.

3. Oracle and Machine Learning

Oracle Expertise

Oracle has infused machine learning technology throughout nearly every facet of the company. Machine learning helps automate internal operations and has a huge and growing impact on the development of new products and services.

We are accelerating the adoption of machine learning technology as part of a larger focus on data science. Our data science platform:

  • makes data scientists more productive by helping them accelerate and deliver models faster with fewer errors
  • makes it easier for data scientists to work with large volumes and varieties of data
  • delivers trusted, enterprise-grade artificial intelligence that’s bias-free, auditable, and reproducible.

This is not a new venture for Oracle or our customers, many of whom have been developing machine learning solutions using machine learning-based technology for decades.

At the core of our data science platform are Oracle Cloud Infrastructure Data Science and Oracle Machine Learning. Cloud Infrastructure Data Science helps enterprises collaboratively build, evaluate, manage, and deploy machine learning. Designed for expert data scientists, Cloud Infrastructure Data Science provides access to the best of open-source algorithms and libraries, best-in-class hardware for CPUs, and fast networking and storage. It is also data source agnostic and allows users to utilize Python for machine learning.

For those who wish to use the power of databases and data lakes for machine learning, Oracle Machine Learning provides in-database data preparation and exploration as well as model building, evaluation, and deployment. Designed for multiple user roles and collaboration, Oracle Machine Learning has Oracle-optimized algorithms, uses data from Oracle’s extended data platform, and enables users to work with SQL and R.

Both Oracle Cloud Infrastructure Data Science and Oracle Machine Learning can be used at the same time and will work together so users can take advantage of either option, depending on their needs and desires. Both of these services take advantage of several other services that complement them, and extend support for the rest of the data science lifecycle and data science team at large.

CUSTOMER STORY

ArgoScout

AgroScout developers can iterate new software features faster since moving to Oracle Cloud. Deployments of new versions that used to take 24 hours now happen in minutes.

"Success of this vision relies on the ability to manage a continuous and increasing flow of input data ans our own AI-based solution to transform that data into presicion and decision agriculture, at scale. The speed, scale, and agility of Oracle Cloud have helped us realize our dream. Now, new horizons have opened up with the recent addition of Oracle Cloud Infrastructure Data Science that improves our data scientists’ ability to collaboratively build, train, and deploy machine learning models.” Simcha Shore, Founder and CEO, AgroScout

View Customer Story

4. About Oracle Cloud Infrastructure Data Science

Oracle’s Data Science & Machine Learning Capabilities

Oracle Cloud Infrastructure Data Science, based on the acquisition of DataScience.com, was built to accelerate the workload of individual data scientists while also supporting the data science team.

Cloud Infrastructure Data Science is data source agnostic and gives data scientists flexible tools to build, manage, deploy, and monitor machine learning models. These skilled technology professionals can also use their favorite Python machine learning or deep learning frameworks while writing their code interactively in the JupyterLab notebook environment.

Cloud Infrastructure Data Science delivers powerful team capabilities, including:

Shared projects that help data scientists organize and segregate work in different compartments with specific access control policies. In a project, data scientists can view and share models, access notebook sessions, and write version-controlled code.

Inside notebook sessions, users can also connect to Git providers and pull code, make changes tracked with version control, and sync back their changes with Git. All of this ensures that everyone can work off the same versioned code base.

Notebook Sessions that give data scientists the power to specify the compute and storage resources that will be used as they work in their JupyterLab notebooks. A notebook session represents a virtual machine containing the tools and libraries that data scientists use to build, train, and evaluate models. The virtual machine is preloaded with the JupyterLabs user interface for Jupyter notebooks. It also comes with over 300 open-source machine learning libraries including:

  • sci-kit learn
  • XGBoost
  • LightGMB
  • Keras
  • TensorFlow
  • Dask
  • pandas

Samples and tutorial notebooks are also included.

Increased Productivity Tools

In addition to open source, Oracle provides other capabilities to enhance user productivity, such as:

Accelerated Data Science (ADS) SDK to make common data science tasks faster, easier, and less error prone. This is a Python library unique to Oracle that offers capabilities for accessing, profiling, and manipulating data. ADS also offers a simple interface for model evaluation and interpretation, and Oracle’s AutoML engine for automated model training.

Model Catalogs to enable team members to reliably share their models and capture artifacts needed for model deployment. The model catalog also enables model auditability and reproducibility, allowing data scientists to re-run the same code in the same environment with the same dataset that was used to train a given model. The model catalog in Cloud Infrastructure Data Science tracks model metadata (including the creator, created data, name, and provenance), and allows for saving model artifacts in service-managed object storage, loading models into notebook sessions for testing, and deployment of the model to Oracle Functions.

Users can also pull the model from a catalog that a colleague has created, compare the model with one that has already been created, perform all operations on the model, and have all work on models and projects documented in a governed way.

Learn More

5. About Oracle Machine Learning

Data Science and Machine Learning with Oracle

Oracle Machine Learning refers to the family of products used to analyze and prepare data, and build, evaluate, and deploy machine learning models in Oracle data management systems.

Oracle Machine Learning has had its roots in Oracle Database since the acquisition of Thinking Machines Corporation in 1999. Initially designed to help database users prepare data, and build and deploy machine learning models from data stored in Oracle databases using SQL, Oracle Machine Learning has expanded to include R APIs, as well as leverage new technologies such as big data, Oracle Autonomous Database, and other Oracle Database cloud services.

Oracle provides you with multiple ways to take advantage of the machine learning in Oracle Database. Oracle Machine Learning’s in-database algorithms have been designed to take advantage of parallelism (multi-threading) and distributed execution (across multiple nodes). This provides scalability and performance benefits; model building and scoring can take advantage of multiple CPUs and compute nodes on large-scale hardware such as Oracle Exadata and cloud-based solutions like Oracle Autonomous Database.

CUSTOMER STORY

CaixaBank

CaixaBank needed a solution that would enable it to add value to their business and adapt to an evolving sector quickly and seamlessly. Oracle provides a solution that responds to the bank’s need for cutting edge information management, enabling it to gain a 360° understanding of their customers from internal and external data to offer them tailored, on-demand solutions.

"Oracle Big Data Appliance, Oracle Advanced Analytics, and Oracle Real-Time Decisions enable us to quickly find patterns and correlations in our customers’ online interaction with us. We’ve improved our business agility and flexibility as well as our ability to know and serve our customers, ultimately focusing on creating value for them rather than solving IT issues." Luis Esteban Grifoll, Chief Data Officer, CaixaBank

View Customer Story

Oracle ML Products

The Oracle family of products for in-database machine learning:

Oracle Machine Learning for SQL (OML4SQL) supports Oracle Database and Oracle Autonomous Database where data scientists use SQL and PL/ SQL code to build machine learning models and score data. You can use Oracle Machine Learning Notebooks on Autonomous Database, or Oracle SQL Developer or another SQL integrated development environment (IDE) to write SQL code. The SQL interface makes machine learning more accessible to users skilled in SQL.

Oracle Machine Learning Notebooks are based on Apache Zeppelin notebooks. With Autonomous Database they provide a collaborative notebook interface for developing and executing SQL, PL/SQL, and OML4SQL code. Oracle Machine Learning Notebooks support data scientists, data analysts, application developers and DBAs in easily shared notebooks and templates. It delivers a collaborative framework that provides access permissions, versioning, and execution scheduling.

Notebooks are organized at two levels: workspaces that control sharing, and projects for finer grained organization.

Oracle Machine Learning for R (OML4R) supports Oracle Database, both in the cloud and on-premises, enabling scalable data exploration, data preparation, model building, and scoring using in-database algorithms—through a natural R interface with the user’s R IDE of choice. R users can also deploy user-defined functions that leverage third-party R packages, which can be invoked via SQL for ease of deployment.

Oracle Data Miner is a no-code graphical tool for data exploration, preparation, model building and evaluation, data scoring, as well as solution deployment. Users construct analytical workflows in this drag-and-drop extension to Oracle SQL Developer. This makes machine learning accessible to a broader set of users, including DBAs and other Oracle Database experts.

Oracle Machine Learning for Spark (OML4Spark) supports data science projects involving data in the data lake, leveraging the full Spark cluster. It provides an R interface for data exploration and preparation, as well as Spark-based machine learning algorithms—both scalable proprietary implementations and those tightly integrated from third-party libraries.

Learn more

6. How Oracle Simplifies Machine Learning

The Process

Machine learning can be complicated, but having the right tools makes the entire process simpler. Here are just a few examples of how Oracle works to continually improve the machine learning process.

Managing the Entire Model Lifecycle

The process of building a machine learning model is an iterative one. Oracle Cloud Infrastructure Data Science and Oracle Machine Learning make it easier to manage models throughout their entire lifecycle.

Building Models

Cloud Infrastructure Data Science’s JupyterLab environment offers a variety of open-source libraries for building machine learning models. It also includes the Accelerated Data Science (ADS) SDK. Data scientists can automate model training through the ADS AutoML API. Cloud Infrastructure Data Science also features AutoML automated data feature engineering, algorithm selection, and hyperparameter tuning. This automates the process of training and selecting model candidates, and saves significant time for data scientists.

Oracle Machine Learning enables you to build models that reside in Oracle Database using SQL and R—key languages used by data scientists. In-database algorithms provide automatic data preparation (ADP) to enable production of a quick, yet reasonable first-cut model from data. Automatic data preparation (ADP) performs normalization, outlier treatment, missing value treatment, and binning as needed by individual algorithms.

Evaluating Models

Oracle Cloud Infrastructure Data Science’s model evaluation generates a comprehensive suite of evaluation metrics and suitable visualizations to measure model performance against new data. It can also rank models over time to enable optimal behavior in production.

Oracle Machine Learning provides statistics for evaluating model quality, including a cost model to account for different impacts of false positives and false negatives. It provides visual comparison of automatically generated in-database model quality statistics.

Explaining Models

Cloud Infrastructure Data Science offers a model explanation capability which provides automated explanation of the relative weighting and importance of the factors that go into generating a prediction.

Oracle Machine Learning’s in-database models provide model-specific explanatory details on model predictions to understand why a given prediction was made. Model details are also available for understanding model behavior.

Deploying Models

Cloud Infrastructure Data Science enables teams to operationalize models as scalable and secure APIs. With Cloud Infrastructure Data Science, the model can be deployed to Oracle Functions, which means that the model can be deployed as a Docker container. This container can be consumed from anywhere; from the database, inside a notebook session, from an application inside or outside Oracle Cloud Infrastructure—even another cloud.

With Oracle Machine Learning, in-database models are immediately deployable through SQL. User-defined R functions can also be deployed through Oracle Database—invoked from R, SQL or REST APIs.

7. Unique Capabilities of Cloud Infrastructure Data Science

We designed Cloud Infrastructure Data Science services to manage the full model lifecycle. The service was also created to serve different needs, with distinctive features and capabilities to best suit different use cases.

Ensure a Collaborative Workflow

Cloud Infrastructure Data Science has done everything to ensure data scientists can work in teams and to centralize all work so there’s no loss of knowledge. Data scientists can work in “projects” where it’s easy to see what’s happening with a high-level view. They can share and reuse data science assets and test their colleagues’ models. Because they can also check their code in and out from a Git repository, save their models in the model catalog, and easily deploy them as Oracle Functions, this makes it easier than ever for them to work collaboratively with each other and with other external teams to deploy successful machine learning models.

Use Your Favorite Open Source Tools

With Oracle Cloud Infrastructure Data Science, you can use the best of open source, including:

  • tools and languages like Python and JupyterLab
  • visualization libraries like Plotly, Matplotlib, and seaborn
  • data manipulation tools like Dask, Pandas, and Numpy
  • machine-learning libraries like TensorFlow, Keras, SciKit-Learn, and XGBoost
  • version control with Git.

Oracle Cloud Infrastructure Data Science was designed to take advantage of the benefits of open source technology. For example, users can utilize open source tools and utilities to:

  • train their models by adding a new library that includes the latest and greatest from the AI/machine learning research community
  • easily get started by integrating pre-built use cases created for time-series modeling, Baysian modeling, deep learning, anomaly detection, supervised modeling, and unsupervised modeling
  • use client libraries that connect to a variety of data sources and ingest data in a variety of formats
  • use Oracle Cloud Infrastructure’s tutorial content and example notebooks that show users how to connect to these different curated libraries for modeling in a variety of use cases.

Be Data Source Agnostic

Data scientists often want the ability to connect to data from many sources, and the majority of data doesn’t reside in databases. Oracle Cloud Infrastructure Data Science is data source agnostic. It contains a variety of client libraries to connect to multiple data sources on Oracle Cloud Infrastructure, as well as other clouds.

Use the Power of Oracle Cloud Infrastructure

Cloud Infrastructure Data Science is Oracle Cloud Infrastructure OCI native—which means it can use the best of the newest, best cloud on the market.

Oracle’s broad range of compute options, fast networking speeds, scalable independent storage, modern architecture, and competitive pricing make it an ideal infrastructure platform for machine learning and AI.

Employing data scientists or using augmented intelligence isn’t enough to create a successful AI deployment. AI requires a modern data infrastructure to support new data types and often massive amounts of data. Many organizations are moving to the cloud for data management.” Fern Halper Ph. D., The Data Warehouse Institute

This is just the tip of the iceberg. Find out more about Oracle’s unique capabilities.

Discover Oracle Data Science

8. The Unique Capabilities of Oracle Machine Learning

Enable Data Science Contributors Across Roles

Oracle Machine Learning supports a range of user roles, including data scientists, data and business analysts, DBAs and IT professionals, and application and dashboard developers. The Oracle Data Miner user interface empowers both experts and non-experts to use in-database machine learning through a drag-and-drop interface.

Designed for Performance and Scaling

With Autonomous Database, users can take advantage of auto scaling with Oracle Machine Learning. With auto scaling enabled, the database can use up to three times more CPU and IO resources than specified by the number of OCPUs explicitly allocated. When auto scaling is enabled, if your workload requires additional CPU and IO resources the database automatically uses needed resources without any manual intervention required.

Whether on-premises, Oracle Database Cloud Service, or Autonomous Database, users benefit from the underlying Exadata architecture. Oracle Machine Learning’s algorithms are parallel and distributed to leverage multi-node and high CPU/RAM environments. Even scoring data with in-database models takes advantage of storage-level “smart scan” technology for even faster scoring times.

Build and Execute Machine Learning Within Oracle Database

Oracle Machine Learning was designed to satisfy a fundamental need: to maximize the value of your data by making it easy for data science teams to take advantage of machine learning technology, scale to handle large volume data, and deploy data science solutions quickly and easily.

The Oracle Machine Learning platform includes mature technologies to profile, explore, and prepare data, and build, evaluate, and deploy machine learning models within Oracle Database. This opens up machine learning capabilities to application developers, database administrators, DevOps professionals, analysts, and many other people who already work with Oracle Database—enabling them to build on their existing skills and extract value from these corporate assets.

Minimize Data Movement with In-Database Processing

You don’t have to move data out of your secure Oracle environment to create machine learning models. Instead, Oracle has brought the algorithms to where the data reside. Oracle allows you to store many types of data in their native form and to apply machine learning algorithms directly to the data, without having to transform it into another format or move it to another analytical engine.

These in-database machine learning capabilities include more than 30 algorithms optimized for supporting machine learning projects on data stored right in Oracle Autonomous Data Warehouse and Oracle Database.

Click to learn more about Oracle’s comprehensive platform for machine learning.

Discover Oracle Data Science

9. The Oracle Data Science Supporting Services

Machine learning with Oracle extends beyond individual services. Oracle offers a wide range of options to suit organizations and their machine learning needs.

Create Your Own Machine Learning Setup

Oracle provides additional solutions through Oracle Cloud Infrastructure, which offers industry-leading compute, network, storage, and database services that have been optimized to run customized computing environments for AI and machine learning.

From Oracle Autonomous Database and Oracle Cloud Infrastructure Virtual Machines for Data Science to Oracle Cloud Infrastructure Data Catalog and Oracle Cloud Infrastructure Data Flow, Oracle’s services help users enhance and expand upon what they can accomplish with data science.

If you’re ready to get started, Oracle offers several hands-on labs and guided tutorials to experience its Data Science and Machine Learning services.

Explore the Labs

Other Important Data Science Platform Services

We’ve built a comprehensive set of products and capabilities to help organizations achieve our data science goals:

  • Oracle Autonomous Data Warehouse: Build a self-service data warehouse that eliminates management complexity relying on automation for provisioning, tuning, scaling, patching, and encryption.
  • Oracle Cloud Infrastructure Data Catalog: Allows users to discover, find, organize, enrich and trace data assets on Oracle Cloud. It also has a built-in business glossary making it easy to curate and discover the right data.
  • Oracle Big Data Service: Offers a full Cloudera Hadoop implementation, with dramatically simpler management than other Hadoop offerings, including just one click to make a cluster highly available and to implement security. It also includes Oracle Machine Learning for Spark, which allows organizations machine learning in memory with one product and with minimal data movement.
  • Oracle Cloud SQL: Enables SQL queries on data in HDFS, Hive, Kafka, NoSQL and Object Storage. Only Oracle CloudSQL enables any user, application, or analytics tool that can talk to Oracle. databases to transparently work with data in other data stores, with the benefit of pushdown, scale-out processing to minimize data movement.
  • Oracle Cloud Infrastructure Data Flow: A fully managed big data service that allows users to run Apache Spark applications with no infrastructure to deploy or manage. It enables enterprises to deliver big data and AI applications faster. Unlike competing Hadoop and Spark services, Cloud Infrastructure Data Flow includes a single window to track all Spark jobs, making it simple to identify expensive tasks and troubleshoot problems.
  • Oracle Cloud Infrastructure Virtual Machines for Data Science: Preconfigured GPU-based environments with common IDEs, notebooks, and frameworks that can be up and running in under 15 minutes, for $30 a day.
  • Oracle Graph: Augment machine learning with more than 60 graph algorithms to increase accuracy and bring a faster, more powerful new perspective to machine learning.

A Complete Data Platform for Machine Learning

To make full use of today’s machine learning technologies, enterprises need a centralized platform that includes the best data science tools, frameworks, and infrastructure. The Oracle data science platform transforms the way your organization approaches data science projects by helping you extract more value from your data.

Click to learn more about Oracle’s comprehensive platform for machine learning

Discover Oracle Data Science