Internal data is data that's generated within an organization by its business processes and operations. These business processes can generate large volumes of data that is specific to that organization's operations. This data can take the form of net revenues, sales to customers, new customer acquisitions, employee turnover, units produced, cost of raw materials, and much more time series or transactional information. This historical and current data is valuable to organizations if they wish to identify patterns and trends, as well as for forecasting and future planning. Importantly, all the relevant data to a domain and question are almost never housed within a single data source; organizations inevitably have multiple sources of relevant data.
In addition to internal data, business intelligence is most effective when internal data is combined with external data. Crucially, external data is data that is generated outside of the boundaries of an organization's operations. Such external data includes things such as the business's overall global economic performance, census information, and competitor prices. All of this data exists irrespective of any particular organization.
Each domain and question will have internal and external data that is relevant and irrelevant to answering the question at hand. However, do not be fooled into believing that simply because you have chosen manufacturing/production as the domain that other domains such as sales and marketing do not have relevant sources of data. If you are trying to forecast the required production levels, sales data in terms of pipelines can be very relevant. Similarly, external data that points toward overall economic growth may also be extremely relevant while data such as the cost of raw materials may very well be irrelevant.
Structured data is data that conforms to a rather formal specification of tables with rows and columns. Think of a spreadsheet where you might have columns for the transaction ID, customer, units purchased, and price per unit. Each row represents a sales transaction. Structured data sources are the easiest sources for business intelligence tools to consume and analyze. These sources are most often relational databases, which include technologies such as Microsoft SQL Server, Microsoft Access, Azure Table storage, Azure SQL database, Oracle, MySQL, IBM DB2, Teradata, PostgreSQL, Informix, and Sybase. In addition, this category of data sources includes relational database standards and APIs such as Open Database Connectivity (ODBC) and Object Linking and Embedding Database (OLE DB).
Unstructured data is effectively the opposite of structured data. Unstructured data cannot be organized into simple tables with rows and columns. Such data includes things such as videos, audio, images, and text. Word processing documents, emails, social media posts, and web pages are also examples of largely unstructured data. Unstructured data sources are the most difficult types of sources for business intelligence tools to consume and analyze. This type of data is either stored as binary large objects (BLOBS) or as a file in a filesystem such as the New Technology File System (NTFS) or the Hadoop Distributed File System (HDFS).
Unstructured data also includes so-called NoSQL databases, which include data stores such as document databases, graph databases, and key-value stores. These databases are specifically designed to store unstructured data. Document databases include Microsoft Azure Cosmos DB, MongoDB, 10Gen, Cloudant (IBM), Couchbase, and MarkLogic. Graph databases include Neo4j and HyperGraphDB. Key-value stores include Microsoft's Cosmos DB, Basho Technologies' Riak, Redis, Aerospike, Amazon Web Services' DynamoDB, Basho Technologies, Couchbase, Datastax's Cassandra, MapR Technologies, and Oracle. Finally, wide-column stores include Cassandra and HBase.
Semi-structured data has a structure but does not conform to the formal definition of structured data, that is, tables with rows and columns. Examples of semi-structured include tab and delimited text files, XML, other markup languages such as HTML and XSL, JavaScript Object Notation (JSON), and electronic data interchange (EDI). Semi-structured data sources have a self-defining structure that makes them easier to consume and analyze than unstructured data sources, but require more work than true, structured data sources
Semi-structured data also includes data access protocols, such as the Open Data Protocol (OData) and other Representational State Transfer (REST) APIs. These protocols provide interfaces to data sources such as Microsoft SharePoint, Microsoft Exchange, Microsoft Active Directory, and Microsoft Dynamics; social media systems such as Twitter and Facebook; as well as other online systems such as MailChimp, Salesforce, Smartsheet, Twilio, Google Analytics, and GitHub, to name a few. These data protocols abstract how the data is stored, whether that is a relational database, NoSQL database, or simply a bunch of files.
The vast majority of business intelligence tools, such as Power BI, are optimized for handling structured and semi-structured data. Structured data sources integrate natively with how business intelligence tools are designed. In addition, business intelligence tools are designed to ingest semi-structured data sources and transform them into structured data. Unstructured data is more difficult but not impossible to analyze with business intelligence tools. In fact, Power BI has a number of features that are designed to ease the ingestion and analysis of unstructured data sources. However, analyzing such unstructured data has its limitations.
A model, or data model, refers to the way in which one or more data sources are organized in order to support analysis and visualization. Models are built by transforming and cleansing data, helping to define the types of data within those sources, as well as the definition of data categories for specific data types.
Models can be extremely simple, such as a single table with columns and rows. However, business intelligence almost always involves multiple tables of data, and most often involves multiple tables of data coming from multiple sources. Thus, the model becomes more complex as the various sources and tables of data must be combined into a cohesive whole. This is done by defining how eac...