How to extract Twitter data with a windows service created with Python in Visual Studio 2017

How to extract Twitter data with a windows service created with Python in Visual Studio 2017

Hi everyone,

In this post, we will code a script in python (with Visual Studio 2017) to create a program which we can execute as a windows service in order to extract (in almost real time) the tweets related to certain words or hashtags, store them in a SQL server database, and then consume them with Power BI.

The tasks to be performed by the script are to connect to the Twitter streaming API using some tokens generated from the official Twitter page for developers. Then it will make a request to the API passing by parameter a list of hashtags or terms and will return indefinitely in “real time” the tweets that are published and that contain these terms. Once the tweets are obtained, they are stored in a buffer (with parameterizable size) “n” tweets that when reaching the maximum size of the buffer will be stored into a SQL Server table.

In order to do this, the first thing we must do is to create a new Python project in the Visual Studio 2017 Environment:

The next step will be to add the libraries that we will use in our script, and that will help the work to a great extent, in this case we will make use of the following:

  • The library to be able to handle JSON files (since the Twitter API returns data in this format).
  • We also use pandas library to store data in dataframes (in memory data structures, similar to SQL tables).
  • We need to use tweepy that gives us access to Twitter API.
  • pyodbc and sqlAlquiemy provide us with tools to connect to a SQL Server databases.

To be able to make use of these libraries, we must add them to our Python environment from the solution explorer:

Search the library in the Python environment:

Then we will add the following code to the main Python script:

import json
import pandas as pd
import pyodbc
from tweepy.streaming import StreamListener
from tweepy import OAuthHandler
from tweepy import Stream
from pandas.io import sql
from sqlalchemy import create_engine
from pandas.io.json import json_normalize

In order to access the Twitter API, we need to create an application in the developers Twitter portal on page https://apps.twitter.com/  we can register applications with any Twitter user. When registering an application some tokens are generated that are necessary to access the API, these tokens will be stored in variables in our script:

#Declare variables that contains the user credentials to access Twitter API
#You can get your own keys in https://apps.twitter.com/
#--------------------------------------------------------------------------------
aToken =       "ENTER YOUR ACCESS TOKEN"
aTokenSecret = "ENTER YOUR ACCESS TOKEN SECRET"
cKey =         "ENTER YOUR API KEY"
cSecret =      "ENTER YOUR API SECRET"

 

In our example we get tokens of the app created before:

The workflow of our script will be this:

The whole code is here:

#Import libraries

import json
import pandas as pd
import pyodbc
from tweepy.streaming import StreamListener
from tweepy import OAuthHandler
from tweepy import Stream
from pandas.io import sql
from sqlalchemy import create_engine
from pandas.io.json import json_normalize


#Declare variables that contains the user credentials to access Twitter API
#You can get your own keys in https://apps.twitter.com/
#--------------------------------------------------------------------------------
aToken =       "ENTER YOUR ACCESS TOKEN"
aTokenSecret = "ENTER YOUR ACCESS TOKEN SECRET"
cKey =         "ENTER YOUR API KEY"
cSecret =      "ENTER YOUR API SECRET"
#--------------------------------------------------------------------------------


#Define after how many twitts we do a insert in the data base.
bufferSize = 5
#Defina an array to store the tweets readed from the stream api
twittsBuffer = []
#Define a connectiont to read-write to the Sql server Database 
engine = create_engine("mssql+pyodbc://MyDbUser:MyPassword@MySQLServer/TwitterDB?driver=SQL+Server+Native+Client+11.0")

#Define a function that receive a twitt by parameter and store it into the twittBuffer variable
#if the twittBuffer reach the buffersize defined lenght then call the function AddTwittsToDB that insert the twitts into
#the twittsBuffer array into the SQL Server database and clean the buffer
#--------------------------------------------------------------------------------
def AddTwittToBuffer(twitt):
    
    global twittsBuffer
    twittsBuffer.append(twitt)
    
    if (len(twittsBuffer) == bufferSize):
        AddTwittsToDB(twittsBuffer)
        twittsBuffer = []
    print(twitt['coordinates'] if twitt['coordinates']!= None else 'no coordinates')

    return 

#This function write the twitts stored in the variable twitBuffer to the SQL Database
#--------------------------------------------------------------------------------
def AddTwittsToDB(twitts):
    tData = {'id': [], 
             'text': [], 
             'screen_name': [], 
             'created_at': [],
             'retweet_count': [], 
             'favorite_count': [],
             'friends_count': [], 
             'followers_count': [], 
             'lang':[], 
             'country':[], 
             'latitude':[], 
             'lontitude':[]}
    
    for t in twitts:
        tData['id'].append(t['id'])
        tData['text'].append(t['text'])
        tData['screen_name'].append(t['user']['screen_name'])
        tData['created_at'].append(t['created_at'])
        tData['retweet_count'].append(t['retweet_count'])
        tData['favorite_count'].append(t['favorite_count'])
        tData['friends_count'].append(t['user']['friends_count'])
        tData['followers_count'].append(t['user']['followers_count'])
        tData['lang'].append(t['lang'])

        if t['place'] != None :
            tData['country'].append(t['place']['country'])
        else :
            tData['country'].append(None)
        
        if t['coordinates'] != None :
            tData['lontitude'].append(t['coordinates']['coordinates'][0])
            tData['latitude'].append(t['coordinates']['coordinates'][1])
        else :
            tData['lontitude'].append(None)
            tData['latitude'].append(None)
    tweets = pd.DataFrame(tData)
    tweets.set_index('id', inplace=True)
    tweets.to_sql("Tweets",engine,None,if_exists='append')
    return True
#--------------------------------------------------------------------------------
#Create a listener class that process received tweets
#On error print status
#--------------------------------------------------------------------------------
class StdOutListener(StreamListener):

    def on_data(self, data):
        t= json.loads(data)
        AddTwittToBuffer(t)
        return True

    def on_error(self, status):
        print(status)
#--------------------------------------------------------------------------------
#Define a main function, the entry point of the program
if __name__ == '__main__':

    #This object handles Twitter authetification and the connection to Twitter Streaming API
    myListener = StdOutListener()
    authenticator = OAuthHandler(cKey, cSecret)
    authenticator.set_access_token(aToken, aTokenSecret)
    stream = Stream(authenticator, myListener)

    #This line filter Twitter Streams to capture data tweets with the included text: 'Microsoft' or 'SolidQ' or 'Visual Studio'
    stream.filter(track=['PowerBI', 'Tableau', 'Qlikview','Microstrategy','Pyramid Analytics','Business Objects', 'Ibm cognos'])
#--------------------------------------------------------------------------------

 

During the execution of the script, if the destination table does not exist, it will be created and the tweets will be inserted into the table.

The table we created will have the following columns:

  1. The id of the tweet
  2. The tweet origin country
  3. The date of creation in text format (then we will transform it to a date format)
  4. The number of favorites of the user who tweets (at the time he tweeted)
  5. The number of followers of the user who tweets (at the time he tweeted)
  6. The number of friends of the user who tweets (at the time he tweeted)
  7. The language detected by the Twitter API
  8. The latitude (if it is informed)
  9. The longitude (if it is informed)
  10. The number of retweets of the message.
  11. The name of the user who tweets
  12. The text of the tweet

If we query the database we will see the following:

In our example, we are getting tweets that contain the following terms ‘PowerBI’, ‘Tableau’, ‘Qlikview’, ‘Microstrategy’, ‘Pyramid Analytics’, ‘Business Objects’, ‘IBM Cognos’ to make a comparison of tweet counts about these terms.

Once we have the script created and tested, we can run it as a windows service so that it is continuously downloading the tweets, this can be done easily with tools such as Non Suck Service Manager you can download from its page https://nssm.cc

Once the nssm is downloaded, we go to the folder where the 64-bit executable is (or the 32-bit executable if it’s your case) and execute the following from the command line:

This window will be opened:

In path we introduce the path where the python executable is adding the executable name of python.exe, in the startup directory we enter the directory (only the path) where the Python executable is stored and in arguments insert the path and file name of the Python script that we created with Visual Studio 2017.

Doing this we will have created the service, we must start it and we will see that tweets begin to appear in our SQL Server table.

Now we have our table with the stored tweets we must create a couple of tables to relate the terms with the table of tweets and create a model that we can consume with Power BI, we do this in several steps:

First, we create a dimension with the terms and a code for each one in bitmask format, there are 7 terms so the table will have 7 files and the ids will be numbered from 1 to 64 in powers of 2 (1, 2, 4, 8,16, 32, and 64), in this way we can relate each tweet with its terms by means of a bit mask.

For example if a tweet contains the words PowerBI and QlickView, the id that relates it to its terms will be the sum of the two ids of those terms, in this case, the relationship ID would be 5, the sum of id 1 corresponding to Power Bi and Id 4 which is the id of the term QlickView. If a tweet contains all the 7 terms its relationship id will be 127 (the sum of all the ids 1+2+4+8+16+32+64)

The script that generates this dimension table is:

with Dim_Terms as(
SELECT 1  as Id, 'PowerBI'			AS Term union all 
SELECT 2  as Id, 'Tableau'			AS Term union all 
SELECT 4  as Id, 'Qlikview'			AS Term union all 
SELECT 8  as Id, 'Microstrategy'		AS Term union all 
SELECT 16 as Id, 'Pyramid Analytics'	        AS Term union all 
SELECT 32 as Id, 'Business Objects'		AS Term union all 
SELECT 64 as Id, 'Ibm cognos'			AS Term )
select * into DimTerm from Dim_Terms

 

Table queried will look like this:

The relationship established between this dimension of terms and the table of “facts” is a “many to many” relationship so we are going to need another 2 tables that consolidate the relationship to be able to use it in Power BI

The first of these relationship tables is the one that contains all the possible combinations of possible terms from a single term up to 7 at a time, this table will have 127 records (the 0 that would correspond to any term will not be contemplated at the moment)

The query that will provide us this table is:

with nums as
(
  select 1 as id
  union all 
  select id + 1
    from nums
    where id + 1 < 128
)
,
datos as(
select t.id,
  REPLACE(REPLACE( 
  ltrim(
  iif(cast(t.id & 64 as bit)  = 1, 'Ibm_cognos','') + 
    iif(cast(t.id & 32 as bit)  = 1, ' Business_Objects','') +
    iif(cast(t.id & 16 as bit) 	= 1, ' Pyramid_Analytics','') +
    iif(cast(t.id & 8 as bit)	= 1, ' Microstrategy','') +
    iif(cast(t.id & 4 as bit)	= 1, ' Qlikview','') +
    iif(cast(t.id & 2 as bit)  	= 1, ' Tableau','') +
    iif(cast(t.id & 1 as bit)	= 1, ' Microsoft','')),' ', ' - '), '_',' ')  as Terms
from nums t
)
select * into Dim_AllTerms from datos option (maxrecursion 0)

 

And its structure will be:

Now we must relate, through a table of relationship “Many to Many”, the dimension with the table that maps each “combination of terms” (Dim_AllTerms) with its terms of the dimension (DimTerm) we do this with next query:

with nums as
(
 select 1 as id
 union all 
 select id + 1
 from nums
 where id + 1 < 128
),
datos as(
select t.id, 
 cast(t.id & 64 as bit) as bit7
 ,cast(t.id & 32 as bit) as bit6
 ,cast(t.id & 16 as bit) as bit5
 ,cast(t.id & 8 as bit) as bit4
 ,cast(t.id & 4 as bit) as bit3
 ,cast(t.id & 2 as bit) as bit2
 ,cast(t.id & 1 as bit) as bit1
from nums t
)
 select 
 id, 
 case trend
 when 'bit1' then 1
 when 'bit2' then 2
 when 'bit3' then 4
 when 'bit4' then 8
 when 'bit5' then 16
 when 'bit6' then 32
 when 'bit7' then 64
 else 0
 end as IdTrend
 into F_MtM_Terms
 From 
 (select * from datos ) p
 UNPIVOT 
 (
 valor for trend in (bit1, bit2, bit3, bit4,bit5,bit6,bit7)
 ) as unpivo
 where
 valor = 1
option (maxrecursion 0)

 

The table will have the following appearance, as you can see the Id of “Terms” 5, it is repeated 2 times, one for each individual term that it’s contained, the 1 of Microsoft and the 4 of QlikView:

Finally, we create a view that will return tweets data in form of a fact table:

SELECT 
  country
  ,favorite_count
  ,followers_count
  ,friends_count
  ,lang
  ,latitude
  ,lontitude
  ,retweet_count
  ,screen_name as UserName
  ,[text]
  , CONVERT(DATETIME, SUBSTRING(created_at,9,2)+'/'
  + CASE SUBSTRING(created_at,5,3)
    WHEN 'Jan' then '01/'
    WHEN 'Feb' then '02/'
    WHEN 'Mar' then '03/'
    WHEN 'Apr' then '04/'
    WHEN 'May' then '05/'
    WHEN 'Jun' then '06/'
    WHEN 'Jul' then '07/'
    WHEN 'Aug' then '08/'
    WHEN 'Sep' then '09/'
    WHEN 'Oct' then '10/'
    WHEN 'Nov' then '11/'
    WHEN 'Dec' then '12/'
  else '' end
  +RIGHT(created_at,4) + ' '
  + SUBSTRING(created_at,12,8), 105) AS Created_Date
  , iif([text] like '%PowerBI%',1,0) 
  + iif([text] like '%Tableau%',2,0) 
  + iif([text] like '%Qlikview%',4,0) 
  + iif([text] like '%Microstrategy%',8,0) 
  + iif([text] like '%Pyramid Analytics%',16,0) 
  + iif([text] like '%Business Objects%',32,0) 
  + iif([text] like '%Ibm cognos%',64,0) IdTrend
FROM 
  [TwitterDB].[dbo].[Tweets]

 

Model in power BI have next structure:

* It’s important to note that the many to many relationships must have the cross filter address set to “Both” to work fine.

If we see the used measures are very simple, a sum of favorites, followers and friends, and a count of the number of tweets.

The final result of a pair of dashboards as an example, showing the tweets by region and language and another showing the evolution of the last 2 days in “numDia-NumHora” format

I hope this will be useful or at least encourages you to experiment a bit with Python, it is a very useful tool with a very easy learning curve.

Greetings

 

Python for SQL Server Specialists Part 3: Graphs and Machine Learning

Python for SQL Server Specialists Part 3: Graphs and Machine Learning

After learning about Python fundamentals and basics about working with data, it is time to start with more exciting parts of this Python for SQL Server Specialists series.

In this article you will learn about the most important libraries for advanced graphing, namely matplotlib and seaborn, and about the most popular data science library, the scikit-learn library.

Creating Graphs

You will learn how to do graphs with two Python libraries: matplotlib and seaborn. Matplotlib is a mature well-tested, and cross-platform graphics engine. In order to work with it, you need to import it. However, you need also to import an interface to it. Matplotlib is the whole library, and matplotlib.pyplot is a module in matplotlib. Pyplot as the interface to the underlying plotting library that knows how automatically create the figure and axes and other necessary elements to create the desired plot. Seaborn is a visualization library built on matplotlib, adding additional enhanced graphing options, and making work with pandas data frames easy.

Anyway, without further talking, let’s start developing. First, let’s import all necessary packages for this section.

import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

The next step is to create sample data. An array of 100 evenly distributed numbers between 0 and 10 is the data for the independent variable, and then the following code creates two dependent variables, one as the sinus of the independent one, and the second as the natural logarithm of the independent one.

# Creating data and functions
x = np.linspace(0.1, 10, 100)
y = np.sin(x)
z = np.log(x)

The following code defines the style to use for the graph and then plots two lines, one for each function. The plt.show() command is needed to show the graph interactively.

# Basic graph
plt.style.use('classic')
plt.plot(x, y)
plt.plot(x, z)
plt.show()

If you execute the code above in Visual Studio 2017, you should get a pop-up window with the desired graph. I am not showing the graph yet; before showing it, I want to make some enhancements and besides showing it also save it to a file. The following code uses the plt.figure() function to create an object that will store the graph. Then for each function defines the line style, line width, line color, and label. The plt.axis() line redefines the axes range. The next three lines define the axes titles and the title of the graph and define font size for the text. The plt.legend() line draws the legend. The last two lines show the graph interactively and save it to a file.

# Enhanced graph
f = plt.figure()
plt.plot(x, y, color = 'blue', linestyle = 'solid',
         linewidth = 4, label = 'sin')
plt.plot(x, z, color = 'red', linestyle = 'dashdot',
         linewidth = 4, label = 'log')
plt.axis([-1, 11, -2, 3.5])
plt.xlabel("X", fontsize = 16)
plt.ylabel("sin(x) & log(x)", fontsize = 16)
plt.title("Enhanced Line Plot", fontsize = 25)
plt.legend(fontsize = 16)
plt.show()
f.savefig('C:\\PythonSolidQ\\SinLog.png')

Here is the result of the code above – the first nice graph.

Graphing SQL Server Data

Now it’s time to switch to some more realistic examples. First, let’s import the dbo.vTargetMail data from the AdventureWorksDW2016 demo database in a pandas data frame.

# Connecting and reading the data
import pyodbc
con = pyodbc.connect('DSN=AWDW;UID=RUser;PWD=Pa$$w0rd')
query = """SELECT CustomerKey, 
             TotalChildren, NumberChildrenAtHome,
             HouseOwnerFlag, NumberCarsOwned,
             EnglishEducation as Education,
             YearlyIncome, Age, BikeBuyer
           FROM dbo.vTargetMail;"""
TM = pd.read_sql(query, con)

The next graph you can create is a scatterplot. The following code plots YearlyIncome over Age. Note that the code creates a smaller data frame with first hundred rows only, in order to get less cluttered graph for the demo. Again, for the sake of brevity, I am not showing this graph.

# Scatterplot
TM1 = TM.head(100)
plt.scatter(TM1['Age'], TM1['YearlyIncome'])
plt.xlabel("Age", fontsize = 16)
plt.ylabel("YearlyIncome", fontsize = 16)
plt.title("YearlyIncome over Age", fontsize = 25)
plt.show()

For categorical variables, you usually create bar charts for a quick overview of the distribution. You can do it with the countplot() function from the seaborn package. Let’s try to plot counts for the BikeBuyer variable in the classes of the Education variable.

# Bar chart
sns.countplot(x="Education", hue="BikeBuyer", data=TM);
plt.show()
# Note the wrong order of Education

If you executed the previous code, you should have noticed that the Education variable is not sorted correctly. You need to inform Python about the intrinsic order of a categorial or nominal variable. The following code defines that the Education variable is categorical and then shows the categories,

# Define Education as categorical
TM['Education'] = TM['Education'].astype('category')
TM['Education']

In the next step, the code defines the correct order.

 # Proper order 
TM['Education'].cat.reorder_categories( 
    ["Partial High School",  
     "High School","Partial College",  
     "Bachelors", "Graduate Degree"], inplace=True)

Now it is time to create the bar chart again. This time, I am also saving it to a file, and showing it here.

# Correct graph
f = plt.figure()
sns.countplot(x="Education", hue="BikeBuyer", data=TM);
plt.show()
f.savefig('C:\\PythonSolidQ\\EducationBikeBuyer.png')

So here is the bar chart.

Machine Learning with Scikit-Learn

You can find many different libraries for statistics, data mining and machine learning in Python. Probably the best-known one is the scikit-learn package. It provides most of the commonly used algorithms, and also tools for data preparation and model evaluation.

In scikit-learn, you work with data in a tabular representation by using pandas data frames. The input table (actually a two-dimensional array, not a table in the relational sense) has columns used to train the model. Columns, or attributes, represent some features, and therefore this table is also called the features matrix. There is no prescribed naming convention; however, in most of the Python code, you will note that this features matrix is stored in variable X.

If you have a directed, or supervised algorithm, then you also need the target variable. This is represented as a vector or one-dimensional target array. Commonly, this target array is stored in a variable named y.

Without further hesitation, let’s create some mining models. First, the following code imports all necessary libraries for this section.

# sklear imports
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.naive_bayes import GaussianNB
from sklearn.mixture import GaussianMixture

Next step is to prepare the features matrix and the target array. The following code also checks the shape of both.

# Preparing the data for Naive Bayes
X = TM[['TotalChildren', 'NumberChildrenAtHome',
        'HouseOwnerFlag', 'NumberCarsOwned',
        'YearlyIncome', 'Age']]
X.shape
y = TM['BikeBuyer']
y.shape

The first model will be a supervised one, using the Naïve Bayes classification. For testing the accuracy of the model, you need to split the data into the training and the test set. You can use the train_test_split() function from the scikit-learn library for this task.

# Split to the treining and test sets
Xtrain, Xtest, ytrain, ytest = train_test_split(
    X, y, random_state = 0, train_size = 0.7)

Note that the code above puts 70% of the data into the training set and 30% into the test set. The next step is to initialize and train the model with the training data set.

# Initialize and train the model
model = GaussianNB()
model.fit(Xtrain, ytrain)

That’s it. The model is prepared and trained. You can start using it for making predictions. You can use the test set for predictions and evaluate the model. A very well-known measure is the accuracy. The accuracy is the proportion of the total number of predictions that were correct, defined as the sum of true positive and true negative predictions with the total number of cases predicted. The following code uses the test set for the predictions and then measures the accuracy.

# Predictions and accuracy
ymodel = model.predict(Xtest)
accuracy_score(ytest, ymodel)

You can see that you can do quite advanced analyses with just few lines of code. Let’s make another model, this time an undirected one, using the clustering algorithm. For this one, you don’t need training and test sets, and also not the target array. The only thing you need to prepare is the features matrix.

# Preparing the data for Clustering
X = TM[['TotalChildren', 'NumberChildrenAtHome',
        'HouseOwnerFlag', 'NumberCarsOwned',
        'YearlyIncome', 'Age', 'BikeBuyer']]

Again, you need to initialize and fit the model. Note the following code tries to group cases in two clusters.

# Initialize and train the model
model = GaussianMixture(n_components = 2, covariance_type = 'full')
model.fit(X)

The predict() function for the clustering model creates the cluster information for each case in the form of a resulting vector. The following code creates this vector and shows it.

# Predictions
ymodel = model.predict(X)
ymodel

You can add the cluster information to the input feature matrix.

# Add the cluster membership to the source data
X['Cluster'] = ymodel
X.head()

Now you need to understand the clusters. You can get this understanding graphically. The following code shows how you can use the seaborn lmplot() function to create scatterplot showing the cluster membership of the cases spread over income and age.

# Analyze the clusters
sns.set(font_scale = 3)
lm = sns.lmplot(x = 'YearlyIncome', y = 'Age', 
                hue = 'Cluster',  markers = ['o', 'x'],
                palette = ["orange", "blue"], scatter_kws={"s": 200},
                data = X, fit_reg = False,
                sharex = False, legend = True)
axes = lm.axes
axes[0,0].set_xlim(0, 190000)
plt.show(lm)

The following figure shows the result. You can see that in cluster 0 there are older people with less income, while cluster 1 consists of younger people, with not so distinctively higher income only.

Python for SQL Server 3

Conclusion

Now this was something, right? With couple of lines of code, we succeeded to create very nice graphs and perform quite advanced analyses. We analyzed SQL Server data. However, we did not use neither the scalable Microsoft machine learning libraries nor Python code inside SQL Server yet. Stay tuned – this is left for the last article in this series.

Click here to download the code

 

Read the whole series:

Python for SQL Server Specialists Part 1: Introducing Python

Python for SQL Server Specialists Part 2: Working with Data

Python for SQL Server Specialists Part 4: Python and SQL Server

How to refresh Power BI dataset from an on-premise Power Shell script

How to refresh Power BI dataset from an on-premise Power Shell script

Hi friends,

Today we will show you how we can refresh a dataset published in Power BI from a Power Shell Script that we would invoke at the end of our ETL process.

We will use the Power BI libraries for power shell to connect to our power Bi portal and send an instruction to refresh a data set. This could be useful to improve our ETL processes, refreshing our on-line datasets used in Power Bi portal before loading data into our data-warehouse and/or our OLAP/Tabular database send an instruction to.

Note that if the dataset is not in a workspace assigned to Premium capacity, then you will be limited to eight refreshes per day. Datasets in workspaces assigned to Premium will support up to 48 refreshes a day.

In order to perform this operation, we will need to create an Azure application with access to our Power BI portal, to do so, follow this steps:

Connect to https://dev.powerbi.com/apps and log in as a Power BI admin user:

Create a new application using this information:
-App Type: Native app
-Redirect URL: “urn:ietf:wg:oauth:2.0:oob” (without quotes)
-Level of access: check all dataset APIs permissions

Then we need to assign permissions to our Power Bi admin user to use the application. To do that connect to Azure portal and enter the “Azure Active Directory” tool following these steps:
1.- Enter in the App Registry (marked in yellow)

2.- Then click on “All applications” to see all the applications you own, and  select the application that will refresh our dataset (marked in yellow):

3.- Add our user as the owner of the application:

4.- At this point we must grant permissions to our application to access active directory:

We must grant access to the datasets of Power BI too:

Once this steps are done, it’s time to code (let’s rock !)

To create the script we must install and add the following libraries to our script:
Microsoft.IdentityModel.Clients.ActiveDirectory.dll
Microsoft.IdentityModel.Clients.ActiveDirectory.WindowsForms.dll

We do that with thiscommand: https://docs.microsoft.com/es-es/powershell/azure/install-azurerm-ps?view=azurermps-5.5.0

# Install the Azure Resource Manager modules from the PowerShell Gallery
Install-Module -Name AzureRM -AllowClobber

In our code we need to parametrize the client Id (that is our app client ID), the dataset ID that we can take it from power BI portal when we click on the dataset options (app.powerbi.com/groups/{groupID}/settings/datasets/{datasetID} ). Note that the group Id must be “me” in the script because we are using “My workspace” workspace.

The entire code with comments are here:

# Parameters - fill these in before running the script!
# =====================================================

# An easy way to get group and dataset ID is to go to dataset settings and click on the dataset
# that you'd like to refresh. Once you do, the URL in the address bar will show the group ID and 
# dataset ID, in the format: 
# app.powerbi.com/groups/{groupID}/settings/datasets/{datasetID} 

$groupID = "me" # the ID of the group that hosts the dataset. Use "me" if this is your My Workspace
$datasetID = "d9b4fd0c-7ac9-4e5d-be42-76686ce6b2db" # the ID of the dataset that hosts the dataset

# AAD Client ID
# To get this, go to the following page and follow the steps to provision an app
# https://dev.powerbi.com/apps
# To get the sample to work, ensure that you have the following fields:
# App Type: Native app
# Redirect URL: urn:ietf:wg:oauth:2.0:oob
#  Level of access: all dataset APIs
$clientId = "Client ID" 

# End Parameters =======================================

# Calls the Active Directory Authentication Library (ADAL) to authenticate against AAD
function GetAuthToken
{
       $adal = "C:\Program Files\WindowsPowerShell\Modules\Azure\5.1.1\Automation\Microsoft.IdentityModel.Clients.ActiveDirectory.dll"
       
 
       $adalforms = "C:\Program Files\WindowsPowerShell\Modules\Azure\5.1.1\Networking\Microsoft.IdentityModel.Clients.ActiveDirectory.WindowsForms.dll"
 
       [System.Reflection.Assembly]::LoadFrom($adal) | Out-Null
 
       [System.Reflection.Assembly]::LoadFrom($adalforms) | Out-Null
 
       $redirectUri = "urn:ietf:wg:oauth:2.0:oob"
 
       $resourceAppIdURI = "https://analysis.windows.net/powerbi/api"
 
       $authority = "https://login.microsoftonline.com/common/oauth2/authorize";
 
       $authContext = New-Object "Microsoft.IdentityModel.Clients.ActiveDirectory.AuthenticationContext" -ArgumentList $authority

       $UserCred = New-Object “Microsoft.IdentityModel.Clients.ActiveDirectory.UserCredential” -ArgumentList “UserName@domain.com“, “myPassword”
 
       $authResult = $authContext.AcquireTokenAsync($resourceAppIdURI, $clientId, $UserCred).Result
 
       return $authResult
}

# Get the auth token from AAD
$token = GetAuthToken

# Building Rest API header with authorization token
$authHeader = @{
   'Content-Type'='application/json'
   'Authorization'=$token.CreateAuthorizationHeader()
}

# properly format groups path
$groupsPath = ""
if ($groupID -eq "me") {
    $groupsPath = "myorg"
} else {
    $groupsPath = "myorg/groups/$groupID"
}

# Refresh the dataset
$uri = "https://api.powerbi.com/v1.0/$groupsPath/datasets/$datasetID/refreshes"
Invoke-RestMethod -Uri $uri –Headers $authHeader –Method POST –Verbose

# Check the refresh history
$uri = "https://api.powerbi.com/v1.0/$groupsPath/datasets/$datasetID/refreshes"
Invoke-RestMethod -Uri $uri –Headers $authHeader –Method GET –Verbose

And you’re all set!

This post was possible thanks to PatAltimore with his work in https://github.com/Azure-Samples/powerbi-powershell/blob/master/manageRefresh.ps1 We only did some changes to his code to make it work with our libraries. Thanks Pal!

Subscribe to our blog if you enjoyed this post 😉

Python for SQL Server Specialists Part 2: Working with Data

Python for SQL Server Specialists Part 2: Working with Data

In my previous article, you learned Python fundamentals. I also introduced the basic data structures. You can imagine you need more advanced data structures for analyzing SQL Server data, which comes in tabular format. In Python, there is also the data frame object, like in R. It is defined in the pandas library. You communicate with SQL Server through the pandas data frames. But before getting there, you need first to learn about arrays and other objects from the numpy library.

In this article, you will learn about the objects from the two of the most important Python libraries, namely, as mentioned, numpy and pandas.

A Quick Graph Demo

For a start, let me intrigue you by showing some analytical and graphic capabilities of Python. I am explaining the code just briefly in this section; you will learn more about Python programming in the rest of this article. The following code imports necessary libraries for this demonstration.

import numpy as np
import pandas as pd
import pyodbc
import matplotlib.pyplot as plt

Then we need some data. I am using the data from the AdventureWorksDW2016 demo database, selecting from the dbo.vTargetMail view.

Before reading the data from SQL Server, you need to perform two additional tasks. First, you need to create a login and a database user for the Python session and give the user the permission to read the data. Then, you need to create an ODBC data source name (DSN) that points to this database. In SQL Server Management Studio, connect to your SQL Server, and then in Object Explorer, expand the Security folder. Right-click on the Logins subfolder. Create a new login and a database user in the AdventureWorksDW2016DW database and add this user to the db_datareader role. I created a SQL Server login called RUser with password Pa$$w0rd, and a user with the same name.

After that, I used the ODBC Data Sources tool to create a system DSN called AWDW. I configured the DSN to connect to my local SQL Server with the RUser SQL Server login and appropriate password and change the context to the AdventureWorksDW2016 database. If you’ve successfully finished both steps, you can execute the following Python code to read the data from SQL Server.

# Connecting and reading the data
con = pyodbc.connect('DSN=AWDW;UID=RUser;PWD=Pa$$w0rd')
query = """SELECT CustomerKey, Age,
             YearlyIncome, TotalChildren,
             NumberCarsOwned
           FROM dbo.vTargetMail;"""
TM = pd.read_sql(query, con)

You can get a quick info about the data you read with the following code:

# Get the information
TM.head(5)
TM.shape

The code shows you the first five rows and the shape of the data you just read.

Now I can do a quick crosstabulation of the NumberCarsOwned variable by the TotalChildren variable.

# Crosstabulation
obb = pd.crosstab(TM.NumberCarsOwned, TM.TotalChildren)
obb

And here are the first results, a pivot table of the afore mentioned variables.

TotalChildren NumberCarsOwned 0 1 2 3 4 5
0 990 1668 602 419 449 110
1 1747 1523 967 290 286 70
2 1752 162 1876 1047 1064 556
3 384 130 182 157 339 453
4 292 136 152 281 165 235

Let me show the results of the pivot table in a graph. I need just the following two lines:

# Bar chart
obb.plot(kind = 'bar')
plt.show()

You can see the graph in the following figure.

Using the NumPy Data Structures and Methods

NumPy is short for Numerical Python; the library name is numpy. The library provides arrays with much more efficient storage and faster work than basic lists and dictionaries. Unlike basic lists, numpy arrays must have elements of a single data type. The following code imports the numpy package with alias np. Then it checks the version of the library. Then the code creates two one-dimensional arrays from two lists, one with implicit element data type integer, and one with explicit float data type.

# Numpay intro
np.__version__
np.array([1, 2, 3, 4])
np.array([1, 2, 3, 4], dtype = "float32")

You can create multidimensional arrays as well. The following code creates three arrays with three rows and five columns, one filled with zeroes, one with ones, and one with the number pi. Note the functions used for populating arrays.

np.zeros((3, 5), dtype = int)
np.ones((3, 5), dtype = int)
np.full((3, 5), 3.14)

For the sake of brevity, I am showing here only the last array.

array([[ 3.14,  3.14,  3.14,  3.14,  3.14],
       [ 3.14,  3.14,  3.14,  3.14,  3.14],
       [ 3.14,  3.14,  3.14,  3.14,  3.14]])

There are many additional functions that help you populating your arrays. The following code creates four different arrays. The first line creates a linear sequence of numbers between 0 and 20 with step 2. Note that the upper bound 20 is not included in the array. The second line creates uniformly distributed numbers between 0 and 1. The third line creates ten numbers between with standard normal distribution with mean 0 and standard deviation 1. The fourth line creates a 3 by 3 matrix of uniformly distributed integral numbers between 0 and 9.

np.arange(0, 20, 2)
np.random.random((1, 10))
np.random.normal(0, 1, (1, 10))
np.random.randint(0, 10, (3, 3))

Again, for the sake of brevity, I am showing only the last result here.

array([[0, 1, 7],
       [5, 9, 4],
       [5, 5, 6]])

In order to perform some calculations on array elements, you could use mathematical functions and operators from the default Python engine, and operate in loops, element by element. However, the numpy library includes also vectorized version of the functions and operators, which operate on vectors and matrices as a whole, and are much faster than the basic ones. The following code creates a 3 by 3 array of numbers between 0 and 8, shows the array, and then calculates the sinus of each element using a numpy vectorized function.

# Numpy vectorized functions
x = np.arange(0, 9).reshape((3, 3))
x
np.sin(x)

And here is the result.

array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])
array([[ 0.        ,  0.84147098,  0.90929743],
       [ 0.14112001, -0.7568025 , -0.95892427],
       [-0.2794155 ,  0.6569866 ,  0.98935825]])

Numpy includes also vectorized aggregate functions. You can use them for a quick overview of the data in an array, using the descriptive statistics calculations. The following code initializes an array of five sequential numbers:

# Aggregate functions
x = np.arange(1,6)
x

Here is the array:

array([1, 2, 3, 4, 5])

Now you can calculate the sum and the product of the elements, the minimum and the maximum, the mean and the standard deviation:

# Aggregations
np.sum(x), np.prod(x)
np.min(x), np.max(x)
np.mean(x), np.std(x)

Here are the results:

(15, 120)
 (1, 5)
(3.0, 1.4142135623730951)

In addition, you can also calculate running aggregates, like running sum in the next example.

# Running totals
np.add.accumulate(x)

The running sum result is here.

array([ 1,  3,  6, 10, 15], dtype=int32)

There are many more operations on arrays available in the numpy module. However, I am switching to the next topic in this Python learning tour, to the pandas library.

Organizing Data with Pandas

The pandas library is built on the top of the numpy library. Therefore, in order to use pandas, you need to import numpy first. The pandas library introduces many additional data structures and functions. Let’s start our pandas tour with the panda Series object. This is a one-dimensional array, like numpy array; however, you can define explicitly named index, and refer to that names to retrieve the data, not just to the positional index. Therefore, a pandas Series object already looks like a tuple in the relational model, or a row in a table. The following code imports both packages, numpy and pandas. Then it defines a simple pandas Series, without an explicit index. The series looks like a simple single-dimensional array, and you can refer to elements through the positional index.

# Pandas series
ser1 = pd.Series([1, 2, 3, 4])
ser1[1:3]

Here are the results. I retrieved the second and the third element, position 1 and 2 with zero-based positional index.

1    2
2    3

Now I will create a series with explicitly named index:

# Explicit index
ser1 = pd.Series([1, 2, 3, 4],
                 index = ['a', 'b', 'c', 'd'])
ser1['b':'c']

As you could see from the last example, you can refer to elements using the names of the index, which serve like column names in a SQL Server row. And below is the result.

b    2
c    3

Imagine you have multiple series with the same structure stacked vertically. This is the pandas DataFrame object. If you know R, let me tell you that it looks and behaves like the R data frame. You use the pandas DataFrame object to store and analyze tabular data from relational sources, or to export the result to the tabular destinations, like SQL Server. When I read SQL Server data at the beginning of this article, I read the tabular data in a Python data frame. The main difference, compared to a SQL Server table, is that a data frame is a matrix, meaning that you still can refer to the data positionally, and that the order of the data is meaningful and preserved.

Pandas data frame is a very powerful object. You have already seen the graphic capabilities of it in the beginning of this article, when I created a quite nice bar chart. In addition, you can use other data frame methods to get information about your data with help of descriptive statistics. He following code shows how to use the describe() function on the whole data frame to calculate basic descriptive statistics on every single column, and then how to calculate the mean, standard deviation, skewness, and kurtosis, i.e. the first four population moments, for the Age variable.

# Descriptive statistics
TM.describe()
TM['Age'].mean(), TM['Age'].std()
TM['Age'].skew(), TM['Age'].kurt()

Let me finish this article with another fancy example. It is quite simple to create even more complex graphs. The following code shows the distribution of the Age variable in histograms and with a kernel density plot.

# Another graph
(TM['Age'] - 20).hist(bins = 25, normed = True,
                      color = 'lightblue')
(TM['Age'] - 20).plot(kind='kde', style='r--', xlim = [0, 80])
plt.show()

You can see the results in the following figure. Note that in the code, I subtracted 20 from the actual age, to get slightly younger population than exists in the demo database.

Conclusion

I guess the first article in this series about Python was a bit dull. Nevertheless, you need to learn the basics before doing anything fancier. I hope that in this article you got more intrigued by the Python capabilities for working on data. In the forthcoming articles, I will go deeper into the graphing capabilities and in data science with Python.

 

Click here to download the code

 

Read the whole series:

Python for SQL Server Specialists Part 1: Introducing Python

Python for SQL Server Specialists Part 3: Graphs and Machine Learning

Python for SQL Server Specialists Part 4: Python and SQL Server

Python for SQL Server Specialists Part 1: Introducing Python

Python for SQL Server Specialists Part 1: Introducing Python

Python is one of the most popular programming languages. It is a general purpose high level language. It was created by Guido van Rossum, publicly released in 1991. SQL Server 2016 started to support R, and SQL Server 2017 adds support for Python. Now you can select your preferred language for the data science and even other tasks. R has even more statistical, data mining and machine learning libraries, because it is more widely used in the data science community; however, Python has broader purpose than just data science, and is more readable and might thus be simpler to learn. This is the first of the four articles that introduce Python to SQL Server developers and business intelligence (BI) specialists. This means that the articles are more focused on Python basics and data science, and less on general programming with Python.

 

Starting with Python

Python is an interpreted language. The philosophy of the language is about the code readability. For example, you use white spaces to delimit code blocks instead of special characters like semicolon or curly brackets. Python supports automatic memory management. It has a dynamic type system. You can use multiple program paradigms in Python, including procedural, object-oriented, and functional programming. You can find Python interpreters for all major operating systems. The reference implementation of Python, namely CPython, is open source software, managed by the non-profit Python Software Foundation. Of course, being open source, also means that there is a reach set of libraries available. Even the standard library is impressive and comprehensive.

In order to start working with Python and R, you need to do some installation. I am not covering general SQL Server and Visual Studio installation, I am just explaining what you need to do to start using Python with SQL Server.

 

Installing ML Services and VS 2017 for Data Science

You just start SQL Server setup, and then from the Feature Selection page select Database Engine Services, and underneath Machine Learning (ML) Services (In-Database), with Python only, or both languages, R and Python, selected. After that, all you need are client tools, and you can start writing the code. The following figure shows the SQL Server setup Feature Selection page with appropriate features selected.

The next step is installing client tools. Of course, you need SQL Server Management Studio (SSMS). In addition, you might want to install Visual Studio (VS) 2017. You can use either Professional or even free Community edition to develop python (and also R) code.

When installing Visual Studio 2017, be sure to select Python development workload, and then Data science and analytical applications, like the following figure shows. This will install Python language templates, including data science templates, and also R Tools for Visual Studio.

Selecting the Appropriate Python Engine

There you go, you are nearly ready. There is a small trick here. VS 2017 installs also its own Python interpreter. In order to use the scalable, the one installed with SQL Server, the one that enables executing code in the Database Engine context and includes Microsoft scalable libraries, you need to setup an additional Python environment, pointing to the scalable version of the interpreter. The path for this scalable interpreter is, if you installed the default instance of SQL Server, C:\Program Files\Microsoft SQL Server\MSSQL14.MSSQLSERVER\PYTHON_SERVICES\python.exe. You can see how to setup this environment in the following figure.

That’s it. You are ready to start Python programming. Just start a new project and select the Python Application template from the Python folder. You can also explore the Python Machine Learning templates, which include Classifier, Clustering, and Regression projects. If you selected the Python Application template, you should have open the first empty Python script with default name the same as the project name and default extension py, waiting for you to write and interactively execute Python code.

Python Language Basics

Python uses the hash mark for a comment. You can execute Python code in VS 2017 by highlighting the code and simultaneously pressing the Ctrl and Enter keys. You can use either single or double apostrophes for delimiting strings. The first command you will learn is the print() command. Write and execute the following code:

# Hash starts a comment
print("Hello World!")
# This line ignored
print('Printing again.')
print('O"Hara')   # In-line comment
print("O'Hara")

You can observe the code you wrote and the results in the Interactive window, which is by default below the script window, at the bottom left side of the screen.

Python supports all basic mathematical and comparison operators, like you would expect. The following code introduces them. Note that you can combine strings and expressions in a single print() statement.

1 + 2
print("The result of 3 + 20 / 4 is:", 3 + 20 / 4)
10 * 2 - 7
10 % 4
print("Is 7 less or equal to 5?", 7 <= 5)
print("Is 7 greater than 5?", 7 > 5)

The next step is to introduce the variables. Note that Python is case-sensitive. The following code shows how you can assign values to variables and use them for direct computations and as the arguments of a function.

# Integer
a = 2
b = 3
a ** b
# Float
c = 7.0
d = float(5)
print(c, d)

You can define strings inside double or single quotes. This enables you to use single quotes inside a double-quoted string, and vice-versa. You can use the “%?” operator for formatting strings to include variables, where the question mark stands for a single letter denoting the data type of the variable, for example “s” for strings and “d” for numbers. The str.format() method of the string data type allows you to do variable substitutions in a string. Here are some examples.

e = "String 1"
f = 10
print("Let's concatenate string %s and number %d." % (e, f))
four_cb = "String {} {} {} {}"
print(four_cb.format(1, 2, 3, 4))

The result of the previous code is:

Let’s concatenate string String 1 and number 10.
String 1 2 3 4

You can also create multi-lines strings. Just enclose the strings in a pair of three double quotes. You can also use special characters, like tab and line feed. Escape them with a single backslash character plus a letter, for example letter t for a tab and letter n for a line feed.

You can always get interactive help with the help() command. A Python module is a file with default extension .py containing Python definitions and statements. You can import a module into your current script with the import command, and then use the functions and variables defined in that module. Besides modules provided with the installation, you can, of course, develop your own modules, distribute them, and reuse the code.

Using Functions, Branches, and Loops

Like in any serious programming language, you can encapsulate your code inside a function. You define a function with the def name(): command. Functions can use arguments. Functions can also return values. The following code defines two functions, one that has no arguments, and one that has two arguments and returns a value. Note that there is no special ending mark of a function body – the correct indentation tells the Python interpreter where the body of the first function ends, and the definition of the second function starts.

def p_n():
    print("No args...")
def add(a, b):
    return a + b

When you call a function, you can pass parameters as literals, or through variables. You can also do some manipulation with the variables when you pass them as the arguments to a function. The following code shows these possibilities.

p_n()
# Call with variables and math
a = 10
b = 20
add(a / 5, b / 4)

You can make branches in the flow of your code with the if..elif..else: statement. The following code shows you and example.

a = 10
b = 20
c = 30
if a > b:
    print("a > b")
elif a > c:
    print("a > c")
elif (b < c):
    print("b < c")
    if a < c:
        print("a < c")
    if b in range(10, 30):
        print("b is between a and c")
else:
    print("a is less than b and less than c")

The results of the code are:

b < c
a < c
b is between a and c

The simplest data structure is the list. Python list is a set of comma-separated values (or items) between square brackets. You can use a for or for each loop to iterate over a list. There are many methods supported by a list. For example, you can use the list.append() method to append an element to a list. The following code shows how to create lists and loop over them with the for and foreach loops. Finally, it shows a while loop.

animals = ["cat", "dog", "pig"]
nums = []
for animal in animals:
    print("Animal: ", animal)
for i in range(2, 5):
    nums.append(i)
print(nums)
i = 1
while i <= 10:
    print(i)
    i = i + 1

The last data structure presented in this introduction article is the dictionary. A dictionary is a set of the key – value pairs. You can see an example of a dictionary in the following code.

states = {
    "Oregon": "OR",
    "Florida": "FL",
    "Michigan": "MI"}
for state, abbrev in list(states.items()):
    print("{} is abbreviated {}.".format(state, abbrev))

I mentioned that in Python you can also use object-oriented paradigm. However, going deeper with object-oriented programming with Python is beyond the scope of this and the following articles.

Conclusion

I guess the programming in Python introduced so far was not over exciting. However, you always need to start with basics, and only after you embrace the basics, the exciting part starts. Therefore, don’t miss my next article, when I will introduce the most important data structure for advanced analytics, the data frame structure.

 

Read the whole series:

Python for SQL Server Specialists Part 2: Working with Data

Python for SQL Server Specialists Part 3: Graphs and Machine Learning

Python for SQL Server Specialists Part 4: Python and SQL Server

SQL Server performance with Spectre and Meltdown patches

SQL Server performance with Spectre and Meltdown patches

As you know, as long as you are not totally oblivious to the technological world you will have heard about one of the biggest bugs in the history of computer science (Spectre and Meltdown) and that its effects are real. So real, that we ourselves at SolidQ ourselves have experienced it in our own Query Analytics software. In this post I will try to shed some light on how to proceed if you detect performance regression in your solution with SQL Server, explaining how I have solved it in my own system.

(more…)

I use columnstore indexes on SQL 2014. What would be the advantages of upgrading to SQL 2016?

I use columnstore indexes on SQL 2014. What would be the advantages of upgrading to SQL 2016?

SQL Server 2016 really represents a drastic change for the better in regards to general performance. And the proof is that, thanks to the engine changes, this is the first SQL Server version where Microsoft actively provides a widespread increase in performance levels by simply upgrading. But what impact can it have in your daily loads if you use columnstore indexes in SQL server 2014 datawarehouse? There was already an amazing increase in performance levels with SQL Server 2014 and the columnstore indexes. Is it worth upgrading to SQL 2016? (more…)

Virtual SANs and the trend towards SDS

Virtual SANs and the trend towards SDS

Lately, we are noticing a trend whereby traditional SANs are increasingly losing ground. This can be due to various reasons, such as forced migrations to the cloud, limited budget environments or even customers who are getting annoyed with their storage providers’ draconian terms and conditions. A common example is to find very costly maintenance agreements, higher-than-market disk size upgrade costs, as well as difficulties to expand/ improve hardware without having to completely replace it. There is also a strong trend to continue going virtual and trading servers like they were commodities. All the above is therefore contributing to an increase in this trend towards SDS, while the VSANs continue to attract new supporters.

(more…)