AndroidのKindleアプリで.mobiファイルを読む
- 内部ストレージ\Android\data\com.amazon.kindle\files\にコピーすればOK
“init"とself
- “init”: クラスのインスタンスを作成したときに初期化処理を行うメソッド
- self: オブジェクトを参照する
- “str”:
アンダースコア(_) の意味: ユーザに見せない処理をする変数を表す。
アンダースコア二つで定義される関数は外部の参照を受けないもの。この場合、アンダースコアで囲う。
アンダースコア一つで定義される関数は参照はできるが、基本的に外部から参照しないということを慣習化させたものらしい。

class person():
    def __init__(self, name):
        self.name = name

hunter = person('Elmer Fudd')
print(hunter.name)

カレントディレクトリの移動
- ‘/'をつけると絶対パス、つけないと相対パス

import os
os.chdir(path)

StringIO

pandas

日付・時刻

処理時間の計測

import time
start = time.time()
# 処理
duration = time.time() - start
print(duration)

.dt.days: pandas.timedeltaをintにする
- Seriesのメソッドなのでリストでは使えない。

import pandas as pd
import datetime as dt

x = pd.to_datetime(pd.Series(['20170701', '20170702']))
y = pd.to_datetime(pd.Series(['20170710', '20170730']))

d = y - x

print(d.dt.days)

merge

# サンプルのデータフレーム生成 (<http://sinhrks.hatenablog.com/entry/2015/01/28/073327> より。感謝) 

import pandas as pd
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                    'B': ['B0', 'B1', 'B2', 'B3'],
                    'C': ['C0', 'C1', 'C2', 'C3'],
                    'D': ['D0', 'D1', 'D2', 'D3']},
                   index=[0, 1, 2, 3])

df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
                    'B': ['B4', 'B5', 'B6', 'B7'],
                    'C': ['C4', 'C5', 'C6', 'C7'],
                    'D': ['D4', 'D5', 'D6', 'D7']},
                   index=[0, 1, 2, 5])

indexでマージする
- right_index=True, left_index=True とする

pd.merge(d1, d2, , right_index=True, left_index=True, how='left')

  A_x B_x C_x D_x  A_y  B_y  C_y  D_y
0  A0  B0  C0  D0   A4   B4   C4   D4
1  A1  B1  C1  D1   A5   B5   C5   D5
2  A2  B2  C2  D2   A6   B6   C6   D6
3  A3  B3  C3  D3  NaN  NaN  NaN  NaN

データフレーム

データフレームの複数の列を基準として、外れ値を除外する

import pandas as pd
import numpy as np
from scipy import stats

df = pd.DataFrame(np.random.randn(100, 3))
# 外れ値として100を挿入
df.loc[0, 0] = 100
print(df.head())

# 100以上の値がある業を全て除く
df[(df < 100).all(axis=1)]

# 全体を標準化して3SD以上を除く
df[(np.abs(stats.zscore(df)) < 3).all(axis=1)]

各変数ごとに外れ値基準を計算して除外

import pandas as pd

df = pd.DataFrame(np.random.randn(10, 3))
df.head()

def elout(df, fv= 0.05):
    print(df.shape)
    r = pd.DataFrame()
    for c in df:
        v = df[[c]]
        is_in = ((v >= v.quantile(fv)) & (v < v.quantile(1-fv)))
        r = pd.concat([r, is_in], axis=1)
    df2 = df[np.array(r).all(axis=1)]
    print(df2.shape)
    return(df2)

elout(df)

forループでデータフレームを足していく

import pandas as pd
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                    'B': ['B0', 'B1', 'B2', 'B3'],
                    'C': ['C0', 'C1', 'C2', 'C3'],
                    'D': ['D0', 'D1', 'D2', 'D3']},
                   index=[0, 1, 2, 3])

x = pd.DataFrame()
for i in range(4):
    t = df1.loc[i, :]
    x = pd.concat([x, t], axis=1)

print(x)
... 
    A   B   C   D
0  A0  B0  C0  D0
1  A1  B1  C1  D1
2  A2  B2  C2  D2
3  A3  B3  C3  D3

重複

重複行削除

d = pd.DataFrame({'a': [1, 2, 3, 1],
                  'b': [1, 3, 3, 1]})
d.drop_duplicates()

   a  b
0  1  1
1  2  3
2  3  3
3  1  1
   a  b
0  1  1
1  2  3
2  3  3

一部の列名を変える: .renameでディクショナリを使う。元データ変更の場合はinplace=True
- 行名を変えるときはindex={}

d = pd.DataFrame({'a': [1, 2, 3, 1],
                  'b': [1, 3, 3, 1]})
d.rename(columns={'a': 'xx'})

   xx  b
0   1  1
1   2  3
2   3  3
3   1  1

行合計や列合計で割る

import numpy as np
import pandas as pd


dat = pd.DataFrame({'a': np.random.randn(5),
                    'b': np.random.randn(5),
                    'c': np.random.randn(5),
                    })

rsum = dat.sum(axis=1)

>>> dat.div(rsum, axis=0)
          a          b          c
0  1.109889  -1.521357   1.411468
1 -0.221592   0.231812   0.989781
2 -0.045156   0.534391   0.510766
3 -0.127811   0.783268   0.344543
4  4.216794  27.485861 -30.702655

欠損値の扱い

.isnan(): 要素ごとにNaN (Not a Number) かどうかを返す。

import numpy as np
v = [1, 2, np.nan]
[np.isnan(x) for x in v]
... 
[False, False, True]

Series.isnull(): Series全体で調べる

import numpy as np
import pandas as pd
s = pd.Series([1, 2, np.nan])
s.isnull()
... 
0    False
1    False
2     True
dtype: bool

.values.any(): どこかに欠損値があるか

s.isnull().values.any()
...
True

欠損値ならTrue、そうでないならFalseをかえす

スライシング

一部の行や列だけ除く: .drop

d = pd.DataFrame({'a': [1, 2, 3, 1],
                  'b': [1, 3, 3, 1]})

# 行を除く
d.drop(0)

   a  b
1  2  3
2  3  3
3  1  1


# 列を除く
d.drop('a', axis=1)
   b
0  1
1  3
2  3
3  1

ピボットテーブル

縦横で分けて集計

import pandas as pd
import numpy as np
df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 6,
                   'B' : ['A', 'B', 'C'] * 8,
                   'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 4,
                   'D' : np.random.randn(24),
                   'E' : np.random.randn(24)})

>>> pd.pivot_table(df, values='D', index=['B'], columns=['A', 'C'], aggfunc=np.sum)
... 
... 
A       one               three                 two          
C       bar       foo       bar       foo       bar       foo
B                                                            
A  1.989904 -0.544853 -1.382726       NaN       NaN  0.929173
B  0.370564 -1.321093       NaN -0.116037  0.125153       NaN
C -0.975050 -3.429380 -2.340092       NaN       NaN -0.735084

groupbyで集計したものをpivotで整形する
- 集計したものはキー変数がインデックスになるが、インデックスを指定してはピボットできない。
- reset_indexで変数にしてからピボットする。

df = pd.DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a'],
                   'key2' : ['one', 'two', 'one', 'two', 'one'],
                   'data1' : np.random.randn(5),
                   'data2' : np.random.randn(5)})


gr = df.groupby([df['key1'], df['key2']])
grs = gr.sum()
x = grs.reset_index()
>>> x.pivot(index='key2', columns ='key1')
... 
... 
         data1               data2          
key1         a         b         a         b
key2                                        
one  -2.106043  0.190401  0.232389 -1.706491
two   0.470106  1.871696 -1.518614 -0.059299

統計

散布図と相関係数

散布図

import numpy as np
import matplotlib.pyplot as plt

x = np.random.randn(1000)
y = np.random.randn(1000)
fig = plt.figure
ax = fig.add_subplot(1,1,1)

ax.scatter(x,y)

ax.set_title('first scatter plot')
ax.set_xlabel('x')
ax.set_ylabel('y')

fig.show()

散布図と回帰直線

from scipy import stats
import numpy as np
import matplotlib.pyplot as plt

x = np.random.randn(1000)
y = np.random.randn(1000)

r = stats.linregress(x, y)
x_line = [x.min(), x.max()]
y_line = [r.slope * i + r.intercept for i in x_line]
plt.scatter(x, y, color='blue', label=x, s=50)
plt.plot(x_line, y_line, color='green', linestyle='-', lw=5)

散布図行列

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

dat = pd.DataFrame({'a': np.random.randn(10), 
                    'b': np.random.randn(10), 
                    'c': np.random.randn(10), 
                    'd': np.random.randn(10), 
                    'e': np.random.randn(10)})

# 単純な相関行列
dat.corr()


# pandas
pd.scatter_matrix(dat)


# seaborn
sns.pairplot(dat)

色を変えて複数の回帰直線を描く

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn
from matplotlib import cm
from scipy import stats

dat = pd.DataFrame({'y': np.random.randn(1000),
                    'x1': np.random.randn(1000),
                    'x2': np.random.randn(1000),
                    'x3': np.random.randn(1000)})

# xの列名
xlabels = ['x1', 'x2', 'x3']
colors = cm.rainbow(np.linspace(0, 1, len(xlabels)))


for x, c in zip(xlabels, colors):
    xv = dat[x]
    slope, intercept, r_value, _, _ = stats.linregress(xv, dat.y)
    x_line = [xv.min(), xv.max()]
    y_line = [slope * i + intercept for i in x_line]
    plt.scatter(xv, dat.y, color=c, label=x, s=50)
    plt.plot(x_line, y_line, color=c, linestyle='-', lw=5)

plt.legend(prop={'size':14})
plt.xlim(-4, 3)
plt.ylim(-4, 3)

重回帰分析

statsmodelsを使って
- Rの結果と一致

import pandas as pd
from statsmodels.formula.api import ols

x1 = pd.Series([12, 12, 7, 17, 14, 9, 10, 13, 15, 12, 12, 15, 11, 14, 17, 17,
                16, 15, 15, 10, 12, 9, 12, 12, 19, 11, 14, 15, 15, 15, 16, 15,
                12, 10, 11, 12, 15, 13, 15, 12, 12, 12, 13, 17, 13, 11, 14,
                16, 12, 12]) # 母親価値
x2 = pd.Series([2, 2, 2, 3, 2, 2, 3, 3, 3, 1, 3, 3, 2, 2, 4, 2, 4, 3, 4, 2, 2,
                1, 2, 2, 4, 2, 3, 2, 3, 3, 2, 3, 2, 2, 3, 1, 2, 3, 2, 2, 2, 3
                , 3, 3, 2, 3, 2, 4, 2, 2]) # 通園年数
y = pd.Series([6, 11, 11, 13, 13, 10, 10, 15, 11, 11, 16, 14, 10, 13, 12, 15,
               16, 14, 14, 8, 13, 12, 12, 11, 16, 9, 12, 13, 13, 14, 12, 15,
               8, 12, 11, 6, 12, 15, 9, 13, 9, 11, 14, 12, 13, 9, 11, 14, 16,
               8]) # 協調性
dat = pd.concat([x1, x2, y], axis=1)

model = ols("y ~ x1 + x2", data=dat).fit()
print(model.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.314
Model:                            OLS   Adj. R-squared:                  0.284
Method:                 Least Squares   F-statistic:                     10.74
Date:                Thu, 20 Jul 2017   Prob (F-statistic):           0.000144
Time:                        11:36:54   Log-Likelihood:                -106.98
No. Observations:                  50   AIC:                             220.0
Df Residuals:                      47   BIC:                             225.7
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      5.1716      1.665      3.107      0.003       1.823       8.520
x1             0.3027      0.147      2.064      0.045       0.008       0.598
x2             1.1259      0.471      2.389      0.021       0.178       2.074
==============================================================================
Omnibus:                        0.276   Durbin-Watson:                   2.411
Prob(Omnibus):                  0.871   Jarque-Bera (JB):                0.348
Skew:                          -0.164   Prob(JB):                        0.840
Kurtosis:                       2.756   Cond. No.                         76.2
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

scikit-learnを使って

from sklearn import linear_model
clf = linear_model.LinearRegression()
 
# 説明変数に "quality (品質スコア以外すべて)" を利用
X = pd.concat([x1, x2], axis=1).as_matrix()
 
# 目的変数に "quality (品質スコア)" を利用
Y = y.as_matrix()
 
# 予測モデルを作成
clf.fit(X, Y)
 
# 偏回帰係数
print(pd.DataFrame({"Name":['x1', 'x2'],
                    "Coefficients":clf.coef_}).sort_values(by='Coefficients') )
 
# 切片 (誤差)
print(clf.intercept_)

## 参考
http://pythondatascience.plavox.info/scikit-learn/%E7%B7%9A%E5%BD%A2%E5%9B%9E%E5%B8%B0
http://pythondatascience.plavox.info/pandas/%E8%A1%8C%E3%83%BB%E5%88%97%E3%82%92%E5%89%8A%E9%99%A4
https://nkmk.github.io/blog/python-pandas-dataframe-rename/
http://nekoyukimmm.hatenablog.com/entry/2015/04/10/094432
http://sinhrks.hatenablog.com/entry/2014/10/13/005327
https://chartio.com/resources/tutorials/how-to-check-if-any-value-is-nan-in-a-pandas-dataframe/

2017-07-03

Gmail APIを使ってみる

公式ガイドにかいてあるとおりやってみる。以下適当に和訳。

Step 1: APIの有効化

Google Developers Consoleでプロジェクトを作ると自動的にAPIが有効化になる。ウィザードどおり続けて認証情報を取得する。
プロジェクトへの認証情報の追加ではキャンセルをクリックする。
上のタブから、OAuth同意画面を選択する。googleアカウントのメールアドレスと、任意のサービス名を入力して保存。
認証情報タブを選択 -> 認証情報の作成 -> OAuthID作成を選択
「その他のアプリケーション」を選択肢、名称は"Gmail API Quickstart"を入力して「作成」ボタン
表示されるクライアントIDとクライアントシークレットを一応保存
右端のダウンロードボタンをクリックしてclient_secret.jsonと名前をつけて保存

Step 2: Google クライアントライブラリのインストール

pipでインストールするだけ

$ pip install --upgrade google-api-python-client

Step 3: サンプルファイル作成

quickstart.pyというファイルをつくる。中身は公式どおり。

Step 4: 実行

quickstart.pyとclient_secret.jsonを同じディレクトリにおいて実行。

$ python quickstart.py

問題なく実行でき、自分のgmailアカウントのラベルが取得できた。
一度認証するとspyderのなかでもできるようだ。便利

gmailを検索してみる

公式にpythonのサンプルコードがあるのでそれを実行するだけ
Google APIの証明書を取得
メールボックスを検索して特定の文字列に一致するメールのIDを取得
そのIDのメール本文を表示

# 証明書を取得する関数。以下すべて公式サンプルのコピペ

from __future__ import print_function
import httplib2
import os

from apiclient import discovery
from oauth2client import client
from oauth2client import tools
from oauth2client.file import Storage

try:
    import argparse
    flags = argparse.ArgumentParser(parents=[tools.argparser]).parse_args()
except ImportError:
    flags = None

# If modifying these scopes, delete your previously saved credentials
# at ~/.credentials/gmail-python-quickstart.json
SCOPES = 'https://www.googleapis.com/auth/gmail.readonly'
CLIENT_SECRET_FILE = 'client_secret.json'
APPLICATION_NAME = 'Gmail API Python Quickstart'

def get_credentials():
    """Gets valid user credentials from storage.

    If nothing has been stored, or if the stored credentials are invalid,
    the OAuth2 flow is completed to obtain the new credentials.

    Returns:
        Credentials, the obtained credential.
    """
    home_dir = os.path.expanduser('~')
    credential_dir = os.path.join(home_dir, '.credentials')
    if not os.path.exists(credential_dir):
        os.makedirs(credential_dir)
    credential_path = os.path.join(credential_dir,
                                   'gmail-python-quickstart.json')

    store = Storage(credential_path)
    credentials = store.get()
    if not credentials or credentials.invalid:
        flow = client.flow_from_clientsecrets(CLIENT_SECRET_FILE, SCOPES)
        flow.user_agent = APPLICATION_NAME
        if flags:
            credentials = tools.run_flow(flow, store, flags)
        else: # Needed only for compatibility with Python 2.6
            credentials = tools.run(flow, store)
        print('Storing credentials to ' + credential_path)
    return credentials


# 文字列を指定してメールを検索し、メッセージIDを返す関数。python3用に一部print文を書き換え日本語も使える。

def ListMessagesMatchingQuery(service, user_id, query=''):
  """List all Messages of the user's mailbox matching the query.

  Args:
    service: Authorized Gmail API service instance.
    user_id: User's email address. The special value "me"
    can be used to indicate the authenticated user.
    query: String used to filter messages returned.
    Eg.- 'from:user@some_domain.com' for Messages from a particular sender.

  Returns:
    List of Messages that match the criteria of the query. Note that the
    returned list contains Message IDs, you must use get with the
    appropriate ID to get the details of a Message.
  """
  try:
    response = service.users().messages().list(userId=user_id,
                                               q=query).execute()
    messages = []
    if 'messages' in response:
      messages.extend(response['messages'])

    while 'nextPageToken' in response:
      page_token = response['nextPageToken']
      response = service.users().messages().list(userId=user_id, q=query,
                                         pageToken=page_token).execute()
      messages.extend(response['messages'])

    return messages
  except errors.HttpError as error:
    print('An error occurred: %s' % error)


#メッセージIDを使ってメール本文を取得する関数
import base64
import email
from apiclient import errors

def GetMessage(service, user_id, msg_id):
  """Get a Message with given ID.

  Args:
    service: Authorized Gmail API service instance.
    user_id: User's email address. The special value "me"
    can be used to indicate the authenticated user.
    msg_id: The ID of the Message required.

  Returns:
    A Message.
  """
  try:
    message = service.users().messages().get(userId=user_id, id=msg_id).execute()

    print('Message snippet: %s' % message['snippet'])

    return message
  except errors.HttpError as error:
    print('An error occurred: %s' % error)

やりたいこと：gmailを検索して特定の正規表現にマッチするものを抽出したい。

2017-06-29

APIで株価データを取得して移動平均を描く

import pandas as pd
import quandl
import seaborn

# quandlから株価データを取得
# 東京証券取引所からソフトバンクのデータ
prices = quandl.get('TSE/9984.4', returns='pandas',
                     start_date='2016-08-17', end_date='2017-06-28')

prices.plot()

# 25日移動平均
x25 = prices.rolling(window=25, center=False).mean()
x75 = prices.rolling(window=75, center=False).mean()

px = pd.concat([x25, x75], axis=1)

px.plot()

f:id:haogrove:20170629144629p:plain

# 特定の銘柄が指定日（今日）にプラスだったか
# 上昇か

start = pd.to_datetime('2017-06-01')
now = pd.to_datetime('today')

value_start = prices.ix[start]
value_now = prices.ix[-1]

print(value_now - value_start)

2017-06-18

オライリーなどのKindle本をAndroid端末で読むには

内部ストレージ > Android > data > com.amazon.kindle > file フォルダにつっこむ。

2017-06-12

株用語。随時更新

成行：値段を指定しない注文方法です。「成行で1000株の買い注文」とか「成行で3000株の売り注文」
指値：買うもしくは売る値段を指定して注文する方法です。例えば「300円の指値で1,000株の買い注文」とか「500円の指値で2,000株の売り注文」

参考

第70回「指値注文」「成行注文」の意味と使い方～超初心者向けコラム第2回～ | 足立武志「知って納得！株式投資で負けないための実践的基礎知識」 | 楽天証券

haogrove’s blog

お勉強することリスト

Oracle

参考情報

通訳案内士

登山ガイドステージI

申し込み書類

テキスト

勉強しよう

どういう勉強記録法がいいか

記録方法として考えたこと

決めたこと

参考

Python tips (随時更新)

いずれしらべる

システム

pandas

日付・時刻

merge

データフレーム

データフレームの複数の列を基準として、外れ値を除外する

forループでデータフレームを足していく

重複

欠損値の扱い

スライシング

ピボットテーブル

統計

散布図と相関係数

重回帰分析

Gmail APIを使ってみる

Step 1: APIの有効化

Step 2: Google クライアントライブラリのインストール

Step 3: サンプルファイル作成

Step 4: 実行

gmailを検索してみる

やりたいこと：gmailを検索して特定の正規表現にマッチするものを抽出したい。

APIで株価データを取得して移動平均を描く

オライリーなどのKindle本をAndroid端末で読むには

株用語。随時更新

参考

参考情報

通訳案内士

登山ガイド ステージI

申し込み書類

テキスト

勉強しよう

記録方法として考えたこと

決めたこと

参考

いずれしらべる

システム

pandas

日付・時刻

merge

データフレーム

データフレームの複数の列を基準として、外れ値を除外する

forループでデータフレームを足していく

重複

欠損値の扱い

スライシング

ピボットテーブル

統計

散布図と相関係数

重回帰分析

Step 1: APIの有効化

Step 2: Google クライアントライブラリのインストール

Step 3: サンプルファイル作成

Step 4: 実行

gmailを検索してみる

やりたいこと：gmailを検索して特定の正規表現にマッチするものを抽出したい。

参考

登山ガイドステージI