Digital History: The Story So Far
As the field of Digital History continues to grow, so too does the number of tools, software, and coding packages built to support and advance digital history in practice. The range of this is at times staggering: from applications suitable for the most novice of digital historians, to coding guides and tools for those working to more nuanced and specific end-goals, researchers have an ability to engage with their materials in digital, quantitative ways on a never before seen level. Often we focus primarily on the new findings that come of out this new way of approaching research - but what about the ways we get to those findings?
Regardless of the type of digital analysis being performed or even the software being used, the process is normally the same: input some data, click some buttons or run some code (perhaps a couple of times over to edit the code and adjust the outcomes), and get your end result.
You've got an outcome - but do you know how you've got from a to b? It's likely that variables have been written over several times along the way, and the data has changed from one type to another, been filtered or added to, and decision after decision has been made without necessarily knowing it. Each little adjustment or re-run of the code has contributed to the research process and is critical to the end output or findings.
But how do we keep track?
Hello kiara.
Introducing kiara, a new data orchestration tool.
This new tool incorporates a number of different digital research approaches, but most importantly documents and encourages users to critically reflect on the process and use of DH tools. In doing so, the software opens up the black box of digital research, moving away from button-clicking software and making digital research more transparent and open to commentary, replicability, and criticism. It not only makes the research process itself more open, allowing users to visualise and examine the individual steps from start to finish, but also allows them to track changes to the data itself, something that is either imperceptible or, perhaps more importantly, forgotten about in traditional digital history methods and tools. kiara therefore acts as a 'wrapper' to this digital research process, tracking and documenting the steps and changes to the data, producing a veritable map of the journey that can be reflected upon and shared. As well as tracing the data and the actions themselves, kiara asks you to record your own decisions each time you run an operation - why are you doing this? What data are you using? What parameters have you chosen and why? Adding comments or notes to each stage of the process not only makes the work clearer for others looking at the research but for your own records as well!
This tutorial will walk you through how to start kiara in Jupyter Notebooks, and some basic but essential functions that can be built on in further notebooks. At the end, it will showcase the way kiara has tracked the decisions made, the research process and changes to the data from start to finish.
This tutorial requires you to know some python and SQL.
Running kiara
Before running this notebook, you need to install kiara and its dependencies in a virtual environment (such as Conda). Check out the installation instructions here.
Make sure as a minimum you have the plugins for onboarding, tabular, and core-types. For the lineage visuals later, make sure to run pip install observable_jupyter in your CLI if you don't have it installed already!
In order to use kiara in Jupyter Notebooks, we need to create a KiaraAPI instance. An API allows us to control and interact with kiara and its functions. In kiara this also allows us to get more information about what can be done (and what is happening) to our data as we go. For more on what can be done with the API, see the kiara API documentation here.
from kiara.api import KiaraAPI
kiara = KiaraAPI.instance()
Creating a Project: kiara contexts
First up, we want to think about creating a project space, which in kiara are known as contexts. Contexts store all of the information on the data we import or create, the jobs we run, and all the processes and decisions we tell kiara about. kiara comes with a 'default' context in place, meaning we can use it straight away, or we can choose to create a new one to use for a specific purpose. Let's create one called 'hello_kiara' for this notebook, by using this as an input for set_active_context and setting our create option to True.
We can also then get a list of all available contexts, and double check which context we are working in.
kiara.set_active_context(context_name='hello_kiara', create=True)
print('Available Contexts:', kiara.list_context_names())
print('Current Context:', kiara.get_current_context_name())
Available Contexts: ['default', 'hello_kiara']
Current Context: hello_kiara
Now we have an API and our project context in place, we can get more information about what we can do in kiara. Let's start by asking kiara to list all the operations that are included with the plugins we just installed.
kiara.list_operation_ids()
['assemble.network_graph',
'assemble.tables',
'calculate.betweenness_score',
'calculate.closeness_score',
'calculate.degree_score',
'calculate.eigenvector_score',
'compute.modularity_group',
'create.cut_point_list',
'create.database.from.file',
'create.database.from.file_bundle',
'create.database.from.table',
'create.database.from.tables',
'create.job_log',
'create.network_graph.from.file',
'create.table.from.file',
'create.table.from.file_bundle',
'create.tables.from.file_bundle',
'date.check_range',
'date.extract_from_string',
'download.file',
'download.file.from.github',
'download.file.from.zenodo',
'download.file_bundle',
'download.file_bundle.from.github',
'download.file_bundle.from.zenodo',
'export.database.as.csv_files',
'export.database.as.sql_dump',
'export.database.as.sqlite_db',
'export.file.as.file',
'export.network_graph',
'export.table.as.csv_file',
'export.tables.as.csv_files',
'export.tables.as.sql_dump',
'export.tables.as.sqlite_db',
'extract.date_array.from.table',
'extract.largest_component',
'file_bundle.pick.file',
'file_bundle.pick.sub_folder',
'import.database.from.local_file_path',
'import.local.file',
'import.local.file_bundle',
'import.network_graph.from.file',
'import.table.from.local_file_path',
'import.table.from.local_folder_path',
'list.contains',
'logic.and',
'logic.nand',
'logic.nor',
'logic.not',
'logic.or',
'logic.xor',
'parse.date_array',
'preview.network_info',
'query.database',
'query.table',
'string_filter.tokens',
'table.add_column',
'table.pick.column',
'table_filter.drop_columns',
'table_filter.select_columns',
'table_filter.select_rows',
'tables.pick.column',
'tables.pick.table']
Downloading Files
Great, now we know the different kind of operations we can use with kiara. Let's start by introducing some files to our notebook, using the download.file function.
First we want to find out what this operation does, and just as importantly, what inputs it needs to work.
kiara.retrieve_operation_info('download.file')
Author(s) Markus Binsteiner markus@frkl.io Context Tags onboarding Labels package: kiara_plugin.onboarding References source_repo: https://github.com/DHARPA-Project/kiara_plugin.onboardingdocumentation: https://DHARPA-Project.github.io/kiara_plugin.onboarding/Module type download.file Operation details Documentation Download a single file from a remote location. The result of this operation is a single value of type 'file' (basically an array of raw bytes + some light metadata), which can then be used in other modules to create more meaningful data structures. Inputs field name type description Required Default ────────────────────────────────────────────────────────────────────────────────────────────────── url string The url of the file to download. yes -- no default -- file_name string The file name to use for the downloaded no -- no default -- file, if not provided it will be generated from the last token of the url. Outputs field name type description ────────────────────────────────────────────────────────────────────────────────────────────────── file file The file that was onboarded.
So from this, we know that download.file will download a single file from a remote location for us to use in kiara.
We need to give the function a url and, if we want, a file name. These are the inputs.
In return, we will get the file and metadata about the file as our outputs.
Let's give this a go using some kiara sample data.
The language of running a module in kiara is kiara.run_job followed by the name of the module, the inputs, and comment followed by your notes about any of the decisions you have made. You don't have to save your outputs if you don't want to, but you do need to put a comment in each time.
First we define our inputs, then use kiara.run_job with our chosen operation, download.file, our input variable and our comments, and save this as our outputs. As we'll see in a couple moments, we can put our inputs directly into the operation if we want to, but we'll start like this for now.
For the moment we've just written 'importing journal nodes' in our comments, but feel free to change this - and make sure to add your own information for the rest of the modules in this notebook!
inputs = {
"url": "https://raw.githubusercontent.com/DHARPA-Project/kiara.examples/main/examples/data/network_analysis/journals/JournalNodes1902.csv",
"file_name": "JournalNodes1902.csv"
}
outputs = kiara.run_job('download.file', inputs=inputs, comment="importing journal nodes")
Let's print out our outputs and see what that looks like.
outputs
╭──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ │ │ │ field value │ │ ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── │ │ file │ │ Preview Id,Label,JournalType,City,CountryNetworkTime,PresentDayCountry,Latitude,Longitude,Language │ │ 75,Psychiatrische en neurologische bladen,specialized: psychiatry and │ │ neurology,Amsterdam,Netherlands,Netherlands,52.366667,4.9,Dutch │ │ 36,The American Journal of Insanity,specialized: psychiatry and neurology,Baltimore,United States,United │ │ States,39.289444,-76.615278,English │ │ 208,The American Journal of Psychology,specialized: psychology,Baltimore,United States,United │ │ States,39.289444,-76.615278,English │ │ 295,Die Krankenpflege,specialized: therapy,Berlin,German Empire,Germany,52.52,13.405,German │ │ 296,Die deutsche Klinik am Eingange des zwanzigsten Jahrhunderts,general medicine,Berlin,German │ │ Empire,Germany,52.52,13.405,German │ │ 300,Therapeutische Monatshefte,specialized: therapy,Berlin,German Empire,Germany,52.52,13.405,German │ │ 1,Allgemeine Zeitschrift für Psychiatrie,specialized: psychiatry and neurology,Berlin,German │ │ Empire,Germany,52.52,13.405,German │ │ 7,Archiv für Psychiatrie und Nervenkrankheiten,specialized: psychiatry and neurology,Berlin,German │ │ Empire,Germany,52.52,13.405,German │ │ 10,Berliner klinische Wochenschrift,general medicine,Berlin,German Empire,Germany,52.52,13.405,German │ │ 13,Charité Annalen,general medicine,Berlin,German Empire,Germany,52.52,13.405,German │ │ 21,Monatsschrift für Psychiatrie und Neurologie,specialized: psychiatry and neurology,Berlin,German │ │ Empire,Germany,52.52,13.405,German │ │ 29,Virchows Archiv,"specialized: anatomy, physiology and pathology",Berlin,German │ │ Empire,Germany,52.52,13.405,German │ │ 31,Zeitschrift für pädagogische Psychologie und Pathologie,specialized: psychology and pedagogy,Berlin,German │ │ Empire,Germany,52.52,13.405,German │ │ 42,Vierteljahrsschrift für gerichtliche Medizin und öffentliches Sanitätswesen,"specialized: anthropology, │ │ criminology and forensics",Berlin,German Empire,Germany,52.52,13.405,German │ │ 47,Centralblatt für Nervenheilkunde und Psychiatrie,specialized: psychiatry and neurology,Berlin,German │ │ Empire,Germany,52.52,13.405,German │ │ 50,Russische medicinische Rundschau,general medicine,Berlin,German Empire,Germany,52.52,13.405,German │ │ 76,Deutsche Aerzte-Zeitung,general medicine,Berlin,German Empire,Germany,52.52,13.405,German │ │ 87,Monatsschrift für Geburtshülfe und Gynäkologie,specialized: gynecology,Berlin,German │ │ Empire,Germany,52.52,13.405,German │ │ 108,Archiv für klinische Chirurgie,specialized: surgery,Berlin,German Empire,Germany,52.52,13.405,German │ │ 113,Zeitschrift für klinische Medicin,general medicine,Berlin,German Empire,Germany,52.52,13.405,German │ │ 159,Deutsche militärärztliche Zeitschrift,specialized: military medicine,Berlin,German │ │ Empire,Germany,52.52,13.405,German │ │ 162,Jahresbericht über die Leistungen und Fortschritte auf dem Gebiete der Neurologie und │ │ Psychiatrie,specialized: psychiatry and neurology,Berlin,German Empire,Germany,52.52,13.405,German │ │ 192,Ärztliche Sachverständigen-Zeitung,general medicine,Berlin,German Empire,Germany,52.52,13.405,German │ │ 198,Zeitschrift für die Behandlung Schwachsinniger und Epileptischer,specialized: psychiatry and │ │ neurology,Berlin,German Empire,Germany,52.52,13.405,German │ │ 258,Der Pfarrbote,news media,Berlin,German Empire,Germany,52.52,13.405,German │ │ 71,Correspondenz-Blatt für Schweizer Aerzte,general │ │ medicine,Bern,Switzerland,Switzerland,46.948056,7.4475,German │ │ 6,Archiv für mikroskopische Anatomie,"specialized: anatomy, physiology and pathology",Bonn,German │ │ Empire,Germany,50.733333,7.1,German │ │ 203,The Journal of Abnormal Psychology,specialized: psychology,Boston,United States,United │ │ States,42.358056,-71.063611,English │ │ 273,"Correspondenz-Blatt der Deutschen Gesellschaft für Anthropologie, Ethnologie und │ │ Urgeschichte","specialized: anthropology, criminology and forensics",Braunschweig,German │ │ Empire,Germany,52.266667,10.516667,German │ │ 303,Policlinique de Bruxelles,general medicine,Brussels,Belgium,Belgium,50.85,4.35,French │ │ 306,Annales de la Société Belge de Neurologie,specialized: psychiatry and │ │ neurology,Brussels,Belgium,Belgium,50.85,4.35,French │ │ 19,Journal de neurologie,specialized: psychiatry and neurology,Brussels,Belgium,Belgium,50.85,4.35,French │ │ 25,"Revue internationale d'électrothérapie, de physiologie, de médecine, de chirurgie, d'obstétrique, de │ │ thérapeutique, de chimie et de pharmacie",general medicine,Brussels,Belgium,Belgium,50.85,4.35,French │ │ 35,Bulletin de la Société de Médecine Mentale de Belgique,specialized: psychiatry and │ │ neurology,Brussels,Belgium,Belgium,50.85,4.35,French │ │ ... │ │ ... │ │ Metadata │ │ download_info { │ │ "url": │ │ "https://raw.githubusercontent.com/DHARPA-Project/kiara.examples/main/examples/data/netw… │ │ "response_headers": [ │ │ { │ │ "connection": "keep-alive", │ │ "content-length": "7436", │ │ "cache-control": "max-age=300", │ │ "content-security-policy": "default-src 'none'; style-src 'unsafe-inline'; │ │ sandbox", │ │ "content-type": "text/plain; charset=utf-8", │ │"etag": "W/"641ae85d69e5836d27ea8906aba0a33b48b0f3ed0ed4c40d21a07fccebdd238d"",│ │ "strict-transport-security": "max-age=31536000", │ │ "x-content-type-options": "nosniff", │ │ "x-frame-options": "deny", │ │ "x-xss-protection": "1; mode=block", │ │ "x-github-request-id": "AF48:2127D9:46F4B:CA184:67925C87", │ │ "content-encoding": "gzip", │ │ "accept-ranges": "bytes", │ │ "date": "Thu, 23 Jan 2025 15:39:48 GMT", │ │ "via": "1.1 varnish", │ │ "x-served-by": "cache-dub4326-DUB", │ │ "x-cache": "HIT", │ │ "x-cache-hits": "1", │ │ "x-timer": "S1737646789.557509,VS0,VE1", │ │ "vary": "Authorization,Accept-Encoding,Origin", │ │ "access-control-allow-origin": "*", │ │ "cross-origin-resource-policy": "cross-origin", │ │ "x-fastly-request-id": "ccfb74c9dcdff832237bbcd4506ec9e46952bf47", │ │ "expires": "Thu, 23 Jan 2025 15:44:48 GMT", │ │ "source-age": "52" │ │ } │ │ ], │ │ "request_time": "2025-01-23T15:39:48.465421+00:00", │ │ "download_time_in_seconds": 0.07359 │ │ } │ ││ ││ │ │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
Great! We've successfully downloaded the file, and we can see there's lots of information here.
At the moment, we're most interested in the file output. This contains the actual contents of the file that we have just downloaded.
Let's separate this out and save it. If we working in python, we can store it as a variable, but with kiara we can also save it under an alias. This saves the data to a store inside kiara for us to access at a later date, so we don't lose it if we restart the notebook, or want to access the information through a different access point instead. Let's save the downloaded file under the alias 'Journal_Nodes'. For this, we need to access the unique identifier stored in kiara, which can be found with value_id. We can then use the alias later, much like a variable!
downloaded_file = outputs['file']
kiara.store_value(value=downloaded_file.value_id, alias='Journal_Nodes')
╭─ Store operation result ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ │ │ │ value_id e16e90d0-5942-4aab-8a4c-a5df80e3118d │ │ alias Journal_Nodes │ │ data type file │ │ size 35.47 KB │ │ success yes │ │ │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
New Formats: Creating and Converting
What next? We could transform the downloaded file contents into a different format.
Let's use the operation list earlier, and look for something that allows us to create something out of our new file.
kiara.list_operation_ids('create')
['create.cut_point_list', 'create.database.from.file', 'create.database.from.file_bundle', 'create.database.from.table', 'create.database.from.tables', 'create.job_log', 'create.network_graph.from.file', 'create.table.from.file', 'create.table.from.file_bundle', 'create.tables.from.file_bundle']
Our file was orginally in a CSV format, so let's make a table using create.table.from.file.
Just like when we used download.file, we can double check what this does, and what inputs and outputs this involves.
This time, we're also going to use a variable to store the operation in - this is especially handy if the operation has a long name, or if you want to use the same operation more than once without retyping it.
op_id = 'create.table.from.file'
kiara.retrieve_operation_info(op_id)
Author(s) Markus Binsteiner markus@frkl.io Context Tags tabular Labels package: kiara_plugin.tabular References source_repo: https://github.com/DHARPA-Project/kiara_plugin.tabulardocumentation: https://DHARPA-Project.github.io/kiara_plugin.tabular/Module type create.table Module config {"source_type": "file","target_type": "table"}Operation details Documentation Create a table from a file, trying to auto-determine the format of said file. Currently supported input file types: • csv • parquet Inputs field name type description Required Default ────────────────────────────────────────────────────────────────────────────────────────────────── file file The source value (of type 'file'). yes -- no default -- first_row_is_header boolean Whether the first row of a (csv) no -- no default -- file is a header row. If not provided, kiara will try to auto-determine. Ignored if not a csv file. delimiter string The delimiter that is used in the no -- no default -- csv file. If not provided, kiara will try to auto-determine. Ignored if not a csv file. Outputs field name type description ────────────────────────────────────────────────────────────────────────────────────────────────── table table The result value (of type 'table').
Great, we have all the information we need now.
Let's go again.
First we define our inputs - here we can use kiara.get_value to get back the downloaded file using the alias that we stored earlier. We also want to tell kiara that the first row should be read as a header.
Then use kiara.run_job with our chosen operation, this time stored as op_id.
Once this is saved as our outputs, we can print it out.
inputs = {
"file": kiara.get_value('Journal_Nodes'),
"first_row_is_header": True
}
outputs = kiara.run_job(op_id, inputs=inputs, comment="")
outputs
╭──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ │ │ │ field value │ │ ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── │ │ table │ │ Id Label JournalType City CountryNetworkT PresentDayCoun Latitude Longitude Language │ │ ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── │ │ 75 Psychiatrische specialized: p Amsterdam Netherlands Netherlands 52.366667 4.9 Dutch │ │ 36 The American J specialized: p Baltimore United States United States 39.289444 -76.615278 English │ │ 208 The American J specialized: p Baltimore United States United States 39.289444 -76.615278 English │ │ 295 Die Krankenpfl specialized: t Berlin German Empire Germany 52.52 13.405 German │ │ 296 Die deutsche K general medici Berlin German Empire Germany 52.52 13.405 German │ │ 300 Therapeutische specialized: t Berlin German Empire Germany 52.52 13.405 German │ │ 1 Allgemeine Zei specialized: p Berlin German Empire Germany 52.52 13.405 German │ │ 7 Archiv für Psy specialized: p Berlin German Empire Germany 52.52 13.405 German │ │ 10 Berliner klini general medici Berlin German Empire Germany 52.52 13.405 German │ │ 13 Charité Annale general medici Berlin German Empire Germany 52.52 13.405 German │ │ 21 Monatsschrift specialized: p Berlin German Empire Germany 52.52 13.405 German │ │ 29 Virchows Archi specialized: a Berlin German Empire Germany 52.52 13.405 German │ │ 31 Zeitschrift fü specialized: p Berlin German Empire Germany 52.52 13.405 German │ │ 42 Vierteljahrssc specialized: a Berlin German Empire Germany 52.52 13.405 German │ │ 47 Centralblatt f specialized: p Berlin German Empire Germany 52.52 13.405 German │ │ 50 Russische medi general medici Berlin German Empire Germany 52.52 13.405 German │ │ ... ... ... ... ... ... ... ... ... │ │ ... ... ... ... ... ... ... ... ... │ │ 277 L'arte medica general medici Turin Italy Italy 45.079167 7.676111 Italian │ │ 288 Allgemeine öst specialized: a Vienna Austro-Hungaria Austria 48.2 16.366667 German │ │ 18 Jahrbücher für specialized: p Vienna Austro-Hungaria Austria 48.2 16.366667 German │ │ 30 Wiener klinisc general medici Vienna Austro-Hungaria Austria 48.2 16.366667 German │ │ 44 Wiener klinisc general medici Vienna Austro-Hungaria Austria 48.2 16.366667 German │ │ 45 Wiener medizin general medici Vienna Austro-Hungaria Austria 48.2 16.366667 German │ │ 72 Wiener medizin general medici Vienna Austro-Hungaria Austria 48.2 16.366667 German │ │ 81 Monatsschrift general medici Vienna Austro-Hungaria Austria 48.2 16.366667 German │ │ 93 Klinisch-thera general medici Vienna Austro-Hungaria Austria 48.2 16.366667 German │ │ 151 Medicinisch-ch specialized: s Vienna Austro-Hungaria Austria 48.2 16.366667 German │ │ 199 Der Militärazt specialized: m Vienna Austro-Hungaria Austria 48.2 16.366667 German │ │ 261 Медицинская бе general medici Voronezh Russian Empire Russia 51.671667 39.210556 Russian │ │ 77 Medycyna general medici Warsaw Russian Empire Poland 52.233333 21.016667 Polish │ │ 150 Kronika Lekars general medici Warsaw Russian Empire Poland 52.233333 21.016667 Polish │ │ 86 Grenzfragen de specialized: p Wiesbaden German Empire Germany 50.0825 8.24 German │ │ 206 Ergebnisse der specialized: a Wiesbaden German Empire Germany 50.0825 8.24 German │ ││ │ │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
This has done exactly what we wanted, and shown the contents from the downloaded file as a table. But we are also interested in some general (mostly internal) information and metadata, this time for the new table we have just created, rather than the original file itself.
Let's save it again under an alias.
outputs_table = outputs['table']
kiara.store_value(value=outputs_table.value_id, alias="Journal_Nodes_table")
╭─ Store operation result ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ │ │ │ value_id 977ccb15-625d-420c-a490-b7ae180d4cc1 │ │ alias Journal_Nodes_table │ │ data type table │ │ size 43.28 KB │ │ success yes │ │ │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
Querying our Data
So now we have downloaded our file and converted it into a table, we want to actually explore it.
To do this, we can query the table using SQL and some functions already included in kiara.
Let's take another look at that operation list, this time looking for functions that let us 'query'.
kiara.list_operation_ids('query')
['query.database', 'query.table']
Well, we already know our file has been converted into a table, so let's have a look at query.table.
kiara.retrieve_operation_info('query.table')
Author(s) Markus Binsteiner markus@frkl.io Context Tags tabular Labels package: kiara_plugin.tabular References source_repo: https://github.com/DHARPA-Project/kiara_plugin.tabulardocumentation: https://DHARPA-Project.github.io/kiara_plugin.tabular/Module type query.table Operation details Documentation Execute a sql query against an (Arrow) table. The default relation name for the sql query is 'data', but can be modified by the 'relation_name' config option/input. If the 'query' module config option is not set, users can provide their own query, otherwise the pre-set one will be used. Inputs field name type description Required Default ────────────────────────────────────────────────────────────────────────────────────────────────── table table The table to query yes -- no default -- query string The query, use the value of the yes -- no default -- 'relation_name' input as table, e.g. 'select * from data'. relation_name string The name the table is referred to in the no data sql query. Outputs field name type description ────────────────────────────────────────────────────────────────────────────────────────────────── query_result table The query result.
So from this information, we only need to provide the table itself, and our query. The query uses on SQL - for more on forming these queries, have a look at tutorials such as this.
Let's work out how many of these journals were published in Berlin.
inputs = {
"table": kiara.get_value('Journal_Nodes_table'),
"query": "SELECT * from data where City like 'Berlin'"
}
outputs = kiara.run_job('query.table', inputs=inputs, comment="")
outputs
╭──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ │ │ │ field value │ │ ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── │ │ query_result │ │ Id Label JournalType City CountryNetwor PresentDayCoun Latitude Longitude Language │ │ ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────── │ │ 295 Die Krankenpfl specialized: t Berlin German Empire Germany 52.52 13.405 German │ │ 296 Die deutsche K general medici Berlin German Empire Germany 52.52 13.405 German │ │ 300 Therapeutische specialized: t Berlin German Empire Germany 52.52 13.405 German │ │ 1 Allgemeine Zei specialized: p Berlin German Empire Germany 52.52 13.405 German │ │ 7 Archiv für Psy specialized: p Berlin German Empire Germany 52.52 13.405 German │ │ 10 Berliner klini general medici Berlin German Empire Germany 52.52 13.405 German │ │ 13 Charité Annale general medici Berlin German Empire Germany 52.52 13.405 German │ │ 21 Monatsschrift specialized: p Berlin German Empire Germany 52.52 13.405 German │ │ 29 Virchows Archi specialized: a Berlin German Empire Germany 52.52 13.405 German │ │ 31 Zeitschrift fü specialized: p Berlin German Empire Germany 52.52 13.405 German │ │ 42 Vierteljahrssc specialized: a Berlin German Empire Germany 52.52 13.405 German │ │ 47 Centralblatt f specialized: p Berlin German Empire Germany 52.52 13.405 German │ │ 50 Russische medi general medici Berlin German Empire Germany 52.52 13.405 German │ │ 76 Deutsche Aerzt general medici Berlin German Empire Germany 52.52 13.405 German │ │ 87 Monatsschrift specialized: g Berlin German Empire Germany 52.52 13.405 German │ │ 108 Archiv für kli specialized: s Berlin German Empire Germany 52.52 13.405 German │ │ 113 Zeitschrift fü general medici Berlin German Empire Germany 52.52 13.405 German │ │ 159 Deutsche milit specialized: m Berlin German Empire Germany 52.52 13.405 German │ │ 162 Jahresbericht specialized: p Berlin German Empire Germany 52.52 13.405 German │ │ 192 Ärztliche Sach general medici Berlin German Empire Germany 52.52 13.405 German │ │ 198 Zeitschrift fü specialized: p Berlin German Empire Germany 52.52 13.405 German │ │ 258 Der Pfarrbote news media Berlin German Empire Germany 52.52 13.405 German │ ││ │ │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
The function has returned the table with just the results we were looking for from the SQL query.
Let's narrow this further, and find all the journals that are just about general medicine and published in Berlin.
We can re-use the query.table function and the table we've just made, stored in outputs['query_result']
inputs = {
"table" : outputs['query_result'],
"query" : "SELECT * from data where JournalType like 'general medicine'"
}
outputs = kiara.run_job('query.table', inputs=inputs, comment="")
outputs
╭──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ │ │ │ field value │ │ ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── │ │ query_result │ │ Id Label JournalType City CountryNetwork PresentDayCou Latitude Longitude Language │ │ ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────── │ │ 296 Die deutsche K general medici Berlin German Empire Germany 52.52 13.405 German │ │ 10 Berliner klini general medici Berlin German Empire Germany 52.52 13.405 German │ │ 13 Charité Annale general medici Berlin German Empire Germany 52.52 13.405 German │ │ 50 Russische medi general medici Berlin German Empire Germany 52.52 13.405 German │ │ 76 Deutsche Aerzt general medici Berlin German Empire Germany 52.52 13.405 German │ │ 113 Zeitschrift fü general medici Berlin German Empire Germany 52.52 13.405 German │ │ 192 Ärztliche Sach general medici Berlin German Empire Germany 52.52 13.405 German │ ││ │ │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
Recording and Tracing our Data
We've quite a few changes to this table, so let's double check the information about this new table we've created with our queries.
query_output = outputs['query_result']
query_output
value_id 251d34c9-fe50-47a3-b8e3-14e1baf9b8a7 kiara_id c8dcfe3a-a996-4269-b3ad-6c25be787227 ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────── data_type_info data_type_name table data_type_config {} characteristics { "is_scalar": false, "is_json_serializable": false } data_type_class python_class_name TableType python_module_name kiara_plugin.tabular.data_types.table full_name kiara_plugin.tabular.data_types.table.TableType destiny_backlinks {} job_id d47a5a91-ddfa-489b-a4b5-f5793753915d property_links { "metadata.python_class": "cb8ef26b-ca05-4171-bd95-ae376eb90829", "metadata.table": "809c47a6-cdf4-433f-9593-c66128bdcee3" } value_created 2025-01-23 15:39:49.180617+00:00 value_hash zdpuAqVeH2LWHhmCyV2rRSXKemJiodfymKvc8BR28xUHwkJdM value_schema type table type_config {} default not_set optional False is_constant False doc The query result. value_size 5.71 KB value_status -- set --
Looks good!
We might have changed things around, but we can still get lots of information about all our data.
More importantly, kiara is able to trace all of these changes, tracking the inputs and outputs and giving them all different identifiers, so you know exactly what has happened to your data.
First lets have a look at our basic lineage function - this gets us the 'backstage' of what has been going on, showing the inputs for each of the functions that we have run, and where they might feed into one another. In each case, kiara has assigned the inputs a unique identifier. Check it out!
query_output.lineage
query.table ├── input: query (string)= d5432a97-1907-4c2c-98aa-9e742e2aff8f ├── input:relation_name (string)= 31ca3cba-bdcc-470b-8aea-6454ed679689 └── input:table (table)= 25d488c1-7679-40fe-a990-1771fd7b12a1 └──query.table ├── input: query (string)= fd5cefac-e382-4e3b-924c-5bc1df644a08 ├── input:relation_name (string)= 1dcde20a-b8c2-43f4-97b3-36377b086fb8 └── input:table (table)= 977ccb15-625d-420c-a490-b7ae180d4cc1 └──create.table ├── input: delimiter (string)= 019eab23-3e66-4dc4-b123-732521bd9d5f ├── input:file (file)= e16e90d0-5942-4aab-8a4c-a5df80e3118d │ └──download.file │ ├── input: file_name (string)= a2991134-2878-43a8-b770-61b988fd0227 │ └── input:url (string)= b5431bb2-8770-45cd-a0e3-9be296103b20 └── input:first_row_is_header (boolean) = ea0d378a-f045-46ac-a9dc-7fb690b0dce0
We can also visualise this, allowing us to view the different functions and their inputs and outputs as a series of steps or 'workflow' as we've been talking about.
lineage = kiara.retrieve_augmented_value_lineage(query_output)
from observable_jupyter import embed
embed('@dharpa-project/kiara-data-lineage', cells=['displayViz', 'style'], inputs={'dataset':lineage})
This gives us the lineage - the history - of the steps we took to get to our final output, or any output that we want to look at. We can also take a look at all the steps we've taken, including those that don't lead to our final output - by getting all the information about the jobs we've run in our context, we can look at all the decisions we have made. Though this notebook has been fairly straightforward, if any of the parameters or inputs were changed at any point, we can view all the re-runs and edits, and all the comments that go with them. Let's have a look.
kiara.print_all_jobs_info_data(show_inputs=True, show_outputs=True, max_char=100)
Module name Comment Time submitted Runtime Inputs Outputs ───────────────────────────────────────────────────────────────────────────────────────────────────────────────── download.file importing journal 2025-01-23 0.085227
nodes 15:39:48.462755+00… file_… Journ… file Journa… urlhttps…
create.table 2025-01-23 0.124732
delim… none table Journ… file Journ… first…
15:39:48.657577+00…True
query.table 2025-01-23 0.155754
query SELECT query… Id
15:39:48.940532+00…
- from
...
query.table 2025-01-23 0.01207
15:39:49.174867+00…
- from
And if we want to keep it, or share it with others? We can export it straight to csv, making it even easier to make all our decisions traceable and replicable.
import pandas as pd
job_table = pd.DataFrame(kiara.get_all_jobs_info_data())
job_table.to_csv('job_log.csv', index=False)
What next...?
That's great, you've completed the first notebook and successfully installed kiara, downloaded files, tested out some functions, and are able to see what this does to your data.
Now you can check out the other plugin packages to explore how this helps you manage and trace your data while using digital analysis tools!