Opened 11 years ago

Last modified 4 years ago

#226 new defect

WinGRASS fails to create .gislock opening a mapset

Reported by: msieczka Owned by: grass-dev@…
Priority: major Milestone: 6.4.6
Component: Default Version: svn-develbranch6
Keywords: wingrass, qgis Cc:
CPU: All Platform: MSWindows XP

Description

GRASS fails to create a .gislock file on Windows, thus QGIS cannot close a mapset it openede, saying e.g.:

Cannot close mapset. Cannot remove mapset lock: H:/GRASSDATA/bug/PERMANENT/.gislock

Change History (21)

comment:1 Changed 11 years ago by msieczka

This issue is discussed in QGIS trac Trac under https://trac.osgeo.org/qgis/ticket/808.

comment:2 Changed 10 years ago by pcav

Summary: WinGRASS fials to create .gislock opening a mapsetWinGRASS fails to create .gislock opening a mapset

comment:3 Changed 10 years ago by hamish

Keywords: wingrass added

comment:4 Changed 10 years ago by hamish

Keywords: qgis added

comment:5 Changed 9 years ago by neteler

The bug has been fixed in QGIS, can we close here, too?

comment:6 Changed 9 years ago by hamish

well, wingrass still doesn't have any support mapset locking AFAIK. so the qgis bug would just be a symptom of that, even if they've implemented a work-around.

no idea if anyone is trying multi-user over SMB in the classroom, otherwise the main danger on Windows is trying to restart a session when you've already got the same one minimized. (?)

Hamish

comment:7 in reply to:  6 Changed 9 years ago by glynn

Replying to hamish:

no idea if anyone is trying multi-user over SMB in the classroom, otherwise the main danger on Windows is trying to restart a session when you've already got the same one minimized. (?)

Which isn't much of a danger. The biggest issue with concurrent use is that the WIND file is per mapset, not per session. The mere existence of another session with the same mapset isn't an issue; it only becomes an issue if both sessions are actually running commands.

comment:8 Changed 9 years ago by hellik

Milestone: 6.4.06.4.2

comment:9 Changed 8 years ago by martinl

Time to close this ticket?

comment:10 in reply to:  9 Changed 8 years ago by hamish

Replying to martinl:

Time to close this ticket?

No, it isn't fixed...

comment:11 Changed 8 years ago by mmetz

IIUC, the .gislock file is not created under windows because kill() does not exist under windows, used by find_process().

Under Linux, assume the following scenario: a group of people are working from different machines on the same location, different mapsets. The location is on a network drive accessible by everyone. Now g.mapset mapset=othermapset using lock (GIS_LOCK) checks if it could kill the pid written in .gislock. But if the pid in .gislock has been written by a different machine/system, then the pid in .gislock has nothing to do with the pid's available to lock, and the kill()-test is complete moot. Right? In this case it would be more helpful if .gislock would not hold a pid, but the name of the user, e.g. user@host, currently accessing the mapset.

Therefore I would suggest to skip the find_process() step and assume that a mapset is locked as long as the file .gislock exists. And always, also on windows, write the file .gislock.

Markus M

comment:12 in reply to:  11 ; Changed 8 years ago by glynn

Replying to mmetz:

IIUC, the .gislock file is not created under windows because kill() does not exist under windows, used by find_process().

Sort of. The existing code won't work in its entirety on Windows. As Windows systems aren't generally multi-user, simply ignoring the entire locking issue was the easiest solution.

Under Linux, assume the following scenario: a group of people are working from different machines on the same location, different mapsets. The location is on a network drive accessible by everyone. Now g.mapset mapset=othermapset using lock (GIS_LOCK) checks if it could kill the pid written in .gislock. But if the pid in .gislock has been written by a different machine/system, then the pid in .gislock has nothing to do with the pid's available to lock, and the kill()-test is complete moot. Right?

The purpose of the kill() test is to check whether the .gislock file is "stale", i.e. whether the session which created the .gislock file terminated without removing it. If kill() fails with ESRCH, the PID stored in the .gislock file doesn't refer to an existing process on the local system, so the lock is assumed to be stale and is ignored. This test isn't particularly reliable; it will consider the lock as stale if it was created by a session on another machine, even if that session is still alive, and will consider the lock as alive if the session has terminated but its PID is now used by another process.

Therefore I would suggest to skip the find_process() step and assume that a mapset is locked as long as the file .gislock exists.

That would avoid the issue with the lock file being considered stale due to having been created on a different system. OTOH, it would require stale lock files to always be removed manually. If that is considered a problem, writing a hostname along with the PID would solve the first issue without abandoning automatic removal on non-shared filesystems.

comment:13 in reply to:  12 ; Changed 8 years ago by mmetz

Replying to glynn:

Replying to mmetz:

The existing code won't work in its entirety on Windows. As Windows systems aren't generally multi-user, simply ignoring the entire locking issue was the easiest solution.

But GRASS databases and locations are multi-user by design, the system used to access a mapset may or may not be single-user, that would not matter if the GRASS database is somewhere on a network.

Under Linux, assume the following scenario: a group of people are working from different machines on the same location, different mapsets. The location is on a network drive accessible by everyone. Now g.mapset mapset=othermapset using lock (GIS_LOCK) checks if it could kill the pid written in .gislock. But if the pid in .gislock has been written by a different machine/system, then the pid in .gislock has nothing to do with the pid's available to lock, and the kill()-test is complete moot. Right?

The purpose of the kill() test is to check whether the .gislock file is "stale", i.e. whether the session which created the .gislock file terminated without removing it. If kill() fails with ESRCH, the PID stored in the .gislock file doesn't refer to an existing process on the local system, so the lock is assumed to be stale and is ignored. This test isn't particularly reliable; it will consider the lock as stale if it was created by a session on another machine, even if that session is still alive, and will consider the lock as alive if the session has terminated but its PID is now used by another process.

Sounds like using PID is not a reliable solution, this can easily result in both false positives and false negatives.

Therefore I would suggest to skip the find_process() step and assume that a mapset is locked as long as the file .gislock exists.

That would avoid the issue with the lock file being considered stale due to having been created on a different system. OTOH, it would require stale lock files to always be removed manually. If that is considered a problem, writing a hostname along with the PID would solve the first issue without abandoning automatic removal on non-shared filesystems.

I don't see an easy way for reliable automated handling based on GIS_LOCK and the PID in .gis_lock, because a lock may be removed even though it is alive or be not removed even though it is stale. How about avoiding PID altogether and writing 'user@host' to .gislock? Currently, in trunk, the wxGUI asks at startup if an existing lock should really, really be removed. Starting trunk in text mode silently removes any gislock (needs to be fixed). How about a new flag for yes-I-know_what-I'm-doing to try to force remove an existing lock both at startup and for g.mapset?

Markus M

comment:14 in reply to:  13 ; Changed 8 years ago by glynn

Replying to mmetz:

How about avoiding PID altogether and writing 'user@host' to .gislock?

That places the burden of determining whether or not the lock is stale entirely on the user, as there's no mechanism (even an unreliable one) for determining whether the session which created the lock file is alive. Writing pid@host would solve the shared-filesystem issue insofar as it lets etc/lock know whether the kill() test can be used. If the lock file was created by a session running on a different system, there's no portable way to determine whether the session is still alive. However, displaying the PID and host to the user may allow them to make the determination manually.

Currently, in trunk, the wxGUI asks at startup if an existing lock should really, really be removed. Starting trunk in text mode silently removes any gislock (needs to be fixed).

It shouldn't remove the lock file. It should only re-write the lock file if the PID contained within doesn't match that of an existing process on the local system. etc/lock terminates with an exit code of 2 if the .gislock file exists and the PID contained within matches an existing process, an exit code of 1 if an error occurred (e.g. couldn't create the file or couldn't write to it) and an exit code of 0 if the file was written successfully.

How about a new flag for yes-I-know_what-I'm-doing to try to force remove an existing lock both at startup and for g.mapset?

In practice, stale lock files are sufficiently rare that it's debatable whether it's worth the effort of adding a simpler alternative to manually deleting the lock file.

comment:15 in reply to:  14 ; Changed 8 years ago by mmetz

Replying to glynn:

Replying to mmetz:

How about avoiding PID altogether and writing 'user@host' to .gislock?

That places the burden of determining whether or not the lock is stale entirely on the user, as there's no mechanism (even an unreliable one) for determining whether the session which created the lock file is alive. Writing pid@host would solve the shared-filesystem issue insofar as it lets etc/lock know whether the kill() test can be used. If the lock file was created by a session running on a different system, there's no portable way to determine whether the session is still alive. However, displaying the PID and host to the user may allow them to make the determination manually.

Displaying the PID and host to the user assumes that the user knows the meaning of PID and host. This is IMHO a false assumption considering the current state of linux, mac, and windows, where users usually do not have to know or worry about PID's. These are the officially supported OS's, and many of their users do not need to be (thus are nowadays probably not) familiar with the inner workings of their OS, e.g. PID and what process in particular corresponds to a given PID. Apart from the issues mentioned earlier that the PID written to .gis_lock does not refer to a PID of the current system if the mapset is accessible to multiple users and located on a network drive.

Currently, in trunk, the wxGUI asks at startup if an existing lock should really, really be removed. Starting trunk in text mode silently removes any gislock (needs to be fixed).

It shouldn't remove the lock file. It should only re-write the lock file if the PID contained within doesn't match that of an existing process on the local system. etc/lock terminates with an exit code of 2 if the .gislock file exists and the PID contained within matches an existing process, an exit code of 1 if an error occurred (e.g. couldn't create the file or couldn't write to it) and an exit code of 0 if the file was written successfully.

How about a new flag for yes-I-know_what-I'm-doing to try to force remove an existing lock both at startup and for g.mapset?

In practice, stale lock files are sufficiently rare that it's debatable whether it's worth the effort of adding a simpler alternative to manually deleting the lock file.

Define practice. What about a windows user starting GRASS with msys and just killing the msys terminal at the end, or not even bothering about the terminal. I have seen that. I guess that nowadays many GRASS users are working on a single-user system with a single-user GRASS database, and for these users GRASS must IMHO work 100%. In this case, a lock is probably not needed. But there are also other users, e.g. some institutes of public administration where many different users access the same GRASS location from individual clients, the GRASS location being located on a central server. For these, the GRASS locking mechanism must also work, although admittedly this is first regulated by file system permission settings. Then there are users like e.g. Markus Neteler, Sören Gebbert, and me who use GRASS on a cluster system where several hundred nodes may want to write to the same mapset at the same time (we have a hack solution for that); these users would be found in scientific research environments. GRASS makes quite some effort to appease exactly such users (scientists), thus these need to be accommodated, too.

Practice 1: single user single GRASS database

  • best practice would be to ignore a lock and proceed. The question here is when a stale lock could occur. A stale lock should be a rare exception in this case.

Practice 2: multiple users, single GRASS database

  • best practice would be to acknowledge a lock and quit (usually this would be regulated through write permissions, though). No chance to determine if a PID lock is stale.

Practice 3: single user acting as multiple users from different systems, single GRASS database

  • best practice would be to acknowledge a lock and quit (ignore mapset write permissions, use lock info only). No chance to determine if a PID lock is stale.

The motivation behind displaying user@host is that most users would know their user name and ideally the system (host) where they are currently logged in. This (in addition to write permissions) should suffice to let the user decide if he wants to try to remove a mapset lock.

Markus M

comment:16 in reply to:  15 ; Changed 8 years ago by glynn

Replying to mmetz:

Displaying the PID and host to the user assumes that the user knows the meaning of PID and host.

If they don't, they probably can't safely deal with a stale lock file.

On an unshared filesystem, the automatic resoluation via the kill() test will usually work. The exceptions are where the PID has since been re-used for an unrelated process (false positive), or if the shell itself has been killed but child processes are still running (false negative). If they're using a shared filessytem, there's probably some form of technical support available.

What about a windows user starting GRASS with msys and just killing the msys terminal at the end, or not even bothering about the terminal.

If the session on that terminal is still "working" (i.e. not just waiting for the next command), it's fairly important that they don't just start up another session using the same mapset.

Practice 1: single user single GRASS database

  • best practice would be to ignore a lock and proceed.

Note that this could result in a corrupted database. I don't know how likely it is practice, i.e. whether it's common to have long-running background jobs on Windows systems.

The motivation behind displaying user@host is that most users would know their user name and ideally the system (host) where they are currently logged in. This (in addition to write permissions) should suffice to let the user decide if he wants to try to remove a mapset lock.

They really need the PID in order to make that determination. Some people have jobs which run for days. In a complex environment (where users have accounts on several multi-user systems), it's not inconceivable that someone can forget which jobs are running on which systems using which mapsets.

In normal use, stale lock files shouldn't occur, so the presumption should be that any existing lock file isn't stale. That presumption may be overridden in the presence of additional evidence; e.g. if the PID contained in the lock file doesn't refer to an existing process (particularly if the lock file also contains a host and the host is the local system), that tends to indicate staleness (the case where the shell has terminated but child processes survive is rather hard to detect).

The solution to the problems with the existing mechanism should be to fix it, e.g. by adding a Windows equivalent of the PID test, and adding the host to the PID file. Rather than assuming that a lock file is stale solely because the assumption is convenient, regardless of its accuracy.

comment:17 in reply to:  16 ; Changed 8 years ago by mmetz

My main concern is that valid lock files are regarded as stale lock files. Since it seems not to be trivial to distinguish between valid and stale lock files, I would opt to keep the lock file and deny access to the mapset in question if the state of the lock file can not be safely determined. The check for the state of a lock file should return one of three answers: is valid, can't say, is stale. The lock file is (should be) removed if stale, if its status is unknown, user-interaction is probably the only option.

Replying to glynn:

Replying to mmetz:

Displaying the PID and host to the user assumes that the user knows the meaning of PID and host.

If they don't, they probably can't safely deal with a stale lock file.

??? There is no danger in removing a stale lock file.

On an unshared filesystem, the automatic resoluation via the kill() test will usually work. The exceptions are where the PID has since been re-used for an unrelated process (false positive), or if the shell itself has been killed but child processes are still running (false negative). If they're using a shared filessytem, there's probably some form of technical support available.

I'm afraid that's beyond my knowledge. What form of technical support do you have in mind that could be used by the lock executable?

What about a windows user starting GRASS with msys and just killing the msys terminal at the end, or not even bothering about the terminal.

If the session on that terminal is still "working" (i.e. not just waiting for the next command), it's fairly important that they don't just start up another session using the same mapset.

That scenario was a bit provocative, but unfortunately I have seen it happening. In this case I would go for a check including the host name.

Practice 1: single user single GRASS database

  • best practice would be to ignore a lock and proceed.

Note that this could result in a corrupted database. I don't know how likely it is practice, i.e. whether it's common to have long-running background jobs on Windows systems.

OK. If in doubt, assume a valid lock.

The motivation behind displaying user@host is that most users would know their user name and ideally the system (host) where they are currently logged in. This (in addition to write permissions) should suffice to let the user decide if he wants to try to remove a mapset lock.

They really need the PID in order to make that determination. Some people have jobs which run for days. In a complex environment (where users have accounts on several multi-user systems), it's not inconceivable that someone can forget which jobs are running on which systems using which mapsets.

I assume that on multi-user systems, user names are unique. In a complex environment, host names should also be unique. PIDs however are not unique, or at least not fail-safe as you have pointed out earlier. A new process may be started with the same PID as in the valid or stale lock file.

Note that the PID in the lock file does not refer to the (long running) job they are currently busy with. The PID in the lock file refers to the instance of init.sh for GRASS 6 and grass.py for GRASS 7. That is, a user may do nothing while logged into a mapset, but as long as he/she is logged in, it's blocked for others. A consequence of the modular design of GRASS, I guess. As long as you are logged in to a given mapset, this is yours only, no matter if you actually do something there or not.

In normal use, stale lock files shouldn't occur, so the presumption should be that any existing lock file isn't stale. That presumption may be overridden in the presence of additional evidence; e.g. if the PID contained in the lock file doesn't refer to an existing process (particularly if the lock file also contains a host and the host is the local system), that tends to indicate staleness (the case where the shell has terminated but child processes survive is rather hard to detect).

I agree.

The solution to the problems with the existing mechanism should be to fix it, e.g. by adding a Windows equivalent of the PID test, and adding the host to the PID file. Rather than assuming that a lock file is stale solely because the assumption is convenient, regardless of its accuracy.

Sorry for being stubborn. You have provided only one argument for the usage of PID@host instead of user@host: I argue that the PID does not say anything about long-running jobs, it may as well be an abandoned GRASS session. I would, as dangerous as it may be, offer the user the option to override the lock if there are doubts whether the lock is valid. And I think that user@host is, particularly in complex environments, more reliable that PID@host.

Markus M

comment:18 in reply to:  17 ; Changed 8 years ago by glynn

Replying to mmetz:

The check for the state of a lock file should return one of three answers: is valid, can't say, is stale. The lock file is (should be) removed if stale, if its status is unknown, user-interaction is probably the only option.

The main requirement for this is that the lock file needs to contain both the host and the PID. As for the check: if the host is not the local system, the status is unknown, otherwise the lock file is considered live if the PID exists and stale otherwise. False positives can be reduced by testing for the PID twice with a delay in between.

If they don't, they probably can't safely deal with a stale lock file.

??? There is no danger in removing a stale lock file.

I should have said "possibly-stale lock file". If it's definitely stale, there's no problem deleting it, but we don't actually know that with the current code. If we don't know whether it's stale (because it was created on a different host, or because we don't have a Windows equivalent for the kill() test), the determination will need to be made manually by someone who understands what "host" and "PID" mean.

I'm afraid that's beyond my knowledge. What form of technical support do you have in mind that could be used by the lock executable?

I'm talking about the case where the lock program is unable to reliably determine stale-ness and has to defer to the user.

Note that the PID in the lock file does not refer to the (long running) job they are currently busy with. The PID in the lock file refers to the instance of init.sh for GRASS 6 and grass.py for GRASS 7. That is, a user may do nothing while logged into a mapset, but as long as he/she is logged in, it's blocked for others. A consequence of the modular design of GRASS, I guess. As long as you are logged in to a given mapset, this is yours only, no matter if you actually do something there or not.

The PID refers to the GRASS session, specifically to the process running the script which starts the GRASS session and persists for the lifetime of the session. There isn't any practical alternative. Process groups won't work, as the shell typically creates a new process group for each command or pipeline. The (Unix) session ID (SID) would be more accurate, but there's no way to tell whether a session exists other than by parsing the output from "ps" (both the format and the flags required to list the SID tend to be platform-specific).

Sorry for being stubborn. You have provided only one argument for the usage of PID@host instead of user@host: I argue that the PID does not say anything about long-running jobs, it may as well be an abandoned GRASS session. I would, as dangerous as it may be, offer the user the option to override the lock if there are doubts whether the lock is valid. And I think that user@host is, particularly in complex environments, more reliable that PID@host.

Even if the session which created the lock file is idle, there's no guarantee that the user won't subsequently "revive" it, so it isn't safe to override the lock file so long as that session exists.

Having the PID in the lock file is the only way that an "obviously stale" lock file can be detected and overridden automatically. Without that, any lock file must be assumed to be live, with any override requiring user involvement.

Storing the user in the lock file is harmless but redundant. The user stored in the lock file will always be the user who owns both the mapset directory and the .gislock file. The start-up code won't let you select a mapset which you do not own as the current mapset.

comment:19 Changed 8 years ago by lutra

I don't know if it adds something to the discussion but using GRASS 6.4.2RC2 through QGIS (osgeo4w) still create a functional .gislock file, while opening a mapset not through QGIS with the very same GRASS installation it does not creates the .gislock file.

I would be very happy to see the .gislock file gone for good also when using GRASS through QGIS -> it is needed to modify the QGIS/GRASS plugin?

comment:20 in reply to:  18 Changed 8 years ago by mmetz

Replying to glynn:

Replying to mmetz:

The check for the state of a lock file should return one of three answers: is valid, can't say, is stale. The lock file is (should be) removed if stale, if its status is unknown, user-interaction is probably the only option.

The main requirement for this is that the lock file needs to contain both the host and the PID. As for the check: if the host is not the local system, the status is unknown, otherwise the lock file is considered live if the PID exists and stale otherwise. False positives can be reduced by testing for the PID twice with a delay in between.

Sounds good to me. If the host is not the local system, the status is unknown, but (most of the time) it should be safe to assume that the lock is alive and not stale.

Having the PID in the lock file is the only way that an "obviously stale" lock file can be detected and overridden automatically. Without that, any lock file must be assumed to be live, with any override requiring user involvement.

This is what we probably have to settle for with windows and lock files created from different hosts.

Storing the user in the lock file is harmless but redundant. The user stored in the lock file will always be the user who owns both the mapset directory and the .gislock file. The start-up code won't let you select a mapset which you do not own as the current mapset.

Ah ok, I thought write permission suffices.

Markus M

comment:21 Changed 4 years ago by neteler

Milestone: 6.4.26.4.6
Note: See TracTickets for help on using tickets.